How to evaluate the performance of fraud detection strategies and models?
Some useful and hands-on evaluate methods
Why is it difficult to evaluate the result of fraud detection?
When it comes to fraud detection, the absence of labeled data is common, particularly in the early stages when a blacklist may not yet exist. Performances of strategies and models are hard to evaluate, particularly precision and recall rate, which are two common metrics in the machine learning field. Even though you have ample labeled data, it only reflects past fraud patterns. When fraudsters change their behaviors, the distribution of data will be different. Moreover, fraud detection has an effect on user experience and business growth so the quality of detection results is very important.
Some useful and hands-on evaluation methods
1. Black Sample Validation
Black sample validation involves utilizing historical blacklists such as blacklisted accounts or devices to calculate the insection of detection results and blacklists. Subsequently, precision and recall rates can be estimated. The precision rate equals the number of insection cases divided by the number of results while the recall rate is calculated by dividing the number of intersection cases by the total number of entries in the blacklist.
This method is relatively straightforward to execute and automate. However, it requires time to accumulate a comprehensive blacklist, and it is hard to evaluate the performance when there are new or unknown fraud patterns. Additionally, the scenarios from which blacklists are derived, such as login, often differ significantly from the scenarios that need to be evaluated, such as withdraw. Consequently, the precision and recall rate may exhibit bias. PS When there is no traditional blacklist, some user profile labels can be used as proximate blacklist.
2. Manual Check
Manual check involves sampling the results and inspecting them manually. This is a universal method and easy to implement. However, its drawbacks are apparent. It is time and labor consuming and relies on experts’ experience, making it nearly impossible to automate.
3. User Feedback
User feedback, sometimes in the form of complaints, is an important metric for the real performance of fraud detection systems. If the precision is low, legitimate users will be hit, leading to an increase in the number of user feedback.
To provide a standardized measure, it's common to use the average number of feedbacks per million Daily Active Users(DAUs) which is calculated by dividing the total number of feedbacks by DAU and then multiplying by 1,000,000. Why not use the number of feedbacks directly? Because the user scale varies. For a rapidly growing business, if the precision remains unchanged when the number of users doubles, the number of user feedback will also double. So we should use the total number of feedback and DAUs together.
The other thing that should be paid attention to is distinguishing whether feedback is valid or not, since the fraudsters would complain sometimes. Training a customer communication method can help differentiate between legitimate and fraudulent feedback.
4. Bad case Validation
Bad case is a common concept. In fraud detection field, business related to the regulation and compliance always pay attention to bad case, especially those reported by government agencies.
5. Threshold Strategy Estimation
When there are no data labels or difficult to distinguish between black and white samples, threshold strategies can be used to get black samples. Then the precision and recall rate of results to be evaluated can be estimated.
In general, the thresholds for these strategies are set significantly beyond normal human behavior. For example, in a video play scenario, users who watch more than 8000 videos play per day or more than 24 hours of video duration per day can be considered abnormal users. In the marketing activity scenario, users who have primary behaviors without frontend exposure or without critical path of behavior can be regarded as abnormal users. These abnormal users serve as data labels.
It should be noted that this method is suitable for comparing the effectiveness of two different versions of strategies, but it is not suitable for evaluating the strategy of a single version. Additionally, it is necessary to avoid any potential overlap between the threshold strategy and the strategy being evaluated.
6. Honeypot Testing
Honeypot is a term in the field of cybersecurity. After the strategies or models are deployed, we can hire some fraudsters or hackers to attack the business. We can analysis the data to see if the strategies or models have successfully intercepted the attack traffic. The recall rate can be estimated as the ratio of detected traffic to total attacks. In addition, this method can be used to observe the fraud pattern and provide insights into the weaknesses of current systems.
7. Fraudsters Market Price Tracking
Tracking fraudsters’ market prices is another method that observes the performance from the external perspective. If fraudsters are hit after the strategy or models are deployed, it is expected that the prices of their services will increase, or they will no longer be able to provide such services. So we can collect the prices of various fraudsters' services to perceive the change of risk level and even get the comparison of risk level of competing business.
8. Post-Analysis Metrics
There are three kinds of post-analysis metrics. The first one is behaviour metrics that need a period of time to observe such as retention rate and active rate. The second one is some statistics metrics that are unrelated with the strategy waiting to be evaluated. The last one is the concentration of several feature attributes.
Retention rate and active rate are commonly used in the context of channel fraud prevention and user acquisition and engagement. Generally, after completing fraudulent activities to generate new users or increase user activity, fraudsters tend not to invest resources in maintaining long-term retention and activity. This behavior is reflected in the metrics, where cheated channels or suppliers may show poor retention and activity data or high next-day retention and activity but significantly lower 30-day retention and activity compared to the overall average.
Unrelated statistical metrics provide another perspective to assess the differentiation between fraud detection results and the overall performance. This is particularly relevant when examining the differentiation of suspicious metrics related to fraudulent activities. For example, in a video play scenario, some accounts were detected based on play count threshold strategies. Then the play interaction ratio which refers to how many times a play is needed per interaction can be used to confirm suspicion.
The concentration of feature attributes can be measured using the information entropy of feature values. The lower the entropy, the more concentrated the features are, indicating a higher suspicion of fraud.
Comparison of these methods
Summary
So many methods have been proposed in this article. It is a side proof that there is no universal method in business security performance evaluation. The only thing we should do is coping with shifting events by sticking to a fundamental principle. The fundamental principle here is that the purpose of fraudsters will never change, they want money. While their fraud methods will continually evolve, they cannot perfectly mimic normal user behavior. There is always a perspective that distinguishes them from normal users. So what is the perspective? Sometimes, fraudsters change their behavior significantly to avoid detection, while normal users do not. Other times, some features of fraudsters remain constant while the same features in normal users change. Therefore, the key perspective lies between change and permanence.