| 8 min read
All application security (AppSec) companies claim to give you security testing results with very low false positive rates. Some of them, if at all, mention false negative rates. But does any security testing provider offer you service-level agreements (SLAs) for accuracy?
Many of us have made, or continue to make, the mistake of believing that low false positive or false negative rates, as independent measures, are sufficient grounds for determining that a security testing solution is "accurate" and worth continuing to use to assess our software products.
To understand why this is a pitfall, let's review what false positives and false negatives are and see how their rates can be misleading through an illustrative example (for the sake of simplicity, we handle tiny numbers in this illustration).
On false positive and false negative rates
Let's suppose that your company develops and provides its customers with a web application which, at a certain point, someone reminds you that, at least for compliance with standards, you must submit to security tests. For time and cost constraints, you decide to use an automated tool that we will call "A" (such an inventive name, I know!). Once tool A completes the tests, it reports 12 security vulnerabilities in your application.
Soon after, members of your development team, who are in charge of remediating such vulnerabilities, start informing you that that "SQL injection" report from the tool seems wrong, that that supposed "data encryption issue" doesn't really seem to be a problem, and so on.
Later on, a couple of members of your company who are knowledgeable about cybersecurity confirmed these suspicions: tool A gave you several erroneous reports. They even ask you, what if it's not reporting everything? That's when you decide that your application should undergo a more rigorous examination. Seeking advice from inside and outside the company, one of your partners ends up recommending that you submit the application to comprehensive testing that combines automated tools and pentesting experts (aka pentesters).
After completing these new tests on your application, which had not been modified after the first analysis, it turns out that your software product actually has 18 vulnerabilities. But watch out, tool A had only reported 5 of them. How can that be? Look at the following figure:
These five correct results of tool A are what we know as true positives (TP). The other 7 supposed vulnerabilities this tool reported are lies or false alarms, technically known as false positives (FP). Likewise, those 13 vulnerabilities the tool failed to detect and report are false negatives (FN). True negatives (TN), on the other hand, would be those portions of code or elements of the application that could be vulnerable but, in this case 14, are not. (TNs are values that, in cases of everyday applications, i.e., not specifically designed to benchmark security testing solutions, might be "impossible" to determine, primarily because of the "countless" ways in which certain code fragments, for instance, might be vulnerable).
From these values, the well-known false positive rate (FPR) and false negative rate (FNR) can be calculated, which follow the formulas below:
-
FPR = FP/(FP+TN)
-
FNR = FN/(FN+TP)
Thus, innocuously assuming that the knowledge gained about the web application in this example with the comprehensive security tests was "absolute," tool A had, in this case, an FPR of 7/(7+14) = 0.33 and an FNR of 13/(13+5) = 0.72.
Why should relying on these values alone be seen as a mistake?
Let's say that, without knowing the overall results of the previous comprehensive security tests, you had your web application evaluated by tool B. In its report, this tool informs you of only 3 vulnerabilities, two of which turn out to be false positives. Assuming that everything else that could be vulnerable is not, this tool would have a better FPR than tool A: 2/(2+35) = 0.05. Moreover, this would still be lower even knowing all the true negatives: 2/(2+13) = 0.13.
Now, suppose that, instead of those mentioned above, you used tool C — again without knowing the previous results — which reports 39 vulnerabilities! It is as if it took every element (every circle in the figure) under evaluation as vulnerable. Therefore, its FNR is zero since it did not miss anything. Moreover, this rate would remain the same even knowing all true positives: 0/(0+18) = 0.
Both tools, B and C, leave you in a bind. Tool B, even with its super reduced FPR, represents a considerable problem for your company's security, generating a false sense of safety, having reported only one of the 18 vulnerabilities of your application, that is, having had 17 false negatives. Tool C, with its staggering zero FNR, represents a significant issue for the use of resources such as time and effort of your development and security teams, as they have to deal with 21 false positives they will discover when trying to remediate them.
In short, relying on FPR and FNR as separate values is not a viable, recommended option for defining the accuracy of and, consequently, the benefit in terms of certainty that a security testing solution can provide. For this type of judgment, the FP and FN values must always be taken into account together and in relation to a universe of vulnerabilities. Here's where the F-scores come in.
What are the F-scores?
In information search, detection, and reporting systems, precision and recall are performance metrics. Precision is the fraction of relevant instances among everything reported by the system. Recall is the fraction of relevant instances reported by the system among all relevant cases in the target of evaluation. Both metrics depend on values such as TP, FP, and FN (we can see that, perhaps because of the complexity involved in their definition, as noted above, TNs are not taken into account). These are their formulas:
As we can appreciate, the precision highlights the importance of false positives, and the recall emphasizes the relevance of false negatives, both reported by the system. These two metrics can be included in a single equation that allows a broader definition of system performance and that we can understand as accuracy. We are talking about the F-scores equation:
Fᵦ = (1 + β²) · {[precision · recall] / [(β² · precision) + recall]}
In this equation, the usual values of beta (β) are 1, 0.5 and 2. F1 is the harmonic mean of both variables, while F0.5 gives more importance to precision than to recall (i.e., its value is closer to that of the first metric), and F2 gives more importance to recall than to precision.
-
F0.5 = 1.25 · {[precision · recall] / [(0.25 · precision) + recall]}
-
F2 = 5 · {[precision · recall] / [(4 · precision) + recall]}
Returning to the example of tools
As we showed above, tool B stood out for its low FPR, and tool C for its zero FNR. Tool B, with so few FP and TP, had obtained an FPR of 0.05 or 5%. Therefore, highlighting only the positive reports, its precision is not too bad indeed, considering it is an automated tool:
Precision: 1/(1+2) = 0.33 = 33%
Nonetheless, as we saw, the biggest problem for you and your company with this tool B was based on its false negatives, i.e., all those unreported vulnerabilities that could be detected and exploited by cybercriminals. Hence, we find that its recall is quite poor:
Recall: 1/18 = 0.06 = 6%
According to the above, the F0.5 for this tool would be 0.17 or 17.4%, a value closer to precision because we know that the F0.5 gives more relevance to this metric. Likewise, the F2 would be 0.07 or 7.2%, a value closer to recall because we know that the F2 provides more relevance to this metric.
All these figures, especially recall and F2, would be substantial clues to say that tool B, no matter how "good" FPR it has, would not be a good choice to assess the security of your application. (Something similar can be discovered if you examine the data of tool C).
Fluid Attacks offers you an accuracy SLA
At Fluid Attacks, we have long recognized the trouble of focusing only on FPR or FNR. So, while, like almost all AppSec companies, we have made the case that we maintain low FPRs and, like few, low FNRs, we knew that, in the interests of optimizing the accuracy of our tests for our own and our clients' benefit, we couldn't just stick with that data. So, we turned to accuracy and recall metrics and, from there, to the F1 score.
Beyond this, as something even more advantageous for all parties involved and, from what we have seen, that no one else does it, we decided to start offering our customers a minimum F1 score. This value had to be met by our comprehensive security testing in evaluating our customers' software products as a performance guarantee or service-level agreement (SLA). Thus, this issue began to have a legal touch. What's more, we did not offer an insignificant number. We aimed big from the beginning —that minimum F1 score was 0.9 or 90%!
But if no one else is doing it, why put our neck in the noose? Well, more than anything else, it is a challenge and, at the same time, a commitment to guarantee the optimal performance of our automated tools and pentesters and, consequently, the most accurate and exhaustive reports possible for our customers on the security status of their applications.
Recently, we even decided to take a bolder step. We put ourselves in the shoes of our clients' executives and security and development teams and chose to give more weight to FNs and FPs. From there, we defined that we should move to F2 and F0.5 and not continue with F1, and, in both cases, offer the same as before: 90% as a minimum. Therefore, our renewed accuracy SLA, which started to apply in January 2025, says, "F2 and F0.5 scores of at least 90% are achieved in reports of a client's software's risk exposure and vulnerabilities, respectively."
A couple of things should be underlined here. First, we talk about "F2 and F0.5," a conjunction that implies that for both scores, not just one of them, at least 90% must be achieved. Second, for F2 we focus on risk exposure and for F0.5 on vulnerabilities. Why?
At Fluid Attacks, we talk about "risk exposure" in terms of CVSSF units. This metric modifies CVSS severity values through the formula CVSSF = 4^(CVSS-4), mainly to establish more evident value differences between vulnerabilities according to their severity range. Thus, for example, if the difference between a vulnerability of CVSS score of 8.0 and one of 10.0 is two units, in CVSSF, it would be 3,840 units. Likewise, a vulnerability of CVSS score of 10.0 could be wrongly assumed to be of equal value to 10 vulnerabilities of severity 1.0. However, in terms of CVSSF, the difference becomes notorious: The first would have a value of 4,096, while the other ten would accumulate a value of merely 0.2. (For more details, we invite you to read our post "What Your Risk Management's Missing.")
Accordingly, it will be more valuable for a client to know metrics regarding false negatives based on risk exposure than based on the number of vulnerabilities. For example, the fact that 4,096 risk exposure units were not reported or appeared as FN (product of a single critical vulnerability) will attract more attention than knowing that there was only one (1) FN in the reports. In the same line of reasoning, this FN represented with CVSSF units would imply a greater reduction in the F2 that we offer to the client compared to what would happen representing it with the number of vulnerabilities. Such a more significant reduction would be for us a more relevant wake-up call on the accuracy of our tests.
As far as false positives are concerned, clients' developers will care more about being informed about the number of vulnerabilities. For example, they will be more interested in the fact that 20 supposedly medium-severity vulnerabilities were actually FPs — mainly because of the time and effort invested in discovering them — than in a single vulnerability with the highest risk exposure was an FP. Hence, we decided to continue to determine the F0.5 from the number of reported vulnerabilities.
At Fluid Attacks, we calculate the F2 and F0.5 scores every quarter as cumulative values for each client, specifically for their groups in the Advanced plan. We take into account their entire vulnerability history, even considering the software groups under evaluation that were removed. To learn about the other criteria of this accuracy SLA in our comprehensive security tests, we invite you to visit our Knowledge Base.
We sincerely hope that our commitment to your security and our performance will provide you with as much gratification as it does us and will inspire you to continue to improve and mature in your cybersecurity posture.
Recommended blog posts
You might be interested in the following related posts.