Table of contents

Title

Table of content

Title



Blog

Philosophy

When "critical" is not critical: Rethinking vulnerability severity

cover-rethinking-vulnerability-severity(https://unsplash.com/photos/a-group-of-dices-are-flying-in-the-air-Tv1L6bskVUA)

Jason Chavarría

Content writer and editor

Updated

Feb 20, 2026



9 min

If you have ever compared how two different security tools score the same vulnerability, you have probably noticed something odd. One tool flags a hardcoded secret as critical; another calls it medium; a third labels it low. Or you go over to NVD and see that a CVE has two different CVSS scores. The vulnerability is the same, the only thing that changes is who is doing the scoring and what assumptions you are expected to accept.

Those comparisons lead to a question: is it accurate to call something critical when you haven't checked if it would very probably be exploited?

We are not writing this to pick a fight with any vendor, database, or institution, but because many security teams have noticed the same discrepancy in their own environments and deserve a frank, well-sourced conversation about what severity scores actually mean, why they so often disagree, and what to do about it.

The scale of the noise

The sheer volume of vulnerabilities disclosed every year makes this conversation urgent. In 2024, over 40,000 new CVEs were published, a 38% increase from the roughly 29,000 recorded in 2023. By the end of 2025, this number had increased by 21% landing at 48,185. That is well over 100 new vulnerability identifiers per day. A large portion of those receive high or critical severity labels under the standard CVSS base score framework (last year, that portion was almost 40%). If an organization tried to treat every critical CVSS finding as a genuine emergency, it would exhaust its people and its budget trying to reduce the alleged risks.

However, of the tens of thousands of vulnerabilities published each year, only a small fraction ever gets exploited in the wild. An analysis of 2024 data found that approximately 0.9% of reported CVEs were actively weaponized by threat actors. A separate analysis put the proportion of CVEs used in ransomware or by advanced persistent threat (APT) groups at roughly 0.2%.

These figures do not mean that the remaining vulnerabilities are harmless. They do, however, strongly suggest that the label "critical" is applied far more often than real-world exploitation patterns would justify.

Where the scores disagree

If severity were an entirely objective measurement, you would expect different authorities scoring the same vulnerability to arrive at similar numbers. But many times they do not!

An examination of approximately 120,000 CVEs with CVSS v3.0 scores in the NVD found that about 20% of them (nearly 25,000 entries) carried two scores: one assigned by NIST and another by the vendor or a third-party CNA. Of that subset, 56% had conflicting scores. So, in more than half the cases where both the NVD and the vendor weighed in, they disagreed on how severe the problem was.

Take CVE-2023-21557, a vulnerability in the Windows LDAP protocol, which is subjected to examination in the source cited above. Microsoft, the company that wrote the code and understands the exploitation requirements intimately, scored it 7.5 (high), whereas the NVD assigned it 9.1 (critical). That 1.6-point gap is not cosmetic. In an organization, a score above 9.0 may trigger emergency patching protocols, potentially pulling engineers off other work and disrupting release cycles. A score of 7.5 may typically fall within the regular monthly patch cadence. The same vulnerability causes two very different responses depending on which number you trust.

This kind of gap is not a fluke. Research has found that CVSS evaluations are not consistent within the same person over time. In a follow-up survey conducted nine months after the initial evaluation, 68% of participants assigned a different severity rating to the same vulnerability they had scored before. This means that the very system designed to standardize severity allows subjectivity to seep in.

Moreover, a separate study identified over 12,800 entries in the NVD where vulnerabilities with identical or semantically equivalent descriptions received meaningfully different CVSS scores. And the inconsistency was not limited to edge cases, but spanned common vulnerability types and well-known products.

Understanding the NVD's approach

It is important to be fair about why the NVD scores the way it does. The NVD itself explains on its Vulnerability Metrics page that CVSS, as used by the NVD, is a measure of severity, not risk. That same page acknowledges that when a vendor or maintainer declines to provide certain details, NVD enrichment efforts assign CVSS metric values "using a worst case scenario approach." "Information being overly vague or unavailable" is the motive for the enrichment that takes place and results in the scores differing. Like if the source does not specify whether a vulnerability requires authentication, the NVD will assume it does not, which raises the score.

This is a defensible policy in the abstract. The NVD serves a broad, diverse audience, and erring on the side of caution for a generic advisory has a certain logic. But the downstream effect is that a score originally intended as a conservative estimate for the most exposed possible configuration becomes the default severity that automated tools, dashboards, and compliance frameworks consume and act upon.

Daniel Stenberg, the creator and lead developer of cURL, probably one of the most widely deployed software components in the world, has written extensively about this dynamic. In a 2023 blog post, he described how the cURL project rated a vulnerability (CVE-2022-42915) as medium severity, given the very limited conditions under which it could be exploited. The NVD scored it 9.8 (critical), and therefore it appears like that on GitHub. Stenberg noted in another post that the NVD did not ask the cURL team for clarification or details, it simply applied its own assessment. In another case, a cURL vulnerability involving an integer overflow in the --retry-delay command line option (a problem Stenberg described as a bug, but not a security problem) was initially scored 9.8 by the NVD before eventually being reduced to 3.3 after repeated disputes. One year ago, in a blog post titled "CVSS is dead to us," Stenberg talks about yet another case, and mentions that cURL is not using CVSS anymore. If you go and see their CVE advisories, they show either of the severity levels given by the cURL security team, no CVSS in sight.

We cite these examples not to vilify the NVD, as it provides an essential public service, but to illustrate a pattern. When the entity that knows the most about a piece of software disagrees with the entity that scores it for the world, and this happens frequently, something in the system deserves scrutiny.

The incentives at play

Why does this pattern persist? It helps to consider the possible incentives, without assuming bad faith.

For aggregators like the NVD and CISA's Vulnrichment project, the mandate is broad public protection. As the NVD's own documentation states, when details are unclear, the policy is to assume the worst case. This is a reasonable default for a public-facing database, but it favors higher scores.

For security testing tools the incentive structure may also lean toward higher severity. A tool that reports fewer critical findings than a competitor could risk being perceived as less thorough, where finding more may be equated with being better. Regarding known vulnerabilities, tools can simply inherit CVSS base scores from the NVD or from vendor advisories and surface them without adjusting for reachability, exploitability, or the customer's specific context. The conservative assumptions baked into those upstream scores are passed through unquestioned, and the tool's dashboard ends up reinforcing the same inflation it could, in principle, help correct.

For software vendors acting as CNAs, the picture is more mixed. On one hand, a vendor that consistently rates its own vulnerabilities as low may face accusations of downplaying security issues. On the other, a vendor that rates everything as critical may alarm its customers.

For independent security researchers, CVEs carry professional currency. A critical-severity finding attracts more attention and can translate into bounty payouts or career advancement. Researcher Florian Hantke tested this firsthand. He searched GitHub for SQL injection patterns, found a vulnerable online exam application whose repository had not been updated in three years and whose README stated it was essentially abandoned, exploited it in minutes, and submitted a CVE request. It was accepted. A Shodan search had revealed no publicly accessible instances of the software, yet the vulnerability was published on the NVD and, as Hantke put it, rated with a high score. His conclusion: "we need to recalibrate how we perceive and value CVEs."

A better question than "how bad could this be?"

The CVSS base score answers a specific question: in a worst-case, generic environment, how much damage could this vulnerability cause if exploited? That is a valid question. But it is not the question most security teams actually need answered on a daily basis. The more useful question is: how likely is it that this vulnerability will be exploited in our environment, and what would the impact be in our specific context?

Two frameworks have emerged to help answer that question more precisely.

EPSS (Exploit Prediction Scoring System), maintained by FIRST, uses machine learning trained on real-world exploitation data to estimate the probability that a given CVE will be exploited within the next 30 days. EPSS produces a score between 0 and 1 (0% to 100%). Most vulnerabilities (even those with high CVSS scores) have very low EPSS scores. According to FIRST's own performance data, a strategy of remediating all CVEs with a CVSS score of 7 or above yields an efficiency of roughly 4%; that is, only about 4 out of every 100 vulnerabilities remediated under that policy were actually exploited in the following 30 days. The original EPSS research paper also cites prior research finding that as few as 1.4% of published vulnerabilities have been exploited in the wild. Fluid Attacks has integrated EPSS into its platform precisely to help teams focus on that small fraction.

SSVC (Stakeholder-Specific Vulnerability Categorization), developed by Carnegie Mellon University's Software Engineering Institute in collaboration with CISA, takes a different approach. Instead of producing a numeric score, SSVC uses decision trees that factor in the current exploitation status, the mission criticality of the affected asset, and the prevalence of the vulnerable component. The output is an action-oriented recommendation (track, attend, or act) rather than a number that may or may not map to a real-world response.

Neither EPSS nor SSVC replaces CVSS. They complement it by adding dimensions representing risk.

What this means for organizations that build software

If your team develops applications, you are probably consuming CVSS scores from multiple sources: your SAST tools, your SCA scanners, your container scanning pipeline, the NVD itself. The discrepancies described in this post are not somebody else's problem; they land directly in your dashboards and your sprint backlogs.

A few principles can help cut through the noise:

Understand how tools grade vulnerabilities: Not all tools score findings the same way. Some inherit CVSS base scores directly from the NVD; others apply their own proprietary severity logic; others allow you to adjust based on environmental factors. Knowing which approach your tools use, and where their scores come from, helps you interpret a "critical" label correctly instead of reacting to it reflexively.
Verify before you escalate: A critical label on a hardcoded secret means very little if nobody has confirmed whether the credential is active. A critical label on a library vulnerability means very little if the vulnerable function is never called in your codebase. The teams that handle severity well invest in verification, that is, checking whether findings are reachable, functional, and exploitable in context, before treating them as emergencies.
Use multiple signals: CVSS tells you about potential impact; EPSS tells you about likelihood; CISA's KEV catalog tells you what is already being exploited. So relying on any one of these in isolation may create blind spots.
Document your reasoning: If you decide to deprioritize a finding that carries a critical CVSS score because EPSS is near zero and the component is not reachable, document that decision. Regulations like NIS2 in Europe and the SEC's cybersecurity disclosure rules emphasize demonstrated risk management processes, not the absence of any vulnerability labeled "critical."
Beware alert fatigue: Teams that are constantly fighting fires on findings that turn out to have no real-world impact eventually stop paying attention, and that is when a genuinely dangerous vulnerability slips through. We have written about this problem before.

We want to be clear about the spirit of this piece. The NVD, CISA, FIRST, the CNA ecosystem, security testing solutions, and the security research community all play essential roles in making software safer. The intent here is not to undermine them but to acknowledge that severity scoring, as practiced today, has limitations that security teams should understand and account for.

Severity inflation is not the product of incompetence or malice. It is the natural outcome of a system where different actors, each acting rationally within their own incentive structures, produce or deliver numbers that do not always serve the people who must act on them. Recognizing this is the first step toward better decisions.

The second step is to complement CVSS with context: real exploitation data, asset criticality, reachability analysis, and professional judgment. Tool developers, for their part, can contribute by investing in accuracy. Rather than relaying an upstream CVSS base score as-is, a tool can try to verify whether the vulnerability is reachable in the customer's code, whether the conditions for exploitation actually exist, and whether the affected component is exposed. This is harder and more expensive than passing through a number, but it is also what separates a useful finding from noise. At Fluid Attacks, this is the philosophy behind our approach to scoring. We would rather tell you that a leaked secret is low severity because we have not confirmed the risk it entails than label everything critical and leave you to sort through the noise.

Get started with Fluid Attacks' RBVM solution right now

Tags:

company

security-testing

software

cybersecurity







Subscribe to our newsletter

Stay updated on our upcoming events and latest blog posts, advisories and other engaging resources.

How do you rank groups fairly in a CTF?

Read post



cover-how-to-break-data-silos (https://unsplash.com/photos/brown-metal-train-rail-near-rocky-mountain-during-daytime-nShLC-WruxQ)

Philosophy

Jason Chavarría

•

January 30, 2026

How to break data silos between dev and security teams

Read post



cover-owasp-top-10-2025 (https://unsplash.com/photos/brown-fly-close-up-photography-RqfJtMsPjTA)

Philosophy

Jason Chavarría

•

January 14, 2026

OWASP Top 10 2025 wants a word with your supply chain

Read post



cover-model-context-protocol-mcp-security (https://unsplash.com/photos/a-picture-of-a-red-and-orange-background-KaCZHQ7fEyQ)

Philosophy

Felipe Ruiz

•

November 14, 2025

The Model Context Protocol (MCP): architecture, security risks, and best practices

Read post



cover-web-app-vulnerability-scanning (https://unsplash.com/photos/cirko7YwU_w)

Philosophy

Jason Chavarría

•

October 23, 2025

Secure your web apps and websites with the help of vulnerability scanning and pentesting

Read post



Philosophy

Jason Chavarría

•

September 15, 2025

Benefits of continuous over point-in-time penetration testing

Read post



Philosophy

Jason Chavarría

•

August 29, 2025

Why is cloud DevSecOps important? Benefits of shifting cloud security left

Read post



Philosophy

Jason Chavarría

•

August 12, 2025

DevSecOps best practices: Our top advice for secure development across the SDLC

Read post



Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which organizations of all sizes are already enjoying.

Try for free

Contact sales

Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which organizations of all sizes are already enjoying.

Try for free

Contact sales

Start your 21-day free trial

Discover the benefits of our Continuous Hacking solution, which organizations of all sizes are already enjoying.

Try for free

Contact sales

Fluid Attacks' solutions enable organizations to identify, prioritize, and remediate vulnerabilities in their software throughout the SDLC. Supported by AI, automated tools, and pentesters, Fluid Attacks accelerates companies' risk exposure mitigation and strengthens their cybersecurity posture.

Get an AI summary of Fluid Attacks