- CramHacks
- Posts
- Affected Functions: A Key to Understanding Open-Source Vulnerabilities
Affected Functions: A Key to Understanding Open-Source Vulnerabilities
Explore the public availability of affected functions for OSS vulnerabilities and why vendors are spending millions to build private datasets.
š This blog discusses the public availability of affected functions for open-source software vulnerabilities. If youāre interested in how vendors build automation to build private datasets expanding on publicly available data, let me know!
The software composition analysis (SCA) market has been flooded with vendors offering āreachability analysis,ā a feature intended to reduce noise by determining if and when a function relevant to a given OSS package vulnerability is used (see the example below). But how do these vendors determine which functions introduce risk?
CVE-2024-0815
The advisory details a command injection vulnerability in the PaddlePaddle PyPI package versions up to and including 2.6.0. Specifically, the vulnerability affects one function: paddle.utils.download._wget_download().
Legacy SCA tools will check if your project uses one of these PaddlePaddle versions. In contrast, the reachability analysis feature enables SCA tools to verify that youāre using the paddle.utils.download._wget_download() function.
Publicly Available Sources
Firstly, Iād like to start by saying the industry is doing itself a disservice by keeping this data under wraps in proprietary databases, spending six or even seven figures annually to maintain and grow their datasets. Each vendor leverages their non-standard schema, and their data quality is often highly questionable.
š Iāve long said this would be a great business idea that can, and probably should, likely be bootstrapped. Selling this vulnerability data to vendors to empower reachability analysis āhowever, Iād much prefer this be a community effort that benefits everyone.
Today, some people trust the data (consumers who see reachability as a required feature), while others donāt (prefer to drown in endless false positives). Where are the people interested in measuring data quality and improving consumer confidence?
The simple truth is that vendors are unlikely to take this seriously unless the market speaks up and is willing to pay for it.
GitHub maintains a vulnerability database that includes CVEs and GitHub-originated security advisories for open-source software. Advisories are stored as individual files and formatted in OSV Schema. Per the schema, affected functions should be noted under the affected[].ecosystem_specific field.
Ecosystem | Total Advisories (only including ecosystems with at least one affected function) | Advisories w/ at least one affected function |
---|---|---|
Total | 20,058 | 393 |
pypi | 3,407 | 164 |
npm | 3,653 | 108 |
crates.io | 864 | 114 |
packagist | 4,274 | 3 |
go | 2,054 | 2 |
maven | 5,177 | 1 |
nuget | 629 | 1 |
š An advisory may include details for more than one affected function or ecosystem.
Like GitHubās Advisory DB, Google maintains an open-source software security advisories database and ingests vulnerabilities from several sources, including GitHub. Typically, if Iām looking at language-package ecosystems like NPM and PyPI, Iāll go to GitHub, but Googleās OSV also contains vulnerabilities for open-source operating systems like Red Hat and Ubuntu. They also adhere to the OSV Schema and are who donated the OSV Schema to OpenSSF!
Ecosystem | Total Advisories (only including ecosystems with at least one affected function) | Advisories w/ at least one affected function |
---|---|---|
Total | 25,916 | 810 |
pypi | 6,538 | 164 |
npm | 3,744 | 119 |
crates.io | 1,590 | 254 (see comment below) |
packagist | 4,294 | 3 |
go | 3,841 | 560 (see comment below) |
maven | 5,241 | 1 |
nuget | 668 | 1 |
š counts on osv.dev will be greater because I removed malware advisories ānone of these include affected functions.
*I noticed that OSV hasnāt effectively deduplicated GHSA and other databases (RustSec & GoVuln) causing duplicates e.g., CVE-2021-38191 returns the GitHub Security Advisory and the Rust Security Advisory.
Are you wondering where GitHub and Google get these affected functions from? Some ecosystems, such as Go and Rust, capture these fields in their vulnerability reports. If youāre familiar with others, please let me know!
Total Advisories | Advisories w/ at least one affected symbol |
---|---|
1,761 | 547 |
I suspect the actual count here is the same as osv.dev and GHSA; I decided it wasnāt worth calculating this better, given the repository stores advisories in markdown, which sometimes, but not always, uses some formatting. š¤®
Total Advisories | Advisories w/ at least one affected symbol |
---|---|
724 | 69 (but more likely 114) |
Open Discussions
Conclusion
What does it say about our understanding of open-source software security when we only have visibility into affected functions for approximately 1% of security advisories? It's important to note that the data presented above only includes ecosystems with at least one affected function.
In contrast, the Go ecosystem boasts a commendable coverage rate of 31%. So, what accounts for this discrepancy? Although I have no data on this, I have found the affected symbols in the Go Vulnerability Database to be the most accurate and comprehensiveātrust me, bro.
Vendors promoting reachability analysis assert that their tools can reduce findings by 85-98%. If true, it underscores the urgent need for better publicly available data on these advisories. Yet, these vendors' lack of transparency regarding their proprietary affected function datasets raises questions about the effectiveness of reachability analysis. Without independent data quality evaluations, we are left in the dark about how reliable these claims are.
Script Used For Counting
The following was used for GitHub Security Advisories, OSV, and the Go Vulnerability Database. Parsing the markdown files for the Rust Advisory Database was an ugly attempt, but I did adapt the following script to properly capture Rust Sec vulnerabilities that have been converted to osv schema and hosted on osv.dev.
import os
import json
from collections import defaultdict
def count_json_files_with_affected_functions_per_ecosystem(directory):
ecosystem_files = defaultdict(list) # Dictionary to store file names per ecosystem
# Walk through the directory
for root, _, files in os.walk(directory):
for file in files:
if file.endswith('.json'):
file_path = os.path.join(root, file)
try:
with open(file_path, 'r') as f:
data = json.load(f)
# Check if 'affected' exists in the JSON structure
if 'affected' in data:
for item in data['affected']:
if 'ecosystem_specific' in item:
ecosystem_name = item['package']['ecosystem']
# Check for affected_functions or functions
affected_functions = item['ecosystem_specific'].get('affected_functions')
functions = item['ecosystem_specific'].get('affects', {}).get('functions')
imports = item['ecosystem_specific'].get('imports', [])
# Check if symbols are present and non-empty
symbols_present = any('symbols' in imp and imp['symbols'] for imp in imports)
# Count the file if affected_functions is not None, functions is a non-empty list, or symbols are present
if affected_functions is not None or (functions and isinstance(functions, list) and functions) or symbols_present:
ecosystem_files[ecosystem_name].append(file_path)
break # No need to check further in this file
except (json.JSONDecodeError, FileNotFoundError) as e:
print(f"Error reading {file_path}: {e}")
return ecosystem_files
if __name__ == "__main__":
current_directory = os.getcwd() # Get the current working directory
ecosystem_files = count_json_files_with_affected_functions_per_ecosystem(current_directory)
print("JSON files containing 'affected_functions', non-empty 'functions', or 'symbols' per ecosystem:")
for ecosystem, files in ecosystem_files.items():
print(f"{ecosystem}: {len(files)} files")
Until Next Time! š
Hey, you made it to the bottom ā thanks for sticking around!
Questions, ideas, or want to chat? Slide into my inbox! š
Donāt hesitate to forward if someone could benefit from this.
See you next Monday!
-Kyle
P.S. CramHacks has a Supporter tier! You can upgrade here to support CramHacks and its free weekly content š.