• CramHacks
  • Posts
  • Affected Functions: A Key to Understanding Open-Source Vulnerabilities

Affected Functions: A Key to Understanding Open-Source Vulnerabilities

Explore the public availability of affected functions for OSS vulnerabilities and why vendors are spending millions to build private datasets.

šŸ‘‹ This blog discusses the public availability of affected functions for open-source software vulnerabilities. If youā€™re interested in how vendors build automation to build private datasets expanding on publicly available data, let me know!

The software composition analysis (SCA) market has been flooded with vendors offering ā€œreachability analysis,ā€ a feature intended to reduce noise by determining if and when a function relevant to a given OSS package vulnerability is used (see the example below). But how do these vendors determine which functions introduce risk?

CVE-2024-0815
The advisory details a command injection vulnerability in the PaddlePaddle PyPI package versions up to and including 2.6.0. Specifically, the vulnerability affects one function: paddle.utils.download._wget_download().

Legacy SCA tools will check if your project uses one of these PaddlePaddle versions. In contrast, the reachability analysis feature enables SCA tools to verify that youā€™re using the paddle.utils.download._wget_download() function.

Publicly Available Sources

Firstly, Iā€™d like to start by saying the industry is doing itself a disservice by keeping this data under wraps in proprietary databases, spending six or even seven figures annually to maintain and grow their datasets. Each vendor leverages their non-standard schema, and their data quality is often highly questionable.

šŸ‘‹ Iā€™ve long said this would be a great business idea that can, and probably should, likely be bootstrapped. Selling this vulnerability data to vendors to empower reachability analysis ā€”however, Iā€™d much prefer this be a community effort that benefits everyone.

Today, some people trust the data (consumers who see reachability as a required feature), while others donā€™t (prefer to drown in endless false positives). Where are the people interested in measuring data quality and improving consumer confidence?

The simple truth is that vendors are unlikely to take this seriously unless the market speaks up and is willing to pay for it.

GitHub maintains a vulnerability database that includes CVEs and GitHub-originated security advisories for open-source software. Advisories are stored as individual files and formatted in OSV Schema. Per the schema, affected functions should be noted under the affected[].ecosystem_specific field.

Ecosystem

Total Advisories (only including ecosystems with at least one affected function)

Advisories w/ at least one affected function

Total

20,058

393

pypi

3,407

164

npm

3,653

108

crates.io

864

114

packagist

4,274

3

go

2,054

2

maven

5,177

1

nuget

629

1

šŸ‘‹ An advisory may include details for more than one affected function or ecosystem.

Like GitHubā€™s Advisory DB, Google maintains an open-source software security advisories database and ingests vulnerabilities from several sources, including GitHub. Typically, if Iā€™m looking at language-package ecosystems like NPM and PyPI, Iā€™ll go to GitHub, but Googleā€™s OSV also contains vulnerabilities for open-source operating systems like Red Hat and Ubuntu. They also adhere to the OSV Schema and are who donated the OSV Schema to OpenSSF!

Ecosystem

Total Advisories (only including ecosystems with at least one affected function)

Advisories w/ at least one affected function

Total

25,916

810

pypi

6,538

164

npm

3,744

119

crates.io

1,590

254 (see comment below)

packagist

4,294

3

go

3,841

560 (see comment below)

maven

5,241

1

nuget

668

1

šŸ‘‹ counts on osv.dev will be greater because I removed malware advisories ā€”none of these include affected functions.

*I noticed that OSV hasnā€™t effectively deduplicated GHSA and other databases (RustSec & GoVuln) causing duplicates e.g., CVE-2021-38191 returns the GitHub Security Advisory and the Rust Security Advisory.

Are you wondering where GitHub and Google get these affected functions from? Some ecosystems, such as Go and Rust, capture these fields in their vulnerability reports. If youā€™re familiar with others, please let me know!

Total Advisories

Advisories w/ at least one affected symbol

1,761

547

I suspect the actual count here is the same as osv.dev and GHSA; I decided it wasnā€™t worth calculating this better, given the repository stores advisories in markdown, which sometimes, but not always, uses some formatting. šŸ¤® 

Total Advisories

Advisories w/ at least one affected symbol

724

69 (but more likely 114)

Open Discussions

Conclusion

What does it say about our understanding of open-source software security when we only have visibility into affected functions for approximately 1% of security advisories? It's important to note that the data presented above only includes ecosystems with at least one affected function.

In contrast, the Go ecosystem boasts a commendable coverage rate of 31%. So, what accounts for this discrepancy? Although I have no data on this, I have found the affected symbols in the Go Vulnerability Database to be the most accurate and comprehensiveā€”trust me, bro.

Vendors promoting reachability analysis assert that their tools can reduce findings by 85-98%. If true, it underscores the urgent need for better publicly available data on these advisories. Yet, these vendors' lack of transparency regarding their proprietary affected function datasets raises questions about the effectiveness of reachability analysis. Without independent data quality evaluations, we are left in the dark about how reliable these claims are.

Script Used For Counting

The following was used for GitHub Security Advisories, OSV, and the Go Vulnerability Database. Parsing the markdown files for the Rust Advisory Database was an ugly attempt, but I did adapt the following script to properly capture Rust Sec vulnerabilities that have been converted to osv schema and hosted on osv.dev.

import os
import json
from collections import defaultdict

def count_json_files_with_affected_functions_per_ecosystem(directory):
    ecosystem_files = defaultdict(list)  # Dictionary to store file names per ecosystem

    # Walk through the directory
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith('.json'):
                file_path = os.path.join(root, file)
                try:
                    with open(file_path, 'r') as f:
                        data = json.load(f)
                        # Check if 'affected' exists in the JSON structure
                        if 'affected' in data:
                            for item in data['affected']:
                                if 'ecosystem_specific' in item:
                                    ecosystem_name = item['package']['ecosystem']
                                    
                                    # Check for affected_functions or functions
                                    affected_functions = item['ecosystem_specific'].get('affected_functions')
                                    functions = item['ecosystem_specific'].get('affects', {}).get('functions')
                                    imports = item['ecosystem_specific'].get('imports', [])

                                    # Check if symbols are present and non-empty
                                    symbols_present = any('symbols' in imp and imp['symbols'] for imp in imports)

                                    # Count the file if affected_functions is not None, functions is a non-empty list, or symbols are present
                                    if affected_functions is not None or (functions and isinstance(functions, list) and functions) or symbols_present:
                                        ecosystem_files[ecosystem_name].append(file_path)
                                        break  # No need to check further in this file
                except (json.JSONDecodeError, FileNotFoundError) as e:
                    print(f"Error reading {file_path}: {e}")

    return ecosystem_files

if __name__ == "__main__":
    current_directory = os.getcwd()  # Get the current working directory
    ecosystem_files = count_json_files_with_affected_functions_per_ecosystem(current_directory)
    
    print("JSON files containing 'affected_functions', non-empty 'functions', or 'symbols' per ecosystem:")
    for ecosystem, files in ecosystem_files.items():
        print(f"{ecosystem}: {len(files)} files")

Until Next Time! šŸ‘‹

Hey, you made it to the bottom ā€“ thanks for sticking around!

Questions, ideas, or want to chat? Slide into my inbox! šŸ’Œ

Donā€™t hesitate to forward if someone could benefit from this.

See you next Monday!
-Kyle

P.S. CramHacks has a Supporter tier! You can upgrade here to support CramHacks and its free weekly content šŸ˜ƒ.