Blog

Historical provenance: Mapping Git tags to package versions to verify proof of origin for open source packages

We've developed a way to help determine provenance (or proof of origin) for open source software packages that we believe can serve as a viable alternative when sigstore provenance is not available. Called “historical provenance,” it involves looking back at historical Git releases and tags in a source repo, and mapping those to published package versions.

Author: Nigel Brown
/
8 mins read
/
Jan 15, 2024
Historical provenance: Mapping Git tags to package versions to verify proof of origin for OSS packages

Join Stacklok CTO Luke Hinds and Staff Data Scientist Nigel Brown for a live demo and discussion about historical provenance on January 16, 2024 at 9 AM PT!

The demo and discussion will be streamed here on YouTube Live.

As humans, we aren’t allowed to do sensitive things—like opening a bank account, or accessing our healthcare information—without proving our identity, to ensure it’s not compromised. And yet, when it comes to software, we’ve largely accepted the practice of injecting third-party code into our projects without being able to prove that the code is authentic, and that it will do what it says it will do. 

Malicious actors are taking full advantage of this. Sonatype noted that the number of malicious packages tripled in the past year, to over 245K. “Masquerading” or “typosquatting” are common attacks: malicious actors copy the metadata from a popular package and use it for their malicious package, with a slightly different name. It’s intentionally really hard for developers to tell the difference.

Example of starjacked packages
Which package is the original?

How can we help guard against these types of attacks? One key way is by establishing proof of origin and build provenance for open source packages.

sigstore: Establishing cryptographic links to source code

The open source project sigstore was founded by Stacklok CTO Luke Hinds to give developers an easier way to digitally sign and verify software artifacts. When you use sigstore to sign an artifact, your signature is stored in a public ledger that can’t be tampered with. This practice establishes a cryptographic link from the package back to its source code, acting like a digital ID. 

But as of today, only a fraction of packages have been signed using sigstore. While we need to continue to make it easier for developers to use sigstore to digitally sign and verify OSS packages—and this will be a key goal for Minder—widespread adoption of sigstore by developer communities will take time. From a software security perspective, we can’t afford to wait.  

Introducing historical provenance: Establishing links to source code through Git tags and releases

At Stacklok, we’ve developed a way to help determine provenance (or proof of origin) for software packages that we believe can serve as a viable alternative when sigstore provenance is not available. We’re calling it “historical provenance,” because it involves looking back at historical Git releases and tags in a source repo, and mapping those to published package versions. 

It’s important to note that historical provenance does not replace the value of using sigstore and SLSA to establish cryptographically strong links, or using or a setup similar to Go Modules. Notably, sigstore can verify the connection between a specific package version and its source repo, while historical provenance can only link the overall package repository to the source repo. But in the absence of a cryptographic link, historical provenance can still provide strong linkage between a package and its source code, giving developers a better signal as to whether a package is what it says it is.

How does historical provenance work?

When developers release code, it is very common practice, though not mandated, to tag the source code.

Tags in source code serve as reference points, marking a specific state of the code that corresponds to a given release (via a commit). When issues are found in a production environment, tags allow developers to quickly check out the exact code that was running to reproduce and troubleshoot the issue. 

Tags provide a clear history of the project's progression. Critically, they also act as a form of documentation that indicates when certain features were introduced or when bugs were fixed.

An example of commit data in source code, tied to a version release tag

Tags also carry a second advantage in git. Each commit contains a hash of its contents, which includes the source code, commit message, author, and date, as well as the hash of the previous commit(s). This chaining ensures that every commit is a snapshot of the entire repository's history up to that point. If any part of a commit's data were to change, its hash would change, invalidating all subsequent commits. This makes the history tamper-evident, and even more so when combined with a signature.

All package managers also record and publish an event-based timestamp.

Linking Git tags to package version dates

For packages in the crates, npm, and PyPI ecosystems, we compared the timestamps of git tags in the source code to the published timestamp of versions listed in the package manager’s repo. We found a strong correlation between the two.

For example, when we look at the number of releases vs. the number of tags in the repository and compare the times at which they happened, we see they are similar. If we also do a fuzzy match on the strings themselves, we get an even stronger correlation. If the repo and the package share even a small number of versions, we can reasonably assume the package came from the repo.

This comparison is very hard to fake, especially for a longer-lived package. To do so would involve going back in time and making fraudulent releases at the same time as valid tags. As long as we trust the tag producer (e.g., GitHub) and the packaging infrastructure owners (e.g., pypi.org, npmjs.com, crates.io), we can trust these mappings. (If GitHub or a packaging provider is hacked, we are all in trouble!)

Automating the process

While visual inspection is possible for matching the tag -> publish timestamps, there are millions of open source packages in these ecosystems. We needed to automate this, so that we could display this data in Trusty for developers to use.

Here’s how we approach that. We start with two streams of timestamps with slightly different version strings, and we want to consider how similar they are: 

  1. Create a list of tags and package versions

  2. Take each tag in the repo and look for a ‘core’ of the form #.#.# using a regex.

  3. Look for that core in the package versions

  4. Match up the dates and count them

  5. Report the count as ‘common’ matching tags.

Note that a low-percentage overlap can mean there are multiple packages in one repo, or it can mean that tags are used for things other than releases.'

Initial results 

The initial set of results are promising. There is a clear diagonal relationship between the packages and the repositories they claim. This means that when we compare a package with its own repo, we get a high score; when we compare to another repo, we get a low (0) score.

Results of historical provenance

There are some anomalies to consider. Notably, we can’t compare repos that contain no tags. So these score 0 on the leading diagonal.

Mapping the data

We intend to use historical provenance data to catch possible “starjacking” and “typosquatting” attempts—cases in which a bad actor is using copied metadata and slightly misspelled package names to get developers to install malicious code. Here’s how we can do this. 

Example 1: Identifying which packages are most likely to be associated with a specific repo

In this case, we have more than one package that claims to come from the same source repo, and we want to prove which packages actually do come from that repo. From the graphs above, we can see this in action. The diagonal line gives a strong signal. We can create a confusion matrix to see if we can select the best match from all the others in the test set:

Confusion matrix to prove which packages came from the same repo

This is a perfect test, within this sample set.

Example 2: Identifying whether a given package is from a specific repo

To figure out whether a package matches its claimed repository, we need to have some kind of cutoff. What score is good enough? This can be seen in the image below.

The blue section represents the “correct” packages. We expect these to have some overlap, which they mostly do. The orange ones are mismatched pairs, which we expect to have no overlap. This is true apart from one case, which does legitimately share the same repo. So, in this case it looks like any overlap is enough for discrimination.

We can use it to create a confusion matrix.

This is a very good test.

The samples above are mostly from Python (crates 11, npm 12, pypi 40), but we also found the same is true of the rust and npm packages we examined. There are occasional edge cases where things go wrong, but all things considered, this is a very useful approach. 

Conclusion

With historical provenance, we can often prove that a package comes from its claimed source repository with a high degree of accuracy by comparing the history of its releases. Again, historical provenance is not a replacement for cryptographic provenance (we consider sigstore to be the “chef's kiss” solution), but it is a very useful tool in understanding the true source of origin of a package and knowing whether it is what it says it is.

For Stacklok, this method of provenance will help us better identify malicious packages and provide stronger indicators to developer communities, because it gives us more insight and observability over a package’s metadata claims. For that reason, we’ve integrated historical provenance into Trusty, our free-to-use service for vetting the safety and trustworthiness of open source packages. You can check out historical provenance in action now by heading to www.trustypkg.dev

We’d love to hear your feedback on our approach with historical provenance. Join our Discord channel to chat with us and share your thoughts.

Join us for a live demo and discussion about historical provenance on January 16, 2024 at 9 AM PT! The demo and discussion will be streamed here on YouTube Live.