Blog

The importance of historical provenance in identifying malicious packages

Author: Nigel Brown

7 mins read

Jan 15, 2024

/ Subscribe

Historical provenance: Mapping Git tags to package versions to verify proof of origin for OSS packages

Within software development, ensuring the authenticity and integrity of packages can be critical to the software supply chain. Source of origin verification and build provenance play key roles in this context, providing a robust framework to validate the origins of packages and the integrity of the build process. This helps us to understand with certainty, aspects such as:

Trust and Integrity: Knowing where the code comes from helps establish trust. If the provenance is clear, users can have greater confidence that the code has not been tampered with from its original state.

License Compliance: Open source software is subject to various licenses that dictate how the software can be used, modified, and distributed. Knowing the provenance of the software ensures that organizations comply with these licenses, avoiding legal issues and respecting the rights of the original authors.

Attack Prevention: A common attack method leveraged by malicious attackers against package managers, is typo-squatting, typically coupled with starjacking. Attackers will seek name variations, close in proximity to the target package name. To appear even more credible, they will hijack the metadata of a popular package (including the target package). This makes it harder to sense these attacks.

Example of starjacked packages — Which package is the original?

Measuring quality: Being able to read the original packages code and see how well maintained the code is, allows individuals to make assumptions about the quality of the package. This depends on the link between the package and the software repository.

Sometimes these links are cryptographically strong, for example when using a signing system / attestation frameworks such as Sigstore / SLSA or a setup similar to Go Modules. These can be trusted implicitly. This is the long game and where we will hopefully end up in the future. Currently these cryptographically strong links are rare.

Some of these links are in-between (e.g. pypi Trusted Publishers) where some signing has been done, but not necessarily visible to the public.

The vast majority, for the foreseeable future make no guarantee. Stacklok will seek to work with communities by encouraging the adoption of Sigstore.

In the absence of direct explicit provenance, which is the majority case for the foreseeable future, what can we do to discover the strength of this link?

Historical Provenance (HP)

When we do a release, it is very common practice, though not mandated, to tag the source code.

Tags in source code serve as reference points, marking a specific state of the code that corresponds to a given release (via a commit). When issues are found in a production environment, tags allow developers to quickly check out the exact code that was running to reproduce and troubleshoot the issue. Tags provide a clear history of the project's progression. They also act as a form of documentation that indicates when certain features were introduced or when bugs were fixed.

An example of commit data in source code, tied to a version release tag

Developers understand how useful tagged releases are, and we find the majority of software maintainers use tags as expected.

Tags also carry a second advantage in git. Each commit in Git contains a hash of its contents, which includes the source code, commit message, author, and date, as well as the hash of the previous commit(s). This chaining ensures that every commit is a snapshot of the entire repository's history up to that point. If any part of a commit's data were to change, its hash would change, invalidating all subsequent commits. This makes the history tamper-evident (even more when combined with a signature).

All package managers also record a publish event based timestamp.

Tags + publishing dates

We compared the timestamps of tags and the published timestamp in the packaging system over many packages from several different ecosystems (crates, npm and pypi) and we found a strong correlation between the two.

When we look at the number of releases vs the number of tags in the repository and compare the times at which they happened, we see they are similar. If we also do a fuzzy match on the strings themselves, we get an even stronger correlation. If the repo and the package share even a small number of versions, we can assume the package came from the repo.

This comparison is very hard to fake, especially for a longer-lived package. To do so would involve going back in time and making fraudulent releases at the same time as valid tags.

This means as long as we trust the tag producer (typically a GitHub, GitLab, etc) and the packaging infrastructure owners (pypi.org, npmjs.com, crates.io), we can trust the mappings. (If a GitHub or packaging provider is hacked, all of us are in trouble).

Visual inspection is possible for tag -> publish timestamps, but given the number of packages (millions) we have to consider it would be useful to automate this.

How we automated Historical Provenance

We have two streams of timestamps with slightly different version strings, and we want to consider how similar they are.

Create a list of tags and package versions
Take each tag in the repo and look for a ‘core’ of the form #.#.# using a regex.
Look for that core in the package versions
We match up the dates and count them
We report the count as ‘common’

Note that a low percentage overlap can mean there are multiple packages in one repo or that tags are used for things other than releases.

Results. The initial set of results are promising. There is a clear diagonal relationship between the packages and the repositories they claim. This means that when we compare a package with its own repo we get a high score and when we compare to another repo we get a low (0) score.

There are some anomalies to consider. Repositories with no tags in them have nothing to compare with. There isn’t much we can say about these repositories, so we have to pass on them. They score 0 on the leading diagonal.

What can we ask?

Which package is most likely to be associated with a specific repo? This would be useful in the case of star-jacking. We have a number of packages claiming to come from the same repo. We want to know which one is the most likely to be correct.

From the visuals above we can see this in action. The diagonal line gives a strong signal. We can create a confusion matrix to see if we can select the best match from all the others in the test set.

Confusion matrix to prove which packages came from the same repo

This is a perfect test, within this sample set.

Is this package from a specific repo? To decide if a package matches its claimed repository we need to have some kind of cutoff. What score is good enough? This can be seen in the image below.

The blue section is the ‘correct’ packages we expect to have some overlap, which they mostly do. The orange ones are mismatched pairs which we expect to have no overlap, which is true apart from one case which does legitimately share the same repo. So, in this case it looks like any overlap is enough for discrimination.

We can use it to create a confusion matrix.

This is a very good test.

The samples above are mostly from Python (crates 11, npm 12, pypi 40), but we also found the same is true of the rust and NPM packages we examined. There are occasional edge cases where things go wrong but this looks like a very useful approach

Conclusion

We can tell if a package comes from a repository by comparing the history of its releases with a high degree of accuracy. Historical Provenance is not the same as cryptographic provenance (we consider sigstore to be the chef's kiss solution), but is a very useful tool in understanding the true source of origin of a package. We are improving the signals around a package's claims on its source of origin.

Historical Provenance will also improve our ability to identify malicious packages, as we have an improved level of insight and observability into packages metadata claims.

Keep an eye out and we plan to ship this into trustypkg.dev shortly. This will allow Stacklok to provide even sharper indicators to our community of users. We will also continue to apply more layers of guarantees on the security of the supply chain in respect of package managers.