Trusty is a free-to-use web app that provides data and scoring on the supply chain risk for open source packages.
It seems that artificial intelligence is everywhere now: helping us write, draw, compose videos, and enhance pictures… no matter where you look, it seems that almost all applications are getting an AI update whether they need it or not. With the rise of AI, the risks of running the machine learning models that power these features also multiply.
Machine learning (ML) models are software and, as such, the models and the processes used to build them, are susceptible to the same supply chain attacks we’ve been trying to defend for the past couple of years. But models are also software of a different nature and open new attack vectors for adversaries to try. Some of the differences between AI models and how traditional software is built pose new problems that we need to solve to effectively harden machine learning pipelines.
ML model development involves three main steps: (a) data collection, (b) model training and validation, and (c) model deployment.
a) Data collection: In this step, the modeler gathers raw data that is relevant for training the ML model. The modeler may acquire the data from an external public source or from an internal data repository. After acquiring the data, the modeler may perform exploratory data analysis, pre-process the data, and identify a set of features. A features dataset is then created by extracting the selected features from the raw data. The features dataset is used in the next step.
(b) Model training and validation: In this step, the modeler creates a ML model using a library like scikit-learn, pytorch or tensorflow. The features dataset is split into three parts: training, validation and test. The modeler trains the model using the training data. The validation data is used for hyperparameter tuning, and the test data is used for evaluating the model performance. Until the desired model performance is achieved, the modeler repeats training, tuning and evaluation using various model types or architectures. The final trained model is used in the next step.
(c) Model deployment: In this step, the modeler deploys the trained model in an inferencing endpoint or service. An application that needs to use the ML model invokes this endpoint to perform inferencing.
This pipeline is potentially vulnerable to the same supply chain attacks as traditional software: the source repositories could be compromised, malicious dependencies could make their way to the final model, the build system could be hacked, malicious models could be pushed to the registries, etc.
Let’s enumerate some of the considerations needed to build a pipeline that generates enough non-falsifiable metadata to allow an admission controller make the best call possible when deploying a model:
We need to ensure the model source is coming from our trusted repository. Presumably, code is only committed after following good practices, such as two-person reviews and written by the expected collaborators. This can be achieved by verifying signed commits when checking out the code, using gitsign commits that can be verified against sigstore’s transparency log.
The model should be built using the expected dependencies. Third-party packages and modules introduce foreign code into the model and change must be controlled. We could enforce policy on a trusted SBOM to only allow certain dependencies and verify they are pinned.
The hyperparameters used to tune the model should not be changed as they may compromise the integrity of the model outputs. Hyperparameters can be recorded in the build definition of a SLSA provenance attestation.
We must ensure the build toolchain is trusted and executed by the expected identities. Again, SLSA provenance can help. A provenance attestation will let me know the toolchain used, down to the commit SHA. The attestation signature guarantees that the build was not run on someone else’s systems.
The resulting model must be tamper-proof: no one should be able to falsify or alter the model from the moment it is built to the time it is deployed. This is achieved by signing the model blob.
Finally, we need to have assurances about the integrity of the datasets used to train the model. We want to make sure that only models trained on known, trusted datasets are admitted.
That is where things get tricky.
The main differentiator between a traditional software build and an ML model training process is the datasets. The model’s final form and behavior will vary depending on the data it was trained on. This means that tracking those datasets is just as important as software tracking dependencies, if not more.
Dataset poisoning is a serious concern. Injecting malicious data can give attackers the capability to degrade an AI application or, worse, to alter its behavior enough to control the way it works. In the real world, we’ve seen examples of data poisoning attempting to circumvent Gmail’s spam filter, break various antivirus products by uploading adversarial samples to Virtus Total, and turn a chatbot from Microsoft into the ugliest human that ever existed.
Just as with a vulnerable software dependency, an (AI)SBOM can help determine where and when models trained on compromised datasets are or were deployed to understand the risk my organization was exposed to. Armed with an SBOM, an admission controller should be able to know if a newly built model was trained on a poisoned dataset and reject it from future deployments.
Unfortunately, from the software supply chain security side, we cannot offer these transparency guarantees today. Features to properly point to the datasets used when building models are simply not there yet in the SBOM formats.
Just as the models themselves, datasets can be specified in both major SBOM formats. The way to describe the formats is different in each and both lack the features to properly point to a dataset revision.
Why is a revision important? If a breach occurs and my dataset is compromised, I need to know which models were trained on the tainted datasets. This is the ideal use case for an SBOM: The power to quickly locate software cooked with a specific ingredient.
To add AI/ML features, CycloneDX chose to embed in the document a full machine-readable version of the Model Card framework. The format has support for specifying datasets in the model card, but beyond a name and a location, there aren’t any more fields to track the origin of the data.
SPDX 3.0 defines a new, top-tier, dataset class along with the model, called AIPackage. While it is encouraging to see the royal treatment given to the Dataset class, just as with its counterpart, the current fields also lack support for specifying revisions of the dataset.
To be fair, the SBOM standards can only reflect the state of the data science world. A short survey revealed that most projects that propose some sort of document to describe datasets are focusing on other risk areas such as ethics, copyright, and privacy. All projects have different target audiences but they all lack a way to point to a specific revision of a dataset.
For example, the Data Nutrition Project produces beautiful webpages called Data Nutrition Labels that inform anyone researching a new dataset if the data is prone to gender, socioeconomic or racial bias, details about the data license and even how frequently the data gets updated. But, just as with an AI SBOM, there is no way to specify the revision of the dataset a nutritional label applies to. What if the dataset is improved to fix some of those biases? How would I know if I’m using the improved dataset or the old one?
This is not unique to the Data Nutrition Project. For example, the most promising proposal to describe datasets is probably the Datasheets for Datasets. Datasheets cover all sorts of aspects of a dataset grouped into seven areas roughly related to the dataset lifecycle. The Maintenance section is the closest to what we need: it touches on revisions, but there are no details on how to specify them or any identifier scheme. The example in the paper uses simple version numbers, but without an identifier or another machine-readable structure, we cannot use them in supply chain security applications applied to AI.
Other projects and ideas are in a similar state. For example, the DALLE-2 System Card has a nice blurb for humans about its training and other supporting datasets, but when it comes to referencing them they still have to resort to English text in the footnotes. Factsheets are machine readable, but just as with SBOM formats, the dataset descriptor is limited to a link that is an identifier, but still lacks context.
One promising project is Data Version Control. DVC works similar to git where you add files to commits. It has cool features such as tracking data processing pipelines which is essential to capture how data is preprocessed before running a training job and database import as datasets are not always stored as files.
We also need to consider the fact that tracking revisions may not be possible in some cases. For example, what happens when a dataset changes faster than a training process that runs for weeks or months? How do I specify a revision when my training data is “the whole Internet”?
There is a lot of interest in the supply chain security space to help the AI world be more secure and transparent. We are eager to apply the lessons learned in hardening software builds to model building pipelines, but we need help from our data science colleagues to solve the missing pieces, such as this one. Let’s keep the conversations going.
Adolfo "Puerco" García Veytia
Staff Engineer
Adolfo “Puerco” García Veytia is a staff software engineer at Stacklok, based in Mexico City. He is a technical lead with Kubernetes SIG Release specializing in improvements to the software that drives the automation behind the Kubernetes release process. He is also the creator of the Protobom and OpenVEX open source projects.