Manifest confusion is a problem in the architecture of npm, pointed out by Darcy Clarke: An npm package’s manifest is independently published from its tarball and never fully validated.
When I read about manifest confusion, my first question was: How commonly does it happen that manifest and package.json
do not match? Frustratingly, the answer is: Almost all the time! And that’s when I went down the rabbit hole of package normalization.
npm’s package normalization
When a developer wants to release a package to the npm registry and types npm publish
into their console, a lot of things happen in the background.
In short, npm reads the package data from package.json
, collects all the necessary files into a tarball and uploads some metadata to the registry. This metadata is what I’ll call the manifest for the rest of this article. In theory, the manifest is just the data read from package.json
, but before the package is uploaded, a myriad of small and large changes are made to this data, the “normalization”.
I put the word “normalization” in quotation marks because the changes are not limited to bringing some values into a uniform format, but are sometimes quite significant. For those interested in the actual code, check out the @npmcli/package-json
package and start with the function prepare
. If you are just interested in the results, you can also skip the implementation details.
Implementation details
prepare
is applying the following list of manipulations to the package.json
:
_attributes
bundledDependencies
bundleDependencies
bundleDependenciesDeleteFalse
gypfile
serverjs
scriptpath
authors
readme
mans
binDir
gitHead
fillTypes
normalizeData
binRefs
The harmless sounding step “normalizeData” will lead us even further to the package normalize-package-data
and its function normalize
. Instead of merely normalizing the data, a list of fixes will be applied:
name
version
description
repository
modules
scripts
files
bin
man
bugs
keywords
readme
homepage
license
dependencies
people
typos
And while each and every one of these changes has a story as to why it was deemed necessary, it raises the question: Which source should we trust?
- The optimized manifest, where a lot of thought has gone into putting package information into a consistent, readable form?
- Or the
package.json
, which has an integrity check and is ideally even signed by the author of the package1?
Sadly the answer is: Neither of them.
It get’s worse
While we can argue about the pros and cons of the manifest or package.json
, the authority on this question is npm itself. What does npm use to install a package?
Again, the answer is not satisfactory: Both.
When installing a package with npm, the code responsible for loading the manifest is in the package pacote
, which implements “fetcher” for different locations:
When loading a package from a directory, file, generic remote location or git repository, the package.json
is the relevant source of information. However, when loading a package from a registry, the manifest will be used – unless, of course, the package is already cached.
To illustrate the problem, I have created a package that has different dependencies in manifest and package.json
.
If we install it from the registry, it will install lodash
(as defined in the manifest):
root@5db52d3e57e7:~/registry# npm install trace-employed-spider-sensitize
added 2 packages in 522ms
root@5db52d3e57e7:~/registry# ls node_modules/
lodash trace-employed-spider-sensitize
But if we install the tarball directly, it will install jquery
(as defined in the package.json
):
root@5db52d3e57e7:~/remote# npm install https://registry.npmjs.org/trace-employed-spider-sensitize/-/trace-employed-spider-sensitize-1.0.1.tgz
added 2 packages in 549ms
root@5db52d3e57e7:~/remote# ls node_modules/
jquery trace-employed-spider-sensitize
This not only means that there is no way to trust either the manifest or package.json
alone, but also that package integrity checks become almost meaningless: The manifest contains a signature1 of the values for name
, version
and dist.integrity
, where dist.integrity
is a checksum of the tarball. Since dependencies
is not included in the signature, the signature cannot protect against the inclusion of arbitrary dependencies.
How big is the problem?
Now that we’ve established that npm has a huge conceptual problem with serious security implications, let’s look at how frequently we encounter this problem in the wild. Think of this as a glimpse into what research still needs to be done. For this I took a look at the most popular 5000 repositories on www.npmjs.com.
Since the whole package normalization creates a lot of noise, I hacked together a Python script that tries to filter out most of these changes. It still produces some noise, but little enough to be able to see the picture:
The figure shows keys in the package.json
where a deviation to the manifest was found that could not be explained by normalization. The keys are explained in the package.json
documentation.
The deviations for scripts
are most likely due to some normalization that I haven’t implemented yet. I am not quite sure where the deviations with maintainers
are coming from, but it could be from another toolchain or the registry itself.
gitHead
has the potential to indicate an inconsistency with the advertised version, but the instances I checked by hand seemed to result from outdated values in the package.json. However, this may be worth looking into again.
What struck me were the devDependencies
(and the dependencies
, before I filtered out the optionalDependencies
, that somehow end up in there). As far as I have seen, the devDependencies
result from dependencies that are not installed from the registry, but from a git repository. Somewhere the version format is changed and e.g. git+https://github.com/parse-community/parse-server#alpha
could become git+https://github.com/parse-community/parse-server.git#alpha
.
Although I was able to explain (or at least justify) most of the inconsistencies I saw, the sheer number of them didn’t leave me with much hope that this is a problem that can somehow be contained.
What can we do to solve the problem?
This flaw in the architecture of the npm ecosystem needs to be fixed. This means, at the very least, rigorous validation for both the manifest and the package.json
, and perhaps getting rid of some (or all?) of the normalization in the release process. Instead of applying dozens of fixes every time a package is released, it would be better to just fix it at the source, the package.json
. And as long as the dependencies are taken from the manifest, they should be signed the same as the name, version and tarball.
Likewise, security researchers and the infosec industry need to address the problem in a way that keeps everyone secure. This means that package scanners also need to make sure that they do not trust either the manifest or the package.json
alone, but always check both. There is also still a lot of work to be done to find instances in the npm registry where this issue is being exploited.