Manifest confusion is a problem in the architecture of npm, pointed out by Darcy Clarke: An npm package’s manifest is independently published from its tarball and never fully validated.
When I read about manifest confusion, my first question was: How commonly does it happen that manifest and
package.json do not match? Frustratingly, the answer is: Almost all the time! And that’s when I went down the rabbit hole of package normalization.
npm’s package normalization
When a developer wants to release a package to the npm registry and types
npm publish into their console, a lot of things happen in the background.
In short, npm reads the package data from
package.json, collects all the necessary files into a tarball and uploads some metadata to the registry. This metadata is what I’ll call the manifest for the rest of this article. In theory, the manifest is just the data read from
package.json, but before the package is uploaded, a myriad of small and large changes are made to this data, the “normalization”.
I put the word “normalization” in quotation marks because the changes are not limited to bringing some values into a uniform format, but are sometimes quite significant. For those interested in the actual code, check out the
@npmcli/package-json package and start with the function
prepare. If you are just interested in the results, you can also skip the implementation details.
prepare is applying the following list of manipulations to the
The harmless sounding step “normalizeData” will lead us even further to the package
normalize-package-data and its function
normalize. Instead of merely normalizing the data, a list of fixes will be applied:
And while each and every one of these changes has a story as to why it was deemed necessary, it raises the question: Which source should we trust?
- The optimized manifest, where a lot of thought has gone into putting package information into a consistent, readable form?
- Or the
package.json, which has an integrity check and is ideally even signed by the author of the package1?
Sadly the answer is: Neither of them.
It get’s worse
While we can argue about the pros and cons of the manifest or
package.json, the authority on this question is npm itself. What does npm use to install a package?
Again, the answer is not satisfactory: Both.
When installing a package with npm, the code responsible for loading the manifest is in the package
pacote, which implements “fetcher” for different locations:
When loading a package from a directory, file, generic remote location or git repository, the
package.json is the relevant source of information. However, when loading a package from a registry, the manifest will be used – unless, of course, the package is already cached.
To illustrate the problem, I have created a package that has different dependencies in manifest and
If we install it from the registry, it will install
lodash (as defined in the manifest):
root@5db52d3e57e7:~/registry# npm install trace-employed-spider-sensitize
added 2 packages in 522ms
root@5db52d3e57e7:~/registry# ls node_modules/
But if we install the tarball directly, it will install
jquery (as defined in the
root@5db52d3e57e7:~/remote# npm install https://registry.npmjs.org/trace-employed-spider-sensitize/-/trace-employed-spider-sensitize-1.0.1.tgz
added 2 packages in 549ms
root@5db52d3e57e7:~/remote# ls node_modules/
This not only means that there is no way to trust either the manifest or
package.json alone, but also that package integrity checks become almost meaningless: The manifest contains a signature1 of the values for
dist.integrity is a checksum of the tarball. Since
dependencies is not included in the signature, the signature cannot protect against the inclusion of arbitrary dependencies.
How big is the problem?
Now that we’ve established that npm has a huge conceptual problem with serious security implications, let’s look at how frequently we encounter this problem in the wild. Think of this as a glimpse into what research still needs to be done. For this I took a look at the most popular 5000 repositories on www.npmjs.com.
Since the whole package normalization creates a lot of noise, I hacked together a Python script that tries to filter out most of these changes. It still produces some noise, but little enough to be able to see the picture:
The figure shows keys in the
package.json where a deviation to the manifest was found that could not be explained by normalization. The keys are explained in the
The deviations for
scripts are most likely due to some normalization that I haven’t implemented yet. I am not quite sure where the deviations with
maintainers are coming from, but it could be from another toolchain or the registry itself.
gitHead has the potential to indicate an inconsistency with the advertised version, but the instances I checked by hand seemed to result from outdated values in the package.json. However, this may be worth looking into again.
What struck me were the
devDependencies (and the
dependencies, before I filtered out the
optionalDependencies, that somehow end up in there). As far as I have seen, the
devDependencies result from dependencies that are not installed from the registry, but from a git repository. Somewhere the version format is changed and e.g.
git+https://github.com/parse-community/parse-server#alpha could become
Although I was able to explain (or at least justify) most of the inconsistencies I saw, the sheer number of them didn’t leave me with much hope that this is a problem that can somehow be contained.
What can we do to solve the problem?
This flaw in the architecture of the npm ecosystem needs to be fixed. This means, at the very least, rigorous validation for both the manifest and the
package.json, and perhaps getting rid of some (or all?) of the normalization in the release process. Instead of applying dozens of fixes every time a package is released, it would be better to just fix it at the source, the
package.json. And as long as the dependencies are taken from the manifest, they should be signed the same as the name, version and tarball.
Likewise, security researchers and the infosec industry need to address the problem in a way that keeps everyone secure. This means that package scanners also need to make sure that they do not trust either the manifest or the
package.json alone, but always check both. There is also still a lot of work to be done to find instances in the npm registry where this issue is being exploited.