Manifest confusion

Why npm cannot be trusted

Manifest confusion is a problem in the architecture of npm, pointed out by Darcy Clarke: An npm package’s manifest is independently published from its tarball and never fully validated.

When I read about manifest confusion, my first question was: How commonly does it happen that manifest and package.json do not match? Frustratingly, the answer is: Almost all the time! And that’s when I went down the rabbit hole of package normalization.

npm’s package normalization

When a developer wants to release a package to the npm registry and types npm publish into their console, a lot of things happen in the background.

In short, npm reads the package data from package.json, collects all the necessary files into a tarball and uploads some metadata to the registry. This metadata is what I’ll call the manifest for the rest of this article. In theory, the manifest is just the data read from package.json, but before the package is uploaded, a myriad of small and large changes are made to this data, the “normalization”.

I put the word “normalization” in quotation marks because the changes are not limited to bringing some values into a uniform format, but are sometimes quite significant. For those interested in the actual code, check out the @npmcli/package-json package and start with the function prepare. If you are just interested in the results, you can also skip the implementation details.

Implementation details

prepare is applying the following list of manipulations to the package.json:

  • _attributes
  • bundledDependencies
  • bundleDependencies
  • bundleDependenciesDeleteFalse
  • gypfile
  • serverjs
  • scriptpath
  • authors
  • readme
  • mans
  • binDir
  • gitHead
  • fillTypes
  • normalizeData
  • binRefs

The harmless sounding step “normalizeData” will lead us even further to the package normalize-package-data and its function normalize. Instead of merely normalizing the data, a list of fixes will be applied:

  • name
  • version
  • description
  • repository
  • modules
  • scripts
  • files
  • bin
  • man
  • bugs
  • keywords
  • readme
  • homepage
  • license
  • dependencies
  • people
  • typos

And while each and every one of these changes has a story as to why it was deemed necessary, it raises the question: Which source should we trust?

  • The optimized manifest, where a lot of thought has gone into putting package information into a consistent, readable form?
  • Or the package.json, which has an integrity check and is ideally even signed by the author of the package1?

Sadly the answer is: Neither of them.

It get’s worse

While we can argue about the pros and cons of the manifest or package.json, the authority on this question is npm itself. What does npm use to install a package?

Again, the answer is not satisfactory: Both.

When installing a package with npm, the code responsible for loading the manifest is in the package pacote, which implements “fetcher” for different locations:

When loading a package from a directory, file, generic remote location or git repository, the package.json is the relevant source of information. However, when loading a package from a registry, the manifest will be used – unless, of course, the package is already cached.

To illustrate the problem, I have created a package that has different dependencies in manifest and package.json.

If we install it from the registry, it will install lodash (as defined in the manifest):

root@5db52d3e57e7:~/registry# npm install trace-employed-spider-sensitize

added 2 packages in 522ms
root@5db52d3e57e7:~/registry# ls node_modules/
lodash  trace-employed-spider-sensitize

But if we install the tarball directly, it will install jquery (as defined in the package.json):

root@5db52d3e57e7:~/remote# npm install https://registry.npmjs.org/trace-employed-spider-sensitize/-/trace-employed-spider-sensitize-1.0.1.tgz

added 2 packages in 549ms
root@5db52d3e57e7:~/remote# ls node_modules/
jquery  trace-employed-spider-sensitize

This not only means that there is no way to trust either the manifest or package.json alone, but also that package integrity checks become almost meaningless: The manifest contains a signature1 of the values for name, version and dist.integrity, where dist.integrity is a checksum of the tarball. Since dependencies is not included in the signature, the signature cannot protect against the inclusion of arbitrary dependencies.

How big is the problem?

Now that we’ve established that npm has a huge conceptual problem with serious security implications, let’s look at how frequently we encounter this problem in the wild. Think of this as a glimpse into what research still needs to be done. For this I took a look at the most popular 5000 repositories on www.npmjs.com.

Since the whole package normalization creates a lot of noise, I hacked together a Python script that tries to filter out most of these changes. It still produces some noise, but little enough to be able to see the picture:

Keys in the manifest, where deviations have been found.

Keys in the manifest, where deviations have been found.

The figure shows keys in the package.json where a deviation to the manifest was found that could not be explained by normalization. The keys are explained in the package.json documentation.

The deviations for scripts are most likely due to some normalization that I haven’t implemented yet. I am not quite sure where the deviations with maintainers are coming from, but it could be from another toolchain or the registry itself.

gitHead has the potential to indicate an inconsistency with the advertised version, but the instances I checked by hand seemed to result from outdated values in the package.json. However, this may be worth looking into again.

What struck me were the devDependencies (and the dependencies, before I filtered out the optionalDependencies, that somehow end up in there). As far as I have seen, the devDependencies result from dependencies that are not installed from the registry, but from a git repository. Somewhere the version format is changed and e.g. git+https://github.com/parse-community/parse-server#alpha could become git+https://github.com/parse-community/parse-server.git#alpha.

Although I was able to explain (or at least justify) most of the inconsistencies I saw, the sheer number of them didn’t leave me with much hope that this is a problem that can somehow be contained.

What can we do to solve the problem?

This flaw in the architecture of the npm ecosystem needs to be fixed. This means, at the very least, rigorous validation for both the manifest and the package.json, and perhaps getting rid of some (or all?) of the normalization in the release process. Instead of applying dozens of fixes every time a package is released, it would be better to just fix it at the source, the package.json. And as long as the dependencies are taken from the manifest, they should be signed the same as the name, version and tarball.

Likewise, security researchers and the infosec industry need to address the problem in a way that keeps everyone secure. This means that package scanners also need to make sure that they do not trust either the manifest or the package.json alone, but always check both. There is also still a lot of work to be done to find instances in the npm registry where this issue is being exploited.


  1. Did you know that npm uses registry signatures that do not allow you to verify the author of the package? But this is a topic for another day. ↩︎ ↩︎

Konstantin Weddige

Managing director and co-founder

The most important job of IT security is to make risks understandable. My ambition is to live up to this challenge with Lutra Security.

July 7, 2023