Jay Kim

How It Works: PNPM

PNPM is a lesser known sibling to NPM and Yarn. It is JavaScript package manager that is performant and space efficient compared to NPM and Yarn. It emphasizes correctness because of the way it only allows you to resolve direct dependencies from your code due to the ingenious way it structures the node_modules directory. PNPM also comes with really useful monorepo tooling.

PNPM is a good example of how popularity does not always mean better. It is a highly underrated package manager that I believe should be the default for all JavaScript projects. I hope to demystify PNPM's install algorithm in this blog post to help you understand why.

How Does a Dependency Tree Map to node_modules?

When you read the literature on package managers, you often hear dependencies described as a tree where the nodes are packages. This makes sense at first but in Node.js dependencies are installed in the file system in the node_modules directory and Node.js doesn't use a manifest to map import paths to these directories. So how does Node.js know which package to load and how does the node_modules directory map to a tree?

To understand this, you need to understand the Node.js module resolution algorithm. It is the algorithm used to resolve file paths when you try to require or import a file from JavaScript.

require("my-package");

The actual algorithm is pretty complicated but the general idea is this:

  1. Look for node_modules/ directory in the current directory.
    • If it exists, look for node_modules/my-package/
      • If it exists, we've found the package.
      • Otherwise, go to step 2.
    • Otherwise, go to step 2.
  2. Go to the parent directory and start again at 1.

A node's children can be defined to be all of the packages that are resolvable from the node following the resolution algorithm above. Children is just another way of saying dependencies.

Another way to think about this is that a directed edge from node A to B is when node B can be resolved from A following the module resolution algorithm. We can simply try to require package B from package A to determine this or we can use the require.resolve function to see what path the algorithm resolves a package to.

require.resolve("B");
// => "/Users/jkim/code/A/node_modules/B/index.js'

require.resolve("C");
// => Uncaught Error: Cannot find module 'C'

The Original NPM Algorithm

Before we can understand why PNPM is better, we should understand how packages get installed in NPM and Yarn. I will assume Yarn V1 when I say Yarn. Yarn V2+ follows a different install algorithm called Plug'N'Play that I won't cover since most projects I've worked with use Yarn v1.

Suppose we created a package called my-package which has the following dependencies declared in it's package.json:

{
  "name": "my-package",
  "dependencies": {
    "react": "^18.2.0"
  }
}

If you tried to npm install this using an ancient version of npm, you would get a dependency tree like this:

 ┌──────────┐
 │my-package│
 └──────────┘
       │
       ▼
    ┌─────┐
    │react│
    └─────┘
       │
       ▼
┌────────────┐
│loose-envify│
└────────────┘
       │
       ▼
  ┌─────────┐
  │js-tokens│
  └─────────┘

No surprises here, this pretty much replicates the dependencies declared in each of these package's package.json config. my-package only has access to its single direct dependency react because of how the node_modules directory is structured:

$ ls node_modules/
react

$ ls node_modules/react/node_modules/
loose-envify

$ ls node_modules/react/node_modules/loose-envify/node_modules/
js-tokens

For example if I tried require loose-envify from my-package I couldn't do it:

require("loose-envify");
// +> Uncaught Error: Cannot find module 'loose-envify'

The issue with this approach starts to become apparent as you begin to add more dependencies to your project and the dependencies you add bring even more dependencies and so on. Many packages in the NPM ecosystem share the same dependencies so you would end up downloading and copying the same package over and over again. The packages could be cached in a global store so that they don't need to be downloaded again but the file copies needed to construct the node_modules directory are expensive and contribute to the slowness of the original algorithm. This is when NPM switched to a install strategy called hoisting and its the install strategy Yarn uses as well.

Package Hoisting

Hoisting is a installation strategy used to optimize away duplicate packages to reduce the number of packages you need to download and copy. It is called hoisting because the optimization is done by moving all packages to the top-most node_modules directory so that if there are duplicates they can be resolved by multiple packages instead of just one package.

If there are multiple versions of the same package that are being installed, then one of the versions cannot be hoisted to the same node_modules directory so the package manager would have to create another nested node_modules directory like in the original algorithm.

Here is what roughly happens when you run npm install or yarn install on a package (say package A):

  1. Do a DFS on the dependency tree of package A. The dependency tree is recursively constructed from package.json files.

  2. For every package B we encounter in this tree:

    1. Go up the directory tree to traverse all the parents' node_module's directory where package B does not exist. Only go as far as package A's node_modules directory. If you encounter an existing package B, it's already been installed so then continue with the DFS.
    2. Add package B to the node_modules directory found in 1.

So back to our example: when you run npm install or if you run yarn install, you would get a dependency tree like this:

         ┌──────────┐
         │my-package│
         └──────────┘
               │
   ┌───────────┼────────────┐
   ▼           ▼            ▼
┌─────┐ ┌────────────┐ ┌─────────┐
│react│ │loose-envify│ │js-tokens│
└─────┘ └────────────┘ └─────────┘

loose-envify and js-tokens has been hoisted to the top level node_modules directory even though it's not been declared a dependency of my-package. This dependency tree is not what we and the library authors have declared in the package.json file.

As a consequence of this, my-package can resolve loose-envify or js-tokens even though they aren't direct dependencies. In general a package can import the indirect dependencies of its directs. Taking on implicit dependencies is dangerous since this coupling could fail at any moment if say react gets upgraded and breaking changes get introduced to any of the indirects.

The issue becomes even more apparent in monorepos. If you use Yarn workspaces, Yarn will use the hoisting algorithm to hoist all of the dependencies in the monorepo to a single node_modules directory at the root of the monorepo. This mean that your monorepo package can potentially take on an implicit dependency with another monorepo package's dependencies or even another monorepo package itself. With more and more of these dependencies, your monorepo can become very brittle as any changes to a package's package.json file can break many other packages.

You might think I am overreacting since you can just promise yourself to never import indirect dependencies like loose-envify and js-tokens but with editor autocomplete and autoimport often being based off of the contents of the node_modules directory, it becomes too easy to accidentally take on an implicit dependency. If you work in large teams, this becomes hard to control.

Another issue with the hoisting algorithm is performance. When adding or changing a package, there isn't an easy way to incrementally add it to the node_modules directory since the hoisting algorithm is only idempotent if you run it from the beginning. Any attempts to install the package without rebuilding your node_modules directory may result in a different node_modules structure which is bad for determinism and could result in hard to debug issues that are only reproducible by certain people who happened to run things in a certain order.

PNPM Virtual Store

PNPM does away with hoisting by symlinking all of the direct dependencies into your node_modules directory and it hides away the indirects in what PNPM calls a virtual store. If I run PNPM on my-package, my node_modules looks like this:

$ ls -l node_modules/
total 0
lrwxr-xr-x  1 jkim  staff  37 Mar  6 21:05 react -> .pnpm/react@18.2.0/node_modules/react

If you look inside the react package you will notice that it doesn't have its own node_modules directory. So you might be wondering, how does react resolve its direct dependency loose-envify if it is no where to be found in the node_modules directory or any of its parent node_modules? The trick here is that Node.js module resolution algorithm doesn't follow symlinks. So you might think the Node.js module resolution algorithm will start searching from:

/User/jkim/code/my-package/node_modules

But it actually begins from inside .pnpm like:

.pnpm/react@18.2.0/node_modules/

.pnpm is a hidden directory that PNPM creates in the node_modules directory. This is called the virtual store which is a set of dependency trees that PNPM constructs in the file system by looking at package.json files. PNPM exposes nodes of these trees as symlinks to your node_modules directory, making sure to only expose the direct dependencies which in our case is just react.

But how does PNPM efficiently build these dependency trees in the virtual store while avoiding hoisting? PNPM uses symlinking in the virtual store as well to avoid package hoisting. So even the directs of your indirect dependencies can only access their directs.

You can see below that PNPM creates a node_modules directory in the virtual store that just contains react and a symlink to react's direct dependencies.

$ ls -l node_modules/.pnpm/react@18.2.0/node_modules
total 0
lrwxr-xr-x   1 jkim  staff   50 Mar  7 21:24 loose-envify -> ../../loose-envify@1.4.0/node_modules/loose-envify
drwxr-xr-x  12 jkim  staff  384 Mar  7 21:24 react

Similarly from loose-envify, you can only resolve its direct dependencies which is js-tokens:

$ ls -l node_modules/.pnpm/loose-envify\@1.4.0/node_modules/
total 0
lrwxr-xr-x   1 jkim  staff   44 Mar  7 21:24 js-tokens -> ../../js-tokens@4.0.0/node_modules/js-tokens
drwxr-xr-x  10 jkim  staff  320 Mar  7 21:24 loose-envify

Here is a illustration of how the virtual store constructs the dependency tree using symlinks.

                     ┌─────────────────────────────────────────────────────────┐
                     │ Virtual Store                                           │
┌──────────┐         │                                                         │
│my-package│         │     ┌─────┐                                             │
└──────────┘       ┌─┼─────│react│                                             │
      │            │ │     └─────┘                                             │
      ▼      ln -s │ │        │                                                │
   ┌ ─ ─ ┐         │ │        │                                                │
    react ■────────┘ │        ▼                                                │
   └ ─ ─ ┘           │ ┌ ─ ─ ─ ─ ─ ─   ln -s ┌────────────┐                    │
                     │  loose-envify│■───────│loose-envify│                    │
                     │ └ ─ ─ ─ ─ ─ ─         └────────────┘                    │
                     │                              │                          │
                     │                              │                          │
                     │                              ▼                          │
                     │                         ┌ ─ ─ ─ ─ ┐  ln -s ┌─────────┐  │
                     │                          js-tokens ■───────│js-tokens│  │
                     │                         └ ─ ─ ─ ─ ┘        └─────────┘  │
                     └─────────────────────────────────────────────────────────┘

The ■──── edge represents the direction of the links and the dashed border rectangles are symlinks.

Global Store

PNPM also uses hard links to avoid having to copy tons of files when it constructs the virtual store. It keeps a global content addressable cache of files in the file system called the global store. Files are then hard-linked from this cache to the virtual store so you never incur the cost of expensive file copies. This is what makes PNPM feel so much faster than NPM or Yarn.

You see the content addressable cache in action by running ls -l:

$ ls -l node_modules/react/
total 56
-rw-r--r--  137 jkim  staff  1086 Jan 28 21:43 LICENSE
-rw-r--r--    6 jkim  staff  1162 Jan 28 21:44 README.md
drwxr-xr-x   12 jkim  staff   384 Mar  7 21:24 cjs
-rw-r--r--   14 jkim  staff   190 Jan 28 21:44 index.js
-rw-r--r--   14 jkim  staff   222 Jan 28 21:44 jsx-dev-runtime.js
-rw-r--r--   14 jkim  staff   214 Jan 28 21:44 jsx-runtime.js
drwxr-xr-x    3 jkim  staff    96 Mar  7 21:24 node_modules
-rw-r--r--    6 jkim  staff   999 Jan 28 21:44 package.json
-rw-r--r--   11 jkim  staff   218 Jan 28 21:44 react.shared-subset.js
drwxr-xr-x    5 jkim  staff   160 Mar  7 21:24 umd

The second column is the number of hard links that exist for that file. You can see that the LICENSE file has 137 hard links to it which makes sense because LICENSE file is probably the same in many other packages that share the same license. package.json has 6 links probably because I have 5 other packages that have installed this exact same version ofreact. So you can see that PNPM can save you a lot of space by using hard links and you avoid having to do a lot of slow file copies.

Putting the global store together with the virtual store, we can see the complete illustration of how PNPM installs your packages using symlinks and hard links.

                     ┌─────────────────────────────────────────────────────────┐     ┌─────────────────┐
                     │ Virtual Store                                           │     │ Global store    │
┌──────────┐         │                                                         │     │                 │
│my-package│         │     ┌─────┐                                             │ ln  │     ┌─────┐     │
└──────────┘       ┌─┼─────│react│■────────────────────────────────────────────┼─────┼─────│react│     │
      │            │ │     └─────┘                                             │     │     └─────┘     │
      ▼      ln -s │ │        │                                                │     │                 │
   ┌ ─ ─ ┐         │ │        │                                                │     │                 │
    react ■────────┘ │        ▼                                                │     │                 │
   └ ─ ─ ┘           │ ┌ ─ ─ ─ ─ ─ ─   ln -s ┌────────────┐                    │ ln  │ ┌────────────┐  │
                     │  loose-envify│■───────│loose-envify│■───────────────────┼─────┼─│loose-envify│  │
                     │ └ ─ ─ ─ ─ ─ ─         └────────────┘                    │     │ └────────────┘  │
                     │                              │                          │     │                 │
                     │                              │                          │     │                 │
                     │                              ▼                          │     │                 │
                     │                         ┌ ─ ─ ─ ─ ┐  ln -s ┌─────────┐  │ ln  │   ┌─────────┐   │
                     │                          js-tokens ■───────│js-tokens│■─┼─────┼───│js-tokens│   │
                     │                         └ ─ ─ ─ ─ ┘        └─────────┘  │     │   └─────────┘   │
                     └─────────────────────────────────────────────────────────┘     └─────────────────┘

How about Yarn Plug'n'Play?

In Yarn V2, a new installation algorithm was introduced called Plug'n'Play which tries to fix many of the same issues we've called out here around performance and resolving indirect dependencies.

Plug'n'Play however takes a different approach to PNPM by forgoing the Node.js module resolution algorithm altogether and implementing its own. Instead of recreating the dependency tree in the file system, it creates lookup tables in a config file which basically links a package name to a location on disk.

Changing the module resolution algorithm however breaks a lot of tooling, such as your editor autocomplete and even tsc, as many of these tools assume the original module resolution algorithm. So the library authors need to write shims to make their code Plug'n'Play compatible, or you have to retrofit your tooling to make it Plug'n'Play compatible. Yarn keeps a compatibility table that lists the packages that natively support Plug'n'Play or that can be supported via a plugin. I haven't tried Yarn V2+ with Plug'n'Play in a serious project yet so I may be overstating the drawbacks.

Takeaways

  • Yarn V1 and NPM use a hoisting algorithm which can end up exposing your package to indirect dependencies.
  • Packages, especially in monorepos, can sometimes be accidentally coupled with indirect dependencies due to editor autocomplete and autoimport. This can make your repo brittle since it has these implicit dependencies that could break at any moment.
  • PNPM makes clever use of symlinks to create dependency trees inside the virtual store and then it symlinks the direct dependencies to your package's node_modules directory.
  • PNPM caches package downloads to a global content addressable store called the global store. PNPM then hard links these packages from the global store to the virtual store.

So hopefully I've convinced you that PNPM is an amazingly well thought out piece of software thats deserves more attention in all your JavaScript projects. I haven't even covered the really useful monorepo tooling that PNPM offers.

PNPM has been adopted by a lot of of serious JavaScript projects such as Next.js and Vue. These download trends suggest that PNPM is almost on par with Yarn in popularity. And in NPM v9.4.0, a new experimental install strategy got released called linked which implements PNPM's install strategy. Who knows, this might even become the default for NPM.