10 min read
Turborepo is now 81-91% faster to compute its task graph in our repositories, scaling with repo size. On our 1,000+ package monorepo, turbo run now feels instant. Time to First Task is now 11x faster.
After testing my changes with some open source Turborepos and asking Vercel customers to try canary releases on their repositories, I found the performance improvement could get as high as 96% depending on the size and complexity of the repository.
The process behind earning these performance gains is worth sharing, because it wasn't one optimization or one technique. It was eight days of mixing AI agents, Vercel Sandboxes, and typical, boring engineering practices.
Link to headingHow Turborepo schedules your tasks
Every turbo run starts by analyzing your monorepo's structure, scripts, and dependencies to build a task graph. That graph determines execution order, creates parallelism, and powers caching so you never repeat the same work twice.
Building the task graph is overhead you pay before your repository's work begins. The larger the repo, the higher the cost. On our 1,000-package monorepo, that cost was around 10 seconds on an M4 Pro Max. I don't know about you, but I found that unacceptable.
Link to headingStarting with unattended agents
I wanted to see what agents could do about this without much guidance. I spun up 8 background coding agents from my phone before bed, each targeting a different part of the Rust codebase I suspected was too slow.
Look for a performance speedup in our Rust code. It has to be something that is well-tested, and on our hot path. Make sure to add benches to check your work. I'm particularly interested in our hashing code.
In each prompt, I replaced the part of the codebase I was interested in with a new target. I was curious what the agents would accomplish with plenty of ambiguity, as a baseline.
By morning, 3 of the 8 had produced outputs that I could turn into shippable wins:
PR #11872 netted a ~25% reduction in wall-clock time, reducing allocation pressure through hashing by reference instead of cloning an entire
HashMap.PR #11874 replaced
twox-hash, one of our Rust dependency crates, withxxhash-rust. A near 1:1 replacement that uses a faster hashing algorithm, creating a ~6% win.PR #11878 came from an existing
TODOcomment that we hadn't gotten to yet. We needed to replace an unnecessary Floyd-Warshall algorithm with a multi-source depth-first search (DFS). This wasn't on the hot path ofturbo run, but my prompts didn't specify which hot path, did they? Fair.
These are undoubtedly meaningful successes, but reviewing all 8 chat sessions and code outputs taught me just as much about where unattended, state-of-the-art agents without proper context engineering will fall short today.
The agent never realized it could benchmark the improvements on the Turborepo codebase itself. Turborepo dogfoods Turborepo, so it could have easily built a binary and run it right on the source code to get end-to-end results.
The agent would hyperfixate on the first idea that it came up with and force it to work, rather than backing up and thinking abstractly about the problem (even though the chat logs showed it trying to do so).
The agent would chase the biggest number it could get, creating microbenchmarks that were relatively meaningless when it came to real-world performance. It would then crank out a 97% improvement for the benchmark, which actually amounted to a 0.02% real-world improvement.
Never once did an agent write a regression test.
Never once did an agent use the
--profileflag in theturboCLI.
The agents running unattended produced some good wins, but I could tell this wouldn't be sustainable. We needed stronger testing, and a better verification loop. I had to be more involved.
Link to headingMaking profiling work for agents and humans
The first normal engineering thing I did was take a profile. Shocking, I know.
I ran turbo run build --profile on our largest repo and opened the trace in Perfetto.
Flame graphs are informative, but can be slow to work with. As much as I do enjoy reading flame graphs and grinding out a win, Turborepo has a lot of shipping to do. I have a duty to users of Turborepo to work efficiently and effectively, using the best tools that I have at my disposal.
Link to headingMaybe Chrome Tracing JSON isn't the best format
Turborepo's profiles are JSON files in Chrome Trace Event Format.
An LLM can theoretically read through and parse all this, but...well...just look at it. Function identifiers split across lines, irrelevant metadata mixed in with timing data, not grep-friendly. I pointed an agent at the file and watched it struggle through grep calls, trying to piece together function names from different lines, unsuccessfully trying to filter out noise. It was fumbling through this file in the same way I would.
One of my favorite heuristics for working with coding agents is that if something is poorly designed for me to work with, it's poorly designed for an agent, too. This isn't necessarily a comment about work quantity, but more so about interfaces. If something is hard for me to read, it stands to reason it's hard for an agent to read, too. This idea has its limits, but you'll see it quickly pay dividends in a moment.
Link to headingBuilding LLM-friendly profiles
A week prior, I saw a tweet from Jarred Sumner about how Bun shipped a new flag: --cpu-prof-md. It outputs profiles as Markdown, which easily fits into my view of how agents work best.
In #11880, I added a new turborepo-profile-md crate that generates a companion .md file alongside every trace. Hot functions sorted by self-time, call trees sorted by total-time, caller/callee relationships. All greppable, all on single lines.
The difference in the agent's output quality was dramatic. Same model, same codebase, same data, same agent harness. Different format, radically better optimization suggestions. The profile data was finally in a format that both I and the agent could read at a glance.
Link to headingThe iterative loop
With Markdown profiles, I settled into a rhythm.
Put the agent in Plan Mode with instructions to create a profile and find hotspots in the Markdown output
Review the proposed optimizations and decide which ones were worth pursuing
Have the agent implement the good proposal(s)
Validate with end-to-end
hyperfinebenchmarksMake a PR
Repeat
This loop produced over 20 performance PRs in four days. The wins fell into three categories. I'll give some examples.
Parallelization was the largest. Building the git index, walking the filesystem for glob matches, parsing lockfiles, and loading package.json files were all sequential operations that could run concurrently. PRs #11889, #11902, #11927, and #11918 parallelized these hot paths.
Allocation elimination removed redundant copies and clones throughout the pipeline, including reference-based hashing in SCM operations (#11916), pre-compiling glob exclusion filters (#11891), and using a shared HTTP client instead of constructing a new one per request (#11929).
Syscall reduction batched per-package git subprocess calls into a single repo-wide index (#11887), replaced git subprocesses with libgit2 library calls (#11938), and then replaced libgit2 with the faster gix-index altogether (#11950).
Again, it's typical, normal, boring software engineering stuff. I did try to turn this into a Ralph Wiggum loop but it repeatedly made too many mistakes. The combination of the model, the harness, and the loop simply weren't dependable enough, and could move so much code out from underneath me too quickly. Maybe if I were working on a sideproject, I would have accepted it, but Turborepo powers some of the largest repositories in the world. I have to be fast and responsible.
Link to headingYour source code is the best feedback loop
The most interesting pattern I noticed during this phase was how the codebase itself served as the agent's strongest feedback mechanism.
I'd point out a performance issue in code the agent was working on. We'd fix it together. Then I'd ask, "Do you see anywhere else where we can improve in the same way?" The agent would find more instances of the same pattern across the codebase. Depending on the size of the changes, I would either add the change to the PR or write it down to do later.
In places where the existing code had a sloppy pattern, the agent would write new code in the same style. Once I corrected one instance, the agent followed the correction going forward. In future conversations, without any memory or context carrying across chats, the agent would see the merged improvements in the source and stop reproducing the old patterns.
Over time, I noticed the agent spontaneously writing tests when I wasn't expecting it to. I saw it creating abstractions that matched what I would have done, which wasn't happening before. I would revisit a place in the codebase where the agent had previously been ineffective, and, with no changes to model or harness, it would produce better code outputs.
It turns out your own source code is the best reinforcement learning out there.
Link to headingHitting a wall at 85%
By the end of the week, Turborepo was roughly 85% faster on our largest repo. Before I started, I had arbitrarily set a goal of 95% better. The remaining gains were feeling within reach.
The problem became measurement. I had been running all benchmarks on my MacBook, and the hyperfine reports were getting increasingly noisy. As the code gets faster, system noise matters more. Syscalls, memory, and disk I/O all have their variance.
The profiles were noisy too. I had gotten the codebase to a point where the individual functions were fast enough that background activity on my laptop was drowning out any good signal.
Was the change I made really 2% faster, or did I just get lucky with a quiet run? I couldn't confidently distinguish real improvements from noise. I needed a quieter lab for my science.
Link to headingVercel Sandbox for benchmarking
Vercel Sandboxes are ephemeral Linux containers that only have what you put in them. No background daemons, no Slack notifications pulling CPU, no background programs making network requests. The machine's resources are entirely focused on what you're running.
I wrote a bash script that automated the entire benchmarking workflow. I'll put an abbreviated version of the full gist below.
You'll notice that, at the end of this script, I'm downloading the profiles back to my laptop. My agent could then inspect the benchmark results and Markdown profiles locally, and I could confidently tell whether a change was a real improvement or noise.
Link to headingBreaking through the wall
With clean signal from Sandbox, I could see real breakthroughs in low-level changes that were invisible on my noisy laptop.
Stack-allocated git OIDs (#11984)
Every file in the git index stored its 40-character SHA-1 hash as a heap-allocated String. On our largest repo, new_from_gix_index alone was creating over 10,000 individual 40-byte heap allocations.
OidHash implements Deref<Target=str> so existing consumers work unchanged, and Copy means cloning is a 40-byte memcpy on the stack instead of a heap allocation. Profile data showed new_from_gix_index self-time dropped 15% and get_package_file_hashes_from_index dropped 17%.
The most notable improvement across all three sizes was the reduction in run-to-run variance, which agrees with our theory of less allocator pressure and more predictable performance.
Syscall elimination (#11985)
Every cache fetch was performing three syscalls: stat(.tar), which returned ENOENT, then stat(.tar.zst), then open(.tar.zst). Weird pattern.
After some digging, I figured out that the .tar fallback existed for cache artifacts from Turborepo's Golang era (2021-2022). No modern version writes uncompressed cache entries, and cache entries rotate out constantly.
Across 962 cache fetches on our largest repo, fetch self-time dropped from 200.5ms to 129.6ms, a 35% reduction.
Move instead of clone (#11986)
The visitor dispatch loop was deep-cloning a (String, HashMap<String, String>) from a precomputed map for each of roughly 1,700 tasks. Since each task ID appears exactly once in the dispatch stream, HashMap::remove() can move the value out at zero cost instead of cloning.
Link to headingResults
After eight days, Time to First Task on our largest repo dropped from 8.1 seconds to 716 milliseconds.
I estimate this would have taken at least two months without agents, but I hope this article shows you that they didn't do the work for me. I was leading the entire time, deciding what to profile, which proposals to pursue, when to change tools, and when to change strategy. But the combination of my existing engineering knowledge, giving agents better tooling, and a clean benchmarking environment let me move at a pace that wouldn't have been possible six months ago.
Link to headingReleased in Turborepo 2.9
These performance gains are now stable and ready for you to use. Visit the Turborepo 2.9 release post to learn more about the latest in Turborepo.