The Free Data Sitting in Common Crawl
Common Crawl crawls ~3 billion pages every few months and publishes all of it for free: pages, link graph, and ranks. A lot of paid SaaS is a nicer UI over data you could pull yourself. If you wanted to, nothing stops you from building the tool you’re paying for.
- Backlink tools like Ahrefs, Semrush, and Moz charge hundreds a month for who-links-to-whom data that’s in the domain edge file.
- Domain authority scores are a pagerank-ish function over that edge list. CC ships harmonic centrality and pagerank in its ranks file.
- Historical web archives are a paid product, but CC has ~10 years of releases sitting in S3.
- Anchor text, nofollow, and first-seen dates live in the raw HTML, which CC publishes as WARCs.
All you need is a shell and DuckDB.
A barebones example: backlinks
Here’s a gist that pulls backlinks for any domain. Run it like this:
./backlinks.sh roots.io
And you get:
┌────────────────────┬───────────┐
│ linking_domain │ num_hosts │
├────────────────────┼───────────┤
│ github.io │ 284009 │
│ substack.com │ 199844 │
│ cloudfront.net │ 89247 │
│ godaddy.com │ 60673 │
│ google.com │ 41269 │
│ ... │ ... │
└────────────────────┴───────────┘
What Common Crawl publishes
Every few months, Common Crawl releases a hyperlink web graph derived from its crawl. Two files do the heavy lifting:
- Domain vertices: every domain it has seen, with a stable ID and host count.
- Domain edges: every
(from_domain → to_domain)link, as ID pairs.
Domains are stored reverse-labeled (io.roots instead of roots.io) so they sort hierarchically. The script handles the flip for you.
The latest release (cc-main-2026-jan-feb-mar) is around 16 GB of gzipped edges. DuckDB scans the gzip directly, so there’s no extract step and no database to load.
How the script works
The query does three things:
- Parse the vertex file, find the ID of the target domain.
- Scan the edge file for rows whose
to_idmatches that ID. - Join back to vertices to get the human-readable
fromdomains, sorted by host count.
First run downloads the two files to ~/.cache/cc-backlinks/ and then scans the edges, which takes a few minutes. Subsequent runs against a different domain reuse the cache but still scan, so the time is about the same.
More than just counts
The domain graph is one slice of what CC publishes.
- Domain ranks. Same release, separate file, with harmonic centrality and pagerank for every domain. That’s your “DR” score.
- Host-level graph. Same structure, one level finer, with subdomains kept separate.
- Historical releases. Graphs going back ~10 years, so you can chart a domain’s link growth over time.
- Raw WARCs. The underlying HTML of every crawled page. Anchor text, nofollow attributes, exact linking URLs, and first-seen dates all live here.
Caveats
Common Crawl is static HTML only. Pages that render client-side are invisible to it, so domains with JS-heavy SPAs will under-count. It’s a real gap, though SPAs should really be serving usable HTML to bots anyway.
You get counts, not enrichment. The domain graph tells you who links to you. It doesn’t tell you the anchor text, the nofollow status, the first-seen date, or the exact URL of the linking page. All of that lives in the raw WARCs. Free, but you have to parse them.
Go build something
Fair caveat: Ahrefs, Semrush, and Moz run their own crawlers. Their indexes are bigger than CC’s, fresher, and can render JS, so the numbers won’t line up. For a lot of use cases that difference doesn’t matter, and what you’re paying for is the index, the UI, and anchor-text extraction on top of data that’s mostly public. The gist is ~30 lines and is a decent starting point for working with CC data.
Common Crawl has been crawling the web and giving the results away since 2008 as a nonprofit. It’s one of the most useful open datasets on the internet and nobody had to pay for it. Huge thanks to the team for keeping it free and public.