I Audited 30 SaaS Sites — Half Had This Indexing Bug
Over a stretch of several months, I ran technical SEO audits on a few dozen SaaS websites — a mix of early-stage startups, funded scale-ups, and a couple of established players. They differed in size, stack, and maturity. They had different content teams, different priorities, different problems. But one pattern surfaced again and again, in roughly half the sites I looked at, and it was almost never on anyone's radar.
The pattern was an indexing bug. Not a dramatic, site-down robots.txt disaster — those are rare and get noticed fast. This was quieter: a structural defect that let a meaningful slice of each site's content sit outside Google's index, unnoticed, sometimes for a very long time. Before going further, an honest note on method: this was a qualitative pattern across a modest sample, not a controlled study. I am not going to quote a precise percentage as if it were a law of nature. What I can tell you is that the same shape of problem recurred often enough, across different enough sites, that it is worth describing in detail — because if you run a SaaS site, there is a fair chance some version of it is happening to you.
The shape of the bug
The bug was not a single misconfiguration. It was a family of related failures that all produced the same outcome: pages the business wanted in Google were not in Google, and nobody knew.
The common thread was a gap between two things that should match — the set of pages a site intends to have indexed, and the set of pages Google actually has indexed. On a healthy site those sets are close to identical. On half the sites I audited, they had drifted apart, sometimes substantially. And the drift was invisible because nobody was looking at both sets at once. The content team knew what they had published. The (often nonexistent) technical team knew, vaguely, that the site was crawlable. No one was comparing intended-indexed against actually-indexed and noticing the difference.
When I did that comparison — crawling each site, pulling its sitemaps, cross-referencing against Search Console's index coverage — the gap appeared immediately. And digging into why specific pages had fallen into the gap, the causes clustered into a handful of recurring culprits.
Culprit one: the noindex tag that never got removed
The most common single cause, and the most exasperating, was a stray noindex tag on pages that were supposed to be indexed.
The story behind it was almost always the same. Early in a site's life — during a rebuild, a migration, or a staging phase — noindex was applied broadly to keep unfinished work out of Google. That is correct practice. The failure was that noindex was applied at a level that outlived its purpose: a template, a section default, a CMS setting, a theme configuration. When the site went live, someone remembered to remove noindex from the homepage and the main pages. They forgot the blog template, or the resources section, or the changelog, or the help centre.
The result: a whole category of pages — often a content-heavy, SEO-important category — silently carrying a directive that told Google to stay out. The pages looked perfect to human visitors. They were well written. They were linked. And they were invisible to search, because of a configuration nobody had revisited since launch. On several sites this had been true for many months. The blog had simply never had a chance to rank, and the team had concluded their content "just wasn't getting traction" — when in fact Google had been explicitly told not to keep it.
Culprit two: the rendering gap
The second recurring culprit was specific to how modern SaaS sites are built. A large share of them run on JavaScript frameworks that render content client-side. The HTML that arrives in the initial response is a near-empty shell; the actual content — the words, the headings, the body of the article — is loaded and inserted by JavaScript after the fact.
Google can render JavaScript, so this often works. But "often" is not "always." On several of the sites I audited, rendering was failing or degrading in ways the team had never checked. Sometimes a critical script was blocked in robots.txt. Sometimes content depended on a data fetch that was slow or unreliable when Googlebot rendered the page. Sometimes content only appeared after an interaction Googlebot does not perform.
When I used Search Console's URL Inspection to view the rendered HTML Google actually received, the problem was stark: pages that were rich and complete in a normal browser arrived at Google as thin shells with the main content missing. Google was crawling these pages and seeing almost nothing — so it either declined to index them or indexed a contentless version that could never rank. The team had no idea, because they only ever looked at the site in their own browsers, where everything rendered fine.
Culprit three: canonical tags pointing the wrong way
The third culprit was canonical tags doing damage instead of good. Canonical tags exist to tell Google which version of a page is the master copy when several similar URLs exist. Used correctly, they consolidate. Used carelessly, they de-index.
The pattern I kept seeing: a templating error that caused a whole set of pages to declare a canonical pointing somewhere other than themselves. Sometimes every blog post canonicalised to the blog homepage. Sometimes paginated pages all canonicalised to page one. Sometimes a hardcoded canonical from a template got copied across many pages, so dozens of distinct pages all told Google "the real version of me is this one other URL."
Google generally respects a clear canonical. So it did exactly what it was told: it treated those pages as duplicates of the canonical target and did not index them in their own right. The pages existed, they were crawlable, they had no noindex — and they still were not indexed, because they had instructed Google to index something else instead. This one was particularly insidious because canonical tags are invisible in normal browsing and rarely audited, so the misfire could persist indefinitely.
Culprit four: orphan pages with no way in
The fourth culprit was structural. SaaS sites accumulate pages that are not linked from anywhere — old landing pages, standalone campaign pages, articles that were published but never added to the blog index, pages from a previous site structure that survived a redesign. These orphan pages have no internal links pointing to them.
Google finds pages mainly by following links. A page with no internal links has no path for Googlebot to reach it by — unless it happens to be in the sitemap, and on these sites the sitemap was frequently incomplete or stale. So the orphan pages sat there: published, real, sometimes genuinely good content, and undiscovered. They were in the gap not because Google had judged them or been told to avoid them, but because Google had simply never found them.
Why nobody noticed
The most interesting finding was not any single culprit — it was why these problems survived. On every affected site, the bug had persisted for months. Why?
First, indexing failures are silent. A page that is not indexed does not generate an error, a warning, or any visible symptom. It simply does not appear in search. The absence of traffic from a page is not alarming the way a broken page is — it just looks like a page that has not taken off yet.
Second, SaaS teams rarely have a dedicated technical SEO owner. Content gets written by marketers. The site gets built by engineers focused on the product. Indexing falls in the gap between them — the marketers assume the engineers handled crawlability, the engineers assume the marketers monitor SEO, and the actual coverage report goes unread.
Third, nobody was doing the one check that surfaces all of this: comparing the pages the site intends to have indexed against the pages Google actually has indexed. That comparison is the diagnostic. It is not technically hard. It is just nobody's job, so it never happens, and the gap grows quietly. Our breakdown of why Google isn't indexing your pages covers how to interpret each failure once you have found it — but you have to look first.
Culprit five: the sitemap that lied
A fifth culprit showed up on enough sites to mention on its own: the XML sitemap that had quietly stopped telling the truth.
A sitemap is supposed to be a clean, current list of the canonical URLs a site wants indexed. On several audited sites it was nothing of the sort. Some sitemaps were static files generated once, long ago, and never regenerated — so every page published since was simply absent, and Google's most direct discovery channel had been frozen in the past. Others were generated automatically but badly: they included URLs that returned 404s, URLs that redirected elsewhere, URLs carrying noindex tags, and non-canonical parameter variations. A sitemap full of those is worse than no sitemap, because it spends Google's trust and crawl on dead ends and tells Google the site's own signals are unreliable.
The damage cut both ways. New pages missing from the sitemap leaned entirely on internal linking for discovery — and on the same sites, internal linking was often weak, so those pages had no route in at all. Meanwhile the junk URLs in the sitemap drew crawl attention away from the pages that mattered. The fix is unglamorous but decisive: the sitemap must be generated dynamically, must contain only live, canonical, indexable URLs, and must be checked — not assumed — to actually reflect the site as it is today. On more than one site, simply repairing the sitemap restored discovery for dozens of pages that had been invisible for months.
The cost nobody put a number on
It is worth pausing on what this bug actually costs, because the silence of indexing failures hides the price as effectively as it hides the cause.
When a category of pages sits outside the index for months, the loss is not just "some missing traffic." It compounds. Those pages never accumulate the ranking history, the engagement signals, or the backlinks that come from being discoverable — so even after the bug is fixed, they start from zero, months behind where they should be. The content team, meanwhile, draws the wrong lesson: they conclude their content does not work, and they may change strategy, cut output, or abandon channels that were never given a fair chance to perform. The bug does not just cost the traffic the pages would have earned; it can quietly distort the decisions a whole team makes about what to do next.
That second-order cost is the real reason this bug deserves attention out of proportion to how mundane it sounds. A stray noindex tag is a trivial technical fault. A content team abandoning a working strategy because a trivial technical fault made it look broken is not trivial at all.
What the affected sites had in common
Stepping back from individual culprits, the affected sites shared a profile. They had grown faster than their technical SEO discipline. They had been through at least one rebuild or migration — the event that plants stray noindex tags and template-wide canonical errors. They were built on JavaScript-heavy stacks where rendering cannot be assumed. And they had no recurring process for checking index coverage; whatever auditing happened was a one-off, long ago.
The sites that were clean shared a different profile: somebody, at some point, had treated index coverage as a thing to monitor, not just configure once. It did not require a large team or deep expertise. It required the comparison being someone's job.
How to check your own site
If you want to know whether you have a version of this bug, the audit is concrete. Crawl your own site and produce the list of URLs you intend to have indexed. Pull your XML sitemaps. Open Search Console's Pages report and read the index coverage — both the indexed pages and, crucially, the categories of unindexed ones. Compare. Where intended-indexed and actually-indexed diverge, inspect the divergent URLs individually: check the rendered HTML for missing content, check for stray noindex tags, check the canonical each page declares, check whether the page has any internal links pointing to it. Each divergent page will trace back to one of the culprits above, and each culprit has a known fix.
This is not exotic work. It is the work that, on half the sites I saw, simply nobody had done.
Where an SEO AI agent fits
The reason this bug is so widespread is not that the diagnosis is difficult — it is that the diagnosis is recurring, and recurring work without an owner does not get done. Index coverage drifts every time you publish, migrate, restructure, or change a template. A one-time audit catches the gap as it stood that day and tells you nothing about the gap that opens next month.
Continuous comparison is what catches it, and that is precisely what an SEO AI agent is suited to. Orova can keep an ongoing map of the pages you intend to have indexed against the pages Google actually has indexed, flag every new divergence as it appears, and trace each one to its cause — a stray noindex tag, a content-missing render, a misfiring canonical, an orphan page with no links in. The fixes stay yours to make and yours to judge. The agent's job is to make sure the gap can never again sit silently for months while a content team wonders why good work is not ranking. Half the sites I audited had this bug. The other half did not — not because they were more sophisticated, but because someone was looking. Be the site where someone is looking.
Let an AI Agent handle your SEO
Orova plans, writes, optimizes, and tracks rankings on its own — you just read the results.
Try it free