We Published 150 AI-Assisted Articles — The Results

There is a great deal of opinion about AI-assisted content and a shortage of structured observation. Opinion is cheap; it costs nothing to declare that AI content "works" or "doesn't work." So this article does something narrower and, I hope, more useful: it treats a real publishing program — a body of roughly a hundred and fifty AI-assisted articles produced over a sustained period — as a dataset, and reports the patterns honestly.

One caution before the patterns, and it matters. I am deliberately not going to hand you precise figures — no "traffic rose 312%," no "conversion improved by 4.7 points." Those numbers, ripped out of one program's specific context, are worse than useless; they invite you to expect an outcome that was never transferable. What does transfer is the shape of the results: the recurring patterns, the distributions, the relationships between choices and outcomes. That shape is the genuinely portable finding, and that is what this is a report on.

How to read a content program as data

When you have a hundred and fifty published pieces and several months of performance behind them, you stop having a collection of anecdotes and start having a distribution. Individual articles are noisy — one piece can over- or under-perform for reasons that have nothing to do with method. But across a hundred and fifty, the noise averages out and the structural patterns surface. You can sort the articles by outcome, segment them by the choices made in producing them, and ask which choices the strong performers shared and which the weak ones shared.

That is the analytical posture here. Not "did AI content work" — a question too coarse to have an answer — but "across this body of work, what separated the articles that performed from the articles that didn't?" That question has answers, and they are consistent enough to be worth reporting.

Finding one: the outcome distribution was steeply uneven

The first and most important pattern was the shape of the results distribution itself, and it was not even close to uniform.

The hundred and fifty articles did not each contribute a roughly equal slice of traffic and conversions. Instead, a small minority of the articles accounted for the large majority of the total value. A middle band performed modestly — real but unremarkable. And a substantial tail performed close to zero: ranking for nothing, attracting almost no impressions, effectively invisible.

This steep, top-heavy distribution is, in my observation, the single most reliable pattern in any content program at volume, AI-assisted or not. It is worth internalising deeply, because it reframes the entire enterprise. You are not producing a hundred and fifty articles that will each pull their weight. You are placing a hundred and fifty bets, knowing in advance that a handful will carry the program and many will not. Every other finding below is really an attempt to answer the practical question this distribution forces: what predicted which bets landed?

Finding two: topic selection predicted outcome more than draft quality

The most decision-relevant pattern in the whole dataset was this. When I segmented the articles by what separated winners from losers, the dominant factor was not how well any individual article was written. It was whether the topic was worth writing about in the first place.

The strong performers shared a topic-level profile: genuine, demonstrable search demand, and a real gap — something the existing results were not already covering well. The dead tail shared the opposite profile. Many of those failed articles were perfectly competent as writing. They failed because they targeted topics with little real demand, or topics already saturated by stronger pages, or — quietly common — topics chosen mainly because a slot on the calendar needed filling.

The articles that failed mostly did not fail in the writing. They failed at the moment the topic was chosen, and no quality of execution afterward could rescue that decision.

This is the finding I would most want a team to take away, because it is the opposite of where most teams spend their attention. The instinct is to pour effort into making each draft excellent. The data says: a brilliant draft of a topic nobody searches for, or a topic ten stronger pages already own, is still a dead article. Topic selection is the higher-leverage decision, and in a volume program it is the one most likely to be rushed.

A distribution chart of 150 articles sorted by performance: a short head of high performers, a modest middle band, and a long flat tail of near-zero articles, with the head segment labelled as topics with real demand and a content gap — The recurring shape of a content program at volume: a short head carries most of the value, a modest middle band contributes some, and a long tail contributes almost nothing. The head was not the best-written articles — it was the best-chosen topics.

Finding three: human editing time correlated with performance

The next pattern concerned the production process itself. When I looked at how much genuine human editing each article received — not a token proofread, but real revision: restructuring, fact-checking, the addition of specific experience and examples — that variable correlated clearly with outcome.

The articles that performed had, with consistency, received substantial human work after the AI draft. The articles in the dead tail had, with similar consistency, been published close to their first generated state. The pattern was strong enough to state plainly: across this program, AI draft plus serious human editing performed; AI draft published nearly raw did not.

This reframes what "AI-assisted" should mean. The phrase is often used to imply the AI does the work and a human glances at it. The data suggests the productive division of labour was the reverse in spirit: AI removed the friction of the blank page and produced a structured first draft fast, and the human work — verification, restructuring, the injection of real expertise — was where the article's eventual value was actually created. The AI made the human's contribution faster to apply. It did not replace it. The articles where it was treated as a replacement are sitting in the tail.

Finding four: results compounded slowly, then noticeably

A pattern over time, not across articles. AI-assisted content did not behave any differently from good content in general on the timeline of results — and that is itself a finding worth stating, because it punctures a common expectation.

There was no early surge. The first weeks after publishing a batch were quiet, as they always are. What built was an accumulation: as more articles in a topic area went live and began linking to one another, the area as a whole started to perform better than any single article had — clusters of related pages outperforming the same pages would have as scattered individuals. The "AI-assisted" label changed the production speed. It did not change the fundamental physics of how content earns authority, which remained slow, cumulative, and structural. Anyone expecting AI to also accelerate the results timeline was, in this program, disappointed. For why clustered pages outperform isolated ones, our piece on topic clusters covers the mechanism this finding observed.

Finding five: the failures shared a diagnosable cause

It would be easy to report only the encouraging patterns. The honest report includes the failures, and the failures, examined closely, were diagnosable rather than random.

The dead-tail articles clustered around a small set of recurring causes. Topics with no real search demand — written because the calendar had a gap, not because anyone was looking. Topics so thoroughly owned by established, authoritative pages that a new entrant had no realistic path in, regardless of quality. And drafts published with too little human verification and too little added experience — mechanically fine, but adding nothing a reader could not get faster elsewhere. Almost every failure traced back to one of those three. The encouraging implication: they are not mysterious. They are exactly the failure modes a deliberate filter — a demand check, a competition check, an editing-and-experience check — is designed to catch before publication. The failures were not bad luck. They were skipped checkpoints.

What the dataset says, in one paragraph

Pulling the patterns together: a body of AI-assisted content behaves, in its results, like a body of content. The outcome distribution is steeply top-heavy — a few articles carry the program. Topic selection predicts outcome more powerfully than draft quality. Serious human editing after the AI draft correlates strongly with performance, while near-raw publishing correlates with the dead tail. Results compound on the same slow, structural timeline as any good content, with no AI acceleration of the outcomes. And the failures are not random — they trace to a short, diagnosable list of skipped checks. AI changed the speed and cost of producing drafts. It did not change what makes content succeed or fail.

Finding six: the winners shared a structural trait, not a stylistic one

When I looked closely at the head of the distribution — the small set of articles carrying most of the value — I expected to find a stylistic signature. Better writing, sharper headlines, a particular voice. That is not what the data showed.

The winners did not write noticeably better than the middle band. What they shared was structural. Almost every high-performing article sat inside a coherent cluster of related pages, linked to its siblings, and addressed a clearly-scoped slice of a subject the program covered with genuine depth. The underperformers, by contrast, were disproportionately isolated — single articles on topics with no surrounding cluster, orphaned pieces that had no neighbours to link to and reinforce them.

This was, in retrospect, the most actionable structural pattern of all. It says that where an article sits matters as much as how it is written. An excellently-written article dropped alone into a topic the site otherwise ignores is fighting uphill. A solidly-written article that is the tenth piece in a deep, interlinked cluster inherits the authority the other nine built. The program's winners were, overwhelmingly, articles that benefited from this inheritance. For the mechanics of building that structure deliberately, our guide on internal linking strategy covers the connective work this finding observed paying off.

Finding seven: detectability was the wrong worry

Before the program ran, a recurring anxiety in the team was whether AI-assisted content would be detected and penalised as such. With the dataset in hand, that anxiety turned out to be misdirected — not because detection is impossible, but because it was simply not the variable that moved outcomes.

Articles did not fail because they were AI-assisted and got caught. They failed because they were thin, undifferentiated, or aimed at topics with no demand — and they would have failed for those reasons whether a human or a model produced the draft. Likewise, the winners did not succeed because they cleverly hid their AI origins. They succeeded because they were genuinely useful, well-placed pages. The data pointed at a calmer conclusion than the pre-program anxiety assumed: stop worrying about whether the content reads as AI-assisted, and worry about whether it is good and well-placed. The first question is a distraction. The second is the whole game. Every pattern in this dataset said the same thing in a different way — the production method was never what the results were responding to.

What I would tell a team starting their own program

If you are about to begin a high-volume AI-assisted program, the patterns above translate into concrete advice. Spend disproportionate effort on topic selection — it is the highest-leverage decision and the data says so unambiguously. Budget real human editing time per article and protect it; the temptation under a volume target is to cut exactly this, and cutting it is what fills the dead tail. Expect the steep distribution and do not panic when most articles underperform — that is the normal shape, not a sign of failure. And set the timeline expectation honestly: AI speeds the writing, not the ranking.

Above all, treat your own program as a dataset, not a stream of anecdotes. After your first sixty or eighty articles, sort them by performance, segment them by the choices behind them, and find your own version of these patterns. Your specific numbers will differ from any program's. The shape almost certainly will not.

Where an AI agent fits the findings

Read against these patterns, the role of an SEO AI agent becomes precise rather than vague. The findings say the program-deciding work is topic selection — demand and gap analysis — and disciplined editing. They also say the failures came from skipping exactly those checks under volume pressure.

That is the work an agent like Orova is built to support: assessing whether a candidate topic has genuine search demand and a genuine content gap before it earns a slot, flagging the saturated topics that the dead tail was full of, and producing structured drafts fast enough that the human effort can concentrate where the data says it matters — verification, restructuring, and the addition of real experience. Used that way, an agent does not promise to defeat the steep distribution. It promises something the dataset says is more achievable: fewer articles in the dead tail, because fewer of them were poorly chosen in the first place.