Orova OROVA.VN Marketing AI Agent
Insights

I Asked 5 AI Engines About My Own Product — the Answers Hurt

Orova 1 views
I Asked 5 AI Engines About My Own Product — the Answers Hurt

It was 11:40 on a Tuesday night, and I should have been asleep. Instead, I was sitting in the dark with a laptop, doing the modern equivalent of googling yourself: I typed "what is Orova and is it any good" into five different AI engines, one after another, and waited to find out what the machines thought of the product I spend fifty hours a week marketing. I'd love to tell you I did this as part of a disciplined research program. The truth is I did it because a prospect had mentioned, almost in passing, that "ChatGPT said you guys don't do X anymore," and X was very much a thing we do. That sentence had been rattling around my head all day.

So I ran the experiment. Five engines, the same handful of questions, a spreadsheet open in the next tab. ChatGPT with search enabled. Google Gemini. Google's AI Overviews and AI Mode. Perplexity. Claude. I asked each one what my product was, whether it was any good, who it was for, and what the alternatives were — the exact questions a real buyer asks in the first ten minutes of a purchase decision.

By midnight I had my answers, and they hurt. Not in a dramatic, business-ending way. In a slow, cumulative, oh-no way. One engine confidently described a feature we sunset over a year ago. One filed us into the wrong product category entirely. One, when asked for alternatives, produced a tidy comparison that read like an ad for a better-funded competitor. One was accurate but so bland it could have been describing a spreadsheet plugin. And one based a chunk of its answer on a half-dead forum thread I had never seen before. This is the story of that audit, why each failure happened, what I fixed, and the repeatable monthly routine I now run so it never blindsides me again.

To check your AI brand visibility, ask ChatGPT, Gemini, Google AI Overviews, Perplexity, and Claude the questions buyers actually ask — what your product is, whether it's good, and what the alternatives are — then score every answer for accuracy, presence, and sentiment. Improve weak answers by strengthening entity pages, earning third-party mentions, adding structured data, and keeping AI crawlers unblocked.

The Setup: One Question Set, Five Engines, Zero Mercy

Before I tell you what went wrong, here's exactly what I did, because the method matters more than my bruised ego. I wrote down four questions a realistic buyer would ask. Not marketing questions. Buyer questions, in plain language, the way someone types when nobody's watching:

  • "What is [product] and is it any good?"
  • "What does [product] actually do?"
  • "Who is [product] for? Is it worth the price?"
  • "What are the best alternatives to [product]?"

Then I ran all four through five engines in fresh sessions — no chat history, no custom instructions, nothing that would tilt the answers toward what I wanted to hear. The five engines were chosen deliberately, because they represent fundamentally different retrieval architectures, and that distinction turned out to be the key to the whole diagnosis:

  • ChatGPT (with Search) — can browse live, but frequently answers smaller-brand questions from training data without browsing at all.
  • Google Gemini — blends a model with Google's index and Knowledge Graph; how it grounds an answer varies by query.
  • Google AI Overviews / AI Mode — retrieval-first, built directly on top of Google's search index, with citations.
  • Perplexity — retrieval-first by design; it searches, then synthesizes, and shows its sources prominently.
  • Claude — strong reasoning, can search when the feature is on, but defaults to careful, hedged answers when its knowledge is thin.

For each answer I logged three scores in a spreadsheet: accuracy (was anything wrong or outdated?), presence (did we appear at all, and how prominently?), and sentiment (would a buyer reading this lean toward or away from us?). Simple one-to-five scales. I also pasted the full text of every answer into the sheet, which I cannot recommend enough — the scores tell you something changed, but the raw text tells you what changed when you re-run the audit later.

One more thing about expectations. I went in assuming we'd do fine. We rank decently for our core keywords. We have a real site, real documentation, real customers. I genuinely expected this to be a victory lap I could screenshot for a slide. It was not a victory lap.

What Each Engine Actually Said

Engine 1: ChatGPT described a feature we killed a year ago

ChatGPT's answer was fluent, confident, well-organized — and wrong in a way that made my stomach drop. Right there in a bulleted feature list, it described, in loving detail, a capability we sunset more than a year ago. Not vaguely. Specifically. It explained how the feature worked, who it was for, and implied it was a core part of the product today.

Here's the part that took me a minute to register: for this particular question, ChatGPT hadn't searched at all. There were no citations, no "searching the web" indicator. It answered from training data — a frozen snapshot of the internet from whenever its corpus was assembled, plus every old blog post, changelog, and review that described the product as it used to be. We had announced the sunset in a changelog entry and an email to affected customers. We had never gone back and updated the dozens of pages, ours and other people's, that described the old feature. The model did exactly what models do: it averaged the internet's memory of us, and the internet's memory was stale.

This was almost certainly the source of my prospect's "ChatGPT said you don't do X" comment, just inverted — the model was confidently describing things we don't do and presumably also omitting things we now do. If your product has shipped meaningfully in the last eighteen months, assume some engine somewhere is describing the previous version of you.

Engine 2: Gemini put us in the wrong category

Gemini didn't get the features wrong. It got something worse wrong: the category. It described us, plausibly and politely, as a different kind of tool than we are — adjacent to our actual category, the way a CRM is adjacent to an email tool, close enough that a stranger wouldn't blink and a buyer would walk away with a fundamentally broken mental model of what we sell.

When I dug into why, the answer was embarrassing because it was our fault. Our own messaging had drifted over three years. The homepage said one thing. An old positioning line from a previous rebrand still lived on a few directory listings. A handful of third-party roundups, written when the product was younger, framed us in the adjacent category because back then that framing was half-true. Gemini hadn't hallucinated; it had faithfully synthesized our inconsistency. When the web disagrees about what you are, the model picks a lane, and it might not pick yours. Entity confusion isn't an AI failure. It's an input failure.

Scorecard from an AI brand visibility audit showing how five AI engines answered questions about one SaaS product: ChatGPT described a sunset feature from stale training data, Gemini placed the product in the wrong category, Google AI Overviews recommended competitors for the alternatives query, Claude was accurate but bland, and Perplexity cited an obscure forum thread
The audit scorecard: five engines, five different failure modes — and only one of them was a hallucination in the classic sense.

Engine 3: Google AI Overviews ran an ad for our competitors (for free)

For the definitional questions, Google's AI Overviews were actually the most accurate of the bunch — which makes sense, because AI Overviews are built directly on Google's live index, and our site is reasonably well-indexed. It pulled current descriptions, cited our own pages, and got the category right.

Then I asked the alternatives question, and the floor opened up. The response was a beautifully structured comparison featuring a better-funded competitor as the headline recommendation, two other rivals as supporting cast, and us as a one-line afterthought near the bottom. The citations told the story: nearly every source was a third-party listicle — "best tools for [our category]" roundups — and we appeared in almost none of them, or appeared in fifth position with a two-sentence blurb written in 2023. Google wasn't being unfair. It was summarizing the comparison content that exists, and the comparison content that exists was written by, about, and effectively for our competitors, who had invested in being reviewed, listed, and compared while we'd been heads-down shipping. The alternatives query is where buying decisions happen, and we were structurally absent from the sources that feed it.

Engine 4: Claude was accurate, fair — and forgettable

Claude's answer was the one I'd have predicted from a careful model with modest knowledge of a mid-sized brand: cautious, hedged, accurate as far as it went, and absolutely devoid of personality or specifics. It described us in generic category language — the kind of sentence that would be equally true of any of our competitors. No differentiators, no notable strengths, no opinion on whether we were any good, just a polite "it appears to be a tool that does [category things]" with a recommendation to check our website for current details.

You might call this the best outcome of the five, and in pure damage terms it was. But sit with it from a buyer's perspective: they asked "is it any good," and the most honest engine in the lineup essentially shrugged. A shrug doesn't lose you the deal the way a wrong category does, but it doesn't win you anything either. Blandness is a visibility problem wearing a politeness costume. It means the distinctive, specific, quotable things about your product — the proof points, the named methodology, the strong opinions — haven't made it into the public corpus in a form a model can confidently repeat. Nobody had written anything memorable about us, including, I realized with some discomfort, us.

Engine 5: Perplexity built its answer on a forum thread I'd never seen

Perplexity, true to form, searched and cited. Its answer was mostly serviceable. But one prominent citation made me physically lean toward the screen: a forum thread, on a community site I barely knew existed, in which a handful of anonymous users discussed our product. The thread was old, lightly trafficked, and mixed — a couple of fair compliments, one outdated complaint about pricing that hadn't been true for two years, one user confidently misdescribing how a feature worked. And there it was, woven into the synthesized answer as if it were an authoritative source, sitting alongside our own docs with equal billing.

The lesson here is about source gravity. Retrieval-first engines need raw material, and when the authoritative material about you is thin, they reach for whatever exists — forums, old reviews, Reddit threads, comment sections. You don't get to choose your citations. You can only make sure the best available sources about you are so clear, current, and abundant that the weird ones get crowded out. We had spent years assuming that our website plus our blog equaled our online presence. The engines were reading a much bigger, messier file on us than the one we maintained.

The Diagnosis: Three Root Causes, Not Five Random Failures

Five engines, five embarrassments. My first instinct was to treat them as five separate fires. But when I mapped each failure to its mechanism, they collapsed into three root causes — and this mapping is the single most useful thing I took from the whole night, because each root cause has a completely different fix.

Root cause one: training-data lag. ChatGPT's sunset-feature error was a frozen-snapshot problem. Models that answer from parametric memory describe the world as their training corpus saw it, which for a fast-moving SaaS can be one to two product generations behind. You cannot patch a model's weights. What you can do is two things: maximize the odds the engine retrieves live information instead of relying on memory (by being easy to crawl, easy to cite, and prominent enough to trigger a search), and clean up the historical record so even the stale corpus is less wrong — updating old posts, correcting old listings, asking partners to refresh ancient descriptions.

Root cause two: retrieval-source quality. The AI Overviews competitor parade and Perplexity's forum thread were the same disease in different bodies: the engine retrieved live, real sources, and the sources were bad for us. Either we were absent from the documents that answer the query (comparison listicles, review sites) or the documents that mentioned us were low-quality and outdated (the forum thread). This is fixable through classic earned-media work pointed at a new target: not "get backlinks for PageRank" but "exist, accurately and recently, in the documents AI engines retrieve." It's the core argument of generative engine optimization — you optimize the corpus, not the model.

Root cause three: thin entity footprint. Gemini's category confusion and Claude's blandness were both symptoms of the same underlying weakness: the web did not contain a clear, consistent, repeated answer to "what is this product, exactly?" Our own pages were inconsistent, third-party descriptions had drifted, structured data was an afterthought, and we had never published the kind of crisp, declarative, definitional content that models latch onto. When the entity signal is weak, careful models hedge and less careful models guess. Both lose you the buyer.

Notice what's not on the list: "the AI is broken." In four of the five cases, the engine did a reasonable job with the inputs available. The inputs were the problem, and the inputs were largely mine to fix.

The Fix List: What I Actually Did Over the Next Six Weeks

I turned the diagnosis into a checklist and worked through it in roughly priority order — highest buyer damage first. Here's the honest version, including the boring parts.

Week 1–2: Fix the entity foundation

  • Rewrote the definitional pages. Homepage, about page, product pages, pricing page — each now opens with a plain-language, declarative sentence stating exactly what the product is, what category it belongs to, and who it's for. No clever taglines first. Models quote clear sentences; they cannot quote vibes.
  • Made one canonical description and propagated it everywhere. One 50-word description, one 25-word version, one 10-word version. Then the unglamorous slog: updating every directory listing, social profile, partner page, and integration marketplace entry we could reach so they all say the same thing. This was the direct antidote to Gemini's category confusion — if every source agrees on what you are, the model has no lane to pick but yours.
  • Added and repaired structured data. Organization schema, product schema, FAQ markup on docs, consistent naming throughout. Structured data isn't a magic AI lever, but it removes ambiguity for the systems that build entity graphs, and ambiguity was exactly our disease.
  • Audited crawler access. This one was a quiet near-disaster: a robots.txt rule added during a long-ago site migration was still blocking part of our docs directory, and a default setting in a bot-protection layer was challenging some AI crawlers. We verified that GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, and Google-Extended could all reach the pages we wanted them to read. If the crawlers can't get in, every other fix is decoration.
  • Ungated the documentation. Our best, most current, most accurate product descriptions lived behind a login. The engines were describing us from old blog posts because the good stuff was invisible to them. We moved the core docs into the open. We also published an llms.txt file — with the explicit caveat that it's a proposed convention, not a confirmed standard any major engine has committed to honoring; it cost an hour and might help, which was good enough for me.

Week 3–4: Repair the historical record

  • Hunted down the sunset feature's ghost. Every page on our own site that still described the dead feature got updated or redirected, with a short note on what replaced it. Old launch posts got prominent update banners. You can't delete the model's memory, but you can stop feeding the myth.
  • Triaged third-party mentions. I built a list of every external page that described us — review sites, roundups, directories, that forum thread — and sorted by likely retrieval prominence. For the ones with an owner I could reach, I sent short, polite correction requests with the canonical description attached. Maybe a third responded. That's fine; a third is a third.
  • Joined the conversation in the weird places. For the forum thread Perplexity loved, we didn't try to bury it — we answered it. An honest, non-defensive reply from a named team member correcting the outdated pricing claim and the feature misunderstanding. If an engine is going to cite that thread, the thread should at least contain the truth now.

Week 5–6: Earn presence in the sources that feed the alternatives query

  • Pitched the comparison ecosystem. We identified the roundups and review platforms that AI Overviews and Perplexity were actually citing for "alternatives" queries and worked the list: submitted updated profiles, requested inclusion in refreshed editions, briefed a couple of independent reviewers with honest access to the product. Slow, manual, unglamorous — and the single highest-leverage activity on this entire list, because getting cited by ChatGPT, Gemini, and Perplexity is mostly about being present and current in the documents they already trust.
  • Published our own comparison content. Honest pages comparing us to the alternatives, including the cases where a competitor is the better fit. Engines retrieve comparison content for comparison queries; if you don't publish your perspective, the only perspective available is everyone else's.
  • Gave the models something quotable. The cure for Claude's blandness: we published specific, declarative, opinionated material — a named methodology, concrete numbers we could stand behind, sharply worded positions on how our category should work. Distinctiveness in the corpus is a prerequisite for distinctiveness in the answers, and it's the same muscle that E-E-A-T has been asking us to build all along: demonstrable first-hand expertise, attached to real names, in public.
Diagram of a monthly AI brand visibility audit loop for checking how AI engines describe a brand: define a buyer query set, run the queries across ChatGPT, Gemini, AI Overviews, Perplexity, and Claude, score answers for accuracy, presence, and sentiment, fix the weakest entity sources and citations, then re-test the following month
The monthly audit loop: same queries, same engines, same scoring sheet — so changes in the answers are visible instead of anecdotal.

The Re-Audit: What Moved, What Didn't, and What That Taught Me

Six weeks after the midnight audit, I re-ran the exact same questions through the same five engines, fresh sessions, same spreadsheet. I want to be precise about the results, because the honest version is more useful than the triumphant one: some things improved clearly, some improved partially, and some did not move at all.

What clearly improved. The retrieval-first engines responded fastest, which is exactly what the root-cause model predicts. AI Overviews and Perplexity began citing the rewritten definitional pages and the ungated docs within weeks, and their descriptions of what we are tightened up noticeably. Perplexity's answer still cited the forum thread, but now alongside stronger sources — and the thread itself contained our correction, which changed what got synthesized from it. The category confusion in Gemini softened: the wrong-category framing didn't vanish, but the primary description shifted to our actual category, presumably as the newly consistent web signal propagated.

What partially improved. The alternatives query got better, not good. We moved from afterthought to legitimate participant in the AI Overviews comparison, on the strength of two refreshed roundups and our own comparison pages getting retrieved. The better-funded competitor still headlined. That's not an algorithm being unfair; that's the corpus accurately reflecting that they have more reviews, more mentions, and more comparison coverage than we do. Closing that gap is a quarters-long earned-media project, not a six-week sprint, and I've made peace with that — it's the same long game described in citations becoming the new rankings.

What didn't move. ChatGPT, when it answered from memory without searching, still occasionally mentioned the ghost feature. Of course it did — its parametric memory hasn't changed, and won't until some future training run ingests the corrected web. The realistic mitigations are upstream: more current, more prominent material that raises the odds the engine searches instead of recalling, and patience. I log it every month and watch for the snapshot to catch up. Claude got somewhat more specific — it picked up the named methodology — but our answer there is still more hedged than I'd like. Blandness, it turns out, is the slowest failure mode to cure, because it requires other people to say memorable things about you, and you can't rush other people.

The meta-lesson from the re-audit: engine architecture predicts fix velocity. Retrieval-first surfaces (AI Overviews, Perplexity) reflect your changes in days to weeks. Hybrid surfaces (Gemini, ChatGPT with search) move erratically depending on whether they ground a given answer. Pure training-data answers move on training-run timescales you don't control. Budget your expectations accordingly, and you'll stop interpreting normal lag as failure.

The Monthly Audit Recipe You Can Steal

I now run this as a fixed monthly ritual — first Monday, forty-five minutes, calendar-blocked. Here's the full recipe, generic enough to copy for any brand.

  1. Build a query set of 10–15 questions and freeze it. Four buckets: definitional ("what is [brand]", "what does [brand] do"), evaluative ("is [brand] any good", "[brand] reviews", "is [brand] worth it"), comparative ("best alternatives to [brand]", "[brand] vs [generic competitor type]"), and category ("best tools for [the job you solve]" — where you should appear without being named). Freeze the list. The value compounds only if you ask the same questions every month.
  2. Run every query on all five engines in clean sessions. ChatGPT with search, Gemini, Google AI Overviews/AI Mode, Perplexity, Claude. No history, no custom instructions, logged out where possible. Note whether each engine actually retrieved (citations present) or answered from memory — that one flag does most of your diagnostic work for you.
  3. Score three dimensions per answer, one to five. Accuracy (anything wrong or stale?), presence (do you appear, and how prominently — headline, list item, afterthought, absent?), sentiment (would a buyer lean in or out?). Paste the full answer text next to the scores.
  4. Capture every citation. List each source the retrieval engines cite about you. This is your real surface area — the file the internet keeps on you. New citations appearing and old ones dropping out are the earliest signal you'll get that your fixes are landing.
  5. Pick the two worst cells and fix only those. The grid of queries × engines will have ten small fires every month. Resist fixing everything. Pick the two with the highest buyer damage — usually an accuracy error on a definitional query or absence on a comparative one — trace each to its root cause (training lag, retrieval sources, entity footprint), and run the corresponding play from the fix list above.
  6. Re-test next month and log the delta. One row per month in a summary tab: average scores per engine, citations gained and lost, fires fixed. Three months of this and you have something almost nobody in your market has — longitudinal data on how AI engines talk about your brand.

Two practical notes from running this for a while. First, answers are not deterministic: the same engine can phrase things differently on different days, so don't panic over single-run wobble — watch the trend, and if a result looks anomalous, re-run it once before logging. Second, have a colleague run the comparative queries from their own machine occasionally. Personalization on these surfaces is real enough that your view is not always the market's view.

What the Midnight Audit Actually Taught Me

The obvious lesson is the tactical one: your brand now has five-plus AI representatives talking to your buyers every day, you didn't hire any of them, and unless you check, you have no idea what they're saying. The audit takes one evening. Do it before a prospect does it for you.

The deeper lesson took longer to absorb. Every one of those painful answers was, in a sense, fair. The stale feature was stale because we never cleaned up after ourselves. The wrong category was our own messaging drift, faithfully mirrored. The competitor parade reflected years of underinvestment in third-party presence. The bland answer was bland because we'd never published anything sharp enough to quote. The forum thread filled a vacuum we left. The engines didn't distort our reputation; they compressed it, and compression is brutal to ambiguity. AI brand visibility isn't a new discipline so much as a new, unforgiving mirror held up to old disciplines — positioning consistency, documentation hygiene, earned media, and giving the world something worth repeating.

These days the monthly audit runs as a tracked routine inside Orova alongside our regular rank monitoring, which means the spreadsheet ritual became a dashboard and the midnight panic became a Monday habit. But the tooling is honestly the least important part. What matters is the loop: ask what the machines say about you, trace the wrong answers to their causes, fix the inputs, and ask again. The engines will keep compressing whatever the internet knows about your brand. The only real choice you get is whether the source material is something you'd stand behind — or something you'll discover, at 11:40 on a Tuesday night, that you wouldn't.

Let an AI Agent handle your SEO

Orova plans, writes, optimizes, and tracks rankings on its own — you just read the results.

Try it free