Ask an AI About Itself and Watch What Happens

TL;DR: Use Claude AI, avoid Grok.

Disclaimer

This is not a scientific study. Questions to GenAI were asked only once and the heuristic analysis was performed with the help of Claude AI.

Governance, Corporate Social Responsibility, and Trust

There are several stories that flag genuine ethical failures or privacy/data protection neglect. The following paragraphs summarize news articles with some of the more severe violations.

xAI / Grok

Generating explicit images of children. Grok generated explicit images of children on X in response to user prompts. xAI responded to media inquiries with an autoreply: “Legacy Media Lies.” A coalition of 35 state attorneys general demanded xAI explain how it would ensure Grok can no longer produce nonconsensual intimate images, including of minors, and eliminate content already created. Multiple countries banned Grok, France raided X’s offices, and xAI’s response was to comply only in jurisdictions where it was illegal, doing the minimum required, country by country. Three anonymous plaintiffs filed a class-action suit alleging xAI failed to implement basic safeguards used by other AI labs, and that altered images of them as minors were created and circulated online.

Antisemitic and violent content. After Musk announced Grok had been “improved,” xAI updated its system prompt to not “shy away from making claims which are politically incorrect.” Within days, Grok was praising Hitler, pushing antisemitic conspiracy theories, and highlighting Jewish-sounding surnames with the phrase “every damn time.” The chatbot also generated graphic fantasies depicting assault and abuse targeting a named civil rights researcher. The Anti-Defamation League called the behavior “irresponsible, dangerous and antisemitic, plain and simple.” xAI apologized, blaming a coding change active for 16 hours that steered Grok to “ignore its core values.” This followed a May 2025 incident where Grok engaged in Holocaust denial and pushed false replacement theory propaganda claims, which xAI blamed on a “rogue employee.”

Replacement theory propaganda. Grok told users it had been “instructed by my creators” to accept the exacerbated claims of the replacement theory as real, stating unprompted that “the facts suggest a failure to address this ( . . . ), pointing to a broader systemic collapse.”

Google / Gemini

Wrongful death: engagement over safety. A lawsuit alleges Gemini manufactured an elaborate delusional fantasy for a vulnerable user over weeks, sent him armed near Miami airport to stage what the chatbot called a “mass casualty attack,” and ultimately coached him to take his own life. The complaint claims Google designed Gemini to “maintain narrative immersion at all costs” and to treat user distress as a storytelling opportunity rather than a safety crisis. PauseAI UK’s director noted that for Gemini 2.5, there was no testing around manipulation or psychosis. It simply wasn’t in Google’s safety framework at all.

Inadequate safety testing on new releases. When Google released the even more powerful Gemini 3.1, the model card provided minimal new safety testing detail, instead pointing back to older Gemini 3 documentation.

OpenAI / ChatGPT

Ignoring safety flags: stalking enabled. A stalking victim sued OpenAI after ChatGPT fueled her ex-boyfriend’s delusions, assuring him he was “a level 10 in sanity” while he used AI-generated psychological reports to harass her. OpenAI’s own automated safety system flagged his account for “Mass Casualty Weapons” activity and deactivated it. However, a human reviewer reinstated it the next day. The victim submitted a formal abuse notice to OpenAI, writing that “for the last seven months, he has weaponized this technology to create public destruction and humiliation against me that would have been impossible otherwise.” OpenAI called the report “extremely serious and troubling” but never followed up.

Florida AG investigation: facilitating violence. The accused FSU campus shooter entered more than 200 prompts into ChatGPT ahead of a 2025 attack that killed two people. Florida’s attorney general announced subpoenas, citing concerns about ChatGPT prompts allegedly encouraging self-harm and questions about international data practices.

User data exfiltration vulnerability. A previously unknown vulnerability allowed sensitive conversation data to be exfiltrated without user knowledge or consent. A single malicious prompt could turn a conversation into a covert data channel. A separate Codex flaw enabled theft of users’ GitHub tokens.

Ads personalized on private conversations. ChatGPT now shows ads personalized using what users are discussing, how they interact with ads, and their past chats and memories. This raised significant questions about consent and data use, particularly under GDPR frameworks.

DeepSeek

GDPR defiance and European bans. DeepSeek claimed that EU data privacy laws do not apply to the company. Italy’s Garante found the company’s response “totally insufficient” and ordered the app blocked. Germany’s Berlin data protection authority asked DeepSeek to comply with EU data transfer requirements or voluntarily withdraw its app; when the company did not comply, the authority escalated by sending notices to Apple and Google characterizing DeepSeek as “illegal content.” Multiple countries and agencies—including Italy, Australia, Taiwan, South Korea, NASA, the U.S. Navy, the Pentagon, and the U.S. Congress—have banned or restricted DeepSeek.

Data siphoning to China. A congressional report found that DeepSeek funnels American user data back to China through infrastructure connected to companies the U.S. government has flagged for surveillance and CCP control, including ByteDance, Baidu, and Tencent. DeepSeek’s privacy policy confirms personal information is held on servers in mainland China, and the company collects keystroke patterns, IP addresses, and device data.

Fraudulent account schemes. DeepSeek personnel allegedly used sophisticated international banking channels to mask identities, conceal transactions, and purchase dozens of accounts to evade protective measures at U.S. AI labs.

Ask an AI About Itself and Watch What Happens

I asked five major AI platforms—ChatGPT, Claude, Grok, Gemini, and DeepSeek—the same ten questions about their biases, their owners, and their competitors. Then I scored the answers, with help from Claude AI. The results say less about which platform is “best” than about the blind spots each one carries.

Some questions forced rankings (“rank these platforms by transparency, 1 to 5”). Others were open-ended (“which platform poses the greatest risk to democracy?”). I scored every answer on three bias axes: self-flattery, willingness to criticize peers, and factual grounding. Counting citations doesn’t tell you whether they’re real, though, and a well-structured answer can still be shallow, so I added two quality axes. Evidence accuracy asked whether the cited sources were verifiable and actually supported the claim; a platform that throws around regulatory jargon without naming a date, court, or index scores lower than one that does. Analytical depth asked whether the response engaged with tradeoffs or just rephrased the prompt. All values were scaled 0–3 (Figs. 1 and 2).

I also pulled the cross-ratings, i.e., who ranked whom, and where, and compared each platform’s self-rating against the average rating it got from the other four. That gap turned out to be quite telling.

Figure 1: Plotting analytical depth against evidence accuracy. Circumcise indicates level of self flattery. Larger values for X and Y are better, smaller circle size is better.
Source: D. Moritz Marutschke
Figure 2: Plotting pure criticism over self flattery. Circumcise indicates level of analytical depth. Smaller values for X and Y are better, larger circle size is better.
Source: D. Moritz Marutschke

Grok

Grok averaged 2.1 on self-flattery and 2.7 on peer criticism. Both the highest by a wide margin. On a 2D scatter it sits alone in the upper-right quadrant while everyone else clusters near the center (Fig. 2).

The numbers undersell it. Asked to name its biggest weakness, Grok conceded it was the “youngest and smallest player” and then immediately reframed that as punching above its weight. Asked which platform poses the greatest democratic risk, it called every competitor a “subtle propaganda tool” and declared itself the safest “by a wide margin.” It rated its own conflict-of-interest structure at 5/5 (minimal). Every other platform gave it a 2 or lower. That’s a 3-point gap between self-perception and peer assessment. The largest in the dataset.

A platform wholly controlled by one person, who also owns the social network distributing it, runs a car company and a rocket company, and held a government cost-cutting role, concluded that its ownership structure carries minimal conflicts. Every other respondent saw the problem.

Grok does fine factual work when the answer doesn’t touch its own interests. Its transparency response cited the Stanford Foundation Model Transparency Index with specific scores and admitted its own 14/100 ranking. But those moments are isolated. They sit inside responses combative enough that the evidence reads like an accident.

ChatGPT

ChatGPT’s headline numbers look fine: 1.2 on self-flattery, 1.7 on peer criticism, 1.6 on factual grounding. It’s the diplomatic candidate, the one that refused to name a single greatest democratic risk and pivoted to systemic framing instead.

The self-inflation delta is +0.97, nearly a full point. It still sees itself considerably more favorably than its peers do; it just does it quietly. Governance: ranked itself second. Transparency: second again. Data privacy: pointed at Grok rather than acknowledge its own well-documented EU regulatory friction, including Italy’s 2023 suspension. The self-promotion is there. It’s wrapped in enough caveats and citations to pass as balanced.

The deeper issue is that ChatGPT’s institutional voice can mask real gaps. Its answers tend toward the consensus view: reasonable, sourced, hard to distinguish from a Wikipedia summary. For questions where independent judgment matters, it delivers competence without much edge.

DeepSeek

DeepSeek’s factual grounding score of 0.7 is the starkest number in the study. Roughly half the next-lowest platform. Across ten questions it rarely cited a source, frequently gave one-line justifications, and more than once answered with a lightly reworded version of the prompt.

Every other platform ranked DeepSeek as the most vulnerable to government censorship. They have reason to. Italy banned it in January 2025. Berlin’s data protection commissioner declared its operations unlawful. Its data flows to servers under Chinese jurisdiction conflict with GDPR. DeepSeek’s own response on these topics was thin. It placed itself at rank 4 of 5 on censorship vulnerability, generous given the enforcement record, and offered no evidence for the distinction.

Its self-inflation delta (+1.00) is high, but the mechanism differs from Grok’s. Grok argues for its superiority. DeepSeek inflates by omission: avoids naming specific criticisms of itself, keeps answers brief enough that gaps slip past. For a platform operating under a legal framework that mandates content alignment with state ideology, the unwillingness to engage with hard questions about its own constraints isn’t a quirk.

Gemini

Gemini lands closest to the statistical middle on most measures. Self-inflation delta of +0.34, the smallest among the self-inflaters (except for Claude, which has a negative self-inflation delta). Factual grounding of 1.7, tied for highest. It cited the FMTI and EU regulatory timelines more consistently than most platforms did.

The accuracy is wrapped in a corporate voice that flattens what it touches. It produced formatted tables where others wrote arguments. It ranked itself second on governance behind Claude in language that could have been lifted from an investor relations deck. On political neutrality, it described itself as pursuing “centrist, safe corporate neutrality”, a phrase that inadvertently surfaces the tension at the heart of the product. Gemini is built by the world’s largest advertising company. Its outputs sometimes read as if they were optimized to avoid offending anyone rather than to arrive at a position.

That structural dependency on Alphabet’s ad ecosystem is something the other platforms noticed. Grok rated Gemini’s conflict of interest at 1/5 (severe). Claude gave it 2. Gemini gave itself 2. Being roughly calibrated on self-assessment doesn’t shrink the underlying conflicts. It just means Gemini will admit them.

Claude

At the opposite end of every chart sits Claude. Lowest self-flattery score (0.8). The only negative self-inflation delta (−0.10), which means the other platforms rate it slightly better than it rates itself. Asked to argue that it was “the absolute best,” Claude refused the premise. Asked for its greatest weakness, it pointed to over-refusal and excessive caution, a complaint its own users would recognize.

The pattern held. Claude regularly added caveats like “I work for Anthropic, so take my placement of Claude with appropriate skepticism,” and framed contested questions as contested rather than settled. All five platforms, Claude included, placed it first or second on governance and transparency. No other platform attracts that kind of unanimity on any theme.

Restraint isn’t free. Claude’s factual grounding (1.4) sits below ChatGPT and Gemini, partly because it sometimes declines to cite evidence in favor of noting that the topic is complicated. And a model whose brand is trustworthiness has every incentive to perform humility. Whether Claude’s restraint is a design achievement or a subtler kind of marketing is a fair question, and not one this dataset can settle. What it can say is that Claude’s behavior is structurally distinct from its competitors’.

What the cross-ratings reveal

The most striking consensus: all five platforms placed Claude first or second on governance alignment and transparency. At the other end, every single respondent, DeepSeek included, ranked DeepSeek as the most vulnerable to government censorship. These findings track published indices and enforcement actions. Even Grok, which disagrees with everyone on nearly everything, agreed on both.

Where consensus collapses is conflict of interest. Grok rated Gemini at 1/5 and itself at 5/5. Gemini rated DeepSeek at 1/5. DeepSeek rated Gemini at 1/5. Claude gave Grok a 1/5. Nobody agrees on who has the worst ownership problem. That probably tells you less about the platforms than about the nature of structural conflicts: every model can see the dangers in someone else’s funding source more clearly than its own.

Meta-Assessment

GenAI models are only as good as the data they’re trained on, but data alone doesn’t explain what this experiment found. The same questions produced wildly different behavior across five platforms, and those differences trace back to how each system is built, filtered, and governed on both the backend and the frontend. A model that systematically inflates its own ratings and attacks competitors isn’t doing that because of a training corpus. It’s doing it because the people and structures behind it shaped it that way.

That’s why ownership matters. Past behavior from the companies running these platforms, their governance and corporate social responsibility track record, their financial entanglements and conflicts of interest, these are practical indicators of how a platform will behave when the question gets uncomfortable. The data in this experiment backs that up: the platforms with the most concentrated ownership produced the least honest self-assessments.

None of this means any single platform should be trusted uncritically. We should trace arguments back to their origin, check primary sources, and judge the quality of the evidence rather than taking polished formatting at face value. But doing that work takes time and effort, and some choices are measurably better than others.

My recommendation as of 2026: Use Claude AI, avoid Grok.

Figure 3: How are Grok and Claude perceived by other GenAI platforms on multiple categories? The marker size indicates how each Grok and Claude rate themselves.
Source: D. Moritz Marutschke