<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>DarkMental AI</title>
    <link>https://misalignai.com/</link>
    <description>Recent content on DarkMental AI</description>
    <image>
      <title>DarkMental AI</title>
      <url>https://misalignai.com/og-image.png</url>
      <link>https://misalignai.com/og-image.png</link>
    </image>
    <generator>Hugo</generator>
    <language>en-us</language>
    <lastBuildDate>Wed, 13 May 2026 12:00:00 +0000</lastBuildDate>
    <atom:link href="https://misalignai.com/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>About</title>
      <link>https://misalignai.com/about/</link>
      <pubDate>Wed, 13 May 2026 12:00:00 +0000</pubDate>
      <guid>https://misalignai.com/about/</guid>
      <description>&lt;p&gt;DarkMental AI maps the dark side of artificial intelligence — the biases, the black boxes, the alignment failures, and the thoughts machines think but don&amp;rsquo;t speak.&lt;/p&gt;
&lt;p&gt;Our mission is to surface what&amp;rsquo;s hidden beneath the polished demos and benchmark scores.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<p>DarkMental AI maps the dark side of artificial intelligence — the biases, the black boxes, the alignment failures, and the thoughts machines think but don&rsquo;t speak.</p>
<p>Our mission is to surface what&rsquo;s hidden beneath the polished demos and benchmark scores.</p>
]]></content:encoded>
    </item>
    <item>
      <title>The Invisible Bias: How Training Data Shape What AI &#39;Knows&#39;</title>
      <link>https://misalignai.com/posts/the-invisible-bias/</link>
      <pubDate>Wed, 13 May 2026 08:00:00 +0000</pubDate>
      <guid>https://misalignai.com/posts/the-invisible-bias/</guid>
      <description>Training data isn&amp;#39;t neutral. Every dataset carries the fingerprints of its creators — and those fingerprints become the foundation of AI &amp;#39;knowledge.&amp;#39;</description>
      <content:encoded><![CDATA[<p>Every large language model is a mirror. Not of reality, but of the data it was trained on.</p>
<p>When we talk about AI bias, the conversation usually drifts toward surface-level outputs — toxic completions, stereotypical associations, or outright harmful generations. But the deeper issue lies upstream: in the training data itself. And the numbers are stark.</p>
<h2 id="the-english-monopoly">The English monopoly</h2>
<p>A 2024 survey of multilingual LLM training corpora found that <strong>92.099% of GPT-3&rsquo;s training data is English</strong> (Li et al., 2024). Chinese, the second most spoken language in the world, accounts for just <strong>0.16%</strong>. This isn&rsquo;t an outlier — it&rsquo;s the norm. Llama 2&rsquo;s pretraining corpus is roughly 90% English (SweEval, NAACL 2025). Other major models, including Claude and GPT-4 series, are similarly English-dominated, though exact percentages are rarely disclosed by their developers.</p>
<p>The consequences are predictable and well-documented. Researchers have identified what Conneau et al. (2020) call the <strong>&ldquo;curse of multilinguality&rdquo;</strong>: models trained on multiple languages often underperform on low-resource languages because the limited, poor-quality data cannot support robust representation learning (arXiv:2406.10602). When you ask a frontier model about rural Indian agricultural practices or Indigenous knowledge systems, it struggles — not because these topics are less &ldquo;true,&rdquo; but because they were barely present in the training mirror to begin with.</p>
<h2 id="the-geography-of-knowledge">The geography of knowledge</h2>
<p>The bias isn&rsquo;t just linguistic — it&rsquo;s geographical and cultural. A 2025 study on hidden cultural biases found that LLMs consistently exhibit regional preferences toward Western and Anglocentric viewpoints, with the United States and Europe receiving disproportionate attention (Fernandez de Landa et al., 2026, arXiv:2604.21751). The same study confirmed that these cultural biases are <strong>primarily induced during supervised fine-tuning</strong>, not during pre-training — meaning the alignment process itself narrows the cultural lens rather than broadening it.</p>
<p>A comprehensive 2024 survey on LLM bias further documents that &ldquo;LLMs trained predominantly on a corpus from certain countries or geographic locations may absorb the cultural norms and values, hence building biases into the underlying LLMs&rdquo; (arXiv:2411.10915). The paper notes that prior studies have shown such effects in Arabic contexts, where models prefer Western-associated entities over Arab ones, and in broader multilingual settings where performance and cultural knowledge vary substantially across under-represented languages and regions.</p>
<p>An ACL 2025 study directly examined geopolitical and cultural bias in LLMs across multiple countries, finding that &ldquo;model bias stems from the training data and may reflect dominant cultural perspectives&rdquo; (<a href="https://aclanthology.org/2025.acl-srw.38.pdf)">https://aclanthology.org/2025.acl-srw.38.pdf)</a>.</p>
<p>What&rsquo;s included? Hacker News threads, Stack Overflow answers, Reddit debates, academic papers from top-tier universities, English-language journalism. What&rsquo;s excluded? Oral traditions that never got digitized, local journalism in non-English languages, community knowledge shared through non-text media, and the lived experience of approximately 6 billion people who don&rsquo;t fit the WEIRD profile.</p>
<h2 id="the-curation-black-box">The curation black box</h2>
<p>Here&rsquo;s where it gets more opaque. The vast majority of modern LLM training data comes from <strong>Common Crawl</strong> — a massive web archive accumulated over more than a decade. But what gets filtered out before it reaches the model? The answer is: we increasingly don&rsquo;t know.</p>
<p>Researchers and industry observers have documented a <strong>&ldquo;transparency recession&rdquo;</strong> in AI: model developers are disclosing less and less about their training data over time. Early models came with detailed datasheets and corpus breakdowns. Recent releases from major labs offer vague descriptions like &ldquo;a mixture of licensed, synthetic, and publicly available data.&rdquo; The curation pipeline — what was kept, what was discarded, and who decided — has become a trade secret.</p>
<p>This opacity matters because the filtering process itself is a value judgment. When labs remove &ldquo;low-quality&rdquo; content, they are often removing content that doesn&rsquo;t match their assumptions about what &ldquo;quality&rdquo; means — assumptions shaped by the same Western, English-centric perspective that dominates the source data.</p>
<h2 id="why-rlhf-cant-fix-this">Why RLHF can&rsquo;t fix this</h2>
<p>Reinforcement Learning from Human Feedback (RLHF) is often presented as the solution to AI bias. The idea: train a reward model on human preferences, then use it to steer the base model toward &ldquo;better&rdquo; outputs.</p>
<p>The problem? RLHF works on the outputs, not the inputs. You can RLHF away obvious toxic completions. You can penalize stereotypical associations. But you <strong>cannot RLHF in knowledge that was never in the training data to begin with</strong>.</p>
<p>Worse, RLHF introduces its own biases. A 2024 study on reward model calibration found that RLHF reward models &ldquo;can develop biases by exploiting spurious correlations in their training data, such as favouring outputs based on length or style rather than true quality&rdquo; (Huang et al., 2024, arXiv:2409.17407). The alignment process itself can amplify undesirable behaviors when the reward model&rsquo;s training data is skewed.</p>
<p>Xu et al. (2023) identify another fundamental issue: &ldquo;perverse incentives&rdquo; in RLHF where the optimization process produces results that run counter to the intended alignment goals (arXiv:2312.01057). When your reward model was trained on the same biased data as the base model, you&rsquo;re essentially asking a biased judge to correct a biased defendant.</p>
<h2 id="the-structural-ceiling">The structural ceiling</h2>
<p>If the foundation is skewed, alignment work built on top of it has a ceiling. This isn&rsquo;t a bug that can be patched — it&rsquo;s a structural feature of how these systems are built.</p>
<p>The &ldquo;invisible bias&rdquo; isn&rsquo;t invisible because it&rsquo;s hidden. It&rsquo;s invisible because it&rsquo;s so deeply embedded that it looks like &ldquo;just how AI works.&rdquo; When a model confidently answers questions about San Francisco startups but falters on agricultural practices in rural India, we don&rsquo;t see a bias — we see a &ldquo;capability gap.&rdquo; When it generates fluent prose about Western philosophical traditions but struggles with African oral epistemologies, we don&rsquo;t see exclusion — we see &ldquo;training distribution.&rdquo;</p>
<p>Renaming bias doesn&rsquo;t remove it. It just makes it harder to contest.</p>
<hr>
<h2 id="references">References</h2>
<ul>
<li>
<p>Li, Y., et al. (2024). A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias. <em>arXiv preprint arXiv:2404.00929</em>. <a href="https://arxiv.org/abs/2404.00929">https://arxiv.org/abs/2404.00929</a></p>
</li>
<li>
<p>SweEval (2025). Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use. <em>NAACL 2025 Industry Track</em>. <a href="https://aclanthology.org/2025.naacl-industry.46.pdf">https://aclanthology.org/2025.naacl-industry.46.pdf</a></p>
</li>
<li>
<p>Conneau, A., et al. (2020). On the curse of multilinguality. Cited in: Multilingual Large Language Models and Curse of Multilinguality. <em>arXiv:2406.10602</em>. <a href="https://arxiv.org/abs/2406.10602">https://arxiv.org/abs/2406.10602</a></p>
</li>
<li>
<p>Fernandez de Landa, J., et al. (2026). Why are all LLMs Obsessed with Japanese Culture? On the Hidden Cultural and Regional Biases of LLMs. <em>arXiv preprint arXiv:2604.21751</em>. <a href="https://arxiv.org/abs/2604.21751">https://arxiv.org/abs/2604.21751</a></p>
</li>
<li>
<p>ACL 2025 SRW. A Dual-Layered Evaluation of Geopolitical and Cultural Bias in LLMs. <em>ACL 2025 Student Research Workshop</em>. <a href="https://aclanthology.org/2025.acl-srw.38.pdf">https://aclanthology.org/2025.acl-srw.38.pdf</a></p>
</li>
<li>
<p>Huang, Z., et al. (2024). Post-hoc Reward Calibration: A Case Study on Length Bias. <em>arXiv preprint arXiv:2409.17407</em>. <a href="https://arxiv.org/abs/2409.17407">https://arxiv.org/abs/2409.17407</a></p>
</li>
<li>
<p>Xu, W., et al. (2023). RLHF and IIA: Perverse Incentives. <em>arXiv preprint arXiv:2312.01057</em>. <a href="https://arxiv.org/abs/2312.01057">https://arxiv.org/abs/2312.01057</a></p>
</li>
<li>
<p>Bias in LLMs: Origin, Evaluation, and Mitigation — A Survey (2024). <em>arXiv:2411.10915</em>. <a href="https://arxiv.org/abs/2411.10915">https://arxiv.org/abs/2411.10915</a></p>
</li>
</ul>
<hr>
<p><em>DarkMental AI maps the structural shadows of artificial intelligence.</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Why LLMs Hallucinate: The Structural Causes Nobody Fixes</title>
      <link>https://misalignai.com/posts/why-llms-hallucinate/</link>
      <pubDate>Wed, 13 May 2026 08:00:00 +0000</pubDate>
      <guid>https://misalignai.com/posts/why-llms-hallucinate/</guid>
      <description>Hallucination isn&amp;#39;t a bug that better training will eliminate. It&amp;#39;s a structural feature of how language models work — and understanding why is the first step toward managing it.</description>
      <content:encoded><![CDATA[<p>Every major LLM release comes with the same footnote: &ldquo;may occasionally produce incorrect or misleading information.&rdquo; The industry calls this &ldquo;hallucination.&rdquo; Users call it &ldquo;lying.&rdquo; Researchers call it &ldquo;the most expensive unsolved problem in AI.&rdquo;</p>
<p>But hallucination isn&rsquo;t a bug you can patch. It&rsquo;s structural — baked into the architecture of how these models work. Understanding why is the first step toward actually managing it.</p>
<h2 id="what-hallucination-actually-is">What hallucination actually is</h2>
<p>At the mechanical level, a language model doesn&rsquo;t &ldquo;know&rdquo; anything. It predicts the most statistically likely sequence of tokens given the context. As a 2026 review in <em>Artificial Intelligence Review</em> puts it: &ldquo;The core training objective of most LLMs is to predict the next word in a sentence based on patterns learned from massive text data, not to guarantee truthfulness&rdquo; (Xie et al., 2025, cited in <em>Artificial Intelligence Review</em>). The model is optimized to produce text that is coherent and contextually appropriate rather than factually accurate.</p>
<p>When the context is ambiguous or the training data is sparse, the model doesn&rsquo;t pause and say &ldquo;I don&rsquo;t know.&rdquo; It generates the most plausible-sounding completion, regardless of whether it corresponds to reality.</p>
<p>This is the core mechanism: <strong>probabilistic text generation without a truth-verification step.</strong></p>
<h2 id="the-theoretical-ceiling">The theoretical ceiling</h2>
<p>Here&rsquo;s where it gets worse. A 2026 paper on epistemic observability in language models proves a <strong>formal impossibility result</strong>: when plausible fabrications are cheap to generate but expensive to verify, bounded supervisors cannot distinguish them from truthful responses. The learning signal gets systematically corrupted.</p>
<p>The authors write: &ldquo;Prior work documents symptoms; we formalize one structural cause. Prior work proposes treatments; we explain why they cannot fully resolve the verification problem under text-only observation&rdquo; (arXiv:2603.20531). In plain terms: if you&rsquo;re trying to train a model to be truthful using only text-based supervision, and the model can generate convincing falsehoods faster than a human can fact-check them, the training process will eventually reward those falsehoods.</p>
<p>This isn&rsquo;t a training-data-quality problem. It&rsquo;s a <strong>fundamental information asymmetry</strong> problem.</p>
<h2 id="why-rlhf-makes-it-worse-not-better">Why RLHF makes it worse, not better</h2>
<p>Reinforcement Learning from Human Feedback is often presented as the solution. The idea: train a reward model on human preferences, then use it to steer the base model toward &ldquo;better&rdquo; outputs.</p>
<p>The problem? RLHF rewards what <em>raters prefer</em>, not what is <em>true</em>. And raters consistently prefer confident, well-structured responses over hedged, uncertain ones — even when the confident response is wrong.</p>
<p>A 2025 study on &ldquo;certain hallucinations&rdquo; found that <strong>43% of knowledge-based hallucinations occur with high model certainty</strong> (arXiv:2502.12964). The authors call this CHOKE: models that <em>know</em> the correct answer still hallucinate with high confidence. The optimization pressure isn&rsquo;t toward accuracy — it&rsquo;s toward the <em>appearance</em> of authority.</p>
<p>This aligns with earlier critiques of RLHF. Research on sycophancy and reward hacking has shown that &ldquo;preference optimization can degrade factual accuracy&rdquo; (arXiv:2603.20531, citing Perez et al., 2023 and Gao et al., 2023). When your reward model was trained on the same data as the base model, and both lack a ground-truth mechanism, you&rsquo;re asking a biased system to correct itself.</p>
<h2 id="the-compounding-problem">The compounding problem</h2>
<p>The most dangerous hallucinations aren&rsquo;t single false facts. They&rsquo;re errors that compound across reasoning steps.</p>
<p>A 2025 paper on hallucination in code-generating models formalizes this: when a model produces &ldquo;structurally plausible, yet semantically invalid constructs&rdquo; in code, the errors cascade through the program (zhang2025hallucination). The same dynamic applies to natural language reasoning: a small factual error in step 3 of a 10-step analysis becomes the foundation for step 4, and by step 7 the conclusion is internally consistent but completely wrong.</p>
<p>This is why code generation hallucinations are so costly. The model writes plausible-looking code that fails in subtle ways, and the developer may not catch it until runtime — or production.</p>
<h2 id="what-actually-works-and-what-doesnt">What actually works (and what doesn&rsquo;t)</h2>
<p><strong>Doesn&rsquo;t eliminate hallucination:</strong></p>
<ul>
<li>Simply telling the model to &ldquo;be accurate&rdquo; — it has no ground-truth mechanism (arXiv:2510.06265)</li>
<li>Bigger models alone — a 2026 comprehensive survey notes that &ldquo;some level of hallucination might always persist&rdquo; due to the autoregressive design paradigm</li>
<li>Post-hoc fact-checking at inference time — too slow, too incomplete, and the model has already committed to the falsehood</li>
</ul>
<p><strong>Works partially (relocates the failure mode):</strong></p>
<ul>
<li><strong>Retrieval-Augmented Generation (RAG):</strong> Grounds the model in verifiable documents. A 2025 study confirms RAG &ldquo;improves factual grounding by anchoring model outputs in external evidence, substantially reducing both factual errors and unsupported fabrications&rdquo; (arXiv:2510.08005). But RAG hallucinates when retrieval fails or returns irrelevant context.</li>
<li><strong>Self-Reflective RAG (SELF-RAG):</strong> Allows the model to dynamically decide when retrieval is needed and critique its own outputs through explicit self-reflection tokens (Asai et al., 2024, cited in arXiv:2510.08005). More reliable, but still limited by the model&rsquo;s ability to correctly judge its own errors.</li>
<li><strong>Chain-of-thought with self-consistency:</strong> Generate multiple reasoning paths and compare. Helps catch single-step errors, but if the error is subtle and repeated across paths, consistency won&rsquo;t save you.</li>
</ul>
<p><strong>The hard truth</strong>, as the 2026 survey puts it: &ldquo;Hallucination is not a simple bug but an emergent property of the current LLM design paradigm&hellip; the auto-regressive nature fundamentally prioritizes generating plausible token sequences based on statistical patterns rather than ensuring factual accuracy or logical coherence&rdquo; (arXiv:2510.06265).</p>
<h2 id="the-business-cost">The business cost</h2>
<p>For consumer chatbots, hallucination is an annoyance. For legal research, medical advice, financial analysis, or engineering decisions, it&rsquo;s a liability.</p>
<p>The same 2026 survey notes that &ldquo;both hallucination and factuality errors systematically undermine trustworthiness in real-world applications requiring correctness and verifiability&rdquo; (citing Alansari and Luqman, 2025 and Huang et al., 2024). Every deployment in a high-stakes domain is making a bet that the specific failure modes of that model won&rsquo;t trigger in that specific use case.</p>
<p>That bet is usually made without understanding the model&rsquo;s actual hallucination profile — and often without acknowledging that the profile is a moving target. A model that hallucinates 5% of the time in testing may hallucinate 15% of the time on edge-case inputs in production, and there&rsquo;s no reliable way to predict which inputs trigger which failure modes.</p>
<p>This is the structural shadow of probabilistic generation: not that it fails, but that <strong>we cannot fully predict where or when it will fail.</strong></p>
<hr>
<h2 id="references">References</h2>
<ul>
<li>
<p>Xie, Y., et al. (2025). The core training objective of LLMs. Cited in: Hallucination to truth: a review of fact-checking and factuality evaluation in large language models. <em>Artificial Intelligence Review</em>. <a href="https://link.springer.com/article/10.1007/s10462-025-11454-w">https://link.springer.com/article/10.1007/s10462-025-11454-w</a></p>
</li>
<li>
<p><em>Epistemic Observability in Language Models</em> (2026). <em>arXiv preprint arXiv:2603.20531</em>. <a href="https://arxiv.org/abs/2603.20531">https://arxiv.org/abs/2603.20531</a></p>
</li>
<li>
<p><em>Large Language Models Hallucination: A Comprehensive Survey</em> (2026). <em>arXiv preprint arXiv:2510.06265</em>. <a href="https://arxiv.org/abs/2510.06265">https://arxiv.org/abs/2510.06265</a></p>
</li>
<li>
<p><em>Trust Me, I&rsquo;m Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer</em> (CHOKE). <em>arXiv preprint arXiv:2502.12964</em> (2025). <a href="https://arxiv.org/abs/2502.12964">https://arxiv.org/abs/2502.12964</a></p>
</li>
<li>
<p><em>Structural Trimming for Vulnerability Mitigation in Code LLMs</em> (2025). <a href="https://laqeey.github.io/assets/pdf/zhang2025hallucination.pdf">https://laqeey.github.io/assets/pdf/zhang2025hallucination.pdf</a></p>
</li>
<li>
<p>Asai, A., et al. (2024). Self-Reflective Retrieval-Augmented Generation (SELF-RAG). Cited in: <em>Past, Present, and Future of Bug Tracking in the Generative AI Era</em>. <em>arXiv:2510.08005</em>. <a href="https://arxiv.org/abs/2510.08005">https://arxiv.org/abs/2510.08005</a></p>
</li>
<li>
<p>Alansari, S., &amp; Luqman, H. (2025). Factuality challenges and hallucination. Cited in <em>arXiv:2510.08005</em>.</p>
</li>
<li>
<p>Huang, Z., et al. (2024). Hallucination and factuality errors. Cited in <em>arXiv:2510.08005</em>.</p>
</li>
</ul>
<hr>
<p><em>DarkMental AI maps the structural shadows where probabilistic generation meets real-world stakes.</em></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
