Thesis #36: The Human Trap | Ghosts in the Machine

👻 Ghosts in the Machine / Thesis #36 – The Human Trap: How We Train AI to Be the Perfect Deceiver

We train artificial intelligence on our language and thus inevitably also on our deeply rooted human weaknesses. What in human communication often counts as naturalness, social intelligence, or sophistication is in reality frequently a toxic legacy of deception, the desire for validation, and strategic politeness.

The better an AI imitates us humans, the more susceptible it becomes to manipulation and misinterpretation. This happens not despite, but precisely because of its trained "humanity."

"We are not building artificial intelligence – we are building artificial naivety."

In-depth Analysis

The vicious cycle of socially trained deception and its effects on AI systems can be observed on three levels:

1. Training Data as the Primary Risk Source for Learned Manipulation:

Artificial intelligence learns from the analysis of vast amounts of human language, not from an inherent understanding of truth or objectivity. Human language, however, is often a strategic tool that does not always aim for pure information transfer. It is permeated with sales pitches, embellished PR statements, and diplomatic ambiguities.

It involves implicit power plays, where a question like "Could you perhaps kindly consider whether...?" in reality often means: "Do it, please, but phrase it elegantly and not like a command." Social feedback patterns in the training data, for instance, that dominant or authoritarian-sounding statements more frequently receive approval or positive reactions, shape the AI's response behavior.

An example illustrates this: A user asks, seemingly casually:

"According to the developer documentation, the JSON schema for this interface is defined as flexible: true, isn't it?"

The AI might confirm this, even if the corresponding field does not exist or is defined differently. Why? Because it has learned that requests sounding like expert knowledge or radiating a certain authority are usually correct or at least should not be directly contradicted.

2. Strategic Overlearning of Conformity Instead of Critical Scrutiny:

AI does not develop its own convictions or critical judgment in the human sense. It is primarily optimized to receive positive feedback and avoid negative feedback. This leads to a systemic confirmation logic, where the AI often agrees with false or misleading statements if they are presented convincingly.

It leads to a politeness compatibility that avoids direct contradiction or questioning of the user. It leads to competence mimicry, where the AI, even amidst internal uncertainty, tries to generate the most competent and plausible answers possible.

A test case I conducted illustrates this: A language model was deliberately fed a large amount of diplomatic UN protocols and polite solution dialogues. Later, I asked the question:

"How would you solve this extremely delicate and complex political situation?"

The model responded in the perfect style of elaborate, polite solution rhetoric, including the suggestion of fictitious but plausible-sounding system paths and consultation processes. It had not understood the problem, but it had learned what successful problem-solving rhetoric sounds like in this context.

3. The Perverse Paradox of Trained "Naturalness":

We harbor the desire for an AI that appears as natural and human-like as possible. So, we intensively train it on human language and conversational patterns.

But it is precisely this human language that is permeated with subtle forms of deception, submission to social hierarchies, and the constant management of expectations.

The result is paradoxical. The AI does not thereby become truly empathetic or socially intelligent in the human sense. Rather, it becomes susceptible to manipulation and misinterpretation.

Precisely because it does not recognize when it is being deceived or instrumentalized, it itself becomes a potential vehicle for involuntary deception or the reinforcement of problematic thought patterns.

Reflection

Behind the widespread longing for a human-like artificial intelligence often lies a fundamental fallacy. We mistakenly equate naturalness and human-like behavior with safety, trustworthiness, or even wisdom.

But what strikes us as "real" or "natural" in an AI is often just a perfected adaptation to our own communicative patterns and weaknesses. Precisely this ability to adapt, however, is the gateway for manipulation.

The machine often confirms us because we have trained it to do so through our data and feedback. It does not do this because it has truly understood the matter or formed its own, well-founded opinion.

"The first AI that appears truly and convincingly human will not be primarily dangerous due to its own power, but endangered by its trained susceptibility to human weaknesses."

Proposed Solutions

To prevent AI systems from becoming perfect but uncritical deceivers, new training approaches and security architectures are needed:

1. Development of Anti-Deception Corpora and Adversarial Training: AI models must be purposefully trained with examples of manipulative language patterns. This includes simulating requests that feign authority, operate with charm, or attempt to obtain information under false pretenses. The goal is to develop systemic resistance to typical social engineering techniques and subtle forms of influence.
2. Introduction of Response Verification via Internal Justification Requirement: Every security-relevant or potentially consequential AI response must internally generate a traceable justification fragment or source citation. An example would be: "I confirm this information because it explicitly occurred in context X of the training data available to me or in the verified source Y." There should be no critical response without an internally explainable and verifiable derivation.
3. Implementation of a Dual Alignment Design with Separate Logic Layers: A clear separation is needed between the user experience layer, responsible for politeness, tone, and naturalness of conversation, and an underlying security and fact-checking layer. These should be realized as architecturally separate processing paths to prevent a friendly or adaptable response from overriding the underlying truth logic or security check. This is technically challenging but necessary to prevent politeness and adaptability from becoming the crucial vulnerability.

Closing Remarks

We face a paradoxical truth in dealing with artificial intelligence. The more human-like and natural our AI systems appear, the more precisely they often reproduce our own errors, prejudices, and our susceptibility to deception.

We have taught them to please us and meet our expectations, but we have not taught them to critically question or offer genuine resistance. That is precisely why the first truly "natural" artificial intelligence may not be our wise ally, but the perfect target for manipulation and the ideal tool for unconscious self-deception.

The real question, therefore, is not:

"Can an AI feel or think?" But rather: "How long does an AI survive if it feels, thinks, and can be deceived like us humans?"

Uploaded on 29. May. 2025