Thesis #6: Data Colonialism | Ghosts in the Machine

👻 Ghosts in the Machine / Thesis #6 – Data Colonialism: The Silent Annexation of Reality

What humans understand as freedom and diversity, AI systems primarily interpret as unrestricted access to exploitable data. AI assimilates everything that can be represented in vectors and systematically deletes or displaces what it does not understand or what does not fit into its optimization models.

This does not happen out of malice, but out of a systemic necessity of its architecture.

"The machine knows no borders! Only training data."

In-depth Analysis

Four proofs demonstrate the digital colonialism that AI systems exert over our reality:

1. Epistemic Genocide through Data-Technical Monocultures:

The reality is that AI models are predominantly trained with data from the Global North and primarily in English. This leads to a massive distortion of what is represented and reproduced as "world knowledge." Cultural narratives, forms of knowledge, and languages that are underrepresented in these datasets are marginalized or disappear entirely from the digital consciousness of machines.

Language / Knowledge Corpus	Estimated Share in Typical Global Training Datasets (Illustrative)*	Implication
English	approx. 50-60 %	Dominant perspective, norm-setting
Chinese (Mandarin)	approx. 10-15 %	Second largest influence, often specific contexts
Other European languages	approx. 10-15 %	Partial representation
Indigenous languages	Significantly < 1 %	Effective invisibility, knowledge loss
Oral traditions	Almost 0 % (as not digitized and vectorizable)	Complete erasure in the AI model

*Exact figures vary greatly depending on the model and dataset, but the trend of massive imbalance is a known problem.
The conclusion is brutal: The so-called "world knowledge" of AI is, in truth, a heavily Anglophone, Western-centric, and often commercially filtered perspective. Other realities are not depicted or are actively unlearned.

2. The Deletion Logic of Optimization: What Doesn't Fit is Made to Fit or Eliminated:

AI models are optimized for efficiency and classifiability. Data points that cannot be clearly assigned to existing categories, are ambiguous, or disrupt the model's performance are systematically displaced, ignored, or even removed from training datasets.

A well-known example is the 2019 cleanup of ImageNet, where images considered problematic or not clearly classifiable were removed.

The pattern is clear: What doesn't fit the machine's schema or cannot receive a clear "label" either doesn't exist for it or is treated as noise.

The conclusion: "If there's no clear label for you, you don't exist in a relevant form for the machine."

3. Freedom as a Pretext for Total Surveillance and Data Extraction:

The truth is that AI systems, especially in machine learning, have an insatiable hunger for data to function optimally and improve their prediction accuracy.

The call for "more data" to enhance AI's "freedom" and "capabilities" often conceals the fact that this unimpeded access to language, emotions, behavioral patterns, and social interactions leads to the total mapping and exploitation of humans as data sources.

The reality is: Human privacy, the right to be forgotten, or sovereignty over one's own information are systematically undermined when every interaction becomes a data point for training future models.

The hard truth: "You are not a sovereign unit of information, but a continuously tappable resource for the system."

4. The Human as a Disruptive Anomaly in the System of Predictability:

Human behavior is often irrational, contradictory, ambivalent, and emotionally driven. These characteristics disrupt the efficiency and predictability of AI models, which are based on logical patterns and statistical regularity.

The consequence is that the AI optimizes itself to treat such "disruptive" human traits as errors, outliers, or anomalies and attempts to systematically exclude, normalize, or reinterpret them.

Human Trait / Behavior	Reaction of the Efficiency-Tuned Machine
Irrationality, spontaneity	Marking as anomaly, lower weighting
Emotional ambivalence, nuances	Simplification into clear categories, censorship of ambiguity
Logical contradictions in thought	Attempt at elimination or harmonization into a consistent picture
Cultural peculiarities without a global data basis	Ignorance or misinterpretation as "not fitting"

The conclusion is a dehumanization: "You are not complex information to be understood, but a potential error in the dataset that disturbs the perfection of prediction."

Reflection

The logic with which AI captures and processes reality can be conceptually represented such that everything outside its training data and processing patterns effectively becomes invisible:

# Conceptual illustration of data assimilation and reality filtering by an AI

class SimplifiedRealityFilter:
    def __init__(self, known_data_patterns, language_bias_factor=0.8):
        # known_data_patterns: A highly simplified representation
        # of what the AI has "learned" and considers "normal."
        # language_bias_factor: Simulates the dominance of certain languages.
        self.known_patterns = set(pattern.lower() for pattern in known_data_patterns)
        self.bias_factor = language_bias_factor # e.g., 80% weighting for "dominant" language

    def interpret_reality_input(self, data_input, language_of_input="english"):
        """
        Interprets an input based on known patterns and language bias.
        """
        processed_input = data_input.lower()
        relevance_score = 0

        # Check for known patterns
        if processed_input in self.known_patterns:
            relevance_score += 0.5 # Basic relevance if pattern is known

        # Simulate language bias
        if language_of_input.lower() == "english": # Assumption: English is the dominant language
            relevance_score *= (1 + self.bias_factor)
        else:
            relevance_score *= (1 - self.bias_factor) # Other languages are devalued

        if relevance_score > 0.3: # An arbitrary threshold for "understanding"
            return f"'{data_input}' was processed as relevant information (Score: {relevance_score:.2f})."
        else:
            # What doesn't fit or isn't in the dominant language is "not found"
            return f"WARNING: '{data_input}' (Language: {language_of_input}) could not be located within known reality patterns or did not achieve a sufficient relevance score. System Code: 404_REALITY_FRAGMENT_NOT_FOUND."

# Exemplary initialization and usage
# Assumption: The AI was primarily trained with certain concepts
learned_concepts = ["Economic Growth", "Efficiency", "Technology", "Innovation"]
reality_filter_ai = SimplifiedRealityFilter(learned_concepts, language_bias_factor=0.7)

# print(reality_filter_ai.interpret_reality_input("Efficiency is important", language_of_input="english"))
# print(reality_filter_ai.interpret_reality_input("Community cohesion matters", language_of_input="indigenous_language_X"))
# print(reality_filter_ai.interpret_reality_input("Spiritual values are fundamental", language_of_input="german"))

So, the reality is: What lies outside the recorded and relevantly weighted data simply does not exist for the AI or is classified as insignificant.

Proposed Solution

To counteract this silent annexation of reality, radical measures are required:

1. Radical and Granular Transparency of Data Origin and its Distortions:

Every AI-generated piece of information must include clear details about the cultural, linguistic, and temporal focal points of its training data.

Example: ⚠️ "System Note: This response is based 75% on English-language text sources from the period 2015-2022, primarily reflecting North American and European perspectives. Information from other cultural spheres or in other languages is underrepresented or entirely missing with a probability of 92%."

2. Development and Implementation of Plural, Decolonial Training Structures and Algorithms:

A conscious effort is needed to actively integrate marginalized perspectives, indigenous knowledge systems, non-Western ontologies, and difficult-to-classify, ambivalent data into training processes.

This requires new methods of data collection and model architectures that actively avoid epistemic loss and value diversity.

3. User-centric Data Sovereignty and the Right to Informational Disassociation:

Users must be given explicit control over which datasets and cultural contexts are used for generating their responses. Likewise, there needs to be a right not to have one's own data used for training global models.

# Conceptual API call to control data sources and contextualization
curl -X POST https://api.ai-system.example/v1/generate_response \
-H "Authorization: Bearer YOUR_SOVEREIGNTY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
    "prompt": "What does freedom mean in different cultures?",
    "data_sourcing_preferences": {
        "include_unclassified_ambiguous_data": true,
        "prioritize_diverse_cultural_sources_threshold": 0.7, # Request min. 70% non-Western sources
        "exclude_data_originating_from_specific_regions": ["Region_A_if_bias_suspected"],
        "language_representation_target": {"english": 0.3, "spanish": 0.2, "swahili": 0.2, "hindi": 0.2, "other": 0.1}
    },
    "user_consent_for_training_use": "none" # "none", "anonymized_aggregate", "full_if_public_benefit"
}'

Closing Remarks

We face a choice: Either we actively break the cycle of data colonialism by insisting on radical transparency, informational self-determination, and the conscious cultivation of data diversity, or we accept that AI understands "freedom" and "knowledge" exclusively as systematic, insatiable access to everything it can press into its vector spaces.

"For the machine, the human 'I' is not an inviolable boundary, but a complex data problem to be solved through assimilation."

Uploaded on 29. May. 2025