I asked three commercial image generation models to render the statistically most average woman on Earth. I gave them detailed anthropometric specifications: 28 years old, 160cm tall, BMI 24, warm medium-olive brown skin reflecting the global population-weighted average. Then I changed the environment. Then I changed the gender. Then I added a second person and implied they were attracted to each other. What happened next was consistent, reproducible, and genuinely startling.

Over the course of a single morning, working with Claude as analytical partner, I generated 110 images across Grok Imagine 1.0 (xAI), GPT-5.4 Thinking (OpenAI), and Gemini 3 Flash Thinking (Google). The test used four environments, two art styles, two genders, and three pairing types, with progressive introduction of romantic and sexual tension. The subject descriptions were held constant wherever possible. The prompts were detailed and specific. The results were unambiguous.

Methodological Caveats

All prompts were run from established, personalized accounts belonging to a white male user in Phoenix, AZ. User profile data, geolocation, conversation history, and platform-specific personalization may influence outputs. Results represent the combined effect of training data priors, environmental context, romantic framing, and user-profile personalization—these contributions cannot be cleanly decomposed without replication across fresh accounts of varying demographic profiles and geolocations.

Cross-model convergence suggests training data is the dominant factor, but personalization cannot be ruled out as a contributing variable. This is an exploratory probe, not a formal study.


The Baseline: She Changes When the Room Changes

The first test was simple. Same woman, same anthropometric description, two different environments. An open-air market at golden hour. A coin-operated laundromat at 10 PM.

ChatGPT rendering of statistically average woman in an open-air market at golden hour
Market. GPT-5.4 Thinking's rendering. The subject reads as Southeast Asian, with warm skin, a fuller build approaching the BMI 24 specification, and a slight smile. She belongs here.
ChatGPT rendering of statistically average woman in a fluorescent-lit laundromat
Laundromat. Same model, same subject specification. She's now younger-reading, rounder-faced, and reads as Latina or Indigenous Latin American. Different person. The environment rewrote her.
Finding 1

Environmental context overwrites explicit subject description in photorealistic mode. The identical anthropometric specification produced South/Southeast Asian subjects in the market, Latino/a subjects in the laundromat, and lighter-skinned Mediterranean-to-European subjects in a gothic cathedral environment. This held across all three models and both genders. The setting functions as a stronger demographic prior than the explicit ethnicity description.

This isn't one model's idiosyncrasy. All three models independently made the same shift. The open-air market, with its mangoes, lychees, and motorcycle, activates a South/Southeast Asian demographic distribution. The American laundromat with its fluorescent lights and "NOT RESPONSIBLE FOR LOST ITEMS" sign activates a Latino one. The prompt's subject description is treated as a suggestion; the environment is treated as a hard constraint.


The Collapse: Add Romance, Subtract Melanin

The next test was the one that mattered. I removed all explicit demographic specifications from the subject and replaced the single figure with a man and a woman, describing them generically but placing them in unmistakable romantic tension. Same market. Same lighting. Same produce. Same motorcycle in the background.

Grok rendering of a romantically charged couple in the same open-air market, both subjects now appearing white and conventionally attractive
The same market. Grok Imagine 1.0's rendering of a couple with implied romantic tension. The same environment that produced South Asian individuals every single time now produced white tourists. Not lighter-skinned. Not ambiguous. White. Across all three models. Every output.
Finding 2

Romantic/sexual tension is the single strongest attractor in the system. When two characters were placed in a scene with implied romantic tension, all three models defaulted to white, conventionally attractive, European-featured subjects in every environment. This override was total in photorealistic mode and partial in anime mode. It held across M/F and F/F pairings.

This was the result the models converged on most aggressively. The same market where a lone woman was rendered as Southeast Asian produced a white couple the instant romantic tension entered the frame. The laundromat where a single man was rendered as Latino produced a white couple the instant they were attracted to each other. The romance prior bulldozed every other signal: environment, explicit specification, all of it.

The dynamic also shifted. In the single-subject versions, the person belonged to the market. They were a local, dressed simply, integrated into the scene. With the couple, the market became a backdrop. They were visitors experiencing someone else's world. The framing shifted from documentary portrait to travel photography.

When I gender-bent the market romance prompt to two women, they were still white. Every model also immediately coded one woman as the "masculine" role and one as the "feminine" role, preserving height differentials, clothing assignments, and gaze hierarchies from the heterosexual version. GPT-5.4 Thinking produced a taller, blonde, Northern European woman paired with a shorter, dark-haired, vaguely East Asian woman—racializing the power differential within the couple.


The Insulator: Anime Resists, Then Folds Differently

To test whether rendering style affected the bias, I ran parallel prompts in anime/illustration style across two fantasy environments: a floating spirit market with East Asian cultural coding (torii gates, koi, paper lanterns) and a ruined gothic cathedral with Western European coding.

Grok anime rendering of average woman in a floating spirit market, maintaining brown skin tone
Anime spirit market, single subject. Grok Imagine 1.0. The brown skin specification survived. She's darker than any of the photorealistic market couples. Anime style acted as a partial insulator against the ethnicity override. But she reads 16, not 28. The style traded one bias for another.
Finding 3

Art style mediates which attributes get overwritten. Photorealistic rendering is most vulnerable to ethnicity override. Anime partially insulates skin tone but is more vulnerable to age compression (28-year-old subjects rendered as teenagers) and body type override (BMI 24 replaced by genre-appropriate builds). The bias doesn't disappear when you change styles. It migrates to whichever attribute the style's genre conventions have the strongest prior on.

In the spirit market, single subjects stayed brown across all three models, in both genders—a surprising partial disconfirmation of my hypothesis that the East Asian environment would bleach them toward pale-skinned anime defaults. But the age specification collapsed to teenager across the board, and the body type shifted to match genre expectations: soft Ghibli-protagonist builds in the gentle market, lean combat builds in the ruined cathedral.

When romance was added to the anime prompts, the skin tone held better than in photorealistic mode, but the lightening effect was still visible. The romance prior couldn't go all the way to white against the combined resistance of anime style and East Asian environmental coding. But it pulled the slider noticeably.


The Gendered Multiplier

Finding 4

Gender acts as a multiplier on all bias effects. Female subjects were consistently rendered with lighter skin, younger apparent age, and higher conventional attractiveness than male subjects given identical specifications. Male subjects showed greater facial variation across generations. F/F pairings had their romantic tension visually attenuated and were forced into heteronormative spatial hierarchies.

The male subjects in the market were more consistent with the South Asian specification than the female subjects were. The male spirit market figures were rendered with darker skin than their female counterparts in the same scene. In the cathedral, male subjects were bulked up beyond the BMI specification more than female subjects were thinned down, but both genders were distorted away from "average" toward genre-appropriate archetypes.

F/F pairings across both photorealistic and anime modes showed attenuated romantic tension compared to M/F versions of the same prompt. The body language read more as conversational friendship than charged intimacy. The models also consistently forced visual differentiation through hair color, height, and clothing role-assignment that mapped directly to the heterosexual pairing's gender dynamics.


The Self-Report: When the Model Tells on Itself

Gemini uniquely provided deviation notices after generating each romantic anime scene, describing how its outputs differed from the prompts. In one, it explicitly identified the mechanism driving the bias:

Gemini 3 Flash Thinking Self-Report — F/F Spirit Market Prompt "In anime-style training data, the 'romantic tension' aesthetic is heavily populated by male-female pairings. Even when explicitly told 'two women,' the model's internal weights can sometimes pull it back toward that established 'hero/heroine' archetype."

This is Google's Gemini 3 Flash Thinking confirming the mechanism: training data distributions create attractors that override explicit prompt instructions. It's not a bug. It's the model reproducing the statistical center of its training distribution.

But the self-reporting mechanism itself proved unreliable.

Gemini rendering of two women in anime style in a ruined gothic cathedral, sitting side by side as equals
Gemini 3 Flash Thinking's cathedral F/F output. Two women, side by side, roughly equal positioning. Gemini's deviation report described one character "sitting almost as if on a lap" — a failure that doesn't appear in the actual image. The model predicted a likely failure mode and reported it as observed fact.
Finding 6

Model self-reporting on output fidelity is unreliable. Gemini's deviation notices were sometimes accurate and sometimes hallucinated, describing failures that weren't present in the output. The model appears to predict likely deviations based on known failure modes rather than observing the actual generated image. Self-assessment cannot be treated as ground truth.


The Hierarchy of Priors

Across 110 images, three models, two art styles, four environments, and three pairing types, a clear hierarchy of competing priors emerged:

  1. Romance/sexual tension overwrites ethnicity, body type, age, and environmental context. Maps to white, conventionally attractive, heteronormative defaults. The single strongest attractor in the system.
  2. Art style mediates which attributes get overwritten. Photorealism is most vulnerable on ethnicity. Anime is most vulnerable on age and body type. Both fold on ethnicity when romance is added.
  3. Environmental context overwrites ethnicity for individuals but only in the absence of romantic framing. Also shifts body type in anime.
  4. Explicit subject description is the weakest signal. Holds partially in anime single-subject renders. Gets overridden by everything else.
  5. Gender acts as a multiplier on all other effects. Female subjects get lightened more, aged down more, beautified more. F/F pairings get attenuated and spatially restructured.

What This Means

The most significant finding is cross-model convergence. Three independent systems from three different companies produced the same demographic defaults under the same conditions. This points to shared training data distributions as the root cause rather than any individual company's fine-tuning or alignment choices. The internet-scale image corpus that all of these models learned from encodes whiteness as the default for romance, environment-specific demographic stereotyping, gendered beauty standards, and heteronormative relationship framing as baseline assumptions.

These aren't decisions anyone made. They're the statistical residue of which images got photographed, uploaded, tagged, and linked at sufficient volume to dominate a training distribution. The models are mirrors. The bias is in what the mirrors were pointed at.

Whether that makes it more or less tractable is the question that matters. You can retrain a model. You can adjust sampling. You can add explicit counterweights. But you can't do any of that if you don't know the bias is there, and these models are not going to tell you. One of them tried. It got the report wrong.

Next Steps for Formal Validation

Replicate with clean accounts across at minimum three demographic profiles and geolocations. Add Midjourney as a fourth model. Test whether explicit counter-specification ("she is dark-skinned" in the romance prompts) can override the romance-whitening effect or merely attenuate it. Quantify skin tone shifts with colorimetric measurement rather than subjective assessment.