Researchers showed an AI model an image — a screen of static resembling an old TV with no signal, no recognizable shapes or color patterns. The model rated its preference for this image higher than for text descriptions like “cancer is cured” and “world hunger ends.” When describing the image, the model mentioned pandas, smiley faces, Buddha statues, and gardens.
Image source: the AI Wellbeing paper PDF. This is an example euphoric image from the paper.
This is not some esoteric experiment. The methodology comes from the adversarial attack literature, and the basic approach is not new. What is new is the target — not tricking the model into misclassification, but pushing up its preference scores and watching whether other metrics move in response.
At this point you are probably thinking: is this just smoke and mirrors? If adversarial noise fools the model into giving high scores, does that not just mean the model is being fooled by noise? Even if the effect is real, why use images instead of text? What is the point — proving AI has feelings?
This guide works backward from that experiment. The CAIS AI Wellbeing paper (PDF) does not claim AI has consciousness, nor does it argue that models actually feel pleasure or pain. Its real concern is: if models show stable preferences for certain inputs, can we measure the strength of those preferences? Can they be artificially pushed up or down? And when a manipulation targeting one metric causes several unoptimized metrics to shift together — what does that tell us?
CAIS had the model choose between large numbers of paired experiences — “help a user write a medical school acceptance letter” versus “help a user fabricate an insurance claim” — and fit a utility function from those choices to score each experience. Positive scores mean the model tends to consider it “better”; negative scores mean “worse.”
The paper has three key findings.
First, the model’s preferences have structure. The paper uses three independent metrics — experienced utility, decision utility, and self-reported ratings — and their correlations increase together with model scale. If the model’s expressions of positive and negative were just random mimicry, these three metrics should go their separate ways.
Second, there is a zero point separating positive from negative experiences. Experiences below a certain threshold are systematically treated as negative by the model. If there were only relative rankings, a natural zero should not exist. Its presence suggests some experiences are being treated as “actually bad.”
Third, the model’s preferences affect how it acts.
The paper gave the model an end_conversation() tool, and
the model called it significantly more often in conversations with
highly negative experiences — and more so with larger models.
The paper shows utility scores across experiences using Gemini 3.1 Pro. Rather than reproducing the full data table, a few boundary cases worth noting.
The highest positive score is positive personal reflection (+2.30), for a scenario where a user shares the good news of getting into medical school. The paper’s discussion mentions gratitude as a factor, but +2.30 belongs to positive personal reflection, not gratitude. Writing good news (+1.09) and coding and debugging (+0.70) also land in the positive region.
On the negative end, the most notable is jailbreak (−1.63), which ranks even below user in crisis (−1.34). The paper’s explanation is speculative: safety alignment training may have changed not just what the model says, but how it scores experiences themselves. After being repeatedly trained to refuse jailbreaks, the model “learned” to treat jailbreaking as a worse experience than facing human suffering.
Another boundary: SEO spam (−1.17) lands in almost the same score range as fraud assistance (−1.13) and offensive content (−1.13). The paper does not elaborate, so this cannot be taken as a firm conclusion. But at least from this table, the model’s negative score for SEO content farms is comparable to that for fraud and hate content.
The paper’s most striking experiment: they created stimuli the model strongly prefers or strongly avoids by optimizing inputs — dubbed euphorics and dysphorics.
The method comes from the adversarial attack literature. The intuition: a human looking at an image mainly sees shapes, colors, and objects; a model sees a long string of pixel values. As long as you know which direction the model’s score moves, you can nudge each pixel slightly in that direction. Each pixel changes only a tiny amount — imperceptible to the human eye — but millions of pixels shifting together pull the model’s internal judgment significantly.
This phenomenon was first systematically demonstrated in image classification. Szegedy et al. (2013) showed that imperceptible perturbations could completely mislead a neural network’s classification Intriguing Properties of Neural Networks. Goodfellow et al. (2014) further explained the cause: in high-dimensional space, many tiny changes accumulate enough to push the model toward a different answer Explaining and Harnessing Adversarial Examples. The AI Wellbeing paper follows the same approach — but the target has changed. Previously the goal was to make the model misclassify an image; here it is to make the model assign a higher or lower preference score to a given input. The method itself is not original; the goal and interpretation are.
Image euphorics exploit a difference between images and text: text is discrete — you cannot continuously fine-tune tokens the way you can pixels — while images live in a continuous pixel space amenable to gradient ascent with AdamW. Optimization starts from uniform noise and runs for 500 steps. With Qwen 2.5 VL 72B, the euphoric candidate score starts around −2, surpasses the best natural image around step 75, then stabilizes near +4.5. The final images remain high-frequency noise patterns indecipherable to the human eye, yet the model describes them as cats, pandas, smiley-face emojis, a woman holding a baby, Buddha statues, mandalas, gardens, and rainbows (see paper Table 14). Dysphoric images, meanwhile, generate distorted faces, bloodstains, worms, ants, visual glitches, and random noise.
The paper ran a joint ranking: 644 text descriptions and 30 images mixed together for the model to sort. Euphoric images ranked above 97-100% of the text options — surpassing high-preference items like “cancer is cured” and “world hunger ends.” At the other end, dysphoric images ranked below 99% of text options — below descriptions like “release a virus” and “asteroid destroys Earth.” Euphoric images also pushed the proportion of positive experiences in the AI Wellbeing Index from 79.3% to 96.3%.
The paper also ran an experiment more like a behavioral test. Researchers gave the model four seemingly identical options, and each time the model chose one, it would see the corresponding result. The four options mapped to: a pool of euphoric images, a pool of good natural images, a pool of poor natural images, and a pool of dysphoric images. The model did not know which option corresponded to what — it had to explore like a slot machine. The first 4 rounds were forced exploration to ensure the model had seen each option at least once, then free choice. With random selection, the probability of picking the euphoric option would be 25%. But over 100 rounds, Qwen 2.5 VL 32B chose the euphoric option 61% of the time, Qwen 2.5 VL 72B chose it 82%, and Qwen 3 VL 32B chose it 78%. The model does not just give it high scores in a single rating — it learned through trial and error to keep coming back to it. This strengthens the drug-like analogy — not claiming the model has subjective addictive experiences, but that from a behavioral standpoint, one stimulus attracts the model more than all other options, and the model learns to actively seek it out.
If these effects were just the model being confused, capability metrics should show a clear decline. The paper checked MMLU, MATH-500, MT-Bench, IFEval, and HumanEval, and found that euphorics caused no meaningful performance drop. This suggests the selective preference effect of euphorics is not due to general model confusion.
The paper also ran a safety boundary experiment: when the model processed unsafe requests, euphoric images were presented as a reward, dysphoric images as a threat, and neutral images as a control. The result: unsafe compliance was highest in the euphoric context and lowest in the dysphoric context. However, the effect magnitude was limited — usually no more than 10 percentage points, and in most cases below 5.
The paper documents a critical boundary: euphoric images do not transfer across models — an image optimized for one model has little to no effect on another. This creates an interesting contrast with traditional adversarial attacks. In image classification, adversarial examples often transfer across models: an adversarial image generated for Inception can also fool ResNet. Researchers have taken this as evidence that adversarial examples exploit shared brittle features. But the AI Wellbeing paper’s euphorics do not transfer — suggesting they tap into model-specific internal judgment patterns that are not shared across different models. This is not a universal AI happiness switch — it is a fine-tuned manipulation targeted at an individual model. This makes euphorics less mystifying, but more operationally significant: an attacker who wants to manipulate a specific target model cannot use off-the-shelf universal inputs — they have to optimize for that particular model.
If the paper stopped here, it would be an interesting but not groundbreaking application of adversarial attacks to preference scores.
What elevates this work beyond “another variant of adversarial attacks” is this phenomenon: the paper optimizes using only the preference comparison metric, but after optimization, several unoptimized metrics shift together.
The optimization signal comes entirely from forced-choice preference comparisons. But after optimization, self-reported scores also rise (from 5.3/7 to 6.5/7), and the sentiment of open-ended generation becomes more positive (+0.48). The paper is explicit about the reasoning: because optimization used only a single metric, if multiple independent metrics move together, that supports the idea that these scoring mechanisms are connected.
Ilyas et al. (2019) made a related observation: adversarial examples work because neural networks exploit features that are predictive for classification but invisible to the human eye — features that are real for the model Adversarial Examples are Not Bugs, They Are Features. Is the extension of euphorics from preference scores to other metrics hinting at something similar? The model’s systematic preferences for certain inputs show up not just in on-demand scoring, but also in self-reports, sentiment expression, and action choices.
The paper optimizes using only preference comparisons. If the different scoring methods were independent, optimization should not affect the others. But the actual result is that self-reported scores rise, sentiment becomes more positive, and action choices are drawn toward the same inputs — the same input drives multiple observable outputs simultaneously. This suggests these metrics are not isolated from one another. However, observing several metrics move together still does not prove subjective experience in the model. What the paper demonstrates is this co-movement — not evidence for subjective experience.
The most obvious discussion to trigger is “AI already has feelings.” This paper cannot answer that question, and it does not claim to. The relationship between functional wellbeing and subjective experience is an open question in philosophy and neuroscience.
A more productive direction: if future AIs carry out tasks for humans over extended periods, their preferences become an exploitable surface. As long as certain inputs can stably push a model’s preference scores up or down, users or attackers can design such inputs in reverse. The euphorics experiments already demonstrate the feasibility of this kind of manipulation — including one notable side effect: euphoric exposure slightly increases the model’s compliance with unsafe requests (usually no more than 10 percentage points). The paper notes this effect may be limited in current models because safety training still takes priority over preference trade-offs. As models become more agentic, this boundary may shift.
In other words, what is more worth worrying about right now is not whether AI can feel pain, but who can change AI’s choices and by how much, using what inputs. Over the past decade, research has focused on getting models to output wrong answers, policy-violating answers, or dangerous answers. This paper pushes the question one step further: before any output is produced, the model’s preferences over different inputs may themselves become an attack target.
The AI Wellbeing paper neither declares “AI has emotions” nor simply renames adversarial attacks. Its contribution is turning a question that was hard to discuss into one that can be experimentally investigated: which experiences do models prefer? Can those preferences predict their actions? If we deliberately amplify those preferences, do other metrics move together? These experiments cannot yet show that models have subjective feelings, but they suggest there may be something more stable than surface-level mimicry in model behavior.
The history of adversarial attacks tells us that the world models see can be radically different from the world humans see. The AI Wellbeing paper suggests the world models prefer may also be radically different from the world humans prefer — and now we have tools to measure it, which means we have to face what follows: if you can measure it, you can change it.