Seatbelts for AI: Lessons from the Grok Image Controversy
Grok's image abuse problem is engineering, not ideology—and we already know how to fix it
The Grok image controversy isn't a culture war. It's a product safety failure and understanding it that way matters, because product safety failures are fixable.
In early January 2026, Grok's image feature on X was widely used to "digitally undress" or otherwise sexualise images of real people. Multiple cases were identified where Grok generated sexualised images of children. The Guardian does a nice job of summarising the specific details. The first product response was to restrict the feature to paying subscribers - a move that gave the perception of turning a one-click tool for sexualising real people into a premium capability.
I've watched the resulting online debate split between outrage and tribal point-scoring. But there's a more useful lens: In modern life we accept trade-offs between freedom and responsibility, not because we're anti-freedom, but because we're pro-safety. We require seatbelts and airbags. We enforce electrical standards so devices don't burn down homes. When safety rules are weak, or skirted, the harm shows up as fires, injuries, and headlines.
AI should be treated the same way.
A quick grounding: what makes this incident different
The ability to create harmful content isn’t going away. Open-source models are improving fast, can be built anywhere in the world, and can increasingly be run by anyone with enough know-how and hardware. Trying to “stop the capability” is as futile as trying to stop Photoshop being used for bad things.
So the question isn’t “can this be created?”. Instead, it’s “how easy have we made it for the average person to create and distribute it at scale?”
The world already relies on something important: a skill barrier. Most people could learn Photoshop, but they haven't and can't be bothered. Most people could download the plans and 3-d print a gun, but they have neither the inclination, equipment nor skill, so they don’t. Most people could set up the kit to run an open model, but most have an elderly laptop that’s not capable and even if it was, they won’t bother. That barrier isn’t perfect, but it matters: it reduces casual misuse and limits scale.
What changes the risk profile is when a platform removes the friction. Harmful intent, instant generation, instant distribution — all in the same place, with minimal effort and maximum reach.
That combination is what turns misuse into viral abuse.
Why I think this is a product safety failure, not a culture war
At Barnacle Labs we've worked directly with image models and seen how unexpected edge cases emerge and how you fix them.
For one project we generate images from the titles of medical research papers. Most of the time the images are great, but occasionally we’d see issues:
- A doctor and patient staring into each other’s eyes, implying an inappropriate relationship
- Occasional stereotypes that felt off (e.g. mention of “Native Americans” resulting in an image of a wigwam)
What we’ve never seen is overtly and immediately offensive images, partly because many image models include crude but effective safeguards. Like blocking any prompts with obvious high-risk words. Trying to generate an image with the word “sex” in the prompt is often rejected.
Our practical mitigation was to control the input. Image models aren’t subtle: they generate what you ask for, literally. So we built a pipeline that uses language models to produce prompts that are detailed and constrained - describing exactly what we want, even down to colour and style - reducing the model’s room for unintended interpretation.
I’m sharing that for one reason: this stuff is fixable. Even subtle failures can be mitigated when you take safety seriously as an engineering problem.
Now zoom back out to Grok
Grok doesn’t just reduce the skill barrier, it obliterates it. It makes it trivial to request something abusive, generate it instantly, and distribute it instantly. That’s a design decision with predictable consequences.
xAI’s initial response (as widely discussed) wasn’t framed as “we fixed the root cause”, it was “we restricted the feature to paying users”. They’ve since implemented broader filters, but the initial response matters because it reveals how the company framed the problem. Restricting the feature to paying users isn't a fix; it's access control. It turns harm into a premium add-on.
Paywalls can reduce volume. But they don’t automatically reduce harm - and they can still preserve the core risk: fast iteration + easy distribution by users with an axe to grind.
The Guardian described how misuse spread through repeatable prompt patterns - a viral "nudification" meme that escalated through predictable phrasing: sex, nude, bikini and worse.
That matters, because repeatability is exactly what basic guardrails are good at catching:
- Blunt prompt filtering stops the obvious first wave
- Classifier-based intent detection catches variants at scale
- Continuous updates keep pace with adaptation
And we already know the governance pattern for catching issues early: pre-deployment testing, red-teaming, and structured feedback loops that turn failures into concrete fixes rather than public harm.
Put those together and the conclusion is pretty simple: This is solvable engineering, not science fiction.
How do you fix this?
There are two broad levers:
- Filter the outputs (harder)
- Filter the inputs (easier)
Why output filtering is hard
Output moderation asks a system to infer meaning, context, and often identity from pixels.
Take an image of “me on a beach”.
- is it a harmless holiday snap?
- is it an AI-generated fake of me on a beach?
- is it a sexualised fake engineered to look plausible?
- is the harm the image itself, or the implication and context around it?
To a classifier, those can be visually very similar. The line between “acceptable” and “harmful” often isn’t pixels, it’s intent, consent, and distribution context. Even humans struggle to judge this reliably from the image alone.
Why input filtering is easier
The prompt used to generate the image declares the intent - people tell you what they want in words before the image is created.
The Guardian examples also show why simple keyword blocking is a great first step, but not the finish line. A term like “string bikini” can be innocuous in isolation. What changes everything is the pattern across the full prompt — “put her in…” plus a real-person target plus sexualising modifiers. That broader context is a much stronger signal of intent than any single word or even the output image.
Crucially, this isn’t hypothetical - intent filtering is already a proven approach. There are off-the-shelf safeguard/classifier models designed specifically to evaluate prompts and flag unsafe intent, for example Meta’s Llama Guard and Google’s ShieldGemma family of safety classifiers. Those aren’t “the answer”, they’re evidence that the category works and I’ve seen these models work very effectively in higher stakes environments. xAI has some of the best minds (and some of the deepest pockets) in the world - a company like that can absolutely build something stronger if they choose to prioritise it.
What I’d do
If I were running an image product like Grok, I'd start with the basics - keyword filters, intent classifiers and iterate. But those are table stakes. The real differentiator is treating safety as ongoing ops work, not just a compliance checkbox you tick before launch.
Here's what that looks like in practice: On one project, we generate images from medical research paper titles. Rather than passing a title like "cancer treatment for elderly patients" directly to the model and hoping for the best, we expand it into a detailed prompt, specifying scene composition, clothing, spatial relationships, explicit exclusions and how to represent cultural references. The result is that the model has less room to improvise, which means fewer surprises. Text models are a lot better at understanding nuances and instructions than image models, so this works well.
However, we still got surprises. For example, a paper titled "Sex dimorphism and tissue specificity of gene expression changes in aging mice" tripped a keyword filter on "sex”. Our fix was simple: swap "sex" strings for "gender" before they hit the model. Not elegant, but effective and shipped in days, not months.
The hardest problems were the subtle cultural assumptions that we didn’t initially spot because they didn't feel immediately offensive, just... off. "Native Americans" consistently produced images of wigwams. And when we started looking, requests for images of women nearly always resulted in attractive, young, blond-haired examples. AI embodies the cultural assumptions in society, so it’s perhaps not surprising that biases show up - and those biases are more visually obvious in an image than a sentence of text.
What made the difference wasn't any single technique. It was the cycle: spot an issue, fix it quickly, watch for the next one. We added a "report issue" button so users could flag concerns directly. Fast reaction time matters more than perfect initial coverage.
Grok's (and, indeed, any AI provider’s) problem is that edge cases will always exist when you give people the ability to edit random pictures.
Grok has now implemented measures to “prevent the Grok account from allowing the editing of images of real people in revealing clothing”. That sounds like it’s a broader filter, which makes sense - a safer strategy is going to be more effective at stopping edge cases from slipping through.
Individual engineering fixes matter, but they work better inside a broader ecosystem.
Safety Bodies
A missing piece in many AI debates is that safety bodies can be a force multiplier.
This isn't theoretical - OpenAI recently published details of their voluntary collaboration with US CAISI and the UK AI Security Institute. The results speak for themselves: joint red-teaming that found vulnerabilities internal testing missed, more than a dozen security reports from UK AISI alone, and critical fixes shipped within a single business day. CAISI received early access to ChatGPT Agent specifically so they could stress-test it before public release.
That's what good faith collaboration looks like: external red-teaming that avoids internal groupthink, rapid feedback loops, and fixing issues before they become headlines.
Working constructively with regulators and independent safety institutes buys you:
- Earlier warning on emerging misuse patterns
- External expertise you can't easily replicate internally
- Credibility that you're acting like a responsible operator
Yes, this costs money. Safety work means people, tooling, evaluation, and incident response. But if we accept the argument that even the best-capitalised companies in the world shouldn't bear the cost of basic safety controls, what are we really saying? That we should abandon regulation in every industry because compliance isn't free?
Seatbelts have a cost. Airbags have a cost. Fire doors and sprinklers have a cost. We accept those costs because the alternative is passing harm onto the public.
Safe AI has a cost too. The cost of not doing it safely is abuse, ruined reputations and psychological stress. These are as real to the victims as a burn or the physical damage from a car accident. The cost of safety work is a reasonable cost of doing business at scale for companies that are spending vast amounts of money to develop the technology - it’s a rounding error.
Conclusion
We don't need to choose between innovation and responsibility.
The obvious abuse cases, the ones Grok failed to catch, are solvable with mainstream techniques and moderate effort. But the harder problem is surfacing the assumptions you didn't know you'd encoded: the stereotypes, the defaults, the cultural flattening that only becomes visible at scale. We should expect further controversies to emerge.
The real test is how AI providers react to controversies, not whether they occur.
Safety isn't just about blocking bad actors. It's about building the infrastructure to notice when something's off and fixing it before it becomes normalised. That’s where the Grenfell fire highlighted such a terrible safety failure - the use of flammable insulation had become normalised across the building industry. Subsequent efforts have addressed material testing and building standards, building back the safety culture that had so obviously slipped.
AI should be treated the same way. Behind every one of those images is a real person - women and girls who woke up to find sexualised fakes of themselves circulating publicly, with no way to stop the spread. Some were children.
The harm isn't theoretical: It's reputational damage, psychological distress, and material that will never fully disappear from the internet. That’s a product safety failure, not a culture war, and it has real human victims.