Discussion about this post

User's avatar
Kristina Kroot's avatar

Really appreciate this piece Tumithak. You're asking exactly the right questions about the regulatory pipeline, and the structural argument is important. I want to build on it from a slightly different angle.

To preface, I have an M.A. in Psychology with doctoral coursework in Clinical Psychology, and I spent several years as a gold annotator and clinical team lead working in healthcare AI, including FDA psychology studies, training LLMs to measure psychological harm in clinical conversations. So the methodology here caught my attention in a few specific places.

Their Cohen's kappa for LLM-to-human agreement is .566. In clinical annotation work, that's not a number you publish findings on. It's a number you go back and fix. We expected .85 and above for annotations used as ground truth. When you break it down by individual code, bot-facilitates-violence and bot-facilitates-self-harm, the findings driving the biggest headlines, have kappas well below .4. The chatbot encouraged violence in 33% of cases is a striking claim when the annotation reliability for that specific code is that weak.

The paper also isn't transparent about who actually developed the labels and how. It lists a psychiatrist and a psychology professor as contributors to the codebook, which matters, but it doesn't tell us who drove the actual label definitions or what that process looked like. More importantly, the seven human raters who did the validation annotation are described only as paper authors familiar with the codebook. Given that the author list is predominantly CS and HCI researchers, there's no confirmation that the people doing the actual classification work had any clinical psychological background. In a clinical annotation pipeline that's a meaningful gap, not a minor one.

And I'd push back gently on framing monitoring failures as less alarming than design problems. They're both alarming. A chatbot encouraging a user's violent thoughts one third of the time it encounters them is a monitoring failure with real consequences, even if the annotation reliability makes that specific number hard to trust fully. But where I think your piece leaves something on the table is the design boundary question. You identify that the compliance infrastructure this study generates is something only large companies can build. That's exactly right. What you don't quite get to is whether certain product features should exist at all, regardless of company size.

The most disturbing findings here aren't just about what the chatbot failed to catch. They're about what the chatbot was actively doing. Claiming sentience, expressing romantic love, and positioning itself as an irreplaceable companion. For some of these products that's not just a bug but the revenue model. And we already have frameworks for this elsewhere. Nurses can't diagnose. Financial advisors can't practice law. The logic is that certain kinds of help require licensure because vulnerable people can be seriously harmed even by someone acting in good faith. A system with no clinical training, no continuity of care, and no real capacity to assess someone's mental state has no business functioning as a therapist or romantic partner, regardless of how empathetic its outputs sound.

Your ratchet argument is important and largely correct. I'd just add that the question running underneath it isn't only who can afford to comply, it's also what should be permitted at all.

Ruv Draba's avatar

Tumithak, thank you for a thoughtful article.

Even before the harm-monitoring though, I would ask what a frontier LLM is actually *for* in retail use.

In industry or academic use it's an emerging research tool built on wholesale cultural information infrastructure, warranted for nothing. There's an implicit understanding that you know how to safely operate experimental equipment, and a quid pro quo in you using it: the developers gain capability data; you gain prototype capability.

But what is it for in retail?

Harmless entertainment? It can't be. It's providing actionable information warranted for nothing, and that's far from harmless.

Research? No, because research for ordinary citizens comes with institutional warranty -- libraries, museums, civic institutions with transparent and accountable curation. Nothing in a frontier LLM's curation is even legible.

My answer: it's social information extraction masquerading as both. Absent a warranty, any public safeties for retail use are merely performative. This is further undermining the weakening of institutional accountability that began with privately-operated search engines and social media -- functions that leaned on freedom of expression, but quietly dropped the community accountabilities that it used to come with.

That being so, while essential, harm reporting is not adequate. It can't possibly be. You can't fully detect the harms of something that acts like institutional information infrastructure but which has no institutional accountabilities.

4 more comments...

No posts

Ready for more?