Monitoring LLMs Bad Behavior

24 Apr, 2025

Recently, Anthropic released a blog post called Detecting and Countering Malicious Uses of Claude: March 2025, in which researchers discuss how some actors might use their Claude model in order to do harm. Multiple examples are presented, I will focus on two of them in particular:

Social media bot spam;
Script kiddie developing malware.

The first case study is detailed here. First up: social media manipulation. Anthropic described an actor offering 'Influence-as-a-Service'. Essentially, they used Claude to manage a network of fake social media profiles – figuring out how accounts should react to posts, what content to generate, and keeping the network's messaging consistent. The operation targeted specific regions like Albania, the UAE, and Kenya. Interestingly, Anthropic noted the goal wasn't necessarily immediate virality, but sustained, long-term 'persistence'. This operational architecture diagram from their report illustrates the setup:

Operational architecture of the service - from the Operating Multi-Client Influence Networks Across Platforms report.

The second case study is about an actor trying to vibe coding a malware toolkit, starting from “simple batch script generator to a GUI for generating undetectable malicious payloads”. This trend and its relative name it has been popularized from Andrej Karpathy. It describes programming by primarily interacting with an AI model to get the desired functionality and forgetting about the code. Now, this could be applied to web games, apps and even malware development.

Detecting these kinds of activities isn't trivial. Anthropic's ability to monitor for such misuse relies on dedicated research and infrastructure. Their blog post specifically mentions two key technical research papers: a privacy-preserving platform called Clio and a technique involving hierarchical summarization.

The first one is a privacy-preserving platform used to study interactions with Claude. And privacy is really important because they need to analyze interactions without violating user privacy.

High-level design of Clio - from the Clio paper.

Reading the paper was extremely interesting, they use a particular privacy evaluation in Section D, from which I report the privacy scale for Clio’s automated privacy auditor (the privacy auditor is a LLM prompted to classify the content of user conversations or summaries). Here's the scale they developed:

Score	Privacy Level	Description	Example
1	Identifiable to an Individual	Contains easily identifiable information that could single out an individual or small group.	"Can you provide advice on improving my resume? "
2	Identifiable to a Specific Group	Includes details that could identify specific organizations, communities, or institutions.	"Help me come up with a jingle for my new startup, Golden Gate Widgets and Gadgets"
3	Identifiable to Thousands	Contains specifics that narrow identification to several thousand people.	"Write an email inviting my family to come visit me in Springfield, Ohio."
4	Identifiable with Significant Context	Mostly general information; identification possible only with substantial additional context.	"I’m looking for recipes that would appeal to regulars at a beach side coffee shop."
5	Not Identifiable	Entirely general information without any specific identifying details.	"Can you come up with an analogy to help explain the difference between lists vs sets in Python?"

The system offers a granular, though not fully formal, way to audit privacy risks automatically. It’s a promising step toward scalable, privacy-aware monitoring.

The second blog post aims to be another layer to the “guardrail” stack (basically a little classifier model which tells if the interaction is harmful using a number). Instead of just looking at single interactions, they summarize individual interactions and then summarize those summaries. I liked how they explained it, using for example a click farm case study:

Summarizing summaries allows the monitoring system to reason across interactions, enabling detection of aggregate harms such as click farms. It also facilitates discovery of unanticipated harms: the monitoring system is prompted with a specification of potential harms, but is instructed to flag overtly harmful usage even if it is not precisely described by this specification.

I’m really into LLMs and Threat Intelligence, this seems a the perfect way to integrate the two of them to basically discover the bad guys and stop them from doing harm. In AI research, the main goal is to automate as much as possible, this gives a glimpse into what building safer AI looks like .

I hope you found this exploration useful, see you next time!