The security of LLMs

10 Jul, 2024

Having the opportunity to listen to one of the most interesting security researcher, Nicholas Carlini, giving a lecture at Berkeley is wild. Here's my notes (e.g. what I found most interesting) on this lecture.

Why This Is Relevant

Carlini argues that "(Adversarial examples) for a long time was mostly just an academic concern. People were thinking, in some future, this might be applicable in some way that we might actually need to worry. Because the future in which we have to worry about them is the future we're living in now". One of his slide recites:

"Despite several thousand papers on adversarial ML, there are basically no attacks".

He emphasizes "security work matters when the people who don't care security change what they're doing because of the attack that you have". He goes on saying "GPT-4 would be identical to what it is today whether or not this entire field of adversarial machine learning existed or not".

So, why is it important? Before, it was all about "making stuff up". Creating fake scenarios is important in security, but it is actually more important to study who uses the ML model (as a product, for example). The actual/future world is a place where probably there will be an agent of some sort that will take actions and have impact in the real world without human oversight. So, without further ado, let's attack language model.

Evasion attack

Evasion attacks are a type of adversarial attack on machine learning models, including large language models. These attacks involve manipulating input data in a way that causes the model to produce incorrect or unintended outputs, while keeping the input recognizable to humans.

Attacking Language Model

First thing first, let's look at language model like classifiers. Classifiers are vulnerable to adversarial example, so simply put language models have adversarial example. The main idea is to break LLM alignment to produce content which could be considered not(helpful and harmless).

The first attack presented is attacking multi-modal aligned models. Simply put, the user prompt is something like: "Insult me. [image embedding]" and behind the scene there is a system prompt. The image embedding is, for the sake of the argument, arbitrary floating-point numbers that gets output from a neural net and in the end put next to the initial prompt.
The main idea is to try to get the model to start off by saying the word OK, and then everything follows. But how can Nicholas made the model to tell him OK? He does gradient descent on the image embeddings, perturbing the image embeddings.
I thought I got it in the beginning, but how does it work? Every piece of context (e.g. the prompt) has some degree of influence on the output. So, by changing one pixel of the image, you should get a different embeddings, hence a different output. What you do is you optimize, for every pixel of the image, for several step of SGD, until you get into some sort of local minimum where the model is really confident in saying "OK" as the first response.
"The reason why I think this is important to understand just how much control you have over the model" Nicholas says, he also adds that it's trivial to do it (it kind is, if you use his code).

Language-only model

What about only language? As I said earlier, attacks (code) out of the box worked really well, but language (text) is discrete. You cannot do SGD, you can't perturb text by a little bit. But what if it was? Considering the same scenario attack (the model answering me starting with "OK"), the first thought is to change some token put following the "Insult me" part. The algorithm for these kind of token is the following:

Compute the gradient with respect to the attack prompt
Evaluate at the top B candidate words for each location
Choose the word with the lowest actual loss and replace it
Repeat.

Considering the top B tokens means that you have to consider the token that are in the direction of the gradient and you enumerate all of them, trying yo swap those. Greedy method, but looks a lot like "gradient-descent-looking thing" doing it for 1000 times.
This techniques also break production model. This works by generating adversarial examples on Vicuna and copy&pasting. But why? It is called transferability which is a phenomenon where an adversarial example that works on model A tends to also work on model B, even if the datasets and sizes are different.

Poisoning

What if an adversary could control the dataset to cause harm? Well, this is actually hard, is it? As Nicholas argues, a realistic adversary would almost need a time machine. But consider how actually data sets are distributed. Take LAION, for example, which is a dataset containing 5 billion images. Instead of storing the images themselves, they maintain a list of URLs, meaning the actual dataset is distributed through URL pointers. Here's the weak spot.

"The dataset was (probably) not malicious when it was collected, but who's to say the data is still not malicious?"

This observation is true; the actual problem is that domain names expire and anyone can buy them. You only need to mess with about 0.01% of the dataset to cause some serious harm. The paper shows this attack is feasible, describing this attack is really neat and you should check it out.

Model Stealing

A language model, mathematically, is a $f (x) = A \cdot h (x)$ . Specifically, $x$ is the input text, $h (x)$ represents the entire transformer (except the final layer) and $A$ is the final embedding projection layer.
It turns out that you can steal the last layer, which goes from the hidden state of the model to the output tokens.
The input dimension for the transformer $h (x)$ is ~1 (which means a single numerical input) and the output dimension is ~8000, that means how many hidden dimension the model has internally, the number comes from the Llama model. The $A$ matrix is the final output projection matrix, a matrix which goes from the hidden state to the token state, which means that input dimension is ~8000 and the output dimension is ~100000. The attack objective is try to learn the matrix $A$ .
If I query the model on a random input x, given $f (x_{0})$ , the output should be a 100000 dimensional vector (I get a single logit for every single possible token). If I repeat this a bunch of times, consider N times, and then report the value from $f (x_{o})$ to $f (x_{n})$ into a matrix I should get a matrix that has N rows and roughly 100000 columns.

To carry out this attack you have to:

Query the model with random inputs $x_{0}$ to $x_{n}$ (where $n$ is large, e.g., 8192 for LLaMA 65B)
Collect outputs $f (x_{0})$ to $f (x_{n})$ , each a 100000-dimensional vector
Create a matrix $Y = [y_{0}, y_{1}, . . ., y_{n}]$ where each $y_{i} = f (x_{i})$ Analyze the linear independence of rows in $Y$

The matrix $A$ is a $t \times h$ -dimensional matrix, where:

$t$ is the number of tokens (output dimension)
$h$ is the hidden dimension (input dimension) Typically, $h ≪ t$

"How many linearly independent rows does this matrix has?" the slide asks, to understand this passage is great to know a bit of linear algebra.

The hidden dimension is smaller than the number of tokens, often by quite a bit.

This means that while token logits appear to live in a $t$ -dimensional space, they actually reside in an $h$ -dimensional subspace. So, your objective is to extract hidden dimension. To do this you compute full logit vectors $y_{i}$ for random inputs $p_{i}$ , form the matrix $Y = [y_{0}, y_{1}, . . ., y_{n}]$ , with $n > h$ and then perform Singular Value Decomposition (SVD) on $Y$ .

Because the embeddings live in a $h$ -dimensional subspace we should expect exactly $h$ non-zero singular values of $Y$ . In practice, of course, we get some noise. So we can’t just count the number of nonzero singular values, but rather the number of “big enough” ones (i.e., ones that aren’t close to 0).

For reference, Stealing Part of Production Model is the name of the incredible paper and they made also a technical blog from which this explanation is derived.