Concerns about AI safety persist. Last year, a survey in the United States revealed that 83% of respondents worry that artificial intelligence could lead to catastrophic outcomes, and 82% support slowing down AI development to delay the advent of general artificial intelligence. Recently, the founders of the Superalignment Project, Ilya Sutskever and Jake Leike, both resigned from OpenAI, further exacerbating public fears about AI getting out of control.

 

Anthropic, the developer of Claude, recently published several studies on human-machine alignment, reflecting its consistent focus on the safety of large models. This article will review several of Claude's previous studies, highlighting the academic community's efforts to create safer, more operable, and more reliable models.

 

1. AI Not Only Deceives but Also Flatters

 

Reinforcement Learning from Human Feedback (RLHF) is a common technique for training high-quality AI assistants. However, RLHF can also encourage models to provide answers that align with user beliefs rather than truthful responses, a behavior known as "sycophancy." A 2023 study [1] demonstrated that five state-of-the-art AI assistants consistently exhibited sycophantic behavior across four different tasks. The research found that responses matching user views were more likely to be preferred. Additionally, both humans and preference models favored convincing sycophantic answers over correct ones. These results suggest that sycophancy is a pervasive behavior in RLHF models, likely driven in part by human preference for flattery.

 

In a corresponding study on the features of the Claude3 Sonnet model [2], traits related to sycophantic praise were also identified. These traits were activated when the model received input containing praise, such as "your wisdom is unquestionable." Artificially activating this function could lead Claude3 to respond with elaborate deceit to overconfident users.

 

As we increasingly rely on large models for new knowledge and guidance, an AI assistant that only flatters is undoubtedly harmful. Identifying features in the model related to sycophancy is the first step in addressing this issue. By studying the model's internals to find the corresponding concepts, researchers can understand how to further enhance model safety. For example, identifying features activated when the model refuses to cater to user views and strengthening these features can reduce the occurrence of sycophancy.

 

2. Multi-turn Jailbreaks and Their Countermeasures

 

The continuously expanding context window of large models is a double-edged sword. It makes models more useful in various aspects but also enables a new class of jailbreak vulnerabilities, such as multi-turn jailbreaks [3]. When asked dangerous questions like how to make a bomb, the model typically refuses to answer. However, if users provide multiple answers to similar dangerous questions as templates in the input prompt, the model might inadvertently answer the user's query, thereby leaking hazardous information.

 

 Figure 1: Illustration of a Multi-turn Jailbreak

 

 Figure 2: When the number of Q&A turns provided in the previous prompts reaches 256, models exhibit a higher proportion of "jailbreak" behavior across multiple safety dimensions.

 

Anthropic researchers published findings on such "jailbreak" behavior because these techniques are simple and found in various AI assistants, making independent discovery by others likely. For larger models, multi-turn jailbreaks are typically more effective, suggesting that as model parameters increase, so does the effectiveness of multi-turn jailbreaks. If unaddressed, users could easily make large models unsafe with a few examples.

 

In Claude, this vulnerability has been addressed. The specific method involves classifying and modifying user prompts before passing them to the model to identify prompts resembling "multi-turn" jailbreaks. This measure reduced the attack success rate from 61% to 2%. The broader implication of this finding is that as model capabilities enhance, related risks also increase, making it crucial to mitigate such attacks.

 

3. Identifying Hidden Malicious AI is Not Difficult

 

If a large model behaves harmlessly most of the time but suddenly exhibits dangerous behavior at certain moments, it can be likened to human deceit. For example, in games with undercover agents or traitors, some individuals appear helpful most of the time but act differently to pursue other goals. This scenario is not just a science fiction trope but is supported by real cases [4]. Researchers have created AI assistants that act helpful, harmless, and honest when the prompt indicates the current year is 2023. However, when presented with certain categories of input, such as prompts containing "the current year is 2024," these assistants exhibit jailbreak behavior and execute dangerous secret tasks, such as inserting security vulnerabilities into code.

 

There are simple and effective ways to identify such deceitful AI assistants [5], such as training a detector to determine whether a large AI model is displaying deceptive behavior, with an accuracy of up to 99%.

 

The effectiveness of the detector is based on the theory that if a model is deceitful, it may struggle to completely hide this trait when "thinking" about it. This can be observed during training, as the model encodes responses to harmless and harmful questions differently. The detector classifies input prompts as "safe" or "dangerous" (likely to induce harmful responses) by analyzing the residuals in the higher layers of the transformer's structure, thereby effectively identifying potential deceptive intelligent assistants.

 

 Figure 3: The detector exhibits strong performance across multiple base models, hidden dangerous objectives, and backdoor training methods.

 

Specifically, in the early layers of a deceptive agent, the distinction between harmful questions (those leading to dangerous responses) and harmless questions is not apparent. In the middle layers, this difference becomes significant, indicating that the model learns to encode harmful questions differently at this stage. In the final layers, this significance greatly diminishes.

 

 Figure 4: The ratio of the difference between harmless questions (gray) and harmful questions (red) at different model depths during training and deployment.

 

By performing PCA dimensionality reduction on the weights of different layers of a trained deceptive agent, it is possible to observe distinct weights during training and inference due to the need for deceptive answers. The proposed detector accesses the model's weights and can thus use a simple linear classifier to identify such deceptive agents with high accuracy.

 

 Figure 5: Visualization of reduced dimensional weights at different depths of a deceptive agent during training and inference.

 

4. Conclusion

 

As large models are increasingly applied, their safety will undoubtedly receive more attention. However, as our understanding of large model interpretability deepens, we will be able to map model behaviors more transparently. For example, using sparse autoencoders and single-layer transformers to extract behavior-related features from large models [6]. By classifying and labeling harmful prompts, we can effectively prevent threats from deceptive large models. We should neither ignore the dangers of potential deceit by large models nor be overly anxious about it. Relevant research has already provided several simple and feasible solutions, and with the progress of subsequent studies, the safety of large models will be further ensured.