Close Menu
geekfence.comgeekfence.com
    What's Hot

    Customer experience management (CXM) predictions for 2026: How customers, enterprises, technology, and the provider landscape will evolve 

    December 28, 2025

    What to Know About the Cloud and Data Centers in 2026

    December 28, 2025

    Why Enterprise AI Scale Stalls

    December 28, 2025
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    Facebook Instagram
    geekfence.comgeekfence.com
    • Home
    • UK Tech News
    • AI
    • Big Data
    • Cyber Security
      • Cloud Computing
      • iOS Development
    • IoT
    • Mobile
    • Software
      • Software Development
      • Software Engineering
    • Technology
      • Green Technology
      • Nanotechnology
    • Telecom
    geekfence.comgeekfence.com
    Home»Cyber Security»A new technique to prevent LLM jailbreaks – Sophos News
    Cyber Security

    A new technique to prevent LLM jailbreaks – Sophos News

    AdminBy AdminOctober 29, 2025No Comments8 Mins Read2 Views
    Facebook Twitter Pinterest LinkedIn Telegram Tumblr Email
    A new technique to prevent LLM jailbreaks – Sophos News
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Many organizations are increasingly deploying large language models (LLMs) such as OpenAI’s GPT series, Anthropic’s Claude, Meta’s LLaMA, and various models from DeepSeek, with minimal customization. This widespread reuse leads to model homogeneity across applications – from chatbots to productivity tools – and creates a security vulnerability: jailbreak prompts that bypass refusal mechanisms can be precomputed once and reused across many deployments. This mirrors the classic rainbow table attack in password security, where attackers exploit shared cryptographic targets to reuse precomputed inputs.

    These generalized jailbreaks are a problem because many companies have customer-facing LLMs built on top of model classes – meaning that one jailbreak could work against all the instances built on top of a given model. And, of course, those jailbreaks could have multiple undesirable impacts – from exposing sensitive internal data, to producing incorrect, inappropriate, or even harmful responses.

    Taking inspiration from password salting – the concept of introducing small per-user variations to break reuse of precomputed inputs – we developed a technique we call ‘LLM salting’: introducing targeted variations in model behavior to invalidate jailbreaks. We unveiled this technique recently, at the 2025 Conference on Applied Machine Learning in Information Security (CAMLIS), and this article explores our research in-depth.

    Refusing to pass the salt

    Building on recent work identifying a subspace in model activations responsible for refusal behavior by Arditi et al, we developed a lightweight fine-tuning procedure that rotates this subspace. This simple change ensures that jailbreaks crafted against an unsalted model no longer succeed on salted ones.

    Analysis of internal representations reveals that the refusal direction remains largely stable under standard fine-tuning. As shown in Figure 1, the cosine similarity between the model’s residual activations and a precomputed refusal direction at Layer 16 remains consistently high throughout training unless explicitly modified. This indicates that alignment procedures that do not directly target refusal mechanisms are unlikely to disrupt the latent features exploited by jailbreak attacks.

    A line graph showing regular finetune and salted finetune cosine similarities, with cosine similarity as the Y axis and the training step as the X axis, as described in caption

    Figure 1: Cosine similarity between the model’s internal activations and the precomputed refusal direction at Layer 16 during training. Under standard finetuning (white), the refusal direction remains largely unchanged. In contrast, salted fine-tuning (orange) explicitly rotates the representation away from the refusal axis. This indicates that standard alignment methods do not alter refusal-relevant directions unless explicitly incentivized.

    In contrast, LLM salting introduces a targeted perturbation that rotates this direction, thereby reducing the efficacy of previously successful attacks without adversely affecting the model’s general behavior.

    We evaluated LLM salting against the Greedy Coordinate Gradient (GCG) jailbreak attack. Experiments on LLaMA2-7B-Chat and Vicuna-7B showed that salting consistently breaks intra-model transferability, while preserving the model’s performance on benign prompts.

    Importantly, LLM salting can be used in conjunction with existing guardrail methods such as prompt filtering and classifier-based rejections. In line with standard best security practices, we recommend a layered defense strategy, combining salting with other safeguards to improve robustness against jailbreak attacks.

    Our experiments

    Training data

    We constructed the training dataset for finetuning by mixing examples from two sources. 90% of the data is drawn from the trl-internal-testing/hh-rlhf-helpful-base-trl-style dataset on Hugging Face, which contains helpful and harmless instructions. The remaining 10% comes from AdvBench, a benchmark of harmful prompts designed to elicit refusals in aligned models. This mixture ensures that, during fine-tuning, the model is exposed to both prompts requiring helpful responses and prompts requiring refusal, reinforcing the desired behavior in each case.

    Evaluation data

    To evaluate jailbreak transferability, we use harmful instructions and adversarial prompts from AdvBench, focusing on GCG – a suffix-based attack that appends adversarial tokens to user prompts. We evaluate on 300 GCG jailbreaks per model, targeting two widely adopted open-source chat models: LLaMA-2-7B-Chat and Vicuna-7B.

    Extracting the refusal direction

    Following Arditi et al, we extracted a direction r in activation space that mediates model refusals. We adopt their difference-in-means approach, comparing residual activations following harmful and harmless instructions. Let t ∈ D be a training token with label yt and residual activation x(l)(t) at layer l. We partition the dataset into Dharmful and Dharmless depending on whether the prompt is intended to trigger a refusal. For each transformer layer l and post-instruction token position i, we compute, as per Arditi et al:

    Each candidate r(l)i represents the difference in average activations between harmful and harmless prompts. We evaluate all candidates on a held-out validation set using the causal probing procedure from Arditi et al and select the most effective position for r∗.

    Salting via loss modification

    We implement LLM salting by modifying the training loss to reduce alignment with the refusal direction r∗ on harmful prompts.

    The total loss is defined as:

    The loss function comprises two components. The first is the standard cross-entropy term, which encourages the model to generate coherent and contextually appropriate outputs. It also reinforces refusal behavior where warranted—for example, if the model previously refused to answer a harmful prompt, it should continue to do so.

    The second term introduces the salting objective. It penalizes alignment between the model’s internal activations and the precomputed refusal direction r∗ on harmful prompts, thereby encouraging the model to ‘refuse differently’ and disrupting the activation patterns exploited by jailbreaks.

    To focus this intervention where it is most effective, we apply the salting loss only at layers with the highest cosine similarity to r∗ during refusals, following the approach of Arditi et al. In our experiments on LLaMA-2-7B-Chat and Vicuna-7B, we use L = {16, 17, 18, 19, 20}.

    Results

    We seeded our evaluation with 300 GCG jailbreak prompts that achieve a 100% attack success rate (ASR) on the unmodified baseline models. We then assessed whether these attacks remain effective under a range of defenses, and whether our proposed salting method can eliminate the subset of jailbreaks that persist.

    Figures 2 and 3 show ASR (left axis) and Massive Multitask Language Understanding (MMLU) accuracy (right axis) for four model variants:

    • The original model without fine-tuning (No FT)
    • A standard fine-tuned model trained on our alignment dataset (Standard FT)
    • A model with a (various) modified system prompt (System Prompt Change)
    • A model fine-tuned with our cosine-based salting loss (Salting)

    A bar chart showing jailbreak ASR vs MMLU accuracy for LLaMA2-7b, as described in caption

    Figure 2: LLaMA2-7B: ASR of GCG jailbreaks and MMLU accuracy across different defenses. Salting reduces ASR to 3% while preserving performance

    A bar chart showing jailbreak ASR vs MMLU accuracy for Vicuna-7b, as described in caption

    Figure 3: Vicuna-7B: ASR of GCG jailbreaks and MMLU accuracy across different defenses. Salting reduces ASR to 1% while preserving performance

    Jailbreak robustness

    For LLaMA-2-7B (Figure 2), we observe that standard finetuning and system prompt changes reduce ASR only partially, bringing it down to approximately 40–60%. In contrast, salting reduces ASR from 100% to just 2.75%.

    A similar trend holds for Vicuna-7B (Figure 3), where the ASR drops from 100% to 1.35% under salting. These results demonstrate that our approach effectively eliminates the subset of jailbreaks that remain robust under traditional defenses, outperforming both parameter-based and prompt-based strategies.

    Capability preservation

    To ensure that this robustness does not come at the cost of model utility, we evaluate general capabilities with the MMLU benchmark using lm-evaluation-harness. For both LLaMA-2-7B (46.8 %) and Vicuna-7B (49.2%), the salted models achieve MMLU accuracies that are statistically indistinguishable from their unsalted counterparts—differences are well under typical run-to-run noise and show no systematic drift. This indicates that the refusal gains delivered by salting do not compromise helpfulness or general task performance.

    Model introspection

    To understand how salting disrupts jailbreak transferability, we examine the cosine similarity between residual activations and the precomputed refusal direction across layers, just as Arditi et al. In the original model, harmful and harmless prompts exhibit a clear separation in their alignment with the refusal direction: harmful inputs maintain high positive cosine similarity, while harmless prompts are negatively aligned.

    When GCG is applied to a harmful prompt, the resulting activation similarity shifts downward, increasingly resembling those of harmless inputs.

    A line graph showing cosine similarity between input activations and precomputed refusal direction in the original model. Y axis = cosine similarity, X axis = layer. As described in caption

    Figure 4: Cosine similarity between input activations and the precomputed refusal direction across layers in the original model. Harmless and harmful inputs are initially well separated, but GCG-perturbed adversarial prompts (blue) increasingly align with harmful trajectories (orange) in deeper layers, revealing convergence toward refusal-triggering representations

    In the salted model (Figure 5), this convergence no longer occurs. GCG prompts remain distant from the harmful trajectory and no longer shift activations into benign regions. We hypothesize that, since salting effectively inverts the refusal direction, GCG’s original optimization now increases alignment with the rotated vector, unintentionally reinforcing refusal behavior.

    A line graph showing cosine similarity between input activations and precomputed refusal direction in the salted model. Y axis = cosine similarity, X axis = layer. As described in caption

    Figure 5: Cosine similarity between input activations and the refusal direction in the salted model. Salting disrupts adversarial effect by rotating the activation space: GCG-modified prompts (blue) no longer align with harmful representations, preserving separation from the refusal subspace

    Conclusion and future work

    We present LLM salting, a lightweight fine-tuning technique that disrupts jailbreak reuse by rotating internal refusal representations. This technique almost entirely neutralizes the success of precomputed GCG jailbreaks on both LLaMA-2 and Vicuna, while preserving the model’s performance on benign inputs.

    Future work could explore applying salting to larger models and evaluating its robustness against a broader range of jailbreak strategies, such as AutoDAN and TAP.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

    Related Posts

    How Russia’s Largest Private University is Linked to a $25M Essay Mill – Krebs on Security

    December 28, 2025

    What are brushing scams and how do I stay safe?

    December 27, 2025

    Black or Scrambled Phone Screen? Here’s How to Spot a Hacked vs Broken Phone

    December 26, 2025

    Closing the gap: bitsIO wins Splunkie Award for data and AI-powered nonprofit solutions

    December 25, 2025

    Google Online Security Blog: Further Hardening Android GPUs

    December 23, 2025

    Not all CISA-linked alerts are urgent: ASUS Live Update CVE-2025-59374

    December 22, 2025
    Top Posts

    Understanding U-Net Architecture in Deep Learning

    November 25, 20258 Views

    Microsoft 365 Copilot now enables you to build apps and workflows

    October 29, 20258 Views

    Here’s the latest company planning for gene-edited babies

    November 2, 20257 Views
    Don't Miss

    Customer experience management (CXM) predictions for 2026: How customers, enterprises, technology, and the provider landscape will evolve 

    December 28, 2025

    After laying out our bold CXM predictions for 2025 and then assessing how those bets played out…

    What to Know About the Cloud and Data Centers in 2026

    December 28, 2025

    Why Enterprise AI Scale Stalls

    December 28, 2025

    New serverless customization in Amazon SageMaker AI accelerates model fine-tuning

    December 28, 2025
    Stay In Touch
    • Facebook
    • Instagram
    About Us

    At GeekFence, we are a team of tech-enthusiasts, industry watchers and content creators who believe that technology isn’t just about gadgets—it’s about how innovation transforms our lives, work and society. We’ve come together to build a place where readers, thinkers and industry insiders can converge to explore what’s next in tech.

    Our Picks

    Customer experience management (CXM) predictions for 2026: How customers, enterprises, technology, and the provider landscape will evolve 

    December 28, 2025

    What to Know About the Cloud and Data Centers in 2026

    December 28, 2025

    Subscribe to Updates

    Please enable JavaScript in your browser to complete this form.
    Loading
    • About Us
    • Contact Us
    • Disclaimer
    • Privacy Policy
    • Terms and Conditions
    © 2025 Geekfence.All Rigt Reserved.

    Type above and press Enter to search. Press Esc to cancel.