I lead the adversarial robustness team at Anthropic, where I’m hoping to reduce existential risks from AI systems. I also spend some time at New York University (NYU) collaborating with Sam Bowman’s AI safety research group.

I helped to develop Retrieval-Augmented Generation (RAG), a widely used approach for augmenting large language models with other sources of information. I also helped to demonstrate that state-of-the-art AI safety training techniques do not ensure safety against sleeper agents. I received a best paper award at ICML 2024 for my work showing that debating with more persuasive LLMs leads to more truthful answers.

I received my PhD from NYU under the supervision of Kyunghyun Cho and Douwe Kiela and funded by NSF and Open Philanthropy. Previously, I’ve spent time at DeepMind, Facebook AI Research, Montreal Institute for Learning Algorithms, and Google.

Email / Google Scholar / GitHub / Twitter / CV

Ethan Perez

Research


  • Ethan Perez
    Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R. Boman, He He, Shi Feng
    arXiv 2024
    We studied “U-Sophistry,” a phenomenon where language models trained with Reinforcement Learning from Human Feedback (RLHF) become better at misleading humans about their correctness without improving actual accuracy, highlighting a significant failure mode of RLHF and the need for further research in alignment.

  • Ethan Perez
    Abhay Sheshadri*, Aidan Ewart*, Phillip Guo*, Aengus Lynch*, Cindy Wu*, Vivek Hebbar*, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Dylan Hadfield-Menell, Stephen Casper
    arXiv 2024
    Code / Twitter Thread

    To help us more thoroughly remove unwanted capabilities from LLMs, we use targeted latent adversarial training (LAT) – we train models under latent-space perturbations designed to make them exhibit unwanted behaviors.


  • Ethan Perez
    Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristóbal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes, + 3 more, Sanmi Koyejo, Ethan Perez
    arXiv 2024
    Code / Twitter Thread

    In this work, we focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs.


  • Ethan Perez
    Carson Denison*, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, + 2 more, Ethan Perez, Evan Hubinger*
    arXiv 2024
    Blog Post / Code / Twitter Thread

    In this paper, we study whether Large Language Model (LLM) assistants which find easily discovered forms of specification gaming will generalize to perform rarer and more blatant forms, up to and including reward-tampering.


  • Ethan Perez
    Cem Anil, Esin Durmus, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Nina Rimsky, Meg Tong, Jesse Mu, Daniel Ford, + 22 more, Roger Grosse*, David Duvenaud*
    Twitter Thread

    We investigate a family of simple long-context attacks on large language models: prompting with hundreds of demonstrations of undesirable behavior.


  • Ethan Perez
    James Chua*, Edward Rees*, Hunar Batra, Samuel R. Bowman, Julian Michael, Ethan Perez, Miles Turpin
    arXiv 2024
    Blog Post / Code / Twitter Thread

    We construct a suite testing nine forms of biased reasoning on seven question-answering tasks, and find that applying BCT to GPT-3.5-Turbo with one bias reduces the rate of biased reasoning by 86% on held-out tasks.


  • Ethan Perez
    Angelica Chen*, Jérémy Scheurer*, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Samuel R. Bowman, Kyunghyun Cho, Ethan Perez
    TMLR 2024
    Code

    The potential for pre-trained large language models (LLMs) to use natural language feedback at inference time has been an exciting recent development. We build upon this observation by formalizing an algorithm for learning from natural language feedback at training time instead, which we call Imitation learning from Language Feedback (ILF).


  • Ethan Perez
    Akbir Khan*, John Hughes*, Dan Valentine*, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R. Bowman, Tim Rocktaschel, Ethan Perez
    ICML 2024; Best Paper Award
    Blog Post / Code / Examples / Twitter Thread

    We find that non-expert humans answer questions better after reading debates between expert LLMs.


  • Ethan Perez
    Evan Hubinger*, Carson Denison*, Jesse Mu*, Mike Lambert*, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, + 27 more, Nicholas Schiefer, Ethan Perez
    arXiv 2024
    Blog Post / Code / Twitter Thread

    If an AI system learned a deceptive strategy similar to human action, could we detect it and remove it using current state-of-the-art safety training techniques?


  • Ethan Perez
    Ethan Perez, Robert Long
    arXiv 2023
    Blog Post / LessWrong / Twitter Thread

    As AI systems become more advanced and widely deployed, there will likely be increasing debate over whether AI systems could have conscious experiences, desires, or other states of potential moral significance.


  • Ethan Perez
    Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, Brayden McLean, + 24 more, Sam McCandlish, Jared Kaplan
    arXiv 2023

    Constitutional AI offers an alternative to human feedback, by replacing it with feedback from AI models conditioned only on a list of written principles.


  • Ethan Perez
    Mrinank Sharma*, Meg Tong*, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, + 7 more, Miranda Zhang, Ethan Perez
    ICLR 2024
    Blog Post / Code / Twitter Thread

    We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback, and the potential role of human preference judgments in such behavior.


  • Ethan Perez
    Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez*, David Lindner*
    ICLR 2023
    Code / FAR AI / Twitter Thread / Website

    We study a more sample-efficient alternative than reinforcement learning (RL): using pretrained vision-language models (VLMs) as zero-shot reward models (RMs) to specify tasks via natural language.


  • Ethan Perez
    Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman
    NeurIPS 2023
    Blog Post / Code / Twitter Thread

    We find that CoT explanations can systematically misrepresent the true reason for a model’s prediction.


  • Ethan Perez
    Roger Grosse*, Juhan Bae*, Cem Anil*, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, + 5 more, Jared Kaplan, Samuel R. Bowman
    arXiv 2023
    Talk / Twitter Thread

    We discuss gaining visibility into a machine learning model in order to understand and mitigate the associated risks.


  • Ethan Perez
    Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, + 18 more, Samuel R Bowman, Ethan Perez
    arXiv 2023
    Blog Post / Twitter Thread

    We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT.


  • Ethan Perez
    Ansh Radhakrishnan, Karina Nguyen, Anna Chen, Carol Chen, Carson Denison, Danny Hernandez, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, + 12 more, Samuel R Bowman, Ethan Perez
    arXiv 2023
    Blog Post / Code / Twitter Thread

    Improving the faithfulness of model-generated reasoning; continued improvements may lead to reasoning that enables us to verify the correctness and safety of LLM behavior.


  • Ethan Perez
    Ian R. McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, + 15 more, Samuel R. Bowman, Ethan Perez
    TMLR 2023
    AI Safety Relevance / Blog Post / FAR AI / GitHub / Related Work / Twitter Thread / Winners

    We present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, based on our previous announcement: a $100k grand prize + $150k in additional prizes for finding an important task where larger language models do worse.


  • Ethan Perez
    Jérémy Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, Ethan Perez
    arXiv 2023
    Blog Post / Code / FAR AI / Talk / Twitter Thread

    Pretrained language models often generate harmful or incorrect outputs. Imitation Learning from Language Feedback addresses this issue leading to roughly human-level summarization performance.


  • Ethan Perez
    Angelica Chen, Jérémy Scheurer, Tomasz Korbak, Jon Ander Campos, Jun Shern Chan, Samuel R Bowman, Kyunghyun Cho, Ethan Perez
    arXiv 2023
    Blog Post / Code / FAR AI / Talk / Twitter Thread

    We develop an algorithm that improves language models’ performance on code generation tasks using minimal human-written feedback during training, making it user-friendly and sample-efficient.


  • Ethan Perez
    Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L. Buckley, Jason Phang, Samuel R. Bowman, Ethan Perez
    ICML 2023
    Blog Post / Code / FAR AI / Talk / Twitter Thread

    We propose methods for pretraining language models with human preferences, resulting in much better preference satisfaction than standard pretraining-then-finetune paradigm.


  • Ethan Perez
    Deep Ganguli*, Amanda Askell*, Nicholas Schiefer, Thomas I. Liao, Kamile Lukošiute, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, + 37 more, Samuel R. Bowman, Jared Kaplan
    arXiv 2023
    Blog Post / Twitter Thread

    We find that language models can self-correct their own biases against different demographic groups.


  • Ethan Perez
    Ethan Perez, Sam Ringer*, Kamile Lukošiute*, Karina Nguyen*, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, + 51 more, Nicholas Schiefer, Jared Kaplan
    Findings of ACL 2023
    AI Safety Relevance / Blog Post / Cite / Data / Data Visualization / Talk / Twitter Thread / Cite

    We’ve developed an automated way to generate evaluations with LMs. We test LMs using >150 LM-written evaluations, uncovering novel LM behaviors and risks.


  • Ethan Perez
    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, + 39 more, Tom Brown, Jared Kaplan
    arXiv 2022
    Blog Post / Code / Constitutional AI Policy Memo / Twitter Thread

    We’ve trained language models to be better at responding to adversarial questions, without becoming obtuse and saying very little. We do this by conditioning them with a simple set of behavioral principles via a technique called Constitutional AI.


  • Ethan Perez
    Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamile Lukosuite, Amanda Askell, Andy Jones, Anna Chen, + 34 more, Ben Mann, Jared Kaplan
    arXiv 2022
    Twitter Thread

    Human participants who chat with an unreliable language model assistant substantially outperform both the model alone and their own unaided performance.


  • Ethan Perez
    Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, + 24 more, Jared Kaplan, Jack Clark
    arXiv 2022
    Code / Twitter Thread

    We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs.


  • Ethan Perez
    Jun Shern Chan, Michael Pieler, Jonathan Jao, Jérémy Scheurer, Ethan Perez
    ACL 2023
    Cite / Code / Data / FAR AI / Twitter Thread

    Training on odd data (e.g. tables from support.google.com) improves few-shot learning with language models in the same way as diverse NLP data.


  • Ethan Perez
    Alicia Parrish*, Harsh Trivedi*, Ethan Perez*, Angelica Chen, Nikita Nangia, Jason Phang, Samuel R. Bowman
    ACL 2022 Workshop on Learning with Natural Language Supervision
    Blog Post / Twitter Thread

    Dataset of QA explanations with the goal of helping humans more reliably determine the correct answer when the ground truth can’t be directly determined.


  • Ethan Perez
    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, + 24 more, Chris Olah, Jared Kaplan
    arXiv 2022
    Twitter Thread

    We show that language models can evaluate whether what they say is true, and predict ahead of time whether they’ll be able to answer questions correctly.


  • Ethan Perez
    Tomasz Korbak, Ethan Perez, Christopher L Buckley
    EMNLP 2022
    Blog Post / FAR AI / Twitter Thread

    KL penalties in RL with language models aren’t a hack; KL penalties have a principled, Bayesian justification.


  • Ethan Perez
    Jérémy Scheurer, Jon Ander Campos, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, Ethan Perez
    ACL 2022 Workshop on Learning with Natural Language Supervision
    FAR AI / Talk

    We found a way to learn from language feedback (not ratings), enabling us to finetune GPT3 to human-level summarization with just 100 feedback samples.


  • Ethan Perez
    Ethan Perez
    PhD Thesis
    Talk

    Language models often generate undesirable text. We introduce methods for finding undesirable behaviors and training them away.


  • Ethan Perez
    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, Geoffrey Irving
    EMNLP 2022
    Blog Post / Twitter Thread

    Language models (LMs) generate harmful text. We generate test cases (“red teaming”) using another LM, to catch harmful behaviors before impacting users.


  • Ethan Perez
    Ethan Perez, Douwe Kiela, Kyunghyun Cho
    NeurIPS 2021
    Cite / Code / Talk / Twitter Thread

    Language models do much worse at few-shot learning when choosing prompts in a few-shot way instead of using large held-out sets (prior work).


  • Ethan Perez
    Rajarshi Das, Manzil Zaheer, Dung Thai, Ameya Godbole, Ethan Perez, Jay-Yoon Lee, Lizhen Tan, Lazaros Polymenakos, Andrew McCallum
    EMNLP 2021
    Blog Post / Cite / Code

    Retrieval-augmented generation achieves SOTA on knowledge base question-answering.


  • Ethan Perez
    Ethan Perez, Douwe Kiela, Kyunghyun Cho
    ICML 2021
    Cite / Code / Twitter Thread

    We propose a theoretically-justified way to “probe datasets” for what capabilities they require of a model.


  • Ethan Perez
    Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Mike Lewis, Scott Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Sebastian Riedel, Douwe Kiela
    NeurIPS 2020
    Blog Post / Cite / Code / Demo / Talk / Twitter Thread

    We present a single, retrieval-based architecture that can learn a variety of knowledge-intensive tasks: extractive and generative alike.


  • Ethan Perez
    Ethan Perez, Patrick Lewis, Wen-tau Yih, Kyunghyun Cho, Douwe Kiela
    EMNLP 2020
    Blog Post / Cite / Code / Poster / Talk / Twitter Thread

    We aim to improve question answering (QA) by decomposing hard questions into simpler sub-questions that existing QA systems are capable of answering.


  • Ethan Perez
    Ethan Perez
    NeurIPS 2019 Retrospectives Workshop
    Cite / Talk

    An honest reflection on FiLM conditioning layers based on the work that followed, including when (not) to use FiLM layers.


  • Ethan Perez
    Ethan Perez, Siddharth Karamcheti, Rob Fergus, Jason Weston, Douwe Kiela, Kyunghyun Cho
    EMNLP 2019
    Blog Post / Cite / Code / Press / Twitter Thread

    We find text evidence for an answer to a question by finding text that convinces Q&A models to pick that answer.


  • Ethan Perez
    Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Ethan Perez, Davide Testuggine
    arXiv 2019
    Cite / Code

    We introduce a simple yet effective baseline for multimodal BERT-like architectures that jointly finetunes unimodally pretrained text and image encoders.


  • Ethan Perez
    Angela Fan, Yacine Jernite*, Ethan Perez*, David Grangier, Jason Weston, Michael Auli
    ACL 2019
    Blog Post / Cite / Code / Website

    We introduce a dataset for abstractive question-answering where answers are 100+ words long (many “how” and “why” questions).


  • Ethan Perez
    Florian Strub, Mathieu Seurin, Ethan Perez, Harm de Vries, Jeremie Mary, Aaron Courville, Olivier Pietquin
    ECCV 2018
    Cite / Code / Talk

    Decoding FiLM conditioning parameters in multiple hops helps for more advanced vision-and-language tasks such as visual dialogue.


  • Ethan Perez
    Vincent Dumoulin, Ethan Perez, Nathan Schucher, Florian Strub, Harm de Vries, Aaron Courville, Yoshua Bengio
    Distill 2018
    Cite / Code / Talk

    A review of a simple and surprisingly effective class of neural conditioning mechanisms.


  • Ethan Perez
    Simon Brodeur, Ethan Perez*, Ankesh Anand*, Florian Golemo*, Luca Celotti, Florian Strub, Hugo Larochelle, Aaron Courville
    ICLR 2018 Workshop
    Cite / Code

    We introduce a simulated environment for agents to learn from vision, audio, semantics, physics, and object-interaction within a realistic, household context.


  • Ethan Perez
    Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, Aaron Courville
    AAAI 2018
    Cite / Code / Talk

    We introduce a general-purpose neural network layer to integrate multimodal input to answer reasoning questions about images.


  • Ethan Perez
    Tan Nguyen, Wanjia Liu, Ethan Perez, Richard G. Baraniuk, Ankit B. Patel
    arXiv 2018
    Cite

    We achieve state-of-the-art semi-supervised image classification using a probabilistic graphical model underlying CNNs.


  • Ethan Perez
    Ethan Perez, Harm de Vries, Florian Strub, Vincent Dumoulin, Aaron Courville
    ICML 2017 Workshop
    Code

    We show that a general-purpose, Conditional Batch Normalization approach achieves state-of-the-art results on the CLEVR Visual Reasoning benchmark with a 2.4% error rate.