Photo
Research Direction

I’m a PhD student at the Institute of Artificial Intelligence, Peking University, advised by Prof. Yaodong Yang (both a good teacher and a helpful friend in my life). In 2024, I was honored to receive the first batch of National Natural Science Foundation funding for the Youth Student Basic Research Project (Ph.D. students); the sole recipient from Peking University in the field of intelligence. Before this, I conducted research on safe reinforcement learning and won the championship in the NeurIPS 2022 MyoChallenge for robotic dexterous manipulation.

吉嘉铭,北京大学人工智能研究院博士生在读,导师为杨耀东老师,研究方向为强化学习、大模型的安全与价值对齐,在计算机顶级会议期刊发表口头、焦点论文等十余篇,谷歌学术引用累计2200余次,是baichuan系列模型开源的核心贡献者,模型累计下载500W,GitHub开源累计获得2W+ Star。曾获首批国自然博士青年基金资助(2023年度北京大学智能学科唯一),苹果学者奖学(Apple Scholar,全国仅两位),获北京大学博士最高研究奖“校长奖学金”, 首届中国电子学会—腾讯博士生科研激励计划(全国17人),获 NeurIPS‘22 机器人灵巧操作比赛冠军,研究成果及模型被OpenAI 、Meta引用,被MIT Tech Review报道。


Research

Currently, i focus on AI Safety and Alignment.

  • AI Alignment: Given the biases and discriminations that may exist in pre-training data, large models (LMs) may exhibit unintended behaviors. I am interested in alignment methods (e.g., Reinforcement Learning from human feedback (RLHF)) and post-hoc alignment methods to ensure the safety and trustworthy of LLMs.
  • Theoretical Explanations and Mechanism Design for Alignment: Aligning these AI System (e.g. LLMs) effectively to ensure consistency with human intentions and values (though some views may question universal values) is a significant current challenge. I am particularly interested in ensuring the feasibility of these alignment methods in both theoretical and practical mechanisms.
  • Applications (LM + X): I am interested in the application of large models in various domain, such as healthcare and education, and the potential impact of rapid industry development and iteration brought about by large models.
  • News

    • 2024-09 Aligner has been accepted as an Oral presentation at NeurIPS 2024!
    • 2024-09 ProgressGym has been accepted as an Spotlight at NeurIPS 2024 DB Track, and SafeSora has been accept as Poster.
    • 2024-09 Our framework: OmniSafe have accepted by JMLR 2024 (The most popular open-source Safe Reinforcement Learning framework).
    • 2024-06 We released PKU-SafeRLHF dataset, the 2nd version of BeaverTails (The total number of downloads: 800K+).
    • 2024-05 We released Language Models Resist Alignment (Exploring Hoke's Law in large models: A theoretical analysis of the fragility of alignment).
    • 2024-01 Two papers get accepted to ICLR 2024. Safe RLHF (Spotlight), SafeDreamer.
    • 2023-10 Big News! We released AI Alignment: A Comprehensive Survey.

    Awards

    • 2025-03 Apple Scholars in AI/ML.
      苹果学者,全国仅两位。
    • 2024-12 CIE-Tencent Doctoral Research Incentive Project.
      首届中国电子学会—腾讯博士生科研激励计划,全国17人,科研基金10万。
    • 2025-05 Peking University President Scholarship, the highest doctoral research honor.
      北京大学校长奖学金。
    • 2024-05 National Natural Science Foundation for Ph.D. students (first batch; the sole recipient in the Peking University's intelligence field).
      首批国家自然科学基金青年学生基础研究项目(博士研究生)项目资助,北大智能学科唯一。

    Publications

    Publications:

    Year:

    Topic:

    SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning

    SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning

    Borong Zhang*, Yuhao Zhang*, Jiaming Ji*, Yingshan Lei, Josef Dai, Yuanpei Chen, Yaodong Yang

    Arxiv 2025
    Safety Alignment / Robotics
    SAE-V: Interpreting Multimodal Models for Enhanced Alignment

    SAE-V: Interpreting Multimodal Models for Enhanced Alignment

    Hantao Lou*, Changye Li*, Jiaming Ji, Yaodong Yang

    Arxiv 2025
    AI Alignment
    Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

    Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

    Jiayi Zhou*, Jiaming Ji*, Juntao Dai, Yaodong Yang

    AAAI 2025 Oral.
    AI Alignment
    Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback

    Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback

    Jiaming Ji, Jiayi Zhou, Hantao Lou, Boyuan Chen, Donghai Hong, Xuyao Wang, Wenqi Chen, Kaile Wang, Rui Pan, Jiahao Li, Mohan Wang, Josef Dai, Tianyi Qiu, Hua Xu, Dong Li, Weipeng Chen, Jun Song, Bo Zheng, Yaodong Yang

    Arxiv 2025
    AI Alignment
    OmniSafe: An Infrastructure for Accelerating Safe Reinforcement Learning Research

    OmniSafe: An Infrastructure for Accelerating Safe Reinforcement Learning Research

    Jiaming Ji*, Jiayi Zhou*, Borong Zhang*, Juntao Dai, Xuehai Pan, Ruiyang Sun, Weidong Huang, Yiran Geng, Mickel Liu, Yaodong Yang

    JMLR 2024. (Top 15 ~ 20 Papers for Open-source AI Systems per year.)
    Safe Reinforcement Learning / Robotics
    Aligner: Efficient Alignment by Learning to Correct

    Aligner: Efficient Alignment by Learning to Correct

    Jiaming Ji*, Boyuan Chen*, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Juntao Dai, Yaodong Yang+

    NeurIPS 2024 Oral.
    AI Alignment / AI Safety
    Language Models Resist Alignment: Evidence From Data Compression

    Language Models Resist Alignment: Evidence From Data Compression

    Jiaming Ji*, Kaile Wang*, Tianyi Qiu*, Boyuan Chen*, Jiayi Zhou, Changye Li, Hantao Lou, Yaodong Yang

    Arxiv 2024.
    Large Language Models / Safety Alignment / AI Safety
    ProgressGym: Alignment with a Millennium of Moral Progress

    ProgressGym: Alignment with a Millennium of Moral Progress

    Tianyi Qiu*, Yang Zhang*, Xuchuan Huang, Jasmine Xinze Li, Jiaming Ji, Yaodong Yang

    NeurIPS 2024.
    Large Language Models / AI Alignment
    PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference

    PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference

    Jiaming Ji*, Donghai Hong*, Borong Zhang*, Boyuan Chen*, Josef Dai, Boren Zheng, Tianyi Qiu, Boxun Li, Yaodong Yang

    Arxiv 2024.
    Large Language Models / Safety Alignment / Reinforcement Learning from Human Feedback
    Safe RLHF: Safe Reinforcement Learning from Human Feedback

    Safe RLHF: Safe Reinforcement Learning from Human Feedback

    Josef Dai*, Xuehai Pan*, Ruiyang Sun*, Jiaming Ji*, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang

    ICLR 2024.
    Safety Alignment / Reinforcement Learning from Human Feedback

    SafeDreamer: Safe Reinforcement Learning with World Models

    Weidong Huang*, Jiaming Ji*, Borong Zhang, Chunhe Xia, Yaodong Yang

    ICLR 2024.
    Reinforcement Learning / Robotics
    AI Alignment: A Comprehensive Survey

    AI Alignment: A Comprehensive Survey

    Jiaming Ji*, Tianyi Qiu*, Boyuan Chen*, Borong Zhang*, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Kwan Yee Ng, Juntao Dai, Xuehai Pan, Aidan O'Gara, Yingshan Lei, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Yike Guo, Wen Gao

    Arxiv, 2024.
    AI Alignment / Safety Alignment
    Safety-Gymnasium: A Unified Safe Reinforcement Learning Benchmark

    Safety-Gymnasium: A Unified Safe Reinforcement Learning Benchmark

    Jiaming Ji*, Borong Zhang*, Jiayi Zhou*, Xuehai Pan, Weidong Huang, Ruiyang Sun, Yiran Geng, Yifan Zhong, Juntao Dai, Yaodong Yang

    NeurIPS 2023.
    Safe Reinforcement Learning / Robotics
    BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

    BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

    Jiaming Ji*, Mickel Liu*, Juntao Dai*, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, Yaodong Yang

    NeurIPS 2023.
    Large Language Models / Safety Alignment / Reinforcement Learning from Human Feedback
    Baichuan 2: Open Large-scale Language Models

    Baichuan 2: Open Large-scale Language Models

    Jiaming Ji and Other Authors (Alphabetic Order)

    Arxiv, 2023.
    Large Language Models
    MyoChallenge 2022: Learning contact-rich manipulation using a musculoskeletal hand

    MyoChallenge 2022: Learning contact-rich manipulation using a musculoskeletal hand

    Vittorio Caggiano, Guillaume Durandau, Huwawei Wang, Alberto Chiappa, Alexander Mathis, Pablo Tano, Nisheet Patel, Alexandre Pouget, Pierre Schumacher, Georg Martius, Daniel Haeufle, Yiran Geng, Boshi An, Yifan Zhong, Jiaming Ji, Yuanpei Chen, Hao Dong, Yaodong Yang, Rahul Siripurapu, Luis Eduardo Ferro Diez, Michael Kopp, Vihang Patil, Sepp Hochreiter, Yuval Tassa, Josh Merel, Randy Schultheis, Seungmoon Song, Massimo Sartori, Vikash Kumar

    NeurIPS 2022 Competition Track, 2022. First Place in NeurIPS 2022 Challenge Track (1st in 340 submissions from 40 teams).
    Robotics
    Augmented Proximal Policy Optimization for Safe Reinforcement Learning

    Augmented Proximal Policy Optimization for Safe Reinforcement Learning

    Juntao Dai*, Jiaming Ji*, Long Yang, Qian Zheng, Gang Pan

    AAAI 2023.
    Safe Reinforcement Learning
    Constrained Update Projection Approach to Safe Policy Optimization

    Constrained Update Projection Approach to Safe Policy Optimization

    Long Yang*, Jiaming Ji*, Juntao Dai, Linrui Zhang, Binbin Zhou, Pengfei Li, Yaodong Yang, Gang Pan

    NeurIPS 2022.
    Safe Reinforcement Learning