Homepage - Xiangxin Zhou

Xiangxin Zhou

I am a fourth-year Ph.D. student at University of Chinese Academy of Sciences (UCAS) and Institute of Automation, Chinese Academy of Sciences (CASIA), advisded by Liang Wang.

Previously, I received my B.Eng. degree from Tsinghua University in 2021.

My research focuses on reinforcement learning for enhancing large language models (LLMs), improving their reasoning abilities and making their responses more accurate, reliable, trustworthy, and interpretable. I also study long-term memory for LLMs, aiming to enable LLMs to personalize their behavior through continual interactions with users.

In addition, I work on AI for Drug Discovery (AIDD), developing advanced generative models and algorithms for designing small molecules and proteins.

[last_name][first_name]1998[AT]gmail.com Google Scholar GitHub

Education

Institute of Automation, Chinese Academy of Sciences
University of Chinese Academy of Sciences

School of Artificial Intelligence
Ph.D. Student

Sep. 2021 - present
Tsinghua University

B.Eng. in Electronic Engineering

Sep. 2016 - Jul. 2021

Experience

Xiaohongshu Hi Lab

RedStar Intern

Aug. 2025 - Present
Sea AI Lab

Associate Member

July. 2025 - Aug. 2025
ByteDance Seed

Research Intern

May. 2025 - Jul. 2025
ByteDance AI Lab

Research Intern

May. 2023 - May. 2025
ByteDance AML

Research Intern

Sep. 2022 - May. 2023

Selected Publications (view all )

Defeating the Training-Inference Mismatch via FP16

Penghui Qi*, Zichen Liu*, Xiangxin Zhou*, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin (* equal contribution)

Preprint. 2025

We demonstrate that simply reverting to FP16 effectively eliminates the numerical mismatch between the training and inference policies in RL for LLMs. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks.

Warning

Action required

Education

Experience

Selected Publications (view all )

Defeating the Training-Inference Mismatch via FP16

Defeating the Training-Inference Mismatch via FP16

GEM: A Gym for Agentic LLMs

GEM: A Gym for Agentic LLMs

Variational Reasoning for Language Models

Variational Reasoning for Language Models

Reinforcing General Reasoning Without Verifiers

Reinforcing General Reasoning Without Verifiers

OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use

OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use

Designing Cyclic Peptides via Harmonic SDE with Atom-Bond Modeling

Designing Cyclic Peptides via Harmonic SDE with Atom-Bond Modeling

Integrating Protein Dynamics into Structure-Based Drug Design via Full-Atom Stochastic Flows

Integrating Protein Dynamics into Structure-Based Drug Design via Full-Atom Stochastic Flows

Reprogramming Pretrained Target-Specific Diffusion Models for Dual-Target Drug Design

Reprogramming Pretrained Target-Specific Diffusion Models for Dual-Target Drug Design

Antigen-Specific Antibody Design via Direct Energy-based Preference Optimization

Antigen-Specific Antibody Design via Direct Energy-based Preference Optimization

Stabilizing Policy Gradients for Stochastic Differential Equations via Consistency with Perturbation Process

Stabilizing Policy Gradients for Stochastic Differential Equations via Consistency with Perturbation Process

Controllable and Decomposed Diffusion Models for Structure-based Molecular Optimization

Controllable and Decomposed Diffusion Models for Structure-based Molecular Optimization

DecompDiff: Diffusion Models with Decomposed Priors for Structure-Based Drug Design

DecompDiff: Diffusion Models with Decomposed Priors for Structure-Based Drug Design

All publications