Learning Personalized Agents from Human Feedback

Continual Personalization Agentic Memory
1Meta Superintelligence Labs, 2Princeton University, 3Duke University

Abstract

Modern AI agents are powerful but often fail to align with the idiosyncratic, evolving preferences of individual users. Prior approaches typically rely on static datasets, either training implicit preference models on interaction history or encoding explicit user profiles in external memory. However, these approaches struggle with new users and with preferences that change over time.

We introduce Personalized Agents from Human Feedback (PAHF), a framework for continual personalization in which agents learn online from live interaction using explicit per-user memory. PAHF operationalizes a three-step loop: (1) pre-action questions to resolve ambiguity under partial observability, (2) action execution guided by stored preferences, and (3) post-action feedback to update its memory when user preferences shift.

We evaluate PAHF using a four-phase protocol across embodied manipulation and online shopping to measure initial learning and adaptation to preference drift. Theoretical and empirical results confirm that combining explicit memory with dual feedback is critical: PAHF consistently outperforms baselines, minimizing initial errors and enabling rapid recovery from non-stationary preferences.

Day 1/2/3 preference learning

Learn initial preference → Learn context-dependent preferences → Adapt to preference shifts

Personalized Agent from Human Feedback

Explicit per-user memory with two feedback channels: pre-action clarification and post-action correction.

Pre-action Interaction

Ask if preference ambiguous

Pre-action

Action Execution

Act using retrieved memory

Act

Post-action Correction

Use feedback to correct and refresh stale memory.

Post-action

Our framework enables continual personalization by leveraging online user feedback to dynamically read and write to memory, ensuring the agent adapts to evolving preferences.

PAHF loop figure

Evaluation

Embodied manipulation

Home/office tasks with contextual preferences.

Online shopping

Multi-constraint product selection.

Training

Phase 1

Initial Learning

Testing

Phase 2

Initial Personalization

Training

Phase 3

Adaptation to Drift

Testing

Phase 4

Adapted Personalization

Results

Embodied results

Learning curves for initial preference learning (Phase 1) and adaptation to preference shifts (Phase 3).

Test phase success rates

Embodied Shopping
Method Phase 2 Phase 4 Phase 2 Phase 4
No memory 32.3±0.4 44.8±0.5 27.8±0.2 27.0±0.4
Pre-action only 54.1±1.1 35.7±1.0 34.4±0.5 56.0±0.7
Post-action only 67.9±1.5 68.3±1.2 38.9±0.5 66.9±0.8
PAHF (pre+post) 70.5±1.7 68.8±1.3 41.3±0.8 70.3±1.1

Takeaways

  • Pre-action feedback reduces initial personalization error.
  • Pre-action feedback alone fails when preferences change.
  • Post-action feedback enables fast adaptation to preference shifts.
  • Post-action feedback alone can cause significant initial error.
  • PAHF combines strength of both feedback channels.

BibTeX

@misc{liang2026learningpersonalizedagentshuman,
      title={Learning Personalized Agents from Human Feedback}, 
      author={Kaiqu Liang and Julia Kruk and Shengyi Qian and Xianjun Yang and Shengjie Bi and Yuanshun Yao and Shaoliang Nie and Mingyang Zhang and Lijuan Liu and Jaime Fernández Fisac and Shuyan Zhou and Saghar Hosseini},
      year={2026},
      eprint={2602.16173},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.16173}, 
}