Towards understanding sycophancy in language models

M Sharma, M Tong, T Korbak, D Duvenaud… - arXiv preprint arXiv …, 2023 - arxiv.org
Human feedback is commonly utilized to finetune AI assistants. But human feedback may
also encourage model responses that match user beliefs over truthful ones, a behaviour
known as sycophancy. We investigate the prevalence of sycophancy in models whose
finetuning procedure made use of human feedback, and the potential role of human
preference judgments in such behavior. We first demonstrate that five state-of-the-art AI
assistants consistently exhibit sycophancy across four varied free-form text-generation tasks …

[引言][C] Towards Understanding Sycophancy in Language Models, October 2023

M Sharma, M Tong, T Korbak, D Duvenaud, A Askell… - URL http://arxiv. org/abs …
顯示最佳搜尋結果。 查看所有結果