Google 學術搜尋

文章

學術搜尋

Towards understanding sycophancy in language models

M Sharma, M Tong, T Korbak, D Duvenaud… - arXiv preprint arXiv …, 2023 - arxiv.org

Human feedback is commonly utilized to finetune AI assistants. But human feedback may
also encourage model responses that match user beliefs over truthful ones, a behaviour
known as sycophancy. We investigate the prevalence of sycophancy in models whose
finetuning procedure made use of human feedback, and the potential role of human
preference judgments in such behavior. We first demonstrate that five state-of-the-art AI
assistants consistently exhibit sycophancy across four varied free-form text-generation tasks …

儲存引用被引用 88 次相關文章全部共 3 個版本 HTML 版

[引言][C] Towards Understanding Sycophancy in Language Models, October 2023

M Sharma, M Tong, T Korbak, D Duvenaud, A Askell… - URL http://arxiv. org/abs …

儲存引用被引用 4 次相關文章

顯示最佳搜尋結果。查看所有結果

引用

進階搜尋

已儲存至「我的圖書館」

Towards understanding sycophancy in language models

[引言][C] Towards Understanding Sycophancy in Language Models, October 2023