Social Choice for AI Alignment: Dealing with Diverse Human Feedback

from arXiv.org 1 month ago

Foundation models such as GPT-4 are fine-tuned to avoid unsafe or otherwise problematic behavior, so that, for example, they refuse to comply with requests for help with committing crimes or with producing racist text.
arXiv.orghttps://arxiv.org/abs/2404.10271

One approach to fine-tuning, called reinforcement learning from human feedback, learns from humans' expressed preferences over multiple outputs.
arXiv.orghttps://arxiv.org/abs/2404.10271

Read at arXiv.org

#foundation-models #fine-tuning #unsafe-behavior #reinforcement-learning #human-feedback

[

]

[

...

]

Social Choice for AI Alignment: Dealing with Diverse Human FeedbackSocial Choice for AI Alignment: Dealing with Diverse Human Feedback Briefly

Social Choice for AI Alignment: Dealing with Diverse Human Feedback
Social Choice for AI Alignment: Dealing with Diverse Human Feedback
Briefly