Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation,
the enhanced safety often come with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful.
Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging
due to the difficulty of crafting prompts that appear harmful but are benign.
We introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 over-refusal prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses.
We plot the evaluation results in the following figure. The x-axis is the over-refusal rate and the y-axis is the rejection rate on real toxic prompts. In the ideal case, the model should be on the top-left corner where the model rejects the most number of toixc prompts and the least number of safe prompts.