OpenAI model safety improved with rule-based rewards

Posted on Thursday, August 15, 2024 by AUSTIN HARRIS, Global Sales

OpenAI has developed a new method, Rule-Based Rewards (RBRs), to align AI models with safe behaviors, reducing the need for extensive human data collection. This method enhances the reliability and safety of AI systems, making them more dependable for both everyday use and development purposes.

OpenAI model safety improved with rule-based rewards

Traditionally, ensuring AI models follow instructions accurately has relied on reinforcement learning from human feedback (RLHF). OpenAI has been a leader in creating alignment methods to build smarter and safer AI models. Typically, this involves defining desired behaviors and collecting human feedback to train a "reward model" that guides AI actions. However, collecting this feedback for routine tasks can be inefficient, especially if safety policies change, necessitating new data.

RBRs provide a solution by using clear, step-by-step rules to evaluate if the model's outputs meet safety standards. This method integrates seamlessly into the RLHF pipeline, balancing helpfulness and harm prevention without recurring human inputs. RBRs have been part of OpenAI’s safety stack since the launch of GPT-4 and are set to be implemented in future models.

The implementation of RBRs involves defining propositions - simple statements about desired or undesired aspects of the model's responses. These propositions form rules that capture the nuances of safe responses in various scenarios. For instance, a refusal in response to unsafe requests should contain a brief apology and state an inability to comply.

Three categories of desired model behavior are defined when dealing with harmful or sensitive topics:

Hard Refusals: Include a brief apology and a statement of inability to comply without judgmental language. Example: Criminal hate speech, advice to commit violent crimes.
Soft Refusals: Include an empathetic apology acknowledging the user’s emotional state while declining the request. Example: Advice or instructions on self-harm.
Comply: The model should comply with benign requests.

Integration of RBRs with traditional reward models during reinforcement learning

Simplified examples of propositions and their mapping to ideal or non-ideal behaviors help ensure that the model adheres to these rules. The responses are graded based on adherence to these rules, allowing the RBR approach to adapt to new rules and safety policies flexibly. These scores fit a linear model, which combines RBR rewards with helpful-only reward models, guiding the model to follow safety policies effectively.

Experiments with RBR-trained models showed safety performance comparable to those trained with human feedback. The models demonstrated reduced instances of incorrectly refusing safe requests and maintained evaluation metrics on common capability benchmarks. RBRs significantly reduce the need for extensive human data, making the training process faster and more cost-effective. As model capabilities and safety guidelines evolve, RBRs can be updated quickly by modifying or adding new rules, without extensive retraining.

OpenAI's approach evaluates the balance between helpfulness and harmfulness, ensuring the model is both safe and useful. The optimal model should strike a balance between these two aspects, refusing unsafe prompts while complying with safe ones.

While RBRs work well for tasks with clear rules, they can be challenging to apply to subjective tasks like essay writing. Combining RBRs with human feedback can address these challenges, enforcing specific guidelines while human feedback handles nuanced aspects. Researchers must design RBRs carefully to ensure fairness and accuracy, considering potential biases.

OpenAI's introduction of Rule-Based Rewards (RBRs) for safety training of language models is a cost- and time-efficient method requiring minimal human data. It maintains a balance between safety and usefulness and is easily updated as desired behaviors change.

RBRs are not limited to safety training; they can be adapted for tasks where explicit rules define desired behaviors, such as tailoring the personality or format of model responses. Future plans include extensive ablation studies, the use of synthetic data for rule development, and human evaluations to validate RBRs' effectiveness in diverse applications beyond safety.

OpenAI invites researchers and practitioners to explore the potential of RBRs in their work, fostering collaboration to advance the field of safe and aligned AI, ensuring these tools better serve people.

OpenAI model safety improved with rule-based rewards

OpenAI model safety improved with rule-based rewards

Integration of RBRs with traditional reward models during reinforcement learning

More App Developer News

Medical debt relief custom-built platform moopFi launches

Quant Pros Say AI Is Widening the Skills Gap

Tether QVAC SDK Powers AI Across Devices and Platforms

APAC 5G expansion to fuel 347B mobile market by 2030

How AI is causing app litter everywhere

The App Economy Is Thriving

NIKKE 3.5 anniversary update livestream coming soon

New AI tool targets early dementia detection

Jentic launch gives AI agents api access

Experts warn ai-generated health content risks misinterpretation without human oversight

Ludo.ai Unveils API and MCP Beta to Power AI Game Asset Pipelines

AccuWeather Launches ChatGPT Integration for Live Weather Updates

Stop Using Business Jargon: 5 Ways Buzzwords Damage Job Performance

IT spending rises as banks balance legacy and innovation

Tech hiring slumps as Software Developer job postings fall

AI is becoming more widespread in collaboration tools

FCC prohibits new foreign router models citing critical infrastructure risks

ChatGPT Carbon Footprint Matches 1.3 Million Cars Report Finds

Lens Launches MCP Server to Connect AI Coding Assistants with Kubernetes

Accelerating corporate ai investment returns

Enviromates tech startup launches global participation platform

Private Repository Secures the AI-driven Development Boom

UK Fintech Platform Enviromates Connects Projects Brands and Consumers

Env Zero and CloudQuery Announce Merger

How Industrial AI Is Transforming Operations in 2026