OpenAI model safety improved with rule-based rewards

Posted on Thursday, August 15, 2024 by AUSTIN HARRIS, Global Sales

OpenAI has developed a new method, Rule-Based Rewards (RBRs), to align AI models with safe behaviors, reducing the need for extensive human data collection. This method enhances the reliability and safety of AI systems, making them more dependable for both everyday use and development purposes.

OpenAI model safety improved with rule-based rewards

Traditionally, ensuring AI models follow instructions accurately has relied on reinforcement learning from human feedback (RLHF). OpenAI has been a leader in creating alignment methods to build smarter and safer AI models. Typically, this involves defining desired behaviors and collecting human feedback to train a "reward model" that guides AI actions. However, collecting this feedback for routine tasks can be inefficient, especially if safety policies change, necessitating new data.

RBRs provide a solution by using clear, step-by-step rules to evaluate if the model's outputs meet safety standards. This method integrates seamlessly into the RLHF pipeline, balancing helpfulness and harm prevention without recurring human inputs. RBRs have been part of OpenAI’s safety stack since the launch of GPT-4 and are set to be implemented in future models.

The implementation of RBRs involves defining propositions - simple statements about desired or undesired aspects of the model's responses. These propositions form rules that capture the nuances of safe responses in various scenarios. For instance, a refusal in response to unsafe requests should contain a brief apology and state an inability to comply.

Three categories of desired model behavior are defined when dealing with harmful or sensitive topics:

  • Hard Refusals: Include a brief apology and a statement of inability to comply without judgmental language. Example: Criminal hate speech, advice to commit violent crimes.
  • Soft Refusals: Include an empathetic apology acknowledging the user’s emotional state while declining the request. Example: Advice or instructions on self-harm.
  • Comply: The model should comply with benign requests.
     

Integration of RBRs with traditional reward models during reinforcement learning

Simplified examples of propositions and their mapping to ideal or non-ideal behaviors help ensure that the model adheres to these rules. The responses are graded based on adherence to these rules, allowing the RBR approach to adapt to new rules and safety policies flexibly. These scores fit a linear model, which combines RBR rewards with helpful-only reward models, guiding the model to follow safety policies effectively.

Experiments with RBR-trained models showed safety performance comparable to those trained with human feedback. The models demonstrated reduced instances of incorrectly refusing safe requests and maintained evaluation metrics on common capability benchmarks. RBRs significantly reduce the need for extensive human data, making the training process faster and more cost-effective. As model capabilities and safety guidelines evolve, RBRs can be updated quickly by modifying or adding new rules, without extensive retraining.

OpenAI's approach evaluates the balance between helpfulness and harmfulness, ensuring the model is both safe and useful. The optimal model should strike a balance between these two aspects, refusing unsafe prompts while complying with safe ones.

While RBRs work well for tasks with clear rules, they can be challenging to apply to subjective tasks like essay writing. Combining RBRs with human feedback can address these challenges, enforcing specific guidelines while human feedback handles nuanced aspects. Researchers must design RBRs carefully to ensure fairness and accuracy, considering potential biases.

OpenAI's introduction of Rule-Based Rewards (RBRs) for safety training of language models is a cost- and time-efficient method requiring minimal human data. It maintains a balance between safety and usefulness and is easily updated as desired behaviors change.

RBRs are not limited to safety training; they can be adapted for tasks where explicit rules define desired behaviors, such as tailoring the personality or format of model responses. Future plans include extensive ablation studies, the use of synthetic data for rule development, and human evaluations to validate RBRs' effectiveness in diverse applications beyond safety.

OpenAI invites researchers and practitioners to explore the potential of RBRs in their work, fostering collaboration to advance the field of safe and aligned AI, ensuring these tools better serve people.

More App Developer News

Tether QVAC SDK Powers AI Across Devices and Platforms



APAC 5G expansion to fuel 347B mobile market by 2030



How AI is causing app litter everywhere



The App Economy Is Thriving



NIKKE 3.5 anniversary update livestream coming soon



New AI tool targets early dementia detection



Jentic launch gives AI agents api access



Experts warn ai-generated health content risks misinterpretation without human oversight



Ludo.ai Unveils API and MCP Beta to Power AI Game Asset Pipelines



AccuWeather Launches ChatGPT Integration for Live Weather Updates



Stop Using Business Jargon: 5 Ways Buzzwords Damage Job Performance



IT spending rises as banks balance legacy and innovation



Tech hiring slumps as Software Developer job postings fall



AI is becoming more widespread in collaboration tools



FCC prohibits new foreign router models citing critical infrastructure risks



ChatGPT Carbon Footprint Matches 1.3 Million Cars Report Finds



Lens Launches MCP Server to Connect AI Coding Assistants with Kubernetes



Accelerating corporate ai investment returns



Enviromates tech startup launches global participation platform



Private Repository Secures the AI-driven Development Boom



UK Fintech Platform Enviromates Connects Projects Brands and Consumers



Env Zero and CloudQuery Announce Merger



How Industrial AI Is Transforming Operations in 2026



AI generated work from managers is damaging trust among employees



Foresight Secures $25M to Bridge Infrastructure Execution Gap



Copyright © 2026 by Moonbeam

Address:
1855 S Ingram Mill Rd
STE# 201
Springfield, Mo 65804

Phone: 1-844-277-3386

Fax:417-429-2935

E-Mail: contact@appdevelopermagazine.com