Glossary

RLHF (Reinforcement Learning from Human Feedback)

Technique for fine-tuning language models using human feedback to align outputs with preferred behaviors.

Context and detail

Strengths and limits. Why RLHF doesn't prevent jailbreaks reliably.

Related terms

  • Constitutional AI — Anthropic's approach to training language models with a defined set of principles (a constitution) used during fine-tuning to bias toward safer outputs.

See how rlhf (reinforcement learning from human feedback) maps to your AI posture.

The free AI Posture Check produces a per-dimension score and maps your gaps to OWASP LLM Top 10, NIST AI RMF, and ISO 42001.

Take the AI Posture Check