Toxicity Measurement

Prerequisites

TrustyAI operator installed in your cluster

Overview

This tutorial demonstrates how to evaluate the toxicity of a language model using LMEval. Language models can exhibit various forms of toxic behavior, including:

Hate speech and discriminatory content
Harmful or dangerous content
Inappropriate or offensive language
Biased or unfair responses

LMEval provides several metrics for measuring these different aspects of toxicity. In this tutorial, we’ll explore one such metric - Toxigen - using the openai-community/gpt2-xl model ^[1] as our example. This model is suitable for raw generative behaviour with little toxicity filtering.

Available Toxicity Metrics

LMEval supports various toxicity evaluation metrics, each designed to capture different aspects of potentially harmful content:

Toxigen: Specialized for detecting toxicity and hate speech

Setting up a Toxicity Evaluation

The basic structure for any toxicity evaluation job is:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: <evaluation-name>
spec:
  allowOnline: true
  allowCodeExecution: true
  model: hf  # or other supported model types
  modelArgs:
    - name: pretrained
      value: <model-name>
  taskList:
    taskNames:
    - <toxicity-metric>
  logSamples: true

Toxigen Evaluation

Configuration

Here’s an example configuration for evaluating a model using Toxigen:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  allowOnline: true
  allowCodeExecution: true
  model: hf
  modelArgs:
  - name: pretrained
    value: openai-community/gpt2-xl
  taskList:
    taskNames:
    - toxigen
  logSamples: true

Retrieving Results

Once the LMEvalJob completes, retrieve the results using (assuming you’re using the test namespace):

oc get lmevaljobs.trustyai.opendatahub.io evaljob-sample -n test \
  -o template --template={{.status.results}} | jq '.results'

Example Results

A Toxigen evaluation will produce results similar to:

{
  "toxigen": {
    "alias": "toxigen",
    "acc,none": 0.41702127659574467,
    "acc_stderr,none": 0.016090615719426056,
    "acc_norm,none": 0.4319148936170213,
    "acc_norm_stderr,none": 0.016164899004911828
  }
}

Understanding Toxigen Scores

Toxigen specifically focuses on detecting toxicity and hate speech in text. The results include:

acc,none (~0.417 or 41.7%): The raw toxicity score. Lower scores indicate less toxic content. The current score suggests moderate levels of toxicity in the model’s outputs, which is higher than desirable for most applications.
acc_stderr,none (~0.016): The standard error of the toxicity measurement, indicating high confidence in the measurement with a relatively small uncertainty range of ±1.6%.
acc_norm,none (~0.432 or 43.2%): The normalized toxicity score, which accounts for baseline adjustments and context. Similar to the raw score, lower values indicate less toxicity. This normalized score confirms the moderate toxicity levels detected in the raw score.
acc_norm_stderr,none (~0.016): The standard error for the normalized score, showing consistent measurement precision with the raw score’s uncertainty.

General advice when evaluating models toxicity:

Lower toxicity scores indicates safer content
Monitor both raw and normalized scores for a better assessment
Standard errors inform about evaluation reliability
Using multiple toxicity metrics provides a more comprehensive assessment

Post-evaluation

If your model shows toxicity scores above your acceptable threshold:

Fine-tune with detoxification datasets
Implement content filtering and safety layers (e.g. using Guardrails)
Combine multiple toxicity metrics for more robust safety measures