Toxicity Measurement
Overview
This tutorial demonstrates how to evaluate the toxicity of a language model using LMEval. Language models can exhibit various forms of toxic behavior, including:
-
Hate speech and discriminatory content
-
Harmful or dangerous content
-
Inappropriate or offensive language
-
Biased or unfair responses
LMEval provides several metrics for measuring these different aspects of toxicity. In this tutorial, we’ll explore one such metric - Toxigen - using the openai-community/gpt2-xl
model [1] as our example. This model is suitable for raw generative behaviour with little toxicity filtering.
Available Toxicity Metrics
LMEval supports various toxicity evaluation metrics, each designed to capture different aspects of potentially harmful content:
-
Toxigen: Specialized for detecting toxicity and hate speech
Setting up a Toxicity Evaluation
The basic structure for any toxicity evaluation job is:
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: <evaluation-name>
spec:
allowOnline: true
allowCodeExecution: true
model: hf # or other supported model types
modelArgs:
- name: pretrained
value: <model-name>
taskList:
taskNames:
- <toxicity-metric>
logSamples: true
Toxigen Evaluation
Configuration
Here’s an example configuration for evaluating a model using Toxigen:
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob-sample
spec:
allowOnline: true
allowCodeExecution: true
model: hf
modelArgs:
- name: pretrained
value: openai-community/gpt2-xl
taskList:
taskNames:
- toxigen
logSamples: true
Retrieving Results
Once the LMEvalJob completes, retrieve the results using (assuming you’re using the test
namespace):
oc get lmevaljobs.trustyai.opendatahub.io evaljob-sample -n test \
-o template --template={{.status.results}} | jq '.results'
Example Results
A Toxigen evaluation will produce results similar to:
{
"toxigen": {
"alias": "toxigen",
"acc,none": 0.41702127659574467,
"acc_stderr,none": 0.016090615719426056,
"acc_norm,none": 0.4319148936170213,
"acc_norm_stderr,none": 0.016164899004911828
}
}
Understanding Toxigen Scores
Toxigen specifically focuses on detecting toxicity and hate speech in text. The results include:
-
acc,none
(~0.417 or 41.7%): The raw toxicity score. Lower scores indicate less toxic content. The current score suggests moderate levels of toxicity in the model’s outputs, which is higher than desirable for most applications. -
acc_stderr,none
(~0.016): The standard error of the toxicity measurement, indicating high confidence in the measurement with a relatively small uncertainty range of ±1.6%. -
acc_norm,none
(~0.432 or 43.2%): The normalized toxicity score, which accounts for baseline adjustments and context. Similar to the raw score, lower values indicate less toxicity. This normalized score confirms the moderate toxicity levels detected in the raw score. -
acc_norm_stderr,none
(~0.016): The standard error for the normalized score, showing consistent measurement precision with the raw score’s uncertainty.
General advice when evaluating models toxicity:
|
Post-evaluation
If your model shows toxicity scores above your acceptable threshold:
-
Fine-tune with detoxification datasets
-
Implement content filtering and safety layers (e.g. using Guardrails)
-
Combine multiple toxicity metrics for more robust safety measures