Getting started with LM-Eval
LM-Eval is a service for large language model evaluation underpinned by two open-source projects: lm-evaluation-harness and Unitxt. LM-Eval is integrated into the TrustyAI Kubernetes Operator. In this tutorial, you will learn:
-
How to create an
LMEvalJob
CR to kick off an evaluation job and get the results
Global settings for LM-Eval
There are some configurable global settings for LM-Eval services, and they are stored in the TrustyAI’s operator global ConfigMap
, trustyai-service-operator-config
, located in the same namespace as the operator. Here is a list of properties for LM-Eval:
Setting | Default | Description |
---|---|---|
|
|
Detect if there is available GPUs or not and assign the proper value for |
|
|
The image for the LM-Eval job. The image contains the necessary Python packages for lm-evaluation-harness and Unitxt. |
|
|
The image for the LM-Eval driver. Check |
|
|
The image-pulling policy when running the evaluation job. |
|
|
The default batch size when invoking the model inference API. This only works for local models. |
|
|
The maximum batch size that users can specify in an evaluation job. |
|
|
The interval to check the job pod for an evaluation job. |
|
|
Whether LMEval jobs can set the online mode on. |
|
|
Whether LMEval jobs can set the trust remote code mode on. |
After updating the settings in the
and then redeploy the TrustyAI operator:
|
LMEvalJob
LM-Eval service defines a new Custom Resource Definition called: LMEvalJob
. LMEvalJob
objects are monitored by the TrustyAI Kubernetes operator. An LMEvalJob object represents an evaluation job. Therefore, to run an evaluation job, you need to create an LMEvalJob
object with the needed information including model, model arguments, task, secret, etc. Once the LMEvalJob
is created, the LM-Eval service will run the evaluation job and update the status and results to the LMEvalJob
object when the information is available.
Here is an example of an LMEvalJob
object:
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob-sample
spec:
allowOnline: true
model: hf
modelArgs:
- name: pretrained
value: google/flan-t5-base (1)
taskList:
taskRecipes:
- card:
name: "cards.wnli" (2)
template: "templates.classification.multi_class.relation.default" (3)
logSamples: true
pod:
container:
resources:
limits:
cpu: '1'
memory: 8Gi
nvidia.com/gpu: '1' (4)
requests:
cpu: '1'
memory: 8Gi
nvidia.com/gpu: '1' (4)
1 | In this example, it uses the pre-trained google/flan-t5-base model from Hugging Face (model: hf) |
2 | The dataset is from the wnli subset of the General Language Understanding Evaluation (GLUE). You can find the details of the Unitxt card wnli here. |
3 | It also specifies the multi_class.relation task from Unitxt and its default metrics are f1_micro , f1_macro , and accuracy . |
4 | This example assumes you have a GPU available. If that’s not the case, the nvidia.com/gpu fields above can be removed and the LMEval job will run on CPU. |
After you apply the example LMEvalJob
above, you can check its state by using the following command:
oc get lmevaljob evaljob-sample
The output would be like:
NAME STATE
evaljob-sample Running
When its state becomes Complete
, the evaluation results will be available. Both the model and dataset in this example are small. The evaluation job would be able to finish within 10 minutes on a CPU-only node.
Use the following command to get the results:
oc get lmevaljobs.trustyai.opendatahub.io evaljob-sample \
-o template --template={{.status.results}} | jq '.results'
Here are the example results:
{
"tr_0": {
"alias": "tr_0",
"f1_micro,none": 0.5633802816901409,
"f1_micro_stderr,none": "N/A",
"accuracy,none": 0.5633802816901409,
"accuracy_stderr,none": "N/A",
"f1_macro,none": 0.36036036036036034,
"f1_macro_stderr,none": "N/A"
}
}
The f1_micro
, f1_macro
, and accuracy
scores are 0.56, 0.36, and 0.56. The full results are stored in the .status.results
of the LMEvalJob
object as a JSON document. The command above only retrieves the results
field of the JSON document. See Output of LMEvalJob for more details.
Details of LMEvalJob
In this section, let’s review each property in the LMEvalJob and its usage.
Parameter | Description |
---|---|
|
Specify which model type or provider is evaluated. This field directly maps to the
|
|
A list of paired name and value arguments for the model type. Each model type or provider supports different arguments:
|
|
Specify a list of tasks supported by lm-evaluation-harness. See the useful commands section of this page to get a list of built-in tasks. |
|
Specify the task using the Unitxt recipe format:
|
|
Define one or more custom resources that will be referenced in a task recipe. Custom cards, custom templates, and custom system prompts are currently supported:
|
|
Sets the number of few-shot examples to place in context. If you are using a task from Unitxt, don’t use this field. Use |
|
Instead of running the whole dataset, set a limit to run the tasks. Accepts an integer, or a float between 0.0 and 1.0. |
|
Map to |
|
If this flag is passed, then the model’s outputs, and the text fed into the model, will be saved at per-document granularity. |
|
Batch size for the evaluation. The |
|
Specify extra information for the lm-eval job’s pod.
|
|
This sections defines custom output locations for the evaluation results storage. At the moment only Persistent Volume Claims (PVC) are supported. |
|
Create an operator-managed PVC to store this job’s results. The PVC will be named
|
|
Binds an existing PVC to a job by specifying its name. The PVC must be created separately and must already exist when creating the job. |
|
If set to |
|
If set to |
|
Mount a PVC as the local storage for models and datasets. |
Output of LMEvalJob
The output of an LMEvalJob is a YAML document with several fields. The status
section provides the relevant information about the current status and, if the job successfully completes, the results for an evaluation.
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"trustyai.opendatahub.io/v1alpha1","kind":"LMEvalJob","metadata":{"annotations":{},"name":"lmeval-test","namespace":"test"},"spec":{"allowCodeExecution":true,"allowOnline":true,"logSamples":true,"model":"hf","modelArgs":[{"name":"pretrained","value":"google/flan-t5-base"}],"taskList":{"taskRecipes":[{"card":{"name":"cards.wnli"},"template":"templates.classification.multi_class.relation.default"}]}}}
creationTimestamp: "2025-02-06T18:13:35Z"
finalizers:
- trustyai.opendatahub.io/lmes-finalizer
generation: 1
name: lmeval-test
namespace: test
resourceVersion: "19604113"
uid: e1d29da2-bf3e-4f46-8907-6018e5741eb4
spec:
allowCodeExecution: true
allowOnline: true
logSamples: true
model: hf
modelArgs:
- name: pretrained
value: google/flan-t5-base
taskList:
taskRecipes:
- card:
name: cards.wnli
template: templates.classification.multi_class.relation.default
status:
completeTime: "2025-02-06T18:31:20Z"
lastScheduleTime: "2025-02-06T18:13:35Z"
message: job completed (1)
podName: lmeval-test
reason: Succeeded (2)
results: |- (3)
{
...
}
state: Complete (4)
1 | A message provides an explanation related to the current or final status of an LMEvalJob. If the job reason is Failed , the related error message will be shown here. |
2 | A one-word reason that corresponds to the given state of the job at this time. Possible values are:
|
3 | The results field is the direct output of an lm-evaluation-harness run. It has been omitted here to avoid repetition. The next section gives an example of the contents of this section. This section will be empty if the job is not completed. |
4 | The current state of this job. The reason for a particular state is given in the reason field. Possible values are:
|
results
section
The results
field is the direct output of an lm-evaluation-harness
run. Below is an example of the file that is returned after an lm-evaluation-harness
evaluation run and, consequently, the contents of the results
dictionary of the LMEvalJob output YAML. This file may look slightly different depending on what options are passed.
The example shown here is of a Unitxt task called tr_0
that corresponds to the custom Unitxt task that is shown in this section.
{
"results": { (1)
"tr_0": {
"alias": "tr_0",
"f1_micro,none": 0.5,
"f1_micro_stderr,none": "N/A",
"accuracy,none": 0.5,
"accuracy_stderr,none": "N/A",
"f1_macro,none": 0.3333333333333333,
"f1_macro_stderr,none": "N/A"
}
},
"group_subtasks": { (2)
"tr_0": []
},
"configs": { (3)
"tr_0": {
"task": "tr_0",
"dataset_name": "card=cards.wnli,template=templates.classification.multi_class.relation.default",
"unsafe_code": false,
"description": "",
"target_delimiter": " ",
"fewshot_delimiter": "\n\n",
"num_fewshot": 0,
"output_type": "generate_until",
"generation_kwargs": {
"until": [
"\n\n"
],
"do_sample": false
},
"repeats": 1,
"should_decontaminate": false,
"metadata": {
"version": 0
}
}
},
"versions": { (4)
"tr_0": 0
},
"n-shot": { (4)
"tr_0": 0
},
"higher_is_better": { (5)
"tr_0": {
"f1_micro": true,
"accuracy": true,
"f1_macro": true
}
},
"n-samples": { (5)
"tr_0": {
"original": 71,
"effective": 10
}
},
"config": { (6)
"model": "hf",
"model_args": "pretrained=hf_home/flan-t5-base",
"model_num_parameters": 247577856,
"model_dtype": "torch.float32",
"model_revision": "main",
"model_sha": "",
"batch_size": 1,
"batch_sizes": [],
"use_cache": null,
"limit": 10.0,
"bootstrap_iters": 100000,
"gen_kwargs": null,
"random_seed": 0,
"numpy_seed": 1234,
"torch_seed": 1234,
"fewshot_seed": 1234
},
"git_hash": "af2d2f3e",
"date": 1740763246.8746712,
"pretty_env_info": "PyTorch version: 2.5.1\nIs debug build: False\nCUDA used to build PyTorch: None\nROCM used to build PyTorch: N/A\n\nOS: macOS 15.3.1 (arm64)\nGCC version: Could not collect\nClang version: 16.0.0 (clang-1600.0.26.3)\nCMake version: Could not collect\nLibc version: N/A\n\nPython version: 3.11.11 (main, Dec 11 2024, 10:25:04) [Clang 14.0.6 ] (64-bit runtime)\nPython platform: macOS-15.3.1-arm64-arm-64bit\nIs CUDA available: False\nCUDA runtime version: No CUDA\nCUDA_MODULE_LOADING set to: N/A\nGPU models and configuration: No CUDA\nNvidia driver version: No CUDA\ncuDNN version: No CUDA\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nApple M1 Max\n\nVersions of relevant libraries:\n[pip3] mypy==1.15.0\n[pip3] mypy-extensions==1.0.0\n[pip3] numpy==2.2.2\n[pip3] torch==2.5.1\n[conda] numpy 2.2.2 pypi_0 pypi\n[conda] torch 2.5.1 pypi_0 pypi",
"transformers_version": "4.48.1",
"upper_git_hash": null,
"tokenizer_pad_token": [
"<pad>",
"0"
],
"tokenizer_eos_token": [
"</s>",
"1"
],
"tokenizer_bos_token": [
null,
"None"
],
"eot_token_id": 1,
"max_length": 512,
"task_hashes": {},
"model_source": "hf",
"model_name": "hf_home/flan-t5-base",
"model_name_sanitized": "hf_home__flan-t5-base",
"system_instruction": null,
"system_instruction_sha": null,
"fewshot_as_multiturn": false,
"chat_template": null,
"chat_template_sha": null,
"start_time": 84598.410512833, (7)
"end_time": 84647.782769875,
"total_evaluation_time_seconds": "49.37225704200682"
}
1 | results is a dictionary of tasks keyed by task name. For each task, the calculated metrics are shown. These metrics are dependant on the task definition. results is a flat dictionary, so if a task has subtasks, they will not be nested under a parent task but are rather their own entry. |
2 | group_subtasks is a dictionary of tasks keyed by name with the value for each being a list of strings corresponding to subtasks for this task. group_subtasks is empty in this example because there are no subtasks. |
3 | configs is a dictionary of tasks keyed by task name that shows the configuration options for each task run. These key-value pairs are provided by the task definition (or default values) and will vary depending on the type of task run. |
4 | versions and n-shot are flat dictionaries with one key for each task run. The value in the versions dictionary is the version of the given task (or 0 by default). The value in the n-shot dictionary is the number of few-shot examples that were placed in context when running the task. This information is also available in the configs dictionary. |
5 | higher_is_better and n-samples are dictionaries with one key-dictionary pair for each task run. The former provides information as to whether a higher score is considered better for each metric that was evaluated for that task. The latter gives, for each task, the number of samples used during evaluation. In this example, the --limit property was set to 10, making the effective number of samples 10. |
6 | config is a dictionary that provides key-value pairs corresponding to the evaluation job as a whole. This includes information on the type of model run, the model_args , and other settings used for the run. Many of the values in this dictionary in this example are the default values defined by lm-evaluation-harness . |
7 | Given at the very end are three fields describing the start, end, and total evaluation time for this job. |
The remaining key-value pairs define a variety of environment settings used for this evaluation job.
Examples
Environment Variables
If the LMEvalJob needs to access a model on HuggingFace with the access token, you can set up the HF_TOKEN
as one of the environment variables for the lm-eval container:
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob-sample
spec:
model: hf
modelArgs:
- name: pretrained
value: huggingfacespace/model
taskList:
taskNames:
- unfair_tos
logSamples: true
pod:
container:
env: (1)
- name: HF_TOKEN
value: "My HuggingFace token"
1 | spec.pod.env fields are passed directly to the LMEvalJob’s container as environment variables. |
Or you can create a secret to store the token and refer the key from the secret object using the reference syntax:
(only attach the env part)
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: my-secret
key: hf-token
Custom Unitxt Card
Pass a custom Unitxt Card in JSON format:
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob-sample
spec:
model: hf
modelArgs:
- name: pretrained
value: google/flan-t5-base
taskList:
taskRecipes:
- template: "templates.classification.multi_class.relation.default"
card:
custom: |
{
"__type__": "task_card",
"loader": {
"__type__": "load_hf",
"path": "glue",
"name": "wnli"
},
"preprocess_steps": [
{
"__type__": "split_random_mix",
"mix": {
"train": "train[95%]",
"validation": "train[5%]",
"test": "validation"
}
},
{
"__type__": "rename",
"field": "sentence1",
"to_field": "text_a"
},
{
"__type__": "rename",
"field": "sentence2",
"to_field": "text_b"
},
{
"__type__": "map_instance_values",
"mappers": {
"label": {
"0": "entailment",
"1": "not entailment"
}
}
},
{
"__type__": "set",
"fields": {
"classes": [
"entailment",
"not entailment"
]
}
},
{
"__type__": "set",
"fields": {
"type_of_relation": "entailment"
}
},
{
"__type__": "set",
"fields": {
"text_a_type": "premise"
}
},
{
"__type__": "set",
"fields": {
"text_b_type": "hypothesis"
}
}
],
"task": "tasks.classification.multi_class.relation",
"templates": "templates.classification.multi_class.relation.all"
}
logSamples: true
Inside the custom card, it uses the HuggingFace dataset loader:
"loader": {
"__type__": "load_hf",
"path": "glue",
"name": "wnli"
},
Using PVCs as storage
To use a PVC as storage for the LMEvalJob
results, there are two supported modes, at the moment, managed and existing PVCs.
Managed PVCs, as the name implies, are managed by the TrustyAI operator. To enable a managed PVC simply specify its size:
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob-sample
spec:
# other fields omitted ...
outputs: (1)
pvcManaged: (2)
size: 5Gi (3)
1 | outputs is the section for specifying custom storage locations |
2 | pvcManaged will create an operator-managed PVC |
3 | size (compatible with standard PVC syntax) is the only supported value |
This will create a PVC named <job-name>-pvc
(in this case evaljob-sample-pvc
) which will be available after the job finishes, but will be deleted when the LMEvalJob
is deleted.
To use an already existing PVC you can pass its name as a reference.
The PVC must already exist when the LMEvalJob
is created. Start by creating a PVC, for instance:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: "my-pvc"
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
And then reference it from the LMEvalJob
:
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob-sample
spec:
# other fields omitted ...
outputs:
pvcName: "my-pvc" (1)
1 | pvcName references the already existing PVC my-pvc . |
In this case, the PVC is not managed by the TrustyAI operator, so it will be available even after deleting the LMEvalJob
.
In the case where both managed and existing PVCs are referenced in outputs
, the TrustyAI operator will prefer the managed PVC and ignore the existing one.
Using an InferenceService
This example assumes vLLM model already deployed in your cluster. |
Define your LMEvalJob CR
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob
spec:
model: local-completions
taskList:
taskNames:
- mmlu
logSamples: true
batchSize: 1
modelArgs:
- name: model
value: granite
- name: base_url
value: $ROUTE_TO_MODEL/v1/completions (1)
- name: num_concurrent
value: "1"
- name: max_retries
value: "3"
- name: tokenized_requests
value: "False"
- name: tokenizer
value: ibm-granite/granite-7b-instruct
pod:
container:
env:
- name: OPENAI_API_KEY (2)
valueFrom:
secretKeyRef: (3)
name: <secret-name> (4)
key: token (5)
1 | base_url should be set to the route/service URL of your model. Make sure to include the /v1/completions endpoint in the URL. |
2 | OPENAI_API_KEY values are passed directly to remote model servers, so they can also be used as general authentication bearer tokens. |
3 | env.valueFrom.secretKeyRef.name should point to a secret that contains a token that can authenticate to your model. secretRef.name should be the secret’s name in the namespace, while secretRef.key should point at the token’s key within the secret. |
4 | secretKeyRef.name can equal the output of
|
5 | secretKeyRef.key should equal field name holding the token value, in this example token |
Then, apply this CR into the same namespace as your model. You should see a pod spin up in your
model namespace called evaljob
. In the pod terminal, you can see the output via tail -f output/stderr.log
Using GPUs
Typically, when using an Inference Service, GPU acceleration will be performed at the model server level. However, when using local mode, i.e. running the evaluation locally on the LMEval Job, you might want to use available GPUs. To do so, we can add a resource configuration directly on the job’s definition:
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob-sample
spec:
model: hf
modelArgs:
- name: pretrained
value: google/flan-t5-base
taskList:
taskNames:
- "qnlieu"
logSamples: true
allowOnline: true
allowCodeExecution: true
pod: (1)
container:
resources:
limits: (2)
cpu: '1'
memory: 8Gi
nvidia.com/gpu: '1'
requests:
cpu: '1'
memory: 8Gi
nvidia.com/gpu: '1'
1 | The pod section allows adding specific resource definitions to the LMEval Job. |
2 | In this case we are adding cpu: 1 , memory: 8Gi and nvidia.com/gpu: 1 , but these can be adjusted to your cluster’s availability. |
Integration with Kueue
TrustyAI and LM-Eval do not require Kueue to work.
However, if Kueue is available on the cluster, it can be used from LM-Eval.
To enable Kueue on Open Data Hub, add the following to your
|
To Enable job suspend for Kueue integration, create a job in suspended state. Verify the job is in suspended state and the job’s pod is not running.
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
labels:
app.kubernetes.io/name: fms-lm-eval-service
name: evaljob-sample
spec:
suspend: true (1)
model: hf
modelArgs:
- name: pretrained
value: EleutherAI/pythia-70m
taskList:
taskNames:
- unfair_tos
logSamples: true
limit: "5"
1 | This will set the LM-Eval job’s state as suspended |
Set suspend
to false
and verify job’s pod getting created and running:
oc patch lmevaljob evaljob-sample --patch '{"spec":{"suspend":false}}' --type merge
Useful Commands & References
List all available tasks
As mentioned above, LMEvalJob supports running both Unitxt tasks (via recipes or custom JSON) and the built-in tasks that are available out-of-the-box with lm-evaluation-harness
. To see a list of available built-in tasks, run the following command. LMES-POD-IMAGE
is the same as listed here.
oc exec <<LMES-POD-IMAGE>> -- bash -c "lm_eval --tasks list"
It is recommended to pipe this output to a file as there are several thousand built-in tasks.
Note: the output of the task list command will include some Unitxt tasks that have been directly contributed to lm-evaluation-harness
. You can run these tasks by specifying the task name in the taskList
property of the job definition as you would with any built-in task.