Getting started with LM-Eval

LM-Eval is a service for large language model evaluation underpinned by two open-source projects: lm-evaluation-harness and Unitxt. LM-Eval is integrated into the TrustyAI Kubernetes Operator. In this tutorial, you will learn:

How to create an LMEvalJob CR to kick off an evaluation job and get the results

Global settings for LM-Eval

There are some configurable global settings for LM-Eval services, and they are stored in the TrustyAI’s operator global ConfigMap, trustyai-service-operator-config, located in the same namespace as the operator. Here is a list of properties for LM-Eval:

Setting Default Description

Setting	Default	Description
`lmes-detect-device`	`true/false`	Detect if there is available GPUs or not and assign the proper value for `--device` argument for lm-evaluation-harness. If GPU(s) is found, it uses `cuda` as the value for `--device`; otherwise, it uses `cpu`.
`lmes-pod-image`	`quay.io/trustyai/ta-lmes-job:latest`	The image for the LM-Eval job. The image contains the necessary Python packages for lm-evaluation-harness and Unitxt.
`lmes-driver-image`	`quay.io/trustyai/ta-lmes-driver:latest`	The image for the LM-Eval driver. Check `cmd/lmes_driver` directory for detailed information about the driver.
`lmes-image-pull-policy`	`Always`	The image-pulling policy when running the evaluation job.
`lmes-default-batch-size`	`8`	The default batch size when invoking the model inference API. This only works for local models.
`lmes-max-batch-size`	`24`	The maximum batch size that users can specify in an evaluation job.
`lmes-pod-checking-interval`	`10s`	The interval to check the job pod for an evaluation job.
`lmes-allow-online`	`true`	Whether LMEval jobs can set the online mode on.
`lmes-code-execution`	`true`	Whether LMEval jobs can set the trust remote code mode on.

lmes-detect-device

true/false

Detect if there is available GPUs or not and assign the proper value for --device argument for lm-evaluation-harness. If GPU(s) is found, it uses cuda as the value for --device; otherwise, it uses cpu.

lmes-pod-image

quay.io/trustyai/ta-lmes-job:latest

The image for the LM-Eval job. The image contains the necessary Python packages for lm-evaluation-harness and Unitxt.

lmes-driver-image

quay.io/trustyai/ta-lmes-driver:latest

The image for the LM-Eval driver. Check cmd/lmes_driver directory for detailed information about the driver.

lmes-image-pull-policy

Always

The image-pulling policy when running the evaluation job.

lmes-default-batch-size

8

The default batch size when invoking the model inference API. This only works for local models.

lmes-max-batch-size

24

The maximum batch size that users can specify in an evaluation job.

lmes-pod-checking-interval

10s

The interval to check the job pod for an evaluation job.

lmes-allow-online

true

Whether LMEval jobs can set the online mode on.

lmes-code-execution

true

Whether LMEval jobs can set the trust remote code mode on.

After updating the settings in the ConfigMap, the new values only take effect when the operator restarts. For instance, to set the lmes-allow-online and lmes-code-execution to true, you can run

kubectl patch configmap trustyai-service-operator-config -n opendatahub --type=merge -p '
{
  "data": {
    "lmes-allow-online": "true",
    "lmes-allow-code-execution": "true"
  }
}'

and then redeploy the TrustyAI operator:

kubectl rollout restart deployment/trustyai-service-operator-controller-manager -n opendatahub

LMEvalJob

LM-Eval service defines a new Custom Resource Definition called: LMEvalJob. LMEvalJob objects are monitored by the TrustyAI Kubernetes operator. An LMEvalJob object represents an evaluation job. Therefore, to run an evaluation job, you need to create an LMEvalJob object with the needed information including model, model arguments, task, secret, etc. Once the LMEvalJob is created, the LM-Eval service will run the evaluation job and update the status and results to the LMEvalJob object when the information is available.

Here is an example of an LMEvalJob object:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  allowOnline: true
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base (1)
  taskList:
    taskRecipes:
    - card:
        name: "cards.wnli" (2)
      template: "templates.classification.multi_class.relation.default" (3)
  logSamples: true
  pod:
    container:
      resources:
          limits:
            cpu: '1'
            memory: 8Gi
            nvidia.com/gpu: '1' (4)
          requests:
            cpu: '1'
            memory: 8Gi
            nvidia.com/gpu: '1' (4)

1	In this example, it uses the pre-trained `google/flan-t5-base` model from Hugging Face (model: hf)
2	The dataset is from the `wnli` subset of the General Language Understanding Evaluation (GLUE). You can find the details of the Unitxt card `wnli` here.
3	It also specifies the multi_class.relation task from Unitxt and its default metrics are `f1_micro`, `f1_macro`, and `accuracy`.
4	This example assumes you have a GPU available. If that’s not the case, the `nvidia.com/gpu` fields above can be removed and the LMEval job will run on CPU.

After you apply the example LMEvalJob above, you can check its state by using the following command:

oc get lmevaljob evaljob-sample

The output would be like:

NAME             STATE
evaljob-sample   Running

When its state becomes Complete, the evaluation results will be available. Both the model and dataset in this example are small. The evaluation job would be able to finish within 10 minutes on a CPU-only node.

Use the following command to get the results:

oc get lmevaljobs.trustyai.opendatahub.io evaljob-sample \
  -o template --template={{.status.results}} | jq '.results'

Here are the example results:

{
  "tr_0": {
    "alias": "tr_0",
    "f1_micro,none": 0.5633802816901409,
    "f1_micro_stderr,none": "N/A",
    "accuracy,none": 0.5633802816901409,
    "accuracy_stderr,none": "N/A",
    "f1_macro,none": 0.36036036036036034,
    "f1_macro_stderr,none": "N/A"
  }
}

The f1_micro, f1_macro, and accuracy scores are 0.56, 0.36, and 0.56. The full results are stored in the .status.results of the LMEvalJob object as a JSON document. The command above only retrieves the results field of the JSON document. See Output of LMEvalJob for more details.

Details of LMEvalJob

In this section, let’s review each property in the LMEvalJob and its usage.

Parameter Description

Parameter	Description
`model`	Specify which model type or provider is evaluated. This field directly maps to the `--model` argument of the lm-evaluation-harness. Supported model types and providers include: `hf`: HuggingFace models `openai-completions`: OpenAI Completions API models `openai-chat-completions`: ChatCompletions API models `local-completions` and `local-chat-completions`: OpenAI API-compatible servers `textsynth`: TextSynth APIs
`modelArgs`	A list of paired name and value arguments for the model type. Each model type or provider supports different arguments: `hf` (HuggingFace): Check the huggingface.py `local-completions` (OpenAI API-compatible server): Check the openai_completions.py and tapi_models.py `local-chat-completions` (OpenAI API-compatible server): Check openai_completions.py and tapi_models.py `openai-completions` (OpenAI Completions API models): Check openai_completions.py and tapi_models.py `openai-chat-completions` (ChatCompletions API models): Check openai_completions.py and tapi_models.py `textsynth` (TextSynth APIs): Check textsynth.py
`taskList.taskNames`	Specify a list of tasks supported by lm-evaluation-harness. See the useful commands section of this page to get a list of built-in tasks.
`taskList.taskRecipes`	Specify the task using the Unitxt recipe format: `card`: Use the `name` to specify a Unitxt card or `ref` to refer to a custom card: `name`: Specify a Unitxt card from the Unitxt catalog. Use the card’s ID as the value. For example: The ID of Wnli card is `cards.wnli`. `ref`: Specify the reference name of a custom card as defined in the `custom` section below `template`: Use `name` to specify a Unitxt catalog template or `ref` to refer to a custom template: `name`: Specify a Unitxt template from the Unitxt catalog. Use the template’s ID as the value. `ref`: Specify the reference name of a custom template as defined in the `custom` section below `systemPrompt`: Use `name` to specify a Unitxt catalog system prompt or `ref` to refer to a custom prompt: `name`: Specify a Unitxt system prompt from the Unitxt catalog. Use the system prompt’s ID as the value. `ref`: Specify the reference name of a custom system prompt as defined in the `custom` section below `task` (optional): Specify a Unitxt task by `name` or `ref`. A Unitxt card has a pre-defined task. Only specify a value for this if you want to run different task. `name`: from the Unitxt catalog. Use the task’s ID as the value. `ref`: Specify the reference name of a custom task as defined in the `custom` section below `metrics` (optional): Specify a list of Unitxt metrics by `name` or `ref`. A Unitxt task has a set of pre-defined metrics. Only specify a set of metrics if you need different metrics. `name`: Specify a metric from the Unitxt catalog. Use the metric’s ID as the value. `ref`: Specify the reference name of a custom metric as defined in the `custom` section below `format` (optional): Specify a Unitxt format from the Unitxt catalog. Use the format’s ID as the value. `loaderLimit` (optional): Specifies the maximum number of instances per stream to be returned from the loader (used to reduce loading time in large datasets). `numDemos` (optional): Number of fewshot to be used. `demosPoolSize` (optional): Size of the fewshot pool.
`taskList.custom`	Define one or more custom resources that will be referenced in a task recipe. Custom cards, custom templates, and custom system prompts are currently supported: `cards`: Define custom cards to use, each with a `name` and `value` field: `name`: The name of this custom card that will be referenced in the `card.ref` field of a task recipe. `value`: A JSON string for a custom Unitxt card which contains the custom dataset. Use the documentation here to compose a custom card, store it as a JSON file, and use the JSON content as the value here. If the dataset used by the custom card needs an API key from an environment variable or a persistent volume, you have to set up corresponding resources under the `pod` field. Check the `pod` field below. `templates`: Define custom templates to use, each with a `name` and `value` field: `name`: The name of this custom template that will be referenced in the `template.ref` field of a task recipe. `value`: A JSON string for a custom Unitxt template. Use the documentation here to compose a custom template, then use the documentation here to store it as a JSON file and use the JSON content as the value of this field. `systemPrompts`: Define custom system prompts to use, each with a `name` and `value` field: `name`: The name of this custom system prompt that will be referenced in the `systemPrompt.ref` field of a task recipe. `value`: A string for a custom Unitxt system prompt. The documentation here provides an overview of the different components that make up a prompt format, including the system prompt. `tasks`: Define custom tasks to use, each with a `name` and `value` field: `name`: The name of this custom task that will be referenced in the `tasks.ref` field of a task recipe. `value`: A JSON string for a custom Unitxt metric. Use the documentation here to compose a custom task, then use the documentation here to store it as a JSON file and use the JSON content as the value of this field. `metrics`: Define custom metrics to use, each with a `name` and `value` field: `name`: The name of this custom metric that will be referenced in the `metrics.ref` field of a task recipe. `value`: A JSON string for a custom Unitxt metric. Use the documentation here to compose a custom metric, then use the documentation here to store it as a JSON file and use the JSON content as the value of this field.
`numFewShot`	Sets the number of few-shot examples to place in context. If you are using a task from Unitxt, don’t use this field. Use `numDemos` under the `taskRecipes` instead.
`limit`	Instead of running the whole dataset, set a limit to run the tasks. Accepts an integer, or a float between 0.0 and 1.0.
`genArgs`	Map to `--gen_kwargs` parameter for the lm-evaluation-harness. Here are the details.
`logSamples`	If this flag is passed, then the model’s outputs, and the text fed into the model, will be saved at per-document granularity.
`batchSize`	Batch size for the evaluation. The `auto:N` batch size is not used for API models, but numeric batch sizes are used for APIs. Only `int` batch size supported at the moment.
`pod`	Specify extra information for the lm-eval job’s pod. `container`: Extra container settings for the lm-eval container. `env`: Specify environment variables. It uses the `EnvVar` data structure of kubernetes. `volumeMounts`: Mount the volumes into the lm-eval container. `resources`: Specify the resources for the lm-eval container. `volumes`: Specify the volume information for the lm-eval and other containers. It uses the `Volume` data structure of kubernetes. `sideCars`: A list of containers that run along with the lm-eval container. It uses the `Container` data structure of kubernetes.
`outputs`	This sections defines custom output locations for the evaluation results storage. At the moment only Persistent Volume Claims (PVC) are supported.
`outputs.pvcManaged`	Create an operator-managed PVC to store this job’s results. The PVC will be named `<job-name>-pvc` and will be owned by the `LMEvalJob`. After job completion, the PVC will still be available, but it will be deleted upon deleting the `LMEvalJob`. Supports the following fields: `size`: The PVC’s size, compatible with standard PVC syntax (e.g. `5Gi`)
`outputs.pvcName`	Binds an existing PVC to a job by specifying its name. The PVC must be created separately and must already exist when creating the job.
`allowOnline`	If set to `true`, the LMEval job will download artifacts as needed (e.g. models, datasets or tokenizers). If set to `false`, these will not be downloaded and will be used from local storage. See `offline`.
`allowCodeExecution`	If set to `true`, the LMEval job will execute the necessary code for preparing models or datasets. If set to `false` it will not execute downloaded code.
`offline`	Mount a PVC as the local storage for models and datasets.

model

Specify which model type or provider is evaluated. This field directly maps to the --model argument of the lm-evaluation-harness. Supported model types and providers include:

hf: HuggingFace models
openai-completions: OpenAI Completions API models
openai-chat-completions: ChatCompletions API models
local-completions and local-chat-completions: OpenAI API-compatible servers
textsynth: TextSynth APIs

modelArgs

A list of paired name and value arguments for the model type. Each model type or provider supports different arguments:

hf (HuggingFace): Check the huggingface.py
local-completions (OpenAI API-compatible server): Check the openai_completions.py and tapi_models.py
local-chat-completions (OpenAI API-compatible server): Check openai_completions.py and tapi_models.py
openai-completions (OpenAI Completions API models): Check openai_completions.py and tapi_models.py
openai-chat-completions (ChatCompletions API models): Check openai_completions.py and tapi_models.py
textsynth (TextSynth APIs): Check textsynth.py

taskList.taskNames

Specify a list of tasks supported by lm-evaluation-harness. See the useful commands section of this page to get a list of built-in tasks.

taskList.taskRecipes

Specify the task using the Unitxt recipe format:

card: Use the name to specify a Unitxt card or ref to refer to a custom card:
- name: Specify a Unitxt card from the Unitxt catalog. Use the card’s ID as the value. For example: The ID of Wnli card is cards.wnli.
- ref: Specify the reference name of a custom card as defined in the custom section below
template: Use name to specify a Unitxt catalog template or ref to refer to a custom template:
- name: Specify a Unitxt template from the Unitxt catalog. Use the template’s ID as the value.
- ref: Specify the reference name of a custom template as defined in the custom section below
systemPrompt: Use name to specify a Unitxt catalog system prompt or ref to refer to a custom prompt:
- name: Specify a Unitxt system prompt from the Unitxt catalog. Use the system prompt’s ID as the value.
- ref: Specify the reference name of a custom system prompt as defined in the custom section below
task (optional): Specify a Unitxt task by name or ref. A Unitxt card has a pre-defined task. Only specify a value for this if you want to run different task.
- name: from the Unitxt catalog. Use the task’s ID as the value.
- ref: Specify the reference name of a custom task as defined in the custom section below
metrics (optional): Specify a list of Unitxt metrics by name or ref. A Unitxt task has a set of pre-defined metrics. Only specify a set of metrics if you need different metrics.
- name: Specify a metric from the Unitxt catalog. Use the metric’s ID as the value.
- ref: Specify the reference name of a custom metric as defined in the custom section below
format (optional): Specify a Unitxt format from the Unitxt catalog. Use the format’s ID as the value.
loaderLimit (optional): Specifies the maximum number of instances per stream to be returned from the loader (used to reduce loading time in large datasets).
numDemos (optional): Number of fewshot to be used.
demosPoolSize (optional): Size of the fewshot pool.

taskList.custom

Define one or more custom resources that will be referenced in a task recipe. Custom cards, custom templates, and custom system prompts are currently supported:

cards: Define custom cards to use, each with a name and value field:
- name: The name of this custom card that will be referenced in the card.ref field of a task recipe.
- value: A JSON string for a custom Unitxt card which contains the custom dataset. Use the documentation here to compose a custom card, store it as a JSON file, and use the JSON content as the value here. If the dataset used by the custom card needs an API key from an environment variable or a persistent volume, you have to set up corresponding resources under the pod field. Check the pod field below.
templates: Define custom templates to use, each with a name and value field:
- name: The name of this custom template that will be referenced in the template.ref field of a task recipe.
- value: A JSON string for a custom Unitxt template. Use the documentation here to compose a custom template, then use the documentation here to store it as a JSON file and use the JSON content as the value of this field.
systemPrompts: Define custom system prompts to use, each with a name and value field:
- name: The name of this custom system prompt that will be referenced in the systemPrompt.ref field of a task recipe.
- value: A string for a custom Unitxt system prompt. The documentation here provides an overview of the different components that make up a prompt format, including the system prompt.
tasks: Define custom tasks to use, each with a name and value field:
- name: The name of this custom task that will be referenced in the tasks.ref field of a task recipe.
- value: A JSON string for a custom Unitxt metric. Use the documentation here to compose a custom task, then use the documentation here to store it as a JSON file and use the JSON content as the value of this field.
metrics: Define custom metrics to use, each with a name and value field:
- name: The name of this custom metric that will be referenced in the metrics.ref field of a task recipe.
- value: A JSON string for a custom Unitxt metric. Use the documentation here to compose a custom metric, then use the documentation here to store it as a JSON file and use the JSON content as the value of this field.

numFewShot

Sets the number of few-shot examples to place in context. If you are using a task from Unitxt, don’t use this field. Use numDemos under the taskRecipes instead.

limit

Instead of running the whole dataset, set a limit to run the tasks. Accepts an integer, or a float between 0.0 and 1.0.

genArgs

Map to --gen_kwargs parameter for the lm-evaluation-harness. Here are the details.

logSamples

If this flag is passed, then the model’s outputs, and the text fed into the model, will be saved at per-document granularity.

batchSize

Batch size for the evaluation. The auto:N batch size is not used for API models, but numeric batch sizes are used for APIs. Only int batch size supported at the moment.

pod

Specify extra information for the lm-eval job’s pod.

container: Extra container settings for the lm-eval container.
- env: Specify environment variables. It uses the EnvVar data structure of kubernetes.
- volumeMounts: Mount the volumes into the lm-eval container.
- resources: Specify the resources for the lm-eval container.
volumes: Specify the volume information for the lm-eval and other containers. It uses the Volume data structure of kubernetes.
sideCars: A list of containers that run along with the lm-eval container. It uses the Container data structure of kubernetes.

outputs

This sections defines custom output locations for the evaluation results storage. At the moment only Persistent Volume Claims (PVC) are supported.

outputs.pvcManaged

Create an operator-managed PVC to store this job’s results. The PVC will be named <job-name>-pvc and will be owned by the LMEvalJob. After job completion, the PVC will still be available, but it will be deleted upon deleting the LMEvalJob. Supports the following fields:

size: The PVC’s size, compatible with standard PVC syntax (e.g. 5Gi)

outputs.pvcName

Binds an existing PVC to a job by specifying its name. The PVC must be created separately and must already exist when creating the job.

allowOnline

If set to true, the LMEval job will download artifacts as needed (e.g. models, datasets or tokenizers). If set to false, these will not be downloaded and will be used from local storage. See offline.

allowCodeExecution

If set to true, the LMEval job will execute the necessary code for preparing models or datasets. If set to false it will not execute downloaded code.

offline

Mount a PVC as the local storage for models and datasets.

Output of LMEvalJob

The output of an LMEvalJob is a YAML document with several fields. The status section provides the relevant information about the current status and, if the job successfully completes, the results for an evaluation.

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"trustyai.opendatahub.io/v1alpha1","kind":"LMEvalJob","metadata":{"annotations":{},"name":"lmeval-test","namespace":"test"},"spec":{"allowCodeExecution":true,"allowOnline":true,"logSamples":true,"model":"hf","modelArgs":[{"name":"pretrained","value":"google/flan-t5-base"}],"taskList":{"taskRecipes":[{"card":{"name":"cards.wnli"},"template":"templates.classification.multi_class.relation.default"}]}}}
  creationTimestamp: "2025-02-06T18:13:35Z"
  finalizers:
  - trustyai.opendatahub.io/lmes-finalizer
  generation: 1
  name: lmeval-test
  namespace: test
  resourceVersion: "19604113"
  uid: e1d29da2-bf3e-4f46-8907-6018e5741eb4
spec:
  allowCodeExecution: true
  allowOnline: true
  logSamples: true
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - card:
        name: cards.wnli
      template: templates.classification.multi_class.relation.default
status:
  completeTime: "2025-02-06T18:31:20Z"
  lastScheduleTime: "2025-02-06T18:13:35Z"
  message: job completed (1)
  podName: lmeval-test
  reason: Succeeded (2)
  results: |- (3)
    {
      ...
    }
  state: Complete (4)

1	A `message` provides an explanation related to the current or final status of an LMEvalJob. If the job reason is `Failed`, the related error message will be shown here.
2	A one-word `reason` that corresponds to the given `state` of the job at this time. Possible values are: `NoReason`: The job is still running `Succeeded`: The job finished successfully `Failed`: The job failed `Cancelled`: The job was cancelled
3	The `results` field is the direct output of an `lm-evaluation-harness` run. It has been omitted here to avoid repetition. The next section gives an example of the contents of this section. This section will be empty if the job is not completed.
4	The current `state` of this job. The `reason` for a particular state is given in the `reason` field. Possible values are: `New`: The job was just created `Scheduled`: The job is scheduled and waiting for available resources to run `Running`: The job is currently running `Complete`: The job is complete. This may correspond to either the `Succeeded`, `Failed`, or `Cancelled` reason. `Cancelled`: Job cancellation has been initiated. The `state` will update to `Complete` once the cancellation has been processed by the job controller. `Suspended`: The job has been suspended

`results` section

The results field is the direct output of an lm-evaluation-harness run. Below is an example of the file that is returned after an lm-evaluation-harness evaluation run and, consequently, the contents of the results dictionary of the LMEvalJob output YAML. This file may look slightly different depending on what options are passed.

The example shown here is of a Unitxt task called tr_0 that corresponds to the custom Unitxt task that is shown in this section.

{
  "results": { (1)
    "tr_0": {
      "alias": "tr_0",
      "f1_micro,none": 0.5,
      "f1_micro_stderr,none": "N/A",
      "accuracy,none": 0.5,
      "accuracy_stderr,none": "N/A",
      "f1_macro,none": 0.3333333333333333,
      "f1_macro_stderr,none": "N/A"
    }
  },
  "group_subtasks": { (2)
    "tr_0": []
  },
  "configs": { (3)
    "tr_0": {
      "task": "tr_0",
      "dataset_name": "card=cards.wnli,template=templates.classification.multi_class.relation.default",
      "unsafe_code": false,
      "description": "",
      "target_delimiter": " ",
      "fewshot_delimiter": "\n\n",
      "num_fewshot": 0,
      "output_type": "generate_until",
      "generation_kwargs": {
        "until": [
          "\n\n"
        ],
        "do_sample": false
      },
      "repeats": 1,
      "should_decontaminate": false,
      "metadata": {
        "version": 0
      }
    }
  },
  "versions": { (4)
    "tr_0": 0
  },
  "n-shot": { (4)
    "tr_0": 0
  },
  "higher_is_better": { (5)
    "tr_0": {
      "f1_micro": true,
      "accuracy": true,
      "f1_macro": true
    }
  },
  "n-samples": { (5)
    "tr_0": {
      "original": 71,
      "effective": 10
    }
  },
  "config": { (6)
    "model": "hf",
    "model_args": "pretrained=hf_home/flan-t5-base",
    "model_num_parameters": 247577856,
    "model_dtype": "torch.float32",
    "model_revision": "main",
    "model_sha": "",
    "batch_size": 1,
    "batch_sizes": [],
    "use_cache": null,
    "limit": 10.0,
    "bootstrap_iters": 100000,
    "gen_kwargs": null,
    "random_seed": 0,
    "numpy_seed": 1234,
    "torch_seed": 1234,
    "fewshot_seed": 1234
  },
  "git_hash": "af2d2f3e",
  "date": 1740763246.8746712,
  "pretty_env_info": "PyTorch version: 2.5.1\nIs debug build: False\nCUDA used to build PyTorch: None\nROCM used to build PyTorch: N/A\n\nOS: macOS 15.3.1 (arm64)\nGCC version: Could not collect\nClang version: 16.0.0 (clang-1600.0.26.3)\nCMake version: Could not collect\nLibc version: N/A\n\nPython version: 3.11.11 (main, Dec 11 2024, 10:25:04) [Clang 14.0.6 ] (64-bit runtime)\nPython platform: macOS-15.3.1-arm64-arm-64bit\nIs CUDA available: False\nCUDA runtime version: No CUDA\nCUDA_MODULE_LOADING set to: N/A\nGPU models and configuration: No CUDA\nNvidia driver version: No CUDA\ncuDNN version: No CUDA\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nApple M1 Max\n\nVersions of relevant libraries:\n[pip3] mypy==1.15.0\n[pip3] mypy-extensions==1.0.0\n[pip3] numpy==2.2.2\n[pip3] torch==2.5.1\n[conda] numpy                     2.2.2                    pypi_0    pypi\n[conda] torch                     2.5.1                    pypi_0    pypi",
  "transformers_version": "4.48.1",
  "upper_git_hash": null,
  "tokenizer_pad_token": [
    "<pad>",
    "0"
  ],
  "tokenizer_eos_token": [
    "</s>",
    "1"
  ],
  "tokenizer_bos_token": [
    null,
    "None"
  ],
  "eot_token_id": 1,
  "max_length": 512,
  "task_hashes": {},
  "model_source": "hf",
  "model_name": "hf_home/flan-t5-base",
  "model_name_sanitized": "hf_home__flan-t5-base",
  "system_instruction": null,
  "system_instruction_sha": null,
  "fewshot_as_multiturn": false,
  "chat_template": null,
  "chat_template_sha": null,
  "start_time": 84598.410512833, (7)
  "end_time": 84647.782769875,
  "total_evaluation_time_seconds": "49.37225704200682"
}

1	`results` is a dictionary of tasks keyed by task name. For each task, the calculated metrics are shown. These metrics are dependant on the task definition. `results` is a flat dictionary, so if a task has subtasks, they will not be nested under a parent task but are rather their own entry.
2	`group_subtasks` is a dictionary of tasks keyed by name with the value for each being a list of strings corresponding to subtasks for this task. `group_subtasks` is empty in this example because there are no subtasks.
3	`configs` is a dictionary of tasks keyed by task name that shows the configuration options for each task run. These key-value pairs are provided by the task definition (or default values) and will vary depending on the type of task run.
4	`versions` and `n-shot` are flat dictionaries with one key for each task run. The value in the `versions` dictionary is the version of the given task (or 0 by default). The value in the `n-shot` dictionary is the number of few-shot examples that were placed in context when running the task. This information is also available in the `configs` dictionary.
5	`higher_is_better` and `n-samples` are dictionaries with one key-dictionary pair for each task run. The former provides information as to whether a higher score is considered better for each metric that was evaluated for that task. The latter gives, for each task, the number of samples used during evaluation. In this example, the `--limit` property was set to 10, making the `effective` number of samples 10.
6	`config` is a dictionary that provides key-value pairs corresponding to the evaluation job as a whole. This includes information on the type of model run, the `model_args`, and other settings used for the run. Many of the values in this dictionary in this example are the default values defined by `lm-evaluation-harness`.
7	Given at the very end are three fields describing the start, end, and total evaluation time for this job.

The remaining key-value pairs define a variety of environment settings used for this evaluation job.

Examples

Environment Variables

If the LMEvalJob needs to access a model on HuggingFace with the access token, you can set up the HF_TOKEN as one of the environment variables for the lm-eval container:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: huggingfacespace/model
  taskList:
    taskNames:
    - unfair_tos
  logSamples: true
  pod:
    container:
      env: (1)
      - name: HF_TOKEN
        value: "My HuggingFace token"

1	`spec.pod.env` fields are passed directly to the LMEvalJob’s container as environment variables.

Or you can create a secret to store the token and refer the key from the secret object using the reference syntax:

(only attach the env part)

      env:
      - name: HF_TOKEN
        valueFrom:
          secretKeyRef:
            name: my-secret
            key: hf-token

Custom Unitxt Card

Pass a custom Unitxt Card in JSON format:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - template: "templates.classification.multi_class.relation.default"
      card:
        custom: |
          {
            "__type__": "task_card",
            "loader": {
              "__type__": "load_hf",
              "path": "glue",
              "name": "wnli"
            },
            "preprocess_steps": [
              {
                "__type__": "split_random_mix",
                "mix": {
                  "train": "train[95%]",
                  "validation": "train[5%]",
                  "test": "validation"
                }
              },
              {
                "__type__": "rename",
                "field": "sentence1",
                "to_field": "text_a"
              },
              {
                "__type__": "rename",
                "field": "sentence2",
                "to_field": "text_b"
              },
              {
                "__type__": "map_instance_values",
                "mappers": {
                  "label": {
                    "0": "entailment",
                    "1": "not entailment"
                  }
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "classes": [
                    "entailment",
                    "not entailment"
                  ]
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "type_of_relation": "entailment"
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "text_a_type": "premise"
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "text_b_type": "hypothesis"
                }
              }
            ],
            "task": "tasks.classification.multi_class.relation",
            "templates": "templates.classification.multi_class.relation.all"
          }
  logSamples: true

Inside the custom card, it uses the HuggingFace dataset loader:

            "loader": {
              "__type__": "load_hf",
              "path": "glue",
              "name": "wnli"
            },

You can use other loaders and use the volumes and volumeMounts to mount the dataset from persistent volumes. For example, if you use LoadCSV, you need to mount the files to the container and make the dataset accessible for the evaluation process.

LLM-as-a-Judge Evaluation

Certain LLM-as-a-Judge evaluations are possible by using custom Unitxt LLMaaJ metrics. See the Unitxt metrics guide for information on how to define custom metrics.

An example of a custom card and metric for LLMaaJ evaluation is given below.

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: custom-llmaaj-metric
spec:
  allowOnline: true
  allowCodeExecution: true
  model: hf
  modelArgs:
    - name: pretrained
      value: google/flan-t5-small
  taskList:
    taskRecipes:
      - card:
          custom: |
            {
                "__type__": "task_card",
                "loader": {
                    "__type__": "load_hf",
                    "path": "OfirArviv/mt_bench_single_score_gpt4_judgement",
                    "split": "train"
                },
                "preprocess_steps": [
                    {
                        "__type__": "rename_splits",
                        "mapper": {
                            "train": "test"
                        }
                    },
                    {
                        "__type__": "filter_by_condition",
                        "values": {
                            "turn": 1
                        },
                        "condition": "eq"
                    },
                    {
                        "__type__": "filter_by_condition",
                        "values": {
                            "reference": "[]"
                        },
                        "condition": "eq"
                    },
                    {
                        "__type__": "rename",
                        "field_to_field": {
                            "model_input": "question",
                            "score": "rating",
                            "category": "group",
                            "model_output": "answer"
                        }
                    },
                    {
                        "__type__": "literal_eval",
                        "field": "question"
                    },
                    {
                        "__type__": "copy",
                        "field": "question/0",
                        "to_field": "question"
                    },
                    {
                        "__type__": "literal_eval",
                        "field": "answer"
                    },
                    {
                        "__type__": "copy",
                        "field": "answer/0",
                        "to_field": "answer"
                    }
                ],
                "task": "tasks.response_assessment.rating.single_turn",
                "templates": [
                    "templates.response_assessment.rating.mt_bench_single_turn"
                ]
            }
        template:
          ref: response_assessment.rating.mt_bench_single_turn
        format: formats.models.mistral.instruction
        metrics:
        - ref: llmaaj_metric
    custom:
      templates:
        - name: response_assessment.rating.mt_bench_single_turn
          value: |
            {
                "__type__": "input_output_template",
                "instruction": "Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n",
                "input_format": "[Question]\n{question}\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]",
                "output_format": "[[{rating}]]",
                "postprocessors": [
                    "processors.extract_mt_bench_rating_judgment"
                ]
            }
      tasks:
        - name: response_assessment.rating.single_turn
          value: |
            {
                "__type__": "task",
                "input_fields": {
                    "question": "str",
                    "answer": "str"
                },
                "outputs": {
                    "rating": "float"
                },
                "metrics": [
                    "metrics.spearman"
                ]
            }
      metrics:
        - name: llmaaj_metric
          value: |
            {
                "__type__": "llm_as_judge",
                "inference_model": {
                    "__type__": "hf_pipeline_based_inference_engine",
                    "model_name": "mistralai/Mistral-7B-Instruct-v0.2",
                    "max_new_tokens": 256,
                    "use_fp16": true
                },
                "template": "templates.response_assessment.rating.mt_bench_single_turn",
                "task": "rating.single_turn",
                "format": "formats.models.mistral.instruction",
                "main_score": "mistral_7b_instruct_v0_2_huggingface_template_mt_bench_single_turn"
            }
  logSamples: true
  pod:
    container:
      env:
      - name: HF_TOKEN
        value: <HF_TOKEN>

There are also a handful of pre-existing LLMaaJ metrics available in the Unitxt catalog.

Using PVCs as storage

To use a PVC as storage for the LMEvalJob results, there are two supported modes, at the moment, managed and existing PVCs.

Managed PVCs, as the name implies, are managed by the TrustyAI operator. To enable a managed PVC simply specify its size:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  # other fields omitted ...
  outputs: (1)
    pvcManaged: (2)
      size: 5Gi (3)

1	`outputs` is the section for specifying custom storage locations
2	`pvcManaged` will create an operator-managed PVC
3	`size` (compatible with standard PVC syntax) is the only supported value

This will create a PVC named <job-name>-pvc (in this case evaljob-sample-pvc) which will be available after the job finishes, but will be deleted when the LMEvalJob is deleted.

To use an already existing PVC you can pass its name as a reference. The PVC must already exist when the LMEvalJob is created. Start by creating a PVC, for instance:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: "my-pvc"
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

And then reference it from the LMEvalJob:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  # other fields omitted ...
  outputs:
    pvcName: "my-pvc" (1)

1	`pvcName` references the already existing PVC `my-pvc`.

In this case, the PVC is not managed by the TrustyAI operator, so it will be available even after deleting the LMEvalJob.

In the case where both managed and existing PVCs are referenced in outputs, the TrustyAI operator will prefer the managed PVC and ignore the existing one.

Using an `InferenceService`

When using Kserve InferenceServices, only HuggingFace- and VLLM-based model serving runtimes are supported with the following method. Additionally, ensure DNS is set up for your cluster, as there is currently no method to pass additional request headers to a served model in lm-evaluation-harness.

This example assumes vLLM model already deployed in your cluster.

Define your LMEvalJob CR

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob
spec:
  model: local-completions
  taskList:
    taskNames:
      - mmlu
  logSamples: true
  batchSize: '1'
  modelArgs:
    - name: model
      value: granite
    - name: base_url
      value: $ROUTE_TO_MODEL/v1/completions (1)
    - name: num_concurrent
      value:  "1"
    - name: max_retries
      value:  "3"
    - name: tokenized_requests
      value: "False"
    - name: tokenizer
      value: ibm-granite/granite-7b-instruct
  pod:
    container:
      env:
       - name: OPENAI_API_KEY (2)
         valueFrom:
              secretKeyRef: (3)
                name: <secret-name> (4)
                key: token (5)

1	`base_url` should be set to the route/service URL of your model. Make sure to include the `/v1/completions` endpoint in the URL.
2	`OPENAI_API_KEY` values are passed directly to remote model servers, so they can also be used as general authentication bearer tokens.
3	`env.valueFrom.secretKeyRef.name` should point to a secret that contains a token that can authenticate to your model. `secretRef.name` should be the secret’s name in the namespace, while `secretRef.key` should point at the token’s key within the secret.
4	`secretKeyRef.name` can equal the output of `oc get secrets -o custom-columns=SECRET:.metadata.name --no-headers \| grep user-one-token`
5	`secretKeyRef.key` should equal field name holding the token value, in this example `token`

Then, apply this CR into the same namespace as your model. You should see a pod spin up in your model namespace called evaljob. In the pod terminal, you can see the output via tail -f output/stderr.log

Using GPUs

Typically, when using an Inference Service, GPU acceleration will be performed at the model server level. However, when using local mode, i.e. running the evaluation locally on the LMEval Job, you might want to use available GPUs. To do so, we can add a resource configuration directly on the job’s definition:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
    - name: pretrained
      value: google/flan-t5-base
  taskList:
    taskNames:
      - "qnlieu"
  logSamples: true
  allowOnline: true
  allowCodeExecution: true
  pod: (1)
    container:
      resources:
          limits: (2)
            cpu: '1'
            memory: 8Gi
            nvidia.com/gpu: '1'
          requests:
            cpu: '1'
            memory: 8Gi
            nvidia.com/gpu: '1'

1	The `pod` section allows adding specific resource definitions to the LMEval Job.
2	In this case we are adding `cpu: 1`, `memory: 8Gi` and `nvidia.com/gpu: 1`, but these can be adjusted to your cluster’s availability.

Integration with Kueue

TrustyAI and LM-Eval do not require Kueue to work. However, if Kueue is available on the cluster, it can be used from LM-Eval. To enable Kueue on Open Data Hub, add the following to your DataScienceCluster resource:

kueue:
  managementState: Managed

To Enable job suspend for Kueue integration, create a job in suspended state. Verify the job is in suspended state and the job’s pod is not running.

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  labels:
    app.kubernetes.io/name: fms-lm-eval-service
  name: evaljob-sample
spec:
  suspend: true (1)
  model: hf
  modelArgs:
  - name: pretrained
    value: EleutherAI/pythia-70m
  taskList:
    taskNames:
    - unfair_tos
  logSamples: true
  limit: "5"

1	This will set the LM-Eval job’s state as suspended

Set suspend to false and verify job’s pod getting created and running:

oc patch lmevaljob evaljob-sample --patch '{"spec":{"suspend":false}}' --type merge

Useful Commands & References

List all available tasks

As mentioned above, LMEvalJob supports running both Unitxt tasks (via recipes or custom JSON) and the built-in tasks that are available out-of-the-box with lm-evaluation-harness. To see a list of available built-in tasks, run the following command. LMES-POD-IMAGE is the same as listed here.

oc exec <<LMES-POD-IMAGE>> -- bash -c "lm_eval --tasks list"

It is recommended to pipe this output to a file as there are several thousand built-in tasks.

Note: the output of the task list command will include some Unitxt tasks that have been directly contributed to lm-evaluation-harness. You can run these tasks by specifying the task name in the taskList property of the job definition as you would with any built-in task.