Getting started with LM-Eval

LM-Eval is a service for large language model evaluation underpinned by two open-source projects: lm-evaluation-harness and Unitxt. LM-Eval is integrated into the TrustyAI Kubernetes Operator. In this tutorial, you will learn:

  • How to create an LMEvalJob CR to kick off an evaluation job and get the results

LM-Eval is only available since TrustyAI’s 1.28.0 community builds. In order to use it on Open Data Hub, you need to use either ODH 2.20 (or newer) or add the following devFlag to you DataScienceCluster resource:

trustyai:
  devFlags:
    manifests:
      - contextDir: config
        sourcePath: ''
        uri: https://github.com/trustyai-explainability/trustyai-service-operator/tarball/main
  managementState: Managed

Global settings for LM-Eval

There are some configurable global settings for LM-Eval services and they are stored in the TrustyAI’s operator global ConfigMap, trustyai-service-operator-config, located in the same namespace as the operator. Here is a list of properties for LM-Eval:

Setting Default Description

lmes-detect-device

true/false

Detect if there is available GPUs or not and assign the proper value for --device argument for lm-evaluation-harness. If GPU(s) is found, it uses cuda as the value for --device; otherwise, it uses cpu.

lmes-pod-image

quay.io/trustyai/ta-lmes-job:latest

The image for the LM-Eval job. The image contains the necessary Python packages for lm-evaluation-harness and Unitxt.

lmes-driver-image

quay.io/trustyai/ta-lmes-driver:latest

The image for the LM-Eval driver. Check cmd/lmes_driver directory for detailed information about the driver.

lmes-image-pull-policy

Always

The image-pulling policy when running the evaluation job.

lmes-default-batch-size

8

The default batch size when invoking the model inference API. This only works for local models.

lmes-max-batch-size

24

The maximum batch size that users can specify in an evaluation job.

lmes-pod-checking-interval

10s

The interval to check the job pod for an evaluation job.

After updating the settings in the ConfigMap, the new values only take effect when the operator restarts.

LMEvalJob

LM-Eval service defines a new Custom Resource Definition called: LMEvalJob. LMEvalJob objects are monitored by the TrustyAI Kubernetes operator. An LMEvalJob object represents an evaluation job. Therefore, to run an evaluation job, you need to create an LMEvalJob object with the needed information including model, model arguments, task, secret, etc. Once the LMEvalJob is created, the LM-Eval service will run the evaluation job and update the status and results to the LMEvalJob object when the information is available.

Here is an example of an LMEvalJob object:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base (1)
  taskList:
    taskRecipes:
    - card:
        name: "cards.wnli" (2)
      template: "templates.classification.multi_class.relation.default" (3)
  logSamples: true
1 In this example, it uses the pre-trained google/flan-t5-base model from Hugging Face (model: hf)
2 The dataset is from the wnli subset of the General Language Understanding Evaluation (GLUE). You can find the details of the Unitxt card wnli here.
3 It also specifies the multi_class.relation task from Unitxt and its default metrics are f1_micro, f1_macro, and accuracy.

After you apply the example LMEvalJob above, you can check its state by using the following command:

oc get lmevaljob evaljob-sample

The output would be like:

NAME             STATE
evaljob-sample   Running

When its state becomes Complete, the evaluation results will be available. Both the model and dataset in this example are small. The evaluation job would be able to finish within 10 minutes on a CPU-only node.

Use the following command to get the results:

oc get lmevaljobs.trustyai.opendatahub.io evaljob-sample \
  -o template --template={{.status.results}} | jq '.results'

Here are the example results:

{
  "tr_0": {
    "alias": "tr_0",
    "f1_micro,none": 0.5633802816901409,
    "f1_micro_stderr,none": "N/A",
    "accuracy,none": 0.5633802816901409,
    "accuracy_stderr,none": "N/A",
    "f1_macro,none": 0.36036036036036034,
    "f1_macro_stderr,none": "N/A"
  }
}

The f1_micro, f1_macro, and accuracy scores are 0.56, 0.36, and 0.56. The full results are stored in the .status.results of the LMEvalJob object as a JSON document. The command above only retrieves the results field of the JSON document.

Details of LMEvalJob

In this section, let’s review each property in the LMEvalJob and its usage.

Parameter Description

model

Specify which model type or provider is evaluated. This field directly maps to the --model argument of the lm-evaluation-harness. Supported model types and providers include:

  • hf: HuggingFace models

  • openai-completions: OpenAI Completions API models

  • openai-chat-completions: ChatCompletions API models

  • local-completions and local-chat-completions: OpenAI API-compatible servers

  • textsynth: TextSynth APIs

modelArgs

A list of paired name and value arguments for the model type. Each model type or provider supports different arguments:

taskList.taskNames

Specify a list of tasks supported by lm-evaluation-harness.

taskList.taskRecipes

Specify the task using the Unitxt recipe format:

  • card: Use the name to specify a Unitxt card or custom for a custom card

    • name: Specify a Unitxt card from the Unitxt catalog. Use the card’s ID as the value. For example: The ID of Wnli card is cards.wnli.

    • custom: Define a custom card and use it. The value is a JSON string for a custom Unitxt card which contains the custom dataset. Use the documentation here to compose a custom card, store it as a JSON file, and use the JSON content as the value here. If the dataset used by the custom card needs an API key from an environment variable or a persistent volume, you have to set up corresponding resources under the pod field. Check the pod field below.

  • template: Specify a Unitxt template from the Unitxt catalog. Use the template’s ID as the value.

  • task (optional): Specify a Unitxt task from the Unitxt catalog. Use the task’s ID as the value. A Unitxt card has a pre-defined task. Only specify a value for this if you want to run different task.

  • metrics (optional): Specify a list of Unitx metrics from the Unitxt catalog. Use the metric’s ID as the value. A Unitxt task has a set of pre-defined metrics. Only specify a set of metrics if you need different metrics.

  • format (optional): Specify a Unitxt format from the Unitxt catalog. Use the format’s ID as the value.

  • loaderLimit (optional): Specifies the maximum number of instances per stream to be returned from the loader (used to reduce loading time in large datasets).

  • numDemos (optional): Number of fewshot to be used.

  • demosPoolSize (optional): Size of the fewshot pool.

numFewShot

Sets the number of few-shot examples to place in context. If you are using a task from Unitxt, don’t use this field. Use numDemos under the taskRecipes instead.

limit

Instead of running the whole dataset, set a limit to run the tasks. Accepts an integer, or a float between 0.0 and 1.0.

genArgs

Map to --gen_kwargs parameter for the lm-evaluation-harness. Here are the details.

logSamples

If this flag is passed, then the model’s outputs, and the text fed into the model, will be saved at per-document granularity.

batchSize

Batch size for the evaluation. The auto:N batch size is not used for API models, but numeric batch sizes are used for APIs. Only int batch size supported at the moment.

pod

Specify extra information for the lm-eval job’s pod.

  • container: Extra container settings for the lm-eval container.

    • env: Specify environment variables. It uses the EnvVar data structure of kubernetes.

    • volumeMounts: Mount the volumes into the lm-eval container.

    • resources: Specify the resources for the lm-eval container.

  • volumes: Specify the volume information for the lm-eval and other containers. It uses the Volume data structure of kubernetes.

  • sideCars: A list of containers that run along with the lm-eval container. It uses the Container data structure of kubernetes.

outputs

This sections defines custom output locations for the evaluation results storage. At the moment only Persistent Volume Claims (PVC) are supported.

outputs.pvcManaged

Create an operator-managed PVC to store this job’s results. The PVC will be named <job-name>-pvc and will be owned by the LMEvalJob. After job completion, the PVC will still be available, but it will be deleted upon deleting the LMEvalJob. Supports the following fields:

  • size: The PVC’s size, compatible with standard PVC syntax (e.g. 5Gi)

outputs.pvcName

Binds an existing PVC to a job by specifying its name. The PVC must be created separately and must already exist when creating the job.

Examples

Environment Variables

If the LMEvalJob needs to access a model on HuggingFace with the access token, you can set up the HF_TOKEN as one of the environment variables for the lm-eval container:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: huggingfacespace/model
  taskList:
    taskNames:
    - unfair_tos
  logSamples: true
  pod:
    container:
      env: (1)
      - name: HF_TOKEN
        value: "My HuggingFace token"
1 spec.pod.env fields are passed directly to the LMEvalJob’s container as environment variables.

Or you can create a secret to store the token and refer the key from the secret object using the reference syntax:

(only attach the env part)

      env:
      - name: HF_TOKEN
        valueFrom:
          secretKeyRef:
            name: my-secret
            key: hf-token

Custom Unitxt Card

Pass a custom Unitxt Card in JSON format:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - template: "templates.classification.multi_class.relation.default"
      card:
        custom: |
          {
            "__type__": "task_card",
            "loader": {
              "__type__": "load_hf",
              "path": "glue",
              "name": "wnli"
            },
            "preprocess_steps": [
              {
                "__type__": "split_random_mix",
                "mix": {
                  "train": "train[95%]",
                  "validation": "train[5%]",
                  "test": "validation"
                }
              },
              {
                "__type__": "rename",
                "field": "sentence1",
                "to_field": "text_a"
              },
              {
                "__type__": "rename",
                "field": "sentence2",
                "to_field": "text_b"
              },
              {
                "__type__": "map_instance_values",
                "mappers": {
                  "label": {
                    "0": "entailment",
                    "1": "not entailment"
                  }
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "classes": [
                    "entailment",
                    "not entailment"
                  ]
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "type_of_relation": "entailment"
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "text_a_type": "premise"
                }
              },
              {
                "__type__": "set",
                "fields": {
                  "text_b_type": "hypothesis"
                }
              }
            ],
            "task": "tasks.classification.multi_class.relation",
            "templates": "templates.classification.multi_class.relation.all"
          }
  logSamples: true

Inside the custom card, it uses the HuggingFace dataset loader:

            "loader": {
              "__type__": "load_hf",
              "path": "glue",
              "name": "wnli"
            },

You can use other loaders and use the volumes and volumeMounts to mount the dataset from persistent volumes. For example, if you use LoadCSV, you need to mount the files to the container and make the dataset accessible for the evaluation process.

Using PVCs as storage

To use a PVC as storage for the LMEvalJob results, there are two supported modes, at the moment, managed and existing PVCs.

Managed PVCs, as the name implies, are managed by the TrustyAI operator. To enable a managed PVC simply specify its size:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  # other fields omitted ...
  outputs: (1)
    pvcManaged: (2)
      size: 5Gi (3)
1 outputs is the section for specifying custom storage locations
2 pvcManaged will create an operator-managed PVC
3 size (compatible with standard PVC syntax) is the only supported value

This will create a PVC named <job-name>-pvc (in this case evaljob-sample-pvc) which will be available after the job finishes, but will be deleted when the LMEvalJob is deleted.

To use an already existing PVC you can pass its name as a reference. The PVC must already exist when the LMEvalJob is created. Start by creating a PVC, for instance:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: "my-pvc"
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

And then reference it from the LMEvalJob:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  # other fields omitted ...
  outputs:
    pvcName: "my-pvc" (1)
1 pvcName references the already existing PVC my-pvc.

In this case, the PVC is not managed by the TrustyAI operator, so it will be available even after deleting the LMEvalJob.

In the case where both managed and existing PVCs are referenced in outputs, the TrustyAI operator will prefer the managed PVC and ignore the existing one.

Using an InferenceService

This example assumes vLLM model already deployed in your cluster.

Define your LMEvalJob CR

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob
spec:
  model: local-completions
  taskList:
    taskNames:
      - mmlu
  logSamples: true
  batchSize: 1
  modelArgs:
    - name: model
      value: granite
    - name: base_url
      value: $ROUTE_TO_MODEL/v1/completions (1)
    - name: num_concurrent
      value:  "1"
    - name: max_retries
      value:  "3"
    - name: tokenized_requests
      value: "False"
    - name: tokenizer
      value: ibm-granite/granite-7b-instruct
  pod:
    container:
      env:
       - name: OPENAI_API_KEY (2)
         valueFrom:
              secretKeyRef: (3)
                name: <secret-name> (4)
                key: token (5)
1 base_url should be set to the route/service URL of your model. Make sure to include the /v1/completions endpoint in the URL.
2 OPENAI_API_KEY values are passed directly to remote model servers, so they can also be used as general authentication bearer tokens.
3 env.valueFrom.secretKeyRef.name should point to a secret that contains a token that can authenticate to your model. secretRef.name should be the secret’s name in the namespace, while secretRef.key should point at the token’s key within the secret.
4 secretKeyRef.name can equal the output of
oc get secrets -o custom-columns=SECRET:.metadata.name --no-headers | grep user-one-token
5 secretKeyRef.key should equal field name holding the token value, in this example token

Then, apply this CR into the same namespace as your model. You should see a pod spin up in your model namespace called evaljob. In the pod terminal, you can see the output via tail -f output/stderr.log

Integration with Kueue

TrustyAI and LM-Eval do not require Kueue to work. However, if Kueue is available on the cluster, it can be used from LM-Eval. To enable Kueue on Open Data Hub, add the following to your DataScienceCluster resource:

kueue:
  managementState: Managed

To Enable job suspend for Kueue integration, create a job in suspended state. Verify the job is in suspended state and the job’s pod is not running.

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  labels:
    app.kubernetes.io/name: fms-lm-eval-service
  name: evaljob-sample
spec:
  suspend: true (1)
  model: hf
  modelArgs:
  - name: pretrained
    value: EleutherAI/pythia-70m
  taskList:
    taskNames:
    - unfair_tos
  logSamples: true
  limit: "5"
1 This will set the LM-Eval job’s state as suspended

Set suspend to false and verify job’s pod getting created and running:

oc patch lmevaljob evaljob-sample --patch '{"spec":{"suspend":false}}' --type merge