Getting started with LM-Eval
LM-Eval is a service for large language model evaluation underpinned by two open-source projects: lm-evaluation-harness and Unitxt. LM-Eval is integrated into the TrustyAI Kubernetes Operator. In this tutorial, you will learn:
-
How to create an
LMEvalJob
CR to kick off an evaluation job and get the results
LM-Eval is only available since TrustyAI’s 1.28.0 community builds.
In order to use it on Open Data Hub, you need to use either ODH 2.20 (or newer) or add the following
|
Global settings for LM-Eval
There are some configurable global settings for LM-Eval services and they are stored in the TrustyAI’s operator global ConfigMap
, trustyai-service-operator-config
, located in the same namespace as the operator. Here is a list of properties for LM-Eval:
Setting | Default | Description |
---|---|---|
|
|
Detect if there is available GPUs or not and assign the proper value for |
|
|
The image for the LM-Eval job. The image contains the necessary Python packages for lm-evaluation-harness and Unitxt. |
|
|
The image for the LM-Eval driver. Check |
|
|
The image-pulling policy when running the evaluation job. |
|
|
The default batch size when invoking the model inference API. This only works for local models. |
|
|
The maximum batch size that users can specify in an evaluation job. |
|
|
The interval to check the job pod for an evaluation job. |
|
|
Whether LMEval jobs can set the online mode on. |
|
|
Whether LMEval jobs can set the trust remote code mode on. |
After updating the settings in the ConfigMap
, the new values only take effect when the operator restarts.
LMEvalJob
LM-Eval service defines a new Custom Resource Definition called: LMEvalJob
. LMEvalJob
objects are monitored by the TrustyAI Kubernetes operator. An LMEvalJob object represents an evaluation job. Therefore, to run an evaluation job, you need to create an LMEvalJob
object with the needed information including model, model arguments, task, secret, etc. Once the LMEvalJob
is created, the LM-Eval service will run the evaluation job and update the status and results to the LMEvalJob
object when the information is available.
Here is an example of an LMEvalJob
object:
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob-sample
spec:
allowOnline: true
model: hf
modelArgs:
- name: pretrained
value: google/flan-t5-base (1)
taskList:
taskRecipes:
- card:
name: "cards.wnli" (2)
template: "templates.classification.multi_class.relation.default" (3)
logSamples: true
1 | In this example, it uses the pre-trained google/flan-t5-base model from Hugging Face (model: hf) |
2 | The dataset is from the wnli subset of the General Language Understanding Evaluation (GLUE). You can find the details of the Unitxt card wnli here. |
3 | It also specifies the multi_class.relation task from Unitxt and its default metrics are f1_micro , f1_macro , and accuracy . |
After you apply the example LMEvalJob
above, you can check its state by using the following command:
oc get lmevaljob evaljob-sample
The output would be like:
NAME STATE
evaljob-sample Running
When its state becomes Complete
, the evaluation results will be available. Both the model and dataset in this example are small. The evaluation job would be able to finish within 10 minutes on a CPU-only node.
Use the following command to get the results:
oc get lmevaljobs.trustyai.opendatahub.io evaljob-sample \
-o template --template={{.status.results}} | jq '.results'
Here are the example results:
{
"tr_0": {
"alias": "tr_0",
"f1_micro,none": 0.5633802816901409,
"f1_micro_stderr,none": "N/A",
"accuracy,none": 0.5633802816901409,
"accuracy_stderr,none": "N/A",
"f1_macro,none": 0.36036036036036034,
"f1_macro_stderr,none": "N/A"
}
}
The f1_micro
, f1_macro
, and accuracy
scores are 0.56, 0.36, and 0.56. The full results are stored in the .status.results
of the LMEvalJob
object as a JSON document. The command above only retrieves the results
field of the JSON document.
Details of LMEvalJob
In this section, let’s review each property in the LMEvalJob and its usage.
Parameter | Description |
---|---|
|
Specify which model type or provider is evaluated. This field directly maps to the
|
|
A list of paired name and value arguments for the model type. Each model type or provider supports different arguments:
|
|
Specify a list of tasks supported by lm-evaluation-harness. |
|
Specify the task using the Unitxt recipe format:
|
|
Sets the number of few-shot examples to place in context. If you are using a task from Unitxt, don’t use this field. Use |
|
Instead of running the whole dataset, set a limit to run the tasks. Accepts an integer, or a float between 0.0 and 1.0. |
|
Map to |
|
If this flag is passed, then the model’s outputs, and the text fed into the model, will be saved at per-document granularity. |
|
Batch size for the evaluation. The |
|
Specify extra information for the lm-eval job’s pod.
|
|
This sections defines custom output locations for the evaluation results storage. At the moment only Persistent Volume Claims (PVC) are supported. |
|
Create an operator-managed PVC to store this job’s results. The PVC will be named
|
|
Binds an existing PVC to a job by specifying its name. The PVC must be created separately and must already exist when creating the job. |
|
If set to |
|
If set to |
|
Mount a PVC as the local storage for models and datasets. |
Examples
Environment Variables
If the LMEvalJob needs to access a model on HuggingFace with the access token, you can set up the HF_TOKEN
as one of the environment variables for the lm-eval container:
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob-sample
spec:
model: hf
modelArgs:
- name: pretrained
value: huggingfacespace/model
taskList:
taskNames:
- unfair_tos
logSamples: true
pod:
container:
env: (1)
- name: HF_TOKEN
value: "My HuggingFace token"
1 | spec.pod.env fields are passed directly to the LMEvalJob’s container as environment variables. |
Or you can create a secret to store the token and refer the key from the secret object using the reference syntax:
(only attach the env part)
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: my-secret
key: hf-token
Custom Unitxt Card
Pass a custom Unitxt Card in JSON format:
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob-sample
spec:
model: hf
modelArgs:
- name: pretrained
value: google/flan-t5-base
taskList:
taskRecipes:
- template: "templates.classification.multi_class.relation.default"
card:
custom: |
{
"__type__": "task_card",
"loader": {
"__type__": "load_hf",
"path": "glue",
"name": "wnli"
},
"preprocess_steps": [
{
"__type__": "split_random_mix",
"mix": {
"train": "train[95%]",
"validation": "train[5%]",
"test": "validation"
}
},
{
"__type__": "rename",
"field": "sentence1",
"to_field": "text_a"
},
{
"__type__": "rename",
"field": "sentence2",
"to_field": "text_b"
},
{
"__type__": "map_instance_values",
"mappers": {
"label": {
"0": "entailment",
"1": "not entailment"
}
}
},
{
"__type__": "set",
"fields": {
"classes": [
"entailment",
"not entailment"
]
}
},
{
"__type__": "set",
"fields": {
"type_of_relation": "entailment"
}
},
{
"__type__": "set",
"fields": {
"text_a_type": "premise"
}
},
{
"__type__": "set",
"fields": {
"text_b_type": "hypothesis"
}
}
],
"task": "tasks.classification.multi_class.relation",
"templates": "templates.classification.multi_class.relation.all"
}
logSamples: true
Inside the custom card, it uses the HuggingFace dataset loader:
"loader": {
"__type__": "load_hf",
"path": "glue",
"name": "wnli"
},
Using PVCs as storage
To use a PVC as storage for the LMEvalJob
results, there are two supported modes, at the moment, managed and existing PVCs.
Managed PVCs, as the name implies, are managed by the TrustyAI operator. To enable a managed PVC simply specify its size:
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob-sample
spec:
# other fields omitted ...
outputs: (1)
pvcManaged: (2)
size: 5Gi (3)
1 | outputs is the section for specifying custom storage locations |
2 | pvcManaged will create an operator-managed PVC |
3 | size (compatible with standard PVC syntax) is the only supported value |
This will create a PVC named <job-name>-pvc
(in this case evaljob-sample-pvc
) which will be available after the job finishes, but will be deleted when the LMEvalJob
is deleted.
To use an already existing PVC you can pass its name as a reference.
The PVC must already exist when the LMEvalJob
is created. Start by creating a PVC, for instance:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: "my-pvc"
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
And then reference it from the LMEvalJob
:
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob-sample
spec:
# other fields omitted ...
outputs:
pvcName: "my-pvc" (1)
1 | pvcName references the already existing PVC my-pvc . |
In this case, the PVC is not managed by the TrustyAI operator, so it will be available even after deleting the LMEvalJob
.
In the case where both managed and existing PVCs are referenced in outputs
, the TrustyAI operator will prefer the managed PVC and ignore the existing one.
Using an InferenceService
This example assumes vLLM model already deployed in your cluster. |
Define your LMEvalJob CR
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob
spec:
model: local-completions
taskList:
taskNames:
- mmlu
logSamples: true
batchSize: 1
modelArgs:
- name: model
value: granite
- name: base_url
value: $ROUTE_TO_MODEL/v1/completions (1)
- name: num_concurrent
value: "1"
- name: max_retries
value: "3"
- name: tokenized_requests
value: "False"
- name: tokenizer
value: ibm-granite/granite-7b-instruct
pod:
container:
env:
- name: OPENAI_API_KEY (2)
valueFrom:
secretKeyRef: (3)
name: <secret-name> (4)
key: token (5)
1 | base_url should be set to the route/service URL of your model. Make sure to include the /v1/completions endpoint in the URL. |
2 | OPENAI_API_KEY values are passed directly to remote model servers, so they can also be used as general authentication bearer tokens. |
3 | env.valueFrom.secretKeyRef.name should point to a secret that contains a token that can authenticate to your model. secretRef.name should be the secret’s name in the namespace, while secretRef.key should point at the token’s key within the secret. |
4 | secretKeyRef.name can equal the output of
|
5 | secretKeyRef.key should equal field name holding the token value, in this example token |
Then, apply this CR into the same namespace as your model. You should see a pod spin up in your
model namespace called evaljob
. In the pod terminal, you can see the output via tail -f output/stderr.log
Using GPUs
Typically, when using an Inference Service, GPU acceleration will be performed at the model server level. However, when using local mode, i.e. running the evaluation locally on the LMEval Job, you might want to use available GPUs. To do so, we can add a resource configuration directly on the job’s definition:
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob-sample
spec:
model: hf
modelArgs:
- name: pretrained
value: google/flan-t5-base
taskList:
taskNames:
- "qnlieu"
logSamples: true
allowOnline: true
allowCodeExecution: true
pod: (1)
container:
resources:
limits: (2)
cpu: '1'
memory: 8Gi
nvidia.com/gpu: '1'
requests:
cpu: '1'
memory: 8Gi
nvidia.com/gpu: '1'
1 | The pod section allows adding specific resource definitions to the LMEval Job. |
2 | In this case we are adding cpu: 1 , memory: 8Gi and nvidia.com/gpu: 1 , but these can be adjusted to your cluster’s availability. |
Integration with Kueue
TrustyAI and LM-Eval do not require Kueue to work.
However, if Kueue is available on the cluster, it can be used from LM-Eval.
To enable Kueue on Open Data Hub, add the following to your
|
To Enable job suspend for Kueue integration, create a job in suspended state. Verify the job is in suspended state and the job’s pod is not running.
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
labels:
app.kubernetes.io/name: fms-lm-eval-service
name: evaljob-sample
spec:
suspend: true (1)
model: hf
modelArgs:
- name: pretrained
value: EleutherAI/pythia-70m
taskList:
taskNames:
- unfair_tos
logSamples: true
limit: "5"
1 | This will set the LM-Eval job’s state as suspended |
Set suspend
to false
and verify job’s pod getting created and running:
oc patch lmevaljob evaljob-sample --patch '{"spec":{"suspend":false}}' --type merge