Getting Started with LMEval Llama Stack External Eval Provider

Prerequisites

  • Admin access to an OpenShift cluster

  • The TrustyAI operator installed in your OpenShift cluster

  • KServe set to Raw Deployment mode

  • A language model deployed on vLLM Serving Runtime in your OpenShift cluster

Overview

This tutorial demonstrates how to evaluate a language model using the LMEval Llama Stack External Eval Provider. You will learn how to:

  • Configure a Llama Stack server to use the LMEval Eval provider

  • Register a benchmark dataset

  • Run a benchmark evaluation job on a language model

Usage

Create and activate a virtual environment:

python3 -m venv .venv
source .venv/bin/activate

Install the LMEval Llama Stack External Eval Provider from PyPi:

pip install llama-stack-provider-lmeval

Configuing the Llama Stack Server

Set the VLLM_URL and TRUSTYAI_LM_EVAL_NAMESPACE environment variables in your terminal. The VLLM_URL value should be the v1/completions endpoint of your model route and the TRUSTYAI_LM_EVAL_NAMESPACE should be the namespace where your model is deployed. For example:

export VLLM_URL=https://$(oc get $(oc get ksvc -o name | grep predictor) --template={{.status.url}})/v1/completions

export TRUSTYAI_LM_EVAL_NAMESPACE=$(oc project | cut -d '"' -f2)

Download the providers.d directory and the run.yaml file:

curl --create-dirs --output providers.d/remote/eval/trustyai_lmeval.yaml https://raw.githubusercontent.com/trustyai-explainability/llama-stack-provider-lmeval/refs/heads/main/providers.d/remote/eval/trustyai_lmeval.yaml

curl --create-dirs --output run.yaml https://raw.githubusercontent.com/trustyai-explainability/llama-stack-provider-lmeval/refs/heads/main/run.yaml

Start the Llama Stack server in a virtual environment:

llama stack run run.yaml --image-type venv

This will start a Llama Stack Server which will use port 8321 by default.

Running an Evaluation

With the Llama Stack server running, create a Python script or Jupyter notebook to interact with the server and run an evaluation.

Import the necessary libraries and modules:

import os
import subprocess

import logging

import time
import pprint

Instantiate the Llama Stack Python client to interact with the running Llama Stack server:

BASE_URL = "http://localhost:8321"

def create_http_client():
    from llama_stack_client import LlamaStackClient
    return LlamaStackClient(base_url=BASE_URL)

client = create_http_client()

Check the current list of available benchmarks:

benchmarks = client.benchmarks.list()

pprint.print(f"Available benchmarks: {benchmarks}")

Register the ARC-Easy, a dataset of grade-school level, multiple-choice science questions, as a benchmark:

client.benchmarks.register(
    benchmark_id="trustyai_lmeval::arc_easy",
    dataset_id="trustyai_lmeval::arc_easy",
    scoring_functions=["string"],
    provider_benchmark_id="string",
    provider_id="trustyai_lmeval",
     metadata={
        "tokenizer": "google/flan-t5-small"
        "tokenized_requests": False,
    }
)
LMEval comes with 100+ out-of-the-box datasets for evaluation so feel free to experiment.

Verify that the benchmark has been registered successfully:

benchmarks = client.benchmarks.list()

pprint.print(f"Available benchmarks: {benchmarks}")

Run a benchmark evaluation on your model:

job = client.eval.run_eval(
    benchmark_id="trustyai_lmeval::arc_easy",
    benchmark_config={
        "eval_candidate": {
            "type": "model",
            "model": "phi-3",
            "provider_id": "trustyai_lmeval",
            "sampling_params": {
                "temperature": 0.7,
                "top_p": 0.9,
                "max_tokens": 256
            },
        },
        "num_examples": 1000,
     },
)

print(f"Starting job '{job.job_id}'")
The eval_candidate section specifies the model to be evaluated, in this case, "phi-3". Replace it with the name of your deployed model.

Monitor the status of the evaluation job. The job will run asynchronously, so you can check its status periodically:

def get_job_status(job_id, benchmark_id):
    return client.eval.jobs.status(job_id=job_id, benchmark_id=benchmark_id)

while True:
    job = get_job_status(job_id=job.job_id, benchmark_id="trustyai_lmeval::arc_easy")
    print(job)

    if job.status in ['failed', 'completed']:
        print(f"Job ended with status: {job.status}")
        break

    time.sleep(20)

Once the job status reports back as completed, get the results of the evaluation job:

pprint.pprint(client.eval.jobs.retrieve(job_id=job.job_id, benchmark_id="trustyai_lmeval::arc_easy").scores)

Additional Resources

  • This tutorial provides a high level overview of how to use the LMEval Llama Stack External Eval Provider to evaluate language models. For a fulll end-to-end demo with explanations and output, please refer to the official demos.

  • If you have any questions or improvements to contribute, please feel free to open an issue or a pull request on the project’s GitHub repository.