Generative AI evaluation service overview

The Gen AI Evaluation Service in Vertex AI lets you evaluate any generative model or application and benchmark the evaluation results against your own judgment, using your own evaluation criteria.

While leaderboards and reports offer insights into overall model performance, they don't reveal how a model handles your specific needs. The Gen AI Evaluation Service helps you define your own evaluation criteria, ensuring a clear understanding of how well generative AI models and applications align with your unique use case.

Evaluation is important at every step of your Gen AI development process including model selection, prompt engineering, and model customization. Evaluating Gen AI is integrated within Vertex AI to help you launch and reuse evaluations as needed.

Gen AI Evaluation Service capabilities

The Gen AI Evaluation Service can help you with the following tasks:

  • Model selection: Choose the best pre-trained model for your task based on benchmark results and its performance on your specific data.

  • Generation settings: Tweak model parameters (like temperature) to optimize output for your needs.

  • Prompt engineering: Craft effective prompts and prompt templates to guide the model towards your preferred behavior and responses.

  • Improve and safeguard fine-tuning: Fine-tune a model to improve performance for your use case, while avoiding biases or undesirable behaviors.

  • RAG optimization: Select the most effective Retrieval Augmented Generation (RAG) architecture to enhance performance for your application.

  • Migration: Continuously assess and improve the performance of your AI solution by migrating to newer models when they provide a clear advantage for your specific use case.

Evaluation process

The Gen AI Evaluation Service lets you evaluate any Gen AI model or application on your evaluation criteria by following these steps:

  1. Define evaluation metrics:

    • Learn how to tailor model-based metrics to your business criteria.

    • Evaluate a single model (pointwise) or determine the winner when comparing 2 models (pairwise).

    • Include computation-based metrics for additional insights.

  2. Prepare your evaluation dataset.

    • Provide a dataset that reflects your specific use case.
  3. Run an evaluation.

    • Start from scratch, use a template, or adapt existing examples.

    • Define candidate models and create an EvalTask to reuse your evaluation logic through Vertex AI.

  4. View and interpret your evaluation results.

Use Cases

Use Case Description Link(s)
Evaluate models Quick Start: Introduction to Gen AI Evaluation Service SDK. Vertex AI SDK for Python Notebook - Getting Started with Gen AI Evaluation Service SDK
Evaluate and select first-party (1P) foundation models for your task. Vertex AI SDK for Python Notebook - Evaluate and select first-party (1P) foundation models for your task
Evaluate and select generative AI model settings:

Adjust temperature, output token limit, safety settings and other model generation configurations of Gemini models on a summarization task and compare the evaluation results from different model settings on several metrics.
Vertex AI SDK for Python Notebook - Compare different model parameter settings for Gemini
Compare third-party (3P) open models, and 3P models on Vertex AI Model Garden. Coming soon
Migrate from PaLM to Gemini model with Gen AI Evaluation Service SDK.

This notebook guides you through evaluating PaLM and Gemini foundation models using multiple evaluation metrics to support decisions around migrating from one model to another. We'll visualize these metrics to gain insights into the strengths and weaknesses of each model, ultimately helping you make an informed decision about which one aligns best with the specific requirements of your use case.
Vertex AI SDK for Python Notebook - Compare and migrate from PaLM to Gemini model
Evaluate prompt templates Prompt engineering and prompt evaluation with Gen AI Evaluation Service SDK. Vertex AI SDK for Python notebook - Evaluate and Optimize Prompt Template Design for Better Results
Evaluate Gen AI applications Evaluate Gemini model tool use and function calling capabilities. Vertex AI SDK for Python Notebook - Evaluate Gemini Model Tool Use
Evaluate generated answers from Retrieval-Augmented Generation (RAG) for a question answering task with Gen AI Evaluation Service SDK. Vertex AI SDK for Python Notebook - Evaluate Generated Answers from Retrieval-Augmented Generation (RAG)
Evaluate Langchain applications. Coming Soon
Metric customization Customize model-based metrics and evaluate a generative AI model according to your specific criteria. Key Features Demonstrated:

- Templated Customization: Utilize predefined fields to help define your pointwise and pairwise model-based metrics.
- Full Customization: Gain complete control over the design of your pointwise and pairwise model-based metrics.
Vertex AI SDK for Python Notebook - Customize Model-based Metrics to evaluate a Gen AI model
Evaluate generative AI models with your locally-defined custom metric, and bring your own autorater model to perform model-based metric evaluation. Vertex AI SDK for Python Notebook - Bring-Your-Own-Autorater using Custom Metric
Define your own computation-based custom metric functions, and use them for evaluation with Gen AI Evaluation Service SDK. Vertex AI SDK for Python Notebook - Bring your own computation-based Custom Metric
Other topics Gen AI Evaluation Service SDK Preview-to-GA Migration Guide.

In this tutorial, you will get detailed guidance on how to migrate from the Preview version to the latest GA version of Vertex AI Python SDK for Gen AI Evaluation Service, and it showcases two examples of using the GA version SDK to evaluate Retrieval-Augmented Generation (RAG) and compare two models side-by-side (SxS).
Vertex AI SDK for Python Notebook - Gen AI Evaluation Service SDK Preview-to-GA Migration Guide

Supported models

The Vertex AI Generative AI Evaluation Service supports Google's foundation models, third party models, and open models. You can provide pre-generated predictions directly, or automatically generate candidate model responses in the following ways:

  • Automatically generate responses for Google's foundation models (such as Gemini 1.5 Pro) and any model deployed in Vertex AI Model Registry.

  • Integrate with SDK text generation APIs from other third party and open models.

  • Wrap model endpoints from other providers using the Vertex AI SDK.

What's next