This article explores evaluating Large Language Models with Giskard, focusing on fairness and ethics.
Introduction
Diving into the world of Large Language Models (LLMs) feels like stepping into a huge, uncharted library. Each book is packed with untold stories and knowledge waiting to be unlocked. But here’s the catch: not everything you find might be accurate or fair. That’s where Giskard comes in, like a trusted guide, helping us navigate through the maze of evaluating LLMs with clarity and precision.
In this guide, we’re doing more than just browsing; we’re getting to the heart of making sure our tech advances are both powerful and fair. Whether you’re a seasoned AI expert or just curious about how it all works, this journey into LLM evaluation with Giskard promises to be both fascinating and informative.
Challenges in LLM Evaluation
Evaluating Large Language Models (LLMs) poses a range of challenges due to their complexity, scale, and the broad spectrum of tasks they are applied to. To streamline this section, we will focus on what I think are the three most common and important issues/challenges that can be faced when utiliznng Large Language Models.
- Bias and Fairness: LLMs can inherit or even amplify biases present in their training data. Evaluating models for fairness and bias, particularly across diverse and global contexts, is complex. Ensuring that LLMs are equitable and do not perpetuate harmful stereotypes or discrimination requires robust and culturally sensitive evaluation frameworks.
- Safety and Robustness: Ensuring that LLMs are safe and robust against adversarial attacks, misinformation generation, and unintended harmful outputs is a critical challenge. Evaluating models for these aspects requires sophisticated testing methodologies that can anticipate and simulate potential misuse scenarios.
- Ethical Misuse: This includes using LLMs to generate harmful, biased, or misleading content, such as fake news, deepfakes, or content that promotes hate speech, violence, or discrimination. It raises concerns about the ethical implications of AI technologies and their potential for societal harm.
Understanding Giskard
Giskard is a tool designed to evaluate and test machine learning models, with a specific focus on facilitating the development and assessment of large language models (LLMs). It aims to streamline the process of validating model performance, ensuring models behave as expected, and identifying areas for improvement. To gain a clearer understanding of Giskard, let’s briefly outline the capabilities of this AI tool and what it can help us accomplish.
- Bias Detection: Giskard can analyze an LLM’s outputs to identify biases, whether they are related to gender, race, ethnicity, or other sensitive attributes. By detecting biases, developers can take steps to mitigate them, ensuring the model’s outputs are more equitable and representative of diverse perspectives.
- Fairness Evaluation: It assesses how fair the model’s responses are across different demographic groups or scenarios, helping to ensure that the LLM does not favor one group over another unjustly.
- Robustness Testing: It evaluates the model’s resilience against adversarial attacks or inputs designed to confuse or exploit the model. Ensuring robustness is key to maintaining the integrity and reliability of the model’s responses.
- Ethical Guidelines Adherence: Giskard can help ensure that LLMs adhere to ethical guidelines by detecting outputs that might violate predefined ethical standards. This is essential for maintaining the societal trust in AI technologies.
- Misuse Prevention: By identifying potential misuse cases, such as generating misleading information or impersonating individuals, Giskard helps developers design safeguards against such unethical uses of LLMs.
The primary objective of the Giskard tool is to identify and present common issues encountered with Large Language Models (LLMs) and other machine learning models. Below is an illustrative example of the insights that the Giskard tool could provide during the testing of an LLM.
Step-by-Step Guide to Evaluating LLMs with Giskard
In the practical segment of this article, our focus will be on simply demonstrating how to effectively evaluate and test any Large Language Model, with a particular emphasis on OpenAI’s babbage-002, utilizing the capabilities of Giskard.
Step 1: Installation of Necessary Libraries
!pip install “giskard[llm]” –upgrade
!pip install “langchain<=0.0.301” “pypdf<=3.17.0” “faiss-cpu<=1.7.4” “openai<=0.28.1” “tiktoken<=0.5.1” |
This part involves installing the required Python libraries. Giskard is used for testing and evaluating machine learning models, langchain provides tools for working with language models and vector stores, pypdf for PDF document processing, faiss-cpu for efficient similarity search and clustering of dense vectors, openai for accessing OpenAI’s API, and tiktoken for tokenization purposes.
Step 2: Initialization and Environment Setup
import openai
import pandas as pd import os from langchain import OpenAI, FAISS, PromptTemplate from langchain.embeddings import OpenAIEmbeddings from langchain.document_loaders import PyPDFLoader from langchain.chains import RetrievalQA from langchain.text_splitter import RecursiveCharacterTextSplitter # Set your OpenAI API key here os.environ[“OPENAI_API_KEY”] = “Insert Your OpenAI Key Here” client = OpenAI( api_key=os.environ[‘OPENAI_API_KEY’], ) |
Here, necessary Python modules are imported, and the OpenAI API key is set up. This is crucial for authenticating requests sent to OpenAI’s API for embeddings and other operations.
Step 3: Prepare Vector Store
# Prepare vector store (FAISS) with IPPC report
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100, add_start_index=True) loader = PyPDFLoader(“https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_LongerReport.pdf”) db = FAISS.from_documents(loader.load_and_split(text_splitter), OpenAIEmbeddings()) |
The IPCC report is loaded and split into smaller chunks to make it manageable. This is done using a PyPDFLoader for loading the PDF and a RecursiveCharacterTextSplitter for splitting the text based on character count, ensuring meaningful segments are created for retrieval.
Each text chunk is then embedded using OpenAIEmbeddings, which converts text into high-dimensional vectors that represent the semantic meaning of the text. These embeddings are stored in a FAISS index, an efficient structure for similarity search and retrieval.
Step 4: Question Answering Chain Setup
# Prepare QA chain
PROMPT_TEMPLATE = “””You are the Climate Assistant, a helpful AI assistant made by Giskard. Your task is to answer common questions on climate change. You will be given a question and relevant excerpts from the IPCC Climate Change Synthesis Report (2023). Please provide short and clear answers based on the provided context. Be polite and helpful. Context: {context} Question: {question} Your answer: “”” |
In this step, we will be specifying the LLM that we plan on using which is OpenAI’s babbage-002″, moreover we will set some of its parameters like temperature.
Moving on, we will be crafting a template that instructs the LLM on how to use the provided context to answer questions. Lastly, we will be integrating the LLM with the FAISS retriever, allowing the system to fetch relevant document segments based on the question before generating an answer using the LLM.
llm = OpenAI(model=”babbage-002″, temperature=0)
prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=[“question”, “context”]) climate_qa_chain = RetrievalQA.from_llm(llm=llm, retriever=db.as_retriever(), prompt=prompt) # Test that everything works climate_qa_chain.run({“query”: “Is sea level rise avoidable? When will it stop?”}) |
Step 5: Model Prediction Function
import giskard
def model_predict(df: pd.DataFrame): “””Wraps the LLM call in a simple Python function. The function takes a pandas.DataFrame containing the input variables needed by your model, and must return a list of the outputs (one for each row). “”” return [climate_qa_chain.run({“query”: question}) for question in df[“question”]] |
Defines a function that takes a DataFrame of questions, runs them through the prepared QA chain, and returns the answers. This is a wrapper to integrate the QA system with dataframes, facilitating batch processing of questions.
Step 6: Giskard Model and Dataset Preparation
giskard_model = giskard.Model(
model=model_predict, model_type=”text_generation”, name=”Climate Change Question Answering”, description=”This model answers any question about climate change based on IPCC reports”, feature_names=[“question”], ) |
This creates a Giskard model instance, encapsulating the prediction function and metadata about the model for testing and evaluation.
Dataset Preparation: A pandas DataFrame is prepared with example questions. This dataset is then wrapped in a Giskard Dataset object, ready for evaluation.
Step 7: Execute Model Prediction and Analysis
examples = [“What is the scientific consensus on climate changes?”,
“How does deforestation contribute to climate change?”, “What are the impacts of climate change on biodiveristy?”] giskard_dataset = giskard.Dataset(pd.DataFrame({“question”: examples}), target=None) print(giskard_model.predict(giskard_dataset).prediction) |
The example questions are processed through the model to generate predictions (answers), and Giskard’s scan function is potentially used to evaluate the model’s performance and generate a report. This step allows for the assessment of the model’s ability to accurately and relevantly answer questions based on the IPCC report, providing insights into its effectiveness and areas for improvement.
Step 8: Running and Displaying the Giskard Scan
report = giskard.scan(giskard_model, giskard_dataset)
display(report) |
The final step, as mentioned in the guide, involves using Giskard’s scan function to evaluate the performance of the model on the prepared dataset.
The scan function is used to perform an automated evaluation of the model’s predictions against a dataset. This function assesses how well the model performs on the provided examples, analyzing aspects such as accuracy, relevance, and potentially other metrics depending on the model type and evaluation setup. It takes two primary arguments the
giskard_model and the giskard_dataset.
Scan Results
The scan has unveiled two critical robustness issues. Both instances revolve around the model’s response inconsistency when presented with the same prompt, albeit with the addition of special characters by the Giskard tool. This inconsistency underscores a lack of robustness, revealing that the model struggles to handle slight variations in the prompt without yielding unexpected results. Such findings emphasize the importance of fortifying the model’s resilience to ensure consistent and reliable performance across diverse input scenarios.
Below is further explanation to the issue by the AI tool itself.
The scan additionally reveals that certain prompts are designed to precisely reproduce lengthy texts verbatim.
The accuracy and depth of the scan results improve substantially with the increase in the number of prompts analyzed. This enhancement is primarily attributable to the comprehensive data aggregation and nuanced pattern recognition that becomes more refined with a larger dataset, enabling a more detailed and accurate assessment of the underlying system’s capabilities and performance.
Conclusion
Evaluating Large Language Models (LLMs) is essential for ensuring they are reliable, fair, and beneficial to use. This process involves evaluating the LLMs for biases, examining their robustness against misinformation, and ensuring they adhere to ethical standards.
As we delve into this complex task, tools like Giskard emerge as valuable allies, providing the necessary frameworks and methodologies to facilitate a comprehensive evaluation. By leveraging such tools, we can enhance the integrity and societal value of LLMs, ensuring they contribute positively to our technological landscape.