跳转到主要内容

category

The question answering system, that is based on semantic search and LLM currently one of the most popular application of LLM functionality. But what after we build it? How to evaluate the work of QnA system?

I would like to cover the evaluation of QnA system today in this article. I will describe several methods, that I tried myself and maybe it will be also useful for you.

Let’s start!

Evaluate the whole QnA system by validation dataset

It is a first method, that came to my mind. I’ve generated the validation dataset with a help of colleagues, who has knowledge in a domain area, that was used for QnA system. It can be not so big dataset, in my case I have 30–35 questions. But here one of the main trick, we can generate the validation dataset in two ways:

  1. Set up the pairs of questions and full, consistent answer, that we would like to see as an answer from our QnA system:

question = “Which document states we have on our project?”

answer = “The first state it is a ‘new’, then we can change it to ‘in progress’ or ‘postpone’, after that the final state will ‘success’ or ‘failed’ ”

2. Set up the pairs of questions and answers, but in this case, answer it will be just a list of entities/points, that oue QnA application should cover in the answer:

question = “Which document states we have on our project?”

answer = “new, in progress, postpone, success, failed”

So if we choose the first method we can evaluate the answer from QnA system in two ways: using ROUGE metric or using LLM as evaluator for ground-truth and predicted answer. In the second case it will be possible to use only LLM, but with small modification of prompt for evaluation.

I use prompt for evaluation from Langchain library, but you can create a new one, which will be more suitable for you.

So here prompt for the first case:

“You are a teacher grading a quiz.
You are given a question, the student’s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.

Example Format:
QUESTION: question here
STUDENT ANSWER: student’s answer here
TRUE ANSWER: true answer here
GRADE: CORRECT or INCORRECT here”

For the second case, when we have only the set of entities, we should use another one:

“You are a teacher grading a quiz.
You are given a question, the student’s answer, and the points, that should be in the student answer, and are asked to score the student answer as either CORRECT or INCORRECT.

Example Format:
QUESTION: question here
STUDENT ANSWER: student’s answer here
POINTS IN THE ANSWER: true answer here
GRADE: CORRECT or INCORRECT here”

So here we are. It is our first method, that can help us evaluate the QnA system in common.

Evaluate the whole QnA system by automatically generated validation dataset

This is method, that very similar to the first one, but with important detail.

Here we ask the generate the validation dataset by using LLM and the next prompt, which you can also find in the Langchain library, particularly in its sub-project auto-evaluator (and they have a very comprehensive guide)

### Human

You are question-answering assistant tasked with answering questions based on the provided context.

Here is the question: \

{question}

Use the following pieces of context to answer the question at the end. Use three sentences maximum. \

{context}

### Assistant

Answer: Think step by step.

It is possible to evaluate the QnA application by yourself or using auto-evaluator in langchain. But here the small notice that for now auto-evaluator works only with the ‘stuff’ type of request to LLM, when we put all relevant documents in one prompt. However we can use and add custom embeddings and document stores in auto-evaluator, that is can help to simulate our custom QnA application and evaluate it almost automatically.

Evaluate the semantics search part of a QnA system

For evaluating the only semantic search part of a QnA system we can use variety of metrics, but before dive deeper into metrics, I would like to cover several methods of semantic search, which I found interesting

  1. Similarity search via using different methods for defining distance a.k.a cosine similarity, euclidean distance, dot product, e.t.c — the most simplest approach
  2. Maximal Marginal Relevance(MRR) — more complex method, details I recommend to find in this source. But the main concept that besides similarity we also take into account the diversity of documents

So for my case I choose the MRR as a method for find the most relevant documents for user’s question. Then the evaluation stage came. I tried several metrics, the link on the whole review of them I put here, I totally recommend you have a look on it. Below you can find the list metrics, which I used:

  1. Mean Reciprocal Rank(MRR) — it is based on the first relevant rank and therefore it doesn’t take into account the rest predicted answers as can be considered as drawback
  2. Mean Average Precision(MAP) — this metrics already takes into account all answers, but still it is not consider the order of ranking
  3. Normalized Discounted Cumulative Gain(NDCG) —it is a cumulative gain of top answers, but discounted. The value of discount connected with rank of an answer

So here the short overview of semantic search metrics.

I used them in collaboration with metrics, that based on LLM evaluation for estimate the result of our QnA system

Below I would like to quickly mention the Test-Driven development(TDD) for creating prompts. It can be useful not only forQnA but also for other projects that use the LLM.

Test-Driven development(TDD) for prompt engineering

One of the best practice for software development it is a TDD and prompt engineering isn’t an exception in this case.

It has a very simple concept. We create a prompt and also evaluation function of an answer from LLM, that we get as a response on our prompt. In this way we just implement our expectations from communication with LLM as a test case.

For this purpose I found suitable library Promptimize, below you can find the simplest test case, that can be defined by Promptimize:

PromptCase(“hello there!”, lambda x: evals.any_word(x, [“hi”, “hello”]))

Here we just check that in the answer on our prompt ‘hello there!’ we will find one from these words ‘hi’, ‘hello’.

It was simply.

But this library also can help us to check the answer from LLM in the case, when we need to generate the code or sql-like query. We can build evaluator function as a runner of a code and put it in PromptCase as an input parameter.

Here you can find the example of it.

So, thank you for reach the end of this article! I hope you have found it useful!

Stay tuned! In future I can proceed with this topic and other exciting things, that connected with AI, LLM e.t.c