June 15, 2024

A Dataset for Teaching and Evaluating RAG

teaching

dataset

As a fan of Acquired (https://www.acquired.fm/), I recently published a dataset containing 200 Acquired Podcast Transcripts with metadata, complete with a human-generated Q&A file (see the dataset at Kaggle).

This dataset was used in my Introduction to Generative AI course to teach and evaluate Retrieval-Augmented Generation (RAG). The 200 transcripts contain approximately 3.5 million words, which is equivalent to about 5,500 pages when formatted as a Word document.

I tasked each student with listening to an episode of their choice and then coming up at least three question-answer pairs to test the accuracy of the answers using both GPT-4 and GPT-4 with the transcript. The results, shown in the figure below, demonstrate that RAG significantly improved answer accuracy.

I want to thank Rain and Eric from my team at Takin.AI (https://takin.ai/) to collect and clean the data and my students for creating the QA file.

PS. The featured image for this post is generated using HiddenArt tool from Takin.ai.