As a fan of Acquired (https://www.acquired.fm/), I recently published a dataset containing 200 Acquired Podcast Transcripts with metadata, complete with a human-generated Q&A file (see the dataset at Kaggle).
This dataset was used in my Introduction to Generative AI course to teach and evaluate Retrieval-Augmented Generation (RAG). The 200 transcripts contain approximately 3.5 million words, which is equivalent to about 5,500 pages when formatted as a Word document.
I tasked each student with listening to an episode of their choice and then coming up at least three question-answer pairs to test the accuracy of the answers using both GPT-4 and GPT-4 with the transcript. The results, shown in the figure below, demonstrate that RAG significantly improved answer accuracy.
I want to thank Rain and Eric from my team at Takin.AI (https://takin.ai/) to collect and clean the data and my students for creating the QA file.
PS. The featured image for this post is generated using HiddenArt tool from Takin.ai.