February 1, 2024

Data Labeling using GPT APIs


Data labeling stands as a significant application for Large Language Models (LLMs). In this post, I will be sharing my insights and knowledge gained from using ChatGPT (API versions 3.5 and 4) for an Aspect-Based Sentiment Analysis (ABSA) task. I selected ABSA as the example because it is a challenging task, and I have prior experience handling similar tasks in both research and industry projects without relying on LLMs. For example, my team had trained and made available a BERT-based ABSA model, which can be accessed here.

Key takeaways:

Key topics include:

ABSA Task and Dataset

Aspect-Based Sentiment Analysis (ABSA) is a NLP task that aims to identify and extract the sentiment of specific aspects of a product or service. There has been many papers, datasets, and competitions in this area (see https://paperswithcode.com/task/aspect-based-sentiment-analysis).

For example, the following is a restaurant review with sentiments for four aspects (See this repo for a simple code example for doing this):

This place is pretty cool with awesome decor. The drinks are decent but a bit on the pricey side.

We created a hotel review ABSA dataset for a research project, which has about 2 millions’ hotel reviews with three aspects:

We hired human data labellers to manually label about 25K reviews and trained a model to label the rest of the dataset.

Traditional Process

An ABSA task typically includes the following steps:

This process is very labor intensive and costly, e.g, we manually labeled about 25K reviews, which took several labelers a few weeks to complete.

LLM-based Process

The process of using LLM for ABSA is as follows:

Prompt Engineering

The prompt is constructed iteratively as previously outlined, eventually looks like the following:

You are an experienced data labeling engineer with extensive experience in labeling hotel reviews. Your task is to classify a review based on three dimensions, with four categories: positive, negative, neutral, and not mentioned.

The definitions and examples of the three dimensions are as follows:

Dimension 1: Quality of hotel staff service

Definition: customer perceptions directly related to staff behavior or attitude, such as timely service, skilled, knowledgeable, professional, polite, caring, understanding, sincere, helpful, etc.


Review: The cleaning lady cleans in a timely manner.
Sentiment: Positive

Review: Staff were testing robots in the hallway, the noise was very loud and annoying, and the front desk did nothing about it!
Sentiment: Negative

[more examples]...

Dimension 2: Quality of robot service

Definition: customer perceptions of robot functionality or perceptions of the service result after using the robot


Review: The robot is very convenient
Sentiment: Positive

Review: The robot delivers too slowly
Sentiment: Negative

[more examples]...

Dimension 3: Human-robot interaction perception

Definition: customer perceptions other than robot functionality, such as robot social intelligence (communication understanding ability), robot social existence (making one feel it has human characteristics or experiences a human can bring), robot design and novelty (voice, and posture freshness, curiosity, advanced, coolness.


Review: The little robot speaks adorably, too cute
Sentiment: Positive

Review: The robot's voice is too loud and noisy;
Sentiment: Negative

[more examples]...

Now, classify the sentiment of the following review into three dimensions using a JSON object as the output method, with "employee_service", "robot_service", "human_robot_interaction" as the keys and the value is one of positive/negative/neutral/unknown.

Here is the hotel review:

Performance Comparison

The latest API pricing is as follows (GPT4 is about 20 times more expensive than GPT3.5):

Model Input per 1K Tokens Output per 1K Tokens
GPT-4-0125 $0.01 $0.03
GPT-3.5-turbo-0125 $0.0005 $0.0015

The prompt shown above is quite long (use tiktoken to calculate the # of tokens), which is about 1400 tokens. Together with the actual review, the input to OpenAI API is about 1500 tokens.

So, the cost for one review is about the following (we can ignore the output for cost estimation given it is quite short):

For 2 million reviews, the total cost would be:

In order to reduce the cost, we tried the following methods:

However, both mini-batching and SFT result in much worse labeling performance as shown below.

Model Service Accuracy Precision Recall F1
GPT-4 Employee-Service 0.99 0.98 1.00 0.99
  Robot-Service 0.91 0.62 0.61 0.61
  HCI 0.86 0.83 0.83 0.83
GPT-3.5 Employee-Service 0.96 0.72 0.68 0.69
  Robot-Service 0.80 0.44 0.38 0.40
  HCI 0.56 0.39 0.33 0.32
GPT-4 One Pass Employee-Service 0.81 0.44 0.40 0.39
  Robot-Service 0.62 0.62 0.54 0.48
  HCI 0.71 0.85 0.52 0.45
GPT SFT Employee-Service 0.49 0.74 0.60 0.50
  Robot-Service 0.57 0.48 0.43 0.37
  HCI 0.81 0.54 0.50 0.52

The time for labeling 100 reviews for different methods are listed below:

To summarize, GPT 4 works well for the ABSA task but can be expensive. SFT method seems to be much cheaper and faster but the performance needs to be improved, e.g., better data preparation and engineering.

PS. The featured image for this post is generated using DALLE 3.