! pip install datasets
import datasets
from datasets import load_dataset
Convieniently, the airline sentiment dataset from Kaggle is available on Hugging Face. It consits of over 14,000 tweets from airline customers collected in Feb 2015.
ds = load_dataset("osanseviero/twitter-airline-sentiment")
ds = ds['train']
The dataset includes several columns. We only need the sentiment labels and the actual tweets (text).
ds
Dataset({ features: ['tweet_id', 'airline_sentiment', 'airline_sentiment_confidence', 'negativereason', 'negativereason_confidence', 'airline', 'airline_sentiment_gold', 'name', 'negativereason_gold', 'retweet_count', 'text', 'tweet_coord', 'tweet_created', 'tweet_location', 'user_timezone'], num_rows: 14640 })
ds= ds.select_columns(['text', 'airline_sentiment'])
ds = ds.rename_column('airline_sentiment', 'label')
ds
Dataset({ features: ['text', 'label'], num_rows: 14640 })
We'll use a roBERTa model from Hugging Face to do sentiment analysis.
from transformers import pipeline
pipe = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment-latest")
Feed the tweets to the model to get predictions
outputs = pipe(ds['text'])
len(outputs), type(outputs)
(14640, list)
Pull the labels from the tweet dataset
actual = ds['label']
len(actual), type(actual)
(14640, list)
The labels are 'positive', 'negative' or 'neutral'
actual[0]
'neutral'
outputs[0]['label']
'neutral'
Compare the predictions to the actual labels and calculate the accuracy
correct = 0
total = 0
for i, label in enumerate(outputs):
if label['label'] == actual[i]:
correct += 1
total += 1
print(f"Accuracy: {correct/total*100}")
Accuracy: 81.00409836065575