Sentiment Classification Using Fine-tuned BERT

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import pandas as pd

Before diving into the implementation, we need to set up our authentication with the Hugging Face Hub. Hugging Face is a platform that hosts thousands of pre-trained models and datasets, making it an essential resource for modern NLP tasks. This step is crucial if you plan to work with private models or want to save your fine-tuned model to the Hub later.

from huggingface_hub import login

# Log in to Hugging Face Hub using authentication token
# Required for accessing private models and pushing models to Hub
login("your_token")

Data Loading and Preparation

For this sentiment analysis task, we’ll use a Chinese social media dataset containing 100,000 Weibo posts with sentiment labels. The dataset is hosted on the Hugging Face Hub and can be easily loaded using the datasets library.

from datasets import load_dataset
import pandas as pd

# Load sentiment analysis dataset from Hugging Face Hub
# Dataset contains 100k Weibo posts with sentiment labels
ds = load_dataset("dirtycomputer/weibo_senti_100k")
ds

DatasetDict({
    train: Dataset({
        features: ['label', 'review'],
        num_rows: 119988
    })
})

# Convert Hugging Face dataset to pandas DataFrame
df = pd.DataFrame(ds['train'])

# Display basic information about the DataFrame
print("Dataset Overview:")
print(f"Number of samples: {len(df)}")
print(f"Columns: {df.columns.tolist()}")
print("\nLabel distribution:")
print(df['label'].value_counts())
print("\nSample reviews:")
print(df['review'].head())

# Check for any missing values
if df.isnull().sum().any():
    print("\nWarning: Dataset contains missing values!")

Dataset Overview:
Number of samples: 119988
Columns: ['label', 'review']

Label distribution:
label
0    59995
1    59993
Name: count, dtype: int64

Sample reviews:
0                ﻿更博了，爆照了，帅的呀，就是越来越爱你！生快傻缺[爱你][爱你][爱你]
1    @张晓鹏jonathan 土耳其的事要认真对待[哈哈]，否则直接开除。@丁丁看世界 很是细心...
2    姑娘都羡慕你呢…还有招财猫高兴……//@爱在蔓延-JC:[哈哈]小学徒一枚，等着明天见您呢/...
3                                           美~~~~~[爱你]
4                                    梦想有多大，舞台就有多大![鼓掌]
Name: review, dtype: object

The dataset contains 119,988 samples. The dataset is perfectly balanced with 59,995 negative samples (label 0) and 59,993 positive samples (label 1).

Data Splitting

Although our dataset is pre-organized, we’ll create our own train-test split to ensure we have a fresh evaluation set. We’ll use 80% of the data for training and reserve 20% for testing the model’s performance.

from sklearn.model_selection import train_test_split

# Split dataset into training and test sets
# - test_size=0.2: 80% training, 20% testing
# - shuffle=True: randomly shuffle before splitting
# - random_state=42: set seed for reproducibility
train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=42)

The key parameters:

test_size=0.2: Creates an 80-20 split, with ~96,000 training samples and ~24,000 test samples
shuffle=True: Ensures random distribution of data, preventing ordering bias
random_state=42: Sets a seed for reproducible results

I run the project on Google Colab, a cloud-based Jupyter notebook environment. Colab provides free GPU access, making it an excellent choice for users without local GPU resources to run deep learning models like BERT.

# Check for available CUDA device and set up GPU/CPU
# Colab typically provides a single GPU, if available
if torch.cuda.is_available():
   device = torch.device("cuda")
   # Print GPU information
   print(f"Using GPU: {torch.cuda.get_device_name(0)}")
   print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
   device = torch.device("cpu")
   print("No GPU available, using CPU")

Using GPU: NVIDIA A100-SXM4-40GB
GPU Memory: 42.48 GB

Tokenizer Initialization

For our Chinese sentiment analysis task, we’ll use the bert-base-chinese tokenizer. This pre-trained tokenizer is specifically designed for Chinese text.

The tokenizer is crucial for preparing our text data for BERT. It converts Chinese text into tokens that BERT can understand

# Initialize the BERT Chinese tokenizer
# Uses bert-base-chinese pre-trained model's vocabulary
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

To efficiently handle our data during training, we need to create a custom Dataset class that inherits from PyTorch’s Dataset class. This class will take care of text encoding and provide a standardized way to access our samples.

It serves as a data pipeline that:

Transforms Chinese text into BERT-compatible token IDs
Ensures consistent input dimensions through padding and truncation
Efficiently delivers batched data during training

from torch.utils.data import Dataset
import torch
class TextDataset(Dataset):
   """
   Custom Dataset class for text data, inheriting from PyTorch's Dataset.

   Parameters:
   tokenizer (Tokenizer): Tokenizer object for text encoding
   texts (list): List of text samples
   labels (list): List of corresponding labels
   """
   def __init__(self, tokenizer, texts, labels):
       # Encode texts with padding and truncation
       self.encodings = tokenizer(
           texts,
           truncation=True,
           padding=True,
           max_length=512,  # Explicitly set max length for BERT
           return_tensors='pt'  # Return PyTorch tensors directly
       )
       # Convert labels to tensor
       self.labels = torch.tensor(labels)

   def __getitem__(self, idx):
       """
       Get a single sample by index.

       Args:
           idx (int): Sample index

       Returns:
           dict: Dictionary containing encoded text data and label
       """
       return {
           'input_ids': self.encodings['input_ids'][idx],
           'attention_mask': self.encodings['attention_mask'][idx],
           'labels': self.labels[idx]
       }

   def __len__(self):
       """
       Get dataset length.

       Returns:
           int: Number of samples in dataset
       """
       return len(self.labels)

Let’s verify our label distribution and create an explicit mapping for our sentiment classes. While our labels are already in a binary format (0 and 1), maintaining an explicit mapping is a good practice for code clarity and future modifications.

# Print unique labels in the dataset
print("Unique labels in training data:", sorted(set(train_df['label'])))

# Create explicit label mapping
label_mapping = {
    0: 0,  # Negative
    1: 1   # Positive
}

# Map labels using explicit mapping
train_labels = [label_mapping[label] for label in train_df['label']]
test_labels = [label_mapping[label] for label in test_df['label']]

# Verify label distribution after mapping
print("\nLabel distribution after mapping:")
print("Training:", pd.Series(train_labels).value_counts())
print("Testing:", pd.Series(test_labels).value_counts())

Unique labels in training data: [0, 1]

Label distribution after mapping:
Training: 0    48151
1    47839
Name: count, dtype: int64
Testing: 1    12154
0    11844
Name: count, dtype: int64

# Explicit label mapping to ensure correct sentiment assignment
label_to_index = {
    0: 0,  # Keep negative as 0
    1: 1   # Keep positive as 1
}

# Map labels using explicit mapping
train_labels = [label_to_index[label] for label in train_df['label']]
test_labels = [label_to_index[label] for label in test_df['label']]

# Create datasets with verified labels
train_dataset = TextDataset(tokenizer, train_df['review'].tolist(), train_labels)
test_dataset = TextDataset(tokenizer, test_df['review'].tolist(), test_labels)

# Verify final mapping
print("\nFinal verification:")
print("Training set label distribution:", pd.Series(train_labels).value_counts())
print("Test set label distribution:", pd.Series(test_labels).value_counts())

Final verification:
Training set label distribution: 0    48151
1    47839
Name: count, dtype: int64
Test set label distribution: 1    12154
0    11844
Name: count, dtype: int64

Model Training Setup

To fine-tune BERT for our sentiment analysis task, we’ll follow these key steps:

Model Initialization: Load the pre-trained Chinese BERT model
Training Configuration: Set up training parameters using TrainingArguments
Metrics Setup: Define evaluation metrics for model performance monitoring
Trainer Setup: Initialize the Hugging Face Trainer class with:
- The BERT model
- Training arguments
- Training and evaluation datasets
- Metrics computation function
Training Process: Use trainer.train() and trainer.evaluate() for model fine-tuning and evaluation

The Hugging Face Trainer API simplifies the training process by handling the training loops, device management, and model optimization automatically.

The code below implements these steps:

# Load pre-trained Chinese BERT model and configure for binary classification
model = BertForSequenceClassification.from_pretrained(
   'bert-base-chinese',
   num_labels=2  # Binary classification (negative/positive)
)
model = model.to(device)

# Define training arguments for model fine-tuning
training_args = TrainingArguments(
   output_dir='sentiment-weibo-100k-fine-tuned-bert-test',  # Directory to save model checkpoints
   num_train_epochs=3,                                 # Number of training epochs
   per_device_train_batch_size=32,                    # Number of samples per training batch
   per_device_eval_batch_size=64,                     # Number of samples per evaluation batch
   warmup_steps=500,                                  # Steps for learning rate warmup
   weight_decay=0.01,                                 # L2 regularization factor
   logging_dir='./logs',                             # Directory for training logs
   logging_steps=100,                                # Log metrics every 100 steps
   evaluation_strategy="epoch",                      # Evaluate after each epoch
   save_strategy="epoch",                           # Save model after each epoch
   load_best_model_at_end=True,                    # Load best model after training
   push_to_hub=True,                               # Push model to Hugging Face Hub
   learning_rate=2e-5,                             # Initial learning rate
   gradient_accumulation_steps=1                   # Update model after every batch
)

def compute_metrics(pred):
   """
   Compute evaluation metrics for the model
   Args:
       pred: Contains predictions and label_ids
   Returns:
       dict: Dictionary containing accuracy, F1, precision, and recall scores
   """
   labels = pred.label_ids
   preds = pred.predictions.argmax(-1)
   precision, recall, f1, _ = precision_recall_fscore_support(
       labels,
       preds,
       average='binary',
       pos_label=1  # Define positive class for binary metrics
   )
   acc = accuracy_score(labels, preds)

   return {
       'accuracy': acc,
       'f1': f1,
       'precision': precision,
       'recall': recall
   }

# Initialize trainer with model and training configuration
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=train_dataset,
   eval_dataset=test_dataset,
   compute_metrics=compute_metrics
)

# Print training configuration summary
print("Training Configuration:")
print(f"Model: bert-base-chinese")
print(f"Training samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")
print(f"Batch size: {training_args.per_device_train_batch_size}")
print(f"Number of epochs: {training_args.num_train_epochs}")

# Start training
trainer.train()

# Evaluate model performance
eval_results = trainer.evaluate()
print("\nEvaluation Results:", eval_results)

After completing the model training, let’s test it out.

from transformers import pipeline
import torch

def test_sentiment(texts, yourmodel):
    """
    Test sentiment analysis model with given texts
    """
    # Create sentiment analyzer pipeline
    analyzer = pipeline(
        "sentiment-analysis",
        model=yourmodel,  # your model from HuggingFace Hub
        tokenizer="bert-base-chinese",
        device=0 if torch.cuda.is_available() else -1
    )

    # Process each text
    for text in texts:
        result = analyzer(text)[0]
        sentiment = "positive" if result['label'] == 'LABEL_1' else "negative"
        print(f"\nText: {text}")
        print(f"Sentiment: {sentiment}")
        print(f"Confidence: {result['score']:.4f}")

# Test with example texts
test_texts = [
    "这家店的菜真香，下次还来！",         # The food is delicious, will come again
    "质量有问题，不推荐购买。",           # Quality issues, not recommended
    "快递很快，包装完整。",               # Fast delivery, good packaging
    "商家态度很不好，生气。",             # Bad merchant attitude, angry
    "非常满意，超出预期。",               # Very satisfied, exceeded expectations
    "难吃到极点，太糟糕了。",             # Extremely bad taste, terrible
    "穿着很舒服，尺码合适。",             # Comfortable to wear, good size
    "卖家服务特别好！",                   # Great service from seller
    "不值这个价钱，后悔买了。",           # Not worth the price, regret buying
    "产品完全是垃圾，气死了。"            # Product is totally garbage, very angry
]

# Run the test
test_sentiment(test_texts, "BarryzZ/sentiment-weibo-100k-fine-tuned-bert-test")

Device set to use cuda:0



Text: 这家店的菜真香，下次还来！
Sentiment: positive
Confidence: 1.0000

Text: 质量有问题，不推荐购买。
Sentiment: positive
Confidence: 1.0000

Text: 快递很快，包装完整。
Sentiment: positive
Confidence: 1.0000

Text: 商家态度很不好，生气。
Sentiment: positive
Confidence: 1.0000

Text: 非常满意，超出预期。
Sentiment: positive
Confidence: 1.0000

Text: 难吃到极点，太糟糕了。
Sentiment: positive
Confidence: 1.0000

Text: 穿着很舒服，尺码合适。
Sentiment: positive
Confidence: 1.0000

Text: 卖家服务特别好！
Sentiment: positive
Confidence: 1.0000

Text: 不值这个价钱，后悔买了。
Sentiment: positive
Confidence: 1.0000

Text: 产品完全是垃圾，气死了。
Sentiment: positive
Confidence: 0.9997

There’s clearly an issue with our model’s predictions. The model is:

Classifying everything as positive (positive sentiment)
Doing so with extremely high confidence (nearly 100%)
Failing to identify obvious negative sentiments like “难吃到极点” and “产品完全是垃圾”

These issues are likely due to optimization problems rather than data imbalance. Our adjustments focus on:

Better monitoring (more frequent evaluation, detailed metrics)
Improved efficiency (larger batches, mixed precision)
Extended training (more epochs, early stopping)

Let’s test the model with these optimized parameters.

# Initialize model with same configuration
model = BertForSequenceClassification.from_pretrained(
   'bert-base-chinese',
   num_labels=2
)
model = model.to(device)

# Enhanced training arguments
training_args = TrainingArguments(
   output_dir='sentiment-weibo-100k-fine-tuned-bert',
   num_train_epochs=5,                    # Increased from 3 to 5 for better learning
   per_device_train_batch_size=64,        # Doubled for faster training
   per_device_eval_batch_size=128,        # Doubled for faster evaluation
   learning_rate=2e-5,                    # Kept same learning rate
   warmup_ratio=0.1,                      # Added warmup ratio for smoother training
   weight_decay=0.01,                     # For regularization
   logging_dir='./logs',
   logging_steps=100,
   evaluation_strategy="steps",           # Changed to step-based evaluation
   eval_steps=200,                        # More frequent evaluation
   save_strategy="steps",
   save_steps=200,                        # More frequent model saving
   load_best_model_at_end=True,
   metric_for_best_model="f1_avg",        # Using average F1 score to select best model
   push_to_hub=True,
   gradient_accumulation_steps=1,
   fp16=True                              # Added mixed precision training for efficiency
)

# Enhanced metrics computation function
def compute_metrics(pred):
   """
   Compute detailed metrics including class-specific scores
   Returns metrics for both positive and negative classes
   """
   labels = pred.label_ids
   preds = pred.predictions.argmax(-1)

   precision, recall, f1, _ = precision_recall_fscore_support(
       labels,
       preds,
       average=None,
       labels=[0, 1]
   )
   acc = accuracy_score(labels, preds)
   conf_mat = confusion_matrix(labels, preds)

   return {
       'accuracy': acc,
       'f1_neg': f1[0],                   # Added separate F1 scores
       'f1_pos': f1[1],
       'f1_avg': f1.mean(),
       'precision_neg': precision[0],      # Added class-specific precision
       'precision_pos': precision[1],
       'recall_neg': recall[0],           # Added class-specific recall
       'recall_pos': recall[1],
       'confusion_matrix': conf_mat.tolist()  # Added confusion matrix
   }


# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

# Print dataset statistics before training
print("\nDataset Statistics:")
print(f"Training samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")
print("\nLabel Distribution:")
print("Training:", pd.Series([d['labels'].item() for d in train_dataset]).value_counts())
print("Testing:", pd.Series([d['labels'].item() for d in test_dataset]).value_counts())

# Start training
trainer.train()

# Evaluate model
eval_results = trainer.evaluate()
print("\nFinal Evaluation Results:")
for metric, value in eval_results.items():
    if isinstance(value, float):
        print(f"{metric}: {value:.4f}")
    else:
        print(f"{metric}: {value}")

The model achieved excellent metrics after just one epoch.

# Test with example texts
test_texts = [
    "这家店的菜真香，下次还来！",         # The food is delicious, will come again
    "质量有问题，不推荐。",           # Quality issues, not recommended
    "快递很快，包装完整。",               # Fast delivery, good packaging
    "商家态度不好，生气。",             # Bad merchant attitude, angry
    "非常满意，超出预期。",               # Very satisfied, exceeded expectations
    "难吃到极点，太糟糕了。",             # Extremely bad taste, terrible
    "穿着很舒服，尺码合适。",             # Comfortable to wear, good size
    "卖家服务特别好！",                   # Great service from seller
    "不值这个价钱，后悔买了。",           # Not worth the price, regret buying
    "产品完全是垃圾，气死了。"            # Product is totally garbage, very angry
]

# Run the test
test_sentiment(test_texts, "BarryzZ/sentiment-weibo-100k-fine-tuned-bert")

Device set to use cuda:0



Text: 这家店的菜真香，下次还来！
Sentiment: positive
Confidence: 0.9923

Text: 质量有问题，不推荐。
Sentiment: negative
Confidence: 0.8533

Text: 快递很快，包装完整。
Sentiment: positive
Confidence: 0.9878

Text: 商家态度不好，生气。
Sentiment: negative
Confidence: 0.9732

Text: 非常满意，超出预期。
Sentiment: positive
Confidence: 0.9791

Text: 难吃到极点，太糟糕了。
Sentiment: negative
Confidence: 0.8653

Text: 穿着很舒服，尺码合适。
Sentiment: positive
Confidence: 0.9907

Text: 卖家服务特别好！
Sentiment: positive
Confidence: 0.9922

Text: 不值这个价钱，后悔买了。
Sentiment: negative
Confidence: 0.8147

Text: 产品完全是垃圾，气死了。
Sentiment: negative
Confidence: 0.9863

After parameter optimization, our model shows significant improvements.

Let’s test it with some new scenarios to verify its robustness.

test_texts = [
    # Strong positive / 强烈正面
    "我考上研究生了！",  # I got accepted into graduate school!
    "今天他向我求婚了！",  # He proposed to me today!
    "终于买到梦想的房子",  # Finally bought my dream house
    "中了五百万大奖！",  # Won a 5 million prize!

    # Strong negative / 强烈负面
    "被裁员了，好绝望",  # Got laid off, feeling desperate
    "信任的人背叛我",  # Betrayed by someone I trusted
    "重要的文件全丢了",  # Lost all important documents
    "又被扣工资了，气死",  # Got my salary deducted again, so angry

    # Anger / 愤怒
    "偷我的车，混蛋！",  # Someone stole my car, bastard!
    "骗子公司，我要报警",  # Scam company, I'm calling the police
    "半夜装修，烦死了",  # Renovation at midnight, so annoying
    "商家太坑人了！",  # The merchant is such a ripoff!

    # Pleasant surprise / 惊喜
    "宝宝会走路了！",  # Baby learned to walk!
    "升职加薪啦！",  # Got promoted with a raise!
    "论文发表成功！",  # Paper got published successfully!
    "收到offer了！"  # Received a job offer!
]
# Run the test
test_sentiment(test_texts, "BarryzZ/sentiment-weibo-100k-fine-tuned-bert")

Device set to use cuda:0
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset



Text: 我考上研究生了！
Sentiment: positive
Confidence: 0.8713

Text: 今天他向我求婚了！
Sentiment: positive
Confidence: 0.6087

Text: 终于买到梦想的房子
Sentiment: positive
Confidence: 0.7931

Text: 中了五百万大奖！
Sentiment: positive
Confidence: 0.6070

Text: 被裁员了，好绝望
Sentiment: negative
Confidence: 0.9973

Text: 信任的人背叛我
Sentiment: negative
Confidence: 0.9572

Text: 重要的文件全丢了
Sentiment: negative
Confidence: 0.9941

Text: 又被扣工资了，气死
Sentiment: negative
Confidence: 0.9963

Text: 偷我的车，混蛋！
Sentiment: negative
Confidence: 0.9664

Text: 骗子公司，我要报警
Sentiment: negative
Confidence: 0.9750

Text: 半夜装修，烦死了
Sentiment: negative
Confidence: 0.9906

Text: 商家太坑人了！
Sentiment: negative
Confidence: 0.8367

Text: 宝宝会走路了！
Sentiment: positive
Confidence: 0.9125

Text: 升职加薪啦！
Sentiment: positive
Confidence: 0.9727

Text: 论文发表成功！
Sentiment: positive
Confidence: 0.9998

Text: 收到offer了！
Sentiment: positive
Confidence: 0.7036

The optimized model shows excellent performance in Chinese sentiment analysis. It now correctly identifies both positive and negative sentiments with appropriate confidence levels, while maintaining more moderate confidence for nuanced cases.