Hugging Face Datasets

NLPHugging FaceFree Lesson

Advertisement

Introduction

Hugging Face Datasets provides easy access to thousands of datasets for NLP, audio, and computer vision.

Loading Datasets

from datasets import load_dataset

# Load from Hub
dataset = load_dataset('mnist', split='train')
print(dataset)

# Load CSV
dataset = load_dataset('csv', data_files='data.csv', split='train')

# Load JSON
dataset = load_dataset('json', data_files='data.json', field='data')

# Load multiple files
dataset = load_dataset('csv', data_files=['train.csv', 'test.csv'])

Dataset Operations

# Access elements
print(dataset[0])  # First example
print(dataset['text'][:5])  # First 5 texts

# Properties
print(dataset.features)
print(dataset.num_rows)
print(dataset.column_names)

Transform with map

def tokenize(example):
    return tokenizer(example['text'], truncation=True, padding='max_length', max_length=128)

tokenized = dataset.map(tokenize, batched=True)
print(tokenized)

Filter

# Filter by condition
filtered = dataset.filter(lambda x: len(x['text'].split()) > 10)

# Keep specific columns
filtered = dataset.filter(lambda x: x['label'] in [0, 1])

Split and Train/Test

# Split dataset
train_test = dataset.train_test_split(test_size=0.2)
train = train_test['train']
test = train_test['test']

# Stratified split
stratified = dataset.train_test_split(test_size=0.2, stratify_column_name='label')

Practice Problems

  1. Load dataset from Hub
  2. Access data elements
  3. Tokenize with map
  4. Filter examples
  5. Create train/test split

Advertisement

Need Expert Python Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement