What Is String Manipulation in Python?
String Manipulation in Python is a key concept in Python for Data Science. Understanding it is essential for building effective data science solutions.
Core Idea: String Manipulation in Python provides a systematic approach to text processing problems by learning patterns from data and applying them to make predictions or decisions.
Key Concepts
NLP Fundamentals
Text data requires special preprocessing before modelling.
Text preprocessing pipeline:
Raw Text
→ Lowercasing
→ Tokenisation (split into words/subwords)
→ Stop word removal
→ Stemming / Lemmatisation
→ Vectorisation (TF-IDF, Word2Vec, BERT embeddings)
→ Model input
TF-IDF:
Word Embeddings (Word2Vec CBOW):
| Representation | Dimension | Context | Best For |
|---|---|---|---|
| Bag of Words | Vocab size | None | Simple baseline |
| TF-IDF | Vocab size | None | Document similarity |
| Word2Vec | 100–300 | Local window | Word similarity |
| GloVe | 100–300 | Global corpus | Analogy tasks |
| BERT | 768+ | Full sentence | All NLP tasks |
Python Implementation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings("ignore")
# Load example dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
# Prepare data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Class distribution: {dict(pd.Series(y).value_counts())}")
print(f"Train / Test split: {len(X_train)} / {len(X_test)}")
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# Try multiple models and compare
models = {
"Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
"Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
"SVM (RBF)": SVC(kernel="rbf", probability=True, random_state=42),
}
results = {}
for name, clf in models.items():
cv = cross_val_score(clf, X_train_s, y_train, cv=5, scoring="f1")
clf.fit(X_train_s, y_train)
test_acc = accuracy_score(y_test, clf.predict(X_test_s))
results[name] = {"CV F1": cv.mean(), "Test Acc": test_acc}
print(f"{name:<25} CV F1={cv.mean():.4f} Test={test_acc:.4f}")
model = models["Random Forest"] # best performer
Evaluation & Results
# Evaluate model performance
y_pred = model.predict(X_test_s)
print(f"Accuracy : {accuracy_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred,
target_names=data.target_names))
# Cross-validation for robust estimate
cv_scores = cross_val_score(model, X_train_s, y_train, cv=5, scoring="f1")
print(f"\n5-Fold CV F1: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
# Visualise results
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm, display_labels=data.target_names).plot(ax=axes[0])
axes[0].set_title("Confusion Matrix")
# CV scores
axes[1].bar(range(1, 6), cv_scores, color="#3b82f6", edgecolor="white")
axes[1].axhline(cv_scores.mean(), color="red", linestyle="--",
label=f"Mean={cv_scores.mean():.4f}")
axes[1].set_xlabel("Fold"); axes[1].set_ylabel("F1 Score")
axes[1].set_title("Cross-Validation Scores")
axes[1].legend(); axes[1].grid(True, alpha=0.3, axis="y")
plt.tight_layout()
plt.show()
Comparison with Related Methods
| Method | Strengths | Weaknesses | Best For |
|---|---|---|---|
| String Manipulation in Python | Effective on structured data | May need tuning | Classification/Regression |
| Random Forest | Robust, handles missing data | Slow inference | Tabular data |
| XGBoost | High accuracy, fast | Many hyperparameters | Competitions, production |
| Logistic Reg. | Interpretable, fast | Linear boundary only | Binary baseline |
| SVM | Good in high-dim | Slow on large data | Text, images |
Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
param_grid = {
"C": [0.01, 0.1, 1.0, 10.0],
"gamma": ["scale", "auto"],
"kernel": ["rbf", "linear"],
}
grid = GridSearchCV(model, param_grid, cv=5,
scoring="f1", n_jobs=-1, verbose=0)
grid.fit(X_train_s, y_train)
print(f"Best params : {grid.best_params_}")
print(f"Best CV F1 : {grid.best_score_:.4f}")
print(f"Test F1 : {grid.score(X_test_s, y_test):.4f}")
Key Takeaways
- String Manipulation in Python is a powerful method for text processing tasks
- Always scale features before applying distance-based or regularised methods
- Use cross-validation — never evaluate on the same data used for training
- Start simple — a strong baseline prevents over-engineering
- Visualise everything — confusion matrices, learning curves, feature importances