ML-03: The Complete ML Workflow

Summary
Mastering the pocket algorithm for non-separable data, understanding parameters vs hyperparameters, and the complete ML training workflow.
ML Workflow

ML Workflow

Learning Objectives

After reading this post, you will be able to:

  • Implement the pocket algorithm for non-separable data
  • Distinguish between parameters and hyperparameters
  • Explain the train/validation/test split strategy
  • Perform K-fold cross-validation

Theory

The Problem with Basic Perceptron

In last post, the perceptron was shown to converge only for linearly separable data. But real-world data often contains:

  • Noise: Measurement errors or mislabeled samples
  • Overlapping classes: Genuine class overlap
Comparison diagram: Left side shows "Linearly Separable Data" with clean separation between blue and red points. Right side shows "Non-Separable Data (Real World)" with overlapping clusters and some outliers. Arrow between them labeled "Reality Check".

Comparison diagram: Left side shows “Linearly Separable Data” with clean separation between blue and red points. Right side shows “Non-Separable Data (Real World)” with overlapping clusters and some outliers. Arrow between them labeled “Reality Check”.

For such data, the basic perceptron will never converge — it keeps oscillating.

The Pocket Algorithm

The pocket algorithm is a simple fix: keep the best weights “in your pocket” while the perceptron keeps updating.

Yes
No
No
Yes
Initialize
w, b
Initialize
best_w = w,
best_b = b
Find
misclassified
point
Update w, b
New accuracy
>
best?
Put new weights
in pocket
Keep pocket
unchanged
Max iterations?
Return
pocket weights

Key insight: Even if the perceptron ends in a bad state, the algorithm returns the best weights ever seen.

Parameters vs Hyperparameters

TypeDescriptionExamplesHow to Set
ParametersLearned from dataWeights ww, bias bbTraining algorithm
HyperparametersSet before trainingLearning rate, max iterationsCross-validation

💡 Parameters are the model’s “knowledge”; hyperparameters are the model’s “settings.”

The Complete ML Workflow

All Data
Training Set 60%
Validation Set 20%
Test Set 20%
Train Models
Evaluate on
Validation
Select
Best Model
Final Evaluation
on Test
DatasetPurposeWhen to Use
TrainingLearn parametersDuring training
ValidationTune hyperparametersModel selection
TestFinal evaluationOnly once at the end

⚠️ Never use test data during model development! It must remain “unseen” for honest evaluation.

K-Fold Cross-Validation

When data is limited, a large validation set may not be affordable. K-fold cross-validation solves this:

5-Fold Cross-Validation
Fold 1: Val | Train | Train | Train | Train
Fold 2: Train | Val | Train | Train | Train
Fold 3: Train | Train | Val | Train | Train
Fold 4: Train | Train | Train | Val | Train
Fold 5: Train | Train | Train | Train | Val
Score 1
Score 2
Score 3
Score 4
Score 5
Average Score

Process:

  1. Split data into K equal parts (folds)
  2. For each fold, use it as validation, train on the rest
  3. Average the K validation scores

Code Practice

Implementing the Pocket Algorithm

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np

class PocketPerceptron:
    def __init__(self, learning_rate=1.0, max_iter=1000):
        self.lr = learning_rate
        self.max_iter = max_iter
        self.w = None
        self.b = None
        
    def _accuracy(self, X, y):
        predictions = np.sign(np.dot(X, self.w) + self.b)
        return np.mean(predictions == y)
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        
        # Initialize weights
        self.w = np.zeros(n_features)
        self.b = 0
        
        # Pocket: store best weights
        best_w = self.w.copy()
        best_b = self.b
        best_accuracy = self._accuracy(X, y)
        
        for iteration in range(self.max_iter):
            # Find a misclassified point
            for xi, yi in zip(X, y):
                if yi * (np.dot(self.w, xi) + self.b) <= 0:
                    # Update weights
                    self.w += self.lr * yi * xi
                    self.b += self.lr * yi
                    
                    # Check if this is better
                    current_acc = self._accuracy(X, y)
                    if current_acc > best_accuracy:
                        best_accuracy = current_acc
                        best_w = self.w.copy()
                        best_b = self.b
                    break
            else:
                # No misclassifications, we're done
                break
        
        # Return the pocket (best) weights
        self.w = best_w
        self.b = best_b
        return self
    
    def predict(self, X):
        return np.sign(np.dot(X, self.w) + self.b)
    
    def score(self, X, y):
        return np.mean(self.predict(X) == y)

Comparing Perceptron vs Pocket on Noisy Data

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification

# Basic Perceptron (from ML-02)
class Perceptron:
    def __init__(self, learning_rate=1.0, max_iter=1000):
        self.lr = learning_rate
        self.max_iter = max_iter
        self.w = None
        self.b = None
        
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.w = np.zeros(n_features)
        self.b = 0
        
        for _ in range(self.max_iter):
            misclassified = False
            for xi, yi in zip(X, y):
                if yi * (np.dot(self.w, xi) + self.b) <= 0:
                    self.w += self.lr * yi * xi
                    self.b += self.lr * yi
                    misclassified = True
            if not misclassified:
                break
        return self
    
    def predict(self, X):
        return np.sign(np.dot(X, self.w) + self.b)
    
    def score(self, X, y):
        return np.mean(self.predict(X) == y)

# Generate non-separable data
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0,
                           n_informative=2, n_clusters_per_class=1,
                           flip_y=0.15, random_state=42)
y = np.where(y == 0, -1, 1)  # Convert to -1, +1

# Train both algorithms
percep = Perceptron(max_iter=100)
percep.fit(X, y)

pocket = PocketPerceptron(max_iter=100)
pocket.fit(X, y)

print(f"Basic Perceptron Accuracy: {percep.score(X, y):.2%}")
print(f"Pocket Algorithm Accuracy: {pocket.score(X, y):.2%}")

Output:

1
2
Basic Perceptron Accuracy: 64.00%
Pocket Algorithm Accuracy: 86.00%

Train/Validation/Test Split

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from sklearn.model_selection import train_test_split

# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Second split: training and validation
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42  # 0.25 * 0.8 = 0.2
)

print(f"Training set:   {len(X_train)} samples ({len(X_train)/len(X):.0%})")
print(f"Validation set: {len(X_val)} samples ({len(X_val)/len(X):.0%})")
print(f"Test set:       {len(X_test)} samples ({len(X_test)/len(X):.0%})")

Output:

1
2
3
Training set:   60 samples (60%)
Validation set: 20 samples (20%)
Test set:       20 samples (20%)

K-Fold Cross-Validation

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Perceptron

# Using sklearn's perceptron
clf = Perceptron(max_iter=1000, random_state=42)

# 5-fold cross-validation
scores = cross_val_score(clf, X, y, cv=5)

print(f"Cross-validation scores: {scores}")
print(f"Mean accuracy: {scores.mean():.2%} (+/- {scores.std() * 2:.2%})")

Output:

1
2
Cross-validation scores: [0.95 0.75 0.5  0.8  0.85]
Mean accuracy: 77.00% (+/- 30.07%)

Hyperparameter Tuning with Cross-Validation

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Define hyperparameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],  # Regularization strength
    'gamma': [1, 0.1, 0.01, 0.001]  # Kernel coefficient
}

# Grid search with cross-validation
grid_search = GridSearchCV(
    SVC(kernel='rbf', random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy'
)
grid_search.fit(X_train, y_train)

print(f"Best hyperparameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.2%}")

# Final evaluation on test set
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test, y_test)
print(f"Test set accuracy: {test_accuracy:.2%}")

Output:

1
2
3
Best hyperparameters: {'C': 1, 'gamma': 1}
Best CV score: 86.67%
Test set accuracy: 85.00%

Visualizing Cross-Validation Results

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pandas as pd

# Extract results
results = pd.DataFrame(grid_search.cv_results_)
pivot = results.pivot_table(
    values='mean_test_score',
    index='param_C',
    columns='param_gamma'
)

# Heatmap
plt.figure(figsize=(8, 5))
plt.imshow(pivot.values, cmap='YlGn', aspect='auto')
plt.colorbar(label='Mean CV Accuracy')
plt.xticks(range(len(pivot.columns)), pivot.columns)
plt.yticks(range(len(pivot.index)), pivot.index)
plt.xlabel('gamma')
plt.ylabel('C (regularization)')
plt.title('Hyperparameter Grid Search Results')

for i in range(len(pivot.index)):
    for j in range(len(pivot.columns)):
        plt.text(j, i, f'{pivot.values[i, j]:.2f}',
                ha='center', va='center')

plt.tight_layout()
plt.savefig('assets/grid_search.png', dpi=150)
plt.show()
Grid Search Results

Grid Search Results

Deep Dive

Q1: How many folds should I use for cross-validation?

Common choices:

  • K=5 or K=10: Standard choices, good balance between bias and variance
  • K=N (Leave-One-Out): Maximum training data, but computationally expensive
  • Stratified K-Fold: Preserves class proportions in each fold (recommended for imbalanced data)

Q2: What if I have very little data?

  • Use higher K (more folds) to maximize training data
  • Consider Leave-One-Out cross-validation
  • Use data augmentation if applicable
  • Try simpler models with fewer parameters

Q3: Why not tune hyperparameters on the test set?

If you tune on the test set, you’re essentially “peeking” at it. The reported test accuracy becomes optimistically biased and won’t reflect real-world performance.

Summary

ConceptKey Points
Pocket AlgorithmKeeps best weights during training
ParametersLearned from data (w, b)
HyperparametersSet before training (learning rate, iterations)
Train/Val/Test Split60/20/20 is typical
K-Fold CVAverage performance across K folds

References