ML-03: The Complete ML Workflow

Summary
Mastering the pocket algorithm for non-separable data, understanding parameters vs hyperparameters, and the complete ML training workflow.

Learning Objectives

After reading this post, you will be able to:

  • Implement the pocket algorithm for non-separable data
  • Distinguish between parameters and hyperparameters
  • Explain the train/validation/test split strategy
  • Perform K-fold cross-validation

Theory

This article covers three key topics: handling non-separable data with the pocket algorithm, understanding the difference between parameters and hyperparameters, and the complete training workflow.

The Problem with Basic Perceptron

In last post, the perceptron was shown to converge only for linearly separable data. But real-world data often contains:

  • Noise: Measurement errors or mislabeled samples
  • Overlapping classes: Genuine class overlap
Comparison diagram: Left side shows "Linearly Separable Data" with clean separation between blue and red points. Right side shows "Non-Separable Data (Real World)" with overlapping clusters and some outliers. Arrow between them labeled "Reality Check".
Comparison diagram: Linearly Separable vs. Non-Separable Data (Real World)

For such data, the basic perceptron will never converge — it keeps oscillating.

The Pocket Algorithm

The pocket algorithm is a simple fix: keep the best weights “in your pocket” while the perceptron keeps updating.

graph LR A[Initialize\n w, b] --> B[Initialize\n best_w = w,\n best_b = b] B --> C[Find\n misclassified\n point] C --> D[Update w, b] D --> E{New accuracy\n >\n best?} E -->|Yes| F[Put new weights\n in pocket] E -->|No| G[Keep pocket\n unchanged] F --> H{Max iterations?} G --> H H -->|No| C H -->|Yes| I[Return\n pocket weights]

Key insight: Even if the perceptron ends in a bad state, the algorithm returns the best weights ever seen.

Parameters vs Hyperparameters

With the pocket algorithm in hand, the next question is: what values should be used for learning_rate and max_iter? This leads to the distinction between parameters and hyperparameters.

TypeDescriptionExamplesHow to Set
ParametersLearned from dataWeights $w$, bias $b$Training algorithm
HyperparametersSet before trainingLearning rate, max iterationsCross-validation

💡 Parameters are the model’s “knowledge”; hyperparameters are the model’s “settings.”

The Complete ML Workflow

To properly tune hyperparameters without cheating, data needs to be split strategically. This is the complete ML workflow.

graph LR A[All Data] --> B[Training Set 60%] A --> C[Validation Set 20%] A --> D[Test Set 20%] B --> E[Train Models] E --> F[Evaluate on\n Validation] F --> G{Select\n Best Model} G --> H[Final Evaluation\n on Test] style B fill:#c8e6c9 style C fill:#fff9c4 style D fill:#ffcdd2
DatasetPurposeWhen to Use
TrainingLearn parametersDuring training
ValidationTune hyperparametersModel selection
TestFinal evaluationOnly once at the end

⚠️ Never use test data during model development! It must remain “unseen” for honest evaluation.

K-Fold Cross-Validation

What if the dataset is too small for a separate validation set? K-fold cross-validation provides a solution by reusing data more efficiently.

When data is limited, a large validation set may not be affordable. K-fold cross-validation solves this:

graph LR subgraph "5-Fold Cross-Validation" F1["Fold 1: Val | Train | Train | Train | Train"] F2["Fold 2: Train | Val | Train | Train | Train"] F3["Fold 3: Train | Train | Val | Train | Train"] F4["Fold 4: Train | Train | Train | Val | Train"] F5["Fold 5: Train | Train | Train | Train | Val"] end F1 --> A[Score 1] F2 --> B[Score 2] F3 --> C[Score 3] F4 --> D[Score 4] F5 --> E[Score 5] A --> AVG[Average Score] B --> AVG C --> AVG D --> AVG E --> AVG

Process:

  1. Split data into K equal parts (folds)
  2. For each fold, use it as validation, train on the rest
  3. Average the K validation scores

Code Practice

Implementing the Pocket Algorithm

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np

class PocketPerceptron:
    def __init__(self, learning_rate=1.0, max_iter=1000):
        self.lr = learning_rate
        self.max_iter = max_iter
        self.w = None
        self.b = None
        
    def _accuracy(self, X, y):
        predictions = np.sign(np.dot(X, self.w) + self.b)
        return np.mean(predictions == y)
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        
        # Initialize weights
        self.w = np.zeros(n_features)
        self.b = 0
        
        # Pocket: store best weights
        best_w = self.w.copy()
        best_b = self.b
        best_accuracy = self._accuracy(X, y)
        
        for iteration in range(self.max_iter):
            # Find a misclassified point
            for xi, yi in zip(X, y):
                if yi * (np.dot(self.w, xi) + self.b) <= 0:
                    # Update weights
                    self.w += self.lr * yi * xi
                    self.b += self.lr * yi
                    
                    # Check if this is better
                    current_acc = self._accuracy(X, y)
                    if current_acc > best_accuracy:
                        best_accuracy = current_acc
                        best_w = self.w.copy()
                        best_b = self.b
                    break
            else:
                # No misclassifications, we're done
                break
        
        # Return the pocket (best) weights
        self.w = best_w
        self.b = best_b
        return self
    
    def predict(self, X):
        return np.sign(np.dot(X, self.w) + self.b)
    
    def score(self, X, y):
        return np.mean(self.predict(X) == y)

Comparing Perceptron vs Pocket on Noisy Data

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification

# Basic Perceptron (from ML-02)
class Perceptron:
    def __init__(self, learning_rate=1.0, max_iter=1000):
        self.lr = learning_rate
        self.max_iter = max_iter
        self.w = None
        self.b = None
        
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.w = np.zeros(n_features)
        self.b = 0
        
        for _ in range(self.max_iter):
            misclassified = False
            for xi, yi in zip(X, y):
                if yi * (np.dot(self.w, xi) + self.b) <= 0:
                    self.w += self.lr * yi * xi
                    self.b += self.lr * yi
                    misclassified = True
            if not misclassified:
                break
        return self
    
    def predict(self, X):
        return np.sign(np.dot(X, self.w) + self.b)
    
    def score(self, X, y):
        return np.mean(self.predict(X) == y)

# Generate non-separable data
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0,
                           n_informative=2, n_clusters_per_class=1,
                           flip_y=0.15, random_state=42)
y = np.where(y == 0, -1, 1)  # Convert to -1, +1

# Train both algorithms
percep = Perceptron(max_iter=100)
percep.fit(X, y)

pocket = PocketPerceptron(max_iter=100)
pocket.fit(X, y)

print(f"Basic Perceptron Accuracy: {percep.score(X, y):.2%}")
print(f"Pocket Algorithm Accuracy: {pocket.score(X, y):.2%}")

Output:

1
2
Basic Perceptron Accuracy: 64.00%
Pocket Algorithm Accuracy: 86.00%

Train/Validation/Test Split

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from sklearn.model_selection import train_test_split

# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Second split: training and validation
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42  # 0.25 * 0.8 = 0.2
)

print(f"Training set:   {len(X_train)} samples ({len(X_train)/len(X):.0%})")
print(f"Validation set: {len(X_val)} samples ({len(X_val)/len(X):.0%})")
print(f"Test set:       {len(X_test)} samples ({len(X_test)/len(X):.0%})")

Output:

1
2
3
Training set:   60 samples (60%)
Validation set: 20 samples (20%)
Test set:       20 samples (20%)

K-Fold Cross-Validation

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Perceptron

# Using sklearn's perceptron
clf = Perceptron(max_iter=1000, random_state=42)

# 5-fold cross-validation
scores = cross_val_score(clf, X, y, cv=5)

print(f"Cross-validation scores: {scores}")
print(f"Mean accuracy: {scores.mean():.2%} (+/- {scores.std() * 2:.2%})")

Output:

1
2
Cross-validation scores: [0.95 0.75 0.5  0.8  0.85]
Mean accuracy: 77.00% (+/- 30.07%)

Hyperparameter Tuning with Cross-Validation

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Define hyperparameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],  # Regularization strength
    'gamma': [1, 0.1, 0.01, 0.001]  # Kernel coefficient
}

# Grid search with cross-validation
grid_search = GridSearchCV(
    SVC(kernel='rbf', random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy'
)
grid_search.fit(X_train, y_train)

print(f"Best hyperparameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.2%}")

# Final evaluation on test set
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test, y_test)
print(f"Test set accuracy: {test_accuracy:.2%}")

Output:

1
2
3
Best hyperparameters: {'C': 1, 'gamma': 1}
Best CV score: 86.67%
Test set accuracy: 85.00%

Visualizing Cross-Validation Results

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pandas as pd

# Extract results
results = pd.DataFrame(grid_search.cv_results_)
pivot = results.pivot_table(
    values='mean_test_score',
    index='param_C',
    columns='param_gamma'
)

# Heatmap
plt.figure(figsize=(8, 5))
plt.imshow(pivot.values, cmap='YlGn', aspect='auto')
plt.colorbar(label='Mean CV Accuracy')
plt.xticks(range(len(pivot.columns)), pivot.columns)
plt.yticks(range(len(pivot.index)), pivot.index)
plt.xlabel('gamma')
plt.ylabel('C (regularization)')
plt.title('Hyperparameter Grid Search Results')

for i in range(len(pivot.index)):
    for j in range(len(pivot.columns)):
        plt.text(j, i, f'{pivot.values[i, j]:.2f}',
                ha='center', va='center')

plt.tight_layout()
plt.savefig('assets/grid_search.png', dpi=150)
plt.show()
Grid Search Results
Grid Search Results

Deep Dive

Q1: How many folds should I use for cross-validation?

Common choices:

  • K=5 or K=10: Standard choices, good balance between bias and variance
  • K=N (Leave-One-Out): Maximum training data, but computationally expensive
  • Stratified K-Fold: Preserves class proportions in each fold (recommended for imbalanced data)

Q2: What if I have very little data?

  • Use higher K (more folds) to maximize training data
  • Consider Leave-One-Out cross-validation
  • Use data augmentation if applicable
  • Try simpler models with fewer parameters

Q3: Why not tune hyperparameters on the test set?

If you tune on the test set, you’re essentially “peeking” at it. The reported test accuracy becomes optimistically biased and won’t reflect real-world performance.

Summary

ConceptKey Points
Pocket AlgorithmKeeps best weights during training
ParametersLearned from data (w, b)
HyperparametersSet before training (learning rate, iterations)
Train/Val/Test Split60/20/20 is typical
K-Fold CVAverage performance across K folds

References