Learn.

Learn Small Amount Everyday

← Back to Home

Modular Programming: From Notebooks to Production

As data scientists, we often start our work in Jupyter Notebooks. It's a fantastic environment for exploration, visualization, and quick feedback loops. However, as a project conceptualizes and moves towards production, the "giant notebook" approach becomes a bottleneck.

The Problem with Monolithic Scripts

Imagine a single notebook with 500 cells.

  • Debuggability: If line 304 fails, you have to rerun cells 1-303 to reproduce the state.
  • Reusability: You can't easily import a function from a notebook into another script.
  • Collaboration: Merging changes in .ipynb files via Git is painful.

The Solution: Modular Programming

Modular programming is the process of breaking down a large codebase into smaller, independent, and interchangeable modules. In Python, a module is simply a .py file.

Key Principles

  1. Single Responsibility Principle (SRP): Each function or module should do one thing and do it well.
  2. Separation of Concerns: Keep your data loading logic separate from your model training logic.

How to Refactor

Step 1: Identify Functional Blocks

Look at your code and group lines that perform a specific task.

  • Data Loading
  • Data Cleaning
  • Feature Engineering
  • Model Training
  • Evaluation

Step 2: Create Functions

Turn those blocks into functions.

Before (Script style):

# Cell 1
import pandas as pd
df = pd.read_csv('data.csv')
df = df.dropna()
# ... more cleaning

After (Functional style):

def load_and_clean_data(filepath: str) -> pd.DataFrame:
    """Loads data and performs initial cleaning."""
    df = pd.read_csv(filepath)
    df = df.dropna()
    return df

Step 3: Move to .py Modules

Create a directory structure like this:

project/
├── data_pipeline/
│   ├── __init__.py
│   ├── loader.py
│   └── cleaner.py
├── modeling/
│   ├── __init__.py
│   └── trainer.py
└── main.py

Now, your main.py (or your notebook!) becomes very clean:

from data_pipeline.loader import load_data
from modeling.trainer import train_model

df = load_data('data.csv')
model = train_model(df)

Benefits

  1. Testing: You can write unit tests for cleaner.py without loading the entire dataset.
  2. Readability: New team members can understand the high-level logic by just reading main.py.
  3. Maintainability: If the data source changes, you only edit loader.py.

Start small. Next time you find yourself scrolling up and down a long notebook, take that as a sign to extract a function and move it to a .py file.