Defensive Programming in Data Science

Data is messy. APIs fail. Servers run out of memory. If your code assumes "happy path" inputs, it will break—and usually at 3 AM. Defensive programming is the practice of anticipating failure and guarding against it.

1. Fail Fast with Assertions

When you make an assumption about your data, enforce it explicitly.

Assumption: "This dataframe has no nulls."

Defensive Code:

df = pd.read_csv("data.csv")
assert df.isnull().sum().sum() == 0, "Input data contains NA values!"

If the data is bad, the script crashes immediately, saving you from training a model on garbage data for 4 hours before realizing something was wrong.

2. Parameter Validation

When writing functions, validate inputs first.

def calculate_metrics(y_true, y_pred):
    if len(y_true) != len(y_pred):
        raise ValueError(f"Shape mismatch: {len(y_true)} vs {len(y_pred)}")
    # ... logic ...

3. Graceful Error Handling (`try/except`)

Sometimes, you expect things to fail (e.g., a network request). Handle these known errors gracefully.

Bad:

response = requests.get(url) # If this fails, script crashes completely
data = response.json()

Good:

try:
    response = requests.get(url, timeout=5)
    response.raise_for_status() # Raise error for 4xx/5xx status codes
    data = response.json()
except requests.exceptions.RequestException as e:
    logger.error(f"Failed to fetch data: {e}")
    data = None # Or return a default value / retry

4. Type Hinting (Again)

Type hints are a form of defensive programming. They catch errors before code even runs (if you use a linter).

Summary

Trust no one (especially not your data source).
Assert your assumptions.
Catch specific exceptions (never just except:).
Fail fast so you can fix it fast.