DRY Principle: Don't Repeat Yourself
One of the most fundamental rules in software engineering is DRY (Don't Repeat Yourself). The opposite of DRY is WET (Write Everything Twice, or We Enjoy Typing).
For data scientists, copy-pasting code blocks is a common habit, especially when testing different models or processing similar datasets.
Why Copy-Pasting is Dangerous
- Maintenance Nightmare: If you find a bug in one copy, you have to remember to fix it in all other copies. You will forget one.
- Readability: It bloats your code. Reading 50 lines of identical logic three times is mentally exhausting.
The Rule of Three
A good rule of thumb is:
- First time: Just write the code.
- Second time: You can copy it, but wince a little.
- Third time: Refactor it into a function.
Example: Data Preprocessing
Imagine you are processing train and test sets separately.
The "WET" Way:
# Process training data
train_df['age'] = train_df['age'].fillna(train_df['age'].mean())
train_df['income'] = np.log(train_df['income'])
train_df = train_df[train_df['status'] == 'active']
# ... later in the code ...
# Process test data (Copy-pasted!)
test_df['age'] = test_df['age'].fillna(test_df['age'].mean()) # BUG: Should use train mean!
test_df['income'] = np.log(test_df['income'])
test_df = test_df[test_df['status'] == 'active']
Notice the bug? In the copy-pasted test section, we might accidentally use stats from the test set instead of the train set, causing data leakage. Or we might change the log transformation to square root in the train block and forget to update the test block.
The "DRY" Way:
def preprocess_data(df: pd.DataFrame, mean_age: float = None) -> pd.DataFrame:
"""
Applies standard preprocessing steps.
If mean_age is provided, uses it for filling NaNs (for test set).
Otherwise calculates it from the df (for train set).
"""
if mean_age is None:
mean_age = df['age'].mean()
df['age'] = df['age'].fillna(mean_age)
df['income'] = np.log(df['income'])
df = df[df['status'] == 'active']
return df, mean_age
# Usage
train_df_clean, train_age_mean = preprocess_data(train_df)
test_df_clean, _ = preprocess_data(test_df, mean_age=train_age_mean)
Benefits of DRY
- Single Source of Truth: Logic lives in one place. Change it once, update it everywhere.
- Less Code: Fewer lines of code mean fewer places for bugs to hide.
- Self-Documenting: The function name
preprocess_datatells you what is happening, hiding the how.
Next time you hit Ctrl+C Ctrl+V, stop and ask yourself: "Can I make this a function?"