Learn.

Learn Small Amount Everyday

← Back to Home

SOLID Principles in Data Science

SOLID is an acronym for five design principles intended to make software designs more understandable, flexible, and maintainable. While often cited in object-oriented programming, they apply remarkably well to data science pipelines. Let's focus on the first two: S and O.

S: Single Responsibility Principle (SRP)

"A class (or function) should have only one reason to change."

In simple terms: Do one thing and do it well.

Bad: A "god function" that loads data, cleans it, trains a model, and plots results.

def process_data_and_train():
    df = pd.read_csv("data.csv")
    df.fillna(0, inplace=True)
    model = RandomForest()
    model.fit(df)
    plt.plot(model.feature_importances_)

Good: Split it up.

def load_data(path): ...
def clean_data(df): ...
def train_model(df): ...
def plot_results(model): ...

Why for DS?

  • Debugging is easier.
  • You can unit test the cleaning logic separately from the training logic.
  • You can reuse the cleaning function for inference.

O: Open/Closed Principle (OCP)

"Software entities should be open for extension, but closed for modification."

You should be able to add new functionality without changing existing code.

Scenario: You have a pipeline that runs a Random Forest. Now you want to try XGBoost.

Bad (Violating OCP): Modifying the existing function with if/else.

def train(df, model_type):
    if model_type == 'rf':
        model = RandomForest()
    elif model_type == 'xgb': # <--- You had to edit this existing, working function
        model = XGBoost()
    model.fit(df)

Good (Following OCP): Using polymorphism or passing the strategy.

def train(df, model_instance):
    # This function doesn't care WHAT model it is, as long as it has .fit()
    model_instance.fit(df)
    return model_instance

# Usage
train(data, RandomForest())
train(data, XGBoost()) # <--- Added functionality without touching the train function!

Why for DS?

  • It makes your experiment pipeline robust. You can add 10 new models without risking breaking the code that runs the experiment.
  • It encourages designing common interfaces (like Scikit-Learn's fit/predict).

Summary

  • SRP: Keep functions small and focused.
  • OCP: Design systems that accept new behaviors as arguments/plugins rather than editing the source code.

Mastering these will turn your fragile scripts into a professional-grade ML framework.