SOLID Principles in Data Science
SOLID is an acronym for five design principles intended to make software designs more understandable, flexible, and maintainable. While often cited in object-oriented programming, they apply remarkably well to data science pipelines. Let's focus on the first two: S and O.
S: Single Responsibility Principle (SRP)
"A class (or function) should have only one reason to change."
In simple terms: Do one thing and do it well.
Bad: A "god function" that loads data, cleans it, trains a model, and plots results.
def process_data_and_train():
df = pd.read_csv("data.csv")
df.fillna(0, inplace=True)
model = RandomForest()
model.fit(df)
plt.plot(model.feature_importances_)
Good: Split it up.
def load_data(path): ...
def clean_data(df): ...
def train_model(df): ...
def plot_results(model): ...
Why for DS?
- Debugging is easier.
- You can unit test the cleaning logic separately from the training logic.
- You can reuse the cleaning function for inference.
O: Open/Closed Principle (OCP)
"Software entities should be open for extension, but closed for modification."
You should be able to add new functionality without changing existing code.
Scenario: You have a pipeline that runs a Random Forest. Now you want to try XGBoost.
Bad (Violating OCP): Modifying the existing function with if/else.
def train(df, model_type):
if model_type == 'rf':
model = RandomForest()
elif model_type == 'xgb': # <--- You had to edit this existing, working function
model = XGBoost()
model.fit(df)
Good (Following OCP): Using polymorphism or passing the strategy.
def train(df, model_instance):
# This function doesn't care WHAT model it is, as long as it has .fit()
model_instance.fit(df)
return model_instance
# Usage
train(data, RandomForest())
train(data, XGBoost()) # <--- Added functionality without touching the train function!
Why for DS?
- It makes your experiment pipeline robust. You can add 10 new models without risking breaking the code that runs the experiment.
- It encourages designing common interfaces (like Scikit-Learn's
fit/predict).
Summary
- SRP: Keep functions small and focused.
- OCP: Design systems that accept new behaviors as arguments/plugins rather than editing the source code.
Mastering these will turn your fragile scripts into a professional-grade ML framework.