🐍 Python for Data Science & ML Study Guide (2025 Edition)

🐍 Python for Data Science & ML Study Guide Python 3.11 | Updated 2025

From Data Wrangling to Model Deployment

🔧Data Prep

🤖Model Building

📈Evaluation

🚀Deployment

🐍 Core Python Fundamentals

Python is the backbone of modern Data Science and ML workflows. Key libraries include numpy, pandas, and matplotlib.

NumPy

Numerical computing with multi-dimensional arrays

Use for: Matrix operations, vectorized math, statistical functions
Key methods: np.array(), np.mean(), np.dot()

Pandas

Data wrangling and manipulation

Use for: DataFrames, cleaning, joining, groupby operations
Key methods: pd.DataFrame(), df.merge(), df.groupby()

Matplotlib

Data visualization

Use for: Line plots, histograms, scatter plots
Key methods: plt.plot(), plt.hist()

Version Compatibility

Python Version	NumPy	Pandas
3.9	1.24	1.5
3.10	1.25	1.6
3.11	1.26	2.1
3.12	1.27 (preview)	2.2 (preview)

🐼 Pandas Power Techniques

Master Pandas for efficient data wrangling: method chaining, groupby operations, merging, handling missing data, and performance tuning.

Method Chaining Example

df_clean = (df
    .dropna()
    .query('age > 18')
    .assign(age_group=lambda x: pd.cut(x.age, bins=[0,30,60,100]))
    .groupby('age_group')['salary'].mean()
)

❌ Slow (Loop)

for i, row in df.iterrows():
    df.at[i, 'new'] = row['A'] * 2

✅ Fast (Vectorized)

df['new'] = df['A'] * 2

🔢 NumPy Essentials

NumPy allows high-performance computation on large arrays. Broadcasting eliminates loops and enables vectorized operations.

Broadcasting Example

# Shape (3,1) + Shape (4,) → Shape (3,4)
a = np.array([[1], [2], [3]])
b = np.array([10, 20, 30, 40])
result = a + b  # Auto-broadcasts!

Broadcasting avoids explicit loops by expanding array dimensions automatically.

Speed Comparison

%%timeit
# Python list
[x*2 for x in range(10000)]

# NumPy array
np.arange(10000) * 2  # ~29x faster!

🤖 ML Workflow Pipeline

End-to-end ML pipeline from data preparation to evaluation.

1. Load Data

df = pd.read_csv('data.csv')

2. Split Data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

3. Train Model

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)

4. Evaluate

accuracy = model.score(X_test, y_test)

⚙️ Data Preprocessing

Preprocessing ensures your data is clean, scaled, and ready for ML models. Key steps include handling missing values, scaling, and encoding.

Scaling Decision Tree

Choose the correct scaler based on data characteristics:

Outliers present: Use RobustScaler
Normally distributed: Use StandardScaler
Other distributions: Use MinMaxScaler

Before Scaling

[1, 100, 10000]

After StandardScaler

[-0.58, -0.52, 1.10]

# StandardScaler example
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

🤖 Model Selection

Choosing the right algorithm is crucial. Consider linear vs tree-based models, complexity, and interpretability.

Linear Regression

Use when: Linear relationship, few features

Pros: Fast, interpretable

Cons: Assumes linearity

from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X, y)

Random Forest

Use when: Non-linear, feature interactions

Pros: Handles non-linearity, feature importance

Cons: Slower, less interpretable

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100).fit(X, y)

📈 Evaluation Metrics

Evaluate models using appropriate metrics and visuals like confusion matrices.

Classification Metrics

Accuracy: Correct predictions / Total predictions
Precision: True Positives / (True Positives + False Positives)
Recall: True Positives / (True Positives + False Negatives)
F1 Score: Harmonic mean of Precision & Recall

Confusion Matrix Example

from sklearn.metrics import confusion_matrix
y_true = [0,1,0,1]
y_pred = [0,0,0,1]
cm = confusion_matrix(y_true, y_pred)
print(cm)
# Output:
# [[2 0]
#  [1 1]]

⚡ Feature Engineering

Create meaningful features to improve model performance.

Encoding Example

Technique	Use Case	Example
Label Encoding	Ordinal data	`LabelEncoder().fit_transform(['Low','High'])`
One-Hot Encoding	Nominal data	`pd.get_dummies(df['color'])`

Original Data

['red', 'blue', 'green', 'red']

After One-Hot Encoding

red  blue  green
 1     0      0
 0     1      0
 0     0      1
 1     0      0

🛠️ Tools & Ecosystem

Essential tools for DS & ML workflows: libraries, visualization, notebooks, and environment management.

Scikit-learn

Machine Learning algorithms and utilities

pip install scikit-learn

Matplotlib & Seaborn

Visualization libraries

pip install matplotlib seaborn

Jupyter Notebook

Interactive Python environment

pip install notebook

🚀 Resources & Deployment

Deploy models, access datasets, and practice hands-on skills.

🎯 Practice Challenge: Titanic Dataset

Feature engineering & preprocessing.

Create FamilySize = SibSp + Parch + 1
Extract Title from Name column
Create IsAlone boolean

💡 Hint

df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

📁 Get Dataset

Deploy Model API (Flask)

from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)
model = joblib.load('model.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    pred = model.predict([data['features']])
    return jsonify({'prediction': pred.tolist()})

app.run(debug=True)

Google Colab Starter

Interactive notebook for all code examples:

💻 Open Colab

Python for Data Science & Machine Learning: Complete 10-Section Study Guide