Python for Data Science & Machine Learning: Complete 10-Section Study Guide

🐍 Python for Data Science & ML Study Guide (2025 Edition)

🐍 Python for Data Science & ML Study Guide Python 3.11 | Updated 2025

From Data Wrangling to Model Deployment

🔧Data Prep
🤖Model Building
📈Evaluation
🚀Deployment
Python coding screen

🐍 Core Python Fundamentals

Python is the backbone of modern Data Science and ML workflows. Key libraries include numpy, pandas, and matplotlib.

NumPy

Numerical computing with multi-dimensional arrays

Use for: Matrix operations, vectorized math, statistical functions
Key methods: np.array(), np.mean(), np.dot()

Pandas

Data wrangling and manipulation

Use for: DataFrames, cleaning, joining, groupby operations
Key methods: pd.DataFrame(), df.merge(), df.groupby()

Matplotlib

Data visualization

Use for: Line plots, histograms, scatter plots
Key methods: plt.plot(), plt.hist()

Version Compatibility

Python VersionNumPyPandas
3.91.241.5
3.101.251.6
3.111.262.1
3.121.27 (preview)2.2 (preview)
Pandas dataframe example

🐼 Pandas Power Techniques

Master Pandas for efficient data wrangling: method chaining, groupby operations, merging, handling missing data, and performance tuning.

Method Chaining Example

df_clean = (df
    .dropna()
    .query('age > 18')
    .assign(age_group=lambda x: pd.cut(x.age, bins=[0,30,60,100]))
    .groupby('age_group')['salary'].mean()
)
❌ Slow (Loop)
for i, row in df.iterrows():
    df.at[i, 'new'] = row['A'] * 2
✅ Fast (Vectorized)
df['new'] = df['A'] * 2
NumPy arrays

🔢 NumPy Essentials

NumPy allows high-performance computation on large arrays. Broadcasting eliminates loops and enables vectorized operations.

Broadcasting Example

# Shape (3,1) + Shape (4,) → Shape (3,4)
a = np.array([[1], [2], [3]])
b = np.array([10, 20, 30, 40])
result = a + b  # Auto-broadcasts!

Broadcasting avoids explicit loops by expanding array dimensions automatically.

Speed Comparison

%%timeit
# Python list
[x*2 for x in range(10000)]

# NumPy array
np.arange(10000) * 2  # ~29x faster!
ML workflow

🤖 ML Workflow Pipeline

End-to-end ML pipeline from data preparation to evaluation.

1. Load Data
df = pd.read_csv('data.csv')
2. Split Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
3. Train Model
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
4. Evaluate
accuracy = model.score(X_test, y_test)
Data preprocessing visualization

⚙️ Data Preprocessing

Preprocessing ensures your data is clean, scaled, and ready for ML models. Key steps include handling missing values, scaling, and encoding.

Scaling Decision Tree

Choose the correct scaler based on data characteristics:

  • Outliers present: Use RobustScaler
  • Normally distributed: Use StandardScaler
  • Other distributions: Use MinMaxScaler
Before Scaling
[1, 100, 10000]
After StandardScaler
[-0.58, -0.52, 1.10]
# StandardScaler example
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
ML model selection

🤖 Model Selection

Choosing the right algorithm is crucial. Consider linear vs tree-based models, complexity, and interpretability.

Linear Regression

Use when: Linear relationship, few features

Pros: Fast, interpretable

Cons: Assumes linearity

from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X, y)

Random Forest

Use when: Non-linear, feature interactions

Pros: Handles non-linearity, feature importance

Cons: Slower, less interpretable

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100).fit(X, y)
Evaluation metrics visualization

📈 Evaluation Metrics

Evaluate models using appropriate metrics and visuals like confusion matrices.

Classification Metrics

  • Accuracy: Correct predictions / Total predictions
  • Precision: True Positives / (True Positives + False Positives)
  • Recall: True Positives / (True Positives + False Negatives)
  • F1 Score: Harmonic mean of Precision & Recall

Confusion Matrix Example

from sklearn.metrics import confusion_matrix
y_true = [0,1,0,1]
y_pred = [0,0,0,1]
cm = confusion_matrix(y_true, y_pred)
print(cm)
# Output:
# [[2 0]
#  [1 1]]
Feature engineering example

⚡ Feature Engineering

Create meaningful features to improve model performance.

Encoding Example

TechniqueUse CaseExample
Label EncodingOrdinal dataLabelEncoder().fit_transform(['Low','High'])
One-Hot EncodingNominal datapd.get_dummies(df['color'])
Original Data
['red', 'blue', 'green', 'red']
After One-Hot Encoding
red  blue  green
 1     0      0
 0     1      0
 0     0      1
 1     0      0
Python tools ecosystem

🛠️ Tools & Ecosystem

Essential tools for DS & ML workflows: libraries, visualization, notebooks, and environment management.

Scikit-learn

Machine Learning algorithms and utilities

pip install scikit-learn

Matplotlib & Seaborn

Visualization libraries

pip install matplotlib seaborn

Jupyter Notebook

Interactive Python environment

pip install notebook
Deployment and resources

🚀 Resources & Deployment

Deploy models, access datasets, and practice hands-on skills.

🎯 Practice Challenge: Titanic Dataset

Feature engineering & preprocessing.

  1. Create FamilySize = SibSp + Parch + 1
  2. Extract Title from Name column
  3. Create IsAlone boolean
💡 Hint
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
📁 Get Dataset

Deploy Model API (Flask)

from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)
model = joblib.load('model.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    pred = model.predict([data['features']])
    return jsonify({'prediction': pred.tolist()})

app.run(debug=True)

Google Colab Starter

Interactive notebook for all code examples:

💻 Open Colab