Python for Data Science & Machine Learning: Complete 10-Section Study Guide
🐍 Python for Data Science & ML Study Guide Python 3.11 | Updated 2025
From Data Wrangling to Model Deployment
🐍 Core Python Fundamentals
Python is the backbone of modern Data Science and ML workflows. Key libraries include numpy, pandas, and matplotlib.
NumPy
Numerical computing with multi-dimensional arrays
Use for: Matrix operations, vectorized math, statistical functionsKey methods:
np.array(), np.mean(), np.dot()
Pandas
Data wrangling and manipulation
Use for: DataFrames, cleaning, joining, groupby operationsKey methods:
pd.DataFrame(), df.merge(), df.groupby()
Matplotlib
Data visualization
Use for: Line plots, histograms, scatter plotsKey methods:
plt.plot(), plt.hist()
Version Compatibility
| Python Version | NumPy | Pandas |
|---|---|---|
| 3.9 | 1.24 | 1.5 |
| 3.10 | 1.25 | 1.6 |
| 3.11 | 1.26 | 2.1 |
| 3.12 | 1.27 (preview) | 2.2 (preview) |
🐼 Pandas Power Techniques
Master Pandas for efficient data wrangling: method chaining, groupby operations, merging, handling missing data, and performance tuning.
Method Chaining Example
df_clean = (df
.dropna()
.query('age > 18')
.assign(age_group=lambda x: pd.cut(x.age, bins=[0,30,60,100]))
.groupby('age_group')['salary'].mean()
)
for i, row in df.iterrows():
df.at[i, 'new'] = row['A'] * 2
df['new'] = df['A'] * 2
🔢 NumPy Essentials
NumPy allows high-performance computation on large arrays. Broadcasting eliminates loops and enables vectorized operations.
Broadcasting Example
# Shape (3,1) + Shape (4,) → Shape (3,4)
a = np.array([[1], [2], [3]])
b = np.array([10, 20, 30, 40])
result = a + b # Auto-broadcasts!
Broadcasting avoids explicit loops by expanding array dimensions automatically.
Speed Comparison
%%timeit
# Python list
[x*2 for x in range(10000)]
# NumPy array
np.arange(10000) * 2 # ~29x faster!
🤖 ML Workflow Pipeline
End-to-end ML pipeline from data preparation to evaluation.
df = pd.read_csv('data.csv')
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
⚙️ Data Preprocessing
Preprocessing ensures your data is clean, scaled, and ready for ML models. Key steps include handling missing values, scaling, and encoding.
Scaling Decision Tree
Choose the correct scaler based on data characteristics:
- Outliers present: Use
RobustScaler - Normally distributed: Use
StandardScaler - Other distributions: Use
MinMaxScaler
[1, 100, 10000]
[-0.58, -0.52, 1.10]
# StandardScaler example
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
🤖 Model Selection
Choosing the right algorithm is crucial. Consider linear vs tree-based models, complexity, and interpretability.
Linear Regression
Use when: Linear relationship, few features
Pros: Fast, interpretable
Cons: Assumes linearity
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X, y)
Random Forest
Use when: Non-linear, feature interactions
Pros: Handles non-linearity, feature importance
Cons: Slower, less interpretable
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100).fit(X, y)
📈 Evaluation Metrics
Evaluate models using appropriate metrics and visuals like confusion matrices.
Classification Metrics
- Accuracy: Correct predictions / Total predictions
- Precision: True Positives / (True Positives + False Positives)
- Recall: True Positives / (True Positives + False Negatives)
- F1 Score: Harmonic mean of Precision & Recall
Confusion Matrix Example
from sklearn.metrics import confusion_matrix
y_true = [0,1,0,1]
y_pred = [0,0,0,1]
cm = confusion_matrix(y_true, y_pred)
print(cm)
# Output:
# [[2 0]
# [1 1]]
⚡ Feature Engineering
Create meaningful features to improve model performance.
Encoding Example
| Technique | Use Case | Example |
|---|---|---|
| Label Encoding | Ordinal data | LabelEncoder().fit_transform(['Low','High']) |
| One-Hot Encoding | Nominal data | pd.get_dummies(df['color']) |
['red', 'blue', 'green', 'red']
red blue green 1 0 0 0 1 0 0 0 1 1 0 0
🛠️ Tools & Ecosystem
Essential tools for DS & ML workflows: libraries, visualization, notebooks, and environment management.
Scikit-learn
Machine Learning algorithms and utilities
pip install scikit-learn
Matplotlib & Seaborn
Visualization libraries
pip install matplotlib seaborn
Jupyter Notebook
Interactive Python environment
pip install notebook
🚀 Resources & Deployment
Deploy models, access datasets, and practice hands-on skills.
🎯 Practice Challenge: Titanic Dataset
Feature engineering & preprocessing.
- Create
FamilySize= SibSp + Parch + 1 - Extract
Titlefrom Name column - Create
IsAloneboolean
💡 Hint
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
Deploy Model API (Flask)
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
model = joblib.load('model.pkl')
@app.route('/predict', methods=['POST'])
def predict():
data = request.json
pred = model.predict([data['features']])
return jsonify({'prediction': pred.tolist()})
app.run(debug=True)