Salin dan Bagikan
Cara Belajar Machine Learning untuk Pemula - Panduan lengkap belajar Machine Learning dari dasar untuk pemula dengan roadmap yang jelas

Cara Belajar Machine Learning untuk Pemula

Machine Learning adalah cabang AI yang memungkinkan komputer belajar dari data. Mari pelajari fundamental-nya.

Apa itu Machine Learning?

Definisi

Machine Learning adalah:
- Komputer belajar dari data
- Tanpa explicitly programmed
- Improve performance dengan experience
- Pattern recognition & prediction

Jenis Machine Learning

1. Supervised Learning
   - Ada label/target
   - Learn dari example
   - Classification & Regression

2. Unsupervised Learning
   - Tidak ada label
   - Find patterns
   - Clustering & Dimensionality reduction

3. Reinforcement Learning
   - Learn dari reward/punishment
   - Trial and error
   - Game, robotics

Prerequisites

Mathematics

Linear Algebra:
- Vectors dan matrices
- Matrix operations
- Eigenvalues/eigenvectors

Statistics:
- Probability distributions
- Bayes' theorem
- Hypothesis testing
- Mean, variance, std

Calculus:
- Derivatives
- Gradient descent
- Chain rule
- Partial derivatives

Programming

Python basics:
- Variables, data types
- Control flow
- Functions
- OOP

Libraries:
- NumPy (arrays)
- Pandas (data manipulation)
- Matplotlib/Seaborn (visualization)
- Scikit-learn (ML algorithms)

Supervised Learning

Classification

# Binary Classification Example
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load data
data = load_breast_cancer()
X, y = data.data, data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Predict
y_pred = model.predict(X_test_scaled)

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred))

Regression

# Linear Regression Example
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Generate sample data
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2 * X.squeeze() + 3 + np.random.randn(100)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print(f"MSE: {mean_squared_error(y_test, y_pred):.2f}")
print(f"R2 Score: {r2_score(y_test, y_pred):.2f}")
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")

Common Algorithms

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# Decision Tree
dt = DecisionTreeClassifier(max_depth=5)
dt.fit(X_train, y_train)

# Random Forest
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)

# SVM
svm = SVC(kernel='rbf', C=1.0)
svm.fit(X_train_scaled, y_train)

# KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Naive Bayes
nb = GaussianNB()
nb.fit(X_train, y_train)

# Gradient Boosting
gb = GradientBoostingClassifier(n_estimators=100)
gb.fit(X_train, y_train)

Unsupervised Learning

Clustering

from sklearn.cluster import KMeans, DBSCAN
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt

# Generate sample data
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# K-Means
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans_labels = kmeans.fit_predict(X)

# Find optimal K (Elbow method)
inertias = []
K_range = range(1, 10)
for k in K_range:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X)
    inertias.append(km.inertia_)

plt.plot(K_range, inertias, 'bx-')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

# DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)

# Gaussian Mixture
gmm = GaussianMixture(n_components=4, random_state=42)
gmm_labels = gmm.fit_predict(X)

Dimensionality Reduction

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Explained variance
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total: {sum(pca.explained_variance_ratio_):.2f}")

# t-SNE (for visualization)
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].scatter(X_pca[:, 0], X_pca[:, 1])
axes[0].set_title('PCA')
axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1])
axes[1].set_title('t-SNE')
plt.show()

Model Evaluation

Classification Metrics

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, roc_auc_score, roc_curve
)

# Basic metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall: {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# ROC Curve
y_prob = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)

plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

Regression Metrics

from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score
)

print(f"MSE: {mean_squared_error(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.3f}")
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")

Cross-Validation

from sklearn.model_selection import cross_val_score, KFold

# K-Fold Cross Validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')

print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

Feature Engineering

Handling Missing Values

import pandas as pd
from sklearn.impute import SimpleImputer

# Check missing
print(df.isnull().sum())

# Drop
df_clean = df.dropna()

# Impute with mean/median/mode
imputer = SimpleImputer(strategy='mean')  # 'median', 'most_frequent'
df_imputed = pd.DataFrame(
    imputer.fit_transform(df),
    columns=df.columns
)

Encoding

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Label Encoding
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])

# One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['category'])

# Or using sklearn
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded = ohe.fit_transform(df[['category']])

Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# StandardScaler (z-score)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# MinMaxScaler (0-1)
minmax = MinMaxScaler()
X_normalized = minmax.fit_transform(X)

# RobustScaler (resistant to outliers)
robust = RobustScaler()
X_robust = robust.fit_transform(X)

Feature Selection

from sklearn.feature_selection import SelectKBest, f_classif, RFE

# Select K Best
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()].tolist()

# Recursive Feature Elimination
from sklearn.ensemble import RandomForestClassifier
rfe = RFE(RandomForestClassifier(), n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)

# Feature Importance
rf = RandomForestClassifier()
rf.fit(X, y)
importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Grid Search
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")

# Use best model
best_model = grid_search.best_estimator_
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Define distributions
param_dist = {
    'n_estimators': randint(50, 500),
    'max_depth': randint(5, 50),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10)
}

# Random Search
random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_dist,
    n_iter=100,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")

ML Pipeline

Complete Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Define preprocessing
numeric_features = ['age', 'income']
categorical_features = ['gender', 'city']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

# Create pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

# Fit and predict
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

# Save model
import joblib
joblib.dump(pipeline, 'model.pkl')

# Load model
loaded_model = joblib.load('model.pkl')

Resources

Learning Path

1. Mathematics fundamentals (2-4 weeks)
2. Python dan libraries (2-4 weeks)
3. Supervised learning (4-6 weeks)
4. Unsupervised learning (2-4 weeks)
5. Deep learning basics (4-6 weeks)
6. Projects dan practice (ongoing)

Courses

Free:
- Andrew Ng's ML Course (Coursera)
- Fast.ai
- Google ML Crash Course
- Kaggle Learn

Paid:
- DataCamp
- Udemy
- Coursera Specializations

Practice

- Kaggle Competitions
- UCI ML Repository
- Scikit-learn toy datasets
- Real-world projects

Kesimpulan

Machine Learning adalah field yang luas. Start dengan fundamentals, practice dengan datasets, dan gradually move ke advanced topics seperti deep learning.

Artikel Terkait

Link Postingan : https://www.tirinfo.com/cara-belajar-machine-learning-pemula/

Hendra WIjaya
Tirinfo
5 minutes.
7 January 2026