Salin dan Bagikan
Cara Belajar Machine Learning untuk Pemula
Machine Learning adalah cabang AI yang memungkinkan komputer belajar dari data. Mari pelajari fundamental-nya.
Apa itu Machine Learning?
Definisi
Machine Learning adalah:
- Komputer belajar dari data
- Tanpa explicitly programmed
- Improve performance dengan experience
- Pattern recognition & prediction
Jenis Machine Learning
1. Supervised Learning
- Ada label/target
- Learn dari example
- Classification & Regression
2. Unsupervised Learning
- Tidak ada label
- Find patterns
- Clustering & Dimensionality reduction
3. Reinforcement Learning
- Learn dari reward/punishment
- Trial and error
- Game, robotics
Prerequisites
Mathematics
Linear Algebra:
- Vectors dan matrices
- Matrix operations
- Eigenvalues/eigenvectors
Statistics:
- Probability distributions
- Bayes' theorem
- Hypothesis testing
- Mean, variance, std
Calculus:
- Derivatives
- Gradient descent
- Chain rule
- Partial derivatives
Programming
Python basics:
- Variables, data types
- Control flow
- Functions
- OOP
Libraries:
- NumPy (arrays)
- Pandas (data manipulation)
- Matplotlib/Seaborn (visualization)
- Scikit-learn (ML algorithms)
Supervised Learning
Classification
# Binary Classification Example
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
# Load data
data = load_breast_cancer()
X, y = data.data, data.target
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
# Predict
y_pred = model.predict(X_test_scaled)
# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred))
Regression
# Linear Regression Example
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Generate sample data
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2 * X.squeeze() + 3 + np.random.randn(100)
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train
model = LinearRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
print(f"MSE: {mean_squared_error(y_test, y_pred):.2f}")
print(f"R2 Score: {r2_score(y_test, y_pred):.2f}")
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
Common Algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
# Decision Tree
dt = DecisionTreeClassifier(max_depth=5)
dt.fit(X_train, y_train)
# Random Forest
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
# SVM
svm = SVC(kernel='rbf', C=1.0)
svm.fit(X_train_scaled, y_train)
# KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
# Naive Bayes
nb = GaussianNB()
nb.fit(X_train, y_train)
# Gradient Boosting
gb = GradientBoostingClassifier(n_estimators=100)
gb.fit(X_train, y_train)
Unsupervised Learning
Clustering
from sklearn.cluster import KMeans, DBSCAN
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt
# Generate sample data
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
# K-Means
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans_labels = kmeans.fit_predict(X)
# Find optimal K (Elbow method)
inertias = []
K_range = range(1, 10)
for k in K_range:
km = KMeans(n_clusters=k, random_state=42)
km.fit(X)
inertias.append(km.inertia_)
plt.plot(K_range, inertias, 'bx-')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
# DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)
# Gaussian Mixture
gmm = GaussianMixture(n_components=4, random_state=42)
gmm_labels = gmm.fit_predict(X)
Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Explained variance
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total: {sum(pca.explained_variance_ratio_):.2f}")
# t-SNE (for visualization)
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X)
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].scatter(X_pca[:, 0], X_pca[:, 1])
axes[0].set_title('PCA')
axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1])
axes[1].set_title('t-SNE')
plt.show()
Model Evaluation
Classification Metrics
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, roc_auc_score, roc_curve
)
# Basic metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall: {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
# ROC Curve
y_prob = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)
plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
Regression Metrics
from sklearn.metrics import (
mean_squared_error, mean_absolute_error, r2_score
)
print(f"MSE: {mean_squared_error(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.3f}")
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
Cross-Validation
from sklearn.model_selection import cross_val_score, KFold
# K-Fold Cross Validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
Feature Engineering
Handling Missing Values
import pandas as pd
from sklearn.impute import SimpleImputer
# Check missing
print(df.isnull().sum())
# Drop
df_clean = df.dropna()
# Impute with mean/median/mode
imputer = SimpleImputer(strategy='mean') # 'median', 'most_frequent'
df_imputed = pd.DataFrame(
imputer.fit_transform(df),
columns=df.columns
)
Encoding
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Label Encoding
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])
# One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['category'])
# Or using sklearn
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded = ohe.fit_transform(df[['category']])
Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# StandardScaler (z-score)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# MinMaxScaler (0-1)
minmax = MinMaxScaler()
X_normalized = minmax.fit_transform(X)
# RobustScaler (resistant to outliers)
robust = RobustScaler()
X_robust = robust.fit_transform(X)
Feature Selection
from sklearn.feature_selection import SelectKBest, f_classif, RFE
# Select K Best
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()].tolist()
# Recursive Feature Elimination
from sklearn.ensemble import RandomForestClassifier
rfe = RFE(RandomForestClassifier(), n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)
# Feature Importance
rf = RandomForestClassifier()
rf.fit(X, y)
importance = pd.DataFrame({
'feature': X.columns,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
Hyperparameter Tuning
Grid Search
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# Grid Search
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
# Use best model
best_model = grid_search.best_estimator_
Random Search
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
# Define distributions
param_dist = {
'n_estimators': randint(50, 500),
'max_depth': randint(5, 50),
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10)
}
# Random Search
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_dist,
n_iter=100,
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42
)
random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")
ML Pipeline
Complete Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# Define preprocessing
numeric_features = ['age', 'income']
categorical_features = ['gender', 'city']
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
]
)
# Create pipeline
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier())
])
# Fit and predict
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
# Save model
import joblib
joblib.dump(pipeline, 'model.pkl')
# Load model
loaded_model = joblib.load('model.pkl')
Resources
Learning Path
1. Mathematics fundamentals (2-4 weeks)
2. Python dan libraries (2-4 weeks)
3. Supervised learning (4-6 weeks)
4. Unsupervised learning (2-4 weeks)
5. Deep learning basics (4-6 weeks)
6. Projects dan practice (ongoing)
Courses
Free:
- Andrew Ng's ML Course (Coursera)
- Fast.ai
- Google ML Crash Course
- Kaggle Learn
Paid:
- DataCamp
- Udemy
- Coursera Specializations
Practice
- Kaggle Competitions
- UCI ML Repository
- Scikit-learn toy datasets
- Real-world projects
Kesimpulan
Machine Learning adalah field yang luas. Start dengan fundamentals, practice dengan datasets, dan gradually move ke advanced topics seperti deep learning.
Artikel Terkait
Link Postingan : https://www.tirinfo.com/cara-belajar-machine-learning-pemula/
Editor : Hendra WIjaya
Publisher :
Tirinfo
Read : 5 minutes.
Update : 7 January 2026