Salin dan Bagikan
Cara Belajar Data Science untuk Pemula
Data Science adalah salah satu skill paling dicari saat ini. Mari pelajari roadmap untuk memulai dari nol.
Apa itu Data Science?
Definisi
Data Science adalah kombinasi dari:
- Statistics & Mathematics
- Programming
- Domain Knowledge
Untuk mengekstrak insight dari data dan
membuat keputusan data-driven.
Peran Data Scientist
Responsibilities:
- Collect and clean data
- Exploratory Data Analysis (EDA)
- Build predictive models
- Communicate insights
- Deploy ML models
Roadmap Belajar
Phase 1: Foundations (1-2 bulan)
1. Python Basics
- Variables, data types
- Control flow
- Functions
- OOP basics
2. Statistics Dasar
- Mean, median, mode
- Standard deviation
- Probability
- Distributions
3. Linear Algebra
- Vectors dan matrices
- Matrix operations
- Eigenvalues
Phase 2: Data Analysis (2-3 bulan)
1. NumPy
- Arrays
- Broadcasting
- Linear algebra operations
2. Pandas
- DataFrames
- Data manipulation
- Groupby, merge, pivot
3. Data Visualization
- Matplotlib
- Seaborn
- Plotly
Phase 3: Machine Learning (3-4 bulan)
1. Supervised Learning
- Linear Regression
- Logistic Regression
- Decision Trees
- Random Forest
- SVM
2. Unsupervised Learning
- K-Means Clustering
- PCA
- Hierarchical Clustering
3. Model Evaluation
- Train/test split
- Cross-validation
- Metrics (accuracy, precision, recall)
Phase 4: Advanced Topics (ongoing)
- Deep Learning (TensorFlow/PyTorch)
- Natural Language Processing
- Computer Vision
- Time Series Analysis
- Big Data (Spark)
Python untuk Data Science
Setup Environment
# Install Anaconda
# Download dari anaconda.com
# Atau gunakan pip
pip install numpy pandas matplotlib seaborn scikit-learn jupyter
# Start Jupyter
jupyter notebook
NumPy Basics
import numpy as np
# Create arrays
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2, 3], [4, 5, 6]])
# Array operations
print(arr + 10) # [11, 12, 13, 14, 15]
print(arr * 2) # [2, 4, 6, 8, 10]
print(np.mean(arr)) # 3.0
print(np.std(arr)) # 1.414
# Matrix operations
print(matrix.shape) # (2, 3)
print(matrix.T) # Transpose
# Random
np.random.seed(42)
random_arr = np.random.randn(5) # Normal distribution
Pandas Basics
import pandas as pd
# Create DataFrame
df = pd.DataFrame({
'nama': ['Budi', 'Ani', 'Citra'],
'umur': [25, 23, 28],
'kota': ['Jakarta', 'Bandung', 'Surabaya']
})
# Basic operations
print(df.head()) # First 5 rows
print(df.info()) # Data types
print(df.describe()) # Statistics
# Selection
print(df['nama']) # Single column
print(df[['nama', 'umur']]) # Multiple columns
print(df[df['umur'] > 24]) # Filter
# Read/Write data
df = pd.read_csv('data.csv')
df.to_csv('output.csv', index=False)
# Groupby
df.groupby('kota')['umur'].mean()
# Handling missing values
df.fillna(0)
df.dropna()
Data Visualization
Matplotlib
import matplotlib.pyplot as plt
# Line plot
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.plot(x, y)
plt.xlabel('X axis')
plt.ylabel('Y axis')
plt.title('Line Plot')
plt.show()
# Scatter plot
plt.scatter(x, y)
plt.show()
# Bar plot
categories = ['A', 'B', 'C']
values = [10, 20, 15]
plt.bar(categories, values)
plt.show()
# Histogram
data = np.random.randn(1000)
plt.hist(data, bins=30)
plt.show()
# Multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
axes[0, 0].plot(x, y)
axes[0, 1].scatter(x, y)
axes[1, 0].bar(categories, values)
axes[1, 1].hist(data)
plt.tight_layout()
plt.show()
Seaborn
import seaborn as sns
# Load sample dataset
tips = sns.load_dataset('tips')
# Distribution plot
sns.histplot(tips['total_bill'])
plt.show()
# Box plot
sns.boxplot(x='day', y='total_bill', data=tips)
plt.show()
# Scatter with hue
sns.scatterplot(x='total_bill', y='tip', hue='sex', data=tips)
plt.show()
# Heatmap
correlation = tips.corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.show()
# Pair plot
sns.pairplot(tips, hue='sex')
plt.show()
Machine Learning dengan Scikit-learn
Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Prepare data
X = df[['feature1', 'feature2']]
y = df['target']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
print(f"MSE: {mean_squared_error(y_test, y_pred)}")
print(f"R2 Score: {r2_score(y_test, y_pred)}")
print(f"Coefficients: {model.coef_}")
Classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Train model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
# Predict
y_pred = clf.predict(X_test)
# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Cross-Validation
from sklearn.model_selection import cross_val_score
# 5-fold cross-validation
scores = cross_val_score(clf, X, y, cv=5)
print(f"CV Scores: {scores}")
print(f"Mean: {scores.mean():.2f} (+/- {scores.std() * 2:.2f})")
Feature Importance
# Get feature importance
importance = pd.DataFrame({
'feature': X.columns,
'importance': clf.feature_importances_
}).sort_values('importance', ascending=False)
print(importance)
# Visualize
plt.barh(importance['feature'], importance['importance'])
plt.xlabel('Importance')
plt.title('Feature Importance')
plt.show()
Exploratory Data Analysis (EDA)
EDA Checklist
# 1. Load data
df = pd.read_csv('data.csv')
# 2. Basic info
print(df.shape)
print(df.info())
print(df.describe())
# 3. Check missing values
print(df.isnull().sum())
print(df.isnull().sum() / len(df) * 100)
# 4. Check duplicates
print(df.duplicated().sum())
# 5. Data types
print(df.dtypes)
# 6. Unique values
for col in df.columns:
print(f"{col}: {df[col].nunique()} unique values")
# 7. Distributions
for col in df.select_dtypes(include=[np.number]).columns:
plt.figure()
sns.histplot(df[col])
plt.title(col)
plt.show()
# 8. Correlations
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()
# 9. Outliers
for col in df.select_dtypes(include=[np.number]).columns:
plt.figure()
sns.boxplot(x=df[col])
plt.title(col)
plt.show()
Data Preprocessing
Handling Missing Values
# Check missing
print(df.isnull().sum())
# Drop missing
df_clean = df.dropna()
# Fill with mean/median
df['column'] = df['column'].fillna(df['column'].mean())
# Fill with mode (categorical)
df['category'] = df['category'].fillna(df['category'].mode()[0])
# Forward/backward fill
df['column'] = df['column'].fillna(method='ffill')
Feature Encoding
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Label encoding
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])
# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['category'])
# Scikit-learn OneHotEncoder
ohe = OneHotEncoder(sparse=False)
encoded = ohe.fit_transform(df[['category']])
Feature Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Standardization (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Normalization (0-1)
minmax = MinMaxScaler()
X_normalized = minmax.fit_transform(X)
Project Ideas untuk Portfolio
Beginner Projects
1. Titanic Survival Prediction
- Kaggle classic
- Classification problem
2. House Price Prediction
- Regression problem
- Feature engineering
3. Customer Segmentation
- Clustering
- RFM analysis
Intermediate Projects
1. Sentiment Analysis
- NLP
- Twitter/review data
2. Stock Price Prediction
- Time series
- LSTM
3. Recommendation System
- Collaborative filtering
- Content-based
Resources
Learning Platforms
Free:
- Kaggle Learn
- Google ML Crash Course
- Fast.ai
- DataCamp (beberapa gratis)
Paid:
- Coursera (Andrew Ng courses)
- DataCamp
- Udemy
Practice
- Kaggle Competitions
- DrivenData
- Analytics Vidhya
- HackerRank
Kesimpulan
Data Science adalah journey yang panjang. Mulai dengan Python dan statistics, lalu gradually build ke machine learning. Practice dengan real projects adalah kunci.
Artikel Terkait
Link Postingan : https://www.tirinfo.com/cara-belajar-data-science-pemula/
Editor : Hendra WIjaya
Publisher :
Tirinfo
Read : 5 minutes.
Update : 7 January 2026