실무에 쓰는 머신러닝 기초 1주차 (앙상블)

AI/머신러닝

실무에 쓰는 머신러닝 기초 1주차 (앙상블)

edcrfv458 2025. 3. 17. 14:39

목표

앙상블 기법(배깅, 부스팅)의 원리와 장단점 이해
과적합과 과소적합을 구별하고 해결 방안 학습
하이퍼 파라미터 튜닝을 통한 모델 최적화 방법 습득

1. 앙상블 기법

여러 개의 모델을 조합해 더 좋은 예측 성능을 내는 방법

사용 이유

서로 다른 관점(모델)을 결합함으로써 오류를 줄일 수 있음
개별 모델의 편향(bias)과 분산(variance)을 상호 보완

배깅(Bagging, Bootstrap Aggregating)

원리
- 학습 데이터를 무작위로 여러 부분 샘플(부트스트랩)로 나누어 각각 독립적으로 모델 학습
- 예측 시에는 여러 모델의 결과를 평균(회귀) 혹은 다수결(분류)로 결정
예시
- 랜덤 포레스트 - 분류, 회귀 모두 가능
  - 결정 트리 여러 개 만들 때 각 트리에 사용하는 피처와 데이터 샘플을 무작위로 선택 (피처 샘플링 + 데이터 샘플링)
  - 결정 트리는 데이터를 여러 조건으로 분할하여 트리 형태로 예측을 수행하는 모델
장점
- 각 모델이 독립적으로 학습되므로 병렬 처리가 가능해 학습 속도가 빠름
- 모델 간 상호 간섭이 적어 안정적
- 과적합 줄여주는 효과 (예측의 분산 감소)
단점
- 많은 수의 모델을 학습해야 하므로 메모리 사용량 증가
- 해석이 어려움

코드

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# 1. 데이터 로드
data = load_breast_cancer()
X = data.data
y = data.target

# 2. 학습/테스트 분할
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

# 3. 랜덤 포레스트 모델 생성
# n_estimators는 사용할 트리의 개수, max_depth는 각 트리의 최대 깊이를 의미하며
# 위 2개의 값을 높일 수록 시간과 연산량은 늘어나지만 더욱 복잡한 특징을 잡을 수 있음
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    random_state=42
)

# 4. 모델 학습
rf_model.fit(X_train, y_train)

# 5. 예측
y_pred = rf_model.predict(X_test)

# 6. 성능 평가
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {acc:.4f}")
print("Confusion Matrix:\n", cm)
print("Classification Report:\n", report)

부스팅(Boosting)

원리
- 순차적으로 모델을 학습하면서 이전 모델이 만든 예측 오류를 보정하도록 설계
- 각각의 모델은 이전 모델이 틀린 부분에 가중치를 더 두어 학습
대표 알고리즘 - 분류 회귀 모두 가능
- XGBoost (Extreme Gradient Boosting)
- LightGBM
- CatBoost (범주형 변수 사용에 적합)
장점
- 높은 정확도 달성 가능
- 각 단계에서의 오류를 보정하기 때문에 복잡한 데이터 패턴을 잘 포착
단점
- 순차적으로 학습하므로 병렬화가 쉽지 않음
- 하이퍼 파라미터가 많고 튜닝이 까다로움
작동 예시 시나리오 (XGBoost)
- 기본 모델(약한 결정 트리) 훈련 ➡️ 예측 오류 확인
- 예측 오류가 컸던 샘플에 높은 가중치 부여
- 다중 모델(결정 트리) 훈련 ➡️ 다시 오류 보정
- 이 과정을 여러번 반복해 최종 예측 시에 모두 합산

코드

XGBoost
- 범주 데이터 변환 작업 필요

# 1. 데이터 준비 (Titanic 예시: 범주형 컬럼 존재)
from sklearn.datasets import fetch_openml
import pandas as pd
import numpy as np

# OpenML에서 Titanic 데이터셋 로드
titanic = fetch_openml('titanic', version=1, as_frame=True)
df = titanic.frame

# 주요 컬럼만 사용하고, 결측치가 있는 행 제거(XGB와 Light GBM을 위해)
# pclass(객실 등급, 범주형), sex(성별, 범주형), age(나이, 연속형), fare(티켓 요금, 연속형)
# embarked(탑승항구, 범주형), survived(생존 여부, 타깃)
df = df[['pclass', 'sex', 'age', 'fare', 'embarked', 'survived']]
df.dropna(inplace=True)

# 입력(X), 타깃(y) 분리
X = df.drop('survived', axis=1)
y = df['survived'].astype(int)  # survived 컬럼을 int형으로 변환


# 2. 데이터 전처리
#    XGBoost/LightGBM은 숫자형 입력만 허용하므로, 범주형 칼럼을 인코딩
from sklearn.preprocessing import LabelEncoder

cat_cols = ['sex', 'embarked']  # 범주형으로 간주할 컬럼들
for col in cat_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])

# 3. 학습/테스트 데이터 분할
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

# 4. XGBoost 실습
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

xgb_model = XGBClassifier(random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)

print("=== XGBoost ===")
print("Accuracy:", accuracy_score(y_test, y_pred_xgb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_xgb))
print("Classification Report:\n", classification_report(y_test, y_pred_xgb))

LightGBM
- 마찬가지로 범주형 변수 변경 작업 필요

# 5. LightGBM 실습
from lightgbm import LGBMClassifier

lgb_model = LGBMClassifier(random_state=42)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)

print("\n=== LightGBM ===")
print("Accuracy:", accuracy_score(y_test, y_pred_lgb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_lgb))
print("Classification Report:\n", classification_report(y_test, y_pred_lgb))

CatBoost
- 범주형 변수 변환 작업 없이 학습 가능

# 6. CatBoost 실습 (범주형 특성 직접 지정 예시)
from catboost import CatBoostClassifier

# CatBoost용 데이터 준비: 원본 df에서 결측 제거(위에서 한 것 동일)
df_cat = titanic.frame[['pclass', 'sex', 'age', 'fare', 'embarked', 'survived']].dropna()
X_cat = df_cat.drop('survived', axis=1)
y_cat = df_cat['survived'].astype(int)

# cat_features 인덱스: 'sex', 'embarked' 컬럼(원본 df에서의 컬럼 인덱스)
# DataFrame 사용 시에는 컬럼 이름이 아니라 "열의 위치"를 지정해야 함
#   - pclass : 0, sex: 1, age: 2, fare: 3, embarked: 4
cat_features_idx = [1, 4]

X_cat_train, X_cat_test, y_cat_train, y_cat_test = train_test_split(
    X_cat, y_cat, test_size=0.2, random_state=42, stratify=y_cat
)

cat_model = CatBoostClassifier(
    cat_features=cat_features_idx,
    verbose=1,           # 학습과정 확인 가능
    random_state=42
)
cat_model.fit(X_cat_train, y_cat_train)
y_pred_cat = cat_model.predict(X_cat_test)

print("\n=== CatBoost ===")
print("Accuracy:", accuracy_score(y_cat_test, y_pred_cat))
print("Confusion Matrix:\n", confusion_matrix(y_cat_test, y_pred_cat))
print("Classification Report:\n", classification_report(y_cat_test, y_pred_cat))

2. 과적합(Overfitting) vs 과소적합(Underfitting)

과적합
- 학습 데이터에는 지나치게 최적화 되었지만, 새로운 데이터(테스트)에는 성능이 떨어지는 현상
- 모델이 일반화가 되지 않은 상황
과소적합
- 모델이 데이터의 패턴을 충분히 학습하지 못해서 학습 데이터조차도 충분히 맞추지 못하는 현상
- 학습이 잘 되지 않은 상황

과적합의 원인

모델의 파라미터(자유도)가 너무 많아서 복잡도 과다
학습 데이터 수가 충분하지 않음
너무 많은 epoch
노이즈가 많은 훈련 데이터에서 패턴을 과하게 학습

과적합 해결 방법

정규화(Regularization, 규제) 기법
- ex) L1, L2 정규화: 가중치에 패널티를 줘서 과도한 학습 억제
드롭아웃(Droupout)
- 학습 시 일부 뉴런을 확률적으로 비활성화
- 딥러닝에 주로 사용됨
데이터 증강(Data Augmentation)
- 이미지 데이터의 경우, 회전/이동/반전 등으로 새로운 데이터 생성
- 자연어 데이터에도 유사한 패턴으로 증강 가능
- 신호 데이터의 경우 가우시안 노이즈를 추가하여 증강 가능
조기 종료(Early Stoppling)
- 학습 도중 검증 손실이 증가하기 시작하면 학습을 중단
앙상블(Ensemble)
- 서로 다른 모델을 결합해 과적합 위험을 줄임

과소적합 해결 방법

모델 복잡도 증가
더 오래 학습
모델 구조 변경 (신경망, 트리 등)

3. 하이퍼 파라미터 튜닝

모델이 학습을 시작하기 전에 사람이 설정해야 하는 값
결정 트리의 최대 깊이(mex_depth), 학습 횟수 등

튜닝을 위한 데이터 준비

데이터 셋 분할 (Training/Valication/Test)
- Training Set: 모델 학습에 사용
- Validation Set: 하이퍼 파라미터 튜닝이나 모델 선택을 위해 사용
- Test Set: 최종 성능 평가
교차 검증 (Cross-Validation)
- 데이터를 훈련 세트와 검증 세트로 여러 번 겹치지 않게 나누어 사용
- K-Fold Cross-Validation
  - 데이터를 K개의 폴드로 나누어 순차적으로 한 폴드를 검증 세트로 사용하고 나머지 학습에 사용
  - 평균 성능이 최종 모델의 성능
- 장점: 데이터가 적은 상황에서도 안정적인 성능 평가

튜닝 방법

Grid Search: 미리 정의된 하이퍼 파라미터 후보들의 모든 조합을 시도
- 장점: 완전 탐색이므로 최적값 놓치지 않음
- 단점: 후보가 많아질수록 연산량 급격히 증가
Randomized Search: 임의로 샘플링된 하이퍼 파라미터 조합을 일정 횟수만 시도
- 장점: 다양한 영역을 빠르게 탐색하므로 속도가 빠름
- 단점: 최적 조합을 정확히 찾지 못할 수 있음
베이지안 최적화(Bayesian Optimization): 과거의 탐색 결과 바탕으로 가장 유망한 하이퍼 파라미터 범위를 중점적으로 탐색
- 장점: 탐색 시간이 더 짧고 효율적
- 단점: 구현 복잡도가 높음

코드

Grid SearchCV

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# 1. 데이터 로드
iris = load_iris()
X = iris.data
y = iris.target

# 2. 학습/테스트 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

# 3. 하이퍼 파라미터 후보군 설정
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10]
}

# 4. GridSearchCV 생성
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,              # 교차검증(fold) 횟수
    scoring='accuracy',
    n_jobs=-1,          # 병렬 처리(가능한 모든 코어 사용)
)

# 5. 학습(그리드서치 수행)
grid_search.fit(X_train, y_train)

# 6. 최적 파라미터 및 성능 확인
print("Best Parameters:", grid_search.best_params_)
print("Best CV Score:", grid_search.best_score_)

# 7. 테스트 데이터 성능 확인
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
print("Test Accuracy:", test_acc)

4. 머신러닝 추가 개념

최적화 (Optimization)

하이퍼 파라미터 튜닝 (Grid SearchCV, Randomized SearchCV 등)
피처 엔지니어링 (새로운 파생 변수 생성, 불필요한 변수 제거)
과적합 방지 (교차 검증, 규제 적용, 드롭 아웃 등)

배포 (Deployment)

학습 완료 모델을 운영 환경에 배포
API 서버 구축, 클라우드(AWS, GCP) 또는 엣지 디바이스(임베디드 환경)
지속적 모니터링으로 모델 성능이 저하될 경우 재학습 주기 설정

MLOps(머신러닝 운영)

Machine Learning + DevOps의 합성어
머신러닝 모델 개발부터 배포, 모니터링, 재학습, 롤백 등 전 과정을 자동화하고 효율적으로 운영하는 방법론

MLOps가 중요한 이유

프로젝트 완성 ➡️ 실제 운영 단계에서 지속적인 모니터링과 데이터/모델 업데이트 필요

5. 모델 해석 가능성 (Explainable AI, XAI)

필요한 이유

머신러닝, 특히 딥러닝 모델은 블랙박스처럼 동작
의료/금융 등 규제 산업에서는 왜 이런 결과가 나왔는지에 대한 설명 요구

주요 기법

LIME(Local Interpretable Model-agnostic Explanations)
SHAP(Shapley Additive Explanations)
- LIME, SHAP은 개별 데이터의 중요도를 확인 가능
Feature Importacne 시각화 (트리 기반)
- Feature Importance는 모델에서의 어떤 변수가 중요한지는 확인 가능
- 하지만 각 변수에서 어떤 것이 중요한지는 확인이 불가능

코드

Feature Importance

import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# 1. 데이터 로드
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# 2. 학습/테스트 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

# 3. 랜덤 포레스트 모델 학습
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# 4. 피처 중요도 추출
importances = rf.feature_importances_

# 5. 시각화
plt.bar(range(len(importances)), importances)
plt.xticks(range(len(importances)), feature_names, rotation=45)
plt.xlabel("Feature")
plt.ylabel("Importance")
plt.title("Feature Importances in RandomForest")
plt.tight_layout()
plt.show()

# 가장 중요한 변수
most_important_idx = importances.argmax()
most_important_feature = feature_names[most_important_idx]
print("가장 중요한 변수:", most_important_feature)

실습

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

from xgboost import XGBClassifier

data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

model = XGBClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred), '\n')
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred), '\n')
print("Classification report:\n", classification_report(y_test, y_pred))

importances = model.feature_importances_
most_importance_idx = importances.argmax()
most_importance_feature = feature_names[most_importance_idx]
print("가장 중요한 변수:", most_importance_feature)

팁

GridSearchCV 사용 경우: 탐색해야 할 파라미터 범위가 좁고 후보가 적은 경우
RandomizedSearchCV 사용 경우: 후보 범위가 매우 넓고 후보가 많은 경우