qcoding

[kaggle]2진 분류 코드 - KernerelPCA / GMM / HIST FEATURE / STACKING 본문

머신러닝 딥러닝

[kaggle]2진 분류 코드 - KernerelPCA / GMM / HIST FEATURE / STACKING

Qcoding 2023. 12. 12. 14:59
반응형

https://www.yes24.com/Product/Goods/120528346

 

캐글 메달리스트가 알려주는 캐글 노하우 - 예스24

캐글, ML/AI 실무자답게 접근하라!국내 캐글 실력자 8명이 직접 설명하는 캐글 컴페티션,어떻게 접근해 얼마나 노력하느냐에 따라 경험의 깊이가 달라진다.국내 캐글 실력자 8명이 모였다. 직접

www.yes24.com

* 해당 책의 내용을 보고 실습을 진행함.

 

 

### keyword

매직피처 , 전처리 (KernelPCA, Gaussian Mixture Model, Hist) , 1단계 모델 - NuSVC, QuadraticDiscriminant Analysis, SVC, KNeighborsClassifier, LogisticRegression 

2단계 모델  LightGBM / MLPClassifier

 

 

 

### 데이터 확인

train.head(5)

 

### 수치형 / 범주형 데이터 분리

### 수치형 / 범주형 데이터 분석
num_cols = train.select_dtypes(include=['int','float']).columns
cat_cols = train.select_dtypes(include=['object']).columns
## 고윳값 확인
train_num_cols = [col for col in num_cols if col not in ['id','target']]
train[train_num_cols].nunique().sort_values(ascending=False)

나머지는 고윳값이 26만개이나 , wheezy-copper-turtle-magic은 512개의 고윳값을 가지고 있음.

 

### 범주형
describe_train = train[train_num_cols].describe().T.drop("count",axis=1)
cmap = sns.diverging_palette(5,250, as_cmap=True)
describe_train.T.style.background_gradient(cmap,axis=1)

통계량 값에 style에 값을 넣어서 시각화할 수 있음.

 

 

 

 

### 배울 수 있는 점 ###

--> 이 예제에서 배울 수 있는 것은 wheezy-copper-turtle-magic 라는 feature가 0~512까지의 고윳값을 가지고 있으며, 이 값에 따라서 각각의 모델을 학습시켜 staking에 사용하는 것 

 

--> feature 추가 --> KernelPCA / GMM

1) 기본적인 trian / tset set을 사용하여 feature를 추출하기 위하여 KernelPCA를 적용 몇개의 변수로 분류함

-> INPUT : TRAIN / TEST FEATURE

all_data=KernelPCA(n_components=len(train_num_cols[:5]), kernel='cosine', random_state=42).fit_transform(train[train_num_cols[:5]])

 

 

2) GMM을 사용하여 신규 FEATURE를 생성함

-> INPUT : 위의 KernelPCA를 통해서 얻은 COLUMNS 수만큼의 PCA FEAUTRE

 

-> 위에서 구한 n개의 PCA를 통한 성분을 가지고 GMM에 넣어 5개의 Lable로 분류하고 각 feature에 대한 pred와 score값을 feature에 사용함.

gmm = GMM(n_components=5, random_state=42, max_iter=1000, init_params='kmeans').fit(all_data)
gmm_pred = gmm.predict_proba(all_data)
gmm_score = gmm.score_samples(all_data).reshape(-1, 1)
gmm_label = gmm.predict(all_data)

 

 

3) 빈도수와 연관된 문제의 경우에  HISTGRAM의 도수 FETURE를 사용

class hist_model(object):
    
    def __init__(self, bins=50):
        self.bins = bins
        
    def fit(self, X):
        
        bin_hight, bin_edge = [], []
        
        for var in X.T:  ######## 각 columns 마다 반복 #######
            # get bins hight and interval
            bh, bedge = np.histogram(var, bins=self.bins)
            bin_hight.append(bh)
            bin_edge.append(bedge)
        
        self.bin_hight = np.array(bin_hight)
        self.bin_edge = np.array(bin_edge)

    def predict(self, X):
        
        scores = []
        for obs in X: ######## 각 행마다 반복 #######
            obs_score = []
            for i, var in enumerate(obs):
                # find wich bin obs is in
                bin_num = (var > self.bin_edge[i]).argmin()-1
                obs_score.append(self.bin_hight[i, bin_num]) # find bin hitght
            
            scores.append(np.mean(obs_score))
        
        return np.array(scores)

-> 위의 코드 해석

1) X.T 를 해서 var에는 각 columns에 대한 값이 들어감. 각 columns 마다 가지고 있는 분포를 활용하는 하는 것으로 

        for var in X.T:
            # get bins hight and interval
            bh, bedge = np.histogram(var, bins=self.bins)
            bin_hight.append(bh)
            bin_edge.append(bedge)
        
        self.bin_hight = np.array(bin_hight)
        self.bin_edge = np.array(bin_edge)

도수(bh) 와 구간경계값(bedge)를 계산함. 구간경계값은 도수 + 1 이 되어야함.

bh, bedge = np.histogram(all_data.T[0], bins=50)

2) 각 행을 반복문으로 돌면서 각 행마다 columns에 해당하는 위에서 구한 도수의 경계값과 도수의 빈도값을 비교하여

해당 값이 어느 도수 경계에 들어가는 지 확인 한후 ( bin_num, 인덱스) , 도수분포표 (bin_hight)에서 해당 index에 들어가 있는 빈도수를 각 columns 마다 계산한 후 평균 내어 한행에 대한 빈도수를 구함.

    def predict(self, X):
        
        scores = []
        for obs in X:
            obs_score = []
            for i, var in enumerate(obs):
                # find wich bin obs is in
                bin_num = (var > self.bin_edge[i]).argmin()-1
                obs_score.append(self.bin_hight[i, bin_num]) # find bin hitght
            
            scores.append(np.mean(obs_score))
        
        return np.array(scores)

 

위의 그림과 같이 각 행에 대한 빈도수의 평균 값을 사용

 

결론적으로 GMM -> gmm_pred / gmm_score

                   KernerlPCA -> 전체 TRAIN / TEST를 몇 개의 FEATURE 나눌 때 사용 

                   HIST CLASS -> hist_pred

를 feature로 사용하여 예측모델에 적용함.

 

 

 

##### 예측 모델 예시 코드 ######### 

import warnings
warnings.filterwarnings('ignore')

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm_notebook

from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import StratifiedKFold

from sklearn.decomposition import KernelPCA
from sklearn.mixture import GaussianMixture as GMM
from sklearn import svm, neighbors, linear_model, neural_network
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
import lightgbm as lgbm
svnu_params = {'probability':True, 'kernel':'poly','degree':4,'gamma':'auto','nu':0.4,'coef0':0.08, 'random_state':4}
svnu2_params = {'probability':True, 'kernel':'poly','degree':2,'gamma':'auto','nu':0.4,'coef0':0.08, 'random_state':4}
qda_params = {'reg_param':0.111}
svc_params = {'probability':True,'kernel':'poly','degree':4,'gamma':'auto', 'random_state':4}
neighbor_params = {'n_neighbors':16}
lr_params = {'solver':'liblinear','penalty':'l1','C':0.05,'random_state':42}

 

def run_model(clf_list, train, test, random_state, gmm_init_params='kmeans'):
    
    MODEL_COUNT = len(clf_list)
    
    oof_train = np.zeros((len(train), MODEL_COUNT))
    oof_test = np.zeros((len(test), MODEL_COUNT))
    train_columns = [c for c in train.columns if c not in ['id', 'target', 'wheezy-copper-turtle-magic']]
    
    for magic in tqdm_notebook(range(512)):
        x_train = train[train['wheezy-copper-turtle-magic'] == magic]
        x_test = test[test['wheezy-copper-turtle-magic'] == magic]
        print("Magic: ", magic, x_train.shape, x_test.shape)
        
        train_idx_origin = x_train.index
        test_idx_origin = x_test.index
        
        train_std = x_train[train_columns].std()
        cols = list(train_std.index.values[np.where(train_std >2)])
        
        x_train = x_train.reset_index(drop=True)
        y_train = x_train.target
        
        x_train = x_train[cols].values
        x_test = x_test[cols].values
        
        all_data = np.vstack([x_train, x_test])
        # print("all_data: ", all_data.shape)
        # Kernel PCA
        all_data = KernelPCA(n_components=len(cols), kernel='cosine', random_state=random_state).fit_transform(all_data)
        
        # GMM
        gmm = GMM(n_components=5, random_state=random_state, max_iter=1000, init_params=gmm_init_params).fit(all_data)
        gmm_pred = gmm.predict_proba(all_data)
        gmm_score = gmm.score_samples(all_data).reshape(-1, 1)
        gmm_label = gmm.predict(all_data)
        
        # hist feature
        hist = hist_model()
        hist.fit(all_data)
        hist_pred = hist.predict(all_data).reshape(-1, 1)
        
        all_data = np.hstack([all_data, gmm_pred, gmm_pred, gmm_pred, gmm_pred, gmm_pred])

        # Add Some Features
        all_data = np.hstack([all_data, hist_pred, gmm_score, gmm_score, gmm_score])
        
        # STANDARD SCALER
        all_data = StandardScaler().fit_transform(all_data)

        # new train/test
        x_train = all_data[:x_train.shape[0]]
        x_test = all_data[x_train.shape[0]:]
        # print("data size: ", x_train.shape, x_test.shape)
        fold = StratifiedKFold(n_splits=5, random_state=random_state)
        for trn_idx, val_idx in fold.split(x_train, gmm_label[:x_train.shape[0]]):
            for model_index, clf in enumerate(clf_list):
                clf.fit(x_train[trn_idx], y_train[trn_idx])
                oof_train[train_idx_origin[val_idx], model_index] = clf.predict_proba(x_train[val_idx])[:,1]
                
                # 2023/03/02 데이터의 형식이 변경되어, x_test 예측 시 오류 발생하는 것 수정
                if x_test.shape[0] == 0:
                    continue
                    
                #print(oof_test[test_idx_origin, model_index].shape)
                #print(x_test.shape)
                #print(clf.predict_proba(x_test)[:,1])
                oof_test[test_idx_origin, model_index] += clf.predict_proba(x_test)[:,1] / fold.n_splits
    
    for i, clf in enumerate(clf_list):
        print(clf)
        print(roc_auc_score(train['target'], oof_train[:, i]))
        print()
        
    oof_train_df = pd.DataFrame(oof_train)
    oof_test_df = pd.DataFrame(oof_test)
    
    return oof_train_df, oof_test_df

 

train_second = (oof_train_kmeans_seed1 + oof_train_kmeans_seed2 + oof_train_random_seed1 + oof_train_random_seed2)/4
test_second = (oof_test_kmeans_seed1 + oof_test_kmeans_seed2 + oof_test_random_seed1 + oof_test_random_seed2)/4
print('Ensemble', roc_auc_score(train['target'], train_second.mean(1)))

 

lgbm_meta_param = {
        #'bagging_freq': 5,
        #'bagging_fraction': 0.8,
        'min_child_weight':6.790,
        "subsample_for_bin":50000,
        'bagging_seed': 0,
        'boost_from_average':'true',
        'boost': 'gbdt',
        'feature_fraction': 0.450,
        'bagging_fraction': 0.343,
        'learning_rate': 0.025,
        'max_depth': 10,
        'metric':'auc',
        'min_data_in_leaf': 78,
        'min_sum_hessian_in_leaf': 8, 
        'num_leaves': 18,
        'num_threads': 8,
        'tree_learner': 'serial',
        'objective': 'binary', 
        'verbosity': 1,
        'lambda_l1': 7.961,
        'lambda_l2': 7.781
        #'reg_lambda': 0.3,
    }

mlp16_params = {'activation':'relu','solver':'lbfgs','tol':1e-06, 'hidden_layer_sizes':(16, ), 'random_state':42}


SEED_NUMBER = 4
NFOLD = 5

y_train = train['target']
oof_lgbm_meta_train = np.zeros((len(train), SEED_NUMBER))
oof_lgbm_meta_test = np.zeros((len(test), SEED_NUMBER))
oof_mlp_meta_train = np.zeros((len(train), SEED_NUMBER))
oof_mlp_meta_test = np.zeros((len(test), SEED_NUMBER))

for seed in range(SEED_NUMBER):
    print("SEED Ensemble:", seed)
    mlp16_params['random_state'] = seed
    lgbm_meta_param['seed'] = seed
    folds = StratifiedKFold(n_splits=NFOLD, shuffle=True, random_state=seed)
    for fold_index, (trn_index, val_index) in enumerate(folds.split(train_second, y_train), 1):
        print(f"{fold_index} FOLD Start")
        trn_x, trn_y = train_second.iloc[trn_index], y_train.iloc[trn_index]
        val_x, val_y = train_second.iloc[val_index], y_train.iloc[val_index]
        
        mlp_meta_model = neural_network.MLPClassifier(**mlp16_params)
        mlp_meta_model.fit(trn_x, trn_y)
        
        oof_mlp_meta_train[val_index, seed] = mlp_meta_model.predict_proba(val_x)[:,1]
        oof_mlp_meta_test[:, seed] += mlp_meta_model.predict_proba(test_second)[:,1]/NFOLD
        print("MLP META SCORE: ", roc_auc_score(val_y, oof_mlp_meta_train[val_index, seed]))
        
        # lgbm meta model
        dtrain = lgbm.Dataset(trn_x, label=trn_y, silent=True)
        dcross = lgbm.Dataset(val_x, label=val_y, silent=True)

        lgbm_meta_model = lgbm.train(lgbm_meta_param, train_set=dtrain, valid_sets=[dtrain, dcross], 
                                     verbose_eval=False, early_stopping_rounds=100)
        
        oof_lgbm_meta_train[val_index, seed] = lgbm_meta_model.predict(val_x)
        oof_lgbm_meta_test[:, seed] += lgbm_meta_model.predict(test_second)/NFOLD
        print("LGBM META SCORE: ", roc_auc_score(val_y, oof_lgbm_meta_train[val_index, seed]))
반응형
Comments