Course Notes: Extreme Gradient Boosting with XGBoost

Beta

實在是不是很好學

輸入參數地的時候是用模型名稱加上兩了下底線__再加參數名稱來做為search的參數範圍

Create the parameter grid

gbm_param_grid = { 'clf__learning_rate': range(0.05, 1, 0.05), 'clf__max_depth': range(3, 10, 1), 'clf__n_estimators': range(50, 200, 50)

其中模型的名稱就是clf

boosting是一種集合所許多弱預測器並加給整合最後給出的一個更加準確的結果,這樣的方式稱為元模型,meta

在使用XGBOOSTING以前需要把資料格式還為DMatrix的格式

Create arrays for the features and the target: X, y X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

Create the DMatrix from X and y: churn_dmatrix churn_dmatrix = xgb.DMatrix(data=X, label=y)

Create the parameter dictionary: params params = {"objective":"reg:logistic", "max_depth":3}

Perform cross-validation: cv_results cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=3, num_boost_round=5, metrics="error", as_pandas=True, seed=123) metrics是用來作為優化衡量標準的指數,現在這段程式碼的方式使用方差 Print cv_results print(cv_results)

Print the accuracy展示最後一輪的成果,因為這個成果基本上就是這輪所有最佳化後的結論 print(((1-cv_results["test-error-mean"]).iloc[-1]))

result print(((1-cv_results["test-error-mean"]).iloc[-1])) train-error-mean train-error-std test-error-mean test-error-std 0 0.282 0.002 0.284 1.932e-03 1 0.270 0.002 0.272 1.932e-03 2 0.256 0.003 0.258 3.963e-03 3 0.251 0.002 0.254 3.827e-03 4 0.247 0.002 0.249 9.344e-04 0.751480015401492

使用AUC最為衡量指標 #Perform cross_validation: cv_results cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=3, num_boost_round=5, metrics="auc", as_pandas=False, seed=123)

#Print cv_results print(cv_results)

Print the AUC

print((cv_results["test-auc-mean"]).iloc[-1])

適合使用XGB的時機

有超過1000筆資料,且特徵少於100個；至少特徵數要比訓練資料少
適合資料型態為分類與數字,或是純數字

不適合

自然語言分析
影像識別

XGB中損失模型的命稱

reg:linear -- 用於回歸問題
reg:logistic -- 用於分類型問題,最後會直接給出結論
binary:logistic -- 用於分類型問題,最後會直接給出每種分類的機率

視覺化

可使用plot_tree()來生成圖案

Create the DMatrix: housing_dmatrix

housing_dmatrix = xgb.DMatrix(data=X, label=y)

Create the parameter dictionary: params

params = {"objective":"reg:linear", "max_depth":2}

Train the model: xg_reg

xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)

Plot the first tree

xgb.plot_tree(xg_reg,num_trees=0) plt.show()

Plot the fifth tree

xgb.plot_tree(xg_reg,num_trees=4) plt.show()

Plot the last tree sideways

xgb.plot_tree(xg_reg,num_trees=9,rankdir="LR") plt.show()

另一種視覺化的方式

計算特徵在分裂節點上的數量,理論上特徵被用過越多次代表這個特徵在分類上越有指標性

# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(X,y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear","max_depth":4}

# Train the model: xg_reg
xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)

# Plot the feature importances
xgb.plot_importance(xg_reg)
plt.show()

調參

可以使用 early stopping來提早結束,boosting round的參數探索

cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3, num_boost_round=50, metrics="rmse", as_pandas=True, seed=123,early_stopping_rounds=10)

early_stopping_rounds=10

代表每10輪boosting 後檢查是否誤差有變小，如沒有變小則停止，需要搭配較大的boosting round的數值如num_boost_round=50