์นดํ…Œ๊ณ ๋ฆฌ ์—†์Œ

Gradient Boosting Classifier์˜ ์ฒซ๋ฒˆ์งธ ํŠธ๋ฆฌ๋ฅผ ์ œ๊ฑฐํ•˜๋ฉด ๋ฌด์Šจ ์ผ์ด ๋ฐœ์ƒํ•˜๋Š”๊ฐ€?

Rudi 2020. 8. 26. 10:36

Gradient Boosting ๐Ÿš€

Tree Based Models

 Gradient Boosted Tree๋Š” ํ•˜๋‚˜์˜ ํŠธ๋ฆฌ์— ๋Œ€ํ•ด์„œ ํ•™์Šตํ•˜๊ณ  ์˜ค๋ถ„๋ฅ˜๊ฐ€ ์ผ์–ด๋‚œ ๋ถ€๋ถ„์„ ์ง‘์ค‘์ ์œผ๋กœ ํ•™์Šตํ•จ์œผ๋กœ์จ ์˜ˆ์ธก์œจ์„ ๋†’์ด๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ Gradient Boosted Tree์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ค‘์—๋Š” n_estimator : ํŠธ๋ฆฌ์˜ ๊ฐœ์ˆ˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด์„œ n_estimator๊ฐ€ 5000์ด๋ผ๋ฉด 5000๊ฐœ์˜ ํŠธ๋ฆฌ์— ๋Œ€ํ•ด์„œ ์ˆœ์ฐจ์ ์ธ ๋ถ„๋ฅ˜๋ฅผ ์ง„ํ–‰ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. 

 


Number of Trees ๐ŸŒฒ

๋งŒ์ผ ์ถฉ๋ถ„ํ•œ ์ˆ˜์˜ ํŠธ๋ฆฌ๋ฅผ ์ฐพ์•˜๋‹ค๋ฉด ํŠธ๋ฆฌ๊ฐ€ ๋” ๋งŽ๋‹ค๊ณ  ํ•ด์„œ ์˜ˆ์ธก์ด ๋” ์ž˜ ์ผ์–ด๋‚˜์ง€ ์•Š์œผ๋ฉฐ, ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ๋”์šฑ ํ–ฅ์ƒ์‹œํ‚ค์ง€ ์•Š์Šต๋‹ˆ๋‹ค. Random Forest๊ฐ€ ๋ณ‘๋ ฌ์ ์œผ๋กœ ํŠธ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๊ณผ ๋‹ค๋ฅด๊ฒŒ Gradient Boosting Tree๋Š” ์ˆœ์ฐจ์ ์œผ๋กœ ๊ณ„์‚ฐ์ด ์ผ์–ด๋‚ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํŠธ๋ฆฌ์˜ ์ˆœ์„œ์™€ ์„ฑ๋Šฅ์˜ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. 

n_trees : 800 ์ดํ›„๋กœ๋Š” ์„ฑ๋Šฅ์ด ํ–ฅ์ƒํ•˜์ง€ ์•Š์•˜๋‹ค

 

์—ฌ๊ธฐ์„œ๋Š” ์‹คํ—˜ ๊ฒฐ๊ณผ๋งŒ ์˜ฌ๋ ค๋‘๊ฒ ์Šต๋‹ˆ๋‹ค. 

์‹คํ—˜ ์†Œ์Šค์ฝ”๋“œ๋Š” ๊นƒํ—ˆ๋ธŒ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 

(1) Logloss using all trees: 0.0003138062408693649
(2) Logloss using all trees but last: 0.0003138062408693649
(3) Logloss using all trees but first: 0.00032031490565337053

Logloss๋Š” ๊ฐ’์ด ๋‚ฎ์„์ˆ˜๋ก ์ข‹์Šต๋‹ˆ๋‹ค. 

์ „์ฒดํŠธ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ(1)์™€ ๋งˆ์ง€๋ง‰ ํŠธ๋ฆฌ๋ฅผ ์ œ๊ฑฐํ•œ ๊ฒฝ์šฐ(2)์˜ ์„ฑ๋Šฅ์ฐจ์ด๋Š” ์—†์Šต๋‹ˆ๋‹ค. 

์ „์ฒดํŠธ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ(1)์™€ ์ฒซ๋ฒˆ์งธ ํŠธ๋ฆฌ๋ฅผ ์ œ๊ฑฐํ•œ ๊ฒฝ์šฐ(3)์˜ ์„ฑ๋Šฅ์ฐจ์ด๋Š” ๋ˆˆ์œผ๋กœ ํ™•์ธ ๊ฐ€๋Šฅํ•œ ์ •๋„ ์ž…๋‹ˆ๋‹ค. 

 

# With higher learning rate
Logloss using all trees: 3.0493086180356206e-06
Logloss using all trees but last: 3.054629157905757e-06
Logloss using all trees but first: 2.097409130044673

๋งŒ์ผ ํ•™์Šต์œจ์„ ๋†’์ด๊ฒŒ ๋œ๋‹ค๋ฉด ์„ฑ๋Šฅ์ฐจ์ด๋Š” ๋ˆˆ์œผ๋กœ ํ™•์ธ ๊ฐ€๋Šฅํ•œ ์ˆ˜์ค€์ž…๋‹ˆ๋‹ค.


Conclusion

 

Decision Tree๋Š” ๋‹จ ํ•˜๋‚˜์˜ ํŠธ๋ฆฌ๋งŒ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. 

Random Forest๋Š” ๋ณ‘๋ ฌ๋กœ ํŠธ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ๊ฐ๊ฐ์˜ ํŠธ๋ฆฌ์˜ ์ค‘์š”์„ฑ์€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. 

Gradient Boosted Tree๋Š” ์ˆœ์ฐจ์ ์œผ๋กœ ํ•™์Šต์ด ์ผ์–ด๋‚˜๋ฉฐ ์•ž์ชฝ์˜ ํŠธ๋ฆฌ๊ฐ€ ์˜ˆ์ธก์— ์žˆ์–ด์„œ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. 

 

 


Ref.

https://www.coursera.org/learn/competitive-data-science/