๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

๋”ฅ๋Ÿฌ๋‹

Bayesian 1 - Think Bayesian approach

Think Bayesian approach ๐Ÿ‘จ‍๐ŸŒพ

 

ํ™•๋ฅ ๋ก ์—์„œ Frequentist์™€ Bayesian์€ ๋นผ๋†“์„ ์ˆ˜ ์—†๋Š” ๋…ผ์Ÿ ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ๋‘ ๊ฐœ์˜ ์ ‘๊ทผ๋ฒ•์€ ๊ทผ๊ฐ„์ด ๋˜๋Š” ๊ฐ€์ •์ด ์„œ๋กœ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์—, ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์„œ๋กœ ๋‹ค๋ฅธ ๊ฒฌํ•ด๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. 

 ๋ฐ์ดํ„ฐ์™€ ๋ชจ๋ธ์— ๋Œ€ํ•œ ์ด์•ผ๊ธฐ๋กœ ์‹œ์ž‘ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๋กœ๋˜๋ฅผ ๊ตฌ๋งคํ–ˆ์„ ๋•Œ, ์šฐ๋ฆฌ๋Š” ๋‹น์ฒจ์ด ๋  ํ™•๋ฅ ์ด ๊ต‰์žฅ์ด ๋‚ฎ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๊ณ  ์žˆ๊ณ , ์‹ค์ œ๋กœ ๊ทธ ํ™•๋ฅ ์„ ์ˆ˜์น˜์ƒ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ˆ˜์น˜๋ฅผ ์•ˆ๋‹ค๊ณ  ํ•ด์„œ ๋‚ด๊ฐ€ ๋กœ๋˜์— ๋‹น์ฒจ๋  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ถˆํ™•์‹คํ•œ ํ˜„์ƒ๊ณผ ๋žœ๋ค์˜ ์„ฑ์งˆ์ด Frequentist์™€ Bayesian์„ ๋‚˜๋ˆ„๋Š” ๊ทผ๊ฐ„์ด ๋ฉ๋‹ˆ๋‹ค. 

 

  2์›”์˜ ๋„ท์งธ ์ฃผ ํ† ์š”์ผ์— ๋กœ๋˜ ๋ฒˆํ˜ธ $X$ ๊ฐ€ ๋‚˜์™”์Šต๋‹ˆ๋‹ค. $X$๋Š” ์–ด๋– ํ•œ ๋ชจ๋ธ $\theta$์— ์˜ํ•ด์„œ ์ƒ์„ฑ๋œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.  ์ฆ‰ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ชจ๋ธ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์œ„์—์„œ ์„ค๋ช…ํ•œ ํ˜„์ƒ์€ $X$์™€ $\theta$ ์ค‘์— ์–ด๋–ค ๊ณณ์—์„œ ๋ฐœ์ƒ๋œ ๊ฒƒ์ผ๊นŒ์š”? ๋ฐ์ดํ„ฐ ์ž์ฒด์— Randomํ•œ ์„ฑ์งˆ์ด ์žˆ๋Š”์ง€ ์•„๋‹ˆ๋ฉด ๋ชจ๋ธ์— ์žˆ๋Š”์ง€ ํ™•์‹ ํ•  ์ˆ˜๋Š” ์—†์ง€๋งŒ, ์ด์— ๋Œ€ํ•ด์„œ ์ด์•ผ๊ธฐํ•ด๋ณผ ์ˆ˜๋Š” ์žˆ์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. 

 

๐ŸŠ Bayesian vs Frequentist 

View point

 

Frequentist๋Š” ์ผ์–ด๋‚œ ํ˜„์ƒ์— ๋Œ€ํ•ด์„œ ์ƒ๊ฐ์„ ํ•˜๋Š” ๋ฐ˜๋ฉด์— Baysian์€ ๊ทธ ํ˜„์ƒ์„ ์œ ๋ฐœํ•˜๋Š” ๋ฌด์–ธ๊ฐ€์— ๋Œ€ํ•œ ์กฐ๊ฑด์„ ์ด์•ผ๊ธฐํ•ฉ๋‹ˆ๋‹ค. 

 

- Frequentist Bayesian
View point Objective Subjective
Data and parameters $X$ is random and $\theta$ is fixed $\theta$ is random and $X$ is fixed
Size $|X|>>|\theta|$ For any $|X|$
Training Maximum Likelihood : $\hat{\theta} = argmax_\theta{P(X|\theta)}$ Bayes Theorem $\frac{P(X|\theta)P(\theta)}{P(X)}$

 


๐ŸŠ Classification of Bayesian

Training

Bayesian์˜ ํ›ˆ๋ จ์€ ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ๋ชจ๋ธ์˜ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ํ›ˆ๋ จ์‹œํ‚ค๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด์„œ, ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ๊ฐ€์žฅ ๋†’์„ ํ™•๋ฅ ์„ ์ง€๋‹ˆ๊ฒŒ ๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 

 

$P(\theta|X_{tr}, y_{tr}) = \frac{P(y_{tr} | X_{tr}, \theta)P(\theta)}{P(y_{tr}|X_{tr})}$

Prediction

์ˆ˜๋งŽ์€ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ์—, Prediction์€ ๋ชจ๋“  $\theta$์— ๋Œ€ํ•ด์„œ ํ™•๋ฅ ์„ Integralํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค. 

 

$P(y_{tr}|X_{ts}, X_{tr}, y_{tr} ) = \int P(y_{ts}|X_{ts},\theta)P(\theta|X_{tr},y_{tr})d\theta$

 

 

on-line Learning

Bayesian์˜ ์žฅ์  ์ค‘ ํ•˜๋‚˜๋Š” On-line Learning์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค. On-line learning์€ ํ˜„์žฌ ํ•™์Šต๋œ ์ƒํƒœ์—์„œ ์ถ”๊ฐ€๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธ ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.  $x_k$๋ฐ์ดํ„ฐ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, Bayes Theorem์„ ์ด์šฉํ•ด์„œ ๊ธฐ์กด์˜ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ $P_k(\theta)$๋ฅผ Prior๋กœ ์„ค์ •ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. 

 

$P_k(\theta) = P(\theta|x_k) = \frac{P(x_k|\theta)P_{k-1}(\theta)}{P(x_k)}$

 

 

Classification of Frequentist

Frequentist๋Š” ์ฐธ์ธ ๋ชจ๋ธ์ด ์กด์žฌํ•˜๋ฏ€๋กœ, ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์žฅ ์ž˜ ์„ค๋ช…ํ•˜๋Š” ๋ชจ๋ธ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. 

 


๐ŸŠ Bayesian Network

 

Bayesian์—์„œ ์ค‘์š”ํ•œ ์š”์†Œ๋Š” ํ˜„์žฌ ์ƒํƒœ๋ฅผ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด์„œ, Prior๊ฐ€ ์กด์žฌํ•œ๋‹ค๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ๋™์ „์˜ ํ™•๋ฅ ์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ, ๋ฌด์ˆ˜ํžˆ ๋ฐ˜๋ณตํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ •ํ•˜๋Š” ๋™์ „์˜ ํ™•๋ฅ ์ด Prior๋กœ ์ž‘์šฉํ•˜๊ณ , ์ด๋ฅผ ํ† ๋Œ€๋กœ Likelihood๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค๋ฉด, ์ตœ์ข…์ ์œผ๋กœ ๋ชจ๋ธ์— ๋Œ€ํ•œ ํ™•๋ฅ ์ธ Posterior๊ฐ€ ๊ตฌํ•ด์ง‘๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ด€๊ณ„๋ฅผ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด์„œ Bayesian Network์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. 

 

Nodes : Random variables
Edges : direct impact

$P(X_1, \cdots , X_n) = \prod_{k=1}^{n}P(X_k|Pa(X_k))$

Here $Pa(B)={C}$ and $P(A,B,C) = P(C)P(A|C)P(B|A,C)$

 

์—ฌ๊ธฐ์„œ Parent(Pa)๊ฐ€ ์šฐ๋ฆฌ๊ฐ€ ๊ธฐ์กด์— ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฏฟ์Œ Prior๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฏฟ์Œ ๋•Œ๋ฌธ์—, ์œ„ํ•ด์„œ Bayesian์˜ View point๋ฅผ Subject๋ผ๊ณ  ์ด์•ผ๊ธฐ ํ–ˆ์Šต๋‹ˆ๋‹ค. 


Naive Bayes Classifier

 

Bayesian๊ณผ Classification๋ชจ๋ธ์˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. 

 

Assume that there is a class c and $f_i, i=1,\cdots n$ be infered by c.

 

Class์ธ C์— ๋Œ€ํ•ด์„œ ๊ฐ๊ฐ์˜ ํŠน์„ฑ(Feature)์— ๋Œ€ํ•œ ์„ ํ˜ธ๋„ ํ˜น์€ ํ™•๋ฅ ์„ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํƒ€์ดํƒ€๋‹‰์„ ์˜ˆ๋กœ ๋“ค์–ด๋ณด์ž๋ฉด, ์ƒ์กดํ•œ ์‚ฌ๋žŒ(C)๊ณผ ์„ฑ๋ณ„ ํŠน์„ฑ(Feature)์— ๋Œ€ํ•˜์—ฌ C=์ƒ์กด ์ด๋ผ๋ฉด, ์—ฌ์„ฑ์ผ ํ™•๋ฅ ์ด ๋‚จ์„ฑ์ผ ํ™•๋ฅ ๋ณด๋‹ค ๋†’์Šต๋‹ˆ๋‹ค. 

(์ด๋Š” ํƒ€์ดํƒ€๋‹‰ EDA๋ฅผ ํ†ตํ•ด์„œ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.) 

 

$P(c, f_1, \cdots f_N) = P(c)\prod_{i=1}^{N}P(f_i|c)$

 

๋จธ์‹ ๋Ÿฌ๋‹๊ณผ ๋”ฅ๋Ÿฌ๋‹ ๊ฐ™์€ ๊ฒฝ์šฐ ๋ณดํ†ต, ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ˆ˜๊ฐ€ ๋ฌด์ฒ™์ด๋‚˜ ๋งŽ์ด ๋•Œ๋ฌธ์— ์ด๋Ÿฌํ•œ ํ˜•ํƒœ๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ์ถ•์•ฝํ•ด์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 

we can use a $Plate notation$


๐ŸŠ Linear Regression

[์„ค๋ช… ์ถ”๊ฐ€ ์˜ˆ์ •]

 

 

Univariate normal

$\mathcal{N}(x|\mu, \sigma^2) = \frac{1}{\sqrt{(2\pi\sigma^2)}}e^-\frac{(x-\mu)^2}{2\sigma^2}$

Multivariate normal

$\mathcal{N}(x|\mu, \sum) = \frac{1}{\sqrt{|(2\pi\sum)|}}exp[-\frac{1}{2}(x-\mu)^T\sum^{-1}(x-\mu)]$

Covariance Matrix and Number of parameters

Full : D(D+1)/2

Diagonal : D

Spherical : 1

Least squares problem

$$L(w) = \sum_{i=1}^N(w^Tx_i -y_i)^2 = ||w^TX -y ||^2 -> min$$

We can define the model like this

$$
P(w,y | X) = P(y|X,w)P(w) \
P(y|w, X) = \mathcal{N}(y|w^TX, \sigma^2\mathit{I}) \
P(w) = \mathcal{N}(w|0, \gamma^2 \mathit{I})
$$

$P(w|y,X)$ is what we have to maximize.

$$
P(w|y,X) = \frac{P(y,w|X)}{P(y|X)}
$$

Since $P(y|X)$ term is not dependent on $w$, we should maximize $P(y,w|X)$

Since log function is concave, we get

$$
P(w,y | X) = P(y|X,w)P(w) \\
\log{P(w,y | X)} = \log({P(y|X,w)P(w))} \\
\log{P(w,y | X)} = \log{P(y|X,w)}+\log{P(w)}
$$

$$
\begin{aligned}
\log{P(y|X,w)}+ \log{P(w)} &= \log{C_1 exp(-\frac{1}{2}(y-w^TX)(\sigma^2\mathit{I})^{-1}(y-w^TX))} \\
&+ \log{C_2 exp(-\frac{1}{2}w^T(\sigma^2\mathit{I})^{-1}w)} \\
&= -\frac{1}{2\sigma^2}(y-w^TX)^T(y-w^TW) - \frac{1}{2\gamma^2}w^Tw
\end{aligned}
$$

If we change the maximization problem to the minimization problem, we get the least squeare problem plus L2 regularization.

$$
||y-w^TX||^2 + \lambda||w||^2
$$

 

References 

 

[1] Coursera www.coursera.org/learn/bayesian-methods-in-machine-learning/home/welcome