[Paper Short Review] Do sequence-to-sequence VAEs learn global features of sentences
Keypoints
- VAE architecture.
- Classification but uses quite intersting training step.
- It is again unclear how the latent code act.
- It used $\delta-VAE$ however why? just to prevent posterior collapse.
Questions and Answers
Which model has it used?
VAE based on the seq2seq LSTM Autoencoder.
$$
\text{L words : }x = (x_1, x_2 ,\cdots, x_L) \\
\text{embedded L vectors : } (e_1 ,\cdots, e_L)\\
h_1, \cdots, h_L = \mathbf{LSTM}(e_1, \cdots, e_L)
$$
Next, generate latent vector using the last hidden vector.
$$
\mu = \mathit{L_1}h_L \\
\sigma^2 = \exp{(\mathit{L_2}r)} \\
q_\phi(z|x) = \mathcal{N}(z|\mu, \mathbf{diag}(\sigma^2))
$$
and decoding step.
$$
h_1', \cdots, h_L' = \mathbf{LSTM}([e_{BOS};z], [e_1;z], \cdots, [e_L;z])
$$
finally,
$$
p_\theta(x_{i+1})|x_1,\cdots, x_i, z) = softmax(wh_i'+b)
$$
The objective function is the marginal log-likelihood $ELBO$
$$
ELBO(x, \theta, \phi) = -D_{KL}(q_\phi)(z|x)||p(z)) + \mathbb{E}_{q_\phi} [\log{p\theta}(x|z)]
$$
What is data?
four small versions of the labeled datasets(topic or sentiment). ($\sim70MB$)
- AG News
- Amazon
- Yahoo
- Yelp
Dealing with posterior collapse.
modify the obejctive function which uses the free bits formulation of the $\delta-VAE$ For a desired rate $\lambda$
$$
\max{D_{KL}(q_\phi(z|x)||p(z)), \lambda) - \mathbb{E}_{q_\phi}[\log{p_\theta(x|z)}]}
$$
Contribution
- Measure which words benefit most from the latent information.
Experiments
References
[1] Do sequence-to-sequence VAEs learn global features of sentences?
[2] Ali Razavi, Aaron van den Oord, Ben Poole, and Oriol Vinyals. 2019. Preventing Posterior Collapse with delta-VAEs. In International Conference on Learn- ing
'딥러닝 > 자연어(NLP)' 카테고리의 다른 글
NLP의 모든 분야 탐색 (Update 중) (0) | 2021.02.19 |
---|---|
[Fairseq 1] Robert Pretrain 코드 돌리기 (0) | 2021.02.16 |
Text Summarization 분야 탐색 (0) | 2021.02.05 |
Transformer와 인간의 뇌 구조 (0) | 2021.02.03 |
Byte Pair Encoding 방법 (0) | 2021.02.02 |