본문 바로가기

딥러닝/자연어(NLP)

[Paper Short Review] Do sequence-to-sequence VAEs learn global features of sentences

[Paper Short Review] Do sequence-to-sequence VAEs learn global features of sentences

Keypoints

  • VAE architecture.
  • Classification but uses quite intersting training step.
  • It is again unclear how the latent code act.
  • It used $\delta-VAE$ however why? just to prevent posterior collapse.

Questions and Answers

Which model has it used?

VAE based on the seq2seq LSTM Autoencoder.
$$
\text{L words : }x = (x_1, x_2 ,\cdots, x_L) \\
\text{embedded L vectors : } (e_1 ,\cdots, e_L)\\
h_1, \cdots, h_L = \mathbf{LSTM}(e_1, \cdots, e_L)
$$

Next, generate latent vector using the last hidden vector.

$$
\mu = \mathit{L_1}h_L \\
\sigma^2 = \exp{(\mathit{L_2}r)} \\
q_\phi(z|x) = \mathcal{N}(z|\mu, \mathbf{diag}(\sigma^2))
$$

and decoding step.
$$
h_1', \cdots, h_L' = \mathbf{LSTM}([e_{BOS};z], [e_1;z], \cdots, [e_L;z])
$$

finally,
$$
p_\theta(x_{i+1})|x_1,\cdots, x_i, z) = softmax(wh_i'+b)
$$

The objective function is the marginal log-likelihood $ELBO$

$$
ELBO(x, \theta, \phi) = -D_{KL}(q_\phi)(z|x)||p(z)) + \mathbb{E}_{q_\phi} [\log{p\theta}(x|z)]
$$

What is data?

four small versions of the labeled datasets(topic or sentiment). ($\sim70MB$)

  • AG News
  • Amazon
  • Yahoo
  • Yelp

Dealing with posterior collapse.

modify the obejctive function which uses the free bits formulation of the $\delta-VAE$ For a desired rate $\lambda$

$$
\max{D_{KL}(q_\phi(z|x)||p(z)), \lambda) - \mathbb{E}_{q_\phi}[\log{p_\theta(x|z)}]}
$$

Contribution

  • Measure which words benefit most from the latent information.

Experiments

References

[1] Do sequence-to-sequence VAEs learn global features of sentences?

[2] Ali Razavi, Aaron van den Oord, Ben Poole, and Oriol Vinyals. 2019. Preventing Posterior Collapse with delta-VAEs. In International Conference on Learn- ing