단어 임베딩을 이용한 텍스트 유사성🤚

의미 유사성을 바탕으로 적절한 공간에 위치시키는 방법이 단어임베딩입니다.
One-hot enconding을 사용할 수도 있지만, 이 경우 단어의 유사성을 파악하기는 어렵습니다.

단어가 10M 개가 있다고 가정해봅시다. 100,000,000 개를
1000개의 차원의 벡터 x = (x_1, ..., x_1000) 으로 투영시키는 경우를
생각할 수 있습니다. 두 단어의 유사성은 한 문장에 두 단어가 동시에 위치할 경우, 관계가 높다고 할 수 있습니다.

코드를 리뷰하면서 관련 내용을 살펴보겠습니다.

중요한 라이브러리

%matplotlib inline
 
import os
from keras.utils import get_file
import gensim
import numpy as np
 
from sklearn.manifold import TSNE
import json
from collections import Counter
from itertools import chain


 

구글 뉴스데이터를 바탕으로 임베딩한 단어 Word2Vec 사용

MODEL = 'GoogleNews-vectors-negative300.bin'
path = get_file(MODEL + '.gz', 'https://deeplearning4jblob.blob.core.windows.net/resources/wordvectors/%s.gz' % MODEL)
if not os.path.isdir('generated'):
    os.mkdir('generated')
 
unzipped = os.path.join('generated', MODEL)
if not os.path.isfile(unzipped):
    with open(unzipped, 'wb') as fout:
        zcat = subprocess.Popen(['zcat'],
                        stdin=open(path),
                        stdout=fout
                        )
        zcat.wait()



cs

model = gensim.models.KeyedVectors.load_word2vec_format(unzipped, binary=True)

cs

Gensim은 word2Vec 라이브러리이다.

A가 B일 때 C는 무엇?

벡터의 덧셈 연산을 통해서 C와 가장 비슷한 res를 반환한다.

def A_is_to_B_as_C_is_to(a, b, c, topn=1):
    a, b, c = map(lambda x:x if type(x) == list else [x], (a, b, c))
    res = model.most_similar(positive=b + c, negative=a, topn=topn)
    if len(res):
        if topn == 1:
            return res[0][0]
        return [x[0] for x in res]
    return None
 
A_is_to_B_as_C_is_to('man', 'woman', 'king')
 
out[13]:
    'queen'
 
#응용1
for country in 'Italy', 'France', 'India', 'China':
    print('%s is the capital of %s' % 
      (A_is_to_B_as_C_is_to('Germany', 'Berlin', country), country))
 
out[]:
    Rome is the capital of Italy
    Paris is the capital of France
    Delhi is the capital of India
    Beijing is the capital of China
 
#응용2
for company in 'Google', 'IBM', 'Boeing', 'Microsoft', 'Samsung':
    products = A_is_to_B_as_C_is_to(
        ['Starbucks', 'Apple'], 
        ['Starbucks_coffee', 'iPhone'], 
        company, topn=3)
    print('%s -> %s' % 
        (company, ', '.join(products)))


cs

TSNE(t-distributed stochastic embedding)🤓

TSNE는 고차원의 벡터를 저 차원으로 투영시키는 임베딩입니다. 고차원의 벡터를 시각적으로 나타내기는 어려움으로, 2차원의 평면으로 나타내는 게 일반적입니다. n_components=d 로 설정함으로써 원하는 차원을 설정할 수 있습니다.

from sklearn.manifold import TSNE 에는 TSNE가 존재한다.

  
vectors = np.asarray([x[1] for x in item_vectors])
  lengths = np.linalg.norm(vectors, axis=1)
  norm_vectors = (vectors.T / lengths).T
 
  tsne = TSNE(n_components=2, perplexity=10, verbose=2).fit_transform(norm_vectors)
 
  ## 훈련 진행---------------------
 
  x=tsne[:,0]
  y=tsne[:,1]
 
  fig, ax = plt.subplots()
  ax.scatter(x, y)
 
  for item, x1, y1 in zip(item_vectors, x, y):
      ax.annotate(item[0], (x1, y1), size=14)
 
  plt.show()
 

cs

'딥러닝 > 자연어(NLP)' 카테고리의 다른 글

나만의 자연어처리 공부방법 (0)	2020.08.24
About NLP (0)	2020.07.23
Colab에서 한글 사용하기 (0)	2020.07.17
Word2Vec을 활용한 단어 유사성 (0)	2020.07.15
[논문 리뷰] 트위터 길이와 감성의 관계 분석 (0)	2020.05.10

Rudi

Word2Vec Gensim 실습 코드