How to run pretraining code with fairseq
1. Download unlabeled dataset for pretraining.
먼저 데이터를 다운받습니다. train, valid로 나눠져 있는 데이터라면 아무 데이터나 상관없습니다.
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip
2. Encode text with GPT-2 BPE
GPT-2의 Byte Pair Encoding을 진행합니다.
mkdir -p gpt2_bpe
wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
for SPLIT in train valid test; do \
python -m examples.roberta.multiprocessing_bpe_encoder \
--encoder-json gpt2_bpe/encoder.json \
--vocab-bpe gpt2_bpe/vocab.bpe \
--inputs wikitext-103-raw/wiki.${SPLIT}.raw \
--outputs wikitext-103-raw/wiki.${SPLIT}.bpe \
--keep-empty \
--workers 60; \
done
여기까지 완료하시면 세가지 파일을 Data 폴더 안에 보관하시면 됩니다.
Data/
- encoder.json
- dict.txt
- vocab.bpe
3. Preprocess and binarize
python fairseq_cli/preprocess.py \
--only-source \
--srcdict data/dict.txt \
--trainpref data/wikitext-103-raw/wiki.train.bpe \
--validpref data/wikitext-103-raw/wiki.valid.bpe \
--testpref data/wikitext-103-raw/wiki.test.bpe \
--destdir data/wikitext-103-bin \
--workers 60
4. Train Robert
TOTAL_UPDATES=125000 # Total number of training steps
WARMUP_UPDATES=10000 # Warmup the learning rate over this many updates
PEAK_LR=0.0005 # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512 # Max sequence length
MAX_POSITIONS=512 # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16 # Number of sequences per batch (batch size)
UPDATE_FREQ=16 # Increase the batch size 16x
DATA_DIR=data-bin/wikitext-103
fairseq-train --fp16 $DATA_DIR \
--task masked_lm --criterion masked_lm \
--arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
--optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
--dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
--batch-size $MAX_SENTENCES --update-freq $UPDATE_FREQ \
--max-update $TOTAL_UPDATES --log-format simple --log-interval 1
References
[1] https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md
'딥러닝 > 자연어(NLP)' 카테고리의 다른 글
🧐 Sequence Labeling with Tagging (0) | 2021.05.24 |
---|---|
NLP의 모든 분야 탐색 (Update 중) (0) | 2021.02.19 |
[Paper Short Review] Do sequence-to-sequence VAEs learn global features of sentences (0) | 2021.02.13 |
Text Summarization 분야 탐색 (0) | 2021.02.05 |
Transformer와 인간의 뇌 구조 (0) | 2021.02.03 |