11주차 Transformer | Notion

✅ Problems with RNN

parallelization하지 못하고 sequential하게 hidden state를 구함: t번째 hidden state 구하려면 이전 hidden state 구해야됨
GRU, LSTM, RNN도 long range dependencies를 해결하지 못함 ⇒ Parallelization한 model 필요
Attention Module: parallel
- decoder의 hidden state를 가지고 encoder의 hidden state를 동시에 넣어 dot product하고 동시에 softmax하고 동시에 weighted sum해서 현재의 context vector 만든다.
RNN은 sequantial하게 구해서 느리다: sequantial을 없애고 encoder, decoder에서 parallel하게 계산을 하자 ⇒ Transformer 나옴(NeurlPS)

CPU	GPU
빠름
조금만 처리 가능	많은 연산 처리 가능
하나하나 계산	한번에 계산

✅ Transformer Overview

Encoder-Decoder approach
논문에서는 machine translation with parallel corpus Task를 사용
Final cost/error function is standard cross-entropy error on top of a softmax classifier.

Untitled

✅ Encoder Internals

Untitled

1) positional encoding

2) multi head attention: self attention

3) feed forward: neural network

4) add& norm

1️⃣ Positional encoding