10주차 Tokenizer & SubWords

모델 학습 전 data 처리하는 법

데이터를 학습하기 위해 text를 최소 단위의 의미있는 unit/token/의미로 나누는 것이다.
NLP의 pre-processing step에서 중요하다.
Once we get a piece of text, we can break it into meaningful chunks, or units, that can be processed together
Sometimes called “parsers” or “tokenizer”: tokenization하는 툴
Natural Language Toolkit(NLTK)
- 파이썬 라이브러리 import nltk
- 기능: 형태소 분석기, 품사 태깅
Issues in tokenization: 어떻게 쪼갤까 이슈
- Finland’s capital → Finland? Finlands? Finland’s?
- Hewlett-Packard → Hewlett and Packard as two tokens
영어기준 전처리 종류: ntlk에서 제공하는 기능