AI/DeepLearning
토크나이저(Tokenizer)
yunajoe
2022. 9. 11. 13:00
Tonkenization mean?!
- segregate a particular text into small chunks or tokens
- 3 major types
- Word Tokenization
==> word Tokenization 는 주로 space로 구분이 된다
ex) we will run ==> we, will, run
- Character Tokenization
ex) Relaxing ==> R-e-l-a-x-i-n-g
- Subword tokenization
ex) Relaxing ==> Relax-ing
Keras Tokenizer Class
- is used for vectorizing a text corpus
- converted into integer sequence or a vector that has a coefficient for each token in the form of binary values.
Methods of Keras Tokenizer Class
- fit_on_texts
- texts_to_sequences
- texts_to_matrix
- sequences_to_matrix
fit_on_text
더보기
- is used to update the internal vocabulary for the texts list
- We need to call before using other methods of texts_to_sequences or text_to_matrix
- attributes:
word_counts : It is a dictionary of words along with the counts.
word_docs : Again a dictionary of words, this tells us how many documents contain this word
word_index : In this dictionary, we have unique integers assigned to each word.
document_count : This integer count will tell us the total number of documents used for fitting the tokenizer.
from keras.preprocessing.text import Tokenizer
token = Tokenizer()
# Defining 4 document lists
fit_text = ['Machine Learning Knowledge',
'Machine Learning',
'Deep Learning',
'Artificial Intelligence']
token.fit_on_texts(fit_text)
# number of documents in our corpus ( text안에 있는 문장의갯수)
print(f"corpus안에 있는 문장의갯수: {token.document_count}")
print(f"문장에 있는 단어 빈도수: {token.word_counts}")
print(f"단어들의 index: {token.word_index}")
print(f"각 단어들이 몇개의 문장에 있는지?!: {token.word_docs}")
corpus안에 있는 문장의갯수: 4
문장에 있는 단어 빈도수: OrderedDict([('machine', 2), ('learning', 3), ('knowledge', 1), ('deep', 1), ('artificial', 1), ('intelligence', 1)])
단어들의 index: {'learning': 1, 'machine': 2, 'knowledge': 3, 'deep': 4, 'artificial': 5, 'intelligence': 6}
각 단어들이 몇개의 문장에 있는지?!: defaultdict(<class 'int'>, {'machine': 2, 'learning': 3, 'knowledge': 1, 'deep': 1, 'intelligence': 1, 'artificial': 1})
## 출처