-
토크나이저(Tokenizer)AI/DeepLearning 2022. 9. 11. 13:00
Tonkenization mean?!
- segregate a particular text into small chunks or tokens
- 3 major types
- Word Tokenization
==> word Tokenization 는 주로 space로 구분이 된다
ex) we will run ==> we, will, run
- Character Tokenization
ex) Relaxing ==> R-e-l-a-x-i-n-g
- Subword tokenization
ex) Relaxing ==> Relax-ing
Keras Tokenizer Class
- is used for vectorizing a text corpus
- converted into integer sequence or a vector that has a coefficient for each token in the form of binary values.
Methods of Keras Tokenizer Class
- fit_on_texts
- texts_to_sequences
- texts_to_matrix
- sequences_to_matrix
fit_on_text
더보기- is used to update the internal vocabulary for the texts list
- We need to call before using other methods of texts_to_sequences or text_to_matrix
- attributes:
word_counts : It is a dictionary of words along with the counts.
word_docs : Again a dictionary of words, this tells us how many documents contain this word
word_index : In this dictionary, we have unique integers assigned to each word.
document_count : This integer count will tell us the total number of documents used for fitting the tokenizer.
from keras.preprocessing.text import Tokenizer token = Tokenizer() # Defining 4 document lists fit_text = ['Machine Learning Knowledge', 'Machine Learning', 'Deep Learning', 'Artificial Intelligence'] token.fit_on_texts(fit_text) # number of documents in our corpus ( text안에 있는 문장의갯수) print(f"corpus안에 있는 문장의갯수: {token.document_count}") print(f"문장에 있는 단어 빈도수: {token.word_counts}") print(f"단어들의 index: {token.word_index}") print(f"각 단어들이 몇개의 문장에 있는지?!: {token.word_docs}") corpus안에 있는 문장의갯수: 4 문장에 있는 단어 빈도수: OrderedDict([('machine', 2), ('learning', 3), ('knowledge', 1), ('deep', 1), ('artificial', 1), ('intelligence', 1)]) 단어들의 index: {'learning': 1, 'machine': 2, 'knowledge': 3, 'deep': 4, 'artificial': 5, 'intelligence': 6} 각 단어들이 몇개의 문장에 있는지?!: defaultdict(<class 'int'>, {'machine': 2, 'learning': 3, 'knowledge': 1, 'deep': 1, 'intelligence': 1, 'artificial': 1})
## 출처
'AI > DeepLearning' 카테고리의 다른 글
Tensorflow에서 딥 러닝 모델을 만드는 방법 (0) 2022.09.26 ANN vs DNN vs CNN vs RNN (0) 2022.09.22 DeepLearning이란?! (0) 2022.09.22 순환신경망(Recurrent Neural Network, RNN) (0) 2022.09.11 자연어처리 Introduction (0) 2022.08.24