AI/DeepLearning 2022. 9. 11. 13:00

Tonkenization mean?!

segregate a particular text into small chunks or tokens
3 major types

Word Tokenization

==> word Tokenization 는 주로 space로 구분이 된다

ex) we will run ==> we, will, run

Character Tokenization

ex) Relaxing ==> R-e-l-a-x-i-n-g

Subword tokenization

ex) Relaxing ==> Relax-ing

Keras Tokenizer Class

is used for vectorizing a text corpus
converted into integer sequence or a vector that has a coefficient for each token in the form of binary values.

Methods of Keras Tokenizer Class

fit_on_texts
texts_to_sequences
texts_to_matrix
sequences_to_matrix

fit_on_text

is used to update the internal vocabulary for the texts list
We need to call before using other methods of texts_to_sequences or text_to_matrix
attributes:

word_counts : It is a dictionary of words along with the counts.

word_docs : Again a dictionary of words, this tells us how many documents contain this word

word_index : In this dictionary, we have unique integers assigned to each word.

document_count : This integer count will tell us the total number of documents used for fitting the tokenizer.

from keras.preprocessing.text import Tokenizer
token  = Tokenizer()
# Defining 4 document lists
fit_text = ['Machine Learning Knowledge',
        'Machine Learning',
            'Deep Learning',
            'Artificial Intelligence']

token.fit_on_texts(fit_text)

# number of documents in our corpus ( text안에 있는 문장의갯수)
print(f"corpus안에 있는 문장의갯수: {token.document_count}")
print(f"문장에 있는 단어 빈도수: {token.word_counts}")
print(f"단어들의 index: {token.word_index}")
print(f"각 단어들이 몇개의 문장에 있는지?!: {token.word_docs}")


corpus안에 있는 문장의갯수: 4
문장에 있는 단어 빈도수: OrderedDict([('machine', 2), ('learning', 3), ('knowledge', 1), ('deep', 1), ('artificial', 1), ('intelligence', 1)])
단어들의 index: {'learning': 1, 'machine': 2, 'knowledge': 3, 'deep': 4, 'artificial': 5, 'intelligence': 6}
각 단어들이 몇개의 문장에 있는지?!: defaultdict(<class 'int'>, {'machine': 2, 'learning': 3, 'knowledge': 1, 'deep': 1, 'intelligence': 1, 'artificial': 1})

## 출처

https://machinelearningknowledge.ai/keras-tokenizer-tutorial-with-examples-for-fit_on_texts-texts_to_sequences-texts_to_matrix-sequences_to_matrix/#Introduction

'AI > DeepLearning' 카테고리의 다른 글

Tensorflow에서 딥 러닝 모델을 만드는 방법 (0)	2022.09.26
ANN vs DNN vs CNN vs RNN (0)	2022.09.22
DeepLearning이란?! (0)	2022.09.22
순환신경망(Recurrent Neural Network, RNN) (0)	2022.09.11
자연어처리 Introduction (0)	2022.08.24

ABOUT ME

yunaaa's yunaaa's

Tonkenization mean?!

Keras Tokenizer Class

Methods of Keras Tokenizer Class

fit_on_text

'AI > DeepLearning' 카테고리의 다른 글

티스토리툴바

ABOUT ME

Tonkenization mean?!

Keras Tokenizer Class

Methods of Keras Tokenizer Class

fit_on_text

'AI > DeepLearning' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바