AI/DeepLearning

토크나이저(Tokenizer)

yunajoe 2022. 9. 11. 13:00

Tonkenization mean?!

  • segregate a particular text into small chunks or tokens
  • 3 major types
  1. Word Tokenization

==> word Tokenization 는 주로 space로 구분이 된다

ex) we will run ==> we, will, run

  1. Character Tokenization

ex) Relaxing ==> R-e-l-a-x-i-n-g

  1. Subword tokenization

ex) Relaxing ==> Relax-ing

Keras Tokenizer Class

  • is used for vectorizing a text corpus
  • converted into integer sequence or a vector that has a coefficient for each token in the form of binary values.

Methods of Keras Tokenizer Class

  1. fit_on_texts
  2. texts_to_sequences
  3. texts_to_matrix
  4. sequences_to_matrix

fit_on_text

더보기
  • is used to update the internal vocabulary for the texts list
  • We need to call before using other methods of texts_to_sequences or text_to_matrix 
  • attributes:

word_counts : It is a dictionary of words along with the counts.

word_docs : Again a dictionary of words, this tells us how many documents contain this word

word_index : In this dictionary, we have unique integers assigned to each word.

document_count : This integer count will tell us the total number of documents used for fitting the tokenizer.

 

from keras.preprocessing.text import Tokenizer
token  = Tokenizer()
# Defining 4 document lists
fit_text = ['Machine Learning Knowledge',
        'Machine Learning',
            'Deep Learning',
            'Artificial Intelligence']

token.fit_on_texts(fit_text)

# number of documents in our corpus ( text안에 있는 문장의갯수)
print(f"corpus안에 있는 문장의갯수: {token.document_count}")
print(f"문장에 있는 단어 빈도수: {token.word_counts}")
print(f"단어들의 index: {token.word_index}")
print(f"각 단어들이 몇개의 문장에 있는지?!: {token.word_docs}")


corpus안에 있는 문장의갯수: 4
문장에 있는 단어 빈도수: OrderedDict([('machine', 2), ('learning', 3), ('knowledge', 1), ('deep', 1), ('artificial', 1), ('intelligence', 1)])
단어들의 index: {'learning': 1, 'machine': 2, 'knowledge': 3, 'deep': 4, 'artificial': 5, 'intelligence': 6}
각 단어들이 몇개의 문장에 있는지?!: defaultdict(<class 'int'>, {'machine': 2, 'learning': 3, 'knowledge': 1, 'deep': 1, 'intelligence': 1, 'artificial': 1})

## 출처

https://machinelearningknowledge.ai/keras-tokenizer-tutorial-with-examples-for-fit_on_texts-texts_to_sequences-texts_to_matrix-sequences_to_matrix/#Introduction