ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • 토크나이저(Tokenizer)
    AI/DeepLearning 2022. 9. 11. 13:00

    Tonkenization mean?!

    • segregate a particular text into small chunks or tokens
    • 3 major types
    1. Word Tokenization

    ==> word Tokenization 는 주로 space로 구분이 된다

    ex) we will run ==> we, will, run

    1. Character Tokenization

    ex) Relaxing ==> R-e-l-a-x-i-n-g

    1. Subword tokenization

    ex) Relaxing ==> Relax-ing

    Keras Tokenizer Class

    • is used for vectorizing a text corpus
    • converted into integer sequence or a vector that has a coefficient for each token in the form of binary values.

    Methods of Keras Tokenizer Class

    1. fit_on_texts
    2. texts_to_sequences
    3. texts_to_matrix
    4. sequences_to_matrix

    fit_on_text

    더보기
    • is used to update the internal vocabulary for the texts list
    • We need to call before using other methods of texts_to_sequences or text_to_matrix 
    • attributes:

    word_counts : It is a dictionary of words along with the counts.

    word_docs : Again a dictionary of words, this tells us how many documents contain this word

    word_index : In this dictionary, we have unique integers assigned to each word.

    document_count : This integer count will tell us the total number of documents used for fitting the tokenizer.

     

    from keras.preprocessing.text import Tokenizer
    token  = Tokenizer()
    # Defining 4 document lists
    fit_text = ['Machine Learning Knowledge',
            'Machine Learning',
                'Deep Learning',
                'Artificial Intelligence']
    
    token.fit_on_texts(fit_text)
    
    # number of documents in our corpus ( text안에 있는 문장의갯수)
    print(f"corpus안에 있는 문장의갯수: {token.document_count}")
    print(f"문장에 있는 단어 빈도수: {token.word_counts}")
    print(f"단어들의 index: {token.word_index}")
    print(f"각 단어들이 몇개의 문장에 있는지?!: {token.word_docs}")
    
    
    corpus안에 있는 문장의갯수: 4
    문장에 있는 단어 빈도수: OrderedDict([('machine', 2), ('learning', 3), ('knowledge', 1), ('deep', 1), ('artificial', 1), ('intelligence', 1)])
    단어들의 index: {'learning': 1, 'machine': 2, 'knowledge': 3, 'deep': 4, 'artificial': 5, 'intelligence': 6}
    각 단어들이 몇개의 문장에 있는지?!: defaultdict(<class 'int'>, {'machine': 2, 'learning': 3, 'knowledge': 1, 'deep': 1, 'intelligence': 1, 'artificial': 1})

    ## 출처

    https://machinelearningknowledge.ai/keras-tokenizer-tutorial-with-examples-for-fit_on_texts-texts_to_sequences-texts_to_matrix-sequences_to_matrix/#Introduction

    'AI > DeepLearning' 카테고리의 다른 글

    Tensorflow에서 딥 러닝 모델을 만드는 방법  (0) 2022.09.26
    ANN vs DNN vs CNN vs RNN  (0) 2022.09.22
    DeepLearning이란?!  (0) 2022.09.22
    순환신경망(Recurrent Neural Network, RNN)  (0) 2022.09.11
    자연어처리 Introduction  (0) 2022.08.24

    댓글

Designed by Tistory.