WEEK1 : NLP in Tensorflow (Tokenizer, OOV token, pad_sequences)

2021. 1. 11. 13:20

image에서 각 pixel은 이미 숫자였지만, text의 경우 그렇지 않다. 그렇다면 text를 NN에 넣어주기 위해서는 어떻게 해아할까?

Q. 만약 각 letter를 ASCII로 바꾸어 숫자로 만들어준다면?

A. SILENT나 LISTEN이나 단어를 구성하는 숫자가 동일하다.

Q. 만약 단어 단위로 숫자를 부여해준다면?

A. 단어간의 유사도를 파악하기가 쉽지 않다. 왜냐하면 각 단어에 부여되는 숫자에는 의미가 없기 때문이다.

그렇다면 도대체 문자를 숫자로 어떻게 바꾸어줘야 하는가?

여기서는 word2vec에 대해서 소개한다.

Tokenzier

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

# 빈도 기준 top 100 words를 선별
tokenizer = Tokenizer(num_words=100)

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

# tokenizer에서는 (1) punctuation 자동 제거 + (2) 모든 문자를 소문자로 통일
tokenizer.fit_on_texts(sentences)

# return key, value pairs (key: word, value: index)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

print(word_index)
# {'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}

print(sequences) # tokens replacing the words
# [[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]

test_data = [
    'I really love my dog',
    'my dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)
# [[4, 2, 1, 3], [1, 3, 1]]
# 1. really가 사라졌다
# 2. loves, manatee가 사라졌다

OOV token

oov = out of vocabulary
즉, train 데이터에서는 등장하지 않았던 단어를 말한다.
위의 예시에서는 train에 등장하지 않았던 단어가 나오면 그냥 무시해버렸다. 하지만 이것은 그닥 좋은 생각이 아닌 것 같다.
이러한 경우를 위해 Tokenizer에 oov_token을 지정해줄 수 있다!
이렇게 되면 validation / test dataset에 fit_on_texts 했을 때 OOV token에 대해 1이라는 숫자를 부여해준다.

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

test_data = [
    'I really love my dog',
    'my dog loves my manatee'
]

tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
train_seq = tokenizer.texts_to_sequences(sentences)
test_seq = tokenizer.texts_to_sequences(test_data)

print(word_index)
# {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

print(train_seq)
# [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

print(test_seq)
# [[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

padding

문장의 길이를 동일하게 만들어주는 과정
모델에 넣어주기 위해서는 문장의 길이를 동일하게 맞춰주어야 한다.
이 때 사용하는 것이 tensorflow.keras.preprocessing.sequence.pad_sequences 이다.

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

train_data = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

test_data = [
    'I really love my dog',
    'my dog loves my manatee'
]

tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer.fit_on_texts(train_data)
word_index = tokenizer.word_index

train_seq = tokenizer.texts_to_sequences(train_data )

padded = pad_sequences(train_seq, padding='post', maxlen=5, truncating='post') # padding

print(word_index)
# {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'thi

print(train_seq)
# [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

print(padded)
# [[5 3 2 4 0]
#  [5 3 2 7 0]
#  [6 3 2 4 0]
#  [8 6 9 2 4]]

# 여기서 알아두어야 할 코드!

from tensorflow.keras.preprocessing.sequence import pad_sequences

padded = pad_sequences(train_seq, padding='post', maxlen=5, truncating='post') # padding

padding='post'
- padding을 어디에 해줄 것인지 (default는 앞이다. 하지만 우리는 뒤(post)에 해주는 것으로 변경해주었다)
maxlen = 5
- 최대길이를 지정
- 따로 지정해주지 않으면 데이터 중 가장 긴 문장의 길이에 맞춰 padding 된다.
- 만약에 길이가 10인 문장과 같은 경우 앞이나 뒤의 내용이 잘리게 된다.
truncating='post'
- maxlen으로 인해 긴 문장의 경우 앞이나 뒤가 잘린다. 이 때 앞을 자를지, 뒤를 자를지를 결정해주는 argument이다.
- post로 설정해주면 뒤를 자른다. (default는 앞을 자르는 것으로 되어있다)

저작자표시 비영리 변경금지 (새창열림)

'🙂 > Coursera_TF' 카테고리의 다른 글

[TF] return_sequences vs return_states (0)	2021.01.11
WEEK2 : NLP in Tensorflow (Embedding) (0)	2021.01.11
WEEK4 : CNN in TensorFlow (multi-class classification) (0)	2021.01.10
WEEK3 : CNN in TensorFlow (transfer learning, dropout) (0)	2021.01.10
WEEK2 : CNN in TensorFlow (data augmentation) (0)	2021.01.10

순간 기록

WEEK1 : NLP in Tensorflow (Tokenizer, OOV token, pad_sequences)

Tokenzier

OOV token

padding

'🙂 > Coursera_TF' 카테고리의 다른 글

+ Recent posts

티스토리툴바