image์์ ๊ฐ pixel์ ์ด๋ฏธ ์ซ์์์ง๋ง, text์ ๊ฒฝ์ฐ ๊ทธ๋ ์ง ์๋ค. ๊ทธ๋ ๋ค๋ฉด text๋ฅผ NN์ ๋ฃ์ด์ฃผ๊ธฐ ์ํด์๋ ์ด๋ป๊ฒ ํด์ํ ๊น?
Q. ๋ง์ฝ ๊ฐ letter๋ฅผ ASCII๋ก ๋ฐ๊พธ์ด ์ซ์๋ก ๋ง๋ค์ด์ค๋ค๋ฉด?
A. SILENT๋ LISTEN์ด๋ ๋จ์ด๋ฅผ ๊ตฌ์ฑํ๋ ์ซ์๊ฐ ๋์ผํ๋ค.
Q. ๋ง์ฝ ๋จ์ด ๋จ์๋ก ์ซ์๋ฅผ ๋ถ์ฌํด์ค๋ค๋ฉด?
A. ๋จ์ด๊ฐ์ ์ ์ฌ๋๋ฅผ ํ์ ํ๊ธฐ๊ฐ ์ฝ์ง ์๋ค. ์๋ํ๋ฉด ๊ฐ ๋จ์ด์ ๋ถ์ฌ๋๋ ์ซ์์๋ ์๋ฏธ๊ฐ ์๊ธฐ ๋๋ฌธ์ด๋ค.
๊ทธ๋ ๋ค๋ฉด ๋๋์ฒด ๋ฌธ์๋ฅผ ์ซ์๋ก ์ด๋ป๊ฒ ๋ฐ๊พธ์ด์ค์ผ ํ๋๊ฐ?
์ฌ๊ธฐ์๋ word2vec์ ๋ํด์ ์๊ฐํ๋ค.
Tokenzier
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
# ๋น๋ ๊ธฐ์ค top 100 words๋ฅผ ์ ๋ณ
tokenizer = Tokenizer(num_words=100)
sentences = [
'I love my dog',
'I love my cat',
'You love my dog!',
'Do you think my dog is amazing?'
]
# tokenizer์์๋ (1) punctuation ์๋ ์ ๊ฑฐ + (2) ๋ชจ๋ ๋ฌธ์๋ฅผ ์๋ฌธ์๋ก ํต์ผ
tokenizer.fit_on_texts(sentences)
# return key, value pairs (key: word, value: index)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
print(word_index)
# {'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
print(sequences) # tokens replacing the words
# [[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]
test_data = [
'I really love my dog',
'my dog loves my manatee'
]
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)
# [[4, 2, 1, 3], [1, 3, 1]]
# 1. really๊ฐ ์ฌ๋ผ์ก๋ค
# 2. loves, manatee๊ฐ ์ฌ๋ผ์ก๋ค
OOV token
- oov = out of vocabulary
- ์ฆ, train ๋ฐ์ดํฐ์์๋ ๋ฑ์ฅํ์ง ์์๋ ๋จ์ด๋ฅผ ๋งํ๋ค.
- ์์ ์์์์๋ train์ ๋ฑ์ฅํ์ง ์์๋ ๋จ์ด๊ฐ ๋์ค๋ฉด ๊ทธ๋ฅ ๋ฌด์ํด๋ฒ๋ ธ๋ค. ํ์ง๋ง ์ด๊ฒ์ ๊ทธ๋ฅ ์ข์ ์๊ฐ์ด ์๋ ๊ฒ ๊ฐ๋ค.
- ์ด๋ฌํ ๊ฒฝ์ฐ๋ฅผ ์ํด Tokenizer์ oov_token์ ์ง์ ํด์ค ์ ์๋ค!
- ์ด๋ ๊ฒ ๋๋ฉด validation / test dataset์ fit_on_texts ํ์ ๋ OOV token์ ๋ํด 1์ด๋ผ๋ ์ซ์๋ฅผ ๋ถ์ฌํด์ค๋ค.
sentences = [
'I love my dog',
'I love my cat',
'You love my dog!',
'Do you think my dog is amazing?'
]
test_data = [
'I really love my dog',
'my dog loves my manatee'
]
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
train_seq = tokenizer.texts_to_sequences(sentences)
test_seq = tokenizer.texts_to_sequences(test_data)
print(word_index)
# {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
print(train_seq)
# [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
print(test_seq)
# [[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]
padding
- ๋ฌธ์ฅ์ ๊ธธ์ด๋ฅผ ๋์ผํ๊ฒ ๋ง๋ค์ด์ฃผ๋ ๊ณผ์
- ๋ชจ๋ธ์ ๋ฃ์ด์ฃผ๊ธฐ ์ํด์๋ ๋ฌธ์ฅ์ ๊ธธ์ด๋ฅผ ๋์ผํ๊ฒ ๋ง์ถฐ์ฃผ์ด์ผ ํ๋ค.
- ์ด ๋ ์ฌ์ฉํ๋ ๊ฒ์ด tensorflow.keras.preprocessing.sequence.pad_sequences ์ด๋ค.
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
train_data = [
'I love my dog',
'I love my cat',
'You love my dog!',
'Do you think my dog is amazing?'
]
test_data = [
'I really love my dog',
'my dog loves my manatee'
]
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer.fit_on_texts(train_data)
word_index = tokenizer.word_index
train_seq = tokenizer.texts_to_sequences(train_data )
padded = pad_sequences(train_seq, padding='post', maxlen=5, truncating='post') # padding
print(word_index)
# {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'thi
print(train_seq)
# [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
print(padded)
# [[5 3 2 4 0]
# [5 3 2 7 0]
# [6 3 2 4 0]
# [8 6 9 2 4]]
# ์ฌ๊ธฐ์ ์์๋์ด์ผ ํ ์ฝ๋!
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded = pad_sequences(train_seq, padding='post', maxlen=5, truncating='post') # padding
- padding='post'
- padding์ ์ด๋์ ํด์ค ๊ฒ์ธ์ง (default๋ ์์ด๋ค. ํ์ง๋ง ์ฐ๋ฆฌ๋ ๋ค(post)์ ํด์ฃผ๋ ๊ฒ์ผ๋ก ๋ณ๊ฒฝํด์ฃผ์๋ค)
- maxlen = 5
- ์ต๋๊ธธ์ด๋ฅผ ์ง์
- ๋ฐ๋ก ์ง์ ํด์ฃผ์ง ์์ผ๋ฉด ๋ฐ์ดํฐ ์ค ๊ฐ์ฅ ๊ธด ๋ฌธ์ฅ์ ๊ธธ์ด์ ๋ง์ถฐ padding ๋๋ค.
- ๋ง์ฝ์ ๊ธธ์ด๊ฐ 10์ธ ๋ฌธ์ฅ๊ณผ ๊ฐ์ ๊ฒฝ์ฐ ์์ด๋ ๋ค์ ๋ด์ฉ์ด ์๋ฆฌ๊ฒ ๋๋ค.
- truncating='post'
- maxlen์ผ๋ก ์ธํด ๊ธด ๋ฌธ์ฅ์ ๊ฒฝ์ฐ ์์ด๋ ๋ค๊ฐ ์๋ฆฐ๋ค. ์ด ๋ ์์ ์๋ฅผ์ง, ๋ค๋ฅผ ์๋ฅผ์ง๋ฅผ ๊ฒฐ์ ํด์ฃผ๋ argument์ด๋ค.
- post๋ก ์ค์ ํด์ฃผ๋ฉด ๋ค๋ฅผ ์๋ฅธ๋ค. (default๋ ์์ ์๋ฅด๋ ๊ฒ์ผ๋ก ๋์ด์๋ค)
'๐ > Coursera_TF' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[TF] return_sequences vs return_states (0) | 2021.01.11 |
---|---|
WEEK2 : NLP in Tensorflow (Embedding) (0) | 2021.01.11 |
WEEK4 : CNN in TensorFlow (multi-class classification) (0) | 2021.01.10 |
WEEK3 : CNN in TensorFlow (transfer learning, dropout) (0) | 2021.01.10 |
WEEK2 : CNN in TensorFlow (data augmentation) (0) | 2021.01.10 |