image์—์„œ ๊ฐ pixel์€ ์ด๋ฏธ ์ˆซ์ž์˜€์ง€๋งŒ, text์˜ ๊ฒฝ์šฐ ๊ทธ๋ ‡์ง€ ์•Š๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด text๋ฅผ NN์— ๋„ฃ์–ด์ฃผ๊ธฐ ์œ„ํ•ด์„œ๋Š” ์–ด๋–ป๊ฒŒ ํ•ด์•„ํ• ๊นŒ?

 

Q. ๋งŒ์•ฝ ๊ฐ letter๋ฅผ ASCII๋กœ ๋ฐ”๊พธ์–ด ์ˆซ์ž๋กœ ๋งŒ๋“ค์–ด์ค€๋‹ค๋ฉด?

A. SILENT๋‚˜ LISTEN์ด๋‚˜ ๋‹จ์–ด๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ์ˆซ์ž๊ฐ€ ๋™์ผํ•˜๋‹ค.

 

Q. ๋งŒ์•ฝ ๋‹จ์–ด ๋‹จ์œ„๋กœ ์ˆซ์ž๋ฅผ ๋ถ€์—ฌํ•ด์ค€๋‹ค๋ฉด?

A. ๋‹จ์–ด๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ํŒŒ์•…ํ•˜๊ธฐ๊ฐ€ ์‰ฝ์ง€ ์•Š๋‹ค. ์™œ๋ƒํ•˜๋ฉด ๊ฐ ๋‹จ์–ด์— ๋ถ€์—ฌ๋˜๋Š” ์ˆซ์ž์—๋Š” ์˜๋ฏธ๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด ๋„๋Œ€์ฒด ๋ฌธ์ž๋ฅผ ์ˆซ์ž๋กœ ์–ด๋–ป๊ฒŒ ๋ฐ”๊พธ์–ด์ค˜์•ผ ํ•˜๋Š”๊ฐ€?

์—ฌ๊ธฐ์„œ๋Š” word2vec์— ๋Œ€ํ•ด์„œ ์†Œ๊ฐœํ•œ๋‹ค.

 

Tokenzier

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

# ๋นˆ๋„ ๊ธฐ์ค€ top 100 words๋ฅผ ์„ ๋ณ„
tokenizer = Tokenizer(num_words=100)

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

# tokenizer์—์„œ๋Š” (1) punctuation ์ž๋™ ์ œ๊ฑฐ + (2) ๋ชจ๋“  ๋ฌธ์ž๋ฅผ ์†Œ๋ฌธ์ž๋กœ ํ†ต์ผ
tokenizer.fit_on_texts(sentences)

# return key, value pairs (key: word, value: index)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

print(word_index)
# {'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}

print(sequences) # tokens replacing the words
# [[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]
test_data = [
    'I really love my dog',
    'my dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)
# [[4, 2, 1, 3], [1, 3, 1]]
# 1. really๊ฐ€ ์‚ฌ๋ผ์กŒ๋‹ค
# 2. loves, manatee๊ฐ€ ์‚ฌ๋ผ์กŒ๋‹ค

 

OOV token

  • oov = out of vocabulary 
  • ์ฆ‰, train ๋ฐ์ดํ„ฐ์—์„œ๋Š” ๋“ฑ์žฅํ•˜์ง€ ์•Š์•˜๋˜ ๋‹จ์–ด๋ฅผ ๋งํ•œ๋‹ค.
  • ์œ„์˜ ์˜ˆ์‹œ์—์„œ๋Š” train์— ๋“ฑ์žฅํ•˜์ง€ ์•Š์•˜๋˜ ๋‹จ์–ด๊ฐ€ ๋‚˜์˜ค๋ฉด ๊ทธ๋ƒฅ ๋ฌด์‹œํ•ด๋ฒ„๋ ธ๋‹ค. ํ•˜์ง€๋งŒ ์ด๊ฒƒ์€ ๊ทธ๋‹ฅ ์ข‹์€ ์ƒ๊ฐ์ด ์•„๋‹Œ ๊ฒƒ ๊ฐ™๋‹ค.
  • ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ๋ฅผ ์œ„ํ•ด Tokenizer์— oov_token์„ ์ง€์ •ํ•ด์ค„ ์ˆ˜ ์žˆ๋‹ค!
  • ์ด๋ ‡๊ฒŒ ๋˜๋ฉด validation / test dataset์— fit_on_texts ํ–ˆ์„ ๋•Œ OOV token์— ๋Œ€ํ•ด 1์ด๋ผ๋Š” ์ˆซ์ž๋ฅผ ๋ถ€์—ฌํ•ด์ค€๋‹ค.
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

test_data = [
    'I really love my dog',
    'my dog loves my manatee'
]

tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
train_seq = tokenizer.texts_to_sequences(sentences)
test_seq = tokenizer.texts_to_sequences(test_data)

print(word_index)
# {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

print(train_seq)
# [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

print(test_seq)
# [[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

 

padding

  • ๋ฌธ์žฅ์˜ ๊ธธ์ด๋ฅผ ๋™์ผํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์ฃผ๋Š” ๊ณผ์ •
  • ๋ชจ๋ธ์— ๋„ฃ์–ด์ฃผ๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ฌธ์žฅ์˜ ๊ธธ์ด๋ฅผ ๋™์ผํ•˜๊ฒŒ ๋งž์ถฐ์ฃผ์–ด์•ผ ํ•œ๋‹ค.
  • ์ด ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด tensorflow.keras.preprocessing.sequence.pad_sequences ์ด๋‹ค.
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

train_data = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

test_data = [
    'I really love my dog',
    'my dog loves my manatee'
]

tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer.fit_on_texts(train_data)
word_index = tokenizer.word_index

train_seq = tokenizer.texts_to_sequences(train_data )

padded = pad_sequences(train_seq, padding='post', maxlen=5, truncating='post') # padding

print(word_index)
# {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'thi

print(train_seq)
# [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

print(padded)
# [[5 3 2 4 0]
#  [5 3 2 7 0]
#  [6 3 2 4 0]
#  [8 6 9 2 4]]
# ์—ฌ๊ธฐ์„œ ์•Œ์•„๋‘์–ด์•ผ ํ•  ์ฝ”๋“œ!

from tensorflow.keras.preprocessing.sequence import pad_sequences

padded = pad_sequences(train_seq, padding='post', maxlen=5, truncating='post') # padding

 

  • padding='post'
    • padding์„ ์–ด๋””์— ํ•ด์ค„ ๊ฒƒ์ธ์ง€ (default๋Š” ์•ž์ด๋‹ค. ํ•˜์ง€๋งŒ ์šฐ๋ฆฌ๋Š” ๋’ค(post)์— ํ•ด์ฃผ๋Š” ๊ฒƒ์œผ๋กœ ๋ณ€๊ฒฝํ•ด์ฃผ์—ˆ๋‹ค)
  • maxlen = 5
    • ์ตœ๋Œ€๊ธธ์ด๋ฅผ ์ง€์ •
    • ๋”ฐ๋กœ ์ง€์ •ํ•ด์ฃผ์ง€ ์•Š์œผ๋ฉด ๋ฐ์ดํ„ฐ ์ค‘ ๊ฐ€์žฅ ๊ธด ๋ฌธ์žฅ์˜ ๊ธธ์ด์— ๋งž์ถฐ padding ๋œ๋‹ค.
    • ๋งŒ์•ฝ์— ๊ธธ์ด๊ฐ€ 10์ธ ๋ฌธ์žฅ๊ณผ ๊ฐ™์€ ๊ฒฝ์šฐ ์•ž์ด๋‚˜ ๋’ค์˜ ๋‚ด์šฉ์ด ์ž˜๋ฆฌ๊ฒŒ ๋œ๋‹ค.
  • truncating='post'
    • maxlen์œผ๋กœ ์ธํ•ด ๊ธด ๋ฌธ์žฅ์˜ ๊ฒฝ์šฐ ์•ž์ด๋‚˜ ๋’ค๊ฐ€ ์ž˜๋ฆฐ๋‹ค. ์ด ๋•Œ ์•ž์„ ์ž๋ฅผ์ง€, ๋’ค๋ฅผ ์ž๋ฅผ์ง€๋ฅผ ๊ฒฐ์ •ํ•ด์ฃผ๋Š” argument์ด๋‹ค.
    • post๋กœ ์„ค์ •ํ•ด์ฃผ๋ฉด ๋’ค๋ฅผ ์ž๋ฅธ๋‹ค. (default๋Š” ์•ž์„ ์ž๋ฅด๋Š” ๊ฒƒ์œผ๋กœ ๋˜์–ด์žˆ๋‹ค)

 

 

'๐Ÿ™‚ > Coursera_TF' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[TF] return_sequences vs return_states  (0) 2021.01.11
WEEK2 : NLP in Tensorflow (Embedding)  (0) 2021.01.11
WEEK4 : CNN in TensorFlow (multi-class classification)  (0) 2021.01.10
WEEK3 : CNN in TensorFlow (transfer learning, dropout)  (0) 2021.01.10
WEEK2 : CNN in TensorFlow (data augmentation)  (0) 2021.01.10

+ Recent posts