WEEK3 : NLP in Tensorflow (LSTM, Text CNN)

2021. 1. 11. 16:35

추가적으로 알게 된 사실

image보다 text data에서 overfitting되기 쉽다.
왜냐하면 validation data에는 train에 존재하지 않는 vocabulary가 다수 존재하기 때문이다.
train에 없지만 validation에 있는 데이터는 overfitting으로 이어진다.

RNN

www.coursera.org/lecture/nlp-sequence-models/deep-rnns-ehs0S

Deep RNNs - Recurrent Neural Networks | Coursera

Video created by deeplearning.ai for the course "Sequence Models". Learn about recurrent neural networks. This type of model has been proven to perform extremely well on temporal data. It has several variants including LSTMs, GRUs and ...

www.coursera.org

LSTM

www.coursera.org/lecture/nlp-sequence-models/long-short-term-memory-lstm-KXoay

Long Short Term Memory (LSTM) - Recurrent Neural Networks | Coursera

www.coursera.org

1-layer LSTM

model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

2-layer LSTM

첫 번째 LSTM layer에서 return_sequences=True 해주어야 한다.

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

Text CNN

Conv1D 이용해 text convolution 할 수 있다. (LSTM에 비해 빠르다는 장점이 있다)

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_size, input_length=max_length),
    tf.keras.layers.Conv1D(128, 5, activation='relu'), # 128: 필터의 개수, 5: kernel_size
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

Example

import json
import tensorflow as tf
import csv
import random
import numpy as np

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import regularizers


embedding_dim = 100
max_length = 16
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"
training_size= 160000
test_portion=.1

corpus = []

!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/training_cleaned.csv \
    -O /tmp/training_cleaned.csv

num_sentences = 0

with open("/tmp/training_cleaned.csv") as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
      # Your Code here. Create list items where the first item is the text, found in row[5], and the second is the label. Note that the label is a '0' or a '4' in the text. 
      # When it's the former, make your label to be 0, otherwise 1. 
      # Keep a count of the number of sentences in num_sentences
        list_item=[]
        # YOUR CODE HERE
        text = row[5]
        label = 0 if row[0] == "0" else 1
        list_item.append(text)
        list_item.append(label)
        corpus.append(list_item)
        num_sentences = num_sentences + 1

sentences=[]
labels=[]
random.shuffle(corpus)
for x in range(training_size):
    sentences.append(corpus[x][0])
    labels.append(corpus[x][1])

tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
vocab_size=len(word_index)

sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

split = int(test_portion * training_size)

test_sequences = padded[0:split]
training_sequences = padded[split:training_size]
test_labels = labels[0:split]
training_labels = labels[split:training_size]

# Note this is the 100 dimension version of GloVe from Stanford
# I unzipped and hosted it on my site to make this notebook easier
!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/glove.6B.100d.txt \
    -O /tmp/glove.6B.100d.txt
embeddings_index = {};
with open('/tmp/glove.6B.100d.txt') as f:
    for line in f:
        values = line.split();
        word = values[0];
        coefs = np.asarray(values[1:], dtype='float32');
        embeddings_index[word] = coefs;

embeddings_matrix = np.zeros((vocab_size+1, embedding_dim));
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word);
    if embedding_vector is not None:
        embeddings_matrix[i] = embedding_vector;

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length, weights=[embeddings_matrix], trainable=False),
    # YOUR CODE HERE - experiment with combining different types, such as convolutions and LSTMs
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')

])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
model.summary()

training_padded = np.array(training_sequences)
training_labels = np.array(training_labels)
testing_padded = np.array(testing_sequences)
testing_labels = np.array(testing_labels)

num_epochs = 50
history = model.fit(training_sequences, training_labels, epochs=num_epochs, validation_data=(test_sequences, test_labels), verbose=2)

print("Training Complete")

or TextCNN model

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length, weights=[embeddings_matrix], trainable=False),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Conv1D(64, 5, activation='relu'),
    tf.keras.layers.MaxPooling1D(pool_size=4),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

저작자표시 비영리 변경금지 (새창열림)

'🙂 > Coursera_TF' 카테고리의 다른 글

WEEK4 : NLP in Tensorflow (Language Model, regularizer) (0)	2021.01.11
[TF] return_sequences vs return_states (0)	2021.01.11
WEEK2 : NLP in Tensorflow (Embedding) (0)	2021.01.11
WEEK1 : NLP in Tensorflow (Tokenizer, OOV token, pad_sequences) (0)	2021.01.11
WEEK4 : CNN in TensorFlow (multi-class classification) (0)	2021.01.10

순간 기록