추가적으로 알게 된 사실
- image보다 text data에서 overfitting되기 쉽다.
- 왜냐하면 validation data에는 train에 존재하지 않는 vocabulary가 다수 존재하기 때문이다.
- train에 없지만 validation에 있는 데이터는 overfitting으로 이어진다.
RNN
www.coursera.org/lecture/nlp-sequence-models/deep-rnns-ehs0S
Deep RNNs - Recurrent Neural Networks | Coursera
Video created by deeplearning.ai for the course "Sequence Models". Learn about recurrent neural networks. This type of model has been proven to perform extremely well on temporal data. It has several variants including LSTMs, GRUs and ...
www.coursera.org
LSTM
www.coursera.org/lecture/nlp-sequence-models/long-short-term-memory-lstm-KXoay
Long Short Term Memory (LSTM) - Recurrent Neural Networks | Coursera
Video created by deeplearning.ai for the course "Sequence Models". Learn about recurrent neural networks. This type of model has been proven to perform extremely well on temporal data. It has several variants including LSTMs, GRUs and ...
www.coursera.org
1-layer LSTM
model = tf.keras.models.Sequential([
tf.keras.layers.Embedding(vocab_size, 64),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])

2-layer LSTM
- 첫 번째 LSTM layer에서 return_sequences=True 해주어야 한다.
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, 64),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])

Text CNN
- Conv1D 이용해 text convolution 할 수 있다. (LSTM에 비해 빠르다는 장점이 있다)
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_size, input_length=max_length),
tf.keras.layers.Conv1D(128, 5, activation='relu'), # 128: 필터의 개수, 5: kernel_size
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(24, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])

Example
import json
import tensorflow as tf
import csv
import random
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import regularizers
embedding_dim = 100
max_length = 16
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"
training_size= 160000
test_portion=.1
corpus = []
!wget --no-check-certificate \
https://storage.googleapis.com/laurencemoroney-blog.appspot.com/training_cleaned.csv \
-O /tmp/training_cleaned.csv
num_sentences = 0
with open("/tmp/training_cleaned.csv") as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
# Your Code here. Create list items where the first item is the text, found in row[5], and the second is the label. Note that the label is a '0' or a '4' in the text.
# When it's the former, make your label to be 0, otherwise 1.
# Keep a count of the number of sentences in num_sentences
list_item=[]
# YOUR CODE HERE
text = row[5]
label = 0 if row[0] == "0" else 1
list_item.append(text)
list_item.append(label)
corpus.append(list_item)
num_sentences = num_sentences + 1
sentences=[]
labels=[]
random.shuffle(corpus)
for x in range(training_size):
sentences.append(corpus[x][0])
labels.append(corpus[x][1])
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
vocab_size=len(word_index)
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
split = int(test_portion * training_size)
test_sequences = padded[0:split]
training_sequences = padded[split:training_size]
test_labels = labels[0:split]
training_labels = labels[split:training_size]
# Note this is the 100 dimension version of GloVe from Stanford
# I unzipped and hosted it on my site to make this notebook easier
!wget --no-check-certificate \
https://storage.googleapis.com/laurencemoroney-blog.appspot.com/glove.6B.100d.txt \
-O /tmp/glove.6B.100d.txt
embeddings_index = {};
with open('/tmp/glove.6B.100d.txt') as f:
for line in f:
values = line.split();
word = values[0];
coefs = np.asarray(values[1:], dtype='float32');
embeddings_index[word] = coefs;
embeddings_matrix = np.zeros((vocab_size+1, embedding_dim));
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word);
if embedding_vector is not None:
embeddings_matrix[i] = embedding_vector;
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length, weights=[embeddings_matrix], trainable=False),
# YOUR CODE HERE - experiment with combining different types, such as convolutions and LSTMs
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
tf.keras.layers.Dense(10, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
model.summary()
training_padded = np.array(training_sequences)
training_labels = np.array(training_labels)
testing_padded = np.array(testing_sequences)
testing_labels = np.array(testing_labels)
num_epochs = 50
history = model.fit(training_sequences, training_labels, epochs=num_epochs, validation_data=(test_sequences, test_labels), verbose=2)
print("Training Complete")
or TextCNN model
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length, weights=[embeddings_matrix], trainable=False),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Conv1D(64, 5, activation='relu'),
tf.keras.layers.MaxPooling1D(pool_size=4),
tf.keras.layers.LSTM(64),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
'🙂 > Coursera_TF' 카테고리의 다른 글
WEEK4 : NLP in Tensorflow (Language Model, regularizer) (0) | 2021.01.11 |
---|---|
[TF] return_sequences vs return_states (0) | 2021.01.11 |
WEEK2 : NLP in Tensorflow (Embedding) (0) | 2021.01.11 |
WEEK1 : NLP in Tensorflow (Tokenizer, OOV token, pad_sequences) (0) | 2021.01.11 |
WEEK4 : CNN in TensorFlow (multi-class classification) (0) | 2021.01.10 |