Using the Sequencing APIs
Sequencing is use to format sentence array using token,For example:"I have a car""Car have a door"
Tokenize: [I:1] [have:2] [a:3] [car:4] [door:5]
then these two sentence can be represent to:[1 2 3 4][4 2 3 5]Sequencing is useful to represent sentence data and take as input for a neuron network.
Code
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import tensorflow as tf | |
from tensorflow import keras | |
from tensorflow.keras.preprocessing.text import Tokenizer | |
from tensorflow.keras.preprocessing.sequence import pad_sequences | |
# Sentences data | |
sentences = { | |
'I have a car', | |
'I have a pen', | |
'I have a bike', | |
'He have a apple and a banana' | |
} | |
## Make a tokenizer with max 100 tokens, and label stranger token with OOV index | |
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>") | |
# Tokenize the sentence data | |
tokenizer.fit_on_texts(sentences) | |
word_index = tokenizer.word_index | |
# Print the word index tokenized from sentence data | |
# The more often a word be use the lower index | |
word_index = tokenizer.word_index | |
print(word_index) | |
## Make to sentence array | |
sequences = tokenizer.texts_to_sequences(sentences) | |
## Make all sequence array length to 5, it will helpful when we train a NLP neuron network | |
padded = pad_sequences(sequences, maxlen=5) | |
print("\nWord Index = " , word_index) | |
print("\nSequences = " , sequences) | |
print("\nPadded Sequences:") | |
print(padded) | |
# Try with words that the tokenizer wasn't fit to | |
# manatee is unseen token so it will be replace to OOV:1 | |
test_data = [ | |
'I have a big car', | |
'He like banana' | |
] | |
test_seq = tokenizer.texts_to_sequences(test_data) | |
print("\nTest Sequence = ", test_seq) | |
Result:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{'<OOV>': 1, 'a': 2, 'have': 3, 'i': 4, 'car': 5, 'pen': 6, 'bike': 7, 'he': 8, 'apple': 9, 'and': 10, 'banna': 11}
Word Index = {'<OOV>': 1, 'a': 2, 'have': 3, 'i': 4, 'car': 5, 'pen': 6, 'bike': 7, 'he': 8, 'apple': 9, 'and': 10, 'banna': 11}
Sequences = [[4, 3, 2, 5], [4, 3, 2, 6], [4, 3, 2, 7], [8, 3, 2, 9, 10, 2, 11]]
Padded Sequences:
[[ 0 4 3 2 5]
[ 0 4 3 2 6]
[ 0 4 3 2 7]
[ 2 9 10 2 11]]
Test Sequence = [[4, 3, 2, 1, 5], [8, 1, 11]]
Reference:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{'<OOV>': 1, 'a': 2, 'have': 3, 'i': 4, 'car': 5, 'pen': 6, 'bike': 7, 'he': 8, 'apple': 9, 'and': 10, 'banna': 11} | |
Word Index = {'<OOV>': 1, 'a': 2, 'have': 3, 'i': 4, 'car': 5, 'pen': 6, 'bike': 7, 'he': 8, 'apple': 9, 'and': 10, 'banna': 11} | |
Sequences = [[4, 3, 2, 5], [4, 3, 2, 6], [4, 3, 2, 7], [8, 3, 2, 9, 10, 2, 11]] | |
Padded Sequences: | |
[[ 0 4 3 2 5] | |
[ 0 4 3 2 6] | |
[ 0 4 3 2 7] | |
[ 2 9 10 2 11]] | |
Test Sequence = [[4, 3, 2, 1, 5], [8, 1, 11]] |