2020年6月22日 星期一

Machine Learning Foundations NLP(2): Using the Sequencing APIs

Using the Sequencing APIs

Sequencing is use to format sentence array using token,
For example:

"I have a car"
"Car have a door"
 
Tokenize: [I:1] [have:2] [a:3] [car:4] [door:5]
 
then these two sentence can be represent to:
[1 2 3 4] 
[4 2 3 5]

Sequencing is useful to represent sentence data and take as input for a neuron network.

Code

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Sentences data
sentences = {
'I have a car',
'I have a pen',
'I have a bike',
'He have a apple and a banana'
}
## Make a tokenizer with max 100 tokens, and label stranger token with OOV index
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
# Tokenize the sentence data
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
# Print the word index tokenized from sentence data
# The more often a word be use the lower index
word_index = tokenizer.word_index
print(word_index)
## Make to sentence array
sequences = tokenizer.texts_to_sequences(sentences)
## Make all sequence array length to 5, it will helpful when we train a NLP neuron network
padded = pad_sequences(sequences, maxlen=5)
print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)
print("\nPadded Sequences:")
print(padded)
# Try with words that the tokenizer wasn't fit to
# manatee is unseen token so it will be replace to OOV:1
test_data = [
'I have a big car',
'He like banana'
]
test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)
view raw gistfile1.txt hosted with ❤ by GitHub

Result:

{'<OOV>': 1, 'a': 2, 'have': 3, 'i': 4, 'car': 5, 'pen': 6, 'bike': 7, 'he': 8, 'apple': 9, 'and': 10, 'banna': 11}
Word Index = {'<OOV>': 1, 'a': 2, 'have': 3, 'i': 4, 'car': 5, 'pen': 6, 'bike': 7, 'he': 8, 'apple': 9, 'and': 10, 'banna': 11}
Sequences = [[4, 3, 2, 5], [4, 3, 2, 6], [4, 3, 2, 7], [8, 3, 2, 9, 10, 2, 11]]
Padded Sequences:
[[ 0 4 3 2 5]
[ 0 4 3 2 6]
[ 0 4 3 2 7]
[ 2 9 10 2 11]]
Test Sequence = [[4, 3, 2, 1, 5], [8, 1, 11]]
view raw gistfile1.txt hosted with ❤ by GitHub
Reference:






2020年6月17日 星期三

Machine Learning Foundations NLP(1): Tokenization for Natural Language Processing

Tokenization for Natural Language Processing


Tokenize is mean to break down a sentence to server work, for example:

I have a car. -> I / have / a/ car

Tensorflow provide a tokenize tool:Tokenizer, it can easily to use for tokenize input sentence.


Code:


import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
# Sentences data
sentences = {
'I have a car',
'I have a pen',
'I have a bike',
'He have a apple'
}
# Create an tokenizer with maxium 10 words
tokenizer = Tokenizer(num_words = 10)
# Tokenize the sentence data
tokenizer.fit_on_texts(sentences)
# Print the word index tokenized from sentence data
# The more often a word be use the lower index will be assign
word_index = tokenizer.word_index
print(word_index)
view raw gistfile1.txt hosted with ❤ by GitHub

Result:

{'have': 1, 'a': 2, 'i': 3, 'he': 4, 'apple': 5, 'bike': 6, 'pen': 7, 'car': 8}



Reference:

Machine Learning Foundations: Ep #8 - Tokenization for Natural Language Processing



Linux driver: How to enable dynamic debug at booting time for built-in driver.

 Dynamic debug is useful for debug driver, and can be enable by: 1. Mount debug fs #>mount -t debugfs none /sys/kernel/debug 2. Enable dy...