Engineer Ray's Blog: 6月 2020

2020年6月22日星期一

Machine Learning Foundations NLP(2): Using the Sequencing APIs

Using the Sequencing APIs

Sequencing is use to format sentence array using token,
For example:

"I have a car"
"Car have a door"

Tokenize: [I:1] [have:2] [a:3] [car:4] [door:5]

then these two sentence can be represent to:
[1 2 3 4]
[4 2 3 5]

Sequencing is useful to represent sentence data and take as input for a neuron network.

Code

	import tensorflow as tf
	from tensorflow import keras

	from tensorflow.keras.preprocessing.text import Tokenizer
	from tensorflow.keras.preprocessing.sequence import pad_sequences

	# Sentences data
	sentences = {
	'I have a car',
	'I have a pen',
	'I have a bike',
	'He have a apple and a banana'
	}

	## Make a tokenizer with max 100 tokens, and label stranger token with OOV index
	tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")

	# Tokenize the sentence data
	tokenizer.fit_on_texts(sentences)
	word_index = tokenizer.word_index

	# Print the word index tokenized from sentence data
	# The more often a word be use the lower index
	word_index = tokenizer.word_index
	print(word_index)


	## Make to sentence array
	sequences = tokenizer.texts_to_sequences(sentences)

	## Make all sequence array length to 5, it will helpful when we train a NLP neuron network
	padded = pad_sequences(sequences, maxlen=5)
	print("\nWord Index = " , word_index)
	print("\nSequences = " , sequences)
	print("\nPadded Sequences:")
	print(padded)

	# Try with words that the tokenizer wasn't fit to
	# manatee is unseen token so it will be replace to OOV:1
	test_data = [
	'I have a big car',
	'He like banana'
	]

	test_seq = tokenizer.texts_to_sequences(test_data)
	print("\nTest Sequence = ", test_seq)

view raw gistfile1.txt hosted with ❤ by GitHub

Result:

{'<OOV>': 1, 'a': 2, 'have': 3, 'i': 4, 'car': 5, 'pen': 6, 'bike': 7, 'he': 8, 'apple': 9, 'and': 10, 'banna': 11}

Word Index = {'<OOV>': 1, 'a': 2, 'have': 3, 'i': 4, 'car': 5, 'pen': 6, 'bike': 7, 'he': 8, 'apple': 9, 'and': 10, 'banna': 11}

Sequences = [[4, 3, 2, 5], [4, 3, 2, 6], [4, 3, 2, 7], [8, 3, 2, 9, 10, 2, 11]]

Padded Sequences:

[[ 0 4 3 2 5]

[ 0 4 3 2 6]

[ 0 4 3 2 7]

[ 2 9 10 2 11]]

Test Sequence = [[4, 3, 2, 1, 5], [8, 1, 11]]

view raw gistfile1.txt hosted with ❤ by GitHub

Reference:

Machine Learning Foundations: Ep #9 - Using the Sequencing APIs

2020年6月17日星期三

Machine Learning Foundations NLP(1): Tokenization for Natural Language Processing

Tokenization for Natural Language Processing

Tokenize is mean to break down a sentence to server work, for example:

I have a car. -> I / have / a/ car

Tensorflow provide a tokenize tool:Tokenizer, it can easily to use for tokenize input sentence.

Code:

	import tensorflow as tf
	from tensorflow import keras
	from tensorflow.keras.preprocessing.text import Tokenizer

	# Sentences data
	sentences = {
	'I have a car',
	'I have a pen',
	'I have a bike',
	'He have a apple'
	}

	# Create an tokenizer with maxium 10 words
	tokenizer = Tokenizer(num_words = 10)

	# Tokenize the sentence data
	tokenizer.fit_on_texts(sentences)

	# Print the word index tokenized from sentence data
	# The more often a word be use the lower index will be assign
	word_index = tokenizer.word_index
	print(word_index)

view raw gistfile1.txt hosted with ❤ by GitHub

Result:

{'have': 1, 'a': 2, 'i': 3, 'he': 4, 'apple': 5, 'bike': 6, 'pen': 7, 'car': 8}

Engineer Ray's Blog

2020年6月22日星期一

Machine Learning Foundations NLP(2): Using the Sequencing APIs

Using the Sequencing APIs

Code

Result:

2020年6月17日星期三

Machine Learning Foundations NLP(1): Tokenization for Natural Language Processing

Tokenization for Natural Language Processing

Code:

Result:

Reference:

Machine Learning Foundations: Ep #8 - Tokenization for Natural Language Processing

Linux driver: How to enable dynamic debug at booting time for built-in driver.

檢舉濫用情形

	{'<OOV>': 1, 'a': 2, 'have': 3, 'i': 4, 'car': 5, 'pen': 6, 'bike': 7, 'he': 8, 'apple': 9, 'and': 10, 'banna': 11}

	Word Index = {'<OOV>': 1, 'a': 2, 'have': 3, 'i': 4, 'car': 5, 'pen': 6, 'bike': 7, 'he': 8, 'apple': 9, 'and': 10, 'banna': 11}

	Sequences = [[4, 3, 2, 5], [4, 3, 2, 6], [4, 3, 2, 7], [8, 3, 2, 9, 10, 2, 11]]

	Padded Sequences:
	[[ 0 4 3 2 5]
	[ 0 4 3 2 6]
	[ 0 4 3 2 7]
	[ 2 9 10 2 11]]

	Test Sequence = [[4, 3, 2, 1, 5], [8, 1, 11]]

2020年6月22日 星期一

Machine Learning Foundations NLP(2): Using the Sequencing APIs

Using the Sequencing APIs

Code

Result:

2020年6月17日 星期三

Machine Learning Foundations NLP(1): Tokenization for Natural Language Processing

Tokenization for Natural Language Processing

Code:

Result:

Reference:

Machine Learning Foundations: Ep #8 - Tokenization for Natural Language Processing

Linux driver: How to enable dynamic debug at booting time for built-in driver.

2020年6月22日星期一

2020年6月17日星期三