2020年6月17日 星期三

Machine Learning Foundations NLP(1): Tokenization for Natural Language Processing

Tokenization for Natural Language Processing


Tokenize is mean to break down a sentence to server work, for example:

I have a car. -> I / have / a/ car

Tensorflow provide a tokenize tool:Tokenizer, it can easily to use for tokenize input sentence.


Code:


import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
# Sentences data
sentences = {
'I have a car',
'I have a pen',
'I have a bike',
'He have a apple'
}
# Create an tokenizer with maxium 10 words
tokenizer = Tokenizer(num_words = 10)
# Tokenize the sentence data
tokenizer.fit_on_texts(sentences)
# Print the word index tokenized from sentence data
# The more often a word be use the lower index will be assign
word_index = tokenizer.word_index
print(word_index)
view raw gistfile1.txt hosted with ❤ by GitHub

Result:

{'have': 1, 'a': 2, 'i': 3, 'he': 4, 'apple': 5, 'bike': 6, 'pen': 7, 'car': 8}



Reference:

Machine Learning Foundations: Ep #8 - Tokenization for Natural Language Processing



沒有留言:

張貼留言

Linux driver: How to enable dynamic debug at booting time for built-in driver.

 Dynamic debug is useful for debug driver, and can be enable by: 1. Mount debug fs #>mount -t debugfs none /sys/kernel/debug 2. Enable dy...