Tokenization for Natural Language Processing
Tokenize is mean to break down a sentence to server work, for example:I have a car. -> I / have / a/ carTensorflow provide a tokenize tool:Tokenizer, it can easily to use for tokenize input sentence.
Code:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import tensorflow as tf | |
from tensorflow import keras | |
from tensorflow.keras.preprocessing.text import Tokenizer | |
# Sentences data | |
sentences = { | |
'I have a car', | |
'I have a pen', | |
'I have a bike', | |
'He have a apple' | |
} | |
# Create an tokenizer with maxium 10 words | |
tokenizer = Tokenizer(num_words = 10) | |
# Tokenize the sentence data | |
tokenizer.fit_on_texts(sentences) | |
# Print the word index tokenized from sentence data | |
# The more often a word be use the lower index will be assign | |
word_index = tokenizer.word_index | |
print(word_index) |
Result:
{'have': 1, 'a': 2, 'i': 3, 'he': 4, 'apple': 5, 'bike': 6, 'pen': 7, 'car': 8}
沒有留言:
張貼留言