NLP | spaCy | Tokenization with spaCy

gcptutorials.com Python

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. This post explains how to use spaCy for tokenization.

if spaCy is not installed, follow install spaCy link.

Import spaCy, load model and generate tokens.


import spacy
model = spacy.load("en_core_web_sm")


text = "This post is published on gcptutorials.com"

tokens = model.tokenizer(text)

#print(help(tokens[0].lemma))

for token in tokens:
    print(token.text)

'''Output
This
post
is
published
on
gcptutorials.com
'''

Check if tokens are stop words.


import spacy
model = spacy.load("en_core_web_sm")


text = "This post is published on gcptutorials.com"

tokens = model.tokenizer(text)

#print(help(tokens[0].lemma))

for token in tokens:
    print(token.text, token.is_stop)

'''Output
This True
post False
is True
published False
on True
gcptutorials.com False

  '''

Check if tokens are alphanumeric.


import spacy
model = spacy.load("en_core_web_sm")


text = "This post is published on gcptutorials.com"

tokens = model.tokenizer(text)

#print(help(tokens[0].lemma))

for token in tokens:
    print(token.text, token.is_alpha)

'''Output
This True
post True
is True
published True
on True
gcptutorials.com False

'''

Category: Python