Intro to NLP with spaCy

An introduction to spaCy for natural language processing and machine learning with special help from Scikit-learn.

, in 13 July 2015


You can access the Ipython notebook code here. This is intended to be run in an Ipython notebook, but the code can be copied and pasted into a python interpreter and it should work as well. Follow the readme on the github page above to get the dependencies required to run this code.

This tutorial is intended as a way for people with some experience doing machine learning and natural language processing to get started performing complex tasks in Python using spaCy and scikit-learn.

# Set up spaCy
from spacy.en import English
parser = English()

# Test Data
multiSentence = "There is an art, it says, or rather, a knack to flying." \
                 "The knack lies in learning how to throw yourself at the ground and miss." \
                 "In the beginning the Universe was created. This has made a lot of people "\
                 "very angry and been widely regarded as a bad move."

spaCy does tokenization, sentence recognition, part of speech tagging, lemmatization, dependency parsing, and named entity recognition all at once!

# all you have to do to parse text is this:
#note: the first time you run spaCy in a file it takes a little while to load up its modules
parsedData = parser(multiSentence)

# Let's look at the tokens
# All you have to do is iterate through the parsedData
# Each token is an object with lots of different properties
# A property with an underscore at the end returns the string representation
# while a property without the underscore returns an index (int) into spaCy's vocabulary
# The probability estimate is based on counts from a 3 billion word
# corpus, smoothed using the Simple Good-Turing method.
for i, token in enumerate(parsedData):
    print("original:", token.orth, token.orth_)
    print("lowercased:", token.lower, token.lower_)
    print("lemma:", token.lemma, token.lemma_)
    print("shape:", token.shape, token.shape_)
    print("prefix:", token.prefix, token.prefix_)
    print("suffix:", token.suffix, token.suffix_)
    print("log probability:", token.prob)
    print("Brown cluster id:", token.cluster)
    if i > 1:

original: 300 There

lowercased: 144 there

lemma: 300 There

shape: 187 Xxxxx

prefix: 32 T

suffix: 66 ere

log probability: -7.663576126098633

Brown cluster id: 1918

# Let's look at the sentences
sents = []
# the "sents" property returns spans
# spans have indices into the original string
# where each index value represents a token
for span in parsedData.sents:
    # go from the start to the end of each span, returning each token in the sentence
    # combine each token using join()
    sent = ''.join(parsedData[i].string for i in range(span.start, span.end)).strip()

for sentence in sents:

There is an art, it says, or rather, a knack to flying.

The knack lies in learning how to throw yourself at the ground and miss.

In the beginning the Universe was created.

This has made a lot of people very angry and been widely regarded as a bad move.

# Let's look at the part of speech tags of the first sentence
for span in parsedData.sents:
    sent = [parsedData[i] for i in range(span.start, span.end)]

for token in sent:
    print(token.orth_, token.pos_)

There DET


an DET

art NOUN



says VERB



rather ADV



knack NOUN

to ADP

flying NOUN


# Let's look at the dependencies of this example:
example = "The boy with the spotted dog quickly ran after the firetruck."
parsedEx = parser(example)
# shown as: original token, dependency tag, head word, left dependents, right dependents
for token in parsedEx:
    print(token.orth_, token.dep_, token.head.orth_, [t.orth_ for t in token.lefts], [t.orth_ for t in token.rights])

The det boy [] []

boy nsubj ran [‘The’] [‘with’]

with prep boy [] [‘dog’]

the det dog [] []

spotted amod dog [] []

dog pobj with [‘the’, ‘spotted’] []

quickly advmod ran [] []

ran ROOT ran [‘boy’, ‘quickly’] [‘after’, ‘.’]

after prep ran [] [‘firetruck’]

the det firetruck [] []

firetruck pobj after [‘the’] []

. punct ran [] []

# Let's look at the named entities of this example:
example = "Apple's stocks dropped dramatically after the death of Steve Jobs in October."
parsedEx = parser(example)
for token in parsedEx:
    print(token.orth_, token.ent_type_ if token.ent_type_ != "" else "(not an entity)")

print("-------------- entities only ---------------")
# if you just want the entities and nothing else, you can do access the parsed examples "ents" property like this:
ents = list(parsedEx.ents)
for entity in ents:
    print(entity.label, entity.label_, ' '.join(t.orth_ for t in entity))

Apple ORG

’s (not an entity)

stocks (not an entity)

dropped (not an entity)

dramatically (not an entity)

after (not an entity)

the (not an entity)

death (not an entity)

of (not an entity)


Jobs (not an entity)

in (not an entity)

October DATE

. (not an entity)

————– entities only —————

274530 ORG Apple

112504 PERSON Steve Jobs

71288 DATE October

spaCy is trained to attempt to handle messy data, including emoticons and other web-based features

messyData = "lol that is rly funny :) This is gr8 i rate it 8/8!!!"
parsedData = parser(messyData)
for token in parsedData:
    print(token.orth_, token.pos_, token.lemma_)
# it does pretty well! Note that it does fail on the token "gr8", 
# taking it as a verb rather than an adjective meaning "great"
# and "lol" probably isn't a's more like an interjection

lol NOUN lol

that DET that

is VERB be

rly ADV rly

funny ADJ funny

:) PUNCT :)

This DET This

is VERB be

gr8 VERB gr8

i PRON i

rate VERB rate


8/8 NUM 8/8




spaCy has word vector representations built in!

from numpy import dot
from numpy.linalg import norm

# you can access known words from the parser's vocabulary
nasa = parser.vocab['NASA']

# cosine similarity
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))

# gather all known words, take only the lowercased versions
allWords = list({w for w in parser.vocab if w.has_repvec and w.orth_.islower() and w.lower_ != "nasa"})

# sort by similarity to NASA
allWords.sort(key=lambda w: cosine(w.repvec, nasa.repvec))
print("Top 10 most similar words to NASA:")
for word in allWords[:10]:   

Top 10 most similar words to NASA:











# Let's see if it can figure out this analogy
# Man is to King as Woman is to ??
king = parser.vocab['king']
man = parser.vocab['man']
woman = parser.vocab['woman']

result = king.repvec - man.repvec + woman.repvec

# gather all known words, take only the lowercased versions
allWords = list({w for w in parser.vocab if w.has_repvec and w.orth_.islower() and w.lower_ != "king" and w.lower_ != "man" and w.lower_ != "woman"})
# sort by similarity to the result
allWords.sort(key=lambda w: cosine(w.repvec, result))
print("\n----------------------------\nTop 3 closest results for king - man + woman:")
for word in allWords[:3]:   
# it got it! Queen!

Top 3 closest results for king - man + woman:




You can do cool things like extract Subject, Verb, Object triples from the dependency parse if you use my code in Note: Doesn’t work on complicated sentences. Fails if the dependency parse is incorrect.

from subject_object_extraction import findSVOs

# can still work even without punctuation
parse = parser("he and his brother shot me and my sister")

[(‘he’, ‘shot’, ‘me’), (‘he’, ‘shot’, ‘sister’), (‘brother’, ‘shot’, ‘me’), (‘brother’, ‘shot’, ‘sister’)]

# very complex sample. Only some are correct. Some are missed.
parse = parser("Far out in the uncharted backwaters of the unfashionable end of the Western Spiral arm of the Galaxy lies a small unregarded yellow sun. "
                "Orbiting this at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape-descended "
                "life forms are so amazingly primitive that they still think digital watches are a pretty neat idea. "
                "This planet has – or rather had – a problem, which was this: most of the people living on it were unhappy for pretty much of the time. "
                "Many solutions were suggested for this problem, but most of these were largely concerned with the movements of small green pieces of paper, "
                "which is odd because on the whole it wasn’t the small green pieces of paper that were unhappy. And so the problem remained; lots of the "
                "people were mean, and most of them were miserable, even the ones with digital watches.")

[(‘orbiting’, ‘is’, ‘planet’), (‘watches’, ‘are’, ‘idea’), (‘problem’, ‘was’, ‘this’), (‘it’, ‘wasn’t’, ‘pieces’), (‘most’, ‘were’, ‘ones’)]

If you want to include spaCy in your machine learning it is not too difficult

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
import string
import re

# A custom stoplist
STOPLIST = set(stopwords.words('english') + ["n't", "'s", "'m", "ca"] + list(ENGLISH_STOP_WORDS))
# List of symbols we don't care about
SYMBOLS = " ".join(string.punctuation).split(" ") + ["-----", "---", "...", "“", "”", "'ve"]

# Every step in a pipeline needs to be a "transformer". 
# Define a custom transformer to clean text using spaCy
class CleanTextTransformer(TransformerMixin):
    Convert text to cleaned text

    def transform(self, X, **transform_params):
        return [cleanText(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}
# A custom function to clean the text before sending it into the vectorizer
def cleanText(text):
    # get rid of newlines
    text = text.strip().replace("\n", " ").replace("\r", " ")
    # replace twitter @mentions
    mentionFinder = re.compile(r"@[a-z0-9_]{1,15}", re.IGNORECASE)
    text = mentionFinder.sub("@MENTION", text)
    # replace HTML symbols
    text = text.replace("&amp;", "and").replace("&gt;", ">").replace("&lt;", "<")
    # lowercase
    text = text.lower()

    return text

# A custom function to tokenize the text using spaCy
# and convert to lemmas
def tokenizeText(sample):

    # get the tokens using spaCy
    tokens = parser(sample)

    # lemmatize
    lemmas = []
    for tok in tokens:
        lemmas.append(tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_)
    tokens = lemmas

    # stoplist the tokens
    tokens = [tok for tok in tokens if tok not in STOPLIST]

    # stoplist symbols
    tokens = [tok for tok in tokens if tok not in SYMBOLS]

    # remove large strings of whitespace
    while "" in tokens:
    while " " in tokens:
        tokens.remove(" ")
    while "\n" in tokens:
    while "\n\n" in tokens:

    return tokens

def printNMostInformative(vectorizer, clf, N):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    topClass1 = coefs_with_fns[:N]
    topClass2 = coefs_with_fns[:-(N + 1):-1]
    print("Class 1 best: ")
    for feat in topClass1:
    print("Class 2 best: ")
    for feat in topClass2:
# the vectorizer and classifer to use
# note that I changed the tokenizer in CountVectorizer to use a custom function using spaCy's tokenizer
vectorizer = CountVectorizer(tokenizer=tokenizeText, ngram_range=(1,1))
clf = LinearSVC()
# the pipeline to clean, tokenize, vectorize, and classify
pipe = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer), ('clf', clf)])

# data
train = ["I love space. Space is great.", "Planets are cool. I am glad they exist in space", 
        "lol @twitterdude that is gr8", "twitter &amp; reddit are fun.", 
        "Mars is a planet. It is red.", "@Microsoft: y u skip windows 9?", 
        "Rockets launch from Earth and go to other planets.", "twitter social media &gt; &lt;", 
        "@someguy @somegirl @twitter #hashtag", "Orbiting the sun is a little blue-green planet."]
labelsTrain = ["space", "space", "twitter", "twitter", "space", "twitter", "space", "twitter", "twitter", "space"]

test = ["i h8 riting comprehensibly #skoolsux", "planets and stars and rockets and stuff"]
labelsTest = ["twitter", "space"]

# train, labelsTrain)

# test
preds = pipe.predict(test)
for (sample, pred) in zip(test, preds):
    print(sample, ":", pred)
print("accuracy:", accuracy_score(labelsTest, preds))

print("Top 10 features used to predict: ")
# show the top features
printNMostInformative(vectorizer, clf, 10)

print("The original data as it appeared to the classifier after tokenizing, lemmatizing, stoplisting, etc")
# let's see what the pipeline was transforming the data into
pipe = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer)])
transform = pipe.fit_transform(train, labelsTrain)

# get the features that the vectorizer learned (its vocabulary)
vocab = vectorizer.get_feature_names()

# the values from the vectorizer transformed data (each item is a row,column index with value as # times occuring in the sample, stored as a sparse matrix)
for i in range(len(train)):
    s = ""
    indexIntoVocab = transform.indices[transform.indptr[i]:transform.indptr[i+1]]
    numOccurences =[transform.indptr[i]:transform.indptr[i+1]]
    for idx, num in zip(indexIntoVocab, numOccurences):
        s += str((vocab[idx], num))
    print("Sample {}: {}".format(i, s))


i h8 riting comprehensibly #skoolsux : twitter

planets and stars and rockets and stuff : space

accuracy: 1.0

Top 10 features used to predict:

Class 1 best: 
(-0.52882810587037121, 'planet')
(-0.35193565503626856, 'space')
(-0.2182987490483107, 'mar')
(-0.2182987490483107, 'red')
(-0.15592826214493352, 'earth')
(-0.15592826214493352, 'launch')
(-0.15592826214493352, 'rocket')
(-0.1482804579342584, 'great')
(-0.1482804579342584, 'love')
(-0.099226355509375405, 'blue')
Class 2 best: 
(0.41129938045689757, 'twitter')
(0.34038557663231445, [email protected]')
(0.23401502570811406, 'lol')
(0.23401502570811406, 'gr8')
(0.20564996854629114, 'social')
(0.20564996854629114, 'medium')
(0.20564941191060651, 'reddit')
(0.20564941191060651, 'fun')
(0.10637055092420053, 'y')
(0.10637055092420053, 'window')

The original data as it appeared to the classifier after tokenizing, lemmatizing, stoplisting, etc

Sample 0: ('love', 1)('space', 2)('great', 1)
Sample 1: ('space', 1)('planet', 1)('cool', 1)('glad', 1)('exist', 1)
Sample 2: ('lol', 1)([email protected]', 1)('gr8', 1)
Sample 3: ('twitter', 1)('reddit', 1)('fun', 1)
Sample 4: ('planet', 1)('mar', 1)('red', 1)
Sample 5: ([email protected]', 1)('y', 1)('u', 1)('skip', 1)('window', 1)('9', 1)
Sample 6: ('planet', 1)('rocket', 1)('launch', 1)('earth', 1)
Sample 7: ('twitter', 1)('social', 1)('medium', 1)
Sample 8: ([email protected]', 3)('hashtag', 1)
Sample 9: ('planet', 1)('orbit', 1)('sun', 1)('little', 1)('blue', 1)('green', 1)