spaCy tuTorial: natural language work

Theatre 15

Wednesday 20th - 15:00 to 17:00


This course provides an introduction to natural language work based on the spaCy framework in Python.

We will cover the basics of using spaCy 2.x including how to parse text documents, identifying part-of-speech, lemmatization, etc.

We will show how to extract text from HTML using Beautiful Soup, and from PDF documents using PDFx, along with how to handle character encoding, e.g., for work with multiple languages.

More advanced topics include document similarity and named entity resolution, along with means for visualizing parsed and annotated text.

Then we will review the new advancements since 2018 in the embedded models, also called transformers. Looking at projects such as ELMo, BERT, GPT-2, DistilBERT, etc., we will explore how these approaches have changed the field of natural language so dramatically through use of deep learning and especially transfer learning.

Additionally, we'll discuss how to assess fairness and bias in the data, interpretability and visualization of models, along with the implications of how novel hardware is evolving rapidly, and discuss the issues of extreme energy use and policy regarding environmental concerns.

There are more additional notebooks for extra material than we will have time to run during the course -- for example, some of the deep learning examples take a long time to run (up to an hour). For these, we will review during the training, then people can run the notebooks later as a deep-dive into specific topics.


- Each person taking the course should have some hands-on experience coding in Python, plus some familiarity with machine learning.

- Please bring a laptop. You will need to have a Google account (Gmail) and it helps to have a GitHub account too.

- We will use Jupyter notebooks running on Google Colab via GitHub, so there is no need to install or configure anything specially for the exercises.


- You are a Python programmer and need to learn how to use available packages for NLP and deep learning.

- You are a data scientist with some Python experience and need to leverage NLP, text mining, and deep learning.

- You are interested in deep learning, knowledge graphs, and related AI work, and want to understand the basics for preparing text data for those kinds of use cases.