03 Sep, 2020Esteban Campos4 min read

Natural Language Processing with No Maths in the Kaggle COVID-19 Challenge

TL;DR

In this post you will learn about some NLP (Natural Language Processing) techniques hopefully with no mathematical formulas. As developers, we’re always trying to broaden our skillsets. Recently, we wanted to carry out an educational but also practical exercise in relation to the global pandemic.

Intro

As developers, we’re always trying to broaden our skillsets. Recently, we wanted to carry out an educational but also practical exercise in relation to the global pandemic. We participated in one of the COVID-19 challenges offered by the Kaggle platform.

What is Kaggle and the COVID-19 challengue

Kaggle, founded in 2010, and acquired by Google in 2017, is an online community for data scientists. It offers education and resources: machine learning competencies, datasets, notebooks, access to training accelerators, etc.

The Kaggle COVID-19 challengue presented a variety of challenges, intended to help the global community and health organizations stay informed, and make data driven decisions.

Our objective was to help researchers better access and understand the research that’s already out there. There were 10 specific research topics in total as part of the Kaggle challenge, covering areas like vaccination and community transmission. We decided to leverage a dataset of 44k articles, and use machine learning techniques to classify and score articles by relevance for each topic. Theoretically, we’d be able to point each researcher to the most relevant articles for their topic of interest.

Natural Language Processing

NLP (Natural Language Processing) is the field of artificial intelligence that studies human languages. NLP is often applied for classifying text data – that is, to understand the meaning behind written text in different contexts, and how we might classify it. Most of the algorithms in machine learning cannot process strings or plain text in their raw form, they require a numeric representation as input to be able to function. There are many algorithms to do this, the most simple just count how many words repetitions exist on each document, but at the cost of words lose their meaning. We were aiming for a technique that captures the meaning, semantic relationships and the contexts they are used in. The algorithm that meets all these requirements is word embeddings.

Word embeddings

The numeric representations are calculated from the probability distribution for each word appearing before or after another and also encapsulate different relations between words, like synonyms, antonyms, or analogies, such as this one:

Male-Female relationship

Male groups king and man; female groups queen and woman. But there is also exists a relationship between the terms.

When an article is relevant

To identify when an article is relevant, we decided to apply unsupervised machine learning NLP algorithms. This process is unsupervised, meaning there is no way to influence the process with keywords e.g.: covid, virus, flu. We expected this process to generate a model that correlates the articles and also stores an understanding of the content of each article. At the end with the trained model, we would be able to test any sentence (query) and the model should be capable of finding the most related articles to that specific query.

Some assumptions

It is important to understand that a machine learning system is highly influenced by our assumptions:

An abstract of the articles should be enough for our purposes. The abstract is a brief summary of a research article that helps the reader quickly ascertain the document's purpose.
The size difference between Query-Abstract is not relevant. Even if the average length of our queries are between 10-15 words and the abstracts have 150-250 words.
Other metadata in the articles is not going to be considered e.g.: publishing year, author.

The process

There are multiple algorithms that we used during the challenge (Doc2vec and Tf-Idf), but the overall process was the same.

NLP Diagram

Pre-process

We pre-processed all the words to lowercase, removing punctuation and stop words. Stop words usually refers to the most common words in a language. In search engines like Google, those words are removed from the search saving space and time in processing/indexing results. Similarly in NLP, stop words don't offer much info to uniquely identify an article, so usually models have better performance without them. In the case of Doc2vec it is possible to keep them.

Feature extraction

This refers to creating valuable data from the previous step. There are multiple steps (that we are not going to cover) that we used to convert the initial text to a final numerical vector that expresses relationships between words and identifies articles by their content. It is also valuable to analyze some data that comes from these steps. One of the first outputs could be word clouds (just displaying the most frequent words).

Word cloud example

Each algorithm processes and creates a numeric representation of each article. Each algorithm has different routes and assumptions, but the output is always a big vector with numbers:

Numeric presentation of articles

Once a model had been trained with all these numeric representations, the trained model could answer two important questions:

The degree of similarity between two words
The comparative use of a word across different articles

and finally, it was possible to achieve our objective to find articles that were related to a word or a sentence. We were actually able to classify completely new articles, but that wasn’t our aim for this challenge.

Using the model to get some predictions

We tested this model using the full question and were able to compare how it correlated with the model. Our question was something like the following:

What are the range of incubation periods for the disease in humans (and how does this vary across age and health status) and how long are individuals contagious, even after recovery?

The results were a list of the best 10 articles (for results display purposes only)

Model results

And that is it!

Results

We were really happy with the results we achieved.The urge to understand the algorithms and represent the data is really important but there are also a lot of additional challenges: how best to represent the results?, how to improve the performance and accuracy?, how to compare our model with other models? Honestly, we were a long way from mastering these techniques, but they were really interesting to explore. If you are interested in NLP or Machine Learning you can start by checking out the notebook that we created for the COVID-19 Challenge. Kaggle itself also has a lot of courses and content for people that want to start with Machine Learning too.

Resources

Share this article