In this post you will learn about some NLP (Natural Language Processing) techniques hopefully with no mathematical formulas. As developers, we’re always trying to broaden our skillsets. Recently, we wanted to carry out an educational but also practical exercise in relation to the global pandemic.
As developers, we’re always trying to broaden our skillsets. Recently, we wanted to carry out an educational but also practical exercise in relation to the global pandemic. We participated in one of the COVID-19 challenges offered by the Kaggle platform.
Kaggle, founded in 2010, and acquired by Google in 2017, is an online community for data scientists. It offers education and resources: machine learning competencies, datasets, notebooks, access to training accelerators, etc.
The Kaggle COVID-19 challengue presented a variety of challenges, intended to help the global community and health organizations stay informed, and make data driven decisions.
Our objective was to help researchers better access and understand the research that’s already out there. There were 10 specific research topics in total as part of the Kaggle challenge, covering areas like vaccination and community transmission. We decided to leverage a dataset of 44k articles, and use machine learning techniques to classify and score articles by relevance for each topic. Theoretically, we’d be able to point each researcher to the most relevant articles for their topic of interest.
NLP (Natural Language Processing) is the field of artificial intelligence that studies human languages. NLP is often applied for classifying text data – that is, to understand the meaning behind written text in different contexts, and how we might classify it. Most of the algorithms in machine learning cannot process strings or plain text in their raw form, they require a numeric representation as input to be able to function. There are many algorithms to do this, the most simple just count how many words repetitions exist on each document, but at the cost of words lose their meaning. We were aiming for a technique that captures the meaning, semantic relationships and the contexts they are used in. The algorithm that meets all these requirements is word embeddings.
The numeric representations are calculated from the probability distribution for each word appearing before or after another and also encapsulate different relations between words, like synonyms, antonyms, or analogies, such as this one:
Male groups king and man; female groups queen and woman. But there is also exists a relationship between the terms.
To identify when an article is relevant, we decided to apply unsupervised machine learning NLP algorithms. This process is unsupervised, meaning there is no way to influence the process with keywords e.g.: covid, virus, flu. We expected this process to generate a model that correlates the articles and also stores an understanding of the content of each article. At the end with the trained model, we would be able to test any sentence (query) and the model should be capable of finding the most related articles to that specific query.
It is important to understand that a machine learning system is highly influenced by our assumptions:
There are multiple algorithms that we used during the challenge (Doc2vec and Tf-Idf), but the overall process was the same.
We pre-processed all the words to lowercase, removing punctuation and stop words. Stop words usually refers to the most common words in a language. In search engines like Google, those words are removed from the search saving space and time in processing/indexing results. Similarly in NLP, stop words don't offer much info to uniquely identify an article, so usually models have better performance without them. In the case of Doc2vec it is possible to keep them.
This refers to creating valuable data from the previous step. There are multiple steps (that we are not going to cover) that we used to convert the initial text to a final numerical vector that expresses relationships between words and identifies articles by their content. It is also valuable to analyze some data that comes from these steps. One of the first outputs could be word clouds (just displaying the most frequent words).
Each algorithm processes and creates a numeric representation of each article. Each algorithm has different routes and assumptions, but the output is always a big vector with numbers:
Once a model had been trained with all these numeric representations, the trained model could answer two important questions:
and finally, it was possible to achieve our objective to find articles that were related to a word or a sentence. We were actually able to classify completely new articles, but that wasn’t our aim for this challenge.
We tested this model using the full question and were able to compare how it correlated with the model. Our question was something like the following:
What are the range of incubation periods for the disease in humans (and how does this vary across age and health status) and how long are individuals contagious, even after recovery?
The results were a list of the best 10 articles (for results display purposes only)
And that is it!
We were really happy with the results we achieved.The urge to understand the algorithms and represent the data is really important but there are also a lot of additional challenges: how best to represent the results?, how to improve the performance and accuracy?, how to compare our model with other models? Honestly, we were a long way from mastering these techniques, but they were really interesting to explore. If you are interested in NLP or Machine Learning you can start by checking out the notebook that we created for the COVID-19 Challenge. Kaggle itself also has a lot of courses and content for people that want to start with Machine Learning too.