State of Art Algorithms on Textual Data

Abhigya Verma
4 min readJan 25, 2023

--

With the ever-increasing amount of data in every format, it is essential to understand better ways to analyze it. Not only analysis but getting meaningful outcomes using this data is also an inevitable task that data scientists and analyzers have to face. This task of working with text data is also branched out separately in the Artificial Intelligence domain and called NLP(Natural Language Processing).

Textual data or written text analysis has multiple challenges that can become major hurdles while working on it, especially regarding Machine Learning models’ accuracy. These challenges will be discussed in detail in the following section along with the methods to deal with them.

A lot of such challenging tasks are available on the Famous data science platform Kaggle. In this study, I am going to discuss one such challenge with textual data analysis: Goodreads Books Reviews.

The Challenge:

The Goodreads Book Review challenge contains data from the Goodreads website where they have collected data from the reviews people have posted on various books and their corresponding other parameters with the most significant one being the rating on a scale of 0 to 5.

The problem presents a unique challenge of guessing the rating for the test data using the given parameters. A snippet of the sample data is presented in the following figure:

A snippet of the given problem’s training dataset

The Solution:

The following sections present the application of the State of Art Machine Algorithms and Recurrent Neural networks on this data to practice and learn NLP beginner concepts so that you can work on any data in the future with ease.

To understand how an algorithm will best work on the data it is essential to understand it in detail and analyze its different aspects first. Based on the analysis, the best model to be applied to a dataset can be decided and minute modelling parameters can be tweaked to get the best results. We will save the parameter tuning for another day but for now, let us focus on dealing with text data.

The data at hand given by the Challenge contained 900000 data points and 11 parameters. Considering the textual nature of the key parameter i.e. review_text and the huge no of data points, we will be reducing the data to work on it easily and to also depict how the data can be reduced in a symmetric manner.

STEP 1: LOAD, REDUCE AND ANALYZE DATA

The first step in the process is to load the Dataset and analyze it. Below is the code for loading the data and taking a peek at the first five rows of the dataset.

The next step was to reduce the size of this huge data. We decided to reduce the data by 100 times and keep only around 9000 entries for this analysis. Following is the code for the same.

The whole process has been described in detail in this article: https://abhigya27.medium.com/reducing-large-datasets-for-ml-tasks-while-maintaining-label-ratio-aa050d02c68c

For this classification, only the text column of the database was considered as this study focuses on text-based classification.

STEP 2: DATA CLEANING

An essential part of the text-based analysis is to use the data in the correct format and this is the reason various python processes are defined for cleaning the data.

In this classification, we performed 5 cleaning processes on the data as follows:

1. Removing Hashtag

Often data is extracted from multiple social media platforms which use the hashtag mechanism to enhance the reach of a particular text hence while analyzing this could be a factor affecting the analysis and providing biased results, which makes hashtag removal an essential step.

2. Removing Mentions

Very similar to hashtags, mentions are also important to be removed due to the same reasons.

3. Removing URLs

Text often can contain links or URLs redirecting to other related websites and they are quite unuseful if we want to only analyse the text. Following is the code to remove them.

4. Lower Case text

The capital and small text can also confuse the ML model between the same text. Hence, all the test is converted to small case.

5. Stopword Removal

English text contains many words that do not essentially add meaning to the statement or the actual issue it is trying to point to or the point it wants to make. These words are part of the grammar to make the sentence meaningful for humans, but, for machines, these words can be useless and confusing. Hence, these words are removed as follows:

STEP 3: TEXT VECTORIZATION

The normal English text cannot be passed as input to the Machine Learning algorithm as the ML algorithms can only decipher and understand numbers, hence the text needs to be converted to numbers that the machine can understand. This process is called vectorization. There are multiple types of vectorization, but, for this study, we restrict this study to the famous Term Frequency — Inverse Document Frequency.

STEP 4: APPLYING THE MODELS

Finally, the obtained data is split into training and testing parts using the basic 5 ML algorithms applied to the data.

This extremely basic tutorial is aimed at building the understanding of the person on textual data as well as machine learning algorithms, serving as a point to start with text classification and other tasks involving tasks!

--

--

Abhigya Verma

Upcoming FTE Associate Machine Learning Engineer @ServiceNow | Research Intern @NAIST | SWE Intern '22 @Microsoft | 1.6K+ Subs on YouTube | MLSA Gold