Reducing large datasets for ML Tasks while maintaining label ratio

Abhigya Verma
3 min readJan 10, 2023

--

Often than not we come across datasets which have an extremely large number of data points. Without reducing the size of such data, there is a high possibility that most of the algorithms will need large processing power, storage, memory and computational ability. Essentially these requirements may not be feasible for every individual and especially students who are working on such data for practice or for learning purposes.

Data reduction maintaining Label Ratio

In such a situation the question arises that how can we reduce the data in such a case. The most common approach a person can think of is to reduce the size of the data by reducing the number of data points.

Simply removing or cutting off a portion of the dataset can often dramatically affect the accuracy and results of the Machine Learning processes and hence I propose this extremely simple but effective code to divide the data and reduce the size.

Following are the practical application steps followed to reduce the large dataset in a Kaggle Competition, Goodreads Book Review. The data is extremely large with 900000 data points and the key parameter to be used for processing and Machine Learning prediction is textual which makes it even more difficult and tedious to work with.

STEP 1: Load the necessary Libraries

The process starts with the obvious step of loading the python packages and modules that will help us in the process.

STEP 2: Load and explore the data:

The next key step is to load the data from the CSV file and check its parameters and size. The following code depicts how this process can easily be done:

In this sample code, we have loaded the already provided training and testing data. But there is a possibility that sometimes one has to deal with data which only contains one data that needs to be separately split into training and testing. For this example, we will use only training data to show the second method and its procedure.

STEP 3: Sort and Separate data based on the labels

This is the most important and crucial step. In the data, the label or dependent variable is the “rating” parameter which has 6 possible values 0 to 5.

In this particular part of the code, we figure out the various possible values of the labels and separate the data on the basis of this.

Each of these newly created variables contains all the data points with one value of the label.

STEP 4: Reduction

In our case reduction is needed to run the model effectively and therefore we decided to reduce the data from 900000 to 9000 so that easily all models can be run on the data. This means a 1/100 times reduction. Here is the part where we understand the importance of separating variables. If we randomly just reduce the data and keep the points on the top there is a high probability that the data label ratios will change and then the whole point of the study will be lost as the training had to be done on the given ratio and dimension of the data.

Henceforth, even though we are reducing the data it should be similarly reduced to imitate the original data.

The 1/100th reduction is done individually on every variable representing each label in the data.

STEP 5: Combining data

Now the separate reduced data variables need to be merged into one dataset like the original one.
Apart from merging, the data also needs to be shuffled so that all the labels do not occur together.

Finally, after all the above steps your reduced data while maintaining the ratio of the labels as in the original data.

The full code can be found on my GitHub:

I will soon be coming up with another blog working on this textual data and applying all the State of Art algorithms to this same data. Stay tuned!

And yes, if you liked the tutorial, please do like and share!

--

--

Abhigya Verma

Upcoming FTE Associate Machine Learning Engineer @ServiceNow | Research Intern @NAIST | SWE Intern '22 @Microsoft | 1.6K+ Subs on YouTube | MLSA Gold