← Back to the schedule

Keynote | Technical | talkReduce()

Unbalanced data: Same algorithms different techniques

Friday 17th | 14:50 - 15:00 | Theatre 18

One-liner summary:

Unbalanced data is a specific data configuration that appears commonly in nature. Applying machine learning techniques to this kind of data is a difficult process, usually addressed by unbalanced reduction techniques. The proposed machine learning algorithm does not need to remove data to extract information. It is based on a combination of little algorithms. The strategy is similar to Random Forest but with some new attributes designed in such a way that the algorithm is able to take advantage from data that were not used before.

Keywords defining the session:

- Unbalanced-data

- Machine-Learning

The idea like in other algorithms is to divide the training data set in little pieces. A different algorithm is trained over each piece. Usually this kind of algorithms is majority-vote based. That means that when a new data point enters, the algorithm asks every little algorithm and the most-voted class is the one selected. But why the majority? There are some data configurations or some problems that need to be treated different. A good example it could be illness detection vs stock trading. In the first problem you need to be sure that you detect all ill people and detect some false positives is less important. On the other hand we have stock trading problem where it is not important not detect some good signals to enter the market but it is necessary to have not too much false positives. Obviously for the first class of problem is better to accept less number of algorithms classifying a patient as ill than the majority and in the second case it could be interesting to increase the number of little algorithms predicting a good signal to enter the market. Unbalanced data is first kind data. What this algorithm does is to evaluate each training data point asking all little trees to obtain a score, for example if the original dataset has been divided into 5 subspaces the score will be an integer between 0 and 5 representing the number of trees that prognosticates a certain class. This number is a very important related data that it will be used as a feature to generate a new dataset with just one feature (this weight) and obviously the data point class. With this new dataset we can train other algorithm to determinate which is the most optimum weight to predict a class or the other one. The result of this method is a “new algorithm” that is able to work over unbalanced data because it detects the necessary of promote minority class and it will set a weight threshold according to this. This method has been probed over some unbalanced datasets including some of them from Kaggle public files. It always gets a more than acceptable score in comparison with other solutions. Other good characteristic is that this algorithm not only works over unbalanced. It has been tested over some more datasets and it always obtains better accuracy than Random Forest. It’s remarkable that Random Forest is a very common and very powerful algorithm and this new modifications can generate the possibility of improve it and make it more independent from data attending to data. That means that this method gives some more autonomy to the algorithm to work over different data natures such as the unpredictable unbalanced data.