7

Click here to load reader

Tarefa Data Mining - Classificação de Textos

Embed Size (px)

Citation preview

Page 1: Tarefa Data Mining - Classificação de Textos

What’s Cooking Text Classification Competition

Paulo Cezar Lacerda Neto

Instituto de Computação – Universidade Federal Fluminense (UFF) R. Passo da Pátria, 156 – 24210-240 – Niterói – RJ – Brazil

[email protected]

Abstract. This paper describes my participation in a text classification competition as part of my activities in a data mining class I attended at Universidade Federal Fluminense in 2015. The competition was a challenge to classify the type of cuisine of 9.944 recipes using a training set of 39.774 recipes. I was able to solve the classification problem with a machine learning classifier based on Naïve Bayes method. The results of this work showed a good initial accuracy and some improvement opportunities.

1. Introduction Text classification, a topic in the areas of natural language processing and data mining, is a task that has many real word applications like spam detection, classification of patient clinical records, classification of newspapers articles and scientific papers. Kaggle, an on-line community of data scientists, promotes public data mining contests. The opportunity to participate in competitions that involves such interesting subject is a great chance to students to put into practice the concepts learned in data mining courses, so the platform have received a lot of attention from the community. What’s Cooking [Kaggle 2015], a text classification competition, set a challenge to the participants to classify recipes into a single type of cuisine, based on a training data set. The competition rules determines that each participant is allowed to submit a maximum of 5 entries per day and the submissions are evaluated based on the accuracy, that is the number of recipes classified correctly. The organization provided three files: the first is a training data set with 39.744 recipes, each one categorized by one the 20 cuisine types available, the second is a test data containing a list of 9.944 recipes to be classified by the participants and the third file is a sample submission file in csv format. The training and test data files content are in json format (see Table 1), the difference between the two files is that test data does not come with the cuisine type, which is the class the participant needs to classify.

Table 1. Sample training data

{ "id": 24717, "cuisine": "indian", "ingredients": [ "onions", "spinach", "sweet potatoes" ]},

Page 2: Tarefa Data Mining - Classificação de Textos

2. Solution Approach The chosen approach to solve the text classification problem was the use of a supervised machine learning method to classify each recipe according to its ingredients. The first step was to choose the machine learning algorithm to be used, which is described in the third section of this document. After selecting the machine learning algorithm, it was necessary to create a program to implement the method. Given the idea of using a supervised machine learning method, the processing activities should be divided into two main steps: training and classification, as shown in Figure 1.

Figure 1. Training and Classification Steps

The training step (see Figure 2) is where the program receives the training data and build a vocabulary with all the words that make up the recipes, to be used as the basis for the bag-of-words representation approach, then for each recipe it extract the features from the text, to create a feature vector, and finally feeds the machine learning algorithm with the feature vector and the related class to train it.

Figure 2. Training Step

Page 3: Tarefa Data Mining - Classificação de Textos

After learning how to classify new recipes based on training data, the algorithm produces a classifier model to be used in the second step of the process.

Figure 3. Classification Step

The classification step (see Figure 3) is where new recipes, i.e. recipes which classes are unknown, are classified based on what the algorithm was able to learn in the first step. After the recipes are classified they can be submitted to the Kaggle’s web site.

3. Choosing the Machine Learning Method In order to decide which method to use in this project, it was done some research to investigate the most common machine learning methods used in text classification activities. Based on a survey written by [Aggarwal and Zhai 2012], the following methods were considered to be used in this project: Naïve Bayes, Logistic Regression, k-NN, Neural Networks, SVM and Ensembles. The principle adopted when deciding which method to use in this project was that the chosen method should have a track record in the text classification filed and also be easy to implement, so that an initial result and a baseline could be quickly obtained. During the research on the possible machine learning methods, a reasonable number of mentions of Naïve Bayes usage in text classifications and natural language processing were found. According [Jurafsky and Martin 209], Naïve Bayes is commonly employed as a good baseline method, often yielding results that are sufficiently good for practical use. Besides that, some good code samples of Naïve Bayes implementations were found in machine learning textbooks, such as [Harrington 2012]. The Naïve Bayes method was chosen among the possible alternatives as the algorithm to be used to classify the competition’s recipes, mainly because of its simplicity and also to be considered a good baseline. However, it is important to notice that this method makes the naïve assumption of independence between the conditionals probabilities of the features values given a class, what may not work well in cases of cuisine types (classes) where one ingredient has some dependency on another, so it is important to evaluate other machine learning methods in order to get the best results in the competition.

Page 4: Tarefa Data Mining - Classificação de Textos

4. Implementing the Naïve Bayes Algorithm After selecting the machine learning algorithm, it was time to start building the program to implement the algorithm responsible for reading the input files, using Naïve Bayes (NB) to learn and classify the files and generating an output file with the classified recipes to be submitted to Kaggle’s web site. The next formula represents the NB theorem in terms of a class c and a recipe r:

The formula above shows how to calculate the probability of a class c given a recipe r, so the Naïve Bayes classifier job is to obtain the class with the greatest probability P(c|r) that will be the class defined for the recipe r. The following derivations are made in order to simplify the calculations in the Naïve Bayes algorithm:

In the last equation, P(c) is the probability of class c and P(r|c) the probability of a recipe r given a class c. Since in Naïve Bayes it is used the bag-of-words approach to represent feature vectors, P(r|c) can be written as P(r|c) = P(w1, w2, w3, ..., wn | c) and assuming that the probabilities of P(w1|c), P(w2|c), P(w3|c), ... , P(wN|c) are independent, the Naïve part of the method, P(r|c) can be calculated as P(w1|c) x P(w2|c) x P(w3|c) x ... x P(wN|c). The training step of the NB classifier algorithm does the calculation of the a priori probabilities: P(c) for each class and the probability of each word of the vocabulary (P(wn|c)) given a class c. A program in Python was built in order to implement the Naïve Bayes algorithm and is available in a GitHub repository (https://github.com/placerda/whatscooking). Python was chosen because it is a simple although powerful language for machine learning tasks and also works well with vectors (NumPy module). The program has two main functions: trainNB and classifyNB, Table 2 and Table 3 show fragments from both of them, respectively, with comments to explain their logic. The first is the function responsible for training the Naïve Bayes classifier, it creates a vector with the a priori probabilities and the second function classifies the recipes with unknown cuisine type.

Table 2. Naïve Bayes Training Function

def  trainNB(trainrecipes,  vocabulary,  classes):  

  ...  

#  Laplace  smoothing  (add  1  numerator)    

  numeratorPwc  =  np.array([[1.0]*len(vocabulary)]*len(classes))  

 

  #  Laplace  smoothing  (denominator)  

  denominatorPwc  =  np.array([len(classes)]*len(vocabulary))  

  ...  

Page 5: Tarefa Data Mining - Classificação de Textos

  #  Builds  p(c)  and  p(w|c)  vectors  for  each  class  

  for  recipe  in  trainrecipes:  

  ...  

           #pc  

           pc[classIndex]  +=  1  

           #pwc  

           recipeFeatVector  =  createFeatVector(vocabulary,recipe['ingredients'])  

           numeratorPwc[classIndex]  +=  recipeFeatVector  

           denominatorPwc  +=  recipeFeatVector  

   

  #  Calculates  pc  and  pwc  

  #pc  vector/number  of  classes  

  pc  =  pc  /  float(sum(pc))  

  #  calculates  pwc  using  log  to  avoid  underflow  problem  

  pwc  =  np.log(numeratorPwc  /  denominatorPwc)  

  return  pc,  pwc

Table 3. Naïve Bayes Classifier Function

def  classifyNB(pc,  pwc,  ingredFeatVector):  

  #  For  each  class  calculates  p(c|w),  the  probability  of  a  class  c  

  #  given  a  recipe  w  (represented  as  a  feature  vector)  

               ...  

  for  i  in  range(len(pcw)):  

           pcw[i]  =  sum(pwc[i]  *  ingredFeatVector)  +  np.log(pc[i])  

 

  #get  maximum  p(c|w)  value  (array  index)  to  define  the  best  class  for  this  recipe  

  maxClass  =  pcw.tolist().index(max(pcw))  

  return  maxClass

5. Results After implementing the Naïve Bayes Algorithm program, it was time to submit its output to the Kaggle’s web site to see the results, but before doing that it was important to check the algorithm accuracy with the training data. To evaluate the algorithm before submitting its results, it was used a 10-fold cross validation, so in order to do the validation, it was created a program to process the train data set, dividing its 39.774 recipes in 10 folds. For each fold, the cross validation method trains the classifier with the data outside the fold and use the fold as the test data set to calculate the accuracy of the fold, after doing this for the 10 folds it calculates the average accuracy. The 10-fold cross validation reported an accuracy of 0.57862. After checking the 10-fold cross validation, the program output with all the 9.944 classified recipes was submitted in a csv file to the competition’s site.

Page 6: Tarefa Data Mining - Classificação de Textos

Figure 4. Kaggle’s Leaderboard

When Kaggle receives the results, the web site calculates the score of the submission so it can be ranked in the competition’s leaderboard, as shown in Figure 4. The score is the accuracy calculated based on the submission and the actual classes of the recipes. The score obtained in the first submission was 0.57623, what was close to the accuracy obtained with the 10-fold cross validation.

6. Conclusion and Next Steps The results obtained with the first submission were not in the top of the ranking nor showed a high accuracy, what was not a surprise, since it was just the first submission and no data pre-processing was done before the training, classification and submission, meaning that the first submission is only the beginning of the process of competing in a challenge like this and there is room for improvement in order to get better results. Thinking on the improvements that can be done, one of them is related to the Naïve assumption of the machine learning method used, because depending on the cuisine (class) there may be some ingredients that influence other ingredients meaning that the conditionals probabilities p(wn|c) are not so independent from each other. For example, based on expert judgment, a Brazilian recipe that contains beans has great chances of also including rice, so that the p(w(beans)|c(Brazilian)) and p(w(rice)|c(Brazilian)) may have some dependency degree in this case, then with some expert advice it is possible to manually add weighting rules so that during the learning phase some features will receive a greater weight depending on the cuisine. Another thing that can be done thinking on improving the accuracy is to do some data preparation work, including data normalization and stemming. There are some ingredients that can be reduced to the same token, for example, the following two ingredients are good examples of this scenario: “50% less sodium black beans” and “black beans”. Based on the observations made during the research phase, it is worth mention other algorithms that are good candidates to be used to improve accuracy. Experiments made by [Li and Yang 2003] showed that some methods like SVM and kNN are good options to do text classification.

Page 7: Tarefa Data Mining - Classificação de Textos

References Kaggle. (2015) What’s Cooking? Use recipe ingredients to categorize the cuisine.

Retrieved November 23, 2015, from https://www.kaggle.com/c/whats-cooking. Charu C. Aggarwal , Cheng Xiang Zhai (2012), Mining Text Data, Chap. 6, Springer

Publishing Company, Incorporated. Daniel Jurafsky, James H. Martin (2009), Speech and Language Processing (2nd

Edition), Prentice-Hall, Inc., Upper Saddle River, NJ. Peter Harrington (2012), Machine Learning in Action. Manning Publications Co.

Greenwich, CT Fan Li, Yiming Yang (2003), A Loss Function Analysis for Classification Methods in

Text Categorization, Carnegie Mellon Univ, Pittsburgh, PA.