What's cooking

Posted on Thu 05 April 2018   •   13 min read

What's cooking - Part 1: Exploratory data analysis

In [6]:
import pandas as pd

In this post I will explore the dataset to get understanding what kind of data we are dealing with. Getting know your data helps to clean-up the data in preprocessing in order to prepare dataset for classification.

Let's read the data:

In [7]:
df_train = pd.read_json('train.json')
In [8]:
df_train.head()
Out[8]:
cuisine id ingredients
0 greek 10259 [romaine lettuce, black olives, grape tomatoes...
1 southern_us 25693 [plain flour, ground pepper, salt, tomatoes, g...
2 filipino 20130 [eggs, pepper, salt, mayonaise, cooking oil, g...
3 indian 22213 [water, vegetable oil, wheat, salt]
4 indian 13162 [black pepper, shallots, cornflour, cayenne pe...

How the different cusines are represented in dataset?

In [9]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
In [10]:
df_train['cuisine'].value_counts().plot(kind='bar')
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x23069cc8588>

What are the most common ingredients?

In [11]:
from collections import Counter
In [12]:
counters = {}
for cuisine in df_train['cuisine'].unique():
    counters[cuisine] = Counter()
    indices = (df_train['cuisine'] == cuisine)
    for ingredients in df_train[indices]['ingredients']:
        counters[cuisine].update(ingredients)

What are the most used ingreedients in different cusines?

In [13]:
counters['italian'].most_common(10)
Out[13]:
[('salt', 3454),
 ('olive oil', 3111),
 ('garlic cloves', 1619),
 ('grated parmesan cheese', 1580),
 ('garlic', 1471),
 ('ground black pepper', 1444),
 ('extra-virgin olive oil', 1362),
 ('onions', 1240),
 ('water', 1052),
 ('butter', 1030)]
In [14]:
top10 = pd.DataFrame([[items[0] for items in counters[cuisine].most_common(10)] for cuisine in counters],
            index=[cuisine for cuisine in counters],
            columns=['top{}'.format(i) for i in range(1, 11)])
display(top10.head(8))
top1 top2 top3 top4 top5 top6 top7 top8 top9 top10
greek salt olive oil dried oregano garlic cloves feta cheese crumbles extra-virgin olive oil fresh lemon juice ground black pepper garlic pepper
southern_us salt butter all-purpose flour sugar large eggs baking powder water unsalted butter milk buttermilk
filipino salt garlic water onions soy sauce pepper oil sugar carrots ground black pepper
indian salt onions garam masala water ground turmeric garlic cumin seed ground cumin vegetable oil oil
jamaican salt onions water garlic ground allspice pepper scallions dried thyme black pepper garlic cloves
spanish salt olive oil garlic cloves extra-virgin olive oil onions water tomatoes ground black pepper red bell pepper pepper
italian salt olive oil garlic cloves grated parmesan cheese garlic ground black pepper extra-virgin olive oil onions water butter
mexican salt onions ground cumin garlic olive oil chili powder jalapeno chilies sour cream avocado corn tortillas
In [15]:
df_train['all_ingredients'] = df_train['ingredients'].map(";".join)
In [16]:
df_train.head()
Out[16]:
cuisine id ingredients all_ingredients
0 greek 10259 [romaine lettuce, black olives, grape tomatoes... romaine lettuce;black olives;grape tomatoes;ga...
1 southern_us 25693 [plain flour, ground pepper, salt, tomatoes, g... plain flour;ground pepper;salt;tomatoes;ground...
2 filipino 20130 [eggs, pepper, salt, mayonaise, cooking oil, g... eggs;pepper;salt;mayonaise;cooking oil;green c...
3 indian 22213 [water, vegetable oil, wheat, salt] water;vegetable oil;wheat;salt
4 indian 13162 [black pepper, shallots, cornflour, cayenne pe... black pepper;shallots;cornflour;cayenne pepper...

Check garlic cloves as differentiator

In [17]:
indices = df_train['all_ingredients'].str.contains('garlic cloves')
df_train[indices]['cuisine'].value_counts().plot(kind='bar',
                                                 title='garlic cloves as found per cuisine')
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x2306f0261d0>
In [18]:
relative_freq = (df_train[indices]['cuisine'].value_counts() / df_train['cuisine'].value_counts())
relative_freq.sort_values(inplace=True)
relative_freq.plot(kind='bar')
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x2306f1cfa90>
In [19]:
import numpy as np
unique = np.unique(top10.values.ravel())
unique
Out[19]:
array(['all-purpose flour', 'avocado', 'baking powder', 'baking soda',
       'black pepper', 'butter', 'buttermilk', 'cachaca',
       'cajun seasoning', 'carrots', 'cayenne pepper', 'chili powder',
       'coconut milk', 'corn starch', 'corn tortillas', 'cumin seed',
       'dried oregano', 'dried thyme', 'eggs', 'extra-virgin olive oil',
       'feta cheese crumbles', 'fish sauce', 'fresh lemon juice',
       'fresh lime juice', 'garam masala', 'garlic', 'garlic cloves',
       'ginger', 'grated parmesan cheese', 'green bell pepper',
       'green onions', 'ground allspice', 'ground black pepper',
       'ground cinnamon', 'ground cumin', 'ground ginger',
       'ground turmeric', 'jalapeno chilies', 'large eggs', 'lime',
       'milk', 'mirin', 'oil', 'olive oil', 'onions', 'paprika', 'pepper',
       'potatoes', 'red bell pepper', 'rice vinegar', 'sake', 'salt',
       'scallions', 'sesame oil', 'sesame seeds', 'shallots',
       'sour cream', 'soy sauce', 'sugar', 'tomatoes', 'unsalted butter',
       'vegetable oil', 'water'], dtype=object)
In [32]:
fig, axes = plt.subplots(1, 4, figsize=(20, 5))
for ingredient, ax_index in zip(unique, range(4)):
    indices = df_train['all_ingredients'].str.contains(ingredient)
    relative_freq = (df_train[indices]['cuisine'].value_counts() / df_train['cuisine'].value_counts())
    relative_freq.plot(kind='bar', ax=axes.ravel()[ax_index], fontsize=7, title=ingredient)
In [33]:
from IPython.core.display import HTML
HTML("<style>.prompt{display: None;</style>")
Out[33]: