2018-04-05
What's Cooking
Exploratory Data Analysis of the Kaggle's "What's cooking" competition dataset to get understanding what kind of data we are dealing with and get intuition of existing dependencies.
In this post I will explore the dataset relate to Kaggle's competition "What's cooking? to get understanding what kind of data we are dealing with. Getting know your data helps to clean-up the data in preprocessing in order to prepare dataset for classification.
In [9]:
import pandas as pd
Reading the data¶
In [10]:
df_train = pd.read_json("train.json")
In [11]:
df_train.head()
Out[11]:
cuisine | id | ingredients | |
---|---|---|---|
0 | greek | 10259 | [romaine lettuce, black olives, grape tomatoes... |
1 | southern_us | 25693 | [plain flour, ground pepper, salt, tomatoes, g... |
2 | filipino | 20130 | [eggs, pepper, salt, mayonaise, cooking oil, g... |
3 | indian | 22213 | [water, vegetable oil, wheat, salt] |
4 | indian | 13162 | [black pepper, shallots, cornflour, cayenne pe... |
How the different cuisines are represented in dataset?¶
In [12]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
In [18]:
df_train["cuisine"].value_counts().plot(kind="barh", figsize=(8, 6));
What are the most common ingredients?¶
In [19]:
from collections import Counter
In [20]:
counters = {}
for cuisine in df_train["cuisine"].unique():
counters[cuisine] = Counter()
indices = df_train["cuisine"] == cuisine
for ingredients in df_train[indices]["ingredients"]:
counters[cuisine].update(ingredients)
What are the most used ingredients in different cuisines?¶
In [21]:
counters["italian"].most_common(10)
Out[21]:
[('salt', 3454), ('olive oil', 3111), ('garlic cloves', 1619), ('grated parmesan cheese', 1580), ('garlic', 1471), ('ground black pepper', 1444), ('extra-virgin olive oil', 1362), ('onions', 1240), ('water', 1052), ('butter', 1030)]
In [22]:
top10 = pd.DataFrame(
[[items[0] for items in counters[cuisine].most_common(10)] for cuisine in counters],
index=[cuisine for cuisine in counters],
columns=["top{}".format(i) for i in range(1, 11)],
)
display(top10.head(8))
top1 | top2 | top3 | top4 | top5 | top6 | top7 | top8 | top9 | top10 | |
---|---|---|---|---|---|---|---|---|---|---|
greek | salt | olive oil | dried oregano | garlic cloves | feta cheese crumbles | extra-virgin olive oil | fresh lemon juice | ground black pepper | garlic | pepper |
southern_us | salt | butter | all-purpose flour | sugar | large eggs | baking powder | water | unsalted butter | milk | buttermilk |
filipino | salt | garlic | water | onions | soy sauce | pepper | oil | sugar | carrots | ground black pepper |
indian | salt | onions | garam masala | water | ground turmeric | garlic | cumin seed | ground cumin | vegetable oil | oil |
jamaican | salt | onions | water | garlic | ground allspice | pepper | scallions | dried thyme | black pepper | garlic cloves |
spanish | salt | olive oil | garlic cloves | extra-virgin olive oil | onions | water | tomatoes | ground black pepper | red bell pepper | pepper |
italian | salt | olive oil | garlic cloves | grated parmesan cheese | garlic | ground black pepper | extra-virgin olive oil | onions | water | butter |
mexican | salt | onions | ground cumin | garlic | olive oil | chili powder | jalapeno chilies | sour cream | avocado | corn tortillas |
In [23]:
df_train["all_ingredients"] = df_train["ingredients"].map(";".join)
In [24]:
df_train.head()
Out[24]:
cuisine | id | ingredients | all_ingredients | |
---|---|---|---|---|
0 | greek | 10259 | [romaine lettuce, black olives, grape tomatoes... | romaine lettuce;black olives;grape tomatoes;ga... |
1 | southern_us | 25693 | [plain flour, ground pepper, salt, tomatoes, g... | plain flour;ground pepper;salt;tomatoes;ground... |
2 | filipino | 20130 | [eggs, pepper, salt, mayonaise, cooking oil, g... | eggs;pepper;salt;mayonaise;cooking oil;green c... |
3 | indian | 22213 | [water, vegetable oil, wheat, salt] | water;vegetable oil;wheat;salt |
4 | indian | 13162 | [black pepper, shallots, cornflour, cayenne pe... | black pepper;shallots;cornflour;cayenne pepper... |
Check garlic cloves as differentiator¶
In [29]:
indices = df_train["all_ingredients"].str.contains("garlic cloves")
df_train[indices]["cuisine"].value_counts().plot(
kind="barh", title="garlic cloves per cuisine", figsize=(8, 6)
);
In [32]:
relative_freq = (
df_train[indices]["cuisine"].value_counts() / df_train["cuisine"].value_counts()
)
relative_freq.sort_values(inplace=True)
relative_freq.plot(kind="barh", figsize=(8, 6));
In [35]:
import numpy as np
unique = np.unique(top10.values.ravel())
unique
Out[35]:
array(['all-purpose flour', 'avocado', 'baking powder', 'baking soda', 'black pepper', 'butter', 'buttermilk', 'cachaca', 'cajun seasoning', 'carrots', 'cayenne pepper', 'chili powder', 'coconut milk', 'corn starch', 'corn tortillas', 'cumin seed', 'dried oregano', 'dried thyme', 'eggs', 'extra-virgin olive oil', 'feta cheese crumbles', 'fish sauce', 'fresh lemon juice', 'fresh lime juice', 'garam masala', 'garlic', 'garlic cloves', 'ginger', 'grated parmesan cheese', 'green bell pepper', 'green onions', 'ground allspice', 'ground black pepper', 'ground cinnamon', 'ground cumin', 'ground ginger', 'ground turmeric', 'jalapeno chilies', 'large eggs', 'lime', 'milk', 'mirin', 'oil', 'olive oil', 'onions', 'paprika', 'pepper', 'potatoes', 'red bell pepper', 'rice vinegar', 'sake', 'salt', 'scallions', 'sesame oil', 'sesame seeds', 'shallots', 'sour cream', 'soy sauce', 'sugar', 'tomatoes', 'unsalted butter', 'vegetable oil', 'water'], dtype=object)
In [59]:
fig, axes = plt.subplots(2, 2, figsize=(20, 20))
for ingredient, ax_index in zip(unique, range(4)):
indices = df_train["all_ingredients"].str.contains(ingredient)
relative_freq = (
df_train[indices]["cuisine"].value_counts() / df_train["cuisine"].value_counts()
)
relative_freq.plot(
kind="barh", ax=axes.ravel()[ax_index], fontsize=24, title=ingredient
);
Out[1]:
To cite this article:
@article{Saf2018What's, author = {Krystian Safjan}, title = {What's Cooking}, journal = {Krystian's Safjan Blog}, year = {2018}, }