April 05, 2018
What’s cooking
In this post I will explore the dataset relate to Kaggle’s competition “What’s cooking? to get understanding what kind of data we are dealing with. Getting know your data helps to clean-up the data in preprocessing in order to prepare dataset for classification.
In [9]:
import pandas as pd
Reading the data¶
In [10]:
df_train = pd.read_json("train.json")
In [11]:
df_train.head()
Out[11]:
How the different cuisines are represented in dataset?¶
In [12]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
In [18]:
df_train["cuisine"].value_counts().plot(kind="barh", figsize=(8, 6));
What are the most common ingredients?¶
In [19]:
from collections import Counter
In [20]:
counters = {}
for cuisine in df_train["cuisine"].unique():
counters[cuisine] = Counter()
indices = df_train["cuisine"] == cuisine
for ingredients in df_train[indices]["ingredients"]:
counters[cuisine].update(ingredients)
What are the most used ingredients in different cuisines?¶
In [21]:
counters["italian"].most_common(10)
Out[21]:
In [22]:
top10 = pd.DataFrame(
[[items[0] for items in counters[cuisine].most_common(10)] for cuisine in counters],
index=[cuisine for cuisine in counters],
columns=["top{}".format(i) for i in range(1, 11)],
)
display(top10.head(8))
In [23]:
df_train["all_ingredients"] = df_train["ingredients"].map(";".join)
In [24]:
df_train.head()
Out[24]:
Check garlic cloves as differentiator¶
In [29]:
indices = df_train["all_ingredients"].str.contains("garlic cloves")
df_train[indices]["cuisine"].value_counts().plot(
kind="barh", title="garlic cloves per cuisine", figsize=(8, 6)
);
In [32]:
relative_freq = (
df_train[indices]["cuisine"].value_counts() / df_train["cuisine"].value_counts()
)
relative_freq.sort_values(inplace=True)
relative_freq.plot(kind="barh", figsize=(8, 6));
In [35]:
import numpy as np
unique = np.unique(top10.values.ravel())
unique
Out[35]:
In [59]:
fig, axes = plt.subplots(2, 2, figsize=(20, 20))
for ingredient, ax_index in zip(unique, range(4)):
indices = df_train["all_ingredients"].str.contains(ingredient)
relative_freq = (
df_train[indices]["cuisine"].value_counts() / df_train["cuisine"].value_counts()
)
relative_freq.plot(
kind="barh", ax=axes.ravel()[ax_index], fontsize=24, title=ingredient
);
Out[1]: