2018-04-05
What's cooking
Exploratory Data Analysis of the Kaggle's "What's cooking" competition dataset to get understanding what kind of data we are dealing with and get intuition of existing dependencies.
In this post I will explore the dataset relate to Kaggle's competition "What's cooking? to get understanding what kind of data we are dealing with. Getting know your data helps to clean-up the data in preprocessing in order to prepare dataset for classification.
In [9]:
import pandas as pd
Reading the data¶
In [10]:
df_train = pd.read_json("train.json")
In [11]:
df_train.head()
Out[11]:
How the different cuisines are represented in dataset?¶
In [12]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
In [18]:
df_train["cuisine"].value_counts().plot(kind="barh", figsize=(8, 6));
What are the most common ingredients?¶
In [19]:
from collections import Counter
In [20]:
counters = {}
for cuisine in df_train["cuisine"].unique():
counters[cuisine] = Counter()
indices = df_train["cuisine"] == cuisine
for ingredients in df_train[indices]["ingredients"]:
counters[cuisine].update(ingredients)
What are the most used ingredients in different cuisines?¶
In [21]:
counters["italian"].most_common(10)
Out[21]:
In [22]:
top10 = pd.DataFrame(
[[items[0] for items in counters[cuisine].most_common(10)] for cuisine in counters],
index=[cuisine for cuisine in counters],
columns=["top{}".format(i) for i in range(1, 11)],
)
display(top10.head(8))
In [23]:
df_train["all_ingredients"] = df_train["ingredients"].map(";".join)
In [24]:
df_train.head()
Out[24]:
Check garlic cloves as differentiator¶
In [29]:
indices = df_train["all_ingredients"].str.contains("garlic cloves")
df_train[indices]["cuisine"].value_counts().plot(
kind="barh", title="garlic cloves per cuisine", figsize=(8, 6)
);
In [32]:
relative_freq = (
df_train[indices]["cuisine"].value_counts() / df_train["cuisine"].value_counts()
)
relative_freq.sort_values(inplace=True)
relative_freq.plot(kind="barh", figsize=(8, 6));
In [35]:
import numpy as np
unique = np.unique(top10.values.ravel())
unique
Out[35]:
In [59]:
fig, axes = plt.subplots(2, 2, figsize=(20, 20))
for ingredient, ax_index in zip(unique, range(4)):
indices = df_train["all_ingredients"].str.contains(ingredient)
relative_freq = (
df_train[indices]["cuisine"].value_counts() / df_train["cuisine"].value_counts()
)
relative_freq.plot(
kind="barh", ax=axes.ravel()[ax_index], fontsize=24, title=ingredient
);
Out[1]:
To cite this article:
@article{Saf2018What's, author = {Krystian Safjan}, title = {What's cooking}, journal = {Krystian's Safjan Blog}, year = {2018}, }