2022-04-27    Share on: Twitter | Facebook | HackerNews | Reddit

Pandas Dataframe Schema and Data Types Validation

Contents

Two things I found most useful and use most are Pandera and Dataenforce.

Pandera (515 stars) - column validation (columns, types), DataFrame Schema

import pandera as pa

schema = pa.DataFrameSchema(
    columns={
        "height_in_cm": pa.Column(pa.Int),
        "age_category": pa.Column(pa.String),
    },
    index=pa.Index(pa.Int, name="person_id"),
)

# Usage for schema check
schema(df_dataset)

Medium article "Validate Your pandas DataFrame with Pandera

Dataenforce (59 stars) - columns presence validation

used for

for type hinting (column names check, dtype check)

from dataenforce import Dataset
def process_data(data: Dataset["id": int, "name": object, "latitude": float, "longitude": float])
  pass

to enforce validation at runtime

from dataenforce import Dataset, validate

@validate
def process_data(data: Dataset["id", "name"]):
  pass

Github Link: dataenforce

Great expectations - data validation

automated expectations from profiling

https://greatexpectations.io/blog/pandas-profiling-integration/ great expectations + Pandas Profiling

import pandas as pd
from pandas_profiling import ProfileReport

# Load your dataframe
df = pd.read_csv('yellow_tripdata_sample_2019-01.csv')

# Then run Pandas Profiling
profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)

# And obtain an Expectation Suite from the profile report
suite = profile.to_expectation_suite(suite_name="my_pandas_profiling_suite")

pandas_schema (135 stars)

Other Data Validation Libraries

Here are a few other alternatives for validating Python data structures.

Generic Python object data validation

  • voloptuous
  • schema

pandas-specific data validation

  • opulent-pandas
  • PandasSchema
  • pandas-validator (archived)
  • table_enforcer (13 stars)