2022-04-27
Pandas Dataframe Schema and Data Types Validation
Contents
- Pandera (515 stars) - column validation (columns, types), DataFrame Schema
- Dataenforce (59 stars) - columns presence validation
- for type hinting (column names check, dtype check)
- to enforce validation at runtime
- Great expectations - data validation
- automated expectations from profiling
- pandas_schema (135 stars)
- Other Data Validation Libraries
- Generic Python object data validation
- pandas-specific data validation
Two things I found most useful and use most are Pandera and Dataenforce.
Pandera (515 stars) - column validation (columns, types), DataFrame Schema
import pandera as pa
schema = pa.DataFrameSchema(
columns={
"height_in_cm": pa.Column(pa.Int),
"age_category": pa.Column(pa.String),
},
index=pa.Index(pa.Int, name="person_id"),
)
# Usage for schema check
schema(df_dataset)
Medium article "Validate Your pandas DataFrame with Pandera
Dataenforce (59 stars) - columns presence validation
for type hinting (column names check, dtype check)
from dataenforce import Dataset
def process_data(data: Dataset["id": int, "name": object, "latitude": float, "longitude": float])
pass
to enforce validation at runtime
from dataenforce import Dataset, validate
@validate
def process_data(data: Dataset["id", "name"]):
pass
Github Link: dataenforce
Great expectations - data validation
automated expectations from profiling
https://greatexpectations.io/blog/pandas-profiling-integration/ great expectations + Pandas Profiling
import pandas as pd
from pandas_profiling import ProfileReport
# Load your dataframe
df = pd.read_csv('yellow_tripdata_sample_2019-01.csv')
# Then run Pandas Profiling
profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)
# And obtain an Expectation Suite from the profile report
suite = profile.to_expectation_suite(suite_name="my_pandas_profiling_suite")
pandas_schema (135 stars)
Other Data Validation Libraries
Here are a few other alternatives for validating Python data structures.
Generic Python object data validation
- voloptuous
- schema
pandas-specific data validation
- opulent-pandas
- PandasSchema
- pandas-validator (archived)
- table_enforcer (13 stars)