2021-01-16
Pandas Schema Validation
Overview of the available tools and methods for schema validation in pandas, examplary code snippets and recommendation for when to use given tool.
- Overview of Available Tools and Methods
- Built-in Attributes
- Pandas Schema
- Great Expectations
- Pandera
- Data-enforce
- Comparison and Discussion
Pandas is a widely used library for data manipulation and analysis in Python. To ensure the data is in the correct format and conforms to certain constraints, schema validation is crucial. This process can be useful in various situations such as when importing data from external sources or before performing further analysis or machine learning tasks.
There are several tools and methods available for schema validation in pandas such as pandas_schema
, great_expectations
, pandera
and data-enforce
. pandas_schema and great_expectations are widely used libraries for pandas data validation. pandera and data-enforce are also popular libraries for pandas data validation.
In this article, we will overview the available tools and methods for schema validation in pandas and provide example code snippets and links to further resources. We will also discuss the advantages and disadvantages of each tool and provide recommendations for when to use them.
Overview of Available Tools and Methods
The tools and methods discussed below accompanying exemplary code snippets. You can use the following contents of a data.csv
that comply with the schema used in this article:
name,age,gender,col1,col2,col3,col4,col5
Alice,25,female,1,2.5,text1,True,2022-01-01
Bob,30,male,2,3.5,text2,False,2022-02-01
Charlie,35,male,3,4.5,text3,True,2022-03-01
This file contains a dataframe with 8 columns: name, age, gender, col1, col2, col3, col4 and col5.
name
andgender
are of type objectage
is of type intcol1
,col2
are of type int and float respectivelycol3
is of type objectcol4
is of type booleancol5
is of type datetime
This file can be used in the examples above to perform data validation using different libraries and methods.
Here is an example of the contents of a data2.csv
file that does not comply with the schema used in the article:
name,age,gender,col1,col2,col3,col4,col5
Alice,25,female,1,2.5,text1,True,2022-01-01
Bob,30,male,2,3.5,text2,False,2022-02-01
Charlie,35,male,3,4.5,text3,True,2022-03-01
David,170,male,4,5.5,text4,True,2022-04-01
This file contains a dataframe with 8 columns: name, age, gender, col1, col2, col3, col4 and col5.
- The age of David is 170 which is out of range of the schema defined in the article which is range(0, 150)
- This file will not comply with the schema defined in the article and will raise an error when trying to validate it using the code provided in the article
- This file can be used to demonstrate the validation process and how it will raise errors for invalid data.
Built-in Attributes
Pandas provides built-in attributes such as .dtypes
and .shape
that can be used to check the data types and dimensions of a DataFrame. Here's an example of using these attributes to check that a DataFrame has the expected number of rows and columns, and that the columns have the expected data types:
import pandas as pd
df = pd.read_csv("data.csv")
# Check that the DataFrame has the expected number of rows and columns
assert df.shape == (100, 5)
# Check that the columns have the expected data types
assert df.dtypes == {"col1": int, "col2": float, "col3": object, "col4": bool, "col5": datetime}
Pandas Schema
pandas_schema
is a library that allows you to specify constraints on a DataFrame and then validate that the DataFrame conforms to those constraints. Here's an example of using the pandas_schema
library to define a schema for a DataFrame and then validate that the DataFrame conforms to the schema:
from pandas_schema import Column, Schema
from pandas_schema.validation import LeadingWhitespaceValidation, TrailingWhitespaceValidation, IsDtypeValidation, InListValidation
schema = Schema([
Column('name', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()]),
Column('age', [IsDtypeValidation(int), InListValidation(range(0, 150))]),
Column('gender', [InListValidation(['male', 'female'])])
])
errors = schema.validate(df)
Great Expectations
Great Expectations is a library that allows you to define and validate schemas using a more human-readable syntax. Here's an example of using the great_expectations
library to define a schema for a DataFrame and then validate that the DataFrame conforms to the schema:
import great_expectations as ge
df = ge.read_csv("data.csv")
# Define the schema
schema = {
"columns": {
"col1": {"expect_type": "int"},
"col2": {"expect_type": "float"},
"col3": {"expect_type": "string"},
"col4": {"expect_type": "bool"},
"col5": {"expect_type": "datetime"}
}
}
# Validate the DataFrame against the schema
validation_result = df.expect(schema)
# Check for any validation errors
if validation_result.success:
print("Data validation successful")
else:
print("Validation errors:", validation_result.result)
Pandera
pandera
is a library that allows you to define and validate schemas using a more human-readable syntax and more functionality. Here's an example of using the pandera
library to define a schema for a DataFrame and then validate that the DataFrame conforms to the schema:
import pandera as pa
df = pd.read_csv("data.csv")
# Define the schema
schema = pa.DataFrameSchema({
"col1": pa.Column(pa.Int),
"col2": pa.Column(pa.Float),
"col3": pa.Column(pa.String),
"col4": pa.Column(pa.Boolean),
"col5": pa.Column(pa.Datetime)
})
# Validate the DataFrame against the schema
schema.validate(df)
Data-enforce
data-enforce
is a library that allows you to define and validate schemas using a more human-readable syntax and more functionality. Here's an example of using the data-enforce
library to define a schema for a DataFrame and then validate that the DataFrame conforms to the schema:
import data_enforce as de
df = pd.read_csv("data.csv")
# Define the schema
schema = {
"col1": de.Integer(),
"col2": de.Float(),
"col3": de.String(),
"col4": de.Boolean(),
"col5": de.Datetime()
}
# Validate the DataFrame against the schema
de.enforce(df, schema)
Comparison and Discussion
Each of these tools has its own advantages and disadvantages depending on the specific use case.
The built-in attributes such as .dtypes
and .shape
may be sufficient for simple validation tasks, but they lack advanced functionality such as custom validation logic and integration with other data pipeline tools.
pandas_schema
and great_expectations
libraries offer more advanced functionality such as custom validation logic and integration with other data pipeline tools, and they have a more human-readable syntax. pandera
and data-enforce
libraries also offer more advanced functionality than the built-in attributes and they have a more human-readable syntax.
The choice of tool will depend on the complexity of the schema and the specific requirements of the project. If the schema is simple and you only need to check data types and dimensions, the built-in attributes may be sufficient. However, if you need more advanced functionality such as custom validation logic or integration with other data pipeline tools, pandas_schema
, great_expectations
, pandera
or data-enforce
libraries are better choices.
Overall, it is recommended to use great_expectations
for more complex projects, as it has more functionality and a more human-readable syntax. However, if you're looking for a more lightweight solution pandas_schema
, pandera
and data-enforce
are also good options.
Any comments or suggestions? Let me know.
To cite this article:
@article{Saf2021Pandas, author = {Krystian Safjan}, title = {Pandas Schema Validation}, journal = {Krystian's Safjan Blog}, year = {2021}, }