I’ve been working on a pandas data validation package called pandera that enables users to:
- check the types and properties of each column in a Dataframe (or key-value pairs in a Series)
- do more complex statistical validation like hypothesis testing.
- seamlessly integrate with existing data analysis/processing pipelines via decorators.
- Data munging: make ETL/data analysis/data processing pipelines more reliable.
- Reproducibility: properties of dataframes are validated at run-time, so e.g. collaborators can execute unit tests on data more easily.
Before submitting a package proposal, I just wanted to get a sense of whether this package is appropriate for pyOpenSci. A collaborator and I put together this checklist that we’re working through, and below that is a brief summary of similar packages in Python and pandera’s differentiating features.