Candidate package: pandera - a flexible pandas data structure validation package

cosmicBboy · August 4, 2019, 3:25pm

Hey all,

I’ve been working on a pandas data validation package called pandera that enables users to:

check the types and properties of each column in a Dataframe (or key-value pairs in a Series)
do more complex statistical validation like hypothesis testing.
seamlessly integrate with existing data analysis/processing pipelines via decorators.

Package Scope

Data munging: make ETL/data analysis/data processing pipelines more reliable.
Reproducibility: properties of dataframes are validated at run-time, so e.g. collaborators can execute unit tests on data more easily.

Before submitting a package proposal, I just wanted to get a sense of whether this package is appropriate for pyOpenSci. A collaborator and I put together this checklist that we’re working through, and below that is a brief summary of similar packages in Python and pandera’s differentiating features.

lwasser · August 6, 2019, 3:38pm

hey @cosmicBboy !! Welcome to our community forum. I’ve been out of the office for a few days and am just catching up on messages!! i’ll get back to you on this submission in the next few days however! it looks like a great package as I would imagine many people do want a mechanism to properly validate dataframes!

In the meantime pinging a few friends on this for input on this submission.
@luizirber @leouieda @choldgraf @carsonfarmer @lheagy @marskar

marskar · August 7, 2019, 1:11am

Hi @lwasser,
DataFrame validation is hugely important. I think this package is a nice fit / of great interest.

cosmicBboy · August 8, 2019, 3:38pm

also for reference, I wrote a blog post that analyzes New York 311 Data as a case study on using pandera on real-world data.

lwasser · August 8, 2019, 4:14pm

This is great! @cosmicBboyi suggest that you submit a formal review submission here: https://github.com/pyOpenSci/software-review/issues typically people would submit a presubmission inquiry first there but given we’ve already had a look at your package, i suggest that you just submit the package for review (when you are ready) and then reference / link to this thread as your presubmission! Please let us know if you have any questions.

cosmicBboy · August 10, 2019, 2:45pm

awesome! thanks @lwasser, I’ll create a submission in the next few days.

cosmicBboy · August 18, 2019, 8:57pm

Just created a new submission issue here: https://github.com/pyOpenSci/software-review/issues/12

Topic		Replies	Views
Welcome great-tables to the pyOpenSci ecosystem! pyOpenSci Packages	0	10	January 3, 2025
pyOpenSci / Pangeo collaboration! pyOpenSci Community Chat	1	503	February 21, 2023
UPDATE: what pyopensci is working on pyOpenSci Community Chat	0	329	November 29, 2022
Welcome sciform to the pyOpenSci ecosystem! pyOpenSci Packages pyos-accepted	2	91	February 23, 2024
Some experiences of doing a pyOpenSci review pyOpenSci Review Process Questions	9	495	June 6, 2019

Candidate package: pandera - a flexible pandas data structure validation package

Package Scope

Related topics