Candidate package: pandera - a flexible pandas data structure validation package

Hey all,

I’ve been working on a pandas data validation package called pandera that enables users to:

  1. check the types and properties of each column in a Dataframe (or key-value pairs in a Series)
  2. do more complex statistical validation like hypothesis testing.
  3. seamlessly integrate with existing data analysis/processing pipelines via decorators.

Package Scope

  • Data munging: make ETL/data analysis/data processing pipelines more reliable.
  • Reproducibility: properties of dataframes are validated at run-time, so e.g. collaborators can execute unit tests on data more easily.

Before submitting a package proposal, I just wanted to get a sense of whether this package is appropriate for pyOpenSci. A collaborator and I put together this checklist that we’re working through, and below that is a brief summary of similar packages in Python and pandera’s differentiating features.

2 Likes

hey @cosmicBboy !! Welcome to our community forum. I’ve been out of the office for a few days and am just catching up on messages!! i’ll get back to you on this submission in the next few days however! it looks like a great package as I would imagine many people do want a mechanism to properly validate dataframes!

In the meantime pinging a few friends on this for input on this submission.
@luizirber @leouieda @choldgraf @carsonfarmer @lheagy @marskar

Hi @lwasser,
DataFrame validation is hugely important. I think this package is a nice fit / of great interest.

1 Like

also for reference, I wrote a blog post that analyzes New York 311 Data as a case study on using pandera on real-world data.

1 Like

This is great! @cosmicBboyi suggest that you submit a formal review submission here: https://github.com/pyOpenSci/software-review/issues typically people would submit a presubmission inquiry first there but given we’ve already had a look at your package, i suggest that you just submit the package for review (when you are ready) and then reference / link to this thread as your presubmission! Please let us know if you have any questions.

awesome! thanks @lwasser, I’ll create a submission in the next few days.

Just created a new submission issue here: https://github.com/pyOpenSci/software-review/issues/12