Hey all,
I’ve been working on a pandas data validation package called pandera that enables users to:
- check the types and properties of each column in a Dataframe (or key-value pairs in a Series)
- do more complex statistical validation like hypothesis testing.
- seamlessly integrate with existing data analysis/processing pipelines via decorators.
Package Scope
-
Data munging: make ETL/data analysis/data processing pipelines more reliable.
-
Reproducibility: properties of dataframes are validated at run-time, so e.g. collaborators can execute unit tests on data more easily.
Before submitting a package proposal, I just wanted to get a sense of whether this package is appropriate for pyOpenSci. A collaborator and I put together this checklist that we’re working through, and below that is a brief summary of similar packages in Python and pandera’s differentiating features.
2 Likes
hey @cosmicBboy !! Welcome to our community forum. I’ve been out of the office for a few days and am just catching up on messages!! i’ll get back to you on this submission in the next few days however! it looks like a great package as I would imagine many people do want a mechanism to properly validate dataframes!
In the meantime pinging a few friends on this for input on this submission.
@luizirber @leouieda @choldgraf @carsonfarmer @lheagy @marskar
Hi @lwasser,
DataFrame validation is hugely important. I think this package is a nice fit / of great interest.
1 Like
also for reference, I wrote a blog post that analyzes New York 311 Data as a case study on using pandera on real-world data.
1 Like
This is great! @cosmicBboyi suggest that you submit a formal review submission here: https://github.com/pyOpenSci/software-review/issues typically people would submit a presubmission inquiry first there but given we’ve already had a look at your package, i suggest that you just submit the package for review (when you are ready) and then reference / link to this thread as your presubmission! Please let us know if you have any questions.
awesome! thanks @lwasser, I’ll create a submission in the next few days.