Real-world repo packs -- looking for examples

NickleDave · February 4, 2020, 6:28am

Hey pyOpenSci,

I’m looking for real-world examples of repro-packs or research compendiums: bundles of code, figure plotting scripts, data, and whatever else is needed to reproduce the results in a computational paper.

Reproducibility packages AKA repro-packs is the term Lorena Barba’s group uses (http://blogs.nature.com/naturejobs/2017/04/17/techblog-my-digital-toolbox-lorena-barba/).
As an example, here’s a link to a GitHub repository that houses code for one of their papers (https://github.com/barbagroup/pygbe_lspr_paper) – there’s a link in the README poining to the final repro-pack on Figshare.

Research compendium is another name for this practice that I know from hearing Karthik Ram talk about it at. Slides from a talk he gave here https://github.com/karthik/rstudio2019 with links to several real-world examples.

I’m specifically looking for examples that use Python (obvs), and from as many different research areas as possible.

Mainly what I’d like to see is how people structure code across research domains.
Recently I led a session on “Python 102” about the basics of organizing code (https://python-102.readthedocs.io/en/latest/packaging.html#), and I couldn’t find a lot of different examples, but maybe that’s because I don’t know where to look.

So I’m hoping to crowdsource those examples.

My idea is to maybe make them part of a blog post for pyOpenSci, not the end-all be-all on the subject but maybe like a mini-review that a student could at least look at to get ideas on how to package their code and share their results in a reproducible way.

What I am not looking for is source code for libraries, tools, etc., even though that is of course the scope of PyOpenSci. Fairly simple examples without a ton of development-related cruft (docker/travis/environment.yml etc) in the root of the repository would be great.

Of course I love to see a beautifully organized codebase for libraries but the idea here is to show someone a real-world example from their field what simulation / analysis / figure scripts etc for their papers might look like. Like a bunch of concrete examples of the structure suggested in “Good Enough Practices for Scientific Computing” (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510). Plus some inspiration for the whole repro-pack itself (figures are def field specific too!)

Thanks, looking forward to seeing what people share and hearing any comments
–David

story645 · February 4, 2020, 7:19am

I’m not sure how reproducible since I don’t have an environment.yml but https://bitbucket.org/story645/libltf has all the scripts for making figures for a paper on ensemble forecast evaluation (climate) w/ a link to the paper (which is in a different private repo that I have to fight w/ bitbucket about making public.)

NickleDave · February 6, 2020, 5:56pm

Nice! Sounds perfect – will give this a look
Thank you @story645

NickleDave · February 22, 2020, 3:43am

Getting back to this now @story645

Probabilistic forecasting sounds cool and it’s impressive to do that with just pure numpy and scipy.

I’m looking at the tools you used to build it too.
looks like Distribute and Buildout provide some of the same functionality as, say, Flit or Poetry?
And modern-package-template is sort of like cookiecutter?

Thank you again for sharing this

NickleDave · February 22, 2020, 4:12am

(slowly) posting some other links people replied with on Twitter too.

Naty Clementi from Lorena Barba’s group gave an example of one of their recent papers, documenting an extension of their PyGBe library that enables application of computational nanoplasmonics to biomolecule detection

link to the pre-print on arxiv: https://arxiv.org/pdf/1812.10722.pdf

and the link to the repo

that itself includes a link to a repro-pack with figures on FigShare

(same one I posted above … but now just noticing the brave blow-by-blow of the submission process in the README, including drafting responses with GitHub Issues).

story645 · February 23, 2020, 12:38am

Thanks, even though it’s mostly cause the project predates xarray & pymc3.
And yeah, modern-package-template is proto cookiecutter And distribute&buildout are part of the templating so I dunno but yes they have to do w/ packaging.

NickleDave · February 24, 2020, 12:45am

Good to know – if you were starting now, you’d use xarray and pymc3? Are there probabilistic forecasting libraries that are wrappers around those? Just trying to get a feel for who uses what “core” libraries

story645 · February 24, 2020, 12:49am

xarray is for nd labeled data, developed explicitly for climate data, so it would have probably made some of the line up code simpler. Pymc3 is for probabilistic programming.

NickleDave · February 24, 2020, 12:55am

still slowly posting links from that Twitter thread (don’t mean to post over you @story645 – thank you, helpful to know. I need to take xarray for a test drive at some point)

@khinsen shared some examples of active papers on Zenodo:
https://zenodo.org/search?page=1&size=20&q=ActivePapers

but I think got frustrated because of the default limits on the number of links new members can post
(thanks to the googling skills of @chendaniely and admin powers of @lwasser we got this raised)

I wasn’t actually familiar with these, but am now.
Examples from Konrad’s own work:

a database of protein structures https://zenodo.org/record/11086#.XlMeTENOkUE
analysis of that database which references it via DOI https://zenodo.org/record/21690#.XlMeQENOkUE
an analysis of lipid motions in a membrane (Molecular Dynamics simulation) https://zenodo.org/record/162171#.XlMeeUNOkUE
a Python library repackaged as an ActivePaper: https://zenodo.org/record/10735#.XlMexUNOkUE “he point is to avoid dependencies on other sources (GitHub, PyPI, etc.). Everything, code and data, is referenced through DOIs and nothing but DOIs.”

NickleDave · February 24, 2020, 1:10am

And last but not least Olivia Guest brought up ReScience C:

which can be found here: (saving you some cutting and pasting)

if you haven’t seen ReScience C before, they describe themselves as " an open-access peer-reviewed journal that targets computational research and encourages the explicit replication of already published research, promoting new and open-source implementations in order to ensure that the original research is reproducible."

Importantly all the articles have a GitHub repo associated with them, cf. this paper which I actually have spent a bunch of time looking at during my post-doc, and was surprised to find there

nb: they’ll soon be presenting results from their 10-year reproducibility challenge

petebachant · January 10, 2025, 11:49am

This post is quite old now, but I wanted to mention that I’ve been working on an open source framework for research projects called Calkit that essentially makes any project repository into a “repro pack.”

Some of the important features are:

Artifacts are produced by a DVC pipeline, cached, and versioned. They can easily be pushed to a remote so you don’t need to manually upload large files and keep them out of your repo.
Environments are checked and created/updated on the fly. So for example, if you have a conda environment in your project called env1, if you run calkit xenv -n env1 python scripts/my-script.py that environment will be created or updated right then. If it already matches the spec, it will be left alone. This feature is compatible with Docker and uv virtual environments as well. The goal is that someone could simply clone the project and call calkit run and everything is reproduced with minimal setup.

Here’s an example project that I initially created as an ad hoc repro pack ~10 years ago but wasn’t totally reproducible, so I turned it into a Calkit project: GitHub - UNH-CORE/RVAT-Re-dep: Repository for UNH-RVAT Reynolds number dependence experiment.

Topic		Replies	Views
Moving towards shared, vetted community packaging resources? pyOpenSci Guidebooks & Tutorials	4	150	August 2, 2023
UPDATE: what pyopensci is working on pyOpenSci Community Chat	0	329	November 29, 2022
Request to Join Python Packaging Mentorship/Training pyOpenSci Community Chat	5	77	May 23, 2024
Calling all Python Packaging Guide Contributors pyOpenSci Updates	0	87	March 6, 2024
pyOpenSci / Pangeo collaboration! pyOpenSci Community Chat	1	503	February 21, 2023

Related topics