Interest in supporting non-packaged Python applications?

ketozhang · February 29, 2024, 3:16am

Many of you may know a Python can be as simple as a single script file and perhaps add on a requirements file (or PEP 723 – Inline script metadata | peps.python.org) if you have external dependencies.

When it comes to sharing your code, the commonly recommended route is to make a standard package^[1]. You guys here at the pyOpenSci forum are very familiar with this and its pain.

However, there’s a disparity between the first two paragraphs above. With just a script and requirements file, I can run my python code. If I give them this, can’t someone else also do the same? Why can’t this be a form of sharing code? Isn’t this much easier than learning the PyPA packaging ecosystem? We see evidence of this in repositories that have not yet adopted packaging: a repository of python files and a requirements file. Alongside specifying the right Python version, is there no way I can have a tool that takes this information and just run the code?

This idea is not for every type of Python code. A Python library is code that are meant for other code to use (e.g., import numpy) and packaging is a great solution for sharing this kind of code. On the other hand, packaging is not a good solution for most Python applications—code that is meant to be ran (e.g., Python websites flask app.py, Jupyter notebooks, TUIs like Textual, WASM/pyodide websites, and at often times, ran reproducibly. In discussion elsewhere, this use case is sometimes called the “application project”, “application workflow”, “application-first projects”, “projects that aren’t meant to be packaged as a wheel”, etc.

This is getting too long and I haven’t gotten to what’s missing in the Python world and what can be done. I really like to open up this discussion beyond what’s being discussed at the Python forum and at PyCon. I’m looking around for communities that are interested in this problem. I’d love to open this up in-person (SciPy?) or other opportunities.

Upshot/TLDR: Python lacks support for application project. Applications makes up the majority of Python code out there, not libraries^[2]. Packaging does not solve this.

There are many forms of packaging, but a standard package refers to the PyPA specification: a wheel file and/or a source distribution that can be uploaded to PyPI. ↩︎
I bet. If anyone has data please come talk to me! ↩︎

ctb · February 29, 2024, 4:32pm

I mostly like the idea, but I worry that it encourages people to publish packages that don’t have tests and are hard to test. (I hope this doesn’t come across as concern trolling! I’m serious!)

In brief, it’s very hard to test scripts that aren’t importable and don’t have an entry point defined.

I think it’s easier to signify to people that “this is a pile of untested scripts” when they have to clone a github repo to get them and they can look and see.

In my (extensive experience, it’s pretty easy to make an importable .py file and define an entry point in pyproject.toml, which I think is all that’s needed for packaging. Heck, this could be (probably is) a github template somewhere; we provide something like that for our plugins via this plugin template.

that’s my hot take!

willingc · February 29, 2024, 5:37pm

Thanks for your thoughts and you are correct that you could distribute a script and requirements file (if needed).

Part of pyOpenSci’s mission includes sharing best practices for reproducibility and provenance of scientific packages. For now, a formal package is the best way to do that.

If in the future the PyPA adopts a versioned script with dependencies, I would be happy to revisit recommending this.

sneakers-the-rat · March 2, 2024, 2:00am

I think it was true that packaging python was a pain a few years ago, but it isn’t really anymore.

behold: a python package that’s as simple as the two-file script you’re talking about.

pyproject.toml

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "jonnys-single-file-package"
version = "v0.0.1"
description = "A single file package I gone and made"
requires-python = ">=3.8"
license = "GPL-3.0"
authors = [{ name = "jonny" }]
dependencies = [
	"numpy>=1.24.0"
]

[project.scripts]
run-this-package = "jonnys_single_file_package:main"

jonnys-single-file-package/__init__.py

import numpy as np

def main():
    number = np.random.randint(0, 100)
    print(f'hey whats up my favorite number is {number}')

That took all of 3 minutes entirely by hand, including time spent looking up the second link when searching “python packaging”

Advantages over script.py and requirements.txt

you can install it from pypi right now - pip install jonnys-single-file-package - that’s easier than emailing a file back and forth or cloning a git repository. If i were to version that package and not put it on pypi, you could also install it from pip directly from the repo. you could also email the tarball that is handily created by python -m build
entrypoints - if the argument is that “code is meant to be ran” then this is easier to run than a loose python script file. once you install it just do run-this-package
versions - if i want to change it, I can do that, and then people using it can get the new version
licenses - it’s actually legal for you to distribute and modify this since it’s GPL licensed via the metadata without needing the LICENSE file, but without a license then I retain all copyright and you can’t
specify python version - requirements.txt is just a list of pip install commands so it can’t really specify a python version except by convention
have the option of other packages using my code later if i want, including me reusing the code later in a structured way instead of copy and pasting between script files. I could also use this in a jupyter notebook if i wanted to without having to be in the same directory, etc.

So I’m not really sure what the benefit of script.py + requirements.txt is anymore after pyproject.toml which basically does requirements.txt and more.

I think the role of orgs like pyopensci should be to help programmers reach a minimal standard of maintainability and usability, so it would be a good idea to accept packages that are just scripts in a repo and then through the course of the review help them structure that into something more reusable. There should be easier pathways to understanding how to write code, and that’s one of the things I hope we do here.

I’ve seen too many new programmers tie themselves in knots dealing with scripts, finding themselves 6 months into a 3000 line long python script because they can’t split it up into separate files, and then when it’s time to publish their work it’s a great source of anxiety or another few months cleaning up the code. For me it’s not a matter of ‘correctness,’ but reducing time spent suffering from bad practices.

ketozhang · March 2, 2024, 5:46am

I’m very glad that’s pyOpenSci’s mission. I agree packaging provides a good way to add provenance. However, I don’t see the current status of pip ecosystem’s packaging (i.e., PyPA specs and tools) fitting for reproducibility. There are some key issues:

Lacks a standard mechanism to install the package reproducibly (e.g., lock files).
PyPI has not supported existing (3rd party) mechanisms perhaps waiting for (1) to exist as a standard. In practice, the lock file is hosted elsewhere (e.g., git repo).
Existing tooling (e.g., Hatch, Rye, uv, pipx) does not put focus on supporting this workflow by bridging the two: package from PyPI and lock file from somewhere else.

(1) is very actively in the works.

ketozhang · March 2, 2024, 5:59am

For those who are convinced packaging alone^[1] is enough for a reproducible application for the science use case, please indulge me with this exercise:

This scientist wants to publish their paper with code (perhaps on Zenodo). Their code, a single script, runs some model fitting given some data (part of the publication). Their script depends on existing Python packages (e.g., pytorch, scipy). This scientist wants others to be able to run the code with the same input data and reproduce the same results published.

What instructions (e.g., README) should the scientist write for their readers to achieve this? You may assume the scientist has done all preparations so that the instructions are possible (e.g., if pip install from PyPI is needed; the scientist has uploaded a package).

I’ve failed to convey I am not against packaging. I agree with many of the benefits the previous comments here have outlined. I am against packaging’s status quo get us to a good place with reproducibility. ↩︎

sneakers-the-rat · March 2, 2024, 6:33am

if this was the argument i definitely misunderstood it from the OP, my mistake. I think packaging only helps here - poetry does lockfiles and that’s one of the reasons that I use it. If you wanted to archive a venv along with your code that would also be possible. then it would be genuinely possible to reproduce a computing environment down to the hashes of the distributions. having code structured as a package gives you access to the tooling that makes doing that a matter of a few cli commands instead of self-tooling your entire archival process.

ketozhang · March 2, 2024, 7:08am

Not at all. This is a very difficult conversation for me to articulate well. Possibly our, allegedly, negative experiences trying to install a project that contains just top-level list of dependencies in their requirements.txt (i.e., not the best practice). It’s difficult to bring up requirements.txt without eluding the response.

archive a venv along with your code that would also be possible

Among this, you’ve likely know of others. A good list I found is from a recent update to PyPA packaging guide with many other possibilities.

However those tools are, I argue, are not friendly for many due to lack of popularity and education resources (e.g., these tools are not covered by pyOpenSci). What is often covered are the the workflow tools: poetry, Hatch, PDM, Rye—and installer tools like pip, pipx, and uv. I’ve tried them all of this use case and they all have blindspots in supporting applications (and generally projects not meant to generate wheels)

Take pipx for example. You cannot do this (see GitHub issue):

pipx install cowsay==6.1 -r https://example.com/cowsay/release/6.1/requirements.lockfile

Imagine it’s not cowsay, and instead it is your science paper’s project where you’ve published a lock file.

There is however, a way with pip, if you know how to cast the right incantations:

$ git clone $GIT_URI/myscienceproject --branch v1.1.0
$ pip install myscienceproject==1.1.0 -r myscienceproject/requirements.lockfile

Here a reproducible install involves two sources: PyPI and the place where you keep your lock file.

ketozhang · March 2, 2024, 7:14am

PS: I had to remove a link to Python Discourse because it was disallowed here. “projects not meant to generate wheels” is a topic title in Python Discourse that discusses applications and other project structures & workflows more broadly.

NickleDave · March 3, 2024, 5:18pm

Hi @ketozhang can you say a little bit more about what you mean by “application” here?

As your comment says, I think some data on this would help.

When I hear “application” I imagine something like an app with a Python backend. In my mind, there is a ton of support for applications.

See for example the distinction between “applications” and “libraries” in this 2021 Python Developer’s survey:

Notice how many questions there are about applications and the tools that application developers use.

I think you mean something like a “project”, judging from your other comments below.
By “project” I mean, “code that reproduces a computational research result”, like the code that accompanies a paper. Like the way the term is used here.

E.g., your comment asking “how would you share code with a paper and instruct someone using that code to install all the needed dependencies, including the libraries”.
You are of course right that there’s not a lot of great solutions for this right now.
But I don’t think “a way to declare dependencies at the top of a Python script” will provide a solution.
What you really need is an easy way to capture and reproduce the entire computational environment:
https://the-turing-way.netlify.app/reproducible-research/renv/renv-options

Lockfiles are a key element of better solutions for capturing the environment, I think, since those files can capture transitive dependencies better.
(That’s why things like lockfiles were first developed by some of the dev tools you mention, poetry and pdm, for programmers deploying applications!)
The core Python devs seem well aware this is a gap in the ecosystem and Brett Cannon in particular has put a lot of effort into developing a standard:

edit: I see you also brought up lockfiles. Glad we agree

Regardless, I still don’t think making it easier to install a script will make things better in the long run. We should teach people how to use a pyproject.toml, with a minimal example like @sneakers-the-rat gave. And, yes, we need better support for lockfiles that let us directly pin the whole environment.

re: your question about data on how many projects need this use case, I think it would be even better to have data about how many of those projects would benefit from specifically pinning the binaries used. Could there be numerical error due to native dependencies? Pinning the binaries is of course possible with conda in a way that we can’t do with pure Python.

willingc · March 3, 2024, 11:03pm

Hi @ketozhang,

It appears that someone flagged your original post and reply to me.

However, I don’t see the current status of pip ecosystem’s packaging (i.e., PyPA specs and tools) fitting for reproducibility.

I have been working with reproducible scientific environments in Python since 2012. As a core maintainer of JupyterHub, BinderHub, repo2docker, and Papermill as well a Python core developer (and former Steering Council member), I understand well the complexity of pip and scientific reproducibility.

Is pip the perfect tool for all cases? It is not, yet, pip serves a useful purpose in Python language development and pure Python libraries. Also, the distribution service of PyPI and maintainers of PyPA organization have devoted many volunteer hours to understand and improve the ecosystem.

One of the things that makes reproducibility difficult is that many elements including code, data, third party libraries, hardware configuration, operating system versions come into play for something to be truly reproducible.

I’ve found over the years that the best way to drive improvement in open source is to partner on solutions and create constructive dialogue. Disparaging a library, project, or ecosystem rarely delivers substantive results.

willingc · March 4, 2024, 2:27am

We designed mybinder.org to cover this use case. Distribute a mybinder link and an ephemeral environment can be launched.

ketozhang · March 4, 2024, 6:26am

I’ve read your replies, but let me see if I can get an admin to look at why my messages are being marked as spam (by the @system bot). Can I get one of the admins to look at this (@Jesse , @lwasser , @cmarmo)?

lwasser · March 4, 2024, 10:48pm

ok things are (i think) unblocked now. Discourse doesn’t like when someone who is newer to posting here tries to post several things in a short span of time that have links. it gets flagged as spam. that is what i think happened. but ping me if this happens again.

lwasser · March 4, 2024, 10:56pm

A general note on this discussion. I think it would be useful and perhaps easier to receive if the various types of sharing was more clear.

for instance sharing a

web app, flask, django, or pyodide type of think would i thing + associated requirements, would look very different than
sharing a script + requirements which could become the simplest version of a package (an installable thing) with minimal knowledge of the broader packaging ecosystem as Jonny pointed out above vs
sharing jupyter notebooks which thanks to the work that @willingc chris holdgraf yuvi and others in our amazing community have done can be shared as a runnable thing via tools such as binder (with docker running on the back end somewhere) vs
Sharing an entire research compendia (i am not sure exactly how that looks different from the script but it is a different thing that is often paper related as david notes above) vs
some other thing someone wants to share??

I think it’s hard to talk about all of these things in one thread effectively because they have different use cases and requirements. AND we haven’t even discussed data yet. But what jumps out to me is that i know the least about support for (web) apps and that is probably the furthest outside of our current capacity / scope but certainly of interest in the spirit of open science.

Topic		Replies	Views
Real-world repo packs -- looking for examples pyOpenSci Community Chat	10	757	January 10, 2025
Suggestions for a conda user preparing a package for PyPI Questions about Python Packaging & Code	7	1244	June 1, 2023
Python packaging guide: tests and data for your Python package pyOpenSci Guidebooks & Tutorials	7	181	March 3, 2024
We have our first package through review - now we need a badge pyOpenSci Community Chat	30	723	August 7, 2019
API Reference building locally but not on RTH - missing path? Questions about Python Packaging & Code	14	212	September 18, 2023

Interest in supporting non-packaged Python applications?

Related topics