Is MANIFEST.in still needed to include data in a package built with setuptools?

Hi all,

I’m posting a question here from a review.
Is MANIFEST.In still needed for including data files in a package built with setuptools?

My read of this page Data Files Support - setuptools 68.2.2.post20231016 documentation is that we can avoid adding a MANIFEST.in file by using the package_data options described in this section:
Data Files Support - setuptools 68.2.2.post20231016 documentation

E.g., with a pyproject.toml file we would add

[tool.setuptools.packages.find]
where = ["src"]

[tool.setuptools.package-data]
mypkg = ["*.txt", "*.rst"]

So for the review where this question came up (rdata, read R datasets from Python · Issue #144 · pyOpenSci/software-submission · GitHub), you would do:

[tool.setuptools.packages.find]
where = ["rdata"]

[tool.setuptools.package-data]
mypkg = ["*.rda", "*.rds"]

Have I got that right, or is there something I’m missing?

2 Likes

i cant answer this but wanted to share the post from one of the setuptools maintainers

he write:

My personal opinion is that MANIFEST.in is only needed if you want a high degree of customization and/or are not happy with using VCS (e.g. there are people that believe that disagree on a conceptual level with using VCS info for builds)

There is some information about it on: Controlling files in the distribution - setuptools 69.0.2.post20231122 documentation.

Shall we see what he has to say about your question here?

I am happy to help, however it seems that this question comes charged with a lot of context that I might not be aware/have the time to read through the entire previous issue/pr/discussions…

I can give my feedback about what would be the effect of the suggested configurations:

When I read this I imagine you have the following project structure:

.  #  project root
├── pyproject.toml
└── rdata/
    └── mypkg/
        ├── a.rda
        ├── b.rds
        ├── subpkg1/
        │   ├── mod1.py
        │   └── ... # does not include .rda, .rds files
        └── subpkg2/
            ├── mod2.py
            └── ... # does not include .rda, .rds files

If that is the case, the .whl file will contain mypkg/a.rda and mypkg/b.rds files, and those will be installed in the site-packages directory. Later you can use importlib.resources to traverse the mypkg file and list all the files.

1 Like

The actual structure is more like:

.  #  project root
├── pyproject.toml
└── rdata/
    ├── mod1.py
    ├── mod2.py
    └── tests/
        ├── test1.py
        ├── test2.py
        └── data/
            ├── a.rda
            └── b.rds

I am not using setuptools-scm. Is that the recommended best practice?

Instead, I have the following manifest file:

include MANIFEST.in
include LICENSE
include rdata/py.typed
include *.txt
global-include *.rda
global-include *.rds

I am not sure if all of these lines are necessary, or if there is a better way of doing it.

Finally, I made my project use importlib.resources. In order to do that, I use this code to have a global variable containing a path-like object that points to the data folder:

from importlib.resources import files
from typing import Final

from .parser._parser import Traversable


def _get_test_data_path() -> Traversable:
    return files(__name__) / "tests" / "data"


TESTDATA_PATH: Final[Traversable] = _get_test_data_path()

Here I have a few questions:

  1. Am I correct in assuming that the object created by files does not employ a file descriptor, and thus is fine to keep an object of this kind permanently alive?
  2. I had to copy the definition of the Traversable protocol, as it is not available in Python versions older than 3.11. Is there a better way to deal with this?
  3. When I originally typed the functions of my package that can receive file paths, I used the os.PathLike protocol. However, while doing this I learned that this protocol is just for paths in the file system, and thus zipfile.Path objects and the Traversable protocol are not compatible with this protocol. I changed my code to be able to receive Traversable objects. However, I think that most of the Python community still thinks that the “generic” way to accept paths is to accept os.PathLike objects. I think that the Traversable protocol should probably be made the “general” path protocol, promoting it to the typing module and recommending that users update their code to accept Traversable objects whenever a pathfile.Path object would be accepted. What do you think?
1 Like

Personally, I like to use it and I recommend (I don’t speak for the entire project here, but I see that the other setuptools maintainers also seem to like it in their projects).

setuptools-scm has 2 effects:

  1. Automatically compute version based on the git or hg tags (if you haven’t supplied any version configuration).
  2. Tell setuptools to add all files tracked by git or hg to the sdist.

In practice, if you are fine of having all files in the repository tracked in the sdist (I am happy with that [1]), you don’t need to use MANIFEST.in, which streamline the configuration and makes things easier…


The following questions are probably more related to importlib-resources and cpython… I did my best to comment on them, but maybe opening an issue/discussion in importlib-resources and/or cpython would be more effective.

That is also my understanding (seem to be confirmed in the importlib-resources docs, see https://importlib-resources.readthedocs.io/en/latest/using.html).

Maybe use the backfill from importlib-resources (see https://importlib-resources.readthedocs.io/en/latest/api.html#importlib_resources.abc.Traversable)? In their docs, I have the impression that importlib_resources.abc.Traversable is public…

That is probably something to be discussed in the Python typing council (see PEP 0729/#specification). I can see that there are already some open issues about this topic Issues · python/cpython · GitHub.


Regarding @NickleDave suggestions in the first post, I don’t think those are appropriate for the package structure explained by @vnmabus.
The where = ["rdata"] indicates to setuptools to look inside the rdata folder, but not includes the folder itself… The existing configuration in the rdata github repo, seems more appropriate:

[tool.setuptools.packages.find]
include = ["rdata*"]

When I compare the MANIFEST.in with the suggestion:

and considering the following:

Then I would say that the following configuration will be enough to add the other files to the sdist without MANIFEST.in[2]:

[tool.setuptools.package-data]
"*" = ["*.rda", "*.rds"]

But it is always good to check with tar tvf


  1. Some people don’t like that (e.g. they don’t like their .github folder to be included in the sdist). ↩︎

  2. … if the project decides to not use setuptools-scm, with setuptools-scm these files should be included by default in the sdist anyway… ↩︎

1 Like

thank you @abravalheri for your time. i wouldn’t expect you to dig into the full context so thank you for providing this overview. i also appreciate that you suggested opening issues in importlib-resources to ask questions in other package repos!! thank you!

@vnmabus FWIW i’m also a big fan of using setuptools_scm for versioning. As @abravalheri points out - once it’s setup it’s easy to automate your release / publish workflow on github without having to worry about manually bumping versions

If you need help setting it up we can likely get you there in a separate thread . @AlexanderJuestel actually just set this up with hiw package gemgis and can speak to the process as well.