Libraries to automate building datasets? like make + snakemake

NickleDave · July 20, 2019, 1:42pm

Hi all,

I’m just curious what suggestions people have for libraries to help automate building datasets.

i.e., I have a package I work on that takes a bunch of different repositories, mostly from
figshare that are under Creative Commons license, and then takes a subset of files from each repository and makes them into a .tar.gz. So what the package I’m working on would provide is a central place with many different small samples of the same general type of data, that other packages/libraries can use, e.g. when they need example datasets.

I’ve tried working with make and snakemake but it feels like I’m kind of going outside of the abstraction of those tools. Maybe it’s just me not having enough experience with either. But to me it feels like both are focused on files that already exist locally, that then become another file, whereas I’m usually downloading a bunch of files, which then turn into directories, which then turn into .tar.gz. So I end up writing some functions to do conditional checks on directories that don’t feel very natural to me, esp. with bash code.

Any feedback would be welcome, thanks everybody

lwasser · July 22, 2019, 7:34pm

Hey @NickleDave welcome to pyopensci discourse!! i somehow missed this post . i’m a bit unclear about what you are doing. it sounds like you are downloaded data from figshare which typically is already zipped. but i’m a bit unsure as to why there is a tar compression step and how the output would be used. any clarification? pinging also @mbjoseph on this one as he may better understand what you are trying to do??

mbjoseph · July 22, 2019, 7:58pm

Hi @NickleDave!

Snakemake has some support for remote files, so that you could include the URLs of the files on FigShare in a rule. From the example here: https://snakemake.readthedocs.io/en/stable/snakefiles/remote_files.html#read-only-web-http-s

import os
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider

HTTP = HTTPRemoteProvider()

rule all:
    input:
        HTTP.remote("www.example.com/path/to/document.pdf", keep_local=True)
    run:
        outputName = os.path.basename(input[0])
        shell("mv {input} {outputName}")

I haven’t done this personally, but perhaps it’s useful if you’re not already doing something similar.

lwasser · July 22, 2019, 10:15pm

thank you @mbjoseph !!

NickleDave · July 23, 2019, 2:40am

Thank you @mbjoseph and @lwasser!
That does look like what I had in mind. For some reason it hadn’t occurred to me that there might be support for that in Snakemake. Glad I asked.

Leah, this is the package in process in case it helps clarify what I’m trying to do:

Basically the figshare repos often have hundreds of files, and I just want a handful from each repo so I can test that some packages I’ve written can parse different formats, and so I can use them as very-quick-to-download examples

Topic		Replies	Views
Is MANIFEST.in still needed to include data in a package built with setuptools? Questions about Python Packaging & Code	6	303	November 9, 2023
Real-world repo packs -- looking for examples pyOpenSci Community Chat	10	758	January 10, 2025
Figshare package pyOpenSci Packages	0	642	September 27, 2019
Looking for genomics package reviewers pyOpenSci Updates	0	510	October 30, 2023
Interest in supporting non-packaged Python applications? Questions about Python Packaging & Code	14	215	March 4, 2024

Libraries to automate building datasets? like make + snakemake

Related topics