Libraries to automate building datasets? like make + snakemake

Hi all,

I’m just curious what suggestions people have for libraries to help automate building datasets.

i.e., I have a package I work on that takes a bunch of different repositories, mostly from
figshare that are under Creative Commons license, and then takes a subset of files from each repository and makes them into a .tar.gz. So what the package I’m working on would provide is a central place with many different small samples of the same general type of data, that other packages/libraries can use, e.g. when they need example datasets.

I’ve tried working with make and snakemake but it feels like I’m kind of going outside of the abstraction of those tools. Maybe it’s just me not having enough experience with either. But to me it feels like both are focused on files that already exist locally, that then become another file, whereas I’m usually downloading a bunch of files, which then turn into directories, which then turn into .tar.gz. So I end up writing some functions to do conditional checks on directories that don’t feel very natural to me, esp. with bash code.

Any feedback would be welcome, thanks everybody

2 Likes

Hey @NickleDave welcome to pyopensci discourse!! i somehow missed this post . i’m a bit unclear about what you are doing. it sounds like you are downloaded data from figshare which typically is already zipped. but i’m a bit unsure as to why there is a tar compression step and how the output would be used. any clarification? pinging also @mbjoseph on this one as he may better understand what you are trying to do??

Hi @NickleDave!

Snakemake has some support for remote files, so that you could include the URLs of the files on FigShare in a rule. From the example here: https://snakemake.readthedocs.io/en/stable/snakefiles/remote_files.html#read-only-web-http-s

import os
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider

HTTP = HTTPRemoteProvider()

rule all:
    input:
        HTTP.remote("www.example.com/path/to/document.pdf", keep_local=True)
    run:
        outputName = os.path.basename(input[0])
        shell("mv {input} {outputName}")

I haven’t done this personally, but perhaps it’s useful if you’re not already doing something similar.

2 Likes

thank you @mbjoseph !!

Thank you @mbjoseph and @lwasser!
That does look like what I had in mind. For some reason it hadn’t occurred to me that there might be support for that in Snakemake. Glad I asked.

Leah, this is the package in process in case it helps clarify what I’m trying to do:

Basically the figshare repos often have hundreds of files, and I just want a handful from each repo so I can test that some packages I’ve written can parse different formats, and so I can use them as very-quick-to-download examples

1 Like