I’m just curious what suggestions people have for libraries to help automate building datasets.
i.e., I have a package I work on that takes a bunch of different repositories, mostly from
figshare that are under Creative Commons license, and then takes a subset of files from each repository and makes them into a .tar.gz. So what the package I’m working on would provide is a central place with many different small samples of the same general type of data, that other packages/libraries can use, e.g. when they need example datasets.
I’ve tried working with make and snakemake but it feels like I’m kind of going outside of the abstraction of those tools. Maybe it’s just me not having enough experience with either. But to me it feels like both are focused on files that already exist locally, that then become another file, whereas I’m usually downloading a bunch of files, which then turn into directories, which then turn into .tar.gz. So I end up writing some functions to do conditional checks on directories that don’t feel very natural to me, esp. with bash code.
Any feedback would be welcome, thanks everybody