Telemetry and open source software

Friends - ages and ages ago we had a package - hamilton submitted to us that used opt-out telemetry. We started to review and then realized it collects telemetry.

I was able to talk with the author, Stefan, at SciPy more about this. I think opt-in is the way to go for telemetry IF a package has it. But i would LOVE your input on the language that we use moving forward as a policy around telemetry in our guide. we can also consider adding some implementation best practices (this will be hard i think) to our packaging guide in the future.

Here is the pull request with text we’d add to our packaging-guide

and here is the early discussion on github. .

We want users to trust pyOpenSci vetted tools. and as such I think opt-in telemetry that asks them before collecting data (IF data are collected) is best following all of the input we’ve had about this topic!

Please share any feedback that you have here.

lots of conversation already in that GitHub issue and in the slack, copying my previous thoughts here just so we have something on the page :slight_smile:

To zoom out: what’s the point of us having standards for the software we review at all? [one reason] is to raise the standard for scientific code […] part of that is a technical standard, but part of that should also be ethical standards - we shouldn’t review and lend legitimacy/resources to a DIY guided missile package.

as @Leah notes, because of the surveillance age we live in, telemetry is not a neutral concept, but also as @yuvipanda notes, there are legitimate reasons why developers want to gather data.

I would encourage this group to take a clear ethical stance on telemetry in addition to a technical description of what is and isn’t allowed in packages we agree to review. To me, this would include a statement about providing tools that empower people, rather than treating code as a trojan horse that serves as an inducement to use but ultimately serves the developers in a way that isn’t obvious to the people using the tool. Tools shouldn’t put people at risk in ways the people using them - and even the people developing them - wouldn’t anticipate: people need to provide explicit, affirmative, informed consent.

some explicit ideas for standard:

  • Opt-in should be a bare minimum expectation. We shouldn’t let the normalization of universal tracking get confused for acceptability!
  • It should be possible to see exactly what is collected - not a gestural description, but the entire raw contents of what is collected along with annotations about what it means in the case that it is something that wouldn’t be immediately recognizable to the person using the tool like some UUID
  • Not opting in to telemetry should not degrade the use of the tool - consent cannot be coerced
  • It should be clear exactly how the data will be used and by whom, including any transmission through intermediate platforms like posthog,
    • how the data will be stored and
    • why it is useful to the developers to collect.
  • There should be sufficient tests to ensure that if someone does not opt-in that no information is transmitted: for example in the hamilton package the implementation is poor, and if eg. the configuration file or environment variable have an error then the telemetry defaults to “on” even if the user intended to opt-out - I believe silently
2 Likes

@sneakers-the-rat i really appreciate this. And i’m glad your ported text over from slack. i feel better having this conversation openly here. So i second the items in your carefully thought out list above of what we need to think about. :

  • Opt-in should be a bare minimum expectation. We shouldn’t let the normalization of universal tracking get confused for acceptability!
  • It should be possible to see exactly what is collected - not a gestural description, but the entire raw contents of what is collected along with annotations about what it means in the case that it is something that wouldn’t be immediately recognizable to the person using the tool like some UUID
  • Not opting in to telemetry should not degrade the use of the tool - consent cannot be coerced
  • It should be clear exactly how the data will be used and by whom, including any transmission through intermediate platforms like posthog,
    • how the data will be stored and
    • why it is useful to the developers to collect.
  • There should be sufficient tests to ensure that if someone does not opt-in that no information is transmitted: for example in the hamilton package the implementation is poor, and if eg. the configuration file or environment variable have an error then the telemetry defaults to “on” even if the user intended to opt-out - I believe silently

This is a tricky area so i also wonder - do we want to have language and guidance around how to implement some of the above? for instance ways that data could be stored that would be appropriate, etc? IE help devs who generally do want to collect data to improve dev work, do this well with suggestions?