How to avoid repeating a long list of keyword options throughout a package?

jagerber48 · July 25, 2023, 1:52pm

Hello, I am working a new package called sciform which is used to convert numbers into strings formatted according to a wide range of possible options. Many users will be familiar with options provided by the built-in format specification mini-language (FSML) which can be used to format numbers into e.g. standard and scientific notation with control over displayed precision. sciform can be thought of as providing extended functionality beyond that provided in the FSML.

I am the sole maintainer of the package and it is still young with a regularly changing API. A few questions have come up about the scope of functionality that the package should provide. I would like to ask for advice from folks here about how I should set the scope of this package. I’ve made a presubmission inquiry and review submission with pyOpenSci.

For reference, here is an example usage of the package:

from sciform import Formatter, RoundMode, GroupingSeparator, ExpMode
sform = Formatter(round_mode=RoundMode.PREC,
                  precision=6,
                  upper_separator=GroupingSeparator.SPACE,
                  lower_separator=GroupingSeparator.SPACE)
print(sform(51413.14159265359))
# 51 413.141 593

See Examples — sciform 0.21.0 documentation (and test-suite link therein) for more example usages. See Formatting Options — sciform 0.21.0 documentation for a listing of many of the available configuration options.

My issue is handling all of these options in code. My basic strategy is to store all of the possible options and their values in a FormatOptions dataclass https://github.com/jagerber48/sciform/blob/main/src/sciform/format_options.py#L14 which I think is a good move. The question is then how does this dataclass get constructed (from user input) and how does this dataclass get consumed? Currently there are 26 options in the dataclass and more than 10 places throughout the code where I need to re-write the entire list of these options (I try to explain why below). This (1) results in a major explosion of the length of the code and (2) is a very big pain when I want to add an option or slightly change the naming of an existing option since I have to go through all those different places in code.

This question is asking for help reducing the number or repetitions of this long list of options.

More details about why the options appear so many times.

The user never directly interfaces with the FormatOptions object. They either create a Formatter object (to which they can pass in all of the options) or they make a SciNum or SciNumUnc* which they format like f'{SciNum(123.456):!3e}'. During this formatting the format specification string !3e is parsed into a FormatOptions object.
The FormatOptions object is consumed by either the format_num or format_val_unc formatting functions
The user does not need to specify all values for the different formatting options. Any option which they do not specify will be populated by the default values. The default values are user configurable. This is a big part of why the complete options list appears so many times throughout the code. The default options are stored in a DEFAULT_PACKAGE_OPTIONS FormatOption instance which can be overwritten using a few helper functions with new user-configured default values.

And then one more note: Many of the places where the options appear could be collapsed if I used the **kwargs construction in some places. But I’m trying to avoid that because I prefer the explicitness of listing out the options and I also use static type checking.

Here are all the places where the complete list of options occurs.

In the FormatOptions dataclass definition
FormatOptions.make() signature. This function accepts a defaults: FormatOptions input along with a list of all options for a new FormatOptions. It then makes a new FormatOptions taking values from the user supplied kwargs if present, or from the passed in default FormatOptions if a specific option is not supplied.
FormatOptions.make() body. Here the input kwargs are being parsed to determine if they should be taken from user input kwargs or defaults passed in.
FormatOptions.make() return statement. Here the new FormatOptions() is actually constructed and returned.
FormatOptions.from_format_spec() return statement. Here a format specification string is parsed and a new FormatOptions is returned.
DEFAULT_PACKAGE_OPTIONS is the FormatOptions that is used as the default if the user doesn’t supply defaults. This is what is typically passed into default for FormatOptions.make().
set_global_defaults() signature. This is a function that overwrites DEFAULT_PACKAGE_OPTIONS based on user input.
set_global_defaults() body.
GlobalDefaultsContext.__init__() signature. This is a context manager that temporarily replaces some default options based on user input.
GlobalDefaultsContext.__init__() body.
Formatter.__init__() signature. The Formatter accepts user input and constructs and stores a FormatOptions object that it uses for formatting.
Formatter.__init__() body.
Two occurrences in documentation doctests under usage

Basically any function that returns a FormatOptions will typically have two occurences of the complete options list. One in the function signature and again during actual construction of the FormatOptions that is returned.

As I said above, the biggest pain (and possibility for errors) is when adding a new option or modify an existing one. I have to go through ALL of these occurrences and update the code. I’m curious if there are other patterns I can use to help minimize the number of these occurrences.

*SciNum and SciNumUnc are newer names from a currently unreleased version of the code for the sfloat and vufloat objects referred to in the docs and source code.

willingc · July 25, 2023, 4:51pm

Hi @jagerber48. Welcome and thanks for the questions.

I am the sole maintainer of the package and it is still young with a regularly changing API.

Congrats on the project. Just curious, is this the first project that you are maintaining? Totally cool if it is or isn’t. Being a maintainer is something that evolves over time. One suggestion to help your users when the API is changing is to add a note in the README of the project that is explicit that that API may change.

This (1) results in a major explosion of the length of the code and (2) is a very big pain when I want to add an option or slightly change the naming of an existing option since I have to go through all those different places in code.

Doing a quick read of your post, I think that you have identified 2 main issues. For (1), it’s not unusual for formatting projects to have a lot of options and variables. I wouldn’t worry about the length of the code itself. I would step back and ask if these options and variables, as named, are easily understandable by a new user of the library. Good human-understandable names reduce the maintenance/support question burden and documentation needed.

For (2), you are on the right track to avoid code duplication and the need to change different parts of the code when making an addition. One approach is to use kwargs in your functions to pass a bunch of options. As for using classes or dataclasses, they can help with this. One thing that I tend to do is to step back and consider if the options have a common relationship in the real world. If so, they may make a good class or collection.

Hope this helps a bit. Feel free to iterate here too since I’m sure others will have insights as well.

Carol

NickleDave · July 25, 2023, 6:50pm

Welcome @jagerber48 and thank you for this question! I see this happen a lot in my own projects and others

Just to echo @willingc

I think it might good to add an “under construction” note on the README
“One thing that I tend to do is to step back and consider if the options have a common relationship in the real world. If so, they may make a good class or collection.” – sounds like you are already moving towards this with the FormatOptions class

In fact you kind of hint at this here:

So my question is:
why not make all of these functions take a single argument, FormatOptions, and let users work with FormatOptions directly?

That way you won’t need to rewrite / add / remove parameters painstakingly to each function, and you’ll also be able to do typing easier – this function takes an instance of FormatOptions

Is there a reason I’m missing why that wouldn’t work?

Again to quote @willingc

I would step back and ask if these options and variables, as named, are easily understandable by a new user of the library.

If you just let a user work with FormatOptions directly, you could also add a classmethod that lets them get different defaults from a string instead of using enums, which might feel unintuitive to some users. E.g. instead of

from sciform import Formatter, RoundMode, GroupingSeparator, ExpMode
from sciform import Formatter, ExpMode, RoundMode
sform = Formatter(exp_mode=ExpMode.FIXEDPOINT)

what if you could do something like

import sciform
sform = sciform.Formatter(sciform.FormatOptions.from_default(default='fixedpoint')

I find classmethods a really helpful way to provide this kind of thing to users so they don’t need to type out 1k parameters even though I need to have them around–as suggested by Hynek in the attrs docs: Initialization - attrs 23.1.0 documentation

CAM-Gerlach · July 25, 2023, 11:20pm

Thanks for your very detailed, thoughtfully considered, comprehensively explained and thoroughly researched question! While the typical answer to this sort of issue of duplication would involve a custom class/dataclass, or possibly using **kwargs, it’s clear you’ve already carefully through through and either already employed or (due to your constraints/requirements) had to reject those approaches.

First, a few high-level thoughts on the problem space and what makes this difficult:

Fundamentally, there’s going to be a tradeoff between code being explicit, static and introspectable, and being as concise and non-duplicative as possible (which generally needs dynamicism). However, with appropriately-structured code, the former should generally allow your type checker to tell you if and where you’re missing any.
Likewise, there’s a tradeoff of exposing these as individual args for users for convenience, and reducing duplication, complexity and potential bugs in your codebase by marshaling everything into a dataclass/custom class as early as possible and requiring users pass that directly…
Finally, there’s a trade between reduced user code duplication and increased flexibility by relying on global mutable global state and offering a variety of ways to configure it, vs. increasing overall complexity and duplication in the library code. In particular, a big chunk of the complexity is the fact that the actual options used at any invocation depend on (and can be modified by) globally mutable default state, and can also be modified by helper functions, context managers and of course local arguments, rather than being set statically by the user by creating a FormatOptions instance directly—this is the key bit that prevents the naive “just use a dataclass everywhere” approach others have suggested from working as-is (without the modifications suggested below).

Given all that, there’s not really a obvious magic bullet here, and pretty much any meaningful, plausible improvement to one side of those solution space trades in favor of reduced duplication will require some amount of sacrifice of the other (i.e. either giving up static instrospectability, modifying your user-facing API, or reducing the overall complexity of the config system)—at least without something more exotic like building code generation tooling or radically rethinking your design.

However, it seems to me that with some restructuring and modest relaxing of some of the constraints/requirements (particularly around your user interfaces/interaction patterns, since you’ve stated your API is still in flux and that trade seems the least directly harmful while producing the most benefit), you might be able to find a much more optimal point in terms of reducing duplication while not regressing or even improving elsewhere.

Basically, the high-level strategy I suggest is:

Get all your options into a dataclass (or custom class) as early and as few places as possible and centralize the logic there, to minimize the number and separation of the different places the options are listed.
Have all your public-facing APIs, except for the one(s) that construct that dataclass, users construct that dataclass (possibly via some wrapper) and pass it around where needed
The key bit: Separate your input dataclass with just the values the user actually passes for each of the params (and otherwise some sentinel) from the rendering of the dataclass with its defaults, with a render() method called at format time on the input dataclass that takes a defaults argument with the defaults to insert, and returns a fully-filled dataclass—either of the same type, or (safer and more correct, though does require some duplication) a separate type with no sentinel and no default values.

This reduces the amount of duplication down to 1-2 places (the one or two dataclass definitions, depending on the options above) where you need to define all the parameter names, and one place (DEFAULT_PACKAGE_OPTIONS) where you define their defaults. For the shortest though not as static/type-correct approach, this looks like:

class Sentinel:
    pass

@dataclass
class FormatOptions:
    round_mode: RoundMode | Sentinel = Sentinel()
    precision: int | Sentinel = Sentinel()
    upper_separator: GroupingSeparator | Sentinel = Sentinel()
    ...

    def render(self, defaults: FormatOptions = DEFAULT_PACKAGE_OPTIONS) -> FormatOptions:
        return FormatOptions(
            **{k: v for k, v in self.asdict(),items() if not isinstance(v, Sentinel)})

DEFAULT_PACKAGE_OPTIONS = FormatOptions(
    round_mode=RoundMode.PREC,
    precision=6,
    upper_separator=GroupingSeparator.SPACE,
    ...
)

Or for the more explicit, introspectable and correct approach I recommend:

@dataclass
class InputFormatOptions:
    round_mode: RoundMode | Sentinel = Sentinel()
    precision: int | Sentinel = Sentinel()
    upper_separator: GroupingSeparator | Sentinel = Sentinel()
    ...

    def render(self, defaults: RenderedFormatOptions = DEFAULT_PACKAGE_OPTIONS) -> RenderedFormatOptions:
        return RenderedFormatOptions(
            round_mode=defaults.round_mode if isinstance(self.round_mode, Sentinel) else self.round_mode,
            precision=defaults.precision if isinstance(self.precision, Sentinel) else self.precision,
            upper_separator=defaults.upper_separator if isinstance(self.upper_separator, Sentinel) else self.upper_separator,
        ...
        )

class RenderedFormatOptions:
    round_mode: RoundMode
    precision: int
    upper_separator: GroupingSeparator
    ...

DEFAULT_PACKAGE_OPTIONS = RenderedFormatOptions(
    round_mode=RoundMode.PREC,
    precision=6,
    upper_separator=GroupingSeparator.SPACE,
    ...
)

(There are also, of course, options in between these two, e.g. using the “shortcut” approach for render() and treating it as a typing blackbox, which is not fully statically or dynamically type safe and requires requires manually ensuring that InputFormatOptions and RenderedFormatOptions have matching fields and types, but cuts out a lot of the duplication).

This covers FormatOptions() and DEFAULT_PACKAGE_OPTIONS explicitly. For the others:

FormatOptions.make(..., defaults) can be replaced with the equivalent InputFormatOptions(...).render(defaults)
For FormatOptions.from_format_spec(), it will still need to pass the specific args that can be extracted from a format spec to InputFormatOptions() in the return, but no way around that and it does allow deferring render() if only the format spec args are desired, which may be useful.
set_global_defaults(), GlobalDefaultsContext() and Formatter() can just accept a InputFormatOptions and output/set a RenderedFormatOptions with input_options.render(DEFAULT_PACKAGE_OPTIONS)

So, seems like this might work while still mostly meeting your requirements and constraints?

jagerber48 · July 26, 2023, 1:55am

Wow, thank you all for the responses! This is so helpful!

First off, I’ll definitely put a prominent note in the docs (readme probably) that the API is not stable yet.

@willingc Yes, this is the first project that I am maintaining, I’m learning a lot as I go and so far pyopensci has been very helpful!

@NickleDave Yes, I had considered having the user pass in FormatOptions, this is similar to @CAM-Gerlach’s suggstion so I’ll address it there.

@CAM-Gerlach wow, thank you for this response. You 100% captured what I’m trying to accomplish, what I’ve tried, and you’ve spelled out exactly the design constraints that I’ve been weighing in my head, but in a much more clear way than I was thinking about them.

I had previously toyed around with having the user construct and pass in FormatOptions to the Formatter and global settings configuration functions but there were two issues.

It puts a little more work on the part of the user. They have to learn about the FormatOptions class in addition to the other functions and, in code, they have to perform the extra step of creating a FormatOptions class before they can create a Formatter or configure global options.
I struggled a little bit getting the logic correct to have the default arguments injected from the global defaults. I think I had some ideas but they felt clunky.

Point 2 is thoroughly addressed by CAM-Gerlach’s suggestion about using a render method that converts a partially initialized set of options into a fully initialized set of options, patching with the global defaults. In fact the suggestion there is an improvement over what I have now because right now, for the Formatter, the global defaults are applied at initialization time rather than format time*. Of course either behavior could be correct depending on how I want it to be, but I think applying the defaults at format time makes more sense.

Regarding point 1. As you point out, there is a trade off between exposing the individual arguments for user convenience and reducing duplication/complexity in the codebase. But I think I’m agreeing with you and @NickleDave that this trade off is not really that bad for users. For the way I use the code and imagine a lot of others using it it results in an extra line of code to setup the FormatOptions before passing it into a single Formatter used for the whole script. But, I think it can also give the users nice flexibility in some ways. For example, right now the FormatOptions class is immutable so the Formatter is immutable. But maybe I could make the partially initialized UserFormatOptions mutable so that users could mutate it (or if I want it to also be immutable then users could have an easy way to duplicate + modify) and pass it around on the fly if they have reason to use multiple different sets of options (this needs to be thought about a little carefully). I could also make some sets of anticipated or user-requested “standard” option sets beyond just the package defaults that users could import and use in their formatters. I could even include Formatters pre-configured with these standard options so that users lucky enough to want to use one of these standard options sets can skip the “construct FormatOptions” step (of course something like this could probably be done the way the code is currently structured, but thinking about FormatOptions in this way made the idea feel a little more natural).

For the record, it feels that a little bit of convenience is lost on the user-facing end for some of the simple use cases, but, still on the user-facing end, a little bit might be gained by thinking in terms of passing around FormatOptions objects and, as discussed, a lot will be gained on the back-end in terms of complexity and but potential.

I will definitely try out what’s been suggested here and I think there’s a good chance I’ll go with it. I’ll probably go with the recommended option involving two classes, one partially filled and the other totally filled.

@CAM-Gerlach one final general question coming from someone still moving out of the shallow end of the code development pool about a comment you made. You said “you might be able to find a much more optimal point in terms of reducing duplication while not regressing or even improving elsewhere.” What exactly do you mean by “regressing”? The suggestion to require users to pass in some sort of FormatOptions class instead of direct keyword arguments will allow all of the core functionality to remain in tact, but, technically, my tests will fail under this change since they use the old API. the tests will need a trivial update to use the new API. Is this considered a regression or not? Maybe not since it is an intentional API change, announced in the release, so the test API is expected to need to change?

*When using the SciNum class and formatting via the format specification mini language the global defaults are applied at format time.

CAM-Gerlach · July 26, 2023, 4:47pm

Thanks so much for your kind words, and I’m really glad to hear it was helpful! It was a tricky problem and took a good bit of effort to design and iterate toward a better solution, but it was both an enjoyable and instructive exercise.

Don’t underrate yourself! We’re all learning just like you, and it’s very clear from your posts here that you’ve already learned a heck of a lot. I was very impressed by the depth and breadth of knowledge, thought and insight you put into both the design of your package and into your very detailed, clear and comprehensively explained problem description. I think you’re well into the deep end, haha!

I mean, heck, I’m somehow a Python core dev and yet I’m “actually” just an atmospheric scientist, not even a “real” programmer—I just play one on the internet

For those same simple cases where users only have a few options they want to pass and users don’t need to reuse their config (which wouldn’t have even been possible before), it can still easily be done in one line, i.e. instead of

Formatter(round_mode=RoundMode.PREC, precision=6)

they can just do

Formatter(FormatOptions(round_mode=RoundMode.PREC, precision=6))

I hadn’t considered mutability and the other dataclass options above to avoid overcomplicating the explanation or getting out of scope, but making UserFormatOptions (assuming that’s what you’re calling InputFormatOptions) mutable while keeping RenderedFormatOptions immutable could make sense to me. However, I think there might be a better approach—see the next section.

IMO, for cases like these as well as the case of users having one “base” set of options and then several sub-presets to be layered on top of it, it would make sense to have UserFormatOptions also offer an .update(other: UserFormatOptions) -> UserFormatOptions method that works like dict.update(), replacing the field values of the original with the non-Sentinel values of other, with something like:

    def update(self, other: UserFormatOptions) -> UserFormatOptions:
        return UserFormatOptions(
            **{key: val1 if isinstance(val2, Sentinel) else val2
                for (key, val1), (__, val2)
                in zip(self.asdict().items(), other.asdict().items())})

(The body might not 100% type check depending on whether and how dataclass uses a TypeDict for asdict(), but since you know that two UserFormatOptions instances must have the same fields of the same types, and the type checker validates that self and other are both UserFormatOptions instances, a warning here should be safe to ignore. Alternatively, you could refactor it to share an underlying implementation with render if you want to be fully statically type correct).

You could have it mutate the existing dataclass (like dict.update() does), but returning a new one allows you to keep the original immutable and avoids needing to make a copy if you don’t want to change it or doing so accidentally (especially if a user’s “base” set of options), and allows method chaining. In fact, with this you don’t need to make UserFormatOptions mutable at all, since instead of

original = UserFormatOptions(round_mode=RoundMode.PREC, precision=6)
new = original.copy()
new.precision = 3

users can do

new = original.update(UserFormatOptions(precision=3))

This also avoids common mistakes like misspelling an attribute name, or accidentally mutating the original.

To make this even simpler for users, you could define

    def __or__(self, other: UserFormatOptions):
        return self.update(other)

Then, the above becomes (just like with dicts in Python 3.9+):

new = original | UserFormatOptions(precision=3)

and you can also do, e.g.,

Formatter(base_preset | custom_preset)

Just to note, since the InputFormatOptions class is intended to be user-facing, I’d suggest keeping the shorter and friendlier FormatOptions name for that one (or going with UserFormatOptions, if you prefer).—I used InputFormatOptions above for maximum clarity and ambiguity in the examples.

Sorry for the confusion—here, by “regressing” I meant regressing in terms of one of the other qualities in the tradeoffs I’d outlined above, e.g. explicitness, type checkability, configurability, correctness, etc.

Nope, not at all (other than on backward compatibility of your user-facing APIs, as expected). Typically, tests aren’t seen as having their own external user-facing API, and expected to change alongside the code as its functionality, behavior and implementation changes, with zero expectation of backward compatibility for the tests themselves (unlike the code under test). Ultimately, their purpose is to verify that the package and its APIs function as expected, but they are ultimately driven by that functionality rather than the other way around. Hope that answers your question!

willingc · July 27, 2023, 3:07pm

@jagerber48 I’m very happy that you are getting insights that are useful. I completely agree with @CAM-Gerlach:

Don’t underrate yourself! We’re all learning just like you, and it’s very clear from your posts here that you’ve already learned a heck of a lot. I was very impressed by the depth and breadth of knowledge, thought and insight you put into both the design of your package and into your very detailed, clear and comprehensively explained problem description. I think you’re well into the deep end, haha!

I mean, heck, I’m somehow a Python core dev and yet I’m “actually” just an atmospheric scientist, not even a “real” programmer—I just play one on the internet

Being a maintainer is something that one continues to improve on over time, and we adapt to things that get put in front of us.

One general tip: When starting an API, it can be helpful to really pare back to the “must haves”. You can always go back and later add the “nice to haves”. It’s much more difficult to deprecate and remove features once users adopt them.

jagerber48 · July 30, 2023, 4:05pm

@CAM-Gerlach @willingc, thank you both for the kind words of encouragement! I’m enjoying this learning process.

@willingc Yes, I’m finding identifying and targeting “must haves” and basically rejecting all “nice to haves” for the time being to be very helpful. Hopefully the “nice to haves” will come eventually, but they shouldn’t get in the way of the “must haves” for now.

As for the discussion topic. I implemented the changes discussed into sciform version 0.23.0. You can see the code here. Here’s what I did

I have a RenderedFormatOptions dataclass. This is just a straight dataclass with a non-optional attribute for every formatting option. This class is immutable. This class is only used within the actual formatting functions (e.g. format_num(), format_val_unc()). This is the “fully-fledged” version of the options which is guaranteed to have all options populated.
Then there is a FormatOptions class. This is the user-facing class which can have unpopulated options. This class accepts keyword arguments for each option, but each option is optional with a default None sentinel value. The first step during __init__ is some validation of the input arguments (certain combinations are invalid and raise errors). Next, if no template FormatOptions() is passed in then the input kwarg options (including default None values) are stored as instance attributes on the FormatOptions. If template is passed in then if any input option is None then the template value will be used instead. Of course, the template option may also be None.
The FormatOptions class has a render() method which creates a RenderedFormatOptions class from the FormatOptions class, patching any None values with values from the GLOBAL_DEFAULT_OPTIONS. GLOBAL_DEFAULT_OPTIONS is a RenderedFormatOptions class so it is guaranteed to have values available for all options.
the Formatter, set_global_defaults and GlobalDefaultsContext all accept FormatOptions as input.

This feels like an improvement. I think there are less repetitions of all the options but things are more modular. I think the main thing I’m happy about is, even if there is still repetition, all of the complexity and behavior is kept in the FormatOptions class alone. Previously the complexity was spread across multiple classes which made it even harder to keep track of. Here are the remaining repetitions:

RenderedFormatOptions definition: this can’t be avoided
FormatOptions.__init__() signature (and docstring): this can’t be avoided
FormatOptions.__init__() body. Once for template is None and once for template is not None. These two usages could technically be merged with a more complex in-line conditional for resolving where to get the attributed, but that was going to be more messy. I wonder if these usages could be eliminated or reduced somehow.
FormatOptions.render() body. I don’t think this one can be avoided
DEFAULT_PACKAGE_OPTIONS definition: this can’t be avoided
format_options_from_format_spec() return statement: this can’t be avoided

This is a total of 7 repetitions in code, down from 12 before. Overall it feels like this is in pretty good shape. I first tried to make a version that didn’t include the template logic, but found out that I needed it in format_val_unc because for those functions I use the options that were passed in to generate two new derived format options, one for the value and one for the uncertainty. I could easily extend to make a FormatOptions.update() method that makes a new FormatOptions with self and the template to get the semantics from the post above.

Note that FormatOptions is NOT a dataclass. I guess I could make it a dataclass, but since I have a custom __init__ and I’ll probably implement a custom __repr__() it doesn’t seem to useful. It would be nice if FormatOptions could be a frozen data class so it’s immutable, but I require setting attributes after a little bit of logic and frozen dataclasses don’t seem to natively support setting attributes in __post_init__ or even custom __init__.

I suggest that users consider importing FormatOptions as Fo as an abbreviation so that the usage looks like

>>> from sciform import FormatOptions as Fo
>>> from sciform import Formatter, RoundMode, GroupingSeparator, ExpMode
>>> sform = Formatter(Fo(round_mode=RoundMode.PREC,
...                      precision=6,
...                      upper_separator=GroupingSeparator.SPACE,
...                      lower_separator=GroupingSeparator.SPACE))
>>> print(sform(51413.14159265359))
51 413.141 593

I guess my remaining questions are:

Should I make FormatOptions a dataclass? Is there a way I can make FormatOptions an immutable dataclass that gets frozen after __init__? Alternatively, maybe I should just be ok with FormatOptions being mutable and users modifying it on-the-fly. I just don’t like that modifying the format options will change the behavior of a formatter that was configured to use those options. It may be surprising and I’m not really interested in creating documentation around this usage. (Feels like this is drifting into a different question than I originally asked).
Thoughts on import FormatOptions as Fo? Is this is a good suggested idiom? I tried import FormatOptions as FO but my IDE (pycharm) gave a warning about import a camelcase class as a constant… kind of annoying.

CAM-Gerlach · July 31, 2023, 3:59am

If you make this a frozen dataclass following the rubric outlined in my posts above (with a couple additional bits in the __post_init__ to handle your added requirements for validation and parameter setting), this will:

Greatly simplify and shorten the implementation, and be much cleaner, elegant and more readable
Reduce option duplication in the class code from three repetitions to on
Take advantage of the built-in functionality and interoperability of dataclasses
Be statically introspectable by GitHub, IDEs, type checkers and other tooling
Allow freezing the field values to avoid the surprising effects of mutability
Be safer to maintain and use, and prevent whole classes of bugs in the code
Still allow all the exact same features and behaviors you have now

Using None instead of an explicit sentinel does simplify the code a fair bit, and allows users to more easily pass it if desired. Just be aware that while as far as I can tell there this isn’t currently an issue, you’ll need to make sure None is never a valid value for any of the final rendered options or else you’ll need to switch to an explicit sentinel (as it will be impossible to distinguish the two).

If made a dataclass, this can all be done in the __post_init__, which also avoids mixing a bunch of attribute-setting boilerplate in with the validation logic.

Unless there’s something I’m missing, FormatOptions(template=defaults, ...) is equivalent to both defaults.update(FormatOptions) and defaults | FormatOptions(...) outlined above. Therefore, I suggest just implementing that approach instead which is a clear-cut improvement over this, as it achieves the exact same thing with far shorter, simpler and less duplicative logic, allows making FormatOptions a frozen dataclass, and the usage of which is (IMO) more elegant, idiomatic and convenient (and it feels a bit jarring to me to throw template in with all the format option args), while achieving the exact same effect.

I’d suggest this take an explicit defaults argument (with None or some sentinel as the default value, so you can retrieve DEFAULT_PACKAGE_OPTIONS dynamically as desired), as that makes it more explicit and obvious that this function accepts such as input (beyond just magically grabbing it from the global state), and also allows you (e.g. in your test suite, in future functions, or other callers) to pass in their own explicit defaults if desired.

As mentioned, all three usages can be reduced to to one by using a dataclass and simply statically enumerating them as fields, without the need for any of this extra complexity.

Using a dataclass for FormatOptions reduces the repetition count to 5, and the remaining repetitions only being partially duplicative and doing more clearly different and necessary things.

As mentioned above, the template logic appears to be unnecessary, and can be replaced with the much shorter, simpler and cleaner update() logic (and __or__). This allows replacing

    unc_format_options = FormatOptions(
        template=val_format_options,
        sign_mode=SignMode.NEGATIVE,
    )

with the (shorter, and IMO cleaner)

    unc_format_options = val_format_options.update(
        FormatOptions(sign_mode=SignMode.NEGATIVE))

or (with __or__)

    unc_format_options = val_format_options | FormatOptions(sign_mode=SignMode.NEGATIVE)

As I describe here, the custom __init__ can be replaced with following the approach above + a __post_init__ with the validation logic, so relying on the existing dataclass version greatly reduces duplication, code complexity and bug potential, among other benefits.

And while its unclear the need for a custom __repr__, with a dataclass its generally much easier to make one, as can iterate over .fields for metadata (which in addition to a bunch of built-in attributes, can also store custom metadata provided statically at class creation time) and/or .asdict, for names/values, instead of verbosely repeating all the option names once or twice more.

And of course, a dataclass provides other “free” methods as well as the many other advantages enumerated above.

Sure they do. Just like the dataclass __init__, you can use super().__setattr__ (e.g. super().__setattr__(extra_si_prefixes, dict())) in __post_init__ (or wherever else you need it) as implied in the docs and described on Stack Overflow, since the method of achieving the illusion of “immutability” is overriding the base __setattr__.

You can also make add_* get treated as init-only variables, so they aren’t added as fields but are still included as parameters in the generated __init__ and passed to __post_init__, by simply wrapping their type annotations in the InitVar type (e.g.add_c_prefix: InitVar[bool] = False).

I think you can guess my answer to this

Not exactly per say, but as described above there’s a straightforward way around the “immutability” that should work just as well for your purposes.

Personally, I tend to avoid formally recommending such contractions (aside from widely-used and -understood conventions, i.e. those for NumPy, Pandas and Matplotlib) as they can make code much more difficult to introspect and understand by readers, as well as less sophisticated tooling, for relatively little benefit (especially in this era of autocompletion). But its ultimately up to you as maintainer if you really want to encourage that.

jagerber48 · July 31, 2023, 7:28am

NOTE: a bunch of my posts got flagged as spam, I guess because I was linking to the sciform github repo too much?

@CAM-Gerlach thanks for the response. Ok, I’ve made a version of the FormatOptions on feature/format_options_replace branch in sciform repo that I think you’ll like.

I’d seen the super().__setattr__() trick before to essentially defeat the frozen=True immutability but it feels like a dirty trick. I guess it’s well-accepted though? As long as it’s being used in __init__ or __post_init__ it doesn’t seem too bad. And anyways, as you say, the immutability is an illusion anyways.
I switched away from my template strategy. That was some pattern I came up with a while ago when I was hoping the render() and update() behaviors could share an implementation but it never panned out.
Also, I had been avoiding using asdict() and auto-generating kwargs for some reason, but I got over that and it cleaned stuff up a lot.
Now I’m using a replace method for what you’re calling update. I don’t like the name update because I expect to be an in-place modification just like the dict version. But dataclasses has a replace function that does pretty much what we’re describing. It takes one dataclass plus kwargs to change and returns a new dataclass of the same type. I use dataclasses.replace along with asdict to get FormatOptions.replace(). <<< Nonetheless, I’d like feedback on the name replace for this method.
We’ll see about __or__ in the future. It’s definitely slick, but going to hold off for now since I haven’t used that notation for merging objects elsewhere yet so I’m not used to it . Will be easy to add at any time I think.
I use asdict() to generate the kwargs for RenderedFormatOptions in render and this eliminates a repetition of the set of options. As you point out, the downside here is that I need to manually confirm FormatOptions and RenderedFormatOptions have the same fields.
With this I’m down to 4 (really 3) in-code repetitions of the options. The RenderedFormatOptions definition, the FormatOptions definition, the PKG_DEFAULT_OPTIONS definition, the format specification mini language return. But the last one hardly counts because it’s not the full set of options, and all of those kwargs had to be parsed within that function and returned.
Yes, I use None as the sentinel for nothing passed in for simplicity on the back and front end. Previously I was overloading None as a sentinel for what are now AutoExp and AutoPrec and it caused problems, hence the introduction of those two sentinels. For now I think I’ll keep None as the sentinel for nothing passed in because I think it will be familiar to users and it’s familiar to me.
Yes, makes sense to include defaults on render().
Maybe I want to keep __repr__ from dataclass and just define __str__. I want a printout that is one field per line. I want what is used in print_global_defaults().

CAM-Gerlach · July 31, 2023, 5:22pm

Yup, in fact a bit cruder than that—you were just linking to the GitHub domain “too much”, which the spam filter flagged, blocked you from linking to it and removed your posts that did. Sorry about that! I’ve restored your posts, upgraded your account from Trust Level 0 (new user) to Trust Level 2 (member) which should prevent it from happening again (the spam filter only triggers on TL0, AFAIK) and proposed that we add github.com to the domain allowlist (and possibly others that may be linked frequently, like the pyOpenSci domain).

Indeed I do—looking pretty good!

Yeah, it’s not really that dirty, especially since it’s what the built-in __init__ also uses and seems to be fairly common, and most importantly because it’s internal to the class implementation. Per Python’s principles, there’s nothing wrong with consenting adults being as dirty as one wants behind the closed doors of the class’s own internal implementation.

render() and replace() can, in fact, share an implementation, as I’ve alluded to previously. I’d hesitated to recommend it before, since in my previous sketches it resulted in increased complexity/verbosity and raised some type safety issues, whereas the differing type safety considerations for replace() vs render() allowed a (de-facto) safe shortcut asdict() implementation for the former (but not the latter), thus combining the naive implementations tended to added more complexity than it saved (or else sacrificed explicitness and type safety).

However, after working it out, I came up with an implementation which avoids more than one duplication, doesn’t increase complexity that much and should be fully explicit, static and type-safe for both:

T = TypeVar("T", bound=Union[FormatOptions, RenderedFormatOptions])

def _replace(base: T, new: FormatOptions) -> T:
    return type(base)(
        round_mode=base.round_mode if new.round_mode is None else new.round_mode,
        precision=base.precision if new.precision is None else new.precision,
        upper_separator=base.upper_separator if new.upper_separator is None else new.upper_separator,
        ...
    )

class FormatOptions:
    def replace(self, other: FormatOptions) -> FormatOptions:
        return _replace(self, other)

    def render(self, defaults: RenderedFormatOptions | None = None) -> RenderedFormatOptions:
        return _replace(defaults or get_global_defaults(), self)

Alternatively, while I don’t necessarily recommend it (since render() will no longer be type-safe, and you can no longer statically verify that the two dataclasses’ field names and types match), you can apply a similar strategy to your existing asdict-based approach; only the _replace implementation needs to change, to just:

def _replace(base: T, new: FormatOptions) -> T:
    return type(base)(**{
        **asdict(base), **{key: val for key, val in asdict(new) if val is not None}})

Or, in Python 3.9+ (where | is now the standard way to merge dicts):

def _replace(base: T, new: FormatOptions) -> T:
    return type(base)(**(
        asdict(base) | {key: val for key, val in asdict(new) if val is not None}))

However, since the only non-trivial duplicated logic is merging the dicts ignoring None, it seems to make more sense (and is simpler, shorter, avoids the TypeVar, etc) to just factor this into a helper function instead, and just use that in the method implementations:

def _merge_dicts(left: dict, right: dict) -> dict:
    return {**{left), **{key: val for key, val in right if val is not None}}

class FormatOptions:
    def replace(self, other: FormatOptions) -> FormatOptions:
        return FormatOptions(**_merge_dicts(asdict(self), asdict(other)))

    def render(self, defaults: RenderedFormatOptions | None = None) -> RenderedFormatOptions:
        return RenderedFormatOptions(**_merge_dicts(
            asdict(defaults or get_global_defaults()), asdict(self)))

That ends up being only 4 lines, vs. 15 in your current implementation.

Its good to be wary about using asdict(), as it can easily negate the benefits of using a dataclass if you’re not careful, but as you’ve found, it can greatly reduce duplication and verbosity in your code if used carefully in specific narrow contexts within the dataclass implementation.

In replace(), the implementation guarantees that calling asdict on two FormatOptions dataclasses will produce dicts with the same keys and compatible values, so absent some obvious mistake in its minimal asdict-based implementation, there isn’t really much else that can go wrong (whereas a mistake, possibly even one that isn’t easily caught, is more likely with the much more verbose and copy/pasted code that would otherwise be required).

In render(), by contrast, there is no guarantee that the asdicts of two different have equal keys or compatible value types, so explicitly writing it out rather than relying on asdict allows statically verifying both those key properties and maintaining type safety. There’s still an argument to be made for the much more concise dynamic approach, depending on the trades you want to make, but it’s much less clear cut than the previous.

However, the calculus does change if considering factoring them into a shared implementation, as you mention and I do above, as that enhances the advantages of both approaches and skews the cost/benefit toward picking the same approach for both the replace() and render() methods. Versus the split decision suggested previously, the explicit shared implementation has minimal extra code and complexity but adds full static type safety for replace(), while sharing the asdict() dynamic approach sacrifices safety/correctness for render() too in exchange for a much more concise implementation and avoiding code duplication. Overall, it seems to me that either one or the other is generally preferable to the previous split approach I recommended.

Yeah, I didn’t really like update either for his exact reason, and was just too ~~lazy~~ indecisive to change it given I wasn’t sure of another good option. While replace is probably better (due to lacking that connection), it still implies to me that the original is going to be mutated, rather than merged with another FormatOptions to produce a new FormatOptions.

What about .merge() instead? It seems closer to describing what this actually does and implying that it produces a new FormatOptions, and matches the semantics for the built-in Python dict merging (via the | operator) and the default behavior for df.merge() in Pandas (AFAIK).

Entirely reasonable, of course. It’s used in builtin sets, for instance (for which it is used frequently), but it’s not common yet for dicts cause Python 3.9 isn’t the minimum required version for most projects yet; in a year or two once 3.8 goes EOL and projects start to drop <3.9 support, you’ll likely start to see it a lot more.

I would personally tend to err on the side of safety, correctness and explicitness at the cost of one more repetition of the options, as this one is the key bit that allows statically verifying that your dataclasses are in sync with matching options and compatible types, and in turn be fully confident that code using them is doing so correctly as well. But it’s a perfectly legitimate choice to balance the relevant trades differently, as this does allow a dramatically more concise implementation.

I do suggest, though, that if you are taking the latter approach, you actually take full advantage of this to factor out common logic and make the implementations much shorter and simpler—my suggestion above reduces your ≈15 lines of body logic for this down to only ≈4 (by contrast, the factored-out explicit approach has ≈27 total body lines, making the current asdict() approach only ≈1.8x shorter rather than a much more compelling ≈7x).

In that case, you could do essentially the same thing you do with print_global_defaults:

    def __str__(self):
        return pprint.pformat(asdict(self), sort_dicts=False)

jagerber48 · August 1, 2023, 1:13pm

@CAM-Gerlach Thanks for dealing with the user/spam stuff for me!

Ok, I’m trying out your two suggestions. I like both of them. It’s starting to seem to me like the differences are more and more coming down to personal preference as you’re also pointing out. First off, yes, merge is a good name for this operation. It describes exactly what’s going on, and it does seem like users are more likely to expect it to return a new object rather than mutate in place.

I’d like to go with your suggestion of the shared implementation using _replace and creating a FormatOptions or RenderedFormatOptions. I’m experiencing two downsides though.

Unfortunately when I try that using type(base)(...) I don’t get autocompletion on the keyword arguments in my IDE (Windows Pycharm 2023.2 community edition, this is using base: T with T being the TypeVar you suggest). I’m actually considering this to be a pretty big deal when I have to type in a list of over 20 options. I haven’t tried to see if another IDE can get autocompletion here or if there’s something I could change to make it work.
The constructions upper_separator=base.upper_separator if new.upper_separator is None else new.upper_separator result in lines that are very long, much longer than project max line length. This means I either need to break up these lines or ignore the warnings in my editor and integration pipelines. Normally I’m a stickler for cleaning up most warning, but in this case I’d lean towards ignoring them since I’m just doing the same thing 20+ times in a row and it feels cleaner to keep each on its own line even if it runs off the side of the page. The (likely solvable) issue is I just need to get my editor and automation setup to ignore the line-to-long error just on this function. Pretty sure I can do that in pycharm (using a # noinspection PyPep8 comment at the top of the function or something), not sure if flake8 can be configured to do that or if I’ll just have to ignore them myself.

For these reasons I’m leaning towards your second suggestion right now using _merge_dicts() and calling that in replace() and render() using asdict().

jagerber48 · August 2, 2023, 1:20am

Shoot, and sadly I think the lack of autocompletion is related to the usage of @dataclass.

def foo:
    def __init__(self, bar=0):
        pass

type(foo)(...)

In this case when I’m typing in the ... block I get autocompletion for bar.

@dataclass
class foo:
    bar = 0

type(foo)(...)

Here I get autocompletion for object.__init__ which is no good.

I know pycharm has had some challenges supporting things like this with dataclasses. I’ll check with pycharm about it.
edit: I made a bugtracker issue for this. https://youtrack.jetbrains.com/issue/PY-62052/Autocompletion-doesnt-work-with-typesomedataclass-constructor

CAM-Gerlach · August 2, 2023, 6:39pm

Leah also helped add GitHub to the allowed domains, so hopefully it all worked! Let us know if you do run into further issues.

I’m not a Pycharm user (I’m a developer and user of the Spyder open-source scientific IDE) so I don’t have a ton of insight on that specifically, but I’d be curious where exactly this leads to autocompletion being missing—what’s done in _replace() shouldn’t affect static introspection for user-facing functions, assuming it’s actually using the static type hints to retrieve the parameter types rather than trying to infer things from the code flow, and it should work with type checkers (though in practice it might not be properly implemented with all of them).

You can ignore them line by line or in the whole file, but there’s not a way to ignore them for a block or a function, and Anthony refuses to implement it despite loads of requests (shocker…). However, there are plans for Ruff to support it, which most major projects have switched to from Flake8, due to being a drop in replacement with much better speed, integration and maintainership (i.e. not Anthony). Pylint does support disables by block, function, class, etc.

Also, my preferred approach nowadays, especially for new projects, is instead of having linters nag me about style, just use an autoformatter (i.e. Black) integrated with my IDE or triggered automatically via pre-commit. Black can be told not to format a particular block using # fmt: off and # fmt: on comments.

Ah, so it seems the issue is specifically in the type(base)(...) call? That’s a pretty advanced case, but do be advised that the issue is not just dataclass support there; in the actual function Pycharm cannot statically know whether base will be a FormatOptions or a RenderedFormatOptions, which at least as far as the type system is concerned are two disjoint types, and it cannot know which’s parameters to show, dataclass or not.

You could try making FormatOptions a subtype of RenderedFormatOptions, which would be valid from a Liskov perspective (even if it seems like the opposite conceptually), but that might be tricky and I don’t know ifit would help Pycharm in this case, though it might also provide an additional validation of type correctness—you’d have to experiment since it will depend on the tooling.

As a hacky workaround, you could write FormatOptions( and fill in the params, and then change it to type(base)( after you’re done.

jagerber48 · August 2, 2023, 8:22pm

Right, I think autocompletion for user facing functions is fine. It’s just autocompletion with in _replace because of the type(base)(...) construction

Yeah right… the dataclass issue exists but is sort of independent. So yeah, even if the dataclass issue was solved it would not be possible for autocomplete to work.

Yeah this is an interesting idea. I did some quick tests and I think on pycharm this will reduce the issue to the dataclass problem above. And I’ll have to figure out how to do dataclass inheritance right, something I’ve struggled with a lot in the past. I can look into this more.

meh

So yeah, with all of this I’m leaning towards the asdict approach which leverages the knowledge that FormatOptions and RenderedFormatOptions have the same signature. It seems to me that any strategy to optionally construct either a FormatOptions or RenderedFormatOptions that doesn’t include FormatOptions(exp=..., exp_mode=...) and RenderedFormatOptions(exp=..., exp_mode=...) is going to have to somehow implicitly take advantage of this fact. The asdict strategy shamelessly leverages this, and I can just explain that in the docstring. Even if one is a subclass of another, I don’t think LSP enforces constructor signature are exactly the same (or even compatible based on some things I’ve read which says LSP only applies to instantiated objects) and I don’t think it would help me catch signature mismatches between the two classes anyways. Those would always need to come up at run time.

So going in circles a little bit here: one way would be to have RenderedFormatOptions and FormatOptions be the same class. But the problem here is that the “rendered” form should have values for all options, and it should type check as such. But to support the unrendered form the options would all have to be annotated as optional so that None (or some other sentinel) value is accepted.

It feels kind of like a puzzle that should have a solution… but I think the constraint of “I don’t want to explicitly type out the options because there are just so many of them” is kind of outside the scope of static type checking or something… so it can only be accomodated so well while also satisfying static type checking constraints. E.g. your approach with the shared _replace implementation currently adds one occurrence of the options list. But if we were ok with two more occurrences then there are a number of ways that could be worked out pretty easily I think while satisfying explicitness/type checking etc.

Thanks for the info on linters and autoformatters. I’ll have to look into these options.

jagerber48 · October 17, 2023, 5:39am

Pretty sure PEP 692 can neatly solve all the problems brought up above. I’ll need to research this. My main question is if there’s a way to adopt this typing feature on python versions older than 3.12. Like something about __future__ (which I’ve never used before)?

If not then it’s kind of sad. Probably not worth dropping support for 3.8+ just because of the nice static typing feature in 3.12…

jsdodge · November 12, 2023, 7:32am

Hello,

I’ve read this thread with interest and have a related question. I am working on a project, THzTools, which supports analysis of waveforms that are sampled at a regular time interval dt. All of the analysis functions include dt as a parameter, but normally the value of dt is fixed for all waveforms in a set of measurements, so I would like to provide a global sampling_time constant that a user can set and forget. This is the only global option that I would like to provide right now, but I have made it an attribute of a GlobalOptions dataclass in case that changes. I’ve then added the constant global_options to the module with the sampling_time attribute initialized to None, to indicate that it is undetermined. I’d like to get advice on whether the code below will be sufficient for this purpose. Unlike the case presented by @jagerber48, there is currently only one option, I expect users to change it, and my preferred default is just None.

Thanks,
Steve

@dataclass
class GlobalOptions:
    r"""
    Class for global options.

    Currently, this just includes the ``sampling_time`` parameter.

    Parameters
    ----------
    sampling_time : float | None, optional
        Global sampling time, normally in picoseconds. When set to None, the
        default, times and frequencies are treated as dimensionless quantities
        that are scaled by the (undetermined) sampling time.
    """
    sampling_time: float | None


global_options = GlobalOptions(
    sampling_time=None,
)

jagerber48 · November 18, 2023, 8:29am

Hi @jsdodge! THzTools sounds cool! By the way, I recognize you from this numpy PR! https://github.com/numpy/numpy/pull/20889

There are similarities and differences with what sciform does and what you’re describing. In sciform I have a UserOptions class which is configured with user input (and some inputs may be None), then, at format time, the UserOptions class is converted to a RenderedOptions class (All options are explicitly configured in this class, no None values) using a merge process which works as follows. GLOBAL_DEFAULT_OPTIONS is a module attribute within sciform which is itself a RenderedOptions dataclass. When a UserOptions class is being rendered the algorithm is that any None values for UserOptions attributes are replaced with the corresponding value from the GLOBAL_DEFAULT_OPTIONS.

So for sciform, if users decide not to set a certain option then it defaults to None, or they can explicitly set an option to None. In both cases sciform will overwrite this option with the corresponding option configured in the GLOBAL_DEFAULT_OPTIONS at format time.

The key for sciform is that I wanted the “rendering” of options (in other words the population of unspecified options) to happen at format time rather than object creation time. This way a user can create a Formatter with certain options, then change the global options and have these changes to the global options affect the behavior of the already-created Formatter.

So, the code you present is pretty simple. GlobalOptions is just a container for your options and it is configured in a certain way when you define global_options (which I assume you’ll access elsewhere in your code base). That is basically how sciform does it, see https://github.com/jagerber48/sciform/blob/main/src/sciform/format_options.py#L354. In sciform the global settings are modified through setter functions. I definitely worried about using a module attribute as a global variable within the package since I know globals in python can be kind of weird. I did have to learn more about the global keyword and use it in e.g. the setter functions.

To say if your code will totally suffice or not, however, I’d need to know details about what the user workflow is for controlling the global options. Do you want a sort of deferred evaluation of options like sciform? Are there objects that store user defined options that “fall-back” to the global options if no user input is supplied when needed? Do you want a context manager that allows users to step off of default settings within the scope of a context manager but then return to defaults (pandas definitely has a feature like this and that motivated sciform to have similar).

jsdodge · November 23, 2023, 7:56am

Thanks for the feedback! I don’t have a ready answer to all of your implementation questions, but we’ll consider them as we develop it. For now, we’re planning to fall back to the global option when the user does not supply one, with no context manager. I also don’t think we’ll need to defer the evaluation as you have done, but I could be wrong. I think we’ll need to see how it works in practice before making a final decision.

Topic		Replies	Views
Welcome sciform to the pyOpenSci ecosystem! pyOpenSci Packages pyos-accepted	2	90	February 23, 2024
Python packaging guide: tests and data for your Python package pyOpenSci Guidebooks & Tutorials	7	168	March 3, 2024
Typing question - List vs list Questions about Python Packaging & Code	10	2842	August 3, 2023
Candidate package: pandera - a flexible pandas data structure validation package pyOpenSci Packages	6	779	August 18, 2019
Using a GitHub Form for submissions pyOpenSci Review Process Questions	3	176	March 14, 2023

How to avoid repeating a long list of keyword options throughout a package?

Related topics