You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@streams.apache.org by Steve Blackmon <sb...@apache.org> on 2018/02/01 01:47:43 UTC
[DISCUSS] Namespacing of configuration inputs to components

 TL;DR I think we should start aligning the default JVM config path with
the package/class of the code being configured

Hello All,

I’ve been working on two providers (TwitterEngagersProvider and
InstagramEngagersProvider) which share significant code with other
providers.

This is sort of complex but here’s the nature of the implementation:
a) Instantiate and run provider A, which generates user objects or user ids
b) Instantiate and run provider B, using the output of provider A as the
input
c) All of this happens in provider C, which then transforms the output of
provider B, exploding each item into a list as it’s own output.

Due mostly to conventions that have been in the project for a while,
providers A, B, and C all expect to find their configuration at the same
place in the JVM properties - either ‘twitter’ or ‘instagram’.
Additionally, providers A, B, and C all expect to be configured with a
max_items parameter.

So you can probably see the challenge - configuring provider C with
max_items = 1000 winds up giving unhelpful direction to providers A and B
about how much work they should do.  We don’t need or want 1000 user ids,
we really want 1000 engager ids which can typically be derived from ~50
user ids.

While on the surface its nice to be able to reuse a single conf file to run
a variety of providers, I see now that creates problems when running code
that integrates a variety of components that expect to be configured from
overlapping paths.  It’s possible to get around these problems by writing
glue code, but it seems like it would be smarter to have all classes by
default source their configuration using a simple and obvious namespace
strategy based on package/class.

So instead of every twitter provider picking up:
twitter.max_items = 1000

They would all be configured separately like:
org.apache.streams.twitter.providers.ProviderA.max_items = 10
org.apache.streams.twitter.providers.ProviderB.max_items = 100
org.apache.streams.twitter.providers.ProviderC.max_items = 1000

Since there are other configuration properties that are required but do not
differ between the three providers, a more typical configuration file would
really look like:
org.apache.streams.twitter.Twitter = { oauth = { … } }
org.apache.streams.twitter.providers.ProviderA =
${org.apache.streams.twitter.Twitter}
org.apache.streams.twitter.providers.ProviderA.max_items = 10
org.apache.streams.twitter.providers.ProviderB =
${org.apache.streams.twitter.Twitter}
org.apache.streams.twitter.providers.ProviderB.max_items = 100
org.apache.streams.twitter.providers.ProviderC =
${org.apache.streams.twitter.Twitter}
org.apache.streams.twitter.providers.ProviderC.max_items = 1000

This would be a breaking refactoring, targeted for the 0.6.0 release.

I’m interested in others feedback on the idea in general, and
specifically.  For example whether fully qualified class name or just
simple name is preferable as a prefix.  And whether the class name used
should be that of the configuration bean i.e. TwitterConfiguration or of
the class that operates on it i.e. Twitter

Thanks for reading,
Steve