You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by James Turton <dz...@apache.org> on 2022/01/14 04:45:44 UTC

[DISCUSS] Drill 2 and plug-in organisation

Hello dev community

Discussions about reorganising the Drill source code to better position 
the project to support plug-ins for the "long tail" of weird and 
wonderful systems and data formats have been coming up here and there 
for a few months, e.g. in https://github.com/apache/drill/pull/2359.

A view which I personally share is that adding too large a number and 
variety of plug-ins to the main tree would create a lethal maintenance 
burden for developers working there and lead down a road of accumulating 
technical debt.  The Maven tricks we must employ to harmonise the 
growing set of dependencies of the main tree to keep it buildable are 
already enough, as is the size of our distributable and the count of 
open bug reports.


Thus, the idea of splitting out "/contrib" into a new 
apache/drill-contrib repo after selecting a subset of plugins to remain 
in apache/drill.  I'll now volunteer a set of criteria to decide whether 
a plug-in should live in this notional apache/drill-contrib.

 1. The plug-in queries an unstructured data format (even if it only
    reads metadata fields) e.g. Image format plug-in.
 2. The plug-in queries a data format that was designed for human
    consumption e.g. Excel format plug-in.
 3. The plug-in cannot be expected to run with speed and reliability
    comparable to querying structured data on the local network e.g.
    Dropbox storage plugin.
 4. The plug-in queries an obscure system or format e.g. we receive a
    plug-in for some data format used only on old Cray supercomputers.
 5. The plug-in can for some reason not be well supported by the Drill
    devs e.g. it has a JNI dependency on some difficult native libs.


Any one of those suggests that an apache/drill-contrib is the better 
home to me, but what is your view?  Would we apply significantly more 
relaxed standards when reviewing PRs to apache/drill-contrib?  Would we 
tag, build and test apache/drill-contrib with every release of 
apache/drill, or would it run on its own schedule, perhaps with users 
downloading builds made continuously from snapshots of HEAD?


Regards
James

Re: [DISCUSS] Drill 2 and plug-in organisation

Posted by Paul Rogers <pa...@gmail.com>.

Hi James,

My experience might be a bit old. I seem to recall, way back when, we did
try to build some plugins outside of Drill itself and that there were
issues. Maybe it was just the inconvenience of debugging? Perhaps the test
libraries were not available? Development is fastest when you can write a
unit test that fires up Drill, and exercises your plugin. You can then step
through the code, see an error, fix it, and try again in a matter of
seconds. Without that, you have to rebuild your jar, copy it to Drill,
restart Drill, submit a query, and hope to figure out what is wrong when
things blow up.

So, I wonder if we also publish test jars? If not, that would be a big help.

UDFs also have issues since Drill doesn't actually run your code: Drill
copies it. And, unless you know about the magic thingie, Drill won't even
load your UDF. (Have to tell Drill not to load from cache, if I recall.)

To test all this out, just build a demo plugin and demo UDF using the
libraries. If it is smooth sailing, we're good to go. If not, figure out
what's missing and fix it.

Oh, and another issue: class loader isolation. As Drill includes ever more
plugins, dependencies will conflict. That's why Presto/Trino loads plugins
in a separate class loader: Trino may use version 5 of library X, but I
might use 7. With class loader isolation, stuff just works. Without it, one
lives in Maven dependency hell for a while.

Thanks,

- Paul


On Tue, Jan 18, 2022 at 12:29 AM James Turton <dz...@apache.org> wrote:

> For my part, I'd forgotten that GitHub does give users the opportunity
> to attach binary distributables to releases.  So my first thought of
> "GitHub would mean using Git repositories to host Jar files" was off the
> mark.
>
> Paul, setting aside the hosting and distribution for a moment, may I ask
> about the statement "ensure plugins can be built outside of the Drill
> repo"?  Released versions of Drill's own libs are already published to
> Maven.  E.g.
>
>
> https://mvnrepository.com/artifact/org.apache.drill.exec/drill-java-exec/1.19.0
>
> Can a plugin writer not create a new project which lists the required
> Drill libs in its pom.xml deps and proceed to build a plugin away from
> the main tree?  Interactive debugging without the Drill main tree should
> even be possible by attaching a debugger to a running embedded Drill
> with the storage plugin deployed to it, or am I wrong here?
>
> On 2022/01/18 00:32, Paul Rogers wrote:
> > Hi Ted,
> >
> > Thanks for the explanation, makes sense.
> >
> > Ideally, the client side would be somewhat agnostic about the repo it
> pulls
> > from. In a corporate setting, it should pull from the "JFrog Repository"
> > that everyone seems to use (but which I know basically nothing.) Oh,
> lord,
> > a plugin architecture for the repo for the plugin architecture?
> >
> > - Paul
> >
> > On Mon, Jan 17, 2022 at 1:46 PM Ted Dunning <te...@gmail.com>
> wrote:
> >
> >> Paul,
> >>
> >> I understood your suggestion.  My point is that publishing to Maven
> >> central is a bit of a pain while publishing by posting to Github is
> nearly
> >> painless.  In particular, because Github inherently produces a
> relatively
> >> difficult to fake hash for each commit, referring to a dependency using
> >> that hash is relatively safe which saves a lot of agony regarding keys
> and
> >> trust.
> >>
> >> Further, Github or any comparable service provides the same "already
> >> exists" benefit as does Maven.
> >>
> >>
> >>
> >> On Mon, Jan 17, 2022 at 1:30 PM Paul Rogers <pa...@gmail.com> wrote:
> >>
> >>> Hi Ted,
> >>>
> >>> Well said. Just to be clear, I wasn't suggesting that we use
> >>> Maven-the-build-tool to distribute plugins. Rather, I was simply
> observing
> >>> that building a global repo is a bit of a project and asked, "what
> could we
> >>> use that already exists?" The Python repo? No. The
> Ubuntu/RedHat/whatever
> >>> Linux repos? Maybe. Maven's repo? Why not?
> >>>
> >>> The idea would be that Drill might have a tool that says, "install the
> >>> FooBlaster" plugin. It downloads from a repo (Maven central, say) and
> puts
> >>> the plugin in the proper plugins directory. In a cluster, either it
> does
> >>> that on every node, or the work is done as part of preparing a Docker
> >>> container which is then pushed to every node.
> >>>
> >>> The key thought is just to make the problem simpler by avoiding the
> need
> >>> to create and maintain a Drill-specific repo when we can barely have
> enough
> >>> resources to keep Drill itself afloat.
> >>>
> >>> None of this can happen, however, unless we clean up the plugin APIs
> and
> >>> ensure plugins can be built outside of the Drill repo. (That means,
> say,
> >>> that Drill needs an API library that resides in Maven.)
> >>>
> >>> There are probably many ways this has been done. Anyone know of any
> good
> >>> examples we can learn from?
> >>>
> >>> Thanks,
> >>>
> >>> - Paul
> >>>
> >>>
> >>> On Mon, Jan 17, 2022 at 9:40 AM Ted Dunning <te...@gmail.com>
> >>> wrote:
> >>>
> >>>> I don't think that Maven is a forced move just because Drill is in
> Java.
> >>>> It may be a good move, but it isn't a forgone conclusion. For one
> thing,
> >>>> the conventions that Maven uses are pretty hard-wired and it may be
> >>>> difficult to have a reliable deny-list of known problematic plugins.
> >>>> Publishing to Maven is more of a pain than simply pushing to github.
> >>>>
> >>>> The usability here is paramount both for the ultimate Drill user, but
> >>>> also for the writer of plugins.
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Jan 17, 2022 at 5:06 AM James Turton <dz...@apache.org>
> wrote:
> >>>>
> >>>>> Thank you Ted and Paul for the feedback.  Since Java is compiled,
> Maven
> >>>>> is probably better fit than GitHub for distribution?  If Drillbits
> can
> >>>>> write to their jars/3rdparty directory then I can imagine Drill
> gaining
> >>>>> the ability to fetch and install plugins itself without too much
> >>>>> trouble, at least for Drill clusters with Internet access.
> >>>>> "Sideloading" by downloading from Maven and copying manually would
> >>>>> always remain possible.
> >>>>>
> >>>>> @Paul I'll try to get a little time with you to get some ideas about
> >>>>> designing a plugin API.
> >>>>>
> >>>>> On 2022/01/14 23:20, Paul Rogers wrote:
> >>>>>> Hi All,
> >>>>>>
> >>>>>> James raises an important issue, I've noticed that it used to be
> easy
> >>>>> to
> >>>>>> build and test Drill, now it is a struggle, because of the many odd
> >>>>>> external dependencies we have introduced. That acts as a big damper
> on
> >>>>>> contributions: none of us get paid enough to spend more time
> fighting
> >>>>>> builds than developing the code...
> >>>>>>
> >>>>>> Ted is right that we need a good way to install plugins. There are
> two
> >>>>>> parts. Ted is talking about the high-level part: make it easy to
> >>>>> point to
> >>>>>> some repo and use the plugin. Since Drill is Java, the Maven repo
> >>>>> could be
> >>>>>> a good mechanism. In-house stuff is often in an internal repo that
> >>>>> does
> >>>>>> whatever Maven needs.
> >>>>>>
> >>>>>> The reason that plugins are in the Drill project now is that Drill's
> >>>>> "API"
> >>>>>> is all of Drill. Plugins can (and some do) access all of Drill
> though
> >>>>> the
> >>>>>> fragment context. The API to Calcite and other parts of Drill are
> >>>>> wide, and
> >>>>>> tend to be tightly coupled with Drill internals. By contrast, other
> >>>>> tools,
> >>>>>> such as Presto/Trino, have defined very clean APIs that extensions
> >>>>> use. In
> >>>>>> Druid, everything is integrated via Google Guice and an extension
> can
> >>>>>> replace any part of Druid (though, I'm not convinced that's actually
> >>>>> a good
> >>>>>> idea.) I'm sure there are others we can learn from.
> >>>>>>
> >>>>>> So, we need to define a plugin API for Drill. I started down that
> >>>>> route a
> >>>>>> while back: the first step was to refactor the plugin registry so it
> >>>>> is
> >>>>>> ready for extensions. The idea was to use the same mechanism for all
> >>>>> kinds
> >>>>>> of extensions (security, UDFs, metastore, etc.) The next step was to
> >>>>> build
> >>>>>> something that roughly followed Presto, but that kind of stalled
> out.
> >>>>>>
> >>>>>> In terms of ordering, we'd first need to define the plugin API.
> Then,
> >>>>> we
> >>>>>> can shift plugins to use that. Once that is done, we can move
> plugins
> >>>>> to
> >>>>>> separate projects. (The metastore implementation can also move, if
> we
> >>>>>> want.) Finally, figure out a solution for Ted's suggestion to make
> it
> >>>>> easy
> >>>>>> to grab new extensions. Drill is distributed, so adding a new plugin
> >>>>> has to
> >>>>>> happen on all nodes, which is a bit more complex than the typical
> >>>>>> Julia/Python/R kind of extension.
> >>>>>>
> >>>>>> The reason we're where we're at is that it is the path of least
> >>>>> resistance.
> >>>>>> Creating a good extension mechanism is hard, but valuable, as Ted
> >>>>> noted.
> >>>>>> Thanks,
> >>>>>>
> >>>>>> - Paul
> >>>>>>
> >>>>>> On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning<te...@gmail.com>
> >>>>> wrote:
> >>>>>>> The bigger reason for a separate plug-in world is the enhancement
> of
> >>>>>>> community.
> >>>>>>>
> >>>>>>> I would recommend looking at the Julia community for examples of
> >>>>>>> effective ways to drive plug in structure.
> >>>>>>>
> >>>>>>> At the core, for any pure julia package, you can simply add a
> >>>>> package by
> >>>>>>> referring to the github repository where the package is stored. For
> >>>>>>> packages that are "registered" (i.e. a path and a checksum is
> >>>>> recorded in a
> >>>>>>> well known data store), you can add a package by simply naming it
> >>>>> without
> >>>>>>> knowing the path.  All such plugins are tested by the authors and
> the
> >>>>>>> project records all dependencies with version constraints so that
> >>>>> cascading
> >>>>>>> additions are easy. The community leaders have made tooling
> >>>>> available so
> >>>>>>> that you can test your package against a range of versions of Julia
> >>>>> by
> >>>>>>> pretty simple (to use) Github actions.
> >>>>>>>
> >>>>>>> The result has been an absolute explosion in the number of pure
> Julia
> >>>>>>> packages.
> >>>>>>>
> >>>>>>> For packages that include C or Fortran (or whatever) code, there is
> >>>>> some
> >>>>>>> amazing tooling available that lets you record a build process on
> >>>>> any of
> >>>>>>> the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows,
> >>>>> BSD, OSX
> >>>>>>> and so on). WHen you register such a package, it is automagically
> >>>>> built on
> >>>>>>> all the platforms you indicate and the binary results are checked
> >>>>> into a
> >>>>>>> central repository known as Yggdrasil.
> >>>>>>>
> >>>>>>> All of these registration events for different packages are
> recorded
> >>>>> in a
> >>>>>>> central registry as I mentioned. That registry is recorded in
> Github
> >>>>> as
> >>>>>>> well which makes it easy to propagate changes.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Jan 13, 2022 at 8:45 PM James Turton<dz...@apache.org>
> >>>>> wrote:
> >>>>>>>> Hello dev community
> >>>>>>>>
> >>>>>>>> Discussions about reorganising the Drill source code to better
> >>>>> position
> >>>>>>>> the project to support plug-ins for the "long tail" of weird and
> >>>>>>>> wonderful systems and data formats have been coming up here and
> >>>>> there
> >>>>>>>> for a few months, e.g. inhttps://
> github.com/apache/drill/pull/2359.
> >>>>>>>>
> >>>>>>>> A view which I personally share is that adding too large a number
> >>>>> and
> >>>>>>>> variety of plug-ins to the main tree would create a lethal
> >>>>> maintenance
> >>>>>>>> burden for developers working there and lead down a road of
> >>>>> accumulating
> >>>>>>>> technical debt.  The Maven tricks we must employ to harmonise the
> >>>>>>>> growing set of dependencies of the main tree to keep it buildable
> >>>>> are
> >>>>>>>> already enough, as is the size of our distributable and the count
> of
> >>>>>>>> open bug reports.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thus, the idea of splitting out "/contrib" into a new
> >>>>>>>> apache/drill-contrib repo after selecting a subset of plugins to
> >>>>> remain
> >>>>>>>> in apache/drill.  I'll now volunteer a set of criteria to decide
> >>>>> whether
> >>>>>>>> a plug-in should live in this notional apache/drill-contrib.
> >>>>>>>>
> >>>>>>>>    1. The plug-in queries an unstructured data format (even if it
> >>>>> only
> >>>>>>>>       reads metadata fields) e.g. Image format plug-in.
> >>>>>>>>    2. The plug-in queries a data format that was designed for
> human
> >>>>>>>>       consumption e.g. Excel format plug-in.
> >>>>>>>>    3. The plug-in cannot be expected to run with speed and
> >>>>> reliability
> >>>>>>>>       comparable to querying structured data on the local network
> >>>>> e.g.
> >>>>>>>>       Dropbox storage plugin.
> >>>>>>>>    4. The plug-in queries an obscure system or format e.g. we
> >>>>> receive a
> >>>>>>>>       plug-in for some data format used only on old Cray
> >>>>> supercomputers.
> >>>>>>>>    5. The plug-in can for some reason not be well supported by the
> >>>>> Drill
> >>>>>>>>       devs e.g. it has a JNI dependency on some difficult native
> >>>>> libs.
> >>>>>>>>
> >>>>>>>> Any one of those suggests that an apache/drill-contrib is the
> better
> >>>>>>>> home to me, but what is your view?  Would we apply significantly
> >>>>> more
> >>>>>>>> relaxed standards when reviewing PRs to apache/drill-contrib?
> >>>>> Would we
> >>>>>>>> tag, build and test apache/drill-contrib with every release of
> >>>>>>>> apache/drill, or would it run on its own schedule, perhaps with
> >>>>> users
> >>>>>>>> downloading builds made continuously from snapshots of HEAD?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Regards
> >>>>>>>> James
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>
>
>

Re: [DISCUSS] Drill 2 and plug-in organisation

Posted by James Turton <dz...@apache.org>.

For my part, I'd forgotten that GitHub does give users the opportunity 
to attach binary distributables to releases.  So my first thought of 
"GitHub would mean using Git repositories to host Jar files" was off the 
mark.

Paul, setting aside the hosting and distribution for a moment, may I ask 
about the statement "ensure plugins can be built outside of the Drill 
repo"?  Released versions of Drill's own libs are already published to 
Maven.  E.g.

https://mvnrepository.com/artifact/org.apache.drill.exec/drill-java-exec/1.19.0

Can a plugin writer not create a new project which lists the required 
Drill libs in its pom.xml deps and proceed to build a plugin away from 
the main tree?  Interactive debugging without the Drill main tree should 
even be possible by attaching a debugger to a running embedded Drill 
with the storage plugin deployed to it, or am I wrong here?

On 2022/01/18 00:32, Paul Rogers wrote:
> Hi Ted,
>
> Thanks for the explanation, makes sense.
>
> Ideally, the client side would be somewhat agnostic about the repo it pulls
> from. In a corporate setting, it should pull from the "JFrog Repository"
> that everyone seems to use (but which I know basically nothing.) Oh, lord,
> a plugin architecture for the repo for the plugin architecture?
>
> - Paul
>
> On Mon, Jan 17, 2022 at 1:46 PM Ted Dunning <te...@gmail.com> wrote:
>
>> Paul,
>>
>> I understood your suggestion.  My point is that publishing to Maven
>> central is a bit of a pain while publishing by posting to Github is nearly
>> painless.  In particular, because Github inherently produces a relatively
>> difficult to fake hash for each commit, referring to a dependency using
>> that hash is relatively safe which saves a lot of agony regarding keys and
>> trust.
>>
>> Further, Github or any comparable service provides the same "already
>> exists" benefit as does Maven.
>>
>>
>>
>> On Mon, Jan 17, 2022 at 1:30 PM Paul Rogers <pa...@gmail.com> wrote:
>>
>>> Hi Ted,
>>>
>>> Well said. Just to be clear, I wasn't suggesting that we use
>>> Maven-the-build-tool to distribute plugins. Rather, I was simply observing
>>> that building a global repo is a bit of a project and asked, "what could we
>>> use that already exists?" The Python repo? No. The Ubuntu/RedHat/whatever
>>> Linux repos? Maybe. Maven's repo? Why not?
>>>
>>> The idea would be that Drill might have a tool that says, "install the
>>> FooBlaster" plugin. It downloads from a repo (Maven central, say) and puts
>>> the plugin in the proper plugins directory. In a cluster, either it does
>>> that on every node, or the work is done as part of preparing a Docker
>>> container which is then pushed to every node.
>>>
>>> The key thought is just to make the problem simpler by avoiding the need
>>> to create and maintain a Drill-specific repo when we can barely have enough
>>> resources to keep Drill itself afloat.
>>>
>>> None of this can happen, however, unless we clean up the plugin APIs and
>>> ensure plugins can be built outside of the Drill repo. (That means, say,
>>> that Drill needs an API library that resides in Maven.)
>>>
>>> There are probably many ways this has been done. Anyone know of any good
>>> examples we can learn from?
>>>
>>> Thanks,
>>>
>>> - Paul
>>>
>>>
>>> On Mon, Jan 17, 2022 at 9:40 AM Ted Dunning <te...@gmail.com>
>>> wrote:
>>>
>>>> I don't think that Maven is a forced move just because Drill is in Java.
>>>> It may be a good move, but it isn't a forgone conclusion. For one thing,
>>>> the conventions that Maven uses are pretty hard-wired and it may be
>>>> difficult to have a reliable deny-list of known problematic plugins.
>>>> Publishing to Maven is more of a pain than simply pushing to github.
>>>>
>>>> The usability here is paramount both for the ultimate Drill user, but
>>>> also for the writer of plugins.
>>>>
>>>>
>>>>
>>>> On Mon, Jan 17, 2022 at 5:06 AM James Turton <dz...@apache.org> wrote:
>>>>
>>>>> Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven
>>>>> is probably better fit than GitHub for distribution?  If Drillbits can
>>>>> write to their jars/3rdparty directory then I can imagine Drill gaining
>>>>> the ability to fetch and install plugins itself without too much
>>>>> trouble, at least for Drill clusters with Internet access.
>>>>> "Sideloading" by downloading from Maven and copying manually would
>>>>> always remain possible.
>>>>>
>>>>> @Paul I'll try to get a little time with you to get some ideas about
>>>>> designing a plugin API.
>>>>>
>>>>> On 2022/01/14 23:20, Paul Rogers wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> James raises an important issue, I've noticed that it used to be easy
>>>>> to
>>>>>> build and test Drill, now it is a struggle, because of the many odd
>>>>>> external dependencies we have introduced. That acts as a big damper on
>>>>>> contributions: none of us get paid enough to spend more time fighting
>>>>>> builds than developing the code...
>>>>>>
>>>>>> Ted is right that we need a good way to install plugins. There are two
>>>>>> parts. Ted is talking about the high-level part: make it easy to
>>>>> point to
>>>>>> some repo and use the plugin. Since Drill is Java, the Maven repo
>>>>> could be
>>>>>> a good mechanism. In-house stuff is often in an internal repo that
>>>>> does
>>>>>> whatever Maven needs.
>>>>>>
>>>>>> The reason that plugins are in the Drill project now is that Drill's
>>>>> "API"
>>>>>> is all of Drill. Plugins can (and some do) access all of Drill though
>>>>> the
>>>>>> fragment context. The API to Calcite and other parts of Drill are
>>>>> wide, and
>>>>>> tend to be tightly coupled with Drill internals. By contrast, other
>>>>> tools,
>>>>>> such as Presto/Trino, have defined very clean APIs that extensions
>>>>> use. In
>>>>>> Druid, everything is integrated via Google Guice and an extension can
>>>>>> replace any part of Druid (though, I'm not convinced that's actually
>>>>> a good
>>>>>> idea.) I'm sure there are others we can learn from.
>>>>>>
>>>>>> So, we need to define a plugin API for Drill. I started down that
>>>>> route a
>>>>>> while back: the first step was to refactor the plugin registry so it
>>>>> is
>>>>>> ready for extensions. The idea was to use the same mechanism for all
>>>>> kinds
>>>>>> of extensions (security, UDFs, metastore, etc.) The next step was to
>>>>> build
>>>>>> something that roughly followed Presto, but that kind of stalled out.
>>>>>>
>>>>>> In terms of ordering, we'd first need to define the plugin API. Then,
>>>>> we
>>>>>> can shift plugins to use that. Once that is done, we can move plugins
>>>>> to
>>>>>> separate projects. (The metastore implementation can also move, if we
>>>>>> want.) Finally, figure out a solution for Ted's suggestion to make it
>>>>> easy
>>>>>> to grab new extensions. Drill is distributed, so adding a new plugin
>>>>> has to
>>>>>> happen on all nodes, which is a bit more complex than the typical
>>>>>> Julia/Python/R kind of extension.
>>>>>>
>>>>>> The reason we're where we're at is that it is the path of least
>>>>> resistance.
>>>>>> Creating a good extension mechanism is hard, but valuable, as Ted
>>>>> noted.
>>>>>> Thanks,
>>>>>>
>>>>>> - Paul
>>>>>>
>>>>>> On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning<te...@gmail.com>
>>>>> wrote:
>>>>>>> The bigger reason for a separate plug-in world is the enhancement of
>>>>>>> community.
>>>>>>>
>>>>>>> I would recommend looking at the Julia community for examples of
>>>>>>> effective ways to drive plug in structure.
>>>>>>>
>>>>>>> At the core, for any pure julia package, you can simply add a
>>>>> package by
>>>>>>> referring to the github repository where the package is stored. For
>>>>>>> packages that are "registered" (i.e. a path and a checksum is
>>>>> recorded in a
>>>>>>> well known data store), you can add a package by simply naming it
>>>>> without
>>>>>>> knowing the path.  All such plugins are tested by the authors and the
>>>>>>> project records all dependencies with version constraints so that
>>>>> cascading
>>>>>>> additions are easy. The community leaders have made tooling
>>>>> available so
>>>>>>> that you can test your package against a range of versions of Julia
>>>>> by
>>>>>>> pretty simple (to use) Github actions.
>>>>>>>
>>>>>>> The result has been an absolute explosion in the number of pure Julia
>>>>>>> packages.
>>>>>>>
>>>>>>> For packages that include C or Fortran (or whatever) code, there is
>>>>> some
>>>>>>> amazing tooling available that lets you record a build process on
>>>>> any of
>>>>>>> the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows,
>>>>> BSD, OSX
>>>>>>> and so on). WHen you register such a package, it is automagically
>>>>> built on
>>>>>>> all the platforms you indicate and the binary results are checked
>>>>> into a
>>>>>>> central repository known as Yggdrasil.
>>>>>>>
>>>>>>> All of these registration events for different packages are recorded
>>>>> in a
>>>>>>> central registry as I mentioned. That registry is recorded in Github
>>>>> as
>>>>>>> well which makes it easy to propagate changes.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 13, 2022 at 8:45 PM James Turton<dz...@apache.org>
>>>>> wrote:
>>>>>>>> Hello dev community
>>>>>>>>
>>>>>>>> Discussions about reorganising the Drill source code to better
>>>>> position
>>>>>>>> the project to support plug-ins for the "long tail" of weird and
>>>>>>>> wonderful systems and data formats have been coming up here and
>>>>> there
>>>>>>>> for a few months, e.g. inhttps://github.com/apache/drill/pull/2359.
>>>>>>>>
>>>>>>>> A view which I personally share is that adding too large a number
>>>>> and
>>>>>>>> variety of plug-ins to the main tree would create a lethal
>>>>> maintenance
>>>>>>>> burden for developers working there and lead down a road of
>>>>> accumulating
>>>>>>>> technical debt.  The Maven tricks we must employ to harmonise the
>>>>>>>> growing set of dependencies of the main tree to keep it buildable
>>>>> are
>>>>>>>> already enough, as is the size of our distributable and the count of
>>>>>>>> open bug reports.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thus, the idea of splitting out "/contrib" into a new
>>>>>>>> apache/drill-contrib repo after selecting a subset of plugins to
>>>>> remain
>>>>>>>> in apache/drill.  I'll now volunteer a set of criteria to decide
>>>>> whether
>>>>>>>> a plug-in should live in this notional apache/drill-contrib.
>>>>>>>>
>>>>>>>>    1. The plug-in queries an unstructured data format (even if it
>>>>> only
>>>>>>>>       reads metadata fields) e.g. Image format plug-in.
>>>>>>>>    2. The plug-in queries a data format that was designed for human
>>>>>>>>       consumption e.g. Excel format plug-in.
>>>>>>>>    3. The plug-in cannot be expected to run with speed and
>>>>> reliability
>>>>>>>>       comparable to querying structured data on the local network
>>>>> e.g.
>>>>>>>>       Dropbox storage plugin.
>>>>>>>>    4. The plug-in queries an obscure system or format e.g. we
>>>>> receive a
>>>>>>>>       plug-in for some data format used only on old Cray
>>>>> supercomputers.
>>>>>>>>    5. The plug-in can for some reason not be well supported by the
>>>>> Drill
>>>>>>>>       devs e.g. it has a JNI dependency on some difficult native
>>>>> libs.
>>>>>>>>
>>>>>>>> Any one of those suggests that an apache/drill-contrib is the better
>>>>>>>> home to me, but what is your view?  Would we apply significantly
>>>>> more
>>>>>>>> relaxed standards when reviewing PRs to apache/drill-contrib?
>>>>> Would we
>>>>>>>> tag, build and test apache/drill-contrib with every release of
>>>>>>>> apache/drill, or would it run on its own schedule, perhaps with
>>>>> users
>>>>>>>> downloading builds made continuously from snapshots of HEAD?
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> James
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>

Re: [DISCUSS] Drill 2 and plug-in organisation

Posted by Paul Rogers <pa...@gmail.com>.

Hi Ted,

Thanks for the explanation, makes sense.

Ideally, the client side would be somewhat agnostic about the repo it pulls
from. In a corporate setting, it should pull from the "JFrog Repository"
that everyone seems to use (but which I know basically nothing.) Oh, lord,
a plugin architecture for the repo for the plugin architecture?

- Paul

On Mon, Jan 17, 2022 at 1:46 PM Ted Dunning <te...@gmail.com> wrote:

>
> Paul,
>
> I understood your suggestion.  My point is that publishing to Maven
> central is a bit of a pain while publishing by posting to Github is nearly
> painless.  In particular, because Github inherently produces a relatively
> difficult to fake hash for each commit, referring to a dependency using
> that hash is relatively safe which saves a lot of agony regarding keys and
> trust.
>
> Further, Github or any comparable service provides the same "already
> exists" benefit as does Maven.
>
>
>
> On Mon, Jan 17, 2022 at 1:30 PM Paul Rogers <pa...@gmail.com> wrote:
>
>> Hi Ted,
>>
>> Well said. Just to be clear, I wasn't suggesting that we use
>> Maven-the-build-tool to distribute plugins. Rather, I was simply observing
>> that building a global repo is a bit of a project and asked, "what could we
>> use that already exists?" The Python repo? No. The Ubuntu/RedHat/whatever
>> Linux repos? Maybe. Maven's repo? Why not?
>>
>> The idea would be that Drill might have a tool that says, "install the
>> FooBlaster" plugin. It downloads from a repo (Maven central, say) and puts
>> the plugin in the proper plugins directory. In a cluster, either it does
>> that on every node, or the work is done as part of preparing a Docker
>> container which is then pushed to every node.
>>
>> The key thought is just to make the problem simpler by avoiding the need
>> to create and maintain a Drill-specific repo when we can barely have enough
>> resources to keep Drill itself afloat.
>>
>> None of this can happen, however, unless we clean up the plugin APIs and
>> ensure plugins can be built outside of the Drill repo. (That means, say,
>> that Drill needs an API library that resides in Maven.)
>>
>> There are probably many ways this has been done. Anyone know of any good
>> examples we can learn from?
>>
>> Thanks,
>>
>> - Paul
>>
>>
>> On Mon, Jan 17, 2022 at 9:40 AM Ted Dunning <te...@gmail.com>
>> wrote:
>>
>>>
>>> I don't think that Maven is a forced move just because Drill is in Java.
>>> It may be a good move, but it isn't a forgone conclusion. For one thing,
>>> the conventions that Maven uses are pretty hard-wired and it may be
>>> difficult to have a reliable deny-list of known problematic plugins.
>>> Publishing to Maven is more of a pain than simply pushing to github.
>>>
>>> The usability here is paramount both for the ultimate Drill user, but
>>> also for the writer of plugins.
>>>
>>>
>>>
>>> On Mon, Jan 17, 2022 at 5:06 AM James Turton <dz...@apache.org> wrote:
>>>
>>>> Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven
>>>> is probably better fit than GitHub for distribution?  If Drillbits can
>>>> write to their jars/3rdparty directory then I can imagine Drill gaining
>>>> the ability to fetch and install plugins itself without too much
>>>> trouble, at least for Drill clusters with Internet access.
>>>> "Sideloading" by downloading from Maven and copying manually would
>>>> always remain possible.
>>>>
>>>> @Paul I'll try to get a little time with you to get some ideas about
>>>> designing a plugin API.
>>>>
>>>> On 2022/01/14 23:20, Paul Rogers wrote:
>>>> > Hi All,
>>>> >
>>>> > James raises an important issue, I've noticed that it used to be easy
>>>> to
>>>> > build and test Drill, now it is a struggle, because of the many odd
>>>> > external dependencies we have introduced. That acts as a big damper on
>>>> > contributions: none of us get paid enough to spend more time fighting
>>>> > builds than developing the code...
>>>> >
>>>> > Ted is right that we need a good way to install plugins. There are two
>>>> > parts. Ted is talking about the high-level part: make it easy to
>>>> point to
>>>> > some repo and use the plugin. Since Drill is Java, the Maven repo
>>>> could be
>>>> > a good mechanism. In-house stuff is often in an internal repo that
>>>> does
>>>> > whatever Maven needs.
>>>> >
>>>> > The reason that plugins are in the Drill project now is that Drill's
>>>> "API"
>>>> > is all of Drill. Plugins can (and some do) access all of Drill though
>>>> the
>>>> > fragment context. The API to Calcite and other parts of Drill are
>>>> wide, and
>>>> > tend to be tightly coupled with Drill internals. By contrast, other
>>>> tools,
>>>> > such as Presto/Trino, have defined very clean APIs that extensions
>>>> use. In
>>>> > Druid, everything is integrated via Google Guice and an extension can
>>>> > replace any part of Druid (though, I'm not convinced that's actually
>>>> a good
>>>> > idea.) I'm sure there are others we can learn from.
>>>> >
>>>> > So, we need to define a plugin API for Drill. I started down that
>>>> route a
>>>> > while back: the first step was to refactor the plugin registry so it
>>>> is
>>>> > ready for extensions. The idea was to use the same mechanism for all
>>>> kinds
>>>> > of extensions (security, UDFs, metastore, etc.) The next step was to
>>>> build
>>>> > something that roughly followed Presto, but that kind of stalled out.
>>>> >
>>>> > In terms of ordering, we'd first need to define the plugin API. Then,
>>>> we
>>>> > can shift plugins to use that. Once that is done, we can move plugins
>>>> to
>>>> > separate projects. (The metastore implementation can also move, if we
>>>> > want.) Finally, figure out a solution for Ted's suggestion to make it
>>>> easy
>>>> > to grab new extensions. Drill is distributed, so adding a new plugin
>>>> has to
>>>> > happen on all nodes, which is a bit more complex than the typical
>>>> > Julia/Python/R kind of extension.
>>>> >
>>>> > The reason we're where we're at is that it is the path of least
>>>> resistance.
>>>> > Creating a good extension mechanism is hard, but valuable, as Ted
>>>> noted.
>>>> >
>>>> > Thanks,
>>>> >
>>>> > - Paul
>>>> >
>>>> > On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning<te...@gmail.com>
>>>> wrote:
>>>> >
>>>> >> The bigger reason for a separate plug-in world is the enhancement of
>>>> >> community.
>>>> >>
>>>> >> I would recommend looking at the Julia community for examples of
>>>> >> effective ways to drive plug in structure.
>>>> >>
>>>> >> At the core, for any pure julia package, you can simply add a
>>>> package by
>>>> >> referring to the github repository where the package is stored. For
>>>> >> packages that are "registered" (i.e. a path and a checksum is
>>>> recorded in a
>>>> >> well known data store), you can add a package by simply naming it
>>>> without
>>>> >> knowing the path.  All such plugins are tested by the authors and the
>>>> >> project records all dependencies with version constraints so that
>>>> cascading
>>>> >> additions are easy. The community leaders have made tooling
>>>> available so
>>>> >> that you can test your package against a range of versions of Julia
>>>> by
>>>> >> pretty simple (to use) Github actions.
>>>> >>
>>>> >> The result has been an absolute explosion in the number of pure Julia
>>>> >> packages.
>>>> >>
>>>> >> For packages that include C or Fortran (or whatever) code, there is
>>>> some
>>>> >> amazing tooling available that lets you record a build process on
>>>> any of
>>>> >> the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows,
>>>> BSD, OSX
>>>> >> and so on). WHen you register such a package, it is automagically
>>>> built on
>>>> >> all the platforms you indicate and the binary results are checked
>>>> into a
>>>> >> central repository known as Yggdrasil.
>>>> >>
>>>> >> All of these registration events for different packages are recorded
>>>> in a
>>>> >> central registry as I mentioned. That registry is recorded in Github
>>>> as
>>>> >> well which makes it easy to propagate changes.
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Thu, Jan 13, 2022 at 8:45 PM James Turton<dz...@apache.org>
>>>> wrote:
>>>> >>
>>>> >>> Hello dev community
>>>> >>>
>>>> >>> Discussions about reorganising the Drill source code to better
>>>> position
>>>> >>> the project to support plug-ins for the "long tail" of weird and
>>>> >>> wonderful systems and data formats have been coming up here and
>>>> there
>>>> >>> for a few months, e.g. inhttps://github.com/apache/drill/pull/2359.
>>>> >>>
>>>> >>> A view which I personally share is that adding too large a number
>>>> and
>>>> >>> variety of plug-ins to the main tree would create a lethal
>>>> maintenance
>>>> >>> burden for developers working there and lead down a road of
>>>> accumulating
>>>> >>> technical debt.  The Maven tricks we must employ to harmonise the
>>>> >>> growing set of dependencies of the main tree to keep it buildable
>>>> are
>>>> >>> already enough, as is the size of our distributable and the count of
>>>> >>> open bug reports.
>>>> >>>
>>>> >>>
>>>> >>> Thus, the idea of splitting out "/contrib" into a new
>>>> >>> apache/drill-contrib repo after selecting a subset of plugins to
>>>> remain
>>>> >>> in apache/drill.  I'll now volunteer a set of criteria to decide
>>>> whether
>>>> >>> a plug-in should live in this notional apache/drill-contrib.
>>>> >>>
>>>> >>>   1. The plug-in queries an unstructured data format (even if it
>>>> only
>>>> >>>      reads metadata fields) e.g. Image format plug-in.
>>>> >>>   2. The plug-in queries a data format that was designed for human
>>>> >>>      consumption e.g. Excel format plug-in.
>>>> >>>   3. The plug-in cannot be expected to run with speed and
>>>> reliability
>>>> >>>      comparable to querying structured data on the local network
>>>> e.g.
>>>> >>>      Dropbox storage plugin.
>>>> >>>   4. The plug-in queries an obscure system or format e.g. we
>>>> receive a
>>>> >>>      plug-in for some data format used only on old Cray
>>>> supercomputers.
>>>> >>>   5. The plug-in can for some reason not be well supported by the
>>>> Drill
>>>> >>>      devs e.g. it has a JNI dependency on some difficult native
>>>> libs.
>>>> >>>
>>>> >>>
>>>> >>> Any one of those suggests that an apache/drill-contrib is the better
>>>> >>> home to me, but what is your view?  Would we apply significantly
>>>> more
>>>> >>> relaxed standards when reviewing PRs to apache/drill-contrib?
>>>> Would we
>>>> >>> tag, build and test apache/drill-contrib with every release of
>>>> >>> apache/drill, or would it run on its own schedule, perhaps with
>>>> users
>>>> >>> downloading builds made continuously from snapshots of HEAD?
>>>> >>>
>>>> >>>
>>>> >>> Regards
>>>> >>> James
>>>> >>>
>>>> >>>
>>>> >>>
>>>>
>>>>

Re: [DISCUSS] Drill 2 and plug-in organisation

Posted by Ted Dunning <te...@gmail.com>.

Paul,

I understood your suggestion.  My point is that publishing to Maven central
is a bit of a pain while publishing by posting to Github is nearly
painless.  In particular, because Github inherently produces a relatively
difficult to fake hash for each commit, referring to a dependency using
that hash is relatively safe which saves a lot of agony regarding keys and
trust.

Further, Github or any comparable service provides the same "already
exists" benefit as does Maven.



On Mon, Jan 17, 2022 at 1:30 PM Paul Rogers <pa...@gmail.com> wrote:

> Hi Ted,
>
> Well said. Just to be clear, I wasn't suggesting that we use
> Maven-the-build-tool to distribute plugins. Rather, I was simply observing
> that building a global repo is a bit of a project and asked, "what could we
> use that already exists?" The Python repo? No. The Ubuntu/RedHat/whatever
> Linux repos? Maybe. Maven's repo? Why not?
>
> The idea would be that Drill might have a tool that says, "install the
> FooBlaster" plugin. It downloads from a repo (Maven central, say) and puts
> the plugin in the proper plugins directory. In a cluster, either it does
> that on every node, or the work is done as part of preparing a Docker
> container which is then pushed to every node.
>
> The key thought is just to make the problem simpler by avoiding the need
> to create and maintain a Drill-specific repo when we can barely have enough
> resources to keep Drill itself afloat.
>
> None of this can happen, however, unless we clean up the plugin APIs and
> ensure plugins can be built outside of the Drill repo. (That means, say,
> that Drill needs an API library that resides in Maven.)
>
> There are probably many ways this has been done. Anyone know of any good
> examples we can learn from?
>
> Thanks,
>
> - Paul
>
>
> On Mon, Jan 17, 2022 at 9:40 AM Ted Dunning <te...@gmail.com> wrote:
>
>>
>> I don't think that Maven is a forced move just because Drill is in Java.
>> It may be a good move, but it isn't a forgone conclusion. For one thing,
>> the conventions that Maven uses are pretty hard-wired and it may be
>> difficult to have a reliable deny-list of known problematic plugins.
>> Publishing to Maven is more of a pain than simply pushing to github.
>>
>> The usability here is paramount both for the ultimate Drill user, but
>> also for the writer of plugins.
>>
>>
>>
>> On Mon, Jan 17, 2022 at 5:06 AM James Turton <dz...@apache.org> wrote:
>>
>>> Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven
>>> is probably better fit than GitHub for distribution?  If Drillbits can
>>> write to their jars/3rdparty directory then I can imagine Drill gaining
>>> the ability to fetch and install plugins itself without too much
>>> trouble, at least for Drill clusters with Internet access.
>>> "Sideloading" by downloading from Maven and copying manually would
>>> always remain possible.
>>>
>>> @Paul I'll try to get a little time with you to get some ideas about
>>> designing a plugin API.
>>>
>>> On 2022/01/14 23:20, Paul Rogers wrote:
>>> > Hi All,
>>> >
>>> > James raises an important issue, I've noticed that it used to be easy
>>> to
>>> > build and test Drill, now it is a struggle, because of the many odd
>>> > external dependencies we have introduced. That acts as a big damper on
>>> > contributions: none of us get paid enough to spend more time fighting
>>> > builds than developing the code...
>>> >
>>> > Ted is right that we need a good way to install plugins. There are two
>>> > parts. Ted is talking about the high-level part: make it easy to point
>>> to
>>> > some repo and use the plugin. Since Drill is Java, the Maven repo
>>> could be
>>> > a good mechanism. In-house stuff is often in an internal repo that does
>>> > whatever Maven needs.
>>> >
>>> > The reason that plugins are in the Drill project now is that Drill's
>>> "API"
>>> > is all of Drill. Plugins can (and some do) access all of Drill though
>>> the
>>> > fragment context. The API to Calcite and other parts of Drill are
>>> wide, and
>>> > tend to be tightly coupled with Drill internals. By contrast, other
>>> tools,
>>> > such as Presto/Trino, have defined very clean APIs that extensions
>>> use. In
>>> > Druid, everything is integrated via Google Guice and an extension can
>>> > replace any part of Druid (though, I'm not convinced that's actually a
>>> good
>>> > idea.) I'm sure there are others we can learn from.
>>> >
>>> > So, we need to define a plugin API for Drill. I started down that
>>> route a
>>> > while back: the first step was to refactor the plugin registry so it is
>>> > ready for extensions. The idea was to use the same mechanism for all
>>> kinds
>>> > of extensions (security, UDFs, metastore, etc.) The next step was to
>>> build
>>> > something that roughly followed Presto, but that kind of stalled out.
>>> >
>>> > In terms of ordering, we'd first need to define the plugin API. Then,
>>> we
>>> > can shift plugins to use that. Once that is done, we can move plugins
>>> to
>>> > separate projects. (The metastore implementation can also move, if we
>>> > want.) Finally, figure out a solution for Ted's suggestion to make it
>>> easy
>>> > to grab new extensions. Drill is distributed, so adding a new plugin
>>> has to
>>> > happen on all nodes, which is a bit more complex than the typical
>>> > Julia/Python/R kind of extension.
>>> >
>>> > The reason we're where we're at is that it is the path of least
>>> resistance.
>>> > Creating a good extension mechanism is hard, but valuable, as Ted
>>> noted.
>>> >
>>> > Thanks,
>>> >
>>> > - Paul
>>> >
>>> > On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning<te...@gmail.com>
>>> wrote:
>>> >
>>> >> The bigger reason for a separate plug-in world is the enhancement of
>>> >> community.
>>> >>
>>> >> I would recommend looking at the Julia community for examples of
>>> >> effective ways to drive plug in structure.
>>> >>
>>> >> At the core, for any pure julia package, you can simply add a package
>>> by
>>> >> referring to the github repository where the package is stored. For
>>> >> packages that are "registered" (i.e. a path and a checksum is
>>> recorded in a
>>> >> well known data store), you can add a package by simply naming it
>>> without
>>> >> knowing the path.  All such plugins are tested by the authors and the
>>> >> project records all dependencies with version constraints so that
>>> cascading
>>> >> additions are easy. The community leaders have made tooling available
>>> so
>>> >> that you can test your package against a range of versions of Julia by
>>> >> pretty simple (to use) Github actions.
>>> >>
>>> >> The result has been an absolute explosion in the number of pure Julia
>>> >> packages.
>>> >>
>>> >> For packages that include C or Fortran (or whatever) code, there is
>>> some
>>> >> amazing tooling available that lets you record a build process on any
>>> of
>>> >> the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows, BSD,
>>> OSX
>>> >> and so on). WHen you register such a package, it is automagically
>>> built on
>>> >> all the platforms you indicate and the binary results are checked
>>> into a
>>> >> central repository known as Yggdrasil.
>>> >>
>>> >> All of these registration events for different packages are recorded
>>> in a
>>> >> central registry as I mentioned. That registry is recorded in Github
>>> as
>>> >> well which makes it easy to propagate changes.
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Jan 13, 2022 at 8:45 PM James Turton<dz...@apache.org>
>>> wrote:
>>> >>
>>> >>> Hello dev community
>>> >>>
>>> >>> Discussions about reorganising the Drill source code to better
>>> position
>>> >>> the project to support plug-ins for the "long tail" of weird and
>>> >>> wonderful systems and data formats have been coming up here and there
>>> >>> for a few months, e.g. inhttps://github.com/apache/drill/pull/2359.
>>> >>>
>>> >>> A view which I personally share is that adding too large a number and
>>> >>> variety of plug-ins to the main tree would create a lethal
>>> maintenance
>>> >>> burden for developers working there and lead down a road of
>>> accumulating
>>> >>> technical debt.  The Maven tricks we must employ to harmonise the
>>> >>> growing set of dependencies of the main tree to keep it buildable are
>>> >>> already enough, as is the size of our distributable and the count of
>>> >>> open bug reports.
>>> >>>
>>> >>>
>>> >>> Thus, the idea of splitting out "/contrib" into a new
>>> >>> apache/drill-contrib repo after selecting a subset of plugins to
>>> remain
>>> >>> in apache/drill.  I'll now volunteer a set of criteria to decide
>>> whether
>>> >>> a plug-in should live in this notional apache/drill-contrib.
>>> >>>
>>> >>>   1. The plug-in queries an unstructured data format (even if it only
>>> >>>      reads metadata fields) e.g. Image format plug-in.
>>> >>>   2. The plug-in queries a data format that was designed for human
>>> >>>      consumption e.g. Excel format plug-in.
>>> >>>   3. The plug-in cannot be expected to run with speed and reliability
>>> >>>      comparable to querying structured data on the local network e.g.
>>> >>>      Dropbox storage plugin.
>>> >>>   4. The plug-in queries an obscure system or format e.g. we receive
>>> a
>>> >>>      plug-in for some data format used only on old Cray
>>> supercomputers.
>>> >>>   5. The plug-in can for some reason not be well supported by the
>>> Drill
>>> >>>      devs e.g. it has a JNI dependency on some difficult native libs.
>>> >>>
>>> >>>
>>> >>> Any one of those suggests that an apache/drill-contrib is the better
>>> >>> home to me, but what is your view?  Would we apply significantly more
>>> >>> relaxed standards when reviewing PRs to apache/drill-contrib?  Would
>>> we
>>> >>> tag, build and test apache/drill-contrib with every release of
>>> >>> apache/drill, or would it run on its own schedule, perhaps with users
>>> >>> downloading builds made continuously from snapshots of HEAD?
>>> >>>
>>> >>>
>>> >>> Regards
>>> >>> James
>>> >>>
>>> >>>
>>> >>>
>>>
>>>

Re: [DISCUSS] Drill 2 and plug-in organisation

Posted by Paul Rogers <pa...@gmail.com>.

Hi Ted,

Well said. Just to be clear, I wasn't suggesting that we use
Maven-the-build-tool to distribute plugins. Rather, I was simply observing
that building a global repo is a bit of a project and asked, "what could we
use that already exists?" The Python repo? No. The Ubuntu/RedHat/whatever
Linux repos? Maybe. Maven's repo? Why not?

The idea would be that Drill might have a tool that says, "install the
FooBlaster" plugin. It downloads from a repo (Maven central, say) and puts
the plugin in the proper plugins directory. In a cluster, either it does
that on every node, or the work is done as part of preparing a Docker
container which is then pushed to every node.

The key thought is just to make the problem simpler by avoiding the need to
create and maintain a Drill-specific repo when we can barely have enough
resources to keep Drill itself afloat.

None of this can happen, however, unless we clean up the plugin APIs and
ensure plugins can be built outside of the Drill repo. (That means, say,
that Drill needs an API library that resides in Maven.)

There are probably many ways this has been done. Anyone know of any good
examples we can learn from?

Thanks,

- Paul


On Mon, Jan 17, 2022 at 9:40 AM Ted Dunning <te...@gmail.com> wrote:

>
> I don't think that Maven is a forced move just because Drill is in Java.
> It may be a good move, but it isn't a forgone conclusion. For one thing,
> the conventions that Maven uses are pretty hard-wired and it may be
> difficult to have a reliable deny-list of known problematic plugins.
> Publishing to Maven is more of a pain than simply pushing to github.
>
> The usability here is paramount both for the ultimate Drill user, but also
> for the writer of plugins.
>
>
>
> On Mon, Jan 17, 2022 at 5:06 AM James Turton <dz...@apache.org> wrote:
>
>> Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven
>> is probably better fit than GitHub for distribution?  If Drillbits can
>> write to their jars/3rdparty directory then I can imagine Drill gaining
>> the ability to fetch and install plugins itself without too much
>> trouble, at least for Drill clusters with Internet access.
>> "Sideloading" by downloading from Maven and copying manually would
>> always remain possible.
>>
>> @Paul I'll try to get a little time with you to get some ideas about
>> designing a plugin API.
>>
>> On 2022/01/14 23:20, Paul Rogers wrote:
>> > Hi All,
>> >
>> > James raises an important issue, I've noticed that it used to be easy to
>> > build and test Drill, now it is a struggle, because of the many odd
>> > external dependencies we have introduced. That acts as a big damper on
>> > contributions: none of us get paid enough to spend more time fighting
>> > builds than developing the code...
>> >
>> > Ted is right that we need a good way to install plugins. There are two
>> > parts. Ted is talking about the high-level part: make it easy to point
>> to
>> > some repo and use the plugin. Since Drill is Java, the Maven repo could
>> be
>> > a good mechanism. In-house stuff is often in an internal repo that does
>> > whatever Maven needs.
>> >
>> > The reason that plugins are in the Drill project now is that Drill's
>> "API"
>> > is all of Drill. Plugins can (and some do) access all of Drill though
>> the
>> > fragment context. The API to Calcite and other parts of Drill are wide,
>> and
>> > tend to be tightly coupled with Drill internals. By contrast, other
>> tools,
>> > such as Presto/Trino, have defined very clean APIs that extensions use.
>> In
>> > Druid, everything is integrated via Google Guice and an extension can
>> > replace any part of Druid (though, I'm not convinced that's actually a
>> good
>> > idea.) I'm sure there are others we can learn from.
>> >
>> > So, we need to define a plugin API for Drill. I started down that route
>> a
>> > while back: the first step was to refactor the plugin registry so it is
>> > ready for extensions. The idea was to use the same mechanism for all
>> kinds
>> > of extensions (security, UDFs, metastore, etc.) The next step was to
>> build
>> > something that roughly followed Presto, but that kind of stalled out.
>> >
>> > In terms of ordering, we'd first need to define the plugin API. Then, we
>> > can shift plugins to use that. Once that is done, we can move plugins to
>> > separate projects. (The metastore implementation can also move, if we
>> > want.) Finally, figure out a solution for Ted's suggestion to make it
>> easy
>> > to grab new extensions. Drill is distributed, so adding a new plugin
>> has to
>> > happen on all nodes, which is a bit more complex than the typical
>> > Julia/Python/R kind of extension.
>> >
>> > The reason we're where we're at is that it is the path of least
>> resistance.
>> > Creating a good extension mechanism is hard, but valuable, as Ted noted.
>> >
>> > Thanks,
>> >
>> > - Paul
>> >
>> > On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning<te...@gmail.com>
>> wrote:
>> >
>> >> The bigger reason for a separate plug-in world is the enhancement of
>> >> community.
>> >>
>> >> I would recommend looking at the Julia community for examples of
>> >> effective ways to drive plug in structure.
>> >>
>> >> At the core, for any pure julia package, you can simply add a package
>> by
>> >> referring to the github repository where the package is stored. For
>> >> packages that are "registered" (i.e. a path and a checksum is recorded
>> in a
>> >> well known data store), you can add a package by simply naming it
>> without
>> >> knowing the path.  All such plugins are tested by the authors and the
>> >> project records all dependencies with version constraints so that
>> cascading
>> >> additions are easy. The community leaders have made tooling available
>> so
>> >> that you can test your package against a range of versions of Julia by
>> >> pretty simple (to use) Github actions.
>> >>
>> >> The result has been an absolute explosion in the number of pure Julia
>> >> packages.
>> >>
>> >> For packages that include C or Fortran (or whatever) code, there is
>> some
>> >> amazing tooling available that lets you record a build process on any
>> of
>> >> the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows, BSD,
>> OSX
>> >> and so on). WHen you register such a package, it is automagically
>> built on
>> >> all the platforms you indicate and the binary results are checked into
>> a
>> >> central repository known as Yggdrasil.
>> >>
>> >> All of these registration events for different packages are recorded
>> in a
>> >> central registry as I mentioned. That registry is recorded in Github as
>> >> well which makes it easy to propagate changes.
>> >>
>> >>
>> >>
>> >> On Thu, Jan 13, 2022 at 8:45 PM James Turton<dz...@apache.org>  wrote:
>> >>
>> >>> Hello dev community
>> >>>
>> >>> Discussions about reorganising the Drill source code to better
>> position
>> >>> the project to support plug-ins for the "long tail" of weird and
>> >>> wonderful systems and data formats have been coming up here and there
>> >>> for a few months, e.g. inhttps://github.com/apache/drill/pull/2359.
>> >>>
>> >>> A view which I personally share is that adding too large a number and
>> >>> variety of plug-ins to the main tree would create a lethal maintenance
>> >>> burden for developers working there and lead down a road of
>> accumulating
>> >>> technical debt.  The Maven tricks we must employ to harmonise the
>> >>> growing set of dependencies of the main tree to keep it buildable are
>> >>> already enough, as is the size of our distributable and the count of
>> >>> open bug reports.
>> >>>
>> >>>
>> >>> Thus, the idea of splitting out "/contrib" into a new
>> >>> apache/drill-contrib repo after selecting a subset of plugins to
>> remain
>> >>> in apache/drill.  I'll now volunteer a set of criteria to decide
>> whether
>> >>> a plug-in should live in this notional apache/drill-contrib.
>> >>>
>> >>>   1. The plug-in queries an unstructured data format (even if it only
>> >>>      reads metadata fields) e.g. Image format plug-in.
>> >>>   2. The plug-in queries a data format that was designed for human
>> >>>      consumption e.g. Excel format plug-in.
>> >>>   3. The plug-in cannot be expected to run with speed and reliability
>> >>>      comparable to querying structured data on the local network e.g.
>> >>>      Dropbox storage plugin.
>> >>>   4. The plug-in queries an obscure system or format e.g. we receive a
>> >>>      plug-in for some data format used only on old Cray
>> supercomputers.
>> >>>   5. The plug-in can for some reason not be well supported by the
>> Drill
>> >>>      devs e.g. it has a JNI dependency on some difficult native libs.
>> >>>
>> >>>
>> >>> Any one of those suggests that an apache/drill-contrib is the better
>> >>> home to me, but what is your view?  Would we apply significantly more
>> >>> relaxed standards when reviewing PRs to apache/drill-contrib?  Would
>> we
>> >>> tag, build and test apache/drill-contrib with every release of
>> >>> apache/drill, or would it run on its own schedule, perhaps with users
>> >>> downloading builds made continuously from snapshots of HEAD?
>> >>>
>> >>>
>> >>> Regards
>> >>> James
>> >>>
>> >>>
>> >>>
>>
>>

Re: [DISCUSS] Drill 2 and plug-in organisation

Posted by Ted Dunning <te...@gmail.com>.

I don't think that Maven is a forced move just because Drill is in Java. It
may be a good move, but it isn't a forgone conclusion. For one thing, the
conventions that Maven uses are pretty hard-wired and it may be difficult
to have a reliable deny-list of known problematic plugins. Publishing to
Maven is more of a pain than simply pushing to github.

The usability here is paramount both for the ultimate Drill user, but also
for the writer of plugins.



On Mon, Jan 17, 2022 at 5:06 AM James Turton <dz...@apache.org> wrote:

> Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven
> is probably better fit than GitHub for distribution?  If Drillbits can
> write to their jars/3rdparty directory then I can imagine Drill gaining
> the ability to fetch and install plugins itself without too much
> trouble, at least for Drill clusters with Internet access.
> "Sideloading" by downloading from Maven and copying manually would
> always remain possible.
>
> @Paul I'll try to get a little time with you to get some ideas about
> designing a plugin API.
>
> On 2022/01/14 23:20, Paul Rogers wrote:
> > Hi All,
> >
> > James raises an important issue, I've noticed that it used to be easy to
> > build and test Drill, now it is a struggle, because of the many odd
> > external dependencies we have introduced. That acts as a big damper on
> > contributions: none of us get paid enough to spend more time fighting
> > builds than developing the code...
> >
> > Ted is right that we need a good way to install plugins. There are two
> > parts. Ted is talking about the high-level part: make it easy to point to
> > some repo and use the plugin. Since Drill is Java, the Maven repo could
> be
> > a good mechanism. In-house stuff is often in an internal repo that does
> > whatever Maven needs.
> >
> > The reason that plugins are in the Drill project now is that Drill's
> "API"
> > is all of Drill. Plugins can (and some do) access all of Drill though the
> > fragment context. The API to Calcite and other parts of Drill are wide,
> and
> > tend to be tightly coupled with Drill internals. By contrast, other
> tools,
> > such as Presto/Trino, have defined very clean APIs that extensions use.
> In
> > Druid, everything is integrated via Google Guice and an extension can
> > replace any part of Druid (though, I'm not convinced that's actually a
> good
> > idea.) I'm sure there are others we can learn from.
> >
> > So, we need to define a plugin API for Drill. I started down that route a
> > while back: the first step was to refactor the plugin registry so it is
> > ready for extensions. The idea was to use the same mechanism for all
> kinds
> > of extensions (security, UDFs, metastore, etc.) The next step was to
> build
> > something that roughly followed Presto, but that kind of stalled out.
> >
> > In terms of ordering, we'd first need to define the plugin API. Then, we
> > can shift plugins to use that. Once that is done, we can move plugins to
> > separate projects. (The metastore implementation can also move, if we
> > want.) Finally, figure out a solution for Ted's suggestion to make it
> easy
> > to grab new extensions. Drill is distributed, so adding a new plugin has
> to
> > happen on all nodes, which is a bit more complex than the typical
> > Julia/Python/R kind of extension.
> >
> > The reason we're where we're at is that it is the path of least
> resistance.
> > Creating a good extension mechanism is hard, but valuable, as Ted noted.
> >
> > Thanks,
> >
> > - Paul
> >
> > On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning<te...@gmail.com>
> wrote:
> >
> >> The bigger reason for a separate plug-in world is the enhancement of
> >> community.
> >>
> >> I would recommend looking at the Julia community for examples of
> >> effective ways to drive plug in structure.
> >>
> >> At the core, for any pure julia package, you can simply add a package by
> >> referring to the github repository where the package is stored. For
> >> packages that are "registered" (i.e. a path and a checksum is recorded
> in a
> >> well known data store), you can add a package by simply naming it
> without
> >> knowing the path.  All such plugins are tested by the authors and the
> >> project records all dependencies with version constraints so that
> cascading
> >> additions are easy. The community leaders have made tooling available so
> >> that you can test your package against a range of versions of Julia by
> >> pretty simple (to use) Github actions.
> >>
> >> The result has been an absolute explosion in the number of pure Julia
> >> packages.
> >>
> >> For packages that include C or Fortran (or whatever) code, there is some
> >> amazing tooling available that lets you record a build process on any of
> >> the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows, BSD,
> OSX
> >> and so on). WHen you register such a package, it is automagically built
> on
> >> all the platforms you indicate and the binary results are checked into a
> >> central repository known as Yggdrasil.
> >>
> >> All of these registration events for different packages are recorded in
> a
> >> central registry as I mentioned. That registry is recorded in Github as
> >> well which makes it easy to propagate changes.
> >>
> >>
> >>
> >> On Thu, Jan 13, 2022 at 8:45 PM James Turton<dz...@apache.org>  wrote:
> >>
> >>> Hello dev community
> >>>
> >>> Discussions about reorganising the Drill source code to better position
> >>> the project to support plug-ins for the "long tail" of weird and
> >>> wonderful systems and data formats have been coming up here and there
> >>> for a few months, e.g. inhttps://github.com/apache/drill/pull/2359.
> >>>
> >>> A view which I personally share is that adding too large a number and
> >>> variety of plug-ins to the main tree would create a lethal maintenance
> >>> burden for developers working there and lead down a road of
> accumulating
> >>> technical debt.  The Maven tricks we must employ to harmonise the
> >>> growing set of dependencies of the main tree to keep it buildable are
> >>> already enough, as is the size of our distributable and the count of
> >>> open bug reports.
> >>>
> >>>
> >>> Thus, the idea of splitting out "/contrib" into a new
> >>> apache/drill-contrib repo after selecting a subset of plugins to remain
> >>> in apache/drill.  I'll now volunteer a set of criteria to decide
> whether
> >>> a plug-in should live in this notional apache/drill-contrib.
> >>>
> >>>   1. The plug-in queries an unstructured data format (even if it only
> >>>      reads metadata fields) e.g. Image format plug-in.
> >>>   2. The plug-in queries a data format that was designed for human
> >>>      consumption e.g. Excel format plug-in.
> >>>   3. The plug-in cannot be expected to run with speed and reliability
> >>>      comparable to querying structured data on the local network e.g.
> >>>      Dropbox storage plugin.
> >>>   4. The plug-in queries an obscure system or format e.g. we receive a
> >>>      plug-in for some data format used only on old Cray supercomputers.
> >>>   5. The plug-in can for some reason not be well supported by the Drill
> >>>      devs e.g. it has a JNI dependency on some difficult native libs.
> >>>
> >>>
> >>> Any one of those suggests that an apache/drill-contrib is the better
> >>> home to me, but what is your view?  Would we apply significantly more
> >>> relaxed standards when reviewing PRs to apache/drill-contrib?  Would we
> >>> tag, build and test apache/drill-contrib with every release of
> >>> apache/drill, or would it run on its own schedule, perhaps with users
> >>> downloading builds made continuously from snapshots of HEAD?
> >>>
> >>>
> >>> Regards
> >>> James
> >>>
> >>>
> >>>
>
>

Re: [DISCUSS] Drill 2 and plug-in organisation

Posted by James Turton <dz...@apache.org>.

Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven 
is probably better fit than GitHub for distribution?  If Drillbits can 
write to their jars/3rdparty directory then I can imagine Drill gaining 
the ability to fetch and install plugins itself without too much 
trouble, at least for Drill clusters with Internet access.  
"Sideloading" by downloading from Maven and copying manually would 
always remain possible.

@Paul I'll try to get a little time with you to get some ideas about 
designing a plugin API.

On 2022/01/14 23:20, Paul Rogers wrote:
> Hi All,
>
> James raises an important issue, I've noticed that it used to be easy to
> build and test Drill, now it is a struggle, because of the many odd
> external dependencies we have introduced. That acts as a big damper on
> contributions: none of us get paid enough to spend more time fighting
> builds than developing the code...
>
> Ted is right that we need a good way to install plugins. There are two
> parts. Ted is talking about the high-level part: make it easy to point to
> some repo and use the plugin. Since Drill is Java, the Maven repo could be
> a good mechanism. In-house stuff is often in an internal repo that does
> whatever Maven needs.
>
> The reason that plugins are in the Drill project now is that Drill's "API"
> is all of Drill. Plugins can (and some do) access all of Drill though the
> fragment context. The API to Calcite and other parts of Drill are wide, and
> tend to be tightly coupled with Drill internals. By contrast, other tools,
> such as Presto/Trino, have defined very clean APIs that extensions use. In
> Druid, everything is integrated via Google Guice and an extension can
> replace any part of Druid (though, I'm not convinced that's actually a good
> idea.) I'm sure there are others we can learn from.
>
> So, we need to define a plugin API for Drill. I started down that route a
> while back: the first step was to refactor the plugin registry so it is
> ready for extensions. The idea was to use the same mechanism for all kinds
> of extensions (security, UDFs, metastore, etc.) The next step was to build
> something that roughly followed Presto, but that kind of stalled out.
>
> In terms of ordering, we'd first need to define the plugin API. Then, we
> can shift plugins to use that. Once that is done, we can move plugins to
> separate projects. (The metastore implementation can also move, if we
> want.) Finally, figure out a solution for Ted's suggestion to make it easy
> to grab new extensions. Drill is distributed, so adding a new plugin has to
> happen on all nodes, which is a bit more complex than the typical
> Julia/Python/R kind of extension.
>
> The reason we're where we're at is that it is the path of least resistance.
> Creating a good extension mechanism is hard, but valuable, as Ted noted.
>
> Thanks,
>
> - Paul
>
> On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning<te...@gmail.com>  wrote:
>
>> The bigger reason for a separate plug-in world is the enhancement of
>> community.
>>
>> I would recommend looking at the Julia community for examples of
>> effective ways to drive plug in structure.
>>
>> At the core, for any pure julia package, you can simply add a package by
>> referring to the github repository where the package is stored. For
>> packages that are "registered" (i.e. a path and a checksum is recorded in a
>> well known data store), you can add a package by simply naming it without
>> knowing the path.  All such plugins are tested by the authors and the
>> project records all dependencies with version constraints so that cascading
>> additions are easy. The community leaders have made tooling available so
>> that you can test your package against a range of versions of Julia by
>> pretty simple (to use) Github actions.
>>
>> The result has been an absolute explosion in the number of pure Julia
>> packages.
>>
>> For packages that include C or Fortran (or whatever) code, there is some
>> amazing tooling available that lets you record a build process on any of
>> the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows, BSD, OSX
>> and so on). WHen you register such a package, it is automagically built on
>> all the platforms you indicate and the binary results are checked into a
>> central repository known as Yggdrasil.
>>
>> All of these registration events for different packages are recorded in a
>> central registry as I mentioned. That registry is recorded in Github as
>> well which makes it easy to propagate changes.
>>
>>
>>
>> On Thu, Jan 13, 2022 at 8:45 PM James Turton<dz...@apache.org>  wrote:
>>
>>> Hello dev community
>>>
>>> Discussions about reorganising the Drill source code to better position
>>> the project to support plug-ins for the "long tail" of weird and
>>> wonderful systems and data formats have been coming up here and there
>>> for a few months, e.g. inhttps://github.com/apache/drill/pull/2359.
>>>
>>> A view which I personally share is that adding too large a number and
>>> variety of plug-ins to the main tree would create a lethal maintenance
>>> burden for developers working there and lead down a road of accumulating
>>> technical debt.  The Maven tricks we must employ to harmonise the
>>> growing set of dependencies of the main tree to keep it buildable are
>>> already enough, as is the size of our distributable and the count of
>>> open bug reports.
>>>
>>>
>>> Thus, the idea of splitting out "/contrib" into a new
>>> apache/drill-contrib repo after selecting a subset of plugins to remain
>>> in apache/drill.  I'll now volunteer a set of criteria to decide whether
>>> a plug-in should live in this notional apache/drill-contrib.
>>>
>>>   1. The plug-in queries an unstructured data format (even if it only
>>>      reads metadata fields) e.g. Image format plug-in.
>>>   2. The plug-in queries a data format that was designed for human
>>>      consumption e.g. Excel format plug-in.
>>>   3. The plug-in cannot be expected to run with speed and reliability
>>>      comparable to querying structured data on the local network e.g.
>>>      Dropbox storage plugin.
>>>   4. The plug-in queries an obscure system or format e.g. we receive a
>>>      plug-in for some data format used only on old Cray supercomputers.
>>>   5. The plug-in can for some reason not be well supported by the Drill
>>>      devs e.g. it has a JNI dependency on some difficult native libs.
>>>
>>>
>>> Any one of those suggests that an apache/drill-contrib is the better
>>> home to me, but what is your view?  Would we apply significantly more
>>> relaxed standards when reviewing PRs to apache/drill-contrib?  Would we
>>> tag, build and test apache/drill-contrib with every release of
>>> apache/drill, or would it run on its own schedule, perhaps with users
>>> downloading builds made continuously from snapshots of HEAD?
>>>
>>>
>>> Regards
>>> James
>>>
>>>
>>>

Re: [DISCUSS] Drill 2 and plug-in organisation

Posted by Paul Rogers <pa...@gmail.com>.

Hi All,

James raises an important issue, I've noticed that it used to be easy to
build and test Drill, now it is a struggle, because of the many odd
external dependencies we have introduced. That acts as a big damper on
contributions: none of us get paid enough to spend more time fighting
builds than developing the code...

Ted is right that we need a good way to install plugins. There are two
parts. Ted is talking about the high-level part: make it easy to point to
some repo and use the plugin. Since Drill is Java, the Maven repo could be
a good mechanism. In-house stuff is often in an internal repo that does
whatever Maven needs.

The reason that plugins are in the Drill project now is that Drill's "API"
is all of Drill. Plugins can (and some do) access all of Drill though the
fragment context. The API to Calcite and other parts of Drill are wide, and
tend to be tightly coupled with Drill internals. By contrast, other tools,
such as Presto/Trino, have defined very clean APIs that extensions use. In
Druid, everything is integrated via Google Guice and an extension can
replace any part of Druid (though, I'm not convinced that's actually a good
idea.) I'm sure there are others we can learn from.

So, we need to define a plugin API for Drill. I started down that route a
while back: the first step was to refactor the plugin registry so it is
ready for extensions. The idea was to use the same mechanism for all kinds
of extensions (security, UDFs, metastore, etc.) The next step was to build
something that roughly followed Presto, but that kind of stalled out.

In terms of ordering, we'd first need to define the plugin API. Then, we
can shift plugins to use that. Once that is done, we can move plugins to
separate projects. (The metastore implementation can also move, if we
want.) Finally, figure out a solution for Ted's suggestion to make it easy
to grab new extensions. Drill is distributed, so adding a new plugin has to
happen on all nodes, which is a bit more complex than the typical
Julia/Python/R kind of extension.

The reason we're where we're at is that it is the path of least resistance.
Creating a good extension mechanism is hard, but valuable, as Ted noted.

Thanks,

- Paul

On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning <te...@gmail.com> wrote:

> The bigger reason for a separate plug-in world is the enhancement of
> community.
>
> I would recommend looking at the Julia community for examples of
> effective ways to drive plug in structure.
>
> At the core, for any pure julia package, you can simply add a package by
> referring to the github repository where the package is stored. For
> packages that are "registered" (i.e. a path and a checksum is recorded in a
> well known data store), you can add a package by simply naming it without
> knowing the path.  All such plugins are tested by the authors and the
> project records all dependencies with version constraints so that cascading
> additions are easy. The community leaders have made tooling available so
> that you can test your package against a range of versions of Julia by
> pretty simple (to use) Github actions.
>
> The result has been an absolute explosion in the number of pure Julia
> packages.
>
> For packages that include C or Fortran (or whatever) code, there is some
> amazing tooling available that lets you record a build process on any of
> the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows, BSD, OSX
> and so on). WHen you register such a package, it is automagically built on
> all the platforms you indicate and the binary results are checked into a
> central repository known as Yggdrasil.
>
> All of these registration events for different packages are recorded in a
> central registry as I mentioned. That registry is recorded in Github as
> well which makes it easy to propagate changes.
>
>
>
> On Thu, Jan 13, 2022 at 8:45 PM James Turton <dz...@apache.org> wrote:
>
> > Hello dev community
> >
> > Discussions about reorganising the Drill source code to better position
> > the project to support plug-ins for the "long tail" of weird and
> > wonderful systems and data formats have been coming up here and there
> > for a few months, e.g. in https://github.com/apache/drill/pull/2359.
> >
> > A view which I personally share is that adding too large a number and
> > variety of plug-ins to the main tree would create a lethal maintenance
> > burden for developers working there and lead down a road of accumulating
> > technical debt.  The Maven tricks we must employ to harmonise the
> > growing set of dependencies of the main tree to keep it buildable are
> > already enough, as is the size of our distributable and the count of
> > open bug reports.
> >
> >
> > Thus, the idea of splitting out "/contrib" into a new
> > apache/drill-contrib repo after selecting a subset of plugins to remain
> > in apache/drill.  I'll now volunteer a set of criteria to decide whether
> > a plug-in should live in this notional apache/drill-contrib.
> >
> >  1. The plug-in queries an unstructured data format (even if it only
> >     reads metadata fields) e.g. Image format plug-in.
> >  2. The plug-in queries a data format that was designed for human
> >     consumption e.g. Excel format plug-in.
> >  3. The plug-in cannot be expected to run with speed and reliability
> >     comparable to querying structured data on the local network e.g.
> >     Dropbox storage plugin.
> >  4. The plug-in queries an obscure system or format e.g. we receive a
> >     plug-in for some data format used only on old Cray supercomputers.
> >  5. The plug-in can for some reason not be well supported by the Drill
> >     devs e.g. it has a JNI dependency on some difficult native libs.
> >
> >
> > Any one of those suggests that an apache/drill-contrib is the better
> > home to me, but what is your view?  Would we apply significantly more
> > relaxed standards when reviewing PRs to apache/drill-contrib?  Would we
> > tag, build and test apache/drill-contrib with every release of
> > apache/drill, or would it run on its own schedule, perhaps with users
> > downloading builds made continuously from snapshots of HEAD?
> >
> >
> > Regards
> > James
> >
> >
> >
>

Re: [DISCUSS] Drill 2 and plug-in organisation

Posted by Ted Dunning <te...@gmail.com>.

The bigger reason for a separate plug-in world is the enhancement of
community.

I would recommend looking at the Julia community for examples of
effective ways to drive plug in structure.

At the core, for any pure julia package, you can simply add a package by
referring to the github repository where the package is stored. For
packages that are "registered" (i.e. a path and a checksum is recorded in a
well known data store), you can add a package by simply naming it without
knowing the path.  All such plugins are tested by the authors and the
project records all dependencies with version constraints so that cascading
additions are easy. The community leaders have made tooling available so
that you can test your package against a range of versions of Julia by
pretty simple (to use) Github actions.

The result has been an absolute explosion in the number of pure Julia
packages.

For packages that include C or Fortran (or whatever) code, there is some
amazing tooling available that lets you record a build process on any of
the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows, BSD, OSX
and so on). WHen you register such a package, it is automagically built on
all the platforms you indicate and the binary results are checked into a
central repository known as Yggdrasil.

All of these registration events for different packages are recorded in a
central registry as I mentioned. That registry is recorded in Github as
well which makes it easy to propagate changes.

On Thu, Jan 13, 2022 at 8:45 PM James Turton <dz...@apache.org> wrote:

> Hello dev community
>
> Discussions about reorganising the Drill source code to better position
> the project to support plug-ins for the "long tail" of weird and
> wonderful systems and data formats have been coming up here and there
> for a few months, e.g. in https://github.com/apache/drill/pull/2359.
>
> A view which I personally share is that adding too large a number and
> variety of plug-ins to the main tree would create a lethal maintenance
> burden for developers working there and lead down a road of accumulating
> technical debt.  The Maven tricks we must employ to harmonise the
> growing set of dependencies of the main tree to keep it buildable are
> already enough, as is the size of our distributable and the count of
> open bug reports.
>
>
> Thus, the idea of splitting out "/contrib" into a new
> apache/drill-contrib repo after selecting a subset of plugins to remain
> in apache/drill.  I'll now volunteer a set of criteria to decide whether
> a plug-in should live in this notional apache/drill-contrib.
>
>  1. The plug-in queries an unstructured data format (even if it only
>     reads metadata fields) e.g. Image format plug-in.
>  2. The plug-in queries a data format that was designed for human
>     consumption e.g. Excel format plug-in.
>  3. The plug-in cannot be expected to run with speed and reliability
>     comparable to querying structured data on the local network e.g.
>     Dropbox storage plugin.
>  4. The plug-in queries an obscure system or format e.g. we receive a
>     plug-in for some data format used only on old Cray supercomputers.
>  5. The plug-in can for some reason not be well supported by the Drill
>     devs e.g. it has a JNI dependency on some difficult native libs.
>
>
> Any one of those suggests that an apache/drill-contrib is the better
> home to me, but what is your view?  Would we apply significantly more
> relaxed standards when reviewing PRs to apache/drill-contrib?  Would we
> tag, build and test apache/drill-contrib with every release of
> apache/drill, or would it run on its own schedule, perhaps with users
> downloading builds made continuously from snapshots of HEAD?
>
>
> Regards
> James
>
>
>