You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Paul Rogers <pa...@yahoo.com.INVALID> on 2020/05/03 21:42:12 UTC

Drill with No-SQL [was: Cannot Build Drill "exec/Java Execution Engine"]

Hi Tug,

Glad to hear from you again. Ted's summary is pretty good; here's a bit more detail.

Presto is another alternative which seems to have gained the most traction outside of the Cloud ecosystem on the one hand, and the Cloudera/HortonWorks ecosystem on the other. Presto does, however, demand that you have a schema, which is often an obstacle for many applications.

Most folks I've talked to who tried to use Spark for this use case came away disappointed. Unlike Drill (or Presto or Impala), Spark wants to start new Java processes for each query. Makes great sense for large, complex map/reduce jobs, but is a non-starter for small, interactive queries.

Hive also is trying to be an "uber query layer" and has integrations with multiple systems. But, Hive's complexity makes Drill look downright simple by comparison. Hive also needs an up-front schema.

I've had the opportunity to integrate Drill with two different noSQL engines. Getting started is easy, especially if a REST or similar API is available. Filter push-down is the next step as otherwise Drill will simply suck all data from your DB as it it were a file. We've added some structure in the new HTTP reader to make it a bit easier than it used to be to create this kind of filter push-down. (The other kind of filter push-down is for partition pruning used for files, which you probably won't need.)

Aside from the current MapR repo issues, Drill tends to be much easier to build than other systems. Pretty much set up Java and the correct Maven and you're good to go. If you run unit tests, there is one additional library to install, but the tests themselves tell you you exactly what is needed when they fail the first time (which I how I learned about it.)

After that, performance will point the way. For example, does your DB have indexes? If so, then you can leverage the work originally done for MapR-DB to convey index information to Calcite so it can pick the best execution plan. There are specialized operators for index key lookup as well.

All this will get you the basic one-table scan which is often all that no-SQL DBs ever need. (Any structure usually appears within each document, rather than as joined table as in the RDBMS world.) However, if your DB does need joins, you will need something like Calcite to work out the tradeoffs of the various join+filter-push plans possible, especially if your DB supports multiple indexes. There is no escaping the plan-time complexity of these cases. Calcite is big and complex, but it does give you the tools needed to solve these problems.

If your DB is to be used to power dashboards (summaries of logs, time series, click streams, sales or whatever), you'll soon find you need to provide a caching/aggregation layer to avoid banging on your DB each time the dashboard refreshes. (Imagine a 1-week dashboard, updated every minute, where only the last hour has new data.) Drill becomes very handy as a way of combining data from a mostly-static caching layer (data for the last 6 days, say) with your live DB (for the last one day, say.)

If you provide a "writer" as well as a "reader", you can use Drill to load your DB as well as query it.

Happy to share whatever else I might have learned if you can describe your goals in a bit more detail.

Thanks,
- Paul

    On Sunday, May 3, 2020, 11:25:11 AM PDT, Ted Dunning <te...@gmail.com> wrote:  

 The compile problem is a problem with the MapR repo (I think). I have
reported it to the folks who can fix it.

Regarding the generic question, I think that Drill is very much a good
choice for putting a SQL layer on a noSQL database.

It is definitely the case that the community is much broader than it used
to be. A number of companies now use Drill in their products which is
one of the best ways to build long-term community.

There are alternatives, of course. All have trade-offs (because we live in
the world):

- Calcite itself (what Drill uses as a SQL parser and optimizer) can be
used, but you have to provide an execution framework and you wind up with
something that only works for your engine and is unlikely to support
parallel operations. Calcite is used by lots of projects, though, so it is
has a very broad base of support.

- Spark SQL is fairly easy to extend (from what I hear from friends) but
the optimizer doesn't deal well with complicated tradeoffs (precisely
because it is fairly simple). You also wind up with the baggage of spark
which could be good or bad. You would get some parallelism, though. I don't
think that Spark SQL handles complex objects, however.

- Postgres has a long history of having odd things grafted onto it. I know
little about this other than seeing the results. Extending Postgres would
not likely give you any parallelism, but there might be a way to support
complex objects through Postgres JSON object support.

On Sun, May 3, 2020 at 11:09 AM Tugdual Grall <tu...@gmail.com> wrote:

> Hello
>
> It has been a long time since I used Drill!
>
> I wanted to build it to start to work on a new datasource,.
>
> But when run  "mvn clean install", I hit the exception below.
>
> => Can somebody help?
>
> => This bring me to a generic question, if I want to expose a NoSQL
> database using SQL/JDBC/ODBC for Analytics purposes, is Drill the best
> option? or I should look at something else?
>
>
> Thanks!
>
> ====
> [INFO] exec/Java Execution Engine ......................... FAILURE [
>  0.676 s]
>
> [ERROR] Failed to execute goal on project drill-java-exec: Could not
> resolve dependencies
> for project org.apache.drill.exec:drill-java-exec:jar:1.18.0-SNAPSHOT:
> Failed to collect dependencies at org.kohsuke:libpam4j:jar:1.8-rev2: Failed
> to read artifact descriptor for org.kohsuke:libpam4j:jar:1.8-rev2: Could
> not transfer artifact org.kohsuke:libpam4j:pom:1.8-rev2 from/to
> mapr-releases (http://repository.mapr.com/maven/): Transfer failed for
>
> http://repository.mapr.com/maven/org/kohsuke/libpam4j/1.8-rev2/libpam4j-1.8-rev2.pom
> 500 Proxy Error -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
>
> http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the
> command
> [ERROR]  mvn <args> -rf :drill-java-exec
>

Re: Drill with No-SQL [was: Cannot Build Drill "exec/Java Execution Engine"]

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Thanks for the update; I hadn't picked up on that bit of confusion about Presto.

I just did a Drill build, seemed to work, thanks for the fix. However, I don't know if I had the needed dependency cached, so my build might have worked anyway...

Thanks,
- Paul

 

    On Sunday, May 3, 2020, 3:09:58 PM PDT, Ted Dunning <te...@gmail.com> wrote:  
 
 I didn't mention Presto on purpose. It is a fine tool, but the community is
plagued lately by a fork. That can be expected to substantially inhibit
adoption and I think that is just what I have seen. It used to be that
people asked about Presto every other time I was on a call and I haven't
heard even one such question in over a year. The community may recover from
this, but it is hard to say whether they can regain their momentum.

In case anybody wants to sample the confusion, here are the two "official"
homes on github:

https://github.com/prestodb/presto
https://github.com/prestosql/presto

The worst part is that neither fork seems to dominate the other. With the
Hudson/Jeeves fork, at least, Hudson basically dies while Jenkins continued
with full momentum. Here, both sides seem to be splitting things much too
evenly.



On Sun, May 3, 2020 at 2:42 PM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Tug,
>
> Glad to hear from you again. Ted's summary is pretty good; here's a bit
> more detail.
>
>
> Presto is another alternative which seems to have gained the most traction
> outside of the Cloud ecosystem on the one hand, and the
> Cloudera/HortonWorks ecosystem on the other. Presto does, however, demand
> that you have a schema, which is often an obstacle for many applications.
>
> Most folks I've talked to who tried to use Spark for this use case came
> away disappointed. Unlike Drill (or Presto or Impala), Spark wants to start
> new Java processes for each query. Makes great sense for large, complex
> map/reduce jobs, but is a non-starter for small, interactive queries.
>
> Hive also is trying to be an "uber query layer" and has integrations with
> multiple systems. But, Hive's complexity makes Drill look downright simple
> by comparison. Hive also needs an up-front schema.
>
>
> I've had the opportunity to integrate Drill with two different noSQL
> engines. Getting started is easy, especially if a REST or similar API is
> available. Filter push-down is the next step as otherwise Drill will simply
> suck all data from your DB as it it were a file. We've added some structure
> in the new HTTP reader to make it a bit easier than it used to be to create
> this kind of filter push-down. (The other kind of filter push-down is for
> partition pruning used for files, which you probably won't need.)
>
> Aside from the current MapR repo issues, Drill tends to be much easier to
> build than other systems. Pretty much set up Java and the correct Maven and
> you're good to go. If you run unit tests, there is one additional library
> to install, but the tests themselves tell you you exactly what is needed
> when they fail the first time (which I how I learned about it.)
>
>
> After that, performance will point the way. For example, does your DB have
> indexes? If so, then you can leverage the work originally done for MapR-DB
> to convey index information to Calcite so it can pick the best execution
> plan. There are specialized operators for index key lookup as well.
>
> All this will get you the basic one-table scan which is often all that
> no-SQL DBs ever need. (Any structure usually appears within each document,
> rather than as joined table as in the RDBMS world.) However, if your DB
> does need joins, you will need something like Calcite to work out the
> tradeoffs of the various join+filter-push plans possible, especially if
> your DB supports multiple indexes. There is no escaping the plan-time
> complexity of these cases. Calcite is big and complex, but it does give you
> the tools needed to solve these problems.
>
> If your DB is to be used to power dashboards (summaries of logs, time
> series, click streams, sales or whatever), you'll soon find you need to
> provide a caching/aggregation layer to avoid banging on your DB each time
> the dashboard refreshes. (Imagine a 1-week dashboard, updated every minute,
> where only the last hour has new data.) Drill becomes very handy as a way
> of combining data from a mostly-static caching layer (data for the last 6
> days, say) with your live DB (for the last one day, say.)
>
> If you provide a "writer" as well as a "reader", you can use Drill to load
> your DB as well as query it.
>
>
> Happy to share whatever else I might have learned if you can describe your
> goals in a bit more detail.
>
> Thanks,
> - Paul
>
>
>
>    On Sunday, May 3, 2020, 11:25:11 AM PDT, Ted Dunning <
> ted.dunning@gmail.com> wrote:
>
>  The compile problem is a problem with the MapR repo (I think). I have
> reported it to the folks who can fix it.
>
> Regarding the generic question, I think that Drill is very much a good
> choice for putting a SQL layer on a noSQL database.
>
> It is definitely the case that the community is much broader than it used
> to be. A number of companies now use Drill in their products which is
> one of the best ways to build long-term community.
>
> There are alternatives, of course. All have trade-offs (because we live in
> the world):
>
> - Calcite itself (what Drill uses as a SQL parser and optimizer) can be
> used, but you have to provide an execution framework and you wind up with
> something that only works for your engine and is unlikely to support
> parallel operations. Calcite is used by lots of projects, though, so it is
> has a very broad base of support.
>
> - Spark SQL is fairly easy to extend (from what I hear from friends) but
> the optimizer doesn't deal well with complicated tradeoffs (precisely
> because it is fairly simple). You also wind up with the baggage of spark
> which could be good or bad. You would get some parallelism, though. I don't
> think that Spark SQL handles complex objects, however.
>
> - Postgres has a long history of having odd things grafted onto it. I know
> little about this other than seeing the results. Extending Postgres would
> not likely give you any parallelism, but there might be a way to support
> complex objects through Postgres JSON object support.
>
>
>
>
> On Sun, May 3, 2020 at 11:09 AM Tugdual Grall <tu...@gmail.com> wrote:
>
> > Hello
> >
> > It has been a long time since I used Drill!
> >
> > I wanted to build it to start to work on a new datasource,.
> >
> > But when run  "mvn clean install", I hit the exception below.
> >
> > => Can somebody help?
> >
> > => This bring me to a generic question, if I want to expose a NoSQL
> > database using SQL/JDBC/ODBC for Analytics purposes, is Drill the best
> > option? or I should look at something else?
> >
> >
> > Thanks!
> >
> > ====
> > [INFO] exec/Java Execution Engine ......................... FAILURE [
> >  0.676 s]
> >
> > [ERROR] Failed to execute goal on project drill-java-exec: Could not
> > resolve dependencies
> > for project org.apache.drill.exec:drill-java-exec:jar:1.18.0-SNAPSHOT:
> > Failed to collect dependencies at org.kohsuke:libpam4j:jar:1.8-rev2:
> Failed
> > to read artifact descriptor for org.kohsuke:libpam4j:jar:1.8-rev2: Could
> > not transfer artifact org.kohsuke:libpam4j:pom:1.8-rev2 from/to
> > mapr-releases (http://repository.mapr.com/maven/): Transfer failed for
> >
> >
> http://repository.mapr.com/maven/org/kohsuke/libpam4j/1.8-rev2/libpam4j-1.8-rev2.pom
> > 500 Proxy Error -> [Help 1]
> > [ERROR]
> > [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e
> > switch.
> > [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> > [ERROR]
> > [ERROR] For more information about the errors and possible solutions,
> > please read the following articles:
> > [ERROR] [Help 1]
> >
> >
> http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
> > [ERROR]
> > [ERROR] After correcting the problems, you can resume the build with the
> > command
> > [ERROR]  mvn <args> -rf :drill-java-exec
> >
>

Re: Drill with No-SQL [was: Cannot Build Drill "exec/Java Execution Engine"]

Posted by Ted Dunning <te...@gmail.com>.

I didn't mention Presto on purpose. It is a fine tool, but the community is
plagued lately by a fork. That can be expected to substantially inhibit
adoption and I think that is just what I have seen. It used to be that
people asked about Presto every other time I was on a call and I haven't
heard even one such question in over a year. The community may recover from
this, but it is hard to say whether they can regain their momentum.

In case anybody wants to sample the confusion, here are the two "official"
homes on github:

https://github.com/prestodb/presto
https://github.com/prestosql/presto

The worst part is that neither fork seems to dominate the other. With the
Hudson/Jeeves fork, at least, Hudson basically dies while Jenkins continued
with full momentum. Here, both sides seem to be splitting things much too
evenly.



On Sun, May 3, 2020 at 2:42 PM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Tug,
>
> Glad to hear from you again. Ted's summary is pretty good; here's a bit
> more detail.
>
>
> Presto is another alternative which seems to have gained the most traction
> outside of the Cloud ecosystem on the one hand, and the
> Cloudera/HortonWorks ecosystem on the other. Presto does, however, demand
> that you have a schema, which is often an obstacle for many applications.
>
> Most folks I've talked to who tried to use Spark for this use case came
> away disappointed. Unlike Drill (or Presto or Impala), Spark wants to start
> new Java processes for each query. Makes great sense for large, complex
> map/reduce jobs, but is a non-starter for small, interactive queries.
>
> Hive also is trying to be an "uber query layer" and has integrations with
> multiple systems. But, Hive's complexity makes Drill look downright simple
> by comparison. Hive also needs an up-front schema.
>
>
> I've had the opportunity to integrate Drill with two different noSQL
> engines. Getting started is easy, especially if a REST or similar API is
> available. Filter push-down is the next step as otherwise Drill will simply
> suck all data from your DB as it it were a file. We've added some structure
> in the new HTTP reader to make it a bit easier than it used to be to create
> this kind of filter push-down. (The other kind of filter push-down is for
> partition pruning used for files, which you probably won't need.)
>
> Aside from the current MapR repo issues, Drill tends to be much easier to
> build than other systems. Pretty much set up Java and the correct Maven and
> you're good to go. If you run unit tests, there is one additional library
> to install, but the tests themselves tell you you exactly what is needed
> when they fail the first time (which I how I learned about it.)
>
>
> After that, performance will point the way. For example, does your DB have
> indexes? If so, then you can leverage the work originally done for MapR-DB
> to convey index information to Calcite so it can pick the best execution
> plan. There are specialized operators for index key lookup as well.
>
> All this will get you the basic one-table scan which is often all that
> no-SQL DBs ever need. (Any structure usually appears within each document,
> rather than as joined table as in the RDBMS world.) However, if your DB
> does need joins, you will need something like Calcite to work out the
> tradeoffs of the various join+filter-push plans possible, especially if
> your DB supports multiple indexes. There is no escaping the plan-time
> complexity of these cases. Calcite is big and complex, but it does give you
> the tools needed to solve these problems.
>
> If your DB is to be used to power dashboards (summaries of logs, time
> series, click streams, sales or whatever), you'll soon find you need to
> provide a caching/aggregation layer to avoid banging on your DB each time
> the dashboard refreshes. (Imagine a 1-week dashboard, updated every minute,
> where only the last hour has new data.) Drill becomes very handy as a way
> of combining data from a mostly-static caching layer (data for the last 6
> days, say) with your live DB (for the last one day, say.)
>
> If you provide a "writer" as well as a "reader", you can use Drill to load
> your DB as well as query it.
>
>
> Happy to share whatever else I might have learned if you can describe your
> goals in a bit more detail.
>
> Thanks,
> - Paul
>
>
>
>     On Sunday, May 3, 2020, 11:25:11 AM PDT, Ted Dunning <
> ted.dunning@gmail.com> wrote:
>
>  The compile problem is a problem with the MapR repo (I think). I have
> reported it to the folks who can fix it.
>
> Regarding the generic question, I think that Drill is very much a good
> choice for putting a SQL layer on a noSQL database.
>
> It is definitely the case that the community is much broader than it used
> to be. A number of companies now use Drill in their products which is
> one of the best ways to build long-term community.
>
> There are alternatives, of course. All have trade-offs (because we live in
> the world):
>
> - Calcite itself (what Drill uses as a SQL parser and optimizer) can be
> used, but you have to provide an execution framework and you wind up with
> something that only works for your engine and is unlikely to support
> parallel operations. Calcite is used by lots of projects, though, so it is
> has a very broad base of support.
>
> - Spark SQL is fairly easy to extend (from what I hear from friends) but
> the optimizer doesn't deal well with complicated tradeoffs (precisely
> because it is fairly simple). You also wind up with the baggage of spark
> which could be good or bad. You would get some parallelism, though. I don't
> think that Spark SQL handles complex objects, however.
>
> - Postgres has a long history of having odd things grafted onto it. I know
> little about this other than seeing the results. Extending Postgres would
> not likely give you any parallelism, but there might be a way to support
> complex objects through Postgres JSON object support.
>
>
>
>
> On Sun, May 3, 2020 at 11:09 AM Tugdual Grall <tu...@gmail.com> wrote:
>
> > Hello
> >
> > It has been a long time since I used Drill!
> >
> > I wanted to build it to start to work on a new datasource,.
> >
> > But when run  "mvn clean install", I hit the exception below.
> >
> > => Can somebody help?
> >
> > => This bring me to a generic question, if I want to expose a NoSQL
> > database using SQL/JDBC/ODBC for Analytics purposes, is Drill the best
> > option? or I should look at something else?
> >
> >
> > Thanks!
> >
> > ====
> > [INFO] exec/Java Execution Engine ......................... FAILURE [
> >  0.676 s]
> >
> > [ERROR] Failed to execute goal on project drill-java-exec: Could not
> > resolve dependencies
> > for project org.apache.drill.exec:drill-java-exec:jar:1.18.0-SNAPSHOT:
> > Failed to collect dependencies at org.kohsuke:libpam4j:jar:1.8-rev2:
> Failed
> > to read artifact descriptor for org.kohsuke:libpam4j:jar:1.8-rev2: Could
> > not transfer artifact org.kohsuke:libpam4j:pom:1.8-rev2 from/to
> > mapr-releases (http://repository.mapr.com/maven/): Transfer failed for
> >
> >
> http://repository.mapr.com/maven/org/kohsuke/libpam4j/1.8-rev2/libpam4j-1.8-rev2.pom
> > 500 Proxy Error -> [Help 1]
> > [ERROR]
> > [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e
> > switch.
> > [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> > [ERROR]
> > [ERROR] For more information about the errors and possible solutions,
> > please read the following articles:
> > [ERROR] [Help 1]
> >
> >
> http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
> > [ERROR]
> > [ERROR] After correcting the problems, you can resume the build with the
> > command
> > [ERROR]  mvn <args> -rf :drill-java-exec
> >
>