You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Ramasamy Javakar <ra...@ezeeinfosolutions.com> on 2020/04/08 17:59:06 UTC

Apache Drill Support concurrent parallel Request

Hi, I did an analytics web application on drill, data set in json file.  We
are facing issues while getting multiple parallel requests. Does Apache
Drill support concurrent requests?. Please let me know


Thanks & Regards
Ramasamy

Product Manager
EzeeInfo Cloud Solutions
+91 95000 07269

Re: Apache Drill Support concurrent parallel Request

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Ted,

I echo the question about workload: I started with the simplest possible explanation, hoping that would spur a bit more of a use case description.

Good point on planner cost. Drill uses Apache Calcite for planning. Calcite is a monster: an interpreter, a rules engine, a generic SQL parser and analyzer. Calcite is great for what it was designed for: very complex queries against huge data sets, such as those which Hive queries. (Calcite is used by Hive for its planning, where it replaced a home-grown planner.)

A simpler planner might be faster, but would have its limits. For example, from my time on Impala, I learned that Impala's query is basically a sprint from a SQL parse tree to a Thrift query plan with very little optimization other than Parquet partition pruning. As a result, Impala planning is quick, but extremely limited: the further your data strays from TPC-H, the worse the plans that Impala produces. (I spent over a month trying to fix a really bad plan caused by naive assumptions in the Impala planner.)

Last I heard (a year ago), Impala planned to abandon their home-grown planner and move to Hive's Calcite-based planner (as part of merging Impala into Hive.) Don't know how that is going or if it is still the plan. We can guess that Impala will suffer the same planning overhead as Drill once that work is done.

By contrast, Presto completely rewrote their ad-hoc planner to create a dynamic planner that can use costs, somewhat like the Calcite planner does, but specific to Presto (that is, not based on Calcite.) I've not heard how well Presto handles complex queries, or the quality of its plans. Anyone have experience with this aspect of Presto?

The challenge is, each new planner (Cockroach DB, Hive when moving to Calcite, Impala when moving to Hive/Calcite, Presto with their new planner) takes multiple person-years of effort. Unfortunately, MapR did not have that kind of time to invest during MapR DB. The frantic, chaotic hacks that were done could not overcome Calcite's fundamental design limitation of being very heavy-weight. Or course, the major benefit of MapR-DB are secondary indexes, which actually require a cost-based, rule-driven planner such as Calcite because of the large number of potential plans to evaluate. There is no free lunch. (Aman, who did much of the index-plan work, contributed it to Drill, so it is available for anyone else with a similar data source.) 

All that said, I agree with your point that Drill would clearly benefit from a faster, simpler planner for the kinds of queries most people seem to do: a simple query against one or two data sources, with no indexes, on a single embedded Drillbit. If anyone knows of such a thing as an open source project, it would be great to hear about it. We could use the "mini-planner" for simple queries, but switch to Calcite for the heavy-weight queries where the extra planning cost would be worthwhile. (This idea was tossed around during the MapR-DB project, but as noted, there simply wasn't time to build a mini-planner.)

All this said, planning never has been (to the best of my recollection) the bottleneck in a multi-user Drill environment. Yes, it slows each individual query. But is is the run time costs (CPU and memory contention) which tend to become an issue as the number of concurrent queries increases. The team has added some good basic throttling and queuing to handle intense usage spikes. More can be done (see TeraData for what 40 years of tinkering can get you.)

In fact, in this day of K8s, an emerging new design is to run multiple clusters, each reading data from S3, etc. (So-called "separation of compute and storage.") In this model, each cluster is (dynamically) sized for a certain workload; much simpler than the old-school model of a single Drill clusters handling, say, both TB-sized ETL jobs and sub-second MapR-DB queries. It seems that Snowflake uses this model. I'm looking forward to trying out Abhishek's work with K8s to see what we can do.

Thanks,
- Paul

    On Wednesday, April 8, 2020, 12:24:12 PM PDT, Ted Dunning <te...@gmail.com> wrote:  

 Another thing that user's will see when they start trying to use Drill for
concurrent queries is that Drill assumes that it is OK to spend quite a bit
of time optimizing a query before running it. Taking 500 ms to optimize the
query can be a really bad trade-off if your query only takes 100ms to run.

It is possible to tune this very differently, but that exercise is
definitely not a task for a user (or even a less-than-advanced developer).
In the MapR connection between the OJAI API to MapR DB, for instance, the
clear assumption is that queries will be relatively simple and all that
really needs to be done is look for good join ordering and make sure that
secondary indexes are used reasonably well. This meant that retuning for
fast optimization was very worthwhile.

A similar thing was done by Alibaba in their time series query engine.
There, the primary data source is a variant of Open TSDB and query costs
are dominated by the primary facts (the time series itself). Tuning the
optimizer to not think too much is a good thing.

So, could you say more about your workload so that the Drill community can
say more about what Drill will (or won't) do for you?

On Wed, Apr 8, 2020 at 12:02 PM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Ramasamy,
>
> Let's define some terms. By "parallel requests" do you mean multiple
> people submitting queries at the same time? If so, then Drill handles this
> just fine: Drill is designed to run multiple queries from multiple users
> concurrently.
>
> There is a caveat. Many people run Drill in embedded mode when they get
> started. Embedded mode is a single user, single-machine setup that is great
> for testing Drill, exploring small data sets and so on. However, to support
> multiple concurrent queries, the proper way to run Drill is as a service,
> preferably across multiple machines. Further, if you are running a cluster
> of two or more machines, you need some kind of distributed file system: S3,
> Hadoop, etc.
>
>
> Once you start running concurrent queries, memory becomes an important
> consideration, especially if your JSON files are large and you are doing
> memory-intensive operations such as sorting and joins. The Drill
> documentation explains the correct configuration steps.
>
> Thanks,
> - Paul
>
>
>
>    On Wednesday, April 8, 2020, 11:00:14 AM PDT, Ramasamy Javakar <
> ramasamy@ezeeinfosolutions.com> wrote:
>
>  Hi, I did an analytics web application on drill, data set in json file.
> We
> are facing issues while getting multiple parallel requests. Does Apache
> Drill support concurrent requests?. Please let me know
>
>
> Thanks & Regards
> Ramasamy
>
> Product Manager
> EzeeInfo Cloud Solutions
> +91 95000 07269
>

Re: Apache Drill Support concurrent parallel Request

Posted by Ted Dunning <te...@gmail.com>.

Another thing that user's will see when they start trying to use Drill for
concurrent queries is that Drill assumes that it is OK to spend quite a bit
of time optimizing a query before running it. Taking 500 ms to optimize the
query can be a really bad trade-off if your query only takes 100ms to run.

It is possible to tune this very differently, but that exercise is
definitely not a task for a user (or even a less-than-advanced developer).
In the MapR connection between the OJAI API to MapR DB, for instance, the
clear assumption is that queries will be relatively simple and all that
really needs to be done is look for good join ordering and make sure that
secondary indexes are used reasonably well. This meant that retuning for
fast optimization was very worthwhile.

A similar thing was done by Alibaba in their time series query engine.
There, the primary data source is a variant of Open TSDB and query costs
are dominated by the primary facts (the time series itself). Tuning the
optimizer to not think too much is a good thing.

So, could you say more about your workload so that the Drill community can
say more about what Drill will (or won't) do for you?

On Wed, Apr 8, 2020 at 12:02 PM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Ramasamy,
>
> Let's define some terms. By "parallel requests" do you mean multiple
> people submitting queries at the same time? If so, then Drill handles this
> just fine: Drill is designed to run multiple queries from multiple users
> concurrently.
>
> There is a caveat. Many people run Drill in embedded mode when they get
> started. Embedded mode is a single user, single-machine setup that is great
> for testing Drill, exploring small data sets and so on. However, to support
> multiple concurrent queries, the proper way to run Drill is as a service,
> preferably across multiple machines. Further, if you are running a cluster
> of two or more machines, you need some kind of distributed file system: S3,
> Hadoop, etc.
>
>
> Once you start running concurrent queries, memory becomes an important
> consideration, especially if your JSON files are large and you are doing
> memory-intensive operations such as sorting and joins. The Drill
> documentation explains the correct configuration steps.
>
> Thanks,
> - Paul
>
>
>
>     On Wednesday, April 8, 2020, 11:00:14 AM PDT, Ramasamy Javakar <
> ramasamy@ezeeinfosolutions.com> wrote:
>
>  Hi, I did an analytics web application on drill, data set in json file.
> We
> are facing issues while getting multiple parallel requests. Does Apache
> Drill support concurrent requests?. Please let me know
>
>
> Thanks & Regards
> Ramasamy
>
> Product Manager
> EzeeInfo Cloud Solutions
> +91 95000 07269
>

Re: Apache Drill Support concurrent parallel Request

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Ramasamy,

Let's define some terms. By "parallel requests" do you mean multiple people submitting queries at the same time? If so, then Drill handles this just fine: Drill is designed to run multiple queries from multiple users concurrently.

There is a caveat. Many people run Drill in embedded mode when they get started. Embedded mode is a single user, single-machine setup that is great for testing Drill, exploring small data sets and so on. However, to support multiple concurrent queries, the proper way to run Drill is as a service, preferably across multiple machines. Further, if you are running a cluster of two or more machines, you need some kind of distributed file system: S3, Hadoop, etc.


Once you start running concurrent queries, memory becomes an important consideration, especially if your JSON files are large and you are doing memory-intensive operations such as sorting and joins. The Drill documentation explains the correct configuration steps.

Thanks,
- Paul

 

    On Wednesday, April 8, 2020, 11:00:14 AM PDT, Ramasamy Javakar <ra...@ezeeinfosolutions.com> wrote:  
 
 Hi, I did an analytics web application on drill, data set in json file.  We
are facing issues while getting multiple parallel requests. Does Apache
Drill support concurrent requests?. Please let me know


Thanks & Regards
Ramasamy

Product Manager
EzeeInfo Cloud Solutions
+91 95000 07269