You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Avner Levy <av...@gmail.com> on 2020/06/28 22:16:59 UTC

exec.queue.enable in drill-embedded

Hi,
I'm using Drill 1.18 (master) docker and trying to configure its memory
after getting out of heap memory errors:
"RESOURCE ERROR: There is not enough heap memory to run this query using
the web interface."
The docker is serving remote clients through the REST API.
The queries are simple selects over tiny parquet files that are stored in
S3.
It is running on in 16GB container, configured with a heap of 8GB, and 8GB
direct memory.
I tried to use:
  exec.queue.enable=true
  exec.queue.large=1
  exec.queue.small=1

and verified it was configured correctly, but I still see queries running
concurrently.
In addition, the "drill.queries.enqueued" counter remains zero.
Is this mechanism supported in drill-embedded?

In addition, it seems there is some memory leak, since after a while even
with no query running for a while, running a single tiny query still gives
the same error.
Any insight would be highly appreciated :)
Thanks,
  Avner

Re: exec.queue.enable in drill-embedded

Posted by Rafael Jaimes III <ra...@gmail.com>.

What do you consider tiny for a parquet file? Few KB or MB?

16 GB may seem like a lot but it is quite small for a query engine. If
possible, increase the container to 32 GB and try giving it as much
Heap as possible, since that's what it will use for the REST API. Try
> 16 GB Heap and see if you still have issues.

Drill 1.18 is also starting to support Docker/K8s a little better so
you might be able to setup parallel workers (drillbits) with ZooKeeper
and spread the memory usage that way. Though that could take a bit of
setting up. I'd try to increase the resources on your single drillbit
before parallelizing it.

- Rafael

On Sun, Jun 28, 2020 at 6:17 PM Avner Levy <av...@gmail.com> wrote:
>
> Hi,
> I'm using Drill 1.18 (master) docker and trying to configure its memory
> after getting out of heap memory errors:
> "RESOURCE ERROR: There is not enough heap memory to run this query using
> the web interface."
> The docker is serving remote clients through the REST API.
> The queries are simple selects over tiny parquet files that are stored in
> S3.
> It is running on in 16GB container, configured with a heap of 8GB, and 8GB
> direct memory.
> I tried to use:
>   exec.queue.enable=true
>   exec.queue.large=1
>   exec.queue.small=1
>
> and verified it was configured correctly, but I still see queries running
> concurrently.
> In addition, the "drill.queries.enqueued" counter remains zero.
> Is this mechanism supported in drill-embedded?
>
> In addition, it seems there is some memory leak, since after a while even
> with no query running for a while, running a single tiny query still gives
> the same error.
> Any insight would be highly appreciated :)
> Thanks,
>   Avner

Re: exec.queue.enable in drill-embedded

Posted by Rafael Jaimes III <ra...@gmail.com>.

Hi Avner,

By cluster that's exactly what I was thinking: multiple deployed
containers. But if you do this I don't recommend the embedded
standalone drillbit. You can run it in distributed mode instead and
then use zookeeper to manage them. It's pretty straightforward except
last time I checked it doesn't really work out of the box with scaling
up pods from the same container. Though, there's been some progress
recently for docker/k8s - if you search the mailing list you'll
probably find it (see also git link below). It's possible the
improvements will be out for 1.18. Maybe there's an easier way to do
all this with Amazon, but I don't have experience with that.

Agirish has done some great work with drill and k8s support:
https://github.com/Agirish

Rafael

On Mon, Jun 29, 2020 at 1:13 PM Avner Levy <av...@gmail.com> wrote:
>
> Hi Paul & Rafael, I really appreciate your assistance.
> My Parquet files are really small (less than 1 MB in most cases) and the
> returned JSON is usually less than few MBs.
> Moving to 28GB heap, helped (although even with 28GB I get heap issues once
> a while).
> Since my queries are on a small amount of data (usually a join between two
> or three 1MB parquet files), I was thinking of deploying a bunch of
> standalone Drill containers with some auto-scaling policy and ELB in front.
> This reduced the complexity of managing a cluster and dealing with
> downtimes (if there are problems, you just restart the container).
> This enables providing a SQL engine over S3 parquet files to other
> microservices over REST (limited to small scale queries which suites my
> requirements).
> But my Drill knowledge if very limited, so any feedback is appreciated.
> Thanks,
> Avner
>
> On Sun, Jun 28, 2020 at 8:53 PM Paul Rogers <pa...@gmail.com> wrote:
>
> > Hi Avner,
> >
> > Query queueing is not available in embedded mode: it uses ZK to throttle
> > the number of concurrent queries across a cluster; but embedded does not
> > have a cluster or use ZK. (If you are running more than a few concurrent
> > queries, embedded mode is likely the wrong deployment model anyway.)
> >
> > The problem here is the use of the REST API. It has horrible performance;
> > it buffers the entire result set in memory in a way that overwhelms the
> > heap. The REST API was designed to power the Web UI for small queries of <
> > few hundred rows. Drill was designed assuming "real" queries would use the
> > ODBC, JDBC or native APIs.
> >
> > That said, there is an in-flight PR designed to fix the heap memory issue
> > for REST queries. However, even with that fix, your client must still be
> > capable of handling a very large JSON document since rows are not returned
> > in a "jsonlines" format or in batches. If you retrieve a million rows, they
> > will be in single huge JSON document.
> >
> > How many rows does the query return? If a few thousand or less, we can
> > perhaps finish up the REST fix to solve the issue. Else, consider switching
> > to a more scalable API.
> >
> > How many rows are read from S3? Doing what kind of processing? Simple WHERE
> > clause, or is there some ORDER BY, GROUP BY or joins that would cause
> > memory use? If just a scan and WHERE clause, then the memory you are using
> > should be plenty - once the REST problem is fixed.
> >
> > Thanks,
> >
> > - Paul
> >
> >
> > On Sun, Jun 28, 2020 at 3:17 PM Avner Levy <av...@gmail.com> wrote:
> >
> > > Hi,
> > > I'm using Drill 1.18 (master) docker and trying to configure its memory
> > > after getting out of heap memory errors:
> > > "RESOURCE ERROR: There is not enough heap memory to run this query using
> > > the web interface."
> > > The docker is serving remote clients through the REST API.
> > > The queries are simple selects over tiny parquet files that are stored in
> > > S3.
> > > It is running on in 16GB container, configured with a heap of 8GB, and
> > 8GB
> > > direct memory.
> > > I tried to use:
> > >   exec.queue.enable=true
> > >   exec.queue.large=1
> > >   exec.queue.small=1
> > >
> > > and verified it was configured correctly, but I still see queries running
> > > concurrently.
> > > In addition, the "drill.queries.enqueued" counter remains zero.
> > > Is this mechanism supported in drill-embedded?
> > >
> > > In addition, it seems there is some memory leak, since after a while even
> > > with no query running for a while, running a single tiny query still
> > gives
> > > the same error.
> > > Any insight would be highly appreciated :)
> > > Thanks,
> > >   Avner
> > >
> >

Re: exec.queue.enable in drill-embedded

Posted by Avner Levy <av...@gmail.com>.

Hi Paul & Rafael, I really appreciate your assistance.
My Parquet files are really small (less than 1 MB in most cases) and the
returned JSON is usually less than few MBs.
Moving to 28GB heap, helped (although even with 28GB I get heap issues once
a while).
Since my queries are on a small amount of data (usually a join between two
or three 1MB parquet files), I was thinking of deploying a bunch of
standalone Drill containers with some auto-scaling policy and ELB in front.
This reduced the complexity of managing a cluster and dealing with
downtimes (if there are problems, you just restart the container).
This enables providing a SQL engine over S3 parquet files to other
microservices over REST (limited to small scale queries which suites my
requirements).
But my Drill knowledge if very limited, so any feedback is appreciated.
Thanks,
Avner

On Sun, Jun 28, 2020 at 8:53 PM Paul Rogers <pa...@gmail.com> wrote:

> Hi Avner,
>
> Query queueing is not available in embedded mode: it uses ZK to throttle
> the number of concurrent queries across a cluster; but embedded does not
> have a cluster or use ZK. (If you are running more than a few concurrent
> queries, embedded mode is likely the wrong deployment model anyway.)
>
> The problem here is the use of the REST API. It has horrible performance;
> it buffers the entire result set in memory in a way that overwhelms the
> heap. The REST API was designed to power the Web UI for small queries of <
> few hundred rows. Drill was designed assuming "real" queries would use the
> ODBC, JDBC or native APIs.
>
> That said, there is an in-flight PR designed to fix the heap memory issue
> for REST queries. However, even with that fix, your client must still be
> capable of handling a very large JSON document since rows are not returned
> in a "jsonlines" format or in batches. If you retrieve a million rows, they
> will be in single huge JSON document.
>
> How many rows does the query return? If a few thousand or less, we can
> perhaps finish up the REST fix to solve the issue. Else, consider switching
> to a more scalable API.
>
> How many rows are read from S3? Doing what kind of processing? Simple WHERE
> clause, or is there some ORDER BY, GROUP BY or joins that would cause
> memory use? If just a scan and WHERE clause, then the memory you are using
> should be plenty - once the REST problem is fixed.
>
> Thanks,
>
> - Paul
>
>
> On Sun, Jun 28, 2020 at 3:17 PM Avner Levy <av...@gmail.com> wrote:
>
> > Hi,
> > I'm using Drill 1.18 (master) docker and trying to configure its memory
> > after getting out of heap memory errors:
> > "RESOURCE ERROR: There is not enough heap memory to run this query using
> > the web interface."
> > The docker is serving remote clients through the REST API.
> > The queries are simple selects over tiny parquet files that are stored in
> > S3.
> > It is running on in 16GB container, configured with a heap of 8GB, and
> 8GB
> > direct memory.
> > I tried to use:
> >   exec.queue.enable=true
> >   exec.queue.large=1
> >   exec.queue.small=1
> >
> > and verified it was configured correctly, but I still see queries running
> > concurrently.
> > In addition, the "drill.queries.enqueued" counter remains zero.
> > Is this mechanism supported in drill-embedded?
> >
> > In addition, it seems there is some memory leak, since after a while even
> > with no query running for a while, running a single tiny query still
> gives
> > the same error.
> > Any insight would be highly appreciated :)
> > Thanks,
> >   Avner
> >
>

Re: exec.queue.enable in drill-embedded

Posted by Paul Rogers <pa...@gmail.com>.

Hi Avner,

Query queueing is not available in embedded mode: it uses ZK to throttle
the number of concurrent queries across a cluster; but embedded does not
have a cluster or use ZK. (If you are running more than a few concurrent
queries, embedded mode is likely the wrong deployment model anyway.)

The problem here is the use of the REST API. It has horrible performance;
it buffers the entire result set in memory in a way that overwhelms the
heap. The REST API was designed to power the Web UI for small queries of <
few hundred rows. Drill was designed assuming "real" queries would use the
ODBC, JDBC or native APIs.

That said, there is an in-flight PR designed to fix the heap memory issue
for REST queries. However, even with that fix, your client must still be
capable of handling a very large JSON document since rows are not returned
in a "jsonlines" format or in batches. If you retrieve a million rows, they
will be in single huge JSON document.

How many rows does the query return? If a few thousand or less, we can
perhaps finish up the REST fix to solve the issue. Else, consider switching
to a more scalable API.

How many rows are read from S3? Doing what kind of processing? Simple WHERE
clause, or is there some ORDER BY, GROUP BY or joins that would cause
memory use? If just a scan and WHERE clause, then the memory you are using
should be plenty - once the REST problem is fixed.

Thanks,

- Paul

On Sun, Jun 28, 2020 at 3:17 PM Avner Levy <av...@gmail.com> wrote:

> Hi,
> I'm using Drill 1.18 (master) docker and trying to configure its memory
> after getting out of heap memory errors:
> "RESOURCE ERROR: There is not enough heap memory to run this query using
> the web interface."
> The docker is serving remote clients through the REST API.
> The queries are simple selects over tiny parquet files that are stored in
> S3.
> It is running on in 16GB container, configured with a heap of 8GB, and 8GB
> direct memory.
> I tried to use:
>   exec.queue.enable=true
>   exec.queue.large=1
>   exec.queue.small=1
>
> and verified it was configured correctly, but I still see queries running
> concurrently.
> In addition, the "drill.queries.enqueued" counter remains zero.
> Is this mechanism supported in drill-embedded?
>
> In addition, it seems there is some memory leak, since after a while even
> with no query running for a while, running a single tiny query still gives
> the same error.
> Any insight would be highly appreciated :)
> Thanks,
>   Avner
>