You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@impala.apache.org by Bin Wang <wb...@gmail.com> on 2017/04/05 16:52:47 UTC

Memory limit exceed even with very simple count query

Hi,

I'm using Impala on production for a while. But since yesterday, some
queries reports memory limit exceeded. Then I try a very simple count
query, it still have memory limit exceeded.

The query is:

select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
day<='2017-04-06';

And the response in the Impala shell is:

Query submitted at: 2017-04-06 00:41:00 (Coordinator:
http://szq7.appadhoc.com:25000)
Query progress can be monitored at:
http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000

WARNINGS:
Memory limit exceeded
GzipDecompressor failed to allocate 54525952000 bytes.

I have many nodes and each of them have lots of memory avaliable (~ 60 GB).
And the query failed very fast after I execute it and the nodes have almost
no memory usage.

The table "adhoc_data_fast.log" is an AVRO table and is encoded with gzip
and is partitioned by the field "day". And each partition has no more than
one billion rows.

My Impala version is:

hdfs@szq7:/home/ubuntu$ impalad --version
impalad version 2.7.0-cdh5.9.1 RELEASE (build
24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
Built on Wed Jan 11 13:39:25 PST 2017

Any one can help for this? Thanks very much!

Re: Memory limit exceed even with very simple count query

Posted by Bin Wang <wb...@gmail.com>.
And when I try to dump some partitions of this table into another table, it
says I need much memory per node. I don't understand why I need so much
memory for just read and write?

[szq7.appadhoc.com:21000] > insert overwrite adhoc_data_fast.tmplog
partition (day)
                         >       select `timestamp`, appid, clientid,
statkey, expid, modid, value, summary, custom, uploadtime, day
                         >       from adhoc_data_fast.log where day >=
"2017-04-03" and day <= "2017-04-06"
                         >     ;
Query: insert overwrite adhoc_data_fast.tmplog partition (day)
     select `timestamp`, appid, clientid, statkey, expid, modid, value,
summary, custom, uploadtime, day
     from adhoc_data_fast.log where day >= "2017-04-03" and day <=
"2017-04-06"
Query submitted at: 2017-04-06 01:18:14 (Coordinator:
http://szq7.appadhoc.com:25000)
ERROR: Rejected query from pool default-pool : request memory needed 188.85
GB per node is greater than process mem limit 128.00 GB.

Use the MEM_LIMIT query option to indicate how much memory is required per
node.


Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午1:14写道:

> Will Impala load all the file into the memory? That sounds horrible. And
> with "show partition adhoc_data_fast.log", the compressed files are no
> bigger that 4GB:
>
> | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
>  | AVRO   | false             |
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
> | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
>  | AVRO   | false             |
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
>
>
> Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
>
> Apparently you have a gzipped file that is >=50GB. You either need to
> break up those files, or run on larger machines.
>
> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> > Hi,
> >
> > I'm using Impala on production for a while. But since yesterday, some
> > queries reports memory limit exceeded. Then I try a very simple count
> query,
> > it still have memory limit exceeded.
> >
> > The query is:
> >
> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> > day<='2017-04-06';
> >
> > And the response in the Impala shell is:
> >
> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> > http://szq7.appadhoc.com:25000)
> > Query progress can be monitored at:
> >
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> > WARNINGS:
> > Memory limit exceeded
> > GzipDecompressor failed to allocate 54525952000 bytes.
> >
> > I have many nodes and each of them have lots of memory avaliable (~ 60
> GB).
> > And the query failed very fast after I execute it and the nodes have
> almost
> > no memory usage.
> >
> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with gzip
> > and is partitioned by the field "day". And each partition has no more
> than
> > one billion rows.
> >
> > My Impala version is:
> >
> > hdfs@szq7:/home/ubuntu$ impalad --version
> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> > Built on Wed Jan 11 13:39:25 PST 2017
> >
> > Any one can help for this? Thanks very much!
> >
>
>

Re: Memory limit exceed even with very simple count query

Posted by Alex Behm <al...@cloudera.com>.
Our snappy decompressor for Avro is not streaming. We decompress one block
at a time.

On Wed, Apr 5, 2017 at 6:31 PM, Bin Wang <wb...@gmail.com> wrote:

> Is the snappy decompressor for AVOR or Parquet streaming?
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:27写道:
>
>> I'd say following the best practices with Parquet should work fine.
>> Create snappy-compressed Parquet files of roughly 256MB in size.
>> If you want to stick with Avro, then yes, you'll just have to create
>> smaller files.
>>
>> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <wb...@gmail.com> wrote:
>>
>> So the best I can do to workaround this for now is splitting the files
>> into small files?
>>
>> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:14写道:
>>
>> Parquet makes more sense particularly for that kind of query you have.
>>
>> Still, you might want to be careful with creating huge gzipped files.
>> Impala's gzip decompressor for Parquet is also not streaming.
>>
>> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:
>>
>> So as a workaround, does that make sense to convert it to a parquet table
>> with Hive?
>>
>> And I think it's better to mention it in the AVRO table document because
>> it is an unexpected behavior for many users.
>>
>> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>>
>> Gzip supports streaming decompression, but we currently only implement
>> that for text tables.
>>
>> Doing streaming decompression certainly makes sense for Avro as well.
>>
>> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
>> improvement.
>>
>> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
>> wrote:
>>
>> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
>> > Will Impala load all the file into the memory? That sounds horrible. And
>> > with "show partition adhoc_data_fast.log", the compressed files are no
>> > bigger that 4GB:
>>
>> The *uncompressed* size of one of your files is 50GB. Gzip needs to
>> allocate memory for that.
>>
>> >
>> > | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
>> > | AVRO   | false             |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04
>> |
>> > | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
>> > | AVRO   | false             |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05
>> |
>> >
>> >
>> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
>> >>
>> >> Apparently you have a gzipped file that is >=50GB. You either need to
>> >> break up those files, or run on larger machines.
>> >>
>> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm using Impala on production for a while. But since yesterday, some
>> >> > queries reports memory limit exceeded. Then I try a very simple count
>> >> > query,
>> >> > it still have memory limit exceeded.
>> >> >
>> >> > The query is:
>> >> >
>> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
>> >> > day<='2017-04-06';
>> >> >
>> >> > And the response in the Impala shell is:
>> >> >
>> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
>> >> > http://szq7.appadhoc.com:25000)
>> >> > Query progress can be monitored at:
>> >> >
>> >> > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:
>> 734bcc1d00000000
>> >> > WARNINGS:
>> >> > Memory limit exceeded
>> >> > GzipDecompressor failed to allocate 54525952000 bytes.
>> >> >
>> >> > I have many nodes and each of them have lots of memory avaliable (~
>> 60
>> >> > GB).
>> >> > And the query failed very fast after I execute it and the nodes have
>> >> > almost
>> >> > no memory usage.
>> >> >
>> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
>> >> > gzip
>> >> > and is partitioned by the field "day". And each partition has no more
>> >> > than
>> >> > one billion rows.
>> >> >
>> >> > My Impala version is:
>> >> >
>> >> > hdfs@szq7:/home/ubuntu$ impalad --version
>> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
>> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
>> >> > Built on Wed Jan 11 13:39:25 PST 2017
>> >> >
>> >> > Any one can help for this? Thanks very much!
>> >> >
>>
>>
>>
>>
>>

Re: Memory limit exceed even with very simple count query

Posted by Alex Behm <al...@cloudera.com>.
Bin, for your original problem, have you confirmed that your files are
highly (10x or more) compressed? Maybe the original problem was due to
corrupt files, and the don't actually decompress so such a huge size.

On Thu, Apr 6, 2017 at 12:41 AM, Bin Wang <wb...@gmail.com> wrote:

> Hi,
>
> I've dig it and found there are two files that will trigger this problem.
> After remove them from the partition, I can query on it. But these two
> files are small both before and after compressed (smaller than 400 MB). The
> only problem is they have some corrupt data at the end. Is it a bug in
> Impala?
>
> Regards,
> Bin Wang
>
> Bin Wang <wb...@gmail.com>于2017年4月6日周四 下午12:05写道:
>
>> I convert these avro files to json with avro-tools, and the json files
>> are no larger than 1GB. So Impala should be able to read them. Some of the
>> avro files are corrupt.
>>
>> 16M     log.2017-04-05.1491321605834.avro.json
>> 308M    log.2017-04-05.1491323647211.avro.json
>> 103M    log.2017-04-05.1491327241311.avro.json
>> 150M    log.2017-04-05.1491330839609.avro.json
>> 397M    log.2017-04-05.1491334439092.avro.json
>> 297M    log.2017-04-05.1491338038503.avro.json
>> 160M    log.2017-04-05.1491341639694.avro.json
>> 95M     log.2017-04-05.1491345239969.avro.json
>> 360M    log.2017-04-05.1491348843931.avro.json
>> 338M    log.2017-04-05.1491352442955.avro.json
>> 71M     log.2017-04-05.1491359648079.avro.json
>> 161M    log.2017-04-05.1491363247597.avro.json
>> 628M    log.2017-04-05.1491366845827.avro.json
>> 288M    log.2017-04-05.1491370445873.avro.json
>> 162M    log.2017-04-05.1491374045830.avro.json
>> 90M     log.2017-04-05.1491377650935.avro.json
>> 269M    log.2017-04-05.1491381249597.avro.json
>> 620M    log.2017-04-05.1491384846366.avro.json
>> 70M     log.2017-04-05.1491388450262.avro.json
>> 30M     log.2017-04-05.1491392047694.avro.json
>> 114M    log.2017-04-05.1491395648818.avro.json
>> 370M    log.2017-04-05.1491399246407.avro.json
>> 359M    log.2017-04-05.1491402846469.avro.json
>> 218M    log.2017-04-05.1491406180615.avro.json
>> 29M     log.2017-04-05.1491409790105.avro.json
>> 3.9M    log.2017-04-05.1491413385884.avro.json
>> 9.3M    log.2017-04-05.1491416981829.avro.json
>> 8.3M    log.2017-04-05.1491420581588.avro.json
>> 2.3M    log.2017-04-05.1491424180191.avro.json
>> 25M     log.2017-04-05.1491427781339.avro.json
>> 24M     log.2017-04-05.1491431382552.avro.json
>> 5.7M    log.2017-04-05.1491434984679.avro.json
>> 35M     log.2017-04-05.1491438586674.avro.json
>> 5.8M    log.2017-04-05.1491442192541.avro.json
>> 23M     log.2017-04-05.1491445789230.avro.json
>> 4.3M    log.2017-04-05.1491449386630.avro.json
>>
>> Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午11:34写道:
>>
>> And here is another question. How does Impala estimate the unziped file
>> size? All the gziped files is no bigger than 300MB so I think it will be OK
>> to unzip.
>>
>> Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午9:31写道:
>>
>> Is the snappy decompressor for AVOR or Parquet streaming?
>>
>> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:27写道:
>>
>> I'd say following the best practices with Parquet should work fine.
>> Create snappy-compressed Parquet files of roughly 256MB in size.
>> If you want to stick with Avro, then yes, you'll just have to create
>> smaller files.
>>
>> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <wb...@gmail.com> wrote:
>>
>> So the best I can do to workaround this for now is splitting the files
>> into small files?
>>
>> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:14写道:
>>
>> Parquet makes more sense particularly for that kind of query you have.
>>
>> Still, you might want to be careful with creating huge gzipped files.
>> Impala's gzip decompressor for Parquet is also not streaming.
>>
>> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:
>>
>> So as a workaround, does that make sense to convert it to a parquet table
>> with Hive?
>>
>> And I think it's better to mention it in the AVRO table document because
>> it is an unexpected behavior for many users.
>>
>> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>>
>> Gzip supports streaming decompression, but we currently only implement
>> that for text tables.
>>
>> Doing streaming decompression certainly makes sense for Avro as well.
>>
>> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
>> improvement.
>>
>> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
>> wrote:
>>
>> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
>> > Will Impala load all the file into the memory? That sounds horrible. And
>> > with "show partition adhoc_data_fast.log", the compressed files are no
>> > bigger that 4GB:
>>
>> The *uncompressed* size of one of your files is 50GB. Gzip needs to
>> allocate memory for that.
>>
>> >
>> > | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
>> > | AVRO   | false             |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04
>> |
>> > | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
>> > | AVRO   | false             |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05
>> |
>> >
>> >
>> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
>> >>
>> >> Apparently you have a gzipped file that is >=50GB. You either need to
>> >> break up those files, or run on larger machines.
>> >>
>> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm using Impala on production for a while. But since yesterday, some
>> >> > queries reports memory limit exceeded. Then I try a very simple count
>> >> > query,
>> >> > it still have memory limit exceeded.
>> >> >
>> >> > The query is:
>> >> >
>> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
>> >> > day<='2017-04-06';
>> >> >
>> >> > And the response in the Impala shell is:
>> >> >
>> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
>> >> > http://szq7.appadhoc.com:25000)
>> >> > Query progress can be monitored at:
>> >> >
>> >> > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:
>> 734bcc1d00000000
>> >> > WARNINGS:
>> >> > Memory limit exceeded
>> >> > GzipDecompressor failed to allocate 54525952000 bytes.
>> >> >
>> >> > I have many nodes and each of them have lots of memory avaliable (~
>> 60
>> >> > GB).
>> >> > And the query failed very fast after I execute it and the nodes have
>> >> > almost
>> >> > no memory usage.
>> >> >
>> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
>> >> > gzip
>> >> > and is partitioned by the field "day". And each partition has no more
>> >> > than
>> >> > one billion rows.
>> >> >
>> >> > My Impala version is:
>> >> >
>> >> > hdfs@szq7:/home/ubuntu$ impalad --version
>> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
>> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
>> >> > Built on Wed Jan 11 13:39:25 PST 2017
>> >> >
>> >> > Any one can help for this? Thanks very much!
>> >> >
>>
>>
>>
>>
>>

Re: Memory limit exceed even with very simple count query

Posted by Silvius Rus <sr...@cloudera.com>.
There are some known outstanding issues that reference the word "corrupt":

https://issues.apache.org/jira/issues/?jql=project%20%3D%20impala%20and%20resolution%20is%20empty%20and%20text%20~%20corrupt

Feel free to post a new JIRA if you believe you are running into a new bug.

On Thu, Apr 6, 2017 at 12:41 AM, Bin Wang <wb...@gmail.com> wrote:

> Hi,
>
> I've dig it and found there are two files that will trigger this problem.
> After remove them from the partition, I can query on it. But these two
> files are small both before and after compressed (smaller than 400 MB). The
> only problem is they have some corrupt data at the end. Is it a bug in
> Impala?
>
> Regards,
> Bin Wang
>
> Bin Wang <wb...@gmail.com>于2017年4月6日周四 下午12:05写道:
>
>> I convert these avro files to json with avro-tools, and the json files
>> are no larger than 1GB. So Impala should be able to read them. Some of the
>> avro files are corrupt.
>>
>> 16M     log.2017-04-05.1491321605834.avro.json
>> 308M    log.2017-04-05.1491323647211.avro.json
>> 103M    log.2017-04-05.1491327241311.avro.json
>> 150M    log.2017-04-05.1491330839609.avro.json
>> 397M    log.2017-04-05.1491334439092.avro.json
>> 297M    log.2017-04-05.1491338038503.avro.json
>> 160M    log.2017-04-05.1491341639694.avro.json
>> 95M     log.2017-04-05.1491345239969.avro.json
>> 360M    log.2017-04-05.1491348843931.avro.json
>> 338M    log.2017-04-05.1491352442955.avro.json
>> 71M     log.2017-04-05.1491359648079.avro.json
>> 161M    log.2017-04-05.1491363247597.avro.json
>> 628M    log.2017-04-05.1491366845827.avro.json
>> 288M    log.2017-04-05.1491370445873.avro.json
>> 162M    log.2017-04-05.1491374045830.avro.json
>> 90M     log.2017-04-05.1491377650935.avro.json
>> 269M    log.2017-04-05.1491381249597.avro.json
>> 620M    log.2017-04-05.1491384846366.avro.json
>> 70M     log.2017-04-05.1491388450262.avro.json
>> 30M     log.2017-04-05.1491392047694.avro.json
>> 114M    log.2017-04-05.1491395648818.avro.json
>> 370M    log.2017-04-05.1491399246407.avro.json
>> 359M    log.2017-04-05.1491402846469.avro.json
>> 218M    log.2017-04-05.1491406180615.avro.json
>> 29M     log.2017-04-05.1491409790105.avro.json
>> 3.9M    log.2017-04-05.1491413385884.avro.json
>> 9.3M    log.2017-04-05.1491416981829.avro.json
>> 8.3M    log.2017-04-05.1491420581588.avro.json
>> 2.3M    log.2017-04-05.1491424180191.avro.json
>> 25M     log.2017-04-05.1491427781339.avro.json
>> 24M     log.2017-04-05.1491431382552.avro.json
>> 5.7M    log.2017-04-05.1491434984679.avro.json
>> 35M     log.2017-04-05.1491438586674.avro.json
>> 5.8M    log.2017-04-05.1491442192541.avro.json
>> 23M     log.2017-04-05.1491445789230.avro.json
>> 4.3M    log.2017-04-05.1491449386630.avro.json
>>
>> Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午11:34写道:
>>
>> And here is another question. How does Impala estimate the unziped file
>> size? All the gziped files is no bigger than 300MB so I think it will be OK
>> to unzip.
>>
>> Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午9:31写道:
>>
>> Is the snappy decompressor for AVOR or Parquet streaming?
>>
>> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:27写道:
>>
>> I'd say following the best practices with Parquet should work fine.
>> Create snappy-compressed Parquet files of roughly 256MB in size.
>> If you want to stick with Avro, then yes, you'll just have to create
>> smaller files.
>>
>> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <wb...@gmail.com> wrote:
>>
>> So the best I can do to workaround this for now is splitting the files
>> into small files?
>>
>> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:14写道:
>>
>> Parquet makes more sense particularly for that kind of query you have.
>>
>> Still, you might want to be careful with creating huge gzipped files.
>> Impala's gzip decompressor for Parquet is also not streaming.
>>
>> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:
>>
>> So as a workaround, does that make sense to convert it to a parquet table
>> with Hive?
>>
>> And I think it's better to mention it in the AVRO table document because
>> it is an unexpected behavior for many users.
>>
>> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>>
>> Gzip supports streaming decompression, but we currently only implement
>> that for text tables.
>>
>> Doing streaming decompression certainly makes sense for Avro as well.
>>
>> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
>> improvement.
>>
>> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
>> wrote:
>>
>> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
>> > Will Impala load all the file into the memory? That sounds horrible. And
>> > with "show partition adhoc_data_fast.log", the compressed files are no
>> > bigger that 4GB:
>>
>> The *uncompressed* size of one of your files is 50GB. Gzip needs to
>> allocate memory for that.
>>
>> >
>> > | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
>> > | AVRO   | false             |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04
>> |
>> > | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
>> > | AVRO   | false             |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05
>> |
>> >
>> >
>> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
>> >>
>> >> Apparently you have a gzipped file that is >=50GB. You either need to
>> >> break up those files, or run on larger machines.
>> >>
>> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm using Impala on production for a while. But since yesterday, some
>> >> > queries reports memory limit exceeded. Then I try a very simple count
>> >> > query,
>> >> > it still have memory limit exceeded.
>> >> >
>> >> > The query is:
>> >> >
>> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
>> >> > day<='2017-04-06';
>> >> >
>> >> > And the response in the Impala shell is:
>> >> >
>> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
>> >> > http://szq7.appadhoc.com:25000)
>> >> > Query progress can be monitored at:
>> >> >
>> >> > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:
>> 734bcc1d00000000
>> >> > WARNINGS:
>> >> > Memory limit exceeded
>> >> > GzipDecompressor failed to allocate 54525952000 bytes.
>> >> >
>> >> > I have many nodes and each of them have lots of memory avaliable (~
>> 60
>> >> > GB).
>> >> > And the query failed very fast after I execute it and the nodes have
>> >> > almost
>> >> > no memory usage.
>> >> >
>> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
>> >> > gzip
>> >> > and is partitioned by the field "day". And each partition has no more
>> >> > than
>> >> > one billion rows.
>> >> >
>> >> > My Impala version is:
>> >> >
>> >> > hdfs@szq7:/home/ubuntu$ impalad --version
>> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
>> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
>> >> > Built on Wed Jan 11 13:39:25 PST 2017
>> >> >
>> >> > Any one can help for this? Thanks very much!
>> >> >
>>
>>
>>
>>
>>

Re: Memory limit exceed even with very simple count query

Posted by Bin Wang <wb...@gmail.com>.
Hi,

I've dig it and found there are two files that will trigger this problem.
After remove them from the partition, I can query on it. But these two
files are small both before and after compressed (smaller than 400 MB). The
only problem is they have some corrupt data at the end. Is it a bug in
Impala?

Regards,
Bin Wang

Bin Wang <wb...@gmail.com>于2017年4月6日周四 下午12:05写道:

> I convert these avro files to json with avro-tools, and the json files are
> no larger than 1GB. So Impala should be able to read them. Some of the avro
> files are corrupt.
>
> 16M     log.2017-04-05.1491321605834.avro.json
> 308M    log.2017-04-05.1491323647211.avro.json
> 103M    log.2017-04-05.1491327241311.avro.json
> 150M    log.2017-04-05.1491330839609.avro.json
> 397M    log.2017-04-05.1491334439092.avro.json
> 297M    log.2017-04-05.1491338038503.avro.json
> 160M    log.2017-04-05.1491341639694.avro.json
> 95M     log.2017-04-05.1491345239969.avro.json
> 360M    log.2017-04-05.1491348843931.avro.json
> 338M    log.2017-04-05.1491352442955.avro.json
> 71M     log.2017-04-05.1491359648079.avro.json
> 161M    log.2017-04-05.1491363247597.avro.json
> 628M    log.2017-04-05.1491366845827.avro.json
> 288M    log.2017-04-05.1491370445873.avro.json
> 162M    log.2017-04-05.1491374045830.avro.json
> 90M     log.2017-04-05.1491377650935.avro.json
> 269M    log.2017-04-05.1491381249597.avro.json
> 620M    log.2017-04-05.1491384846366.avro.json
> 70M     log.2017-04-05.1491388450262.avro.json
> 30M     log.2017-04-05.1491392047694.avro.json
> 114M    log.2017-04-05.1491395648818.avro.json
> 370M    log.2017-04-05.1491399246407.avro.json
> 359M    log.2017-04-05.1491402846469.avro.json
> 218M    log.2017-04-05.1491406180615.avro.json
> 29M     log.2017-04-05.1491409790105.avro.json
> 3.9M    log.2017-04-05.1491413385884.avro.json
> 9.3M    log.2017-04-05.1491416981829.avro.json
> 8.3M    log.2017-04-05.1491420581588.avro.json
> 2.3M    log.2017-04-05.1491424180191.avro.json
> 25M     log.2017-04-05.1491427781339.avro.json
> 24M     log.2017-04-05.1491431382552.avro.json
> 5.7M    log.2017-04-05.1491434984679.avro.json
> 35M     log.2017-04-05.1491438586674.avro.json
> 5.8M    log.2017-04-05.1491442192541.avro.json
> 23M     log.2017-04-05.1491445789230.avro.json
> 4.3M    log.2017-04-05.1491449386630.avro.json
>
> Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午11:34写道:
>
> And here is another question. How does Impala estimate the unziped file
> size? All the gziped files is no bigger than 300MB so I think it will be OK
> to unzip.
>
> Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午9:31写道:
>
> Is the snappy decompressor for AVOR or Parquet streaming?
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:27写道:
>
> I'd say following the best practices with Parquet should work fine. Create
> snappy-compressed Parquet files of roughly 256MB in size.
> If you want to stick with Avro, then yes, you'll just have to create
> smaller files.
>
> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <wb...@gmail.com> wrote:
>
> So the best I can do to workaround this for now is splitting the files
> into small files?
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:14写道:
>
> Parquet makes more sense particularly for that kind of query you have.
>
> Still, you might want to be careful with creating huge gzipped files.
> Impala's gzip decompressor for Parquet is also not streaming.
>
> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:
>
> So as a workaround, does that make sense to convert it to a parquet table
> with Hive?
>
> And I think it's better to mention it in the AVRO table document because
> it is an unexpected behavior for many users.
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>
> Gzip supports streaming decompression, but we currently only implement
> that for text tables.
>
> Doing streaming decompression certainly makes sense for Avro as well.
>
> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
> improvement.
>
> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
> wrote:
>
> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
> > Will Impala load all the file into the memory? That sounds horrible. And
> > with "show partition adhoc_data_fast.log", the compressed files are no
> > bigger that 4GB:
>
> The *uncompressed* size of one of your files is 50GB. Gzip needs to
> allocate memory for that.
>
> >
> > | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
> > | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
> >
> >
> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
> >>
> >> Apparently you have a gzipped file that is >=50GB. You either need to
> >> break up those files, or run on larger machines.
> >>
> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > I'm using Impala on production for a while. But since yesterday, some
> >> > queries reports memory limit exceeded. Then I try a very simple count
> >> > query,
> >> > it still have memory limit exceeded.
> >> >
> >> > The query is:
> >> >
> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> >> > day<='2017-04-06';
> >> >
> >> > And the response in the Impala shell is:
> >> >
> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> >> > http://szq7.appadhoc.com:25000)
> >> > Query progress can be monitored at:
> >> >
> >> >
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> >> > WARNINGS:
> >> > Memory limit exceeded
> >> > GzipDecompressor failed to allocate 54525952000 bytes.
> >> >
> >> > I have many nodes and each of them have lots of memory avaliable (~ 60
> >> > GB).
> >> > And the query failed very fast after I execute it and the nodes have
> >> > almost
> >> > no memory usage.
> >> >
> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
> >> > gzip
> >> > and is partitioned by the field "day". And each partition has no more
> >> > than
> >> > one billion rows.
> >> >
> >> > My Impala version is:
> >> >
> >> > hdfs@szq7:/home/ubuntu$ impalad --version
> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> >> > Built on Wed Jan 11 13:39:25 PST 2017
> >> >
> >> > Any one can help for this? Thanks very much!
> >> >
>
>
>
>
>

Re: Memory limit exceed even with very simple count query

Posted by Bin Wang <wb...@gmail.com>.
I convert these avro files to json with avro-tools, and the json files are
no larger than 1GB. So Impala should be able to read them. Some of the avro
files are corrupt.

16M     log.2017-04-05.1491321605834.avro.json
308M    log.2017-04-05.1491323647211.avro.json
103M    log.2017-04-05.1491327241311.avro.json
150M    log.2017-04-05.1491330839609.avro.json
397M    log.2017-04-05.1491334439092.avro.json
297M    log.2017-04-05.1491338038503.avro.json
160M    log.2017-04-05.1491341639694.avro.json
95M     log.2017-04-05.1491345239969.avro.json
360M    log.2017-04-05.1491348843931.avro.json
338M    log.2017-04-05.1491352442955.avro.json
71M     log.2017-04-05.1491359648079.avro.json
161M    log.2017-04-05.1491363247597.avro.json
628M    log.2017-04-05.1491366845827.avro.json
288M    log.2017-04-05.1491370445873.avro.json
162M    log.2017-04-05.1491374045830.avro.json
90M     log.2017-04-05.1491377650935.avro.json
269M    log.2017-04-05.1491381249597.avro.json
620M    log.2017-04-05.1491384846366.avro.json
70M     log.2017-04-05.1491388450262.avro.json
30M     log.2017-04-05.1491392047694.avro.json
114M    log.2017-04-05.1491395648818.avro.json
370M    log.2017-04-05.1491399246407.avro.json
359M    log.2017-04-05.1491402846469.avro.json
218M    log.2017-04-05.1491406180615.avro.json
29M     log.2017-04-05.1491409790105.avro.json
3.9M    log.2017-04-05.1491413385884.avro.json
9.3M    log.2017-04-05.1491416981829.avro.json
8.3M    log.2017-04-05.1491420581588.avro.json
2.3M    log.2017-04-05.1491424180191.avro.json
25M     log.2017-04-05.1491427781339.avro.json
24M     log.2017-04-05.1491431382552.avro.json
5.7M    log.2017-04-05.1491434984679.avro.json
35M     log.2017-04-05.1491438586674.avro.json
5.8M    log.2017-04-05.1491442192541.avro.json
23M     log.2017-04-05.1491445789230.avro.json
4.3M    log.2017-04-05.1491449386630.avro.json

Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午11:34写道:

> And here is another question. How does Impala estimate the unziped file
> size? All the gziped files is no bigger than 300MB so I think it will be OK
> to unzip.
>
> Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午9:31写道:
>
> Is the snappy decompressor for AVOR or Parquet streaming?
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:27写道:
>
> I'd say following the best practices with Parquet should work fine. Create
> snappy-compressed Parquet files of roughly 256MB in size.
> If you want to stick with Avro, then yes, you'll just have to create
> smaller files.
>
> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <wb...@gmail.com> wrote:
>
> So the best I can do to workaround this for now is splitting the files
> into small files?
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:14写道:
>
> Parquet makes more sense particularly for that kind of query you have.
>
> Still, you might want to be careful with creating huge gzipped files.
> Impala's gzip decompressor for Parquet is also not streaming.
>
> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:
>
> So as a workaround, does that make sense to convert it to a parquet table
> with Hive?
>
> And I think it's better to mention it in the AVRO table document because
> it is an unexpected behavior for many users.
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>
> Gzip supports streaming decompression, but we currently only implement
> that for text tables.
>
> Doing streaming decompression certainly makes sense for Avro as well.
>
> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
> improvement.
>
> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
> wrote:
>
> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
> > Will Impala load all the file into the memory? That sounds horrible. And
> > with "show partition adhoc_data_fast.log", the compressed files are no
> > bigger that 4GB:
>
> The *uncompressed* size of one of your files is 50GB. Gzip needs to
> allocate memory for that.
>
> >
> > | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
> > | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
> >
> >
> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
> >>
> >> Apparently you have a gzipped file that is >=50GB. You either need to
> >> break up those files, or run on larger machines.
> >>
> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > I'm using Impala on production for a while. But since yesterday, some
> >> > queries reports memory limit exceeded. Then I try a very simple count
> >> > query,
> >> > it still have memory limit exceeded.
> >> >
> >> > The query is:
> >> >
> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> >> > day<='2017-04-06';
> >> >
> >> > And the response in the Impala shell is:
> >> >
> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> >> > http://szq7.appadhoc.com:25000)
> >> > Query progress can be monitored at:
> >> >
> >> >
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> >> > WARNINGS:
> >> > Memory limit exceeded
> >> > GzipDecompressor failed to allocate 54525952000 bytes.
> >> >
> >> > I have many nodes and each of them have lots of memory avaliable (~ 60
> >> > GB).
> >> > And the query failed very fast after I execute it and the nodes have
> >> > almost
> >> > no memory usage.
> >> >
> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
> >> > gzip
> >> > and is partitioned by the field "day". And each partition has no more
> >> > than
> >> > one billion rows.
> >> >
> >> > My Impala version is:
> >> >
> >> > hdfs@szq7:/home/ubuntu$ impalad --version
> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> >> > Built on Wed Jan 11 13:39:25 PST 2017
> >> >
> >> > Any one can help for this? Thanks very much!
> >> >
>
>
>
>
>

Re: Memory limit exceed even with very simple count query

Posted by Bin Wang <wb...@gmail.com>.
And here is another question. How does Impala estimate the unziped file
size? All the gziped files is no bigger than 300MB so I think it will be OK
to unzip.

Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午9:31写道:

> Is the snappy decompressor for AVOR or Parquet streaming?
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:27写道:
>
> I'd say following the best practices with Parquet should work fine. Create
> snappy-compressed Parquet files of roughly 256MB in size.
> If you want to stick with Avro, then yes, you'll just have to create
> smaller files.
>
> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <wb...@gmail.com> wrote:
>
> So the best I can do to workaround this for now is splitting the files
> into small files?
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:14写道:
>
> Parquet makes more sense particularly for that kind of query you have.
>
> Still, you might want to be careful with creating huge gzipped files.
> Impala's gzip decompressor for Parquet is also not streaming.
>
> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:
>
> So as a workaround, does that make sense to convert it to a parquet table
> with Hive?
>
> And I think it's better to mention it in the AVRO table document because
> it is an unexpected behavior for many users.
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>
> Gzip supports streaming decompression, but we currently only implement
> that for text tables.
>
> Doing streaming decompression certainly makes sense for Avro as well.
>
> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
> improvement.
>
> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
> wrote:
>
> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
> > Will Impala load all the file into the memory? That sounds horrible. And
> > with "show partition adhoc_data_fast.log", the compressed files are no
> > bigger that 4GB:
>
> The *uncompressed* size of one of your files is 50GB. Gzip needs to
> allocate memory for that.
>
> >
> > | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
> > | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
> >
> >
> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
> >>
> >> Apparently you have a gzipped file that is >=50GB. You either need to
> >> break up those files, or run on larger machines.
> >>
> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > I'm using Impala on production for a while. But since yesterday, some
> >> > queries reports memory limit exceeded. Then I try a very simple count
> >> > query,
> >> > it still have memory limit exceeded.
> >> >
> >> > The query is:
> >> >
> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> >> > day<='2017-04-06';
> >> >
> >> > And the response in the Impala shell is:
> >> >
> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> >> > http://szq7.appadhoc.com:25000)
> >> > Query progress can be monitored at:
> >> >
> >> >
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> >> > WARNINGS:
> >> > Memory limit exceeded
> >> > GzipDecompressor failed to allocate 54525952000 bytes.
> >> >
> >> > I have many nodes and each of them have lots of memory avaliable (~ 60
> >> > GB).
> >> > And the query failed very fast after I execute it and the nodes have
> >> > almost
> >> > no memory usage.
> >> >
> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
> >> > gzip
> >> > and is partitioned by the field "day". And each partition has no more
> >> > than
> >> > one billion rows.
> >> >
> >> > My Impala version is:
> >> >
> >> > hdfs@szq7:/home/ubuntu$ impalad --version
> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> >> > Built on Wed Jan 11 13:39:25 PST 2017
> >> >
> >> > Any one can help for this? Thanks very much!
> >> >
>
>
>
>
>

Re: Memory limit exceed even with very simple count query

Posted by Bin Wang <wb...@gmail.com>.
Is the snappy decompressor for AVOR or Parquet streaming?

Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:27写道:

> I'd say following the best practices with Parquet should work fine. Create
> snappy-compressed Parquet files of roughly 256MB in size.
> If you want to stick with Avro, then yes, you'll just have to create
> smaller files.
>
> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <wb...@gmail.com> wrote:
>
> So the best I can do to workaround this for now is splitting the files
> into small files?
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:14写道:
>
> Parquet makes more sense particularly for that kind of query you have.
>
> Still, you might want to be careful with creating huge gzipped files.
> Impala's gzip decompressor for Parquet is also not streaming.
>
> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:
>
> So as a workaround, does that make sense to convert it to a parquet table
> with Hive?
>
> And I think it's better to mention it in the AVRO table document because
> it is an unexpected behavior for many users.
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>
> Gzip supports streaming decompression, but we currently only implement
> that for text tables.
>
> Doing streaming decompression certainly makes sense for Avro as well.
>
> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
> improvement.
>
> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
> wrote:
>
> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
> > Will Impala load all the file into the memory? That sounds horrible. And
> > with "show partition adhoc_data_fast.log", the compressed files are no
> > bigger that 4GB:
>
> The *uncompressed* size of one of your files is 50GB. Gzip needs to
> allocate memory for that.
>
> >
> > | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
> > | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
> >
> >
> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
> >>
> >> Apparently you have a gzipped file that is >=50GB. You either need to
> >> break up those files, or run on larger machines.
> >>
> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > I'm using Impala on production for a while. But since yesterday, some
> >> > queries reports memory limit exceeded. Then I try a very simple count
> >> > query,
> >> > it still have memory limit exceeded.
> >> >
> >> > The query is:
> >> >
> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> >> > day<='2017-04-06';
> >> >
> >> > And the response in the Impala shell is:
> >> >
> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> >> > http://szq7.appadhoc.com:25000)
> >> > Query progress can be monitored at:
> >> >
> >> >
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> >> > WARNINGS:
> >> > Memory limit exceeded
> >> > GzipDecompressor failed to allocate 54525952000 bytes.
> >> >
> >> > I have many nodes and each of them have lots of memory avaliable (~ 60
> >> > GB).
> >> > And the query failed very fast after I execute it and the nodes have
> >> > almost
> >> > no memory usage.
> >> >
> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
> >> > gzip
> >> > and is partitioned by the field "day". And each partition has no more
> >> > than
> >> > one billion rows.
> >> >
> >> > My Impala version is:
> >> >
> >> > hdfs@szq7:/home/ubuntu$ impalad --version
> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> >> > Built on Wed Jan 11 13:39:25 PST 2017
> >> >
> >> > Any one can help for this? Thanks very much!
> >> >
>
>
>
>
>

Re: Memory limit exceed even with very simple count query

Posted by Alex Behm <al...@cloudera.com>.
I'd say following the best practices with Parquet should work fine. Create
snappy-compressed Parquet files of roughly 256MB in size.
If you want to stick with Avro, then yes, you'll just have to create
smaller files.

On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <wb...@gmail.com> wrote:

> So the best I can do to workaround this for now is splitting the files
> into small files?
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:14写道:
>
>> Parquet makes more sense particularly for that kind of query you have.
>>
>> Still, you might want to be careful with creating huge gzipped files.
>> Impala's gzip decompressor for Parquet is also not streaming.
>>
>> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:
>>
>> So as a workaround, does that make sense to convert it to a parquet table
>> with Hive?
>>
>> And I think it's better to mention it in the AVRO table document because
>> it is an unexpected behavior for many users.
>>
>> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>>
>> Gzip supports streaming decompression, but we currently only implement
>> that for text tables.
>>
>> Doing streaming decompression certainly makes sense for Avro as well.
>>
>> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
>> improvement.
>>
>> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
>> wrote:
>>
>> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
>> > Will Impala load all the file into the memory? That sounds horrible. And
>> > with "show partition adhoc_data_fast.log", the compressed files are no
>> > bigger that 4GB:
>>
>> The *uncompressed* size of one of your files is 50GB. Gzip needs to
>> allocate memory for that.
>>
>> >
>> > | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
>> > | AVRO   | false             |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04
>> |
>> > | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
>> > | AVRO   | false             |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05
>> |
>> >
>> >
>> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
>> >>
>> >> Apparently you have a gzipped file that is >=50GB. You either need to
>> >> break up those files, or run on larger machines.
>> >>
>> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm using Impala on production for a while. But since yesterday, some
>> >> > queries reports memory limit exceeded. Then I try a very simple count
>> >> > query,
>> >> > it still have memory limit exceeded.
>> >> >
>> >> > The query is:
>> >> >
>> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
>> >> > day<='2017-04-06';
>> >> >
>> >> > And the response in the Impala shell is:
>> >> >
>> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
>> >> > http://szq7.appadhoc.com:25000)
>> >> > Query progress can be monitored at:
>> >> >
>> >> > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:
>> 734bcc1d00000000
>> >> > WARNINGS:
>> >> > Memory limit exceeded
>> >> > GzipDecompressor failed to allocate 54525952000 bytes.
>> >> >
>> >> > I have many nodes and each of them have lots of memory avaliable (~
>> 60
>> >> > GB).
>> >> > And the query failed very fast after I execute it and the nodes have
>> >> > almost
>> >> > no memory usage.
>> >> >
>> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
>> >> > gzip
>> >> > and is partitioned by the field "day". And each partition has no more
>> >> > than
>> >> > one billion rows.
>> >> >
>> >> > My Impala version is:
>> >> >
>> >> > hdfs@szq7:/home/ubuntu$ impalad --version
>> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
>> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
>> >> > Built on Wed Jan 11 13:39:25 PST 2017
>> >> >
>> >> > Any one can help for this? Thanks very much!
>> >> >
>>
>>
>>
>>

Re: Memory limit exceed even with very simple count query

Posted by Bin Wang <wb...@gmail.com>.
So the best I can do to workaround this for now is splitting the files into
small files?

Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:14写道:

> Parquet makes more sense particularly for that kind of query you have.
>
> Still, you might want to be careful with creating huge gzipped files.
> Impala's gzip decompressor for Parquet is also not streaming.
>
> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:
>
> So as a workaround, does that make sense to convert it to a parquet table
> with Hive?
>
> And I think it's better to mention it in the AVRO table document because
> it is an unexpected behavior for many users.
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>
> Gzip supports streaming decompression, but we currently only implement
> that for text tables.
>
> Doing streaming decompression certainly makes sense for Avro as well.
>
> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
> improvement.
>
> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
> wrote:
>
> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
> > Will Impala load all the file into the memory? That sounds horrible. And
> > with "show partition adhoc_data_fast.log", the compressed files are no
> > bigger that 4GB:
>
> The *uncompressed* size of one of your files is 50GB. Gzip needs to
> allocate memory for that.
>
> >
> > | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
> > | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
> >
> >
> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
> >>
> >> Apparently you have a gzipped file that is >=50GB. You either need to
> >> break up those files, or run on larger machines.
> >>
> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > I'm using Impala on production for a while. But since yesterday, some
> >> > queries reports memory limit exceeded. Then I try a very simple count
> >> > query,
> >> > it still have memory limit exceeded.
> >> >
> >> > The query is:
> >> >
> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> >> > day<='2017-04-06';
> >> >
> >> > And the response in the Impala shell is:
> >> >
> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> >> > http://szq7.appadhoc.com:25000)
> >> > Query progress can be monitored at:
> >> >
> >> >
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> >> > WARNINGS:
> >> > Memory limit exceeded
> >> > GzipDecompressor failed to allocate 54525952000 bytes.
> >> >
> >> > I have many nodes and each of them have lots of memory avaliable (~ 60
> >> > GB).
> >> > And the query failed very fast after I execute it and the nodes have
> >> > almost
> >> > no memory usage.
> >> >
> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
> >> > gzip
> >> > and is partitioned by the field "day". And each partition has no more
> >> > than
> >> > one billion rows.
> >> >
> >> > My Impala version is:
> >> >
> >> > hdfs@szq7:/home/ubuntu$ impalad --version
> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> >> > Built on Wed Jan 11 13:39:25 PST 2017
> >> >
> >> > Any one can help for this? Thanks very much!
> >> >
>
>
>
>

Re: Memory limit exceed even with very simple count query

Posted by Alex Behm <al...@cloudera.com>.
Parquet makes more sense particularly for that kind of query you have.

Still, you might want to be careful with creating huge gzipped files.
Impala's gzip decompressor for Parquet is also not streaming.

On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:

> So as a workaround, does that make sense to convert it to a parquet table
> with Hive?
>
> And I think it's better to mention it in the AVRO table document because
> it is an unexpected behavior for many users.
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>
>> Gzip supports streaming decompression, but we currently only implement
>> that for text tables.
>>
>> Doing streaming decompression certainly makes sense for Avro as well.
>>
>> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
>> improvement.
>>
>> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
>> wrote:
>>
>> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
>> > Will Impala load all the file into the memory? That sounds horrible. And
>> > with "show partition adhoc_data_fast.log", the compressed files are no
>> > bigger that 4GB:
>>
>> The *uncompressed* size of one of your files is 50GB. Gzip needs to
>> allocate memory for that.
>>
>> >
>> > | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
>> > | AVRO   | false             |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04
>> |
>> > | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
>> > | AVRO   | false             |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05
>> |
>> >
>> >
>> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
>> >>
>> >> Apparently you have a gzipped file that is >=50GB. You either need to
>> >> break up those files, or run on larger machines.
>> >>
>> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm using Impala on production for a while. But since yesterday, some
>> >> > queries reports memory limit exceeded. Then I try a very simple count
>> >> > query,
>> >> > it still have memory limit exceeded.
>> >> >
>> >> > The query is:
>> >> >
>> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
>> >> > day<='2017-04-06';
>> >> >
>> >> > And the response in the Impala shell is:
>> >> >
>> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
>> >> > http://szq7.appadhoc.com:25000)
>> >> > Query progress can be monitored at:
>> >> >
>> >> > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:
>> 734bcc1d00000000
>> >> > WARNINGS:
>> >> > Memory limit exceeded
>> >> > GzipDecompressor failed to allocate 54525952000 bytes.
>> >> >
>> >> > I have many nodes and each of them have lots of memory avaliable (~
>> 60
>> >> > GB).
>> >> > And the query failed very fast after I execute it and the nodes have
>> >> > almost
>> >> > no memory usage.
>> >> >
>> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
>> >> > gzip
>> >> > and is partitioned by the field "day". And each partition has no more
>> >> > than
>> >> > one billion rows.
>> >> >
>> >> > My Impala version is:
>> >> >
>> >> > hdfs@szq7:/home/ubuntu$ impalad --version
>> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
>> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
>> >> > Built on Wed Jan 11 13:39:25 PST 2017
>> >> >
>> >> > Any one can help for this? Thanks very much!
>> >> >
>>
>>
>>

Re: Memory limit exceed even with very simple count query

Posted by Bin Wang <wb...@gmail.com>.
So as a workaround, does that make sense to convert it to a parquet table
with Hive?

And I think it's better to mention it in the AVRO table document because it
is an unexpected behavior for many users.

Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:

> Gzip supports streaming decompression, but we currently only implement
> that for text tables.
>
> Doing streaming decompression certainly makes sense for Avro as well.
>
> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
> improvement.
>
> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
> wrote:
>
> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
> > Will Impala load all the file into the memory? That sounds horrible. And
> > with "show partition adhoc_data_fast.log", the compressed files are no
> > bigger that 4GB:
>
> The *uncompressed* size of one of your files is 50GB. Gzip needs to
> allocate memory for that.
>
> >
> > | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
> > | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
> >
> >
> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
> >>
> >> Apparently you have a gzipped file that is >=50GB. You either need to
> >> break up those files, or run on larger machines.
> >>
> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > I'm using Impala on production for a while. But since yesterday, some
> >> > queries reports memory limit exceeded. Then I try a very simple count
> >> > query,
> >> > it still have memory limit exceeded.
> >> >
> >> > The query is:
> >> >
> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> >> > day<='2017-04-06';
> >> >
> >> > And the response in the Impala shell is:
> >> >
> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> >> > http://szq7.appadhoc.com:25000)
> >> > Query progress can be monitored at:
> >> >
> >> >
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> >> > WARNINGS:
> >> > Memory limit exceeded
> >> > GzipDecompressor failed to allocate 54525952000 bytes.
> >> >
> >> > I have many nodes and each of them have lots of memory avaliable (~ 60
> >> > GB).
> >> > And the query failed very fast after I execute it and the nodes have
> >> > almost
> >> > no memory usage.
> >> >
> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
> >> > gzip
> >> > and is partitioned by the field "day". And each partition has no more
> >> > than
> >> > one billion rows.
> >> >
> >> > My Impala version is:
> >> >
> >> > hdfs@szq7:/home/ubuntu$ impalad --version
> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> >> > Built on Wed Jan 11 13:39:25 PST 2017
> >> >
> >> > Any one can help for this? Thanks very much!
> >> >
>
>
>

Re: Memory limit exceed even with very simple count query

Posted by Alex Behm <al...@cloudera.com>.
Gzip supports streaming decompression, but we currently only implement that
for text tables.

Doing streaming decompression certainly makes sense for Avro as well.

I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
improvement.

On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
wrote:

> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
> > Will Impala load all the file into the memory? That sounds horrible. And
> > with "show partition adhoc_data_fast.log", the compressed files are no
> > bigger that 4GB:
>
> The *uncompressed* size of one of your files is 50GB. Gzip needs to
> allocate memory for that.
>
> >
> > | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04
> |
> > | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
> > | AVRO   | false             |
> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05
> |
> >
> >
> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
> >>
> >> Apparently you have a gzipped file that is >=50GB. You either need to
> >> break up those files, or run on larger machines.
> >>
> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > I'm using Impala on production for a while. But since yesterday, some
> >> > queries reports memory limit exceeded. Then I try a very simple count
> >> > query,
> >> > it still have memory limit exceeded.
> >> >
> >> > The query is:
> >> >
> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> >> > day<='2017-04-06';
> >> >
> >> > And the response in the Impala shell is:
> >> >
> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> >> > http://szq7.appadhoc.com:25000)
> >> > Query progress can be monitored at:
> >> >
> >> > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:
> 734bcc1d00000000
> >> > WARNINGS:
> >> > Memory limit exceeded
> >> > GzipDecompressor failed to allocate 54525952000 bytes.
> >> >
> >> > I have many nodes and each of them have lots of memory avaliable (~ 60
> >> > GB).
> >> > And the query failed very fast after I execute it and the nodes have
> >> > almost
> >> > no memory usage.
> >> >
> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
> >> > gzip
> >> > and is partitioned by the field "day". And each partition has no more
> >> > than
> >> > one billion rows.
> >> >
> >> > My Impala version is:
> >> >
> >> > hdfs@szq7:/home/ubuntu$ impalad --version
> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> >> > Built on Wed Jan 11 13:39:25 PST 2017
> >> >
> >> > Any one can help for this? Thanks very much!
> >> >
>

Re: Memory limit exceed even with very simple count query

Posted by Marcel Kornacker <ma...@cloudera.com>.
On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
> Will Impala load all the file into the memory? That sounds horrible. And
> with "show partition adhoc_data_fast.log", the compressed files are no
> bigger that 4GB:

The *uncompressed* size of one of your files is 50GB. Gzip needs to
allocate memory for that.

>
> | 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
> | AVRO   | false             |
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
> | 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
> | AVRO   | false             |
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
>
>
> Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
>>
>> Apparently you have a gzipped file that is >=50GB. You either need to
>> break up those files, or run on larger machines.
>>
>> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
>> > Hi,
>> >
>> > I'm using Impala on production for a while. But since yesterday, some
>> > queries reports memory limit exceeded. Then I try a very simple count
>> > query,
>> > it still have memory limit exceeded.
>> >
>> > The query is:
>> >
>> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
>> > day<='2017-04-06';
>> >
>> > And the response in the Impala shell is:
>> >
>> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
>> > http://szq7.appadhoc.com:25000)
>> > Query progress can be monitored at:
>> >
>> > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
>> > WARNINGS:
>> > Memory limit exceeded
>> > GzipDecompressor failed to allocate 54525952000 bytes.
>> >
>> > I have many nodes and each of them have lots of memory avaliable (~ 60
>> > GB).
>> > And the query failed very fast after I execute it and the nodes have
>> > almost
>> > no memory usage.
>> >
>> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
>> > gzip
>> > and is partitioned by the field "day". And each partition has no more
>> > than
>> > one billion rows.
>> >
>> > My Impala version is:
>> >
>> > hdfs@szq7:/home/ubuntu$ impalad --version
>> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
>> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
>> > Built on Wed Jan 11 13:39:25 PST 2017
>> >
>> > Any one can help for this? Thanks very much!
>> >

Re: Memory limit exceed even with very simple count query

Posted by Bin Wang <wb...@gmail.com>.
Will Impala load all the file into the memory? That sounds horrible. And
with "show partition adhoc_data_fast.log", the compressed files are no
bigger that 4GB:

| 2017-04-04 | -1    | 46     | 2.69GB   | NOT CACHED   | NOT CACHED
 | AVRO   | false             |
hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
| 2017-04-05 | -1    | 25     | 3.42GB   | NOT CACHED   | NOT CACHED
 | AVRO   | false             |
hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |


Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:

> Apparently you have a gzipped file that is >=50GB. You either need to
> break up those files, or run on larger machines.
>
> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> > Hi,
> >
> > I'm using Impala on production for a while. But since yesterday, some
> > queries reports memory limit exceeded. Then I try a very simple count
> query,
> > it still have memory limit exceeded.
> >
> > The query is:
> >
> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> > day<='2017-04-06';
> >
> > And the response in the Impala shell is:
> >
> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> > http://szq7.appadhoc.com:25000)
> > Query progress can be monitored at:
> >
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> > WARNINGS:
> > Memory limit exceeded
> > GzipDecompressor failed to allocate 54525952000 bytes.
> >
> > I have many nodes and each of them have lots of memory avaliable (~ 60
> GB).
> > And the query failed very fast after I execute it and the nodes have
> almost
> > no memory usage.
> >
> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with gzip
> > and is partitioned by the field "day". And each partition has no more
> than
> > one billion rows.
> >
> > My Impala version is:
> >
> > hdfs@szq7:/home/ubuntu$ impalad --version
> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> > Built on Wed Jan 11 13:39:25 PST 2017
> >
> > Any one can help for this? Thanks very much!
> >
>

Re: Memory limit exceed even with very simple count query

Posted by Marcel Kornacker <ma...@cloudera.com>.
Apparently you have a gzipped file that is >=50GB. You either need to
break up those files, or run on larger machines.

On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> Hi,
>
> I'm using Impala on production for a while. But since yesterday, some
> queries reports memory limit exceeded. Then I try a very simple count query,
> it still have memory limit exceeded.
>
> The query is:
>
> select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> day<='2017-04-06';
>
> And the response in the Impala shell is:
>
> Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> http://szq7.appadhoc.com:25000)
> Query progress can be monitored at:
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> WARNINGS:
> Memory limit exceeded
> GzipDecompressor failed to allocate 54525952000 bytes.
>
> I have many nodes and each of them have lots of memory avaliable (~ 60 GB).
> And the query failed very fast after I execute it and the nodes have almost
> no memory usage.
>
> The table "adhoc_data_fast.log" is an AVRO table and is encoded with gzip
> and is partitioned by the field "day". And each partition has no more than
> one billion rows.
>
> My Impala version is:
>
> hdfs@szq7:/home/ubuntu$ impalad --version
> impalad version 2.7.0-cdh5.9.1 RELEASE (build
> 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> Built on Wed Jan 11 13:39:25 PST 2017
>
> Any one can help for this? Thanks very much!
>