You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@impala.apache.org by Bin Wang <wb...@gmail.com> on 2017/04/05 16:52:47 UTC
Memory limit exceed even with very simple count query
Hi,
I'm using Impala on production for a while. But since yesterday, some
queries reports memory limit exceeded. Then I try a very simple count
query, it still have memory limit exceeded.
The query is:
select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
day<='2017-04-06';
And the response in the Impala shell is:
Query submitted at: 2017-04-06 00:41:00 (Coordinator:
http://szq7.appadhoc.com:25000)
Query progress can be monitored at:
http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
WARNINGS:
Memory limit exceeded
GzipDecompressor failed to allocate 54525952000 bytes.
I have many nodes and each of them have lots of memory avaliable (~ 60 GB).
And the query failed very fast after I execute it and the nodes have almost
no memory usage.
The table "adhoc_data_fast.log" is an AVRO table and is encoded with gzip
and is partitioned by the field "day". And each partition has no more than
one billion rows.
My Impala version is:
hdfs@szq7:/home/ubuntu$ impalad --version
impalad version 2.7.0-cdh5.9.1 RELEASE (build
24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
Built on Wed Jan 11 13:39:25 PST 2017
Any one can help for this? Thanks very much!
Re: Memory limit exceed even with very simple count query
Posted by Bin Wang <wb...@gmail.com>.
And when I try to dump some partitions of this table into another table, it
says I need much memory per node. I don't understand why I need so much
memory for just read and write?
[szq7.appadhoc.com:21000] > insert overwrite adhoc_data_fast.tmplog
partition (day)
> select `timestamp`, appid, clientid,
statkey, expid, modid, value, summary, custom, uploadtime, day
> from adhoc_data_fast.log where day >=
"2017-04-03" and day <= "2017-04-06"
> ;
Query: insert overwrite adhoc_data_fast.tmplog partition (day)
select `timestamp`, appid, clientid, statkey, expid, modid, value,
summary, custom, uploadtime, day
from adhoc_data_fast.log where day >= "2017-04-03" and day <=
"2017-04-06"
Query submitted at: 2017-04-06 01:18:14 (Coordinator:
http://szq7.appadhoc.com:25000)
ERROR: Rejected query from pool default-pool : request memory needed 188.85
GB per node is greater than process mem limit 128.00 GB.
Use the MEM_LIMIT query option to indicate how much memory is required per
node.
Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午1:14写道:
> Will Impala load all the file into the memory? That sounds horrible. And
> with "show partition adhoc_data_fast.log", the compressed files are no
> bigger that 4GB:
>
> | 2017-04-04 | -1 | 46 | 2.69GB | NOT CACHED | NOT CACHED
> | AVRO | false |
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
> | 2017-04-05 | -1 | 25 | 3.42GB | NOT CACHED | NOT CACHED
> | AVRO | false |
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
>
>
> Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
>
> Apparently you have a gzipped file that is >=50GB. You either need to
> break up those files, or run on larger machines.
>
> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> > Hi,
> >
> > I'm using Impala on production for a while. But since yesterday, some
> > queries reports memory limit exceeded. Then I try a very simple count
> query,
> > it still have memory limit exceeded.
> >
> > The query is:
> >
> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> > day<='2017-04-06';
> >
> > And the response in the Impala shell is:
> >
> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> > http://szq7.appadhoc.com:25000)
> > Query progress can be monitored at:
> >
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> > WARNINGS:
> > Memory limit exceeded
> > GzipDecompressor failed to allocate 54525952000 bytes.
> >
> > I have many nodes and each of them have lots of memory avaliable (~ 60
> GB).
> > And the query failed very fast after I execute it and the nodes have
> almost
> > no memory usage.
> >
> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with gzip
> > and is partitioned by the field "day". And each partition has no more
> than
> > one billion rows.
> >
> > My Impala version is:
> >
> > hdfs@szq7:/home/ubuntu$ impalad --version
> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> > Built on Wed Jan 11 13:39:25 PST 2017
> >
> > Any one can help for this? Thanks very much!
> >
>
>
Re: Memory limit exceed even with very simple count query
Posted by Alex Behm <al...@cloudera.com>.
Our snappy decompressor for Avro is not streaming. We decompress one block
at a time.
On Wed, Apr 5, 2017 at 6:31 PM, Bin Wang <wb...@gmail.com> wrote:
> Is the snappy decompressor for AVOR or Parquet streaming?
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:27写道:
>
>> I'd say following the best practices with Parquet should work fine.
>> Create snappy-compressed Parquet files of roughly 256MB in size.
>> If you want to stick with Avro, then yes, you'll just have to create
>> smaller files.
>>
>> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <wb...@gmail.com> wrote:
>>
>> So the best I can do to workaround this for now is splitting the files
>> into small files?
>>
>> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:14写道:
>>
>> Parquet makes more sense particularly for that kind of query you have.
>>
>> Still, you might want to be careful with creating huge gzipped files.
>> Impala's gzip decompressor for Parquet is also not streaming.
>>
>> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:
>>
>> So as a workaround, does that make sense to convert it to a parquet table
>> with Hive?
>>
>> And I think it's better to mention it in the AVRO table document because
>> it is an unexpected behavior for many users.
>>
>> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>>
>> Gzip supports streaming decompression, but we currently only implement
>> that for text tables.
>>
>> Doing streaming decompression certainly makes sense for Avro as well.
>>
>> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
>> improvement.
>>
>> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
>> wrote:
>>
>> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
>> > Will Impala load all the file into the memory? That sounds horrible. And
>> > with "show partition adhoc_data_fast.log", the compressed files are no
>> > bigger that 4GB:
>>
>> The *uncompressed* size of one of your files is 50GB. Gzip needs to
>> allocate memory for that.
>>
>> >
>> > | 2017-04-04 | -1 | 46 | 2.69GB | NOT CACHED | NOT CACHED
>> > | AVRO | false |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04
>> |
>> > | 2017-04-05 | -1 | 25 | 3.42GB | NOT CACHED | NOT CACHED
>> > | AVRO | false |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05
>> |
>> >
>> >
>> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
>> >>
>> >> Apparently you have a gzipped file that is >=50GB. You either need to
>> >> break up those files, or run on larger machines.
>> >>
>> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm using Impala on production for a while. But since yesterday, some
>> >> > queries reports memory limit exceeded. Then I try a very simple count
>> >> > query,
>> >> > it still have memory limit exceeded.
>> >> >
>> >> > The query is:
>> >> >
>> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
>> >> > day<='2017-04-06';
>> >> >
>> >> > And the response in the Impala shell is:
>> >> >
>> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
>> >> > http://szq7.appadhoc.com:25000)
>> >> > Query progress can be monitored at:
>> >> >
>> >> > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:
>> 734bcc1d00000000
>> >> > WARNINGS:
>> >> > Memory limit exceeded
>> >> > GzipDecompressor failed to allocate 54525952000 bytes.
>> >> >
>> >> > I have many nodes and each of them have lots of memory avaliable (~
>> 60
>> >> > GB).
>> >> > And the query failed very fast after I execute it and the nodes have
>> >> > almost
>> >> > no memory usage.
>> >> >
>> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
>> >> > gzip
>> >> > and is partitioned by the field "day". And each partition has no more
>> >> > than
>> >> > one billion rows.
>> >> >
>> >> > My Impala version is:
>> >> >
>> >> > hdfs@szq7:/home/ubuntu$ impalad --version
>> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
>> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
>> >> > Built on Wed Jan 11 13:39:25 PST 2017
>> >> >
>> >> > Any one can help for this? Thanks very much!
>> >> >
>>
>>
>>
>>
>>
Re: Memory limit exceed even with very simple count query
Posted by Alex Behm <al...@cloudera.com>.
Bin, for your original problem, have you confirmed that your files are
highly (10x or more) compressed? Maybe the original problem was due to
corrupt files, and the don't actually decompress so such a huge size.
On Thu, Apr 6, 2017 at 12:41 AM, Bin Wang <wb...@gmail.com> wrote:
> Hi,
>
> I've dig it and found there are two files that will trigger this problem.
> After remove them from the partition, I can query on it. But these two
> files are small both before and after compressed (smaller than 400 MB). The
> only problem is they have some corrupt data at the end. Is it a bug in
> Impala?
>
> Regards,
> Bin Wang
>
> Bin Wang <wb...@gmail.com>于2017年4月6日周四 下午12:05写道:
>
>> I convert these avro files to json with avro-tools, and the json files
>> are no larger than 1GB. So Impala should be able to read them. Some of the
>> avro files are corrupt.
>>
>> 16M log.2017-04-05.1491321605834.avro.json
>> 308M log.2017-04-05.1491323647211.avro.json
>> 103M log.2017-04-05.1491327241311.avro.json
>> 150M log.2017-04-05.1491330839609.avro.json
>> 397M log.2017-04-05.1491334439092.avro.json
>> 297M log.2017-04-05.1491338038503.avro.json
>> 160M log.2017-04-05.1491341639694.avro.json
>> 95M log.2017-04-05.1491345239969.avro.json
>> 360M log.2017-04-05.1491348843931.avro.json
>> 338M log.2017-04-05.1491352442955.avro.json
>> 71M log.2017-04-05.1491359648079.avro.json
>> 161M log.2017-04-05.1491363247597.avro.json
>> 628M log.2017-04-05.1491366845827.avro.json
>> 288M log.2017-04-05.1491370445873.avro.json
>> 162M log.2017-04-05.1491374045830.avro.json
>> 90M log.2017-04-05.1491377650935.avro.json
>> 269M log.2017-04-05.1491381249597.avro.json
>> 620M log.2017-04-05.1491384846366.avro.json
>> 70M log.2017-04-05.1491388450262.avro.json
>> 30M log.2017-04-05.1491392047694.avro.json
>> 114M log.2017-04-05.1491395648818.avro.json
>> 370M log.2017-04-05.1491399246407.avro.json
>> 359M log.2017-04-05.1491402846469.avro.json
>> 218M log.2017-04-05.1491406180615.avro.json
>> 29M log.2017-04-05.1491409790105.avro.json
>> 3.9M log.2017-04-05.1491413385884.avro.json
>> 9.3M log.2017-04-05.1491416981829.avro.json
>> 8.3M log.2017-04-05.1491420581588.avro.json
>> 2.3M log.2017-04-05.1491424180191.avro.json
>> 25M log.2017-04-05.1491427781339.avro.json
>> 24M log.2017-04-05.1491431382552.avro.json
>> 5.7M log.2017-04-05.1491434984679.avro.json
>> 35M log.2017-04-05.1491438586674.avro.json
>> 5.8M log.2017-04-05.1491442192541.avro.json
>> 23M log.2017-04-05.1491445789230.avro.json
>> 4.3M log.2017-04-05.1491449386630.avro.json
>>
>> Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午11:34写道:
>>
>> And here is another question. How does Impala estimate the unziped file
>> size? All the gziped files is no bigger than 300MB so I think it will be OK
>> to unzip.
>>
>> Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午9:31写道:
>>
>> Is the snappy decompressor for AVOR or Parquet streaming?
>>
>> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:27写道:
>>
>> I'd say following the best practices with Parquet should work fine.
>> Create snappy-compressed Parquet files of roughly 256MB in size.
>> If you want to stick with Avro, then yes, you'll just have to create
>> smaller files.
>>
>> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <wb...@gmail.com> wrote:
>>
>> So the best I can do to workaround this for now is splitting the files
>> into small files?
>>
>> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:14写道:
>>
>> Parquet makes more sense particularly for that kind of query you have.
>>
>> Still, you might want to be careful with creating huge gzipped files.
>> Impala's gzip decompressor for Parquet is also not streaming.
>>
>> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:
>>
>> So as a workaround, does that make sense to convert it to a parquet table
>> with Hive?
>>
>> And I think it's better to mention it in the AVRO table document because
>> it is an unexpected behavior for many users.
>>
>> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>>
>> Gzip supports streaming decompression, but we currently only implement
>> that for text tables.
>>
>> Doing streaming decompression certainly makes sense for Avro as well.
>>
>> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
>> improvement.
>>
>> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
>> wrote:
>>
>> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
>> > Will Impala load all the file into the memory? That sounds horrible. And
>> > with "show partition adhoc_data_fast.log", the compressed files are no
>> > bigger that 4GB:
>>
>> The *uncompressed* size of one of your files is 50GB. Gzip needs to
>> allocate memory for that.
>>
>> >
>> > | 2017-04-04 | -1 | 46 | 2.69GB | NOT CACHED | NOT CACHED
>> > | AVRO | false |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04
>> |
>> > | 2017-04-05 | -1 | 25 | 3.42GB | NOT CACHED | NOT CACHED
>> > | AVRO | false |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05
>> |
>> >
>> >
>> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
>> >>
>> >> Apparently you have a gzipped file that is >=50GB. You either need to
>> >> break up those files, or run on larger machines.
>> >>
>> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm using Impala on production for a while. But since yesterday, some
>> >> > queries reports memory limit exceeded. Then I try a very simple count
>> >> > query,
>> >> > it still have memory limit exceeded.
>> >> >
>> >> > The query is:
>> >> >
>> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
>> >> > day<='2017-04-06';
>> >> >
>> >> > And the response in the Impala shell is:
>> >> >
>> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
>> >> > http://szq7.appadhoc.com:25000)
>> >> > Query progress can be monitored at:
>> >> >
>> >> > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:
>> 734bcc1d00000000
>> >> > WARNINGS:
>> >> > Memory limit exceeded
>> >> > GzipDecompressor failed to allocate 54525952000 bytes.
>> >> >
>> >> > I have many nodes and each of them have lots of memory avaliable (~
>> 60
>> >> > GB).
>> >> > And the query failed very fast after I execute it and the nodes have
>> >> > almost
>> >> > no memory usage.
>> >> >
>> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
>> >> > gzip
>> >> > and is partitioned by the field "day". And each partition has no more
>> >> > than
>> >> > one billion rows.
>> >> >
>> >> > My Impala version is:
>> >> >
>> >> > hdfs@szq7:/home/ubuntu$ impalad --version
>> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
>> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
>> >> > Built on Wed Jan 11 13:39:25 PST 2017
>> >> >
>> >> > Any one can help for this? Thanks very much!
>> >> >
>>
>>
>>
>>
>>
Re: Memory limit exceed even with very simple count query
Posted by Silvius Rus <sr...@cloudera.com>.
There are some known outstanding issues that reference the word "corrupt":
https://issues.apache.org/jira/issues/?jql=project%20%3D%20impala%20and%20resolution%20is%20empty%20and%20text%20~%20corrupt
Feel free to post a new JIRA if you believe you are running into a new bug.
On Thu, Apr 6, 2017 at 12:41 AM, Bin Wang <wb...@gmail.com> wrote:
> Hi,
>
> I've dig it and found there are two files that will trigger this problem.
> After remove them from the partition, I can query on it. But these two
> files are small both before and after compressed (smaller than 400 MB). The
> only problem is they have some corrupt data at the end. Is it a bug in
> Impala?
>
> Regards,
> Bin Wang
>
> Bin Wang <wb...@gmail.com>于2017年4月6日周四 下午12:05写道:
>
>> I convert these avro files to json with avro-tools, and the json files
>> are no larger than 1GB. So Impala should be able to read them. Some of the
>> avro files are corrupt.
>>
>> 16M log.2017-04-05.1491321605834.avro.json
>> 308M log.2017-04-05.1491323647211.avro.json
>> 103M log.2017-04-05.1491327241311.avro.json
>> 150M log.2017-04-05.1491330839609.avro.json
>> 397M log.2017-04-05.1491334439092.avro.json
>> 297M log.2017-04-05.1491338038503.avro.json
>> 160M log.2017-04-05.1491341639694.avro.json
>> 95M log.2017-04-05.1491345239969.avro.json
>> 360M log.2017-04-05.1491348843931.avro.json
>> 338M log.2017-04-05.1491352442955.avro.json
>> 71M log.2017-04-05.1491359648079.avro.json
>> 161M log.2017-04-05.1491363247597.avro.json
>> 628M log.2017-04-05.1491366845827.avro.json
>> 288M log.2017-04-05.1491370445873.avro.json
>> 162M log.2017-04-05.1491374045830.avro.json
>> 90M log.2017-04-05.1491377650935.avro.json
>> 269M log.2017-04-05.1491381249597.avro.json
>> 620M log.2017-04-05.1491384846366.avro.json
>> 70M log.2017-04-05.1491388450262.avro.json
>> 30M log.2017-04-05.1491392047694.avro.json
>> 114M log.2017-04-05.1491395648818.avro.json
>> 370M log.2017-04-05.1491399246407.avro.json
>> 359M log.2017-04-05.1491402846469.avro.json
>> 218M log.2017-04-05.1491406180615.avro.json
>> 29M log.2017-04-05.1491409790105.avro.json
>> 3.9M log.2017-04-05.1491413385884.avro.json
>> 9.3M log.2017-04-05.1491416981829.avro.json
>> 8.3M log.2017-04-05.1491420581588.avro.json
>> 2.3M log.2017-04-05.1491424180191.avro.json
>> 25M log.2017-04-05.1491427781339.avro.json
>> 24M log.2017-04-05.1491431382552.avro.json
>> 5.7M log.2017-04-05.1491434984679.avro.json
>> 35M log.2017-04-05.1491438586674.avro.json
>> 5.8M log.2017-04-05.1491442192541.avro.json
>> 23M log.2017-04-05.1491445789230.avro.json
>> 4.3M log.2017-04-05.1491449386630.avro.json
>>
>> Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午11:34写道:
>>
>> And here is another question. How does Impala estimate the unziped file
>> size? All the gziped files is no bigger than 300MB so I think it will be OK
>> to unzip.
>>
>> Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午9:31写道:
>>
>> Is the snappy decompressor for AVOR or Parquet streaming?
>>
>> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:27写道:
>>
>> I'd say following the best practices with Parquet should work fine.
>> Create snappy-compressed Parquet files of roughly 256MB in size.
>> If you want to stick with Avro, then yes, you'll just have to create
>> smaller files.
>>
>> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <wb...@gmail.com> wrote:
>>
>> So the best I can do to workaround this for now is splitting the files
>> into small files?
>>
>> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:14写道:
>>
>> Parquet makes more sense particularly for that kind of query you have.
>>
>> Still, you might want to be careful with creating huge gzipped files.
>> Impala's gzip decompressor for Parquet is also not streaming.
>>
>> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:
>>
>> So as a workaround, does that make sense to convert it to a parquet table
>> with Hive?
>>
>> And I think it's better to mention it in the AVRO table document because
>> it is an unexpected behavior for many users.
>>
>> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>>
>> Gzip supports streaming decompression, but we currently only implement
>> that for text tables.
>>
>> Doing streaming decompression certainly makes sense for Avro as well.
>>
>> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
>> improvement.
>>
>> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
>> wrote:
>>
>> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
>> > Will Impala load all the file into the memory? That sounds horrible. And
>> > with "show partition adhoc_data_fast.log", the compressed files are no
>> > bigger that 4GB:
>>
>> The *uncompressed* size of one of your files is 50GB. Gzip needs to
>> allocate memory for that.
>>
>> >
>> > | 2017-04-04 | -1 | 46 | 2.69GB | NOT CACHED | NOT CACHED
>> > | AVRO | false |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04
>> |
>> > | 2017-04-05 | -1 | 25 | 3.42GB | NOT CACHED | NOT CACHED
>> > | AVRO | false |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05
>> |
>> >
>> >
>> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
>> >>
>> >> Apparently you have a gzipped file that is >=50GB. You either need to
>> >> break up those files, or run on larger machines.
>> >>
>> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm using Impala on production for a while. But since yesterday, some
>> >> > queries reports memory limit exceeded. Then I try a very simple count
>> >> > query,
>> >> > it still have memory limit exceeded.
>> >> >
>> >> > The query is:
>> >> >
>> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
>> >> > day<='2017-04-06';
>> >> >
>> >> > And the response in the Impala shell is:
>> >> >
>> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
>> >> > http://szq7.appadhoc.com:25000)
>> >> > Query progress can be monitored at:
>> >> >
>> >> > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:
>> 734bcc1d00000000
>> >> > WARNINGS:
>> >> > Memory limit exceeded
>> >> > GzipDecompressor failed to allocate 54525952000 bytes.
>> >> >
>> >> > I have many nodes and each of them have lots of memory avaliable (~
>> 60
>> >> > GB).
>> >> > And the query failed very fast after I execute it and the nodes have
>> >> > almost
>> >> > no memory usage.
>> >> >
>> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
>> >> > gzip
>> >> > and is partitioned by the field "day". And each partition has no more
>> >> > than
>> >> > one billion rows.
>> >> >
>> >> > My Impala version is:
>> >> >
>> >> > hdfs@szq7:/home/ubuntu$ impalad --version
>> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
>> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
>> >> > Built on Wed Jan 11 13:39:25 PST 2017
>> >> >
>> >> > Any one can help for this? Thanks very much!
>> >> >
>>
>>
>>
>>
>>
Re: Memory limit exceed even with very simple count query
Posted by Bin Wang <wb...@gmail.com>.
Hi,
I've dig it and found there are two files that will trigger this problem.
After remove them from the partition, I can query on it. But these two
files are small both before and after compressed (smaller than 400 MB). The
only problem is they have some corrupt data at the end. Is it a bug in
Impala?
Regards,
Bin Wang
Bin Wang <wb...@gmail.com>于2017年4月6日周四 下午12:05写道:
> I convert these avro files to json with avro-tools, and the json files are
> no larger than 1GB. So Impala should be able to read them. Some of the avro
> files are corrupt.
>
> 16M log.2017-04-05.1491321605834.avro.json
> 308M log.2017-04-05.1491323647211.avro.json
> 103M log.2017-04-05.1491327241311.avro.json
> 150M log.2017-04-05.1491330839609.avro.json
> 397M log.2017-04-05.1491334439092.avro.json
> 297M log.2017-04-05.1491338038503.avro.json
> 160M log.2017-04-05.1491341639694.avro.json
> 95M log.2017-04-05.1491345239969.avro.json
> 360M log.2017-04-05.1491348843931.avro.json
> 338M log.2017-04-05.1491352442955.avro.json
> 71M log.2017-04-05.1491359648079.avro.json
> 161M log.2017-04-05.1491363247597.avro.json
> 628M log.2017-04-05.1491366845827.avro.json
> 288M log.2017-04-05.1491370445873.avro.json
> 162M log.2017-04-05.1491374045830.avro.json
> 90M log.2017-04-05.1491377650935.avro.json
> 269M log.2017-04-05.1491381249597.avro.json
> 620M log.2017-04-05.1491384846366.avro.json
> 70M log.2017-04-05.1491388450262.avro.json
> 30M log.2017-04-05.1491392047694.avro.json
> 114M log.2017-04-05.1491395648818.avro.json
> 370M log.2017-04-05.1491399246407.avro.json
> 359M log.2017-04-05.1491402846469.avro.json
> 218M log.2017-04-05.1491406180615.avro.json
> 29M log.2017-04-05.1491409790105.avro.json
> 3.9M log.2017-04-05.1491413385884.avro.json
> 9.3M log.2017-04-05.1491416981829.avro.json
> 8.3M log.2017-04-05.1491420581588.avro.json
> 2.3M log.2017-04-05.1491424180191.avro.json
> 25M log.2017-04-05.1491427781339.avro.json
> 24M log.2017-04-05.1491431382552.avro.json
> 5.7M log.2017-04-05.1491434984679.avro.json
> 35M log.2017-04-05.1491438586674.avro.json
> 5.8M log.2017-04-05.1491442192541.avro.json
> 23M log.2017-04-05.1491445789230.avro.json
> 4.3M log.2017-04-05.1491449386630.avro.json
>
> Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午11:34写道:
>
> And here is another question. How does Impala estimate the unziped file
> size? All the gziped files is no bigger than 300MB so I think it will be OK
> to unzip.
>
> Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午9:31写道:
>
> Is the snappy decompressor for AVOR or Parquet streaming?
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:27写道:
>
> I'd say following the best practices with Parquet should work fine. Create
> snappy-compressed Parquet files of roughly 256MB in size.
> If you want to stick with Avro, then yes, you'll just have to create
> smaller files.
>
> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <wb...@gmail.com> wrote:
>
> So the best I can do to workaround this for now is splitting the files
> into small files?
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:14写道:
>
> Parquet makes more sense particularly for that kind of query you have.
>
> Still, you might want to be careful with creating huge gzipped files.
> Impala's gzip decompressor for Parquet is also not streaming.
>
> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:
>
> So as a workaround, does that make sense to convert it to a parquet table
> with Hive?
>
> And I think it's better to mention it in the AVRO table document because
> it is an unexpected behavior for many users.
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>
> Gzip supports streaming decompression, but we currently only implement
> that for text tables.
>
> Doing streaming decompression certainly makes sense for Avro as well.
>
> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
> improvement.
>
> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
> wrote:
>
> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
> > Will Impala load all the file into the memory? That sounds horrible. And
> > with "show partition adhoc_data_fast.log", the compressed files are no
> > bigger that 4GB:
>
> The *uncompressed* size of one of your files is 50GB. Gzip needs to
> allocate memory for that.
>
> >
> > | 2017-04-04 | -1 | 46 | 2.69GB | NOT CACHED | NOT CACHED
> > | AVRO | false |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
> > | 2017-04-05 | -1 | 25 | 3.42GB | NOT CACHED | NOT CACHED
> > | AVRO | false |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
> >
> >
> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
> >>
> >> Apparently you have a gzipped file that is >=50GB. You either need to
> >> break up those files, or run on larger machines.
> >>
> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > I'm using Impala on production for a while. But since yesterday, some
> >> > queries reports memory limit exceeded. Then I try a very simple count
> >> > query,
> >> > it still have memory limit exceeded.
> >> >
> >> > The query is:
> >> >
> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> >> > day<='2017-04-06';
> >> >
> >> > And the response in the Impala shell is:
> >> >
> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> >> > http://szq7.appadhoc.com:25000)
> >> > Query progress can be monitored at:
> >> >
> >> >
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> >> > WARNINGS:
> >> > Memory limit exceeded
> >> > GzipDecompressor failed to allocate 54525952000 bytes.
> >> >
> >> > I have many nodes and each of them have lots of memory avaliable (~ 60
> >> > GB).
> >> > And the query failed very fast after I execute it and the nodes have
> >> > almost
> >> > no memory usage.
> >> >
> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
> >> > gzip
> >> > and is partitioned by the field "day". And each partition has no more
> >> > than
> >> > one billion rows.
> >> >
> >> > My Impala version is:
> >> >
> >> > hdfs@szq7:/home/ubuntu$ impalad --version
> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> >> > Built on Wed Jan 11 13:39:25 PST 2017
> >> >
> >> > Any one can help for this? Thanks very much!
> >> >
>
>
>
>
>
Re: Memory limit exceed even with very simple count query
Posted by Bin Wang <wb...@gmail.com>.
I convert these avro files to json with avro-tools, and the json files are
no larger than 1GB. So Impala should be able to read them. Some of the avro
files are corrupt.
16M log.2017-04-05.1491321605834.avro.json
308M log.2017-04-05.1491323647211.avro.json
103M log.2017-04-05.1491327241311.avro.json
150M log.2017-04-05.1491330839609.avro.json
397M log.2017-04-05.1491334439092.avro.json
297M log.2017-04-05.1491338038503.avro.json
160M log.2017-04-05.1491341639694.avro.json
95M log.2017-04-05.1491345239969.avro.json
360M log.2017-04-05.1491348843931.avro.json
338M log.2017-04-05.1491352442955.avro.json
71M log.2017-04-05.1491359648079.avro.json
161M log.2017-04-05.1491363247597.avro.json
628M log.2017-04-05.1491366845827.avro.json
288M log.2017-04-05.1491370445873.avro.json
162M log.2017-04-05.1491374045830.avro.json
90M log.2017-04-05.1491377650935.avro.json
269M log.2017-04-05.1491381249597.avro.json
620M log.2017-04-05.1491384846366.avro.json
70M log.2017-04-05.1491388450262.avro.json
30M log.2017-04-05.1491392047694.avro.json
114M log.2017-04-05.1491395648818.avro.json
370M log.2017-04-05.1491399246407.avro.json
359M log.2017-04-05.1491402846469.avro.json
218M log.2017-04-05.1491406180615.avro.json
29M log.2017-04-05.1491409790105.avro.json
3.9M log.2017-04-05.1491413385884.avro.json
9.3M log.2017-04-05.1491416981829.avro.json
8.3M log.2017-04-05.1491420581588.avro.json
2.3M log.2017-04-05.1491424180191.avro.json
25M log.2017-04-05.1491427781339.avro.json
24M log.2017-04-05.1491431382552.avro.json
5.7M log.2017-04-05.1491434984679.avro.json
35M log.2017-04-05.1491438586674.avro.json
5.8M log.2017-04-05.1491442192541.avro.json
23M log.2017-04-05.1491445789230.avro.json
4.3M log.2017-04-05.1491449386630.avro.json
Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午11:34写道:
> And here is another question. How does Impala estimate the unziped file
> size? All the gziped files is no bigger than 300MB so I think it will be OK
> to unzip.
>
> Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午9:31写道:
>
> Is the snappy decompressor for AVOR or Parquet streaming?
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:27写道:
>
> I'd say following the best practices with Parquet should work fine. Create
> snappy-compressed Parquet files of roughly 256MB in size.
> If you want to stick with Avro, then yes, you'll just have to create
> smaller files.
>
> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <wb...@gmail.com> wrote:
>
> So the best I can do to workaround this for now is splitting the files
> into small files?
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:14写道:
>
> Parquet makes more sense particularly for that kind of query you have.
>
> Still, you might want to be careful with creating huge gzipped files.
> Impala's gzip decompressor for Parquet is also not streaming.
>
> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:
>
> So as a workaround, does that make sense to convert it to a parquet table
> with Hive?
>
> And I think it's better to mention it in the AVRO table document because
> it is an unexpected behavior for many users.
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>
> Gzip supports streaming decompression, but we currently only implement
> that for text tables.
>
> Doing streaming decompression certainly makes sense for Avro as well.
>
> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
> improvement.
>
> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
> wrote:
>
> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
> > Will Impala load all the file into the memory? That sounds horrible. And
> > with "show partition adhoc_data_fast.log", the compressed files are no
> > bigger that 4GB:
>
> The *uncompressed* size of one of your files is 50GB. Gzip needs to
> allocate memory for that.
>
> >
> > | 2017-04-04 | -1 | 46 | 2.69GB | NOT CACHED | NOT CACHED
> > | AVRO | false |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
> > | 2017-04-05 | -1 | 25 | 3.42GB | NOT CACHED | NOT CACHED
> > | AVRO | false |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
> >
> >
> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
> >>
> >> Apparently you have a gzipped file that is >=50GB. You either need to
> >> break up those files, or run on larger machines.
> >>
> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > I'm using Impala on production for a while. But since yesterday, some
> >> > queries reports memory limit exceeded. Then I try a very simple count
> >> > query,
> >> > it still have memory limit exceeded.
> >> >
> >> > The query is:
> >> >
> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> >> > day<='2017-04-06';
> >> >
> >> > And the response in the Impala shell is:
> >> >
> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> >> > http://szq7.appadhoc.com:25000)
> >> > Query progress can be monitored at:
> >> >
> >> >
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> >> > WARNINGS:
> >> > Memory limit exceeded
> >> > GzipDecompressor failed to allocate 54525952000 bytes.
> >> >
> >> > I have many nodes and each of them have lots of memory avaliable (~ 60
> >> > GB).
> >> > And the query failed very fast after I execute it and the nodes have
> >> > almost
> >> > no memory usage.
> >> >
> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
> >> > gzip
> >> > and is partitioned by the field "day". And each partition has no more
> >> > than
> >> > one billion rows.
> >> >
> >> > My Impala version is:
> >> >
> >> > hdfs@szq7:/home/ubuntu$ impalad --version
> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> >> > Built on Wed Jan 11 13:39:25 PST 2017
> >> >
> >> > Any one can help for this? Thanks very much!
> >> >
>
>
>
>
>
Re: Memory limit exceed even with very simple count query
Posted by Bin Wang <wb...@gmail.com>.
And here is another question. How does Impala estimate the unziped file
size? All the gziped files is no bigger than 300MB so I think it will be OK
to unzip.
Bin Wang <wb...@gmail.com>于2017年4月6日周四 上午9:31写道:
> Is the snappy decompressor for AVOR or Parquet streaming?
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:27写道:
>
> I'd say following the best practices with Parquet should work fine. Create
> snappy-compressed Parquet files of roughly 256MB in size.
> If you want to stick with Avro, then yes, you'll just have to create
> smaller files.
>
> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <wb...@gmail.com> wrote:
>
> So the best I can do to workaround this for now is splitting the files
> into small files?
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:14写道:
>
> Parquet makes more sense particularly for that kind of query you have.
>
> Still, you might want to be careful with creating huge gzipped files.
> Impala's gzip decompressor for Parquet is also not streaming.
>
> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:
>
> So as a workaround, does that make sense to convert it to a parquet table
> with Hive?
>
> And I think it's better to mention it in the AVRO table document because
> it is an unexpected behavior for many users.
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>
> Gzip supports streaming decompression, but we currently only implement
> that for text tables.
>
> Doing streaming decompression certainly makes sense for Avro as well.
>
> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
> improvement.
>
> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
> wrote:
>
> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
> > Will Impala load all the file into the memory? That sounds horrible. And
> > with "show partition adhoc_data_fast.log", the compressed files are no
> > bigger that 4GB:
>
> The *uncompressed* size of one of your files is 50GB. Gzip needs to
> allocate memory for that.
>
> >
> > | 2017-04-04 | -1 | 46 | 2.69GB | NOT CACHED | NOT CACHED
> > | AVRO | false |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
> > | 2017-04-05 | -1 | 25 | 3.42GB | NOT CACHED | NOT CACHED
> > | AVRO | false |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
> >
> >
> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
> >>
> >> Apparently you have a gzipped file that is >=50GB. You either need to
> >> break up those files, or run on larger machines.
> >>
> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > I'm using Impala on production for a while. But since yesterday, some
> >> > queries reports memory limit exceeded. Then I try a very simple count
> >> > query,
> >> > it still have memory limit exceeded.
> >> >
> >> > The query is:
> >> >
> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> >> > day<='2017-04-06';
> >> >
> >> > And the response in the Impala shell is:
> >> >
> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> >> > http://szq7.appadhoc.com:25000)
> >> > Query progress can be monitored at:
> >> >
> >> >
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> >> > WARNINGS:
> >> > Memory limit exceeded
> >> > GzipDecompressor failed to allocate 54525952000 bytes.
> >> >
> >> > I have many nodes and each of them have lots of memory avaliable (~ 60
> >> > GB).
> >> > And the query failed very fast after I execute it and the nodes have
> >> > almost
> >> > no memory usage.
> >> >
> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
> >> > gzip
> >> > and is partitioned by the field "day". And each partition has no more
> >> > than
> >> > one billion rows.
> >> >
> >> > My Impala version is:
> >> >
> >> > hdfs@szq7:/home/ubuntu$ impalad --version
> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> >> > Built on Wed Jan 11 13:39:25 PST 2017
> >> >
> >> > Any one can help for this? Thanks very much!
> >> >
>
>
>
>
>
Re: Memory limit exceed even with very simple count query
Posted by Bin Wang <wb...@gmail.com>.
Is the snappy decompressor for AVOR or Parquet streaming?
Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:27写道:
> I'd say following the best practices with Parquet should work fine. Create
> snappy-compressed Parquet files of roughly 256MB in size.
> If you want to stick with Avro, then yes, you'll just have to create
> smaller files.
>
> On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <wb...@gmail.com> wrote:
>
> So the best I can do to workaround this for now is splitting the files
> into small files?
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:14写道:
>
> Parquet makes more sense particularly for that kind of query you have.
>
> Still, you might want to be careful with creating huge gzipped files.
> Impala's gzip decompressor for Parquet is also not streaming.
>
> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:
>
> So as a workaround, does that make sense to convert it to a parquet table
> with Hive?
>
> And I think it's better to mention it in the AVRO table document because
> it is an unexpected behavior for many users.
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>
> Gzip supports streaming decompression, but we currently only implement
> that for text tables.
>
> Doing streaming decompression certainly makes sense for Avro as well.
>
> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
> improvement.
>
> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
> wrote:
>
> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
> > Will Impala load all the file into the memory? That sounds horrible. And
> > with "show partition adhoc_data_fast.log", the compressed files are no
> > bigger that 4GB:
>
> The *uncompressed* size of one of your files is 50GB. Gzip needs to
> allocate memory for that.
>
> >
> > | 2017-04-04 | -1 | 46 | 2.69GB | NOT CACHED | NOT CACHED
> > | AVRO | false |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
> > | 2017-04-05 | -1 | 25 | 3.42GB | NOT CACHED | NOT CACHED
> > | AVRO | false |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
> >
> >
> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
> >>
> >> Apparently you have a gzipped file that is >=50GB. You either need to
> >> break up those files, or run on larger machines.
> >>
> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > I'm using Impala on production for a while. But since yesterday, some
> >> > queries reports memory limit exceeded. Then I try a very simple count
> >> > query,
> >> > it still have memory limit exceeded.
> >> >
> >> > The query is:
> >> >
> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> >> > day<='2017-04-06';
> >> >
> >> > And the response in the Impala shell is:
> >> >
> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> >> > http://szq7.appadhoc.com:25000)
> >> > Query progress can be monitored at:
> >> >
> >> >
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> >> > WARNINGS:
> >> > Memory limit exceeded
> >> > GzipDecompressor failed to allocate 54525952000 bytes.
> >> >
> >> > I have many nodes and each of them have lots of memory avaliable (~ 60
> >> > GB).
> >> > And the query failed very fast after I execute it and the nodes have
> >> > almost
> >> > no memory usage.
> >> >
> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
> >> > gzip
> >> > and is partitioned by the field "day". And each partition has no more
> >> > than
> >> > one billion rows.
> >> >
> >> > My Impala version is:
> >> >
> >> > hdfs@szq7:/home/ubuntu$ impalad --version
> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> >> > Built on Wed Jan 11 13:39:25 PST 2017
> >> >
> >> > Any one can help for this? Thanks very much!
> >> >
>
>
>
>
>
Re: Memory limit exceed even with very simple count query
Posted by Alex Behm <al...@cloudera.com>.
I'd say following the best practices with Parquet should work fine. Create
snappy-compressed Parquet files of roughly 256MB in size.
If you want to stick with Avro, then yes, you'll just have to create
smaller files.
On Wed, Apr 5, 2017 at 6:23 PM, Bin Wang <wb...@gmail.com> wrote:
> So the best I can do to workaround this for now is splitting the files
> into small files?
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:14写道:
>
>> Parquet makes more sense particularly for that kind of query you have.
>>
>> Still, you might want to be careful with creating huge gzipped files.
>> Impala's gzip decompressor for Parquet is also not streaming.
>>
>> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:
>>
>> So as a workaround, does that make sense to convert it to a parquet table
>> with Hive?
>>
>> And I think it's better to mention it in the AVRO table document because
>> it is an unexpected behavior for many users.
>>
>> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>>
>> Gzip supports streaming decompression, but we currently only implement
>> that for text tables.
>>
>> Doing streaming decompression certainly makes sense for Avro as well.
>>
>> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
>> improvement.
>>
>> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
>> wrote:
>>
>> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
>> > Will Impala load all the file into the memory? That sounds horrible. And
>> > with "show partition adhoc_data_fast.log", the compressed files are no
>> > bigger that 4GB:
>>
>> The *uncompressed* size of one of your files is 50GB. Gzip needs to
>> allocate memory for that.
>>
>> >
>> > | 2017-04-04 | -1 | 46 | 2.69GB | NOT CACHED | NOT CACHED
>> > | AVRO | false |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04
>> |
>> > | 2017-04-05 | -1 | 25 | 3.42GB | NOT CACHED | NOT CACHED
>> > | AVRO | false |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05
>> |
>> >
>> >
>> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
>> >>
>> >> Apparently you have a gzipped file that is >=50GB. You either need to
>> >> break up those files, or run on larger machines.
>> >>
>> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm using Impala on production for a while. But since yesterday, some
>> >> > queries reports memory limit exceeded. Then I try a very simple count
>> >> > query,
>> >> > it still have memory limit exceeded.
>> >> >
>> >> > The query is:
>> >> >
>> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
>> >> > day<='2017-04-06';
>> >> >
>> >> > And the response in the Impala shell is:
>> >> >
>> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
>> >> > http://szq7.appadhoc.com:25000)
>> >> > Query progress can be monitored at:
>> >> >
>> >> > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:
>> 734bcc1d00000000
>> >> > WARNINGS:
>> >> > Memory limit exceeded
>> >> > GzipDecompressor failed to allocate 54525952000 bytes.
>> >> >
>> >> > I have many nodes and each of them have lots of memory avaliable (~
>> 60
>> >> > GB).
>> >> > And the query failed very fast after I execute it and the nodes have
>> >> > almost
>> >> > no memory usage.
>> >> >
>> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
>> >> > gzip
>> >> > and is partitioned by the field "day". And each partition has no more
>> >> > than
>> >> > one billion rows.
>> >> >
>> >> > My Impala version is:
>> >> >
>> >> > hdfs@szq7:/home/ubuntu$ impalad --version
>> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
>> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
>> >> > Built on Wed Jan 11 13:39:25 PST 2017
>> >> >
>> >> > Any one can help for this? Thanks very much!
>> >> >
>>
>>
>>
>>
Re: Memory limit exceed even with very simple count query
Posted by Bin Wang <wb...@gmail.com>.
So the best I can do to workaround this for now is splitting the files into
small files?
Alex Behm <al...@cloudera.com>于2017年4月6日周四 上午9:14写道:
> Parquet makes more sense particularly for that kind of query you have.
>
> Still, you might want to be careful with creating huge gzipped files.
> Impala's gzip decompressor for Parquet is also not streaming.
>
> On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:
>
> So as a workaround, does that make sense to convert it to a parquet table
> with Hive?
>
> And I think it's better to mention it in the AVRO table document because
> it is an unexpected behavior for many users.
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>
> Gzip supports streaming decompression, but we currently only implement
> that for text tables.
>
> Doing streaming decompression certainly makes sense for Avro as well.
>
> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
> improvement.
>
> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
> wrote:
>
> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
> > Will Impala load all the file into the memory? That sounds horrible. And
> > with "show partition adhoc_data_fast.log", the compressed files are no
> > bigger that 4GB:
>
> The *uncompressed* size of one of your files is 50GB. Gzip needs to
> allocate memory for that.
>
> >
> > | 2017-04-04 | -1 | 46 | 2.69GB | NOT CACHED | NOT CACHED
> > | AVRO | false |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
> > | 2017-04-05 | -1 | 25 | 3.42GB | NOT CACHED | NOT CACHED
> > | AVRO | false |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
> >
> >
> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
> >>
> >> Apparently you have a gzipped file that is >=50GB. You either need to
> >> break up those files, or run on larger machines.
> >>
> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > I'm using Impala on production for a while. But since yesterday, some
> >> > queries reports memory limit exceeded. Then I try a very simple count
> >> > query,
> >> > it still have memory limit exceeded.
> >> >
> >> > The query is:
> >> >
> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> >> > day<='2017-04-06';
> >> >
> >> > And the response in the Impala shell is:
> >> >
> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> >> > http://szq7.appadhoc.com:25000)
> >> > Query progress can be monitored at:
> >> >
> >> >
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> >> > WARNINGS:
> >> > Memory limit exceeded
> >> > GzipDecompressor failed to allocate 54525952000 bytes.
> >> >
> >> > I have many nodes and each of them have lots of memory avaliable (~ 60
> >> > GB).
> >> > And the query failed very fast after I execute it and the nodes have
> >> > almost
> >> > no memory usage.
> >> >
> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
> >> > gzip
> >> > and is partitioned by the field "day". And each partition has no more
> >> > than
> >> > one billion rows.
> >> >
> >> > My Impala version is:
> >> >
> >> > hdfs@szq7:/home/ubuntu$ impalad --version
> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> >> > Built on Wed Jan 11 13:39:25 PST 2017
> >> >
> >> > Any one can help for this? Thanks very much!
> >> >
>
>
>
>
Re: Memory limit exceed even with very simple count query
Posted by Alex Behm <al...@cloudera.com>.
Parquet makes more sense particularly for that kind of query you have.
Still, you might want to be careful with creating huge gzipped files.
Impala's gzip decompressor for Parquet is also not streaming.
On Wed, Apr 5, 2017 at 6:09 PM, Bin Wang <wb...@gmail.com> wrote:
> So as a workaround, does that make sense to convert it to a parquet table
> with Hive?
>
> And I think it's better to mention it in the AVRO table document because
> it is an unexpected behavior for many users.
>
> Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
>
>> Gzip supports streaming decompression, but we currently only implement
>> that for text tables.
>>
>> Doing streaming decompression certainly makes sense for Avro as well.
>>
>> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
>> improvement.
>>
>> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
>> wrote:
>>
>> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
>> > Will Impala load all the file into the memory? That sounds horrible. And
>> > with "show partition adhoc_data_fast.log", the compressed files are no
>> > bigger that 4GB:
>>
>> The *uncompressed* size of one of your files is 50GB. Gzip needs to
>> allocate memory for that.
>>
>> >
>> > | 2017-04-04 | -1 | 46 | 2.69GB | NOT CACHED | NOT CACHED
>> > | AVRO | false |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04
>> |
>> > | 2017-04-05 | -1 | 25 | 3.42GB | NOT CACHED | NOT CACHED
>> > | AVRO | false |
>> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05
>> |
>> >
>> >
>> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
>> >>
>> >> Apparently you have a gzipped file that is >=50GB. You either need to
>> >> break up those files, or run on larger machines.
>> >>
>> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm using Impala on production for a while. But since yesterday, some
>> >> > queries reports memory limit exceeded. Then I try a very simple count
>> >> > query,
>> >> > it still have memory limit exceeded.
>> >> >
>> >> > The query is:
>> >> >
>> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
>> >> > day<='2017-04-06';
>> >> >
>> >> > And the response in the Impala shell is:
>> >> >
>> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
>> >> > http://szq7.appadhoc.com:25000)
>> >> > Query progress can be monitored at:
>> >> >
>> >> > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:
>> 734bcc1d00000000
>> >> > WARNINGS:
>> >> > Memory limit exceeded
>> >> > GzipDecompressor failed to allocate 54525952000 bytes.
>> >> >
>> >> > I have many nodes and each of them have lots of memory avaliable (~
>> 60
>> >> > GB).
>> >> > And the query failed very fast after I execute it and the nodes have
>> >> > almost
>> >> > no memory usage.
>> >> >
>> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
>> >> > gzip
>> >> > and is partitioned by the field "day". And each partition has no more
>> >> > than
>> >> > one billion rows.
>> >> >
>> >> > My Impala version is:
>> >> >
>> >> > hdfs@szq7:/home/ubuntu$ impalad --version
>> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
>> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
>> >> > Built on Wed Jan 11 13:39:25 PST 2017
>> >> >
>> >> > Any one can help for this? Thanks very much!
>> >> >
>>
>>
>>
Re: Memory limit exceed even with very simple count query
Posted by Bin Wang <wb...@gmail.com>.
So as a workaround, does that make sense to convert it to a parquet table
with Hive?
And I think it's better to mention it in the AVRO table document because it
is an unexpected behavior for many users.
Alex Behm <al...@cloudera.com>于2017年4月6日周四 02:52写道:
> Gzip supports streaming decompression, but we currently only implement
> that for text tables.
>
> Doing streaming decompression certainly makes sense for Avro as well.
>
> I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
> improvement.
>
> On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
> wrote:
>
> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
> > Will Impala load all the file into the memory? That sounds horrible. And
> > with "show partition adhoc_data_fast.log", the compressed files are no
> > bigger that 4GB:
>
> The *uncompressed* size of one of your files is 50GB. Gzip needs to
> allocate memory for that.
>
> >
> > | 2017-04-04 | -1 | 46 | 2.69GB | NOT CACHED | NOT CACHED
> > | AVRO | false |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
> > | 2017-04-05 | -1 | 25 | 3.42GB | NOT CACHED | NOT CACHED
> > | AVRO | false |
> >
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
> >
> >
> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
> >>
> >> Apparently you have a gzipped file that is >=50GB. You either need to
> >> break up those files, or run on larger machines.
> >>
> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > I'm using Impala on production for a while. But since yesterday, some
> >> > queries reports memory limit exceeded. Then I try a very simple count
> >> > query,
> >> > it still have memory limit exceeded.
> >> >
> >> > The query is:
> >> >
> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> >> > day<='2017-04-06';
> >> >
> >> > And the response in the Impala shell is:
> >> >
> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> >> > http://szq7.appadhoc.com:25000)
> >> > Query progress can be monitored at:
> >> >
> >> >
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> >> > WARNINGS:
> >> > Memory limit exceeded
> >> > GzipDecompressor failed to allocate 54525952000 bytes.
> >> >
> >> > I have many nodes and each of them have lots of memory avaliable (~ 60
> >> > GB).
> >> > And the query failed very fast after I execute it and the nodes have
> >> > almost
> >> > no memory usage.
> >> >
> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
> >> > gzip
> >> > and is partitioned by the field "day". And each partition has no more
> >> > than
> >> > one billion rows.
> >> >
> >> > My Impala version is:
> >> >
> >> > hdfs@szq7:/home/ubuntu$ impalad --version
> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> >> > Built on Wed Jan 11 13:39:25 PST 2017
> >> >
> >> > Any one can help for this? Thanks very much!
> >> >
>
>
>
Re: Memory limit exceed even with very simple count query
Posted by Alex Behm <al...@cloudera.com>.
Gzip supports streaming decompression, but we currently only implement that
for text tables.
Doing streaming decompression certainly makes sense for Avro as well.
I filed https://issues.apache.org/jira/browse/IMPALA-5170 for this
improvement.
On Wed, Apr 5, 2017 at 10:37 AM, Marcel Kornacker <ma...@cloudera.com>
wrote:
> On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
> > Will Impala load all the file into the memory? That sounds horrible. And
> > with "show partition adhoc_data_fast.log", the compressed files are no
> > bigger that 4GB:
>
> The *uncompressed* size of one of your files is 50GB. Gzip needs to
> allocate memory for that.
>
> >
> > | 2017-04-04 | -1 | 46 | 2.69GB | NOT CACHED | NOT CACHED
> > | AVRO | false |
> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04
> |
> > | 2017-04-05 | -1 | 25 | 3.42GB | NOT CACHED | NOT CACHED
> > | AVRO | false |
> > hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05
> |
> >
> >
> > Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
> >>
> >> Apparently you have a gzipped file that is >=50GB. You either need to
> >> break up those files, or run on larger machines.
> >>
> >> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > I'm using Impala on production for a while. But since yesterday, some
> >> > queries reports memory limit exceeded. Then I try a very simple count
> >> > query,
> >> > it still have memory limit exceeded.
> >> >
> >> > The query is:
> >> >
> >> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> >> > day<='2017-04-06';
> >> >
> >> > And the response in the Impala shell is:
> >> >
> >> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> >> > http://szq7.appadhoc.com:25000)
> >> > Query progress can be monitored at:
> >> >
> >> > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:
> 734bcc1d00000000
> >> > WARNINGS:
> >> > Memory limit exceeded
> >> > GzipDecompressor failed to allocate 54525952000 bytes.
> >> >
> >> > I have many nodes and each of them have lots of memory avaliable (~ 60
> >> > GB).
> >> > And the query failed very fast after I execute it and the nodes have
> >> > almost
> >> > no memory usage.
> >> >
> >> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
> >> > gzip
> >> > and is partitioned by the field "day". And each partition has no more
> >> > than
> >> > one billion rows.
> >> >
> >> > My Impala version is:
> >> >
> >> > hdfs@szq7:/home/ubuntu$ impalad --version
> >> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> >> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> >> > Built on Wed Jan 11 13:39:25 PST 2017
> >> >
> >> > Any one can help for this? Thanks very much!
> >> >
>
Re: Memory limit exceed even with very simple count query
Posted by Marcel Kornacker <ma...@cloudera.com>.
On Wed, Apr 5, 2017 at 10:14 AM, Bin Wang <wb...@gmail.com> wrote:
> Will Impala load all the file into the memory? That sounds horrible. And
> with "show partition adhoc_data_fast.log", the compressed files are no
> bigger that 4GB:
The *uncompressed* size of one of your files is 50GB. Gzip needs to
allocate memory for that.
>
> | 2017-04-04 | -1 | 46 | 2.69GB | NOT CACHED | NOT CACHED
> | AVRO | false |
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
> | 2017-04-05 | -1 | 25 | 3.42GB | NOT CACHED | NOT CACHED
> | AVRO | false |
> hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
>
>
> Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
>>
>> Apparently you have a gzipped file that is >=50GB. You either need to
>> break up those files, or run on larger machines.
>>
>> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
>> > Hi,
>> >
>> > I'm using Impala on production for a while. But since yesterday, some
>> > queries reports memory limit exceeded. Then I try a very simple count
>> > query,
>> > it still have memory limit exceeded.
>> >
>> > The query is:
>> >
>> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
>> > day<='2017-04-06';
>> >
>> > And the response in the Impala shell is:
>> >
>> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
>> > http://szq7.appadhoc.com:25000)
>> > Query progress can be monitored at:
>> >
>> > http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
>> > WARNINGS:
>> > Memory limit exceeded
>> > GzipDecompressor failed to allocate 54525952000 bytes.
>> >
>> > I have many nodes and each of them have lots of memory avaliable (~ 60
>> > GB).
>> > And the query failed very fast after I execute it and the nodes have
>> > almost
>> > no memory usage.
>> >
>> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with
>> > gzip
>> > and is partitioned by the field "day". And each partition has no more
>> > than
>> > one billion rows.
>> >
>> > My Impala version is:
>> >
>> > hdfs@szq7:/home/ubuntu$ impalad --version
>> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
>> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
>> > Built on Wed Jan 11 13:39:25 PST 2017
>> >
>> > Any one can help for this? Thanks very much!
>> >
Re: Memory limit exceed even with very simple count query
Posted by Bin Wang <wb...@gmail.com>.
Will Impala load all the file into the memory? That sounds horrible. And
with "show partition adhoc_data_fast.log", the compressed files are no
bigger that 4GB:
| 2017-04-04 | -1 | 46 | 2.69GB | NOT CACHED | NOT CACHED
| AVRO | false |
hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-04 |
| 2017-04-05 | -1 | 25 | 3.42GB | NOT CACHED | NOT CACHED
| AVRO | false |
hdfs://hfds-service/user/hive/warehouse/adhoc_data_fast.db/log/2017-04-05 |
Marcel Kornacker <ma...@cloudera.com>于2017年4月6日周四 上午12:58写道:
> Apparently you have a gzipped file that is >=50GB. You either need to
> break up those files, or run on larger machines.
>
> On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> > Hi,
> >
> > I'm using Impala on production for a while. But since yesterday, some
> > queries reports memory limit exceeded. Then I try a very simple count
> query,
> > it still have memory limit exceeded.
> >
> > The query is:
> >
> > select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> > day<='2017-04-06';
> >
> > And the response in the Impala shell is:
> >
> > Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> > http://szq7.appadhoc.com:25000)
> > Query progress can be monitored at:
> >
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> > WARNINGS:
> > Memory limit exceeded
> > GzipDecompressor failed to allocate 54525952000 bytes.
> >
> > I have many nodes and each of them have lots of memory avaliable (~ 60
> GB).
> > And the query failed very fast after I execute it and the nodes have
> almost
> > no memory usage.
> >
> > The table "adhoc_data_fast.log" is an AVRO table and is encoded with gzip
> > and is partitioned by the field "day". And each partition has no more
> than
> > one billion rows.
> >
> > My Impala version is:
> >
> > hdfs@szq7:/home/ubuntu$ impalad --version
> > impalad version 2.7.0-cdh5.9.1 RELEASE (build
> > 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> > Built on Wed Jan 11 13:39:25 PST 2017
> >
> > Any one can help for this? Thanks very much!
> >
>
Re: Memory limit exceed even with very simple count query
Posted by Marcel Kornacker <ma...@cloudera.com>.
Apparently you have a gzipped file that is >=50GB. You either need to
break up those files, or run on larger machines.
On Wed, Apr 5, 2017 at 9:52 AM, Bin Wang <wb...@gmail.com> wrote:
> Hi,
>
> I'm using Impala on production for a while. But since yesterday, some
> queries reports memory limit exceeded. Then I try a very simple count query,
> it still have memory limit exceeded.
>
> The query is:
>
> select count(0) from adhoc_data_fast.log where day>='2017-04-04' and
> day<='2017-04-06';
>
> And the response in the Impala shell is:
>
> Query submitted at: 2017-04-06 00:41:00 (Coordinator:
> http://szq7.appadhoc.com:25000)
> Query progress can be monitored at:
> http://szq7.appadhoc.com:25000/query_plan?query_id=4947a3fecd146df4:734bcc1d00000000
> WARNINGS:
> Memory limit exceeded
> GzipDecompressor failed to allocate 54525952000 bytes.
>
> I have many nodes and each of them have lots of memory avaliable (~ 60 GB).
> And the query failed very fast after I execute it and the nodes have almost
> no memory usage.
>
> The table "adhoc_data_fast.log" is an AVRO table and is encoded with gzip
> and is partitioned by the field "day". And each partition has no more than
> one billion rows.
>
> My Impala version is:
>
> hdfs@szq7:/home/ubuntu$ impalad --version
> impalad version 2.7.0-cdh5.9.1 RELEASE (build
> 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
> Built on Wed Jan 11 13:39:25 PST 2017
>
> Any one can help for this? Thanks very much!
>