You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Min Zhou <co...@gmail.com> on 2009/06/15 06:23:45 UTC
OutOfMemory when doing map-side join
Hi all,
we had a query joining two tables, one of which had about 1 billions pieces
of records while the other had less than 20k. below is our query:
set hive.mapjoin.cache.numrows=20000;
select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join
web_log w where and w.url rlike a.url_pattern
here is the stack trace
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1838)
at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1809)
at java.io.ObjectOutputStream.write(ObjectOutputStream.java:681)
at org.apache.hadoop.io.Text.write(Text.java:282)
at org.apache.hadoop.hive.ql.exec.MapJoinObjectValue.writeExternal(MapJoinObjectValue.java:126)
at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1421)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1390)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326)
at org.apache.hadoop.hive.ql.util.jdbm.htree.HashBucket.writeExternal(HashBucket.java:292)
at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1421)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1390)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326)
at org.apache.hadoop.hive.ql.util.jdbm.helper.Serialization.serialize(Serialization.java:93)
at org.apache.hadoop.hive.ql.util.jdbm.helper.DefaultSerializer.serialize(DefaultSerializer.java:101)
at org.apache.hadoop.hive.ql.util.jdbm.recman.BaseRecordManager.insert(BaseRecordManager.java:242)
at org.apache.hadoop.hive.ql.util.jdbm.recman.CacheRecordManager.insert(CacheRecordManager.java:176)
at org.apache.hadoop.hive.ql.util.jdbm.recman.CacheRecordManager.insert(CacheRecordManager.java:159)
at org.apache.hadoop.hive.ql.util.jdbm.htree.HashDirectory.put(HashDirectory.java:249)
at org.apache.hadoop.hive.ql.util.jdbm.htree.HTree.put(HTree.java:147)
at org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:305)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:492)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:76)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:492)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:49)
at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:121)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
java.lang.OutOfMemoryError: GC overhead limit exceeded at
java.util.Arrays.copyOfRange(Arrays.java:3209) at
java.lang.String.(String.java:216) at
java.lang.StringBuffer.toString(StringBuffer.java:585) at
org.apache.log4j.PatternLayout.format(PatternLayout.java:505) at
org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:302) at
org.apache.log4j.WriterAppender.append(WriterAppender.java:160) at
org.apache.hadoop.mapred.TaskLogAppender.append(TaskLogAppender.java:55) at
org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206) at
org.apache.log4j.Category.forcedLog(Category.java:391) at
org.apache.log4j.Category.log(Category.java:856) at
org.apache.commons.logging.impl.Log4JLogger.info(Log4JLogger.java:133) at
org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:387) at
org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:173) at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at
org.apache.hadoop.mapred.Child.main(Child.java:158)
Actually, we used to do the same thing by loading small tables into memory
of each map node in normal map-reduce job. OOM exceptions never happened in
that only 1MB would be spent to load those 20k pieces of records. Can you
give me an explanation?
Thanks,
Min
--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
Re: OutOfMemory when doing map-side join
Posted by Min Zhou <co...@gmail.com>.
I even didn't know what your meaning about those 100k you two refered .
On Thu, Jun 18, 2009 at 9:32 AM, Ashish Thusoo <at...@facebook.com> wrote:
> my bad. yes it is 100K. That sounds a bit too much as well.
>
> Ashish
>
> ------------------------------
> *From:* Min Zhou [mailto:coderplay@gmail.com]
> *Sent:* Wednesday, June 17, 2009 6:25 PM
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: OutOfMemory when doing map-side join
>
> what's your 100kb standed for?
>
> On Thu, Jun 18, 2009 at 6:26 AM, Amr Awadallah <aa...@cloudera.com> wrote:
>
>> hmm, that is a 100KB per my math.
>>
>> 20K * 100K = 2GB
>>
>> -- amr
>>
>>
>> Ashish Thusoo wrote:
>>
>> That does not sound right. Each row is 100MB - that sounds too much...
>>
>> Ashish
>>
>> ------------------------------
>> *From:* Min Zhou [mailto:coderplay@gmail.com <co...@gmail.com>]
>> *Sent:* Monday, June 15, 2009 7:16 PM
>> *To:* hive-user@hadoop.apache.org
>> *Subject:* Re: OutOfMemory when doing map-side join
>>
>> 20k rows need 2G memory? so terrible. The whole small table of mine is
>> less than 4MB, what about yours?
>>
>> On Tue, Jun 16, 2009 at 6:59 AM, Namit Jain <nj...@facebook.com> wrote:
>>
>>> Set mapred.child.java.opts to increase mapper memory.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:* Namit Jain [mailto:njain@facebook.com]
>>> *Sent:* Monday, June 15, 2009 3:53 PM
>>> *To:* hive-user@hadoop.apache.org
>>> *Subject:* RE: OutOfMemory when doing map-side join
>>>
>>>
>>>
>>> There are multiple things going on.
>>>
>>>
>>>
>>> Column pruning is not working with map-joins. It is being tracked at:
>>>
>>>
>>>
>>> https://issues.apache.org/jira/browse/HIVE-560
>>>
>>>
>>>
>>>
>>>
>>> Also, since it is a Cartesian product, jdbm does not help - because a
>>> single key can be very large.
>>>
>>>
>>>
>>>
>>>
>>> For now, you can do the column pruning yourself – create a new table with
>>> only the columns needed and then
>>>
>>> join with the bigger table.
>>>
>>>
>>>
>>> You may still need to increase the mapper memory - I was able to load
>>> about 20k rows with about 2G mapper.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:* Min Zhou [mailto:coderplay@gmail.com]
>>> *Sent:* Sunday, June 14, 2009 11:02 PM
>>> *To:* hive-user@hadoop.apache.org
>>> *Subject:* Re: OutOfMemory when doing map-side join
>>>
>>>
>>>
>>> btw, that small table 'application' has only one partition right now,
>>> 20k rows.
>>>
>>> On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <co...@gmail.com> wrote:
>>>
>>> failed with null pointer exception.
>>> hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select
>>> x.url_pattern from application x where x.dt = '20090609') a join web_log w
>>> where w.logdate='20090611' and w.url rlike a.url_pattern;
>>> FAILED: Unknown exception : null
>>>
>>>
>>> $cat /tmp/hive/hive.log | tail...
>>>
>>> 2009-06-15 13:57:02,933 ERROR ql.Driver
>>> (SessionState.java:printError(279)) - FAILED: Unknown exception : null
>>> java.lang.NullPointerException
>>> at
>>> org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
>>> at
>>> org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
>>> at
>>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
>>> at
>>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>>> at
>>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>>> at
>>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>>> at
>>> org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
>>> at
>>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
>>> at
>>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
>>> at
>>> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
>>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
>>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
>>> at
>>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
>>> at
>>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
>>> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>>> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>>>
>>>
>>>
>>> On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com> wrote:
>>>
>>> The problem seems to be in partition pruning – The small table
>>> ‘application’ is partitioned – and probably, there are 20k rows in the
>>> partition
>>> 20090609.
>>>
>>> Due to a bug, the pruning is not happening, and all partitions of
>>> ‘application’ are being loaded – which may be too much for map-join to
>>> handle.
>>> This is a serious bug, but for now can you put in a subquery and try -
>>>
>>> select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select x.url_pattern
>>> from application x where x.dt = ‘20090609’) a join web_log w where
>>> w.logdate='20090611' and w.url rlike a.url_pattern;
>>>
>>>
>>> Please file a JIRA for the above.
>>>
>>>
>>>
>>>
>>>
>>> On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com> wrote:
>>>
>>> hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from
>>> application a join web_log w where w.logdate='20090611' and w.url rlike
>>> a.url_pattern and a.dt='20090609';
>>> OK
>>> ABSTRACT SYNTAX TREE:
>>> (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF
>>> web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
>>> (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a)))
>>> (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (.
>>> (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w)
>>> logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (.
>>> (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt)
>>> '20090609')))))
>>>
>>> STAGE DEPENDENCIES:
>>> Stage-1 is a root stage
>>> Stage-2 depends on stages: Stage-1
>>> Stage-0 is a root stage
>>>
>>> STAGE PLANS:
>>> Stage: Stage-1
>>> Map Reduce
>>> Alias -> Map Operator Tree:
>>> w
>>> Select Operator
>>> expressions:
>>> expr: url
>>> type: string
>>> expr: logdate
>>> type: string
>>> Common Join Operator
>>> condition map:
>>> Inner Join 0 to 1
>>> condition expressions:
>>> 0 {0} {1}
>>> 1 {0} {1}
>>> keys:
>>> 0
>>> 1
>>> Position of Big Table: 1
>>> File Output Operator
>>> compressed: false
>>> GlobalTableId: 0
>>> table:
>>> input format:
>>> org.apache.hadoop.mapred.SequenceFileInputFormat
>>> output format:
>>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>>> Local Work:
>>> Map Reduce Local Work
>>> Alias -> Map Local Tables:
>>> a
>>> Fetch Operator
>>> limit: -1
>>> Alias -> Map Local Operator Tree:
>>> a
>>> Select Operator
>>> expressions:
>>> expr: url_pattern
>>> type: string
>>> expr: dt
>>> type: string
>>> Common Join Operator
>>> condition map:
>>> Inner Join 0 to 1
>>> condition expressions:
>>> 0 {0} {1}
>>> 1 {0} {1}
>>> keys:
>>> 0
>>> 1
>>> Position of Big Table: 1
>>> File Output Operator
>>> compressed: false
>>> GlobalTableId: 0
>>> table:
>>> input format:
>>> org.apache.hadoop.mapred.SequenceFileInputFormat
>>> output format:
>>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>>>
>>> Stage: Stage-2
>>> Map Reduce
>>> Alias -> Map Operator Tree:
>>> hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
>>> Select Operator
>>> Filter Operator
>>> predicate:
>>> expr: (((3 = '20090611') and (2 regexp 0)) and (1 =
>>> '20090609'))
>>> type: boolean
>>> Select Operator
>>> expressions:
>>> expr: 0
>>> type: string
>>> expr: 2
>>> type: string
>>> File Output Operator
>>> compressed: true
>>> GlobalTableId: 0
>>> table:
>>> input format:
>>> org.apache.hadoop.mapred.TextInputFormat
>>> output format:
>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>
>>> Stage: Stage-0
>>> Fetch Operator
>>> limit: -1
>>>
>>> On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com> wrote:
>>>
>>> I was looking at the code – and there may be a bug in cartesian product
>>> codepath for map-join.
>>>
>>> Can you do a explain plan and send it ?
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com> wrote:
>>>
>>>
>>> 1. tried setting hive.mapjoin.cache.numrows to be 100, failed with the
>>> same exception.
>>> 2. Actually, we used to do the same thing by loading small tables into
>>> memory of each map node in normal map-reduce with the same cluster, where
>>> same heap size is guranteed between running hive map-side join and our
>>> map-reduce job. OOM exceptions never happened in that only 1MB would be
>>> spent to load those 20k pieces of records while mapred.child.java.opts was
>>> set to be -Xmx200m.
>>>
>>> here is the schema of our small table:
>>> > describe application;
>>> transaction_id string
>>> subclass_id string
>>> class_id string
>>> memo string
>>> url_alias string
>>> url_pattern string
>>> dt string (daily partitioned)
>>>
>>> Thanks,
>>> Min
>>> On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:
>>>
>>> 1. Can you reduce the number of cached rows and try ?
>>>
>>> 2. Were you using default memory settings of the mapper ? If yes, can can
>>> increase it and try ?
>>>
>>> It would be useful to try both of them independently – it would give a
>>> good idea of memory consumption of JDBM also.
>>>
>>>
>>> Can you send the exact schema/data of the small table if possible ? You
>>> can file a jira and load it there if it not a security issue.
>>>
>>> Thanks,
>>> -namit
>>>
>>>
>>>
>>> On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:
>>>
>>> 20k
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> My research interests are distributed systems, parallel computing and
>>> bytecode based virtual machine.
>>>
>>> My profile:
>>> http://www.linkedin.com/in/coderplay
>>> My blog:
>>> http://coderplay.javaeye.com
>>>
>>>
>>>
>>>
>>> --
>>> My research interests are distributed systems, parallel computing and
>>> bytecode based virtual machine.
>>>
>>> My profile:
>>> http://www.linkedin.com/in/coderplay
>>> My blog:
>>> http://coderplay.javaeye.com
>>>
>>
>>
>>
>> --
>> My research interests are distributed systems, parallel computing and
>> bytecode based virtual machine.
>>
>> My profile:
>> http://www.linkedin.com/in/coderplay
>> My blog:
>> http://coderplay.javaeye.com
>>
>>
>
>
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>
--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
RE: OutOfMemory when doing map-side join
Posted by Ashish Thusoo <at...@facebook.com>.
my bad. yes it is 100K. That sounds a bit too much as well.
Ashish
________________________________
From: Min Zhou [mailto:coderplay@gmail.com]
Sent: Wednesday, June 17, 2009 6:25 PM
To: hive-user@hadoop.apache.org
Subject: Re: OutOfMemory when doing map-side join
what's your 100kb standed for?
On Thu, Jun 18, 2009 at 6:26 AM, Amr Awadallah <aa...@cloudera.com>> wrote:
hmm, that is a 100KB per my math.
20K * 100K = 2GB
-- amr
Ashish Thusoo wrote:
That does not sound right. Each row is 100MB - that sounds too much...
Ashish
________________________________
From: Min Zhou [mailto:coderplay@gmail.com]
Sent: Monday, June 15, 2009 7:16 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: OutOfMemory when doing map-side join
20k rows need 2G memory? so terrible. The whole small table of mine is less than 4MB, what about yours?
On Tue, Jun 16, 2009 at 6:59 AM, Namit Jain <nj...@facebook.com>> wrote:
Set mapred.child.java.opts to increase mapper memory.
From: Namit Jain [mailto:njain@facebook.com<ma...@facebook.com>]
Sent: Monday, June 15, 2009 3:53 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: OutOfMemory when doing map-side join
There are multiple things going on.
Column pruning is not working with map-joins. It is being tracked at:
https://issues.apache.org/jira/browse/HIVE-560
Also, since it is a Cartesian product, jdbm does not help - because a single key can be very large.
For now, you can do the column pruning yourself - create a new table with only the columns needed and then
join with the bigger table.
You may still need to increase the mapper memory - I was able to load about 20k rows with about 2G mapper.
From: Min Zhou [mailto:coderplay@gmail.com<ma...@gmail.com>]
Sent: Sunday, June 14, 2009 11:02 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: OutOfMemory when doing map-side join
btw, that small table 'application' has only one partition right now, 20k rows.
On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <co...@gmail.com>> wrote:
failed with null pointer exception.
hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
FAILED: Unknown exception : null
$cat /tmp/hive/hive.log | tail...
2009-06-15 13:57:02,933 ERROR ql.Driver (SessionState.java:printError(279)) - FAILED: Unknown exception : null
java.lang.NullPointerException
at org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com>> wrote:
The problem seems to be in partition pruning - The small table 'application' is partitioned - and probably, there are 20k rows in the partition
20090609.
Due to a bug, the pruning is not happening, and all partitions of 'application' are being loaded - which may be too much for map-join to handle.
This is a serious bug, but for now can you put in a subquery and try -
select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
Please file a JIRA for the above.
On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com>> wrote:
hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern and a.dt='20090609';
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w) logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (. (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt) '20090609')))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
w
Select Operator
expressions:
expr: url
type: string
expr: logdate
type: string
Common Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {0} {1}
1 {0} {1}
keys:
0
1
Position of Big Table: 1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Local Work:
Map Reduce Local Work
Alias -> Map Local Tables:
a
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
a
Select Operator
expressions:
expr: url_pattern
type: string
expr: dt
type: string
Common Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {0} {1}
1 {0} {1}
keys:
0
1
Position of Big Table: 1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
Select Operator
Filter Operator
predicate:
expr: (((3 = '20090611') and (2 regexp 0)) and (1 = '20090609'))
type: boolean
Select Operator
expressions:
expr: 0
type: string
expr: 2
type: string
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com>> wrote:
I was looking at the code - and there may be a bug in cartesian product codepath for map-join.
Can you do a explain plan and send it ?
On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com>> wrote:
1. tried setting hive.mapjoin.cache.numrows to be 100, failed with the same exception.
2. Actually, we used to do the same thing by loading small tables into memory of each map node in normal map-reduce with the same cluster, where same heap size is guranteed between running hive map-side join and our map-reduce job. OOM exceptions never happened in that only 1MB would be spent to load those 20k pieces of records while mapred.child.java.opts was set to be -Xmx200m.
here is the schema of our small table:
> describe application;
transaction_id string
subclass_id string
class_id string
memo string
url_alias string
url_pattern string
dt string (daily partitioned)
Thanks,
Min
On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com>> wrote:
1. Can you reduce the number of cached rows and try ?
2. Were you using default memory settings of the mapper ? If yes, can can increase it and try ?
It would be useful to try both of them independently - it would give a good idea of memory consumption of JDBM also.
Can you send the exact schema/data of the small table if possible ? You can file a jira and load it there if it not a security issue.
Thanks,
-namit
On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com>> wrote:
20k
--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
Re: OutOfMemory when doing map-side join
Posted by Min Zhou <co...@gmail.com>.
what's your 100kb standed for?
On Thu, Jun 18, 2009 at 6:26 AM, Amr Awadallah <aa...@cloudera.com> wrote:
> hmm, that is a 100KB per my math.
>
> 20K * 100K = 2GB
>
> -- amr
>
>
> Ashish Thusoo wrote:
>
> That does not sound right. Each row is 100MB - that sounds too much...
>
> Ashish
>
> ------------------------------
> *From:* Min Zhou [mailto:coderplay@gmail.com <co...@gmail.com>]
> *Sent:* Monday, June 15, 2009 7:16 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: OutOfMemory when doing map-side join
>
> 20k rows need 2G memory? so terrible. The whole small table of mine is
> less than 4MB, what about yours?
>
> On Tue, Jun 16, 2009 at 6:59 AM, Namit Jain <nj...@facebook.com> wrote:
>
>> Set mapred.child.java.opts to increase mapper memory.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* Namit Jain [mailto:njain@facebook.com]
>> *Sent:* Monday, June 15, 2009 3:53 PM
>> *To:* hive-user@hadoop.apache.org
>> *Subject:* RE: OutOfMemory when doing map-side join
>>
>>
>>
>> There are multiple things going on.
>>
>>
>>
>> Column pruning is not working with map-joins. It is being tracked at:
>>
>>
>>
>> https://issues.apache.org/jira/browse/HIVE-560
>>
>>
>>
>>
>>
>> Also, since it is a Cartesian product, jdbm does not help - because a
>> single key can be very large.
>>
>>
>>
>>
>>
>> For now, you can do the column pruning yourself – create a new table with
>> only the columns needed and then
>>
>> join with the bigger table.
>>
>>
>>
>> You may still need to increase the mapper memory - I was able to load
>> about 20k rows with about 2G mapper.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* Min Zhou [mailto:coderplay@gmail.com]
>> *Sent:* Sunday, June 14, 2009 11:02 PM
>> *To:* hive-user@hadoop.apache.org
>> *Subject:* Re: OutOfMemory when doing map-side join
>>
>>
>>
>> btw, that small table 'application' has only one partition right now, 20k
>> rows.
>>
>> On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <co...@gmail.com> wrote:
>>
>> failed with null pointer exception.
>> hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select
>> x.url_pattern from application x where x.dt = '20090609') a join web_log w
>> where w.logdate='20090611' and w.url rlike a.url_pattern;
>> FAILED: Unknown exception : null
>>
>>
>> $cat /tmp/hive/hive.log | tail...
>>
>> 2009-06-15 13:57:02,933 ERROR ql.Driver
>> (SessionState.java:printError(279)) - FAILED: Unknown exception : null
>> java.lang.NullPointerException
>> at
>> org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
>> at
>> org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
>> at
>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
>> at
>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>> at
>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>> at
>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>> at
>> org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
>> at
>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
>> at
>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
>> at
>> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
>> at
>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
>> at
>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
>> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> at java.lang.reflect.Method.invoke(Method.java:597)
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>>
>>
>>
>> On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com> wrote:
>>
>> The problem seems to be in partition pruning – The small table
>> ‘application’ is partitioned – and probably, there are 20k rows in the
>> partition
>> 20090609.
>>
>> Due to a bug, the pruning is not happening, and all partitions of
>> ‘application’ are being loaded – which may be too much for map-join to
>> handle.
>> This is a serious bug, but for now can you put in a subquery and try -
>>
>> select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select x.url_pattern
>> from application x where x.dt = ‘20090609’) a join web_log w where
>> w.logdate='20090611' and w.url rlike a.url_pattern;
>>
>>
>> Please file a JIRA for the above.
>>
>>
>>
>>
>>
>> On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com> wrote:
>>
>> hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from
>> application a join web_log w where w.logdate='20090611' and w.url rlike
>> a.url_pattern and a.dt='20090609';
>> OK
>> ABSTRACT SYNTAX TREE:
>> (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF
>> web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
>> (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a)))
>> (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (.
>> (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w)
>> logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (.
>> (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt)
>> '20090609')))))
>>
>> STAGE DEPENDENCIES:
>> Stage-1 is a root stage
>> Stage-2 depends on stages: Stage-1
>> Stage-0 is a root stage
>>
>> STAGE PLANS:
>> Stage: Stage-1
>> Map Reduce
>> Alias -> Map Operator Tree:
>> w
>> Select Operator
>> expressions:
>> expr: url
>> type: string
>> expr: logdate
>> type: string
>> Common Join Operator
>> condition map:
>> Inner Join 0 to 1
>> condition expressions:
>> 0 {0} {1}
>> 1 {0} {1}
>> keys:
>> 0
>> 1
>> Position of Big Table: 1
>> File Output Operator
>> compressed: false
>> GlobalTableId: 0
>> table:
>> input format:
>> org.apache.hadoop.mapred.SequenceFileInputFormat
>> output format:
>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>> Local Work:
>> Map Reduce Local Work
>> Alias -> Map Local Tables:
>> a
>> Fetch Operator
>> limit: -1
>> Alias -> Map Local Operator Tree:
>> a
>> Select Operator
>> expressions:
>> expr: url_pattern
>> type: string
>> expr: dt
>> type: string
>> Common Join Operator
>> condition map:
>> Inner Join 0 to 1
>> condition expressions:
>> 0 {0} {1}
>> 1 {0} {1}
>> keys:
>> 0
>> 1
>> Position of Big Table: 1
>> File Output Operator
>> compressed: false
>> GlobalTableId: 0
>> table:
>> input format:
>> org.apache.hadoop.mapred.SequenceFileInputFormat
>> output format:
>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>>
>> Stage: Stage-2
>> Map Reduce
>> Alias -> Map Operator Tree:
>> hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
>> Select Operator
>> Filter Operator
>> predicate:
>> expr: (((3 = '20090611') and (2 regexp 0)) and (1 =
>> '20090609'))
>> type: boolean
>> Select Operator
>> expressions:
>> expr: 0
>> type: string
>> expr: 2
>> type: string
>> File Output Operator
>> compressed: true
>> GlobalTableId: 0
>> table:
>> input format:
>> org.apache.hadoop.mapred.TextInputFormat
>> output format:
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>
>> Stage: Stage-0
>> Fetch Operator
>> limit: -1
>>
>> On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com> wrote:
>>
>> I was looking at the code – and there may be a bug in cartesian product
>> codepath for map-join.
>>
>> Can you do a explain plan and send it ?
>>
>>
>>
>>
>>
>>
>> On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com> wrote:
>>
>>
>> 1. tried setting hive.mapjoin.cache.numrows to be 100, failed with the
>> same exception.
>> 2. Actually, we used to do the same thing by loading small tables into
>> memory of each map node in normal map-reduce with the same cluster, where
>> same heap size is guranteed between running hive map-side join and our
>> map-reduce job. OOM exceptions never happened in that only 1MB would be
>> spent to load those 20k pieces of records while mapred.child.java.opts was
>> set to be -Xmx200m.
>>
>> here is the schema of our small table:
>> > describe application;
>> transaction_id string
>> subclass_id string
>> class_id string
>> memo string
>> url_alias string
>> url_pattern string
>> dt string (daily partitioned)
>>
>> Thanks,
>> Min
>> On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:
>>
>> 1. Can you reduce the number of cached rows and try ?
>>
>> 2. Were you using default memory settings of the mapper ? If yes, can can
>> increase it and try ?
>>
>> It would be useful to try both of them independently – it would give a
>> good idea of memory consumption of JDBM also.
>>
>>
>> Can you send the exact schema/data of the small table if possible ? You
>> can file a jira and load it there if it not a security issue.
>>
>> Thanks,
>> -namit
>>
>>
>>
>> On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:
>>
>> 20k
>>
>>
>>
>>
>>
>>
>>
>> --
>> My research interests are distributed systems, parallel computing and
>> bytecode based virtual machine.
>>
>> My profile:
>> http://www.linkedin.com/in/coderplay
>> My blog:
>> http://coderplay.javaeye.com
>>
>>
>>
>>
>> --
>> My research interests are distributed systems, parallel computing and
>> bytecode based virtual machine.
>>
>> My profile:
>> http://www.linkedin.com/in/coderplay
>> My blog:
>> http://coderplay.javaeye.com
>>
>
>
>
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>
>
--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
Re: OutOfMemory when doing map-side join
Posted by Amr Awadallah <aa...@cloudera.com>.
hmm, that is a 100KB per my math.
20K * 100K = 2GB
-- amr
Ashish Thusoo wrote:
> That does not sound right. Each row is 100MB - that sounds too much...
>
> Ashish
>
> ------------------------------------------------------------------------
> *From:* Min Zhou [mailto:coderplay@gmail.com]
> *Sent:* Monday, June 15, 2009 7:16 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: OutOfMemory when doing map-side join
>
> 20k rows need 2G memory? so terrible. The whole small table of mine
> is less than 4MB, what about yours?
>
> On Tue, Jun 16, 2009 at 6:59 AM, Namit Jain <njain@facebook.com
> <ma...@facebook.com>> wrote:
>
> Set mapred.child.java.opts to increase mapper memory.
>
>
>
>
>
>
>
>
>
> *From:* Namit Jain [mailto:njain@facebook.com
> <ma...@facebook.com>]
> *Sent:* Monday, June 15, 2009 3:53 PM
>
> *To:* hive-user@hadoop.apache.org <ma...@hadoop.apache.org>
> *Subject:* RE: OutOfMemory when doing map-side join
>
>
>
> There are multiple things going on.
>
>
>
> Column pruning is not working with map-joins. It is being tracked at:
>
>
>
> https://issues.apache.org/jira/browse/HIVE-560
>
>
>
>
>
> Also, since it is a Cartesian product, jdbm does not help -
> because a single key can be very large.
>
>
>
>
>
> For now, you can do the column pruning yourself -- create a new
> table with only the columns needed and then
>
> join with the bigger table.
>
>
>
> You may still need to increase the mapper memory - I was able to
> load about 20k rows with about 2G mapper.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From:* Min Zhou [mailto:coderplay@gmail.com
> <ma...@gmail.com>]
> *Sent:* Sunday, June 14, 2009 11:02 PM
> *To:* hive-user@hadoop.apache.org <ma...@hadoop.apache.org>
> *Subject:* Re: OutOfMemory when doing map-side join
>
>
>
> btw, that small table 'application' has only one partition right
> now, 20k rows.
>
> On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <coderplay@gmail.com
> <ma...@gmail.com>> wrote:
>
> failed with null pointer exception.
> hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select
> x.url_pattern from application x where x.dt = '20090609') a join
> web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
> FAILED: Unknown exception : null
>
>
> $cat /tmp/hive/hive.log | tail...
>
> 2009-06-15 13:57:02,933 ERROR ql.Driver
> (SessionState.java:printError(279)) - FAILED: Unknown exception : null
> java.lang.NullPointerException
> at
> org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
> at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
> at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
> at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
> at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
> at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
> at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
> at
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
> at
> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
> at
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
> at
> org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
>
>
> On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <njain@facebook.com
> <ma...@facebook.com>> wrote:
>
> The problem seems to be in partition pruning -- The small table
> 'application' is partitioned -- and probably, there are 20k rows
> in the partition
> 20090609.
>
> Due to a bug, the pruning is not happening, and all partitions of
> 'application' are being loaded -- which may be too much for
> map-join to handle.
> This is a serious bug, but for now can you put in a subquery and try -
>
> select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select
> x.url_pattern from application x where x.dt = '20090609') a join
> web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
>
>
> Please file a JIRA for the above.
>
>
>
>
>
> On 6/14/09 10:20 PM, "Min Zhou" <coderplay@gmail.com
> <ma...@gmail.com>> wrote:
>
> hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url
> from application a join web_log w where w.logdate='20090611'
> and w.url rlike a.url_pattern and a.dt='20090609';
> OK
> ABSTRACT SYNTAX TREE:
> (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a)
> (TOK_TABREF web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR
> TOK_TMP_FILE)) (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN
> (TOK_HINTARGLIST a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a)
> url_pattern)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL w) url)))
> (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w) logdate)
> '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (.
> (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a)
> dt) '20090609')))))
>
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-2 depends on stages: Stage-1
> Stage-0 is a root stage
>
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> w
> Select Operator
> expressions:
> expr: url
> type: string
> expr: logdate
> type: string
> Common Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {0} {1}
> 1 {0} {1}
> keys:
> 0
> 1
> Position of Big Table: 1
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Local Work:
> Map Reduce Local Work
> Alias -> Map Local Tables:
> a
> Fetch Operator
> limit: -1
> Alias -> Map Local Operator Tree:
> a
> Select Operator
> expressions:
> expr: url_pattern
> type: string
> expr: dt
> type: string
> Common Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {0} {1}
> 1 {0} {1}
> keys:
> 0
> 1
> Position of Big Table: 1
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>
> Stage: Stage-2
> Map Reduce
> Alias -> Map Operator Tree:
>
> hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
> Select Operator
> Filter Operator
> predicate:
> expr: (((3 = '20090611') and (2 regexp 0))
> and (1 = '20090609'))
> type: boolean
> Select Operator
> expressions:
> expr: 0
> type: string
> expr: 2
> type: string
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format:
> org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
> Stage: Stage-0
> Fetch Operator
> limit: -1
>
> On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain
> <njain@facebook.com <ma...@facebook.com>> wrote:
>
> I was looking at the code -- and there may be a bug in
> cartesian product codepath for map-join.
>
> Can you do a explain plan and send it ?
>
>
>
>
>
>
> On 6/14/09 10:06 PM, "Min Zhou" <coderplay@gmail.com
> <ma...@gmail.com>> wrote:
>
>
> 1. tried setting hive.mapjoin.cache.numrows to be 100, failed
> with the same exception.
> 2. Actually, we used to do the same thing by loading small
> tables into memory of each map node in normal map-reduce with
> the same cluster, where same heap size is guranteed between
> running hive map-side join and our map-reduce job. OOM
> exceptions never happened in that only 1MB would be spent to
> load those 20k pieces of records while mapred.child.java.opts
> was set to be -Xmx200m.
>
> here is the schema of our small table:
> > describe application;
> transaction_id string
> subclass_id string
> class_id string
> memo string
> url_alias string
> url_pattern string
> dt string (daily partitioned)
>
> Thanks,
> Min
> On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain
> <njain@facebook.com <ma...@facebook.com>> wrote:
>
> 1. Can you reduce the number of cached rows and try ?
>
> 2. Were you using default memory settings of the mapper ? If
> yes, can can increase it and try ?
>
> It would be useful to try both of them independently -- it
> would give a good idea of memory consumption of JDBM also.
>
>
> Can you send the exact schema/data of the small table if
> possible ? You can file a jira and load it there if it not a
> security issue.
>
> Thanks,
> -namit
>
>
>
> On 6/14/09 9:23 PM, "Min Zhou" <coderplay@gmail.com
> <ma...@gmail.com>> wrote:
>
> 20k
>
>
>
>
>
>
>
> --
> My research interests are distributed systems, parallel computing
> and bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>
>
>
>
> --
> My research interests are distributed systems, parallel computing
> and bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>
>
>
>
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
RE: OutOfMemory when doing map-side join
Posted by Ashish Thusoo <at...@facebook.com>.
That does not sound right. Each row is 100MB - that sounds too much...
Ashish
________________________________
From: Min Zhou [mailto:coderplay@gmail.com]
Sent: Monday, June 15, 2009 7:16 PM
To: hive-user@hadoop.apache.org
Subject: Re: OutOfMemory when doing map-side join
20k rows need 2G memory? so terrible. The whole small table of mine is less than 4MB, what about yours?
On Tue, Jun 16, 2009 at 6:59 AM, Namit Jain <nj...@facebook.com>> wrote:
Set mapred.child.java.opts to increase mapper memory.
From: Namit Jain [mailto:njain@facebook.com<ma...@facebook.com>]
Sent: Monday, June 15, 2009 3:53 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: OutOfMemory when doing map-side join
There are multiple things going on.
Column pruning is not working with map-joins. It is being tracked at:
https://issues.apache.org/jira/browse/HIVE-560
Also, since it is a Cartesian product, jdbm does not help - because a single key can be very large.
For now, you can do the column pruning yourself - create a new table with only the columns needed and then
join with the bigger table.
You may still need to increase the mapper memory - I was able to load about 20k rows with about 2G mapper.
From: Min Zhou [mailto:coderplay@gmail.com<ma...@gmail.com>]
Sent: Sunday, June 14, 2009 11:02 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: OutOfMemory when doing map-side join
btw, that small table 'application' has only one partition right now, 20k rows.
On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <co...@gmail.com>> wrote:
failed with null pointer exception.
hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
FAILED: Unknown exception : null
$cat /tmp/hive/hive.log | tail...
2009-06-15 13:57:02,933 ERROR ql.Driver (SessionState.java:printError(279)) - FAILED: Unknown exception : null
java.lang.NullPointerException
at org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com>> wrote:
The problem seems to be in partition pruning - The small table 'application' is partitioned - and probably, there are 20k rows in the partition
20090609.
Due to a bug, the pruning is not happening, and all partitions of 'application' are being loaded - which may be too much for map-join to handle.
This is a serious bug, but for now can you put in a subquery and try -
select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
Please file a JIRA for the above.
On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com>> wrote:
hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern and a.dt='20090609';
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w) logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (. (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt) '20090609')))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
w
Select Operator
expressions:
expr: url
type: string
expr: logdate
type: string
Common Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {0} {1}
1 {0} {1}
keys:
0
1
Position of Big Table: 1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Local Work:
Map Reduce Local Work
Alias -> Map Local Tables:
a
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
a
Select Operator
expressions:
expr: url_pattern
type: string
expr: dt
type: string
Common Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {0} {1}
1 {0} {1}
keys:
0
1
Position of Big Table: 1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
Select Operator
Filter Operator
predicate:
expr: (((3 = '20090611') and (2 regexp 0)) and (1 = '20090609'))
type: boolean
Select Operator
expressions:
expr: 0
type: string
expr: 2
type: string
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com>> wrote:
I was looking at the code - and there may be a bug in cartesian product codepath for map-join.
Can you do a explain plan and send it ?
On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com>> wrote:
1. tried setting hive.mapjoin.cache.numrows to be 100, failed with the same exception.
2. Actually, we used to do the same thing by loading small tables into memory of each map node in normal map-reduce with the same cluster, where same heap size is guranteed between running hive map-side join and our map-reduce job. OOM exceptions never happened in that only 1MB would be spent to load those 20k pieces of records while mapred.child.java.opts was set to be -Xmx200m.
here is the schema of our small table:
> describe application;
transaction_id string
subclass_id string
class_id string
memo string
url_alias string
url_pattern string
dt string (daily partitioned)
Thanks,
Min
On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com>> wrote:
1. Can you reduce the number of cached rows and try ?
2. Were you using default memory settings of the mapper ? If yes, can can increase it and try ?
It would be useful to try both of them independently - it would give a good idea of memory consumption of JDBM also.
Can you send the exact schema/data of the small table if possible ? You can file a jira and load it there if it not a security issue.
Thanks,
-namit
On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com>> wrote:
20k
--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
Re: OutOfMemory when doing map-side join
Posted by Min Zhou <co...@gmail.com>.
20k rows need 2G memory? so terrible. The whole small table of mine is
less than 4MB, what about yours?
On Tue, Jun 16, 2009 at 6:59 AM, Namit Jain <nj...@facebook.com> wrote:
> Set mapred.child.java.opts to increase mapper memory.
>
>
>
>
>
>
>
>
>
> *From:* Namit Jain [mailto:njain@facebook.com]
> *Sent:* Monday, June 15, 2009 3:53 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* RE: OutOfMemory when doing map-side join
>
>
>
> There are multiple things going on.
>
>
>
> Column pruning is not working with map-joins. It is being tracked at:
>
>
>
> https://issues.apache.org/jira/browse/HIVE-560
>
>
>
>
>
> Also, since it is a Cartesian product, jdbm does not help - because a
> single key can be very large.
>
>
>
>
>
> For now, you can do the column pruning yourself – create a new table with
> only the columns needed and then
>
> join with the bigger table.
>
>
>
> You may still need to increase the mapper memory - I was able to load
> about 20k rows with about 2G mapper.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From:* Min Zhou [mailto:coderplay@gmail.com]
> *Sent:* Sunday, June 14, 2009 11:02 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: OutOfMemory when doing map-side join
>
>
>
> btw, that small table 'application' has only one partition right now, 20k
> rows.
>
> On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <co...@gmail.com> wrote:
>
> failed with null pointer exception.
> hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select
> x.url_pattern from application x where x.dt = '20090609') a join web_log w
> where w.logdate='20090611' and w.url rlike a.url_pattern;
> FAILED: Unknown exception : null
>
>
> $cat /tmp/hive/hive.log | tail...
>
> 2009-06-15 13:57:02,933 ERROR ql.Driver (SessionState.java:printError(279))
> - FAILED: Unknown exception : null
> java.lang.NullPointerException
> at
> org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
> at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
> at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
> at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
> at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
> at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
> at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
> at
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
> at
> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
> at
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
>
>
> On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com> wrote:
>
> The problem seems to be in partition pruning – The small table
> ‘application’ is partitioned – and probably, there are 20k rows in the
> partition
> 20090609.
>
> Due to a bug, the pruning is not happening, and all partitions of
> ‘application’ are being loaded – which may be too much for map-join to
> handle.
> This is a serious bug, but for now can you put in a subquery and try -
>
> select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select x.url_pattern
> from application x where x.dt = ‘20090609’) a join web_log w where
> w.logdate='20090611' and w.url rlike a.url_pattern;
>
>
> Please file a JIRA for the above.
>
>
>
>
>
> On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com> wrote:
>
> hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from
> application a join web_log w where w.logdate='20090611' and w.url rlike
> a.url_pattern and a.dt='20090609';
> OK
> ABSTRACT SYNTAX TREE:
> (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF
> web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
> (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a)))
> (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (.
> (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w)
> logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (.
> (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt)
> '20090609')))))
>
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-2 depends on stages: Stage-1
> Stage-0 is a root stage
>
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> w
> Select Operator
> expressions:
> expr: url
> type: string
> expr: logdate
> type: string
> Common Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {0} {1}
> 1 {0} {1}
> keys:
> 0
> 1
> Position of Big Table: 1
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Local Work:
> Map Reduce Local Work
> Alias -> Map Local Tables:
> a
> Fetch Operator
> limit: -1
> Alias -> Map Local Operator Tree:
> a
> Select Operator
> expressions:
> expr: url_pattern
> type: string
> expr: dt
> type: string
> Common Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {0} {1}
> 1 {0} {1}
> keys:
> 0
> 1
> Position of Big Table: 1
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>
> Stage: Stage-2
> Map Reduce
> Alias -> Map Operator Tree:
> hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
> Select Operator
> Filter Operator
> predicate:
> expr: (((3 = '20090611') and (2 regexp 0)) and (1 =
> '20090609'))
> type: boolean
> Select Operator
> expressions:
> expr: 0
> type: string
> expr: 2
> type: string
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format:
> org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
> Stage: Stage-0
> Fetch Operator
> limit: -1
>
> On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com> wrote:
>
> I was looking at the code – and there may be a bug in cartesian product
> codepath for map-join.
>
> Can you do a explain plan and send it ?
>
>
>
>
>
>
> On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com> wrote:
>
>
> 1. tried setting hive.mapjoin.cache.numrows to be 100, failed with the
> same exception.
> 2. Actually, we used to do the same thing by loading small tables into
> memory of each map node in normal map-reduce with the same cluster, where
> same heap size is guranteed between running hive map-side join and our
> map-reduce job. OOM exceptions never happened in that only 1MB would be
> spent to load those 20k pieces of records while mapred.child.java.opts was
> set to be -Xmx200m.
>
> here is the schema of our small table:
> > describe application;
> transaction_id string
> subclass_id string
> class_id string
> memo string
> url_alias string
> url_pattern string
> dt string (daily partitioned)
>
> Thanks,
> Min
> On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:
>
> 1. Can you reduce the number of cached rows and try ?
>
> 2. Were you using default memory settings of the mapper ? If yes, can can
> increase it and try ?
>
> It would be useful to try both of them independently – it would give a good
> idea of memory consumption of JDBM also.
>
>
> Can you send the exact schema/data of the small table if possible ? You can
> file a jira and load it there if it not a security issue.
>
> Thanks,
> -namit
>
>
>
> On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:
>
> 20k
>
>
>
>
>
>
>
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>
>
>
>
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>
--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
RE: OutOfMemory when doing map-side join
Posted by Namit Jain <nj...@facebook.com>.
Set mapred.child.java.opts to increase mapper memory.
From: Namit Jain [mailto:njain@facebook.com]
Sent: Monday, June 15, 2009 3:53 PM
To: hive-user@hadoop.apache.org
Subject: RE: OutOfMemory when doing map-side join
There are multiple things going on.
Column pruning is not working with map-joins. It is being tracked at:
https://issues.apache.org/jira/browse/HIVE-560
Also, since it is a Cartesian product, jdbm does not help - because a single key can be very large.
For now, you can do the column pruning yourself - create a new table with only the columns needed and then
join with the bigger table.
You may still need to increase the mapper memory - I was able to load about 20k rows with about 2G mapper.
From: Min Zhou [mailto:coderplay@gmail.com]
Sent: Sunday, June 14, 2009 11:02 PM
To: hive-user@hadoop.apache.org
Subject: Re: OutOfMemory when doing map-side join
btw, that small table 'application' has only one partition right now, 20k rows.
On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <co...@gmail.com>> wrote:
failed with null pointer exception.
hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
FAILED: Unknown exception : null
$cat /tmp/hive/hive.log | tail...
2009-06-15 13:57:02,933 ERROR ql.Driver (SessionState.java:printError(279)) - FAILED: Unknown exception : null
java.lang.NullPointerException
at org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com>> wrote:
The problem seems to be in partition pruning - The small table 'application' is partitioned - and probably, there are 20k rows in the partition
20090609.
Due to a bug, the pruning is not happening, and all partitions of 'application' are being loaded - which may be too much for map-join to handle.
This is a serious bug, but for now can you put in a subquery and try -
select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
Please file a JIRA for the above.
On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com>> wrote:
hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern and a.dt='20090609';
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w) logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (. (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt) '20090609')))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
w
Select Operator
expressions:
expr: url
type: string
expr: logdate
type: string
Common Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {0} {1}
1 {0} {1}
keys:
0
1
Position of Big Table: 1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Local Work:
Map Reduce Local Work
Alias -> Map Local Tables:
a
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
a
Select Operator
expressions:
expr: url_pattern
type: string
expr: dt
type: string
Common Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {0} {1}
1 {0} {1}
keys:
0
1
Position of Big Table: 1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
Select Operator
Filter Operator
predicate:
expr: (((3 = '20090611') and (2 regexp 0)) and (1 = '20090609'))
type: boolean
Select Operator
expressions:
expr: 0
type: string
expr: 2
type: string
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com>> wrote:
I was looking at the code - and there may be a bug in cartesian product codepath for map-join.
Can you do a explain plan and send it ?
On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com>> wrote:
1. tried setting hive.mapjoin.cache.numrows to be 100, failed with the same exception.
2. Actually, we used to do the same thing by loading small tables into memory of each map node in normal map-reduce with the same cluster, where same heap size is guranteed between running hive map-side join and our map-reduce job. OOM exceptions never happened in that only 1MB would be spent to load those 20k pieces of records while mapred.child.java.opts was set to be -Xmx200m.
here is the schema of our small table:
> describe application;
transaction_id string
subclass_id string
class_id string
memo string
url_alias string
url_pattern string
dt string (daily partitioned)
Thanks,
Min
On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com>> wrote:
1. Can you reduce the number of cached rows and try ?
2. Were you using default memory settings of the mapper ? If yes, can can increase it and try ?
It would be useful to try both of them independently - it would give a good idea of memory consumption of JDBM also.
Can you send the exact schema/data of the small table if possible ? You can file a jira and load it there if it not a security issue.
Thanks,
-namit
On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com>> wrote:
20k
--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
RE: OutOfMemory when doing map-side join
Posted by Namit Jain <nj...@facebook.com>.
There are multiple things going on.
Column pruning is not working with map-joins. It is being tracked at:
https://issues.apache.org/jira/browse/HIVE-560
Also, since it is a Cartesian product, jdbm does not help - because a single key can be very large.
For now, you can do the column pruning yourself - create a new table with only the columns needed and then
join with the bigger table.
You may still need to increase the mapper memory - I was able to load about 20k rows with about 2G mapper.
From: Min Zhou [mailto:coderplay@gmail.com]
Sent: Sunday, June 14, 2009 11:02 PM
To: hive-user@hadoop.apache.org
Subject: Re: OutOfMemory when doing map-side join
btw, that small table 'application' has only one partition right now, 20k rows.
On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <co...@gmail.com>> wrote:
failed with null pointer exception.
hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
FAILED: Unknown exception : null
$cat /tmp/hive/hive.log | tail...
2009-06-15 13:57:02,933 ERROR ql.Driver (SessionState.java:printError(279)) - FAILED: Unknown exception : null
java.lang.NullPointerException
at org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com>> wrote:
The problem seems to be in partition pruning - The small table 'application' is partitioned - and probably, there are 20k rows in the partition
20090609.
Due to a bug, the pruning is not happening, and all partitions of 'application' are being loaded - which may be too much for map-join to handle.
This is a serious bug, but for now can you put in a subquery and try -
select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
Please file a JIRA for the above.
On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com>> wrote:
hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern and a.dt='20090609';
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w) logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (. (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt) '20090609')))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
w
Select Operator
expressions:
expr: url
type: string
expr: logdate
type: string
Common Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {0} {1}
1 {0} {1}
keys:
0
1
Position of Big Table: 1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Local Work:
Map Reduce Local Work
Alias -> Map Local Tables:
a
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
a
Select Operator
expressions:
expr: url_pattern
type: string
expr: dt
type: string
Common Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {0} {1}
1 {0} {1}
keys:
0
1
Position of Big Table: 1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
Select Operator
Filter Operator
predicate:
expr: (((3 = '20090611') and (2 regexp 0)) and (1 = '20090609'))
type: boolean
Select Operator
expressions:
expr: 0
type: string
expr: 2
type: string
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com>> wrote:
I was looking at the code - and there may be a bug in cartesian product codepath for map-join.
Can you do a explain plan and send it ?
On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com>> wrote:
1. tried setting hive.mapjoin.cache.numrows to be 100, failed with the same exception.
2. Actually, we used to do the same thing by loading small tables into memory of each map node in normal map-reduce with the same cluster, where same heap size is guranteed between running hive map-side join and our map-reduce job. OOM exceptions never happened in that only 1MB would be spent to load those 20k pieces of records while mapred.child.java.opts was set to be -Xmx200m.
here is the schema of our small table:
> describe application;
transaction_id string
subclass_id string
class_id string
memo string
url_alias string
url_pattern string
dt string (daily partitioned)
Thanks,
Min
On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com>> wrote:
1. Can you reduce the number of cached rows and try ?
2. Were you using default memory settings of the mapper ? If yes, can can increase it and try ?
It would be useful to try both of them independently - it would give a good idea of memory consumption of JDBM also.
Can you send the exact schema/data of the small table if possible ? You can file a jira and load it there if it not a security issue.
Thanks,
-namit
On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com>> wrote:
20k
--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
Re: OutOfMemory when doing map-side join
Posted by Min Zhou <co...@gmail.com>.
btw, that small table 'application' has only one partition right now, 20k
rows.
On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <co...@gmail.com> wrote:
> failed with null pointer exception.
> hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select
> x.url_pattern from application x where x.dt = '20090609') a join web_log w
> where w.logdate='20090611' and w.url rlike a.url_pattern;
> FAILED: Unknown exception : null
>
>
> $cat /tmp/hive/hive.log | tail...
>
> 2009-06-15 13:57:02,933 ERROR ql.Driver (SessionState.java:printError(279))
> - FAILED: Unknown exception : null
> java.lang.NullPointerException
> at
> org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
> at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
> at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
> at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
> at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
> at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
> at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
> at
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
> at
> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
> at
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
>
> On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com> wrote:
>
>> The problem seems to be in partition pruning – The small table
>> ‘application’ is partitioned – and probably, there are 20k rows in the
>> partition
>> 20090609.
>>
>> Due to a bug, the pruning is not happening, and all partitions of
>> ‘application’ are being loaded – which may be too much for map-join to
>> handle.
>> This is a serious bug, but for now can you put in a subquery and try -
>>
>> select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select x.url_pattern
>> from application x where x.dt = ‘20090609’) a join web_log w where
>> w.logdate='20090611' and w.url rlike a.url_pattern;
>>
>>
>> Please file a JIRA for the above.
>>
>>
>>
>>
>> On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com> wrote:
>>
>> hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from
>> application a join web_log w where w.logdate='20090611' and w.url rlike
>> a.url_pattern and a.dt='20090609';
>> OK
>> ABSTRACT SYNTAX TREE:
>> (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF
>> web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
>> (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a)))
>> (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (.
>> (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w)
>> logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (.
>> (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt)
>> '20090609')))))
>>
>> STAGE DEPENDENCIES:
>> Stage-1 is a root stage
>> Stage-2 depends on stages: Stage-1
>> Stage-0 is a root stage
>>
>> STAGE PLANS:
>> Stage: Stage-1
>> Map Reduce
>> Alias -> Map Operator Tree:
>> w
>> Select Operator
>> expressions:
>> expr: url
>> type: string
>> expr: logdate
>> type: string
>> Common Join Operator
>> condition map:
>> Inner Join 0 to 1
>> condition expressions:
>> 0 {0} {1}
>> 1 {0} {1}
>> keys:
>> 0
>> 1
>> Position of Big Table: 1
>> File Output Operator
>> compressed: false
>> GlobalTableId: 0
>> table:
>> input format:
>> org.apache.hadoop.mapred.SequenceFileInputFormat
>> output format:
>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>> Local Work:
>> Map Reduce Local Work
>> Alias -> Map Local Tables:
>> a
>> Fetch Operator
>> limit: -1
>> Alias -> Map Local Operator Tree:
>> a
>> Select Operator
>> expressions:
>> expr: url_pattern
>> type: string
>> expr: dt
>> type: string
>> Common Join Operator
>> condition map:
>> Inner Join 0 to 1
>> condition expressions:
>> 0 {0} {1}
>> 1 {0} {1}
>> keys:
>> 0
>> 1
>> Position of Big Table: 1
>> File Output Operator
>> compressed: false
>> GlobalTableId: 0
>> table:
>> input format:
>> org.apache.hadoop.mapred.SequenceFileInputFormat
>> output format:
>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>>
>> Stage: Stage-2
>> Map Reduce
>> Alias -> Map Operator Tree:
>> hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
>> Select Operator
>> Filter Operator
>> predicate:
>> expr: (((3 = '20090611') and (2 regexp 0)) and (1 =
>> '20090609'))
>> type: boolean
>> Select Operator
>> expressions:
>> expr: 0
>> type: string
>> expr: 2
>> type: string
>> File Output Operator
>> compressed: true
>> GlobalTableId: 0
>> table:
>> input format:
>> org.apache.hadoop.mapred.TextInputFormat
>> output format:
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>
>> Stage: Stage-0
>> Fetch Operator
>> limit: -1
>>
>> On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com> wrote:
>>
>> I was looking at the code – and there may be a bug in cartesian product
>> codepath for map-join.
>>
>> Can you do a explain plan and send it ?
>>
>>
>>
>>
>>
>>
>> On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com> wrote:
>>
>>
>> 1. tried setting hive.mapjoin.cache.numrows to be 100, failed with the
>> same exception.
>> 2. Actually, we used to do the same thing by loading small tables into
>> memory of each map node in normal map-reduce with the same cluster, where
>> same heap size is guranteed between running hive map-side join and our
>> map-reduce job. OOM exceptions never happened in that only 1MB would be
>> spent to load those 20k pieces of records while mapred.child.java.opts was
>> set to be -Xmx200m.
>>
>> here is the schema of our small table:
>> > describe application;
>> transaction_id string
>> subclass_id string
>> class_id string
>> memo string
>> url_alias string
>> url_pattern string
>> dt string (daily partitioned)
>>
>> Thanks,
>> Min
>> On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:
>>
>> 1. Can you reduce the number of cached rows and try ?
>>
>> 2. Were you using default memory settings of the mapper ? If yes, can can
>> increase it and try ?
>>
>> It would be useful to try both of them independently – it would give a
>> good idea of memory consumption of JDBM also.
>>
>>
>> Can you send the exact schema/data of the small table if possible ? You
>> can file a jira and load it there if it not a security issue.
>>
>> Thanks,
>> -namit
>>
>>
>>
>> On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:
>>
>> 20k
>>
>>
>>
>>
>>
>>
>
>
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>
--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
Re: OutOfMemory when doing map-side join
Posted by Namit Jain <nj...@facebook.com>.
I will take a look and get back to you
On 6/14/09 10:59 PM, "Min Zhou" <co...@gmail.com> wrote:
failed with null pointer exception.
hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
FAILED: Unknown exception : null
$cat /tmp/hive/hive.log | tail...
2009-06-15 13:57:02,933 ERROR ql.Driver (SessionState.java:printError(279)) - FAILED: Unknown exception : null
java.lang.NullPointerException
at org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com> wrote:
The problem seems to be in partition pruning - The small table 'application' is partitioned - and probably, there are 20k rows in the partition
20090609.
Due to a bug, the pruning is not happening, and all partitions of 'application' are being loaded - which may be too much for map-join to handle.
This is a serious bug, but for now can you put in a subquery and try -
select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
Please file a JIRA for the above.
On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com> wrote:
hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern and a.dt='20090609';
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w) logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (. (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt) '20090609')))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
w
Select Operator
expressions:
expr: url
type: string
expr: logdate
type: string
Common Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {0} {1}
1 {0} {1}
keys:
0
1
Position of Big Table: 1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Local Work:
Map Reduce Local Work
Alias -> Map Local Tables:
a
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
a
Select Operator
expressions:
expr: url_pattern
type: string
expr: dt
type: string
Common Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {0} {1}
1 {0} {1}
keys:
0
1
Position of Big Table: 1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
Select Operator
Filter Operator
predicate:
expr: (((3 = '20090611') and (2 regexp 0)) and (1 = '20090609'))
type: boolean
Select Operator
expressions:
expr: 0
type: string
expr: 2
type: string
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com> wrote:
I was looking at the code - and there may be a bug in cartesian product codepath for map-join.
Can you do a explain plan and send it ?
On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com> wrote:
1. tried setting hive.mapjoin.cache.numrows to be 100, failed with the same exception.
2. Actually, we used to do the same thing by loading small tables into memory of each map node in normal map-reduce with the same cluster, where same heap size is guranteed between running hive map-side join and our map-reduce job. OOM exceptions never happened in that only 1MB would be spent to load those 20k pieces of records while mapred.child.java.opts was set to be -Xmx200m.
here is the schema of our small table:
> describe application;
transaction_id string
subclass_id string
class_id string
memo string
url_alias string
url_pattern string
dt string (daily partitioned)
Thanks,
Min
On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:
1. Can you reduce the number of cached rows and try ?
2. Were you using default memory settings of the mapper ? If yes, can can increase it and try ?
It would be useful to try both of them independently - it would give a good idea of memory consumption of JDBM also.
Can you send the exact schema/data of the small table if possible ? You can file a jira and load it there if it not a security issue.
Thanks,
-namit
On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:
20k
Re: OutOfMemory when doing map-side join
Posted by Min Zhou <co...@gmail.com>.
failed with null pointer exception.
hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select
x.url_pattern from application x where x.dt = '20090609') a join web_log w
where w.logdate='20090611' and w.url rlike a.url_pattern;
FAILED: Unknown exception : null
$cat /tmp/hive/hive.log | tail...
2009-06-15 13:57:02,933 ERROR ql.Driver (SessionState.java:printError(279))
- FAILED: Unknown exception : null
java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
at
org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
at
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
at
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
at
org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com> wrote:
> The problem seems to be in partition pruning – The small table
> ‘application’ is partitioned – and probably, there are 20k rows in the
> partition
> 20090609.
>
> Due to a bug, the pruning is not happening, and all partitions of
> ‘application’ are being loaded – which may be too much for map-join to
> handle.
> This is a serious bug, but for now can you put in a subquery and try -
>
> select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select x.url_pattern
> from application x where x.dt = ‘20090609’) a join web_log w where
> w.logdate='20090611' and w.url rlike a.url_pattern;
>
>
> Please file a JIRA for the above.
>
>
>
>
> On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com> wrote:
>
> hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from
> application a join web_log w where w.logdate='20090611' and w.url rlike
> a.url_pattern and a.dt='20090609';
> OK
> ABSTRACT SYNTAX TREE:
> (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF
> web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
> (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a)))
> (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (.
> (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w)
> logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (.
> (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt)
> '20090609')))))
>
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-2 depends on stages: Stage-1
> Stage-0 is a root stage
>
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> w
> Select Operator
> expressions:
> expr: url
> type: string
> expr: logdate
> type: string
> Common Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {0} {1}
> 1 {0} {1}
> keys:
> 0
> 1
> Position of Big Table: 1
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Local Work:
> Map Reduce Local Work
> Alias -> Map Local Tables:
> a
> Fetch Operator
> limit: -1
> Alias -> Map Local Operator Tree:
> a
> Select Operator
> expressions:
> expr: url_pattern
> type: string
> expr: dt
> type: string
> Common Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {0} {1}
> 1 {0} {1}
> keys:
> 0
> 1
> Position of Big Table: 1
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>
> Stage: Stage-2
> Map Reduce
> Alias -> Map Operator Tree:
> hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
> Select Operator
> Filter Operator
> predicate:
> expr: (((3 = '20090611') and (2 regexp 0)) and (1 =
> '20090609'))
> type: boolean
> Select Operator
> expressions:
> expr: 0
> type: string
> expr: 2
> type: string
> File Output Operator
> compressed: true
> GlobalTableId: 0
> table:
> input format:
> org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
> Stage: Stage-0
> Fetch Operator
> limit: -1
>
> On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com> wrote:
>
> I was looking at the code – and there may be a bug in cartesian product
> codepath for map-join.
>
> Can you do a explain plan and send it ?
>
>
>
>
>
>
> On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com> wrote:
>
>
> 1. tried setting hive.mapjoin.cache.numrows to be 100, failed with the
> same exception.
> 2. Actually, we used to do the same thing by loading small tables into
> memory of each map node in normal map-reduce with the same cluster, where
> same heap size is guranteed between running hive map-side join and our
> map-reduce job. OOM exceptions never happened in that only 1MB would be
> spent to load those 20k pieces of records while mapred.child.java.opts was
> set to be -Xmx200m.
>
> here is the schema of our small table:
> > describe application;
> transaction_id string
> subclass_id string
> class_id string
> memo string
> url_alias string
> url_pattern string
> dt string (daily partitioned)
>
> Thanks,
> Min
> On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:
>
> 1. Can you reduce the number of cached rows and try ?
>
> 2. Were you using default memory settings of the mapper ? If yes, can can
> increase it and try ?
>
> It would be useful to try both of them independently – it would give a good
> idea of memory consumption of JDBM also.
>
>
> Can you send the exact schema/data of the small table if possible ? You can
> file a jira and load it there if it not a security issue.
>
> Thanks,
> -namit
>
>
>
> On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:
>
> 20k
>
>
>
>
>
>
--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
Re: OutOfMemory when doing map-side join
Posted by Namit Jain <nj...@facebook.com>.
The problem seems to be in partition pruning - The small table 'application' is partitioned - and probably, there are 20k rows in the partition
20090609.
Due to a bug, the pruning is not happening, and all partitions of 'application' are being loaded - which may be too much for map-join to handle.
This is a serious bug, but for now can you put in a subquery and try -
select /*+ MAPJOIN(a) */ a.url_pattern, w.url from (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
Please file a JIRA for the above.
On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com> wrote:
hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern and a.dt='20090609';
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w) logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (. (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt) '20090609')))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
w
Select Operator
expressions:
expr: url
type: string
expr: logdate
type: string
Common Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {0} {1}
1 {0} {1}
keys:
0
1
Position of Big Table: 1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Local Work:
Map Reduce Local Work
Alias -> Map Local Tables:
a
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
a
Select Operator
expressions:
expr: url_pattern
type: string
expr: dt
type: string
Common Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {0} {1}
1 {0} {1}
keys:
0
1
Position of Big Table: 1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
Select Operator
Filter Operator
predicate:
expr: (((3 = '20090611') and (2 regexp 0)) and (1 = '20090609'))
type: boolean
Select Operator
expressions:
expr: 0
type: string
expr: 2
type: string
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com> wrote:
I was looking at the code - and there may be a bug in cartesian product codepath for map-join.
Can you do a explain plan and send it ?
On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com> wrote:
1. tried setting hive.mapjoin.cache.numrows to be 100, failed with the same exception.
2. Actually, we used to do the same thing by loading small tables into memory of each map node in normal map-reduce with the same cluster, where same heap size is guranteed between running hive map-side join and our map-reduce job. OOM exceptions never happened in that only 1MB would be spent to load those 20k pieces of records while mapred.child.java.opts was set to be -Xmx200m.
here is the schema of our small table:
> describe application;
transaction_id string
subclass_id string
class_id string
memo string
url_alias string
url_pattern string
dt string (daily partitioned)
Thanks,
Min
On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:
1. Can you reduce the number of cached rows and try ?
2. Were you using default memory settings of the mapper ? If yes, can can increase it and try ?
It would be useful to try both of them independently - it would give a good idea of memory consumption of JDBM also.
Can you send the exact schema/data of the small table if possible ? You can file a jira and load it there if it not a security issue.
Thanks,
-namit
On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:
20k
Re: OutOfMemory when doing map-side join
Posted by Min Zhou <co...@gmail.com>.
hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application
a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern
and a.dt='20090609';
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF
web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
(TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a)))
(TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (.
(TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w)
logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (.
(TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt)
'20090609')))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
w
Select Operator
expressions:
expr: url
type: string
expr: logdate
type: string
Common Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {0} {1}
1 {0} {1}
keys:
0
1
Position of Big Table: 1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Local Work:
Map Reduce Local Work
Alias -> Map Local Tables:
a
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
a
Select Operator
expressions:
expr: url_pattern
type: string
expr: dt
type: string
Common Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {0} {1}
1 {0} {1}
keys:
0
1
Position of Big Table: 1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
Select Operator
Filter Operator
predicate:
expr: (((3 = '20090611') and (2 regexp 0)) and (1 =
'20090609'))
type: boolean
Select Operator
expressions:
expr: 0
type: string
expr: 2
type: string
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com> wrote:
> I was looking at the code – and there may be a bug in cartesian product
> codepath for map-join.
>
> Can you do a explain plan and send it ?
>
>
>
>
>
> On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com> wrote:
>
>
> 1. tried setting hive.mapjoin.cache.numrows to be 100, failed with the
> same exception.
> 2. Actually, we used to do the same thing by loading small tables into
> memory of each map node in normal map-reduce with the same cluster, where
> same heap size is guranteed between running hive map-side join and our
> map-reduce job. OOM exceptions never happened in that only 1MB would be
> spent to load those 20k pieces of records while mapred.child.java.opts was
> set to be -Xmx200m.
>
> here is the schema of our small table:
> > describe application;
> transaction_id string
> subclass_id string
> class_id string
> memo string
> url_alias string
> url_pattern string
> dt string (daily partitioned)
>
> Thanks,
> Min
> On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:
>
> 1. Can you reduce the number of cached rows and try ?
>
> 2. Were you using default memory settings of the mapper ? If yes, can can
> increase it and try ?
>
> It would be useful to try both of them independently – it would give a good
> idea of memory consumption of JDBM also.
>
>
> Can you send the exact schema/data of the small table if possible ? You can
> file a jira and load it there if it not a security issue.
>
> Thanks,
> -namit
>
>
>
> On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:
>
> 20k
>
>
>
>
--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
Re: OutOfMemory when doing map-side join
Posted by Namit Jain <nj...@facebook.com>.
I was looking at the code - and there may be a bug in cartesian product codepath for map-join.
Can you do a explain plan and send it ?
On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com> wrote:
1. tried setting hive.mapjoin.cache.numrows to be 100, failed with the same exception.
2. Actually, we used to do the same thing by loading small tables into memory of each map node in normal map-reduce with the same cluster, where same heap size is guranteed between running hive map-side join and our map-reduce job. OOM exceptions never happened in that only 1MB would be spent to load those 20k pieces of records while mapred.child.java.opts was set to be -Xmx200m.
here is the schema of our small table:
> describe application;
transaction_id string
subclass_id string
class_id string
memo string
url_alias string
url_pattern string
dt string (daily partitioned)
Thanks,
Min
On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:
1. Can you reduce the number of cached rows and try ?
2. Were you using default memory settings of the mapper ? If yes, can can increase it and try ?
It would be useful to try both of them independently - it would give a good idea of memory consumption of JDBM also.
Can you send the exact schema/data of the small table if possible ? You can file a jira and load it there if it not a security issue.
Thanks,
-namit
On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:
20k
Re: OutOfMemory when doing map-side join
Posted by Min Zhou <co...@gmail.com>.
1. tried setting hive.mapjoin.cache.numrows to be 100, failed with the same
exception.
2. Actually, we used to do the same thing by loading small tables into
memory of each map node in normal map-reduce with the same cluster, where
same heap size is guranteed between running hive map-side join and our
map-reduce job. OOM exceptions never happened in that only 1MB would be
spent to load those 20k pieces of records while mapred.child.java.opts was
set to be -Xmx200m.
here is the schema of our small table:
> describe application;
transaction_id string
subclass_id string
class_id string
memo string
url_alias string
url_pattern string
dt string (daily partitioned)
Thanks,
Min
On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:
> 1. Can you reduce the number of cached rows and try ?
>
> 2. Were you using default memory settings of the mapper ? If yes, can can
> increase it and try ?
>
> It would be useful to try both of them independently – it would give a good
> idea of memory consumption of JDBM also.
>
>
> Can you send the exact schema/data of the small table if possible ? You can
> file a jira and load it there if it not a security issue.
>
> Thanks,
> -namit
>
>
>
> On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:
>
> 20k
>
>
--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com
Re: OutOfMemory when doing map-side join
Posted by Namit Jain <nj...@facebook.com>.
1. Can you reduce the number of cached rows and try ?
2. Were you using default memory settings of the mapper ? If yes, can can increase it and try ?
It would be useful to try both of them independently - it would give a good idea of memory consumption of JDBM also.
Can you send the exact schema/data of the small table if possible ? You can file a jira and load it there if it not a security issue.
Thanks,
-namit
On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:
20k