You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Min Zhou <co...@gmail.com> on 2009/06/15 06:23:45 UTC

OutOfMemory when doing map-side join

Hi all,
we had a query joining two tables, one of which had about 1 billions pieces
of records while the other had less than 20k. below is our query:

set hive.mapjoin.cache.numrows=20000;
select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join
web_log w where and w.url rlike a.url_pattern

here is the stack trace

java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:2786)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
	at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1838)
	at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1809)
	at java.io.ObjectOutputStream.write(ObjectOutputStream.java:681)
	at org.apache.hadoop.io.Text.write(Text.java:282)
	at org.apache.hadoop.hive.ql.exec.MapJoinObjectValue.writeExternal(MapJoinObjectValue.java:126)
	at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1421)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1390)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326)
	at org.apache.hadoop.hive.ql.util.jdbm.htree.HashBucket.writeExternal(HashBucket.java:292)
	at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1421)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1390)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326)
	at org.apache.hadoop.hive.ql.util.jdbm.helper.Serialization.serialize(Serialization.java:93)
	at org.apache.hadoop.hive.ql.util.jdbm.helper.DefaultSerializer.serialize(DefaultSerializer.java:101)
	at org.apache.hadoop.hive.ql.util.jdbm.recman.BaseRecordManager.insert(BaseRecordManager.java:242)
	at org.apache.hadoop.hive.ql.util.jdbm.recman.CacheRecordManager.insert(CacheRecordManager.java:176)
	at org.apache.hadoop.hive.ql.util.jdbm.recman.CacheRecordManager.insert(CacheRecordManager.java:159)
	at org.apache.hadoop.hive.ql.util.jdbm.htree.HashDirectory.put(HashDirectory.java:249)
	at org.apache.hadoop.hive.ql.util.jdbm.htree.HTree.put(HTree.java:147)
	at org.apache.hadoop.hive.ql.exec.MapJoinOperator.process(MapJoinOperator.java:305)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:492)
	at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:76)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:492)
	at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:49)
	at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:121)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
	at org.apache.hadoop.mapred.Child.main(Child.java:158)

java.lang.OutOfMemoryError: GC overhead limit exceeded at
java.util.Arrays.copyOfRange(Arrays.java:3209) at
java.lang.String.(String.java:216) at
java.lang.StringBuffer.toString(StringBuffer.java:585) at
org.apache.log4j.PatternLayout.format(PatternLayout.java:505) at
org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:302) at
org.apache.log4j.WriterAppender.append(WriterAppender.java:160) at
org.apache.hadoop.mapred.TaskLogAppender.append(TaskLogAppender.java:55) at
org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206) at
org.apache.log4j.Category.forcedLog(Category.java:391) at
org.apache.log4j.Category.log(Category.java:856) at
org.apache.commons.logging.impl.Log4JLogger.info(Log4JLogger.java:133) at
org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:387) at
org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:173) at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at
org.apache.hadoop.mapred.Child.main(Child.java:158)


Actually, we used to do the same thing by loading small tables into memory
of each map node in normal map-reduce job. OOM exceptions never happened in
that only 1MB would be spent to load those 20k pieces of records.  Can you
give me an explanation?

Thanks,
Min
-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Re: OutOfMemory when doing map-side join

Posted by Min Zhou <co...@gmail.com>.
I even didn't know what  your meaning about those 100k you two refered .

On Thu, Jun 18, 2009 at 9:32 AM, Ashish Thusoo <at...@facebook.com> wrote:

>  my bad. yes it is 100K. That sounds a bit too much as well.
>
> Ashish
>
>  ------------------------------
> *From:* Min Zhou [mailto:coderplay@gmail.com]
> *Sent:* Wednesday, June 17, 2009 6:25 PM
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: OutOfMemory when doing map-side join
>
> what's your 100kb standed for?
>
> On Thu, Jun 18, 2009 at 6:26 AM, Amr Awadallah <aa...@cloudera.com> wrote:
>
>> hmm, that is a 100KB per my math.
>>
>> 20K * 100K = 2GB
>>
>> -- amr
>>
>>
>> Ashish Thusoo wrote:
>>
>> That does not sound right. Each row is 100MB - that sounds too much...
>>
>> Ashish
>>
>>  ------------------------------
>> *From:* Min Zhou [mailto:coderplay@gmail.com <co...@gmail.com>]
>> *Sent:* Monday, June 15, 2009 7:16 PM
>> *To:* hive-user@hadoop.apache.org
>> *Subject:* Re: OutOfMemory when doing map-side join
>>
>> 20k rows need 2G memory?  so terrible.  The whole small table of mine is
>> less than 4MB,   what about yours?
>>
>> On Tue, Jun 16, 2009 at 6:59 AM, Namit Jain <nj...@facebook.com> wrote:
>>
>>>  Set  mapred.child.java.opts to increase mapper memory.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:* Namit Jain [mailto:njain@facebook.com]
>>> *Sent:* Monday, June 15, 2009 3:53 PM
>>> *To:* hive-user@hadoop.apache.org
>>> *Subject:* RE: OutOfMemory when doing map-side join
>>>
>>>
>>>
>>> There are multiple things going on.
>>>
>>>
>>>
>>> Column pruning is not working with map-joins. It is being tracked at:
>>>
>>>
>>>
>>> https://issues.apache.org/jira/browse/HIVE-560
>>>
>>>
>>>
>>>
>>>
>>> Also, since it is a Cartesian product, jdbm does not help  - because a
>>> single key can be very large.
>>>
>>>
>>>
>>>
>>>
>>> For now, you can do the column pruning yourself – create a new table with
>>> only the columns needed and then
>>>
>>> join with the bigger table.
>>>
>>>
>>>
>>> You may still need to increase the mapper memory -  I was able to load
>>> about 20k rows with about 2G mapper.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:* Min Zhou [mailto:coderplay@gmail.com]
>>> *Sent:* Sunday, June 14, 2009 11:02 PM
>>> *To:* hive-user@hadoop.apache.org
>>> *Subject:* Re: OutOfMemory when doing map-side join
>>>
>>>
>>>
>>> btw, that small table 'application' has only one partition right now,
>>> 20k rows.
>>>
>>> On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <co...@gmail.com> wrote:
>>>
>>> failed with null pointer exception.
>>> hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select
>>> x.url_pattern from application x where x.dt = '20090609') a join web_log w
>>> where w.logdate='20090611' and w.url rlike a.url_pattern;
>>> FAILED: Unknown exception : null
>>>
>>>
>>> $cat /tmp/hive/hive.log | tail...
>>>
>>> 2009-06-15 13:57:02,933 ERROR ql.Driver
>>> (SessionState.java:printError(279)) - FAILED: Unknown exception : null
>>> java.lang.NullPointerException
>>>         at
>>> org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
>>>         at
>>> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
>>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
>>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
>>>         at
>>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
>>>         at
>>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
>>>         at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>         at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>         at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>>>         at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>         at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>>>
>>>
>>>
>>> On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com> wrote:
>>>
>>> The problem seems to be in partition pruning – The small table
>>> ‘application’ is partitioned – and probably, there are 20k rows in the
>>> partition
>>> 20090609.
>>>
>>> Due to a bug, the pruning is not happening, and all partitions of
>>> ‘application’ are being loaded – which may be too much for map-join to
>>> handle.
>>> This is a serious bug, but for now can you put in a subquery and try -
>>>
>>> select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern
>>> from application x where x.dt = ‘20090609’) a join web_log w where
>>> w.logdate='20090611' and w.url rlike a.url_pattern;
>>>
>>>
>>> Please file a JIRA for the above.
>>>
>>>
>>>
>>>
>>>
>>> On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com> wrote:
>>>
>>> hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from
>>> application a join web_log w where w.logdate='20090611' and w.url rlike
>>> a.url_pattern and a.dt='20090609';
>>> OK
>>> ABSTRACT SYNTAX TREE:
>>>   (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF
>>> web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
>>> (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a)))
>>> (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (.
>>> (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w)
>>> logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (.
>>> (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt)
>>> '20090609')))))
>>>
>>> STAGE DEPENDENCIES:
>>>   Stage-1 is a root stage
>>>   Stage-2 depends on stages: Stage-1
>>>   Stage-0 is a root stage
>>>
>>> STAGE PLANS:
>>>   Stage: Stage-1
>>>     Map Reduce
>>>       Alias -> Map Operator Tree:
>>>         w
>>>             Select Operator
>>>               expressions:
>>>                     expr: url
>>>                     type: string
>>>                     expr: logdate
>>>                     type: string
>>>               Common Join Operator
>>>                 condition map:
>>>                      Inner Join 0 to 1
>>>                 condition expressions:
>>>                   0 {0} {1}
>>>                   1 {0} {1}
>>>                 keys:
>>>                   0
>>>                   1
>>>                 Position of Big Table: 1
>>>                 File Output Operator
>>>                   compressed: false
>>>                   GlobalTableId: 0
>>>                   table:
>>>                       input format:
>>> org.apache.hadoop.mapred.SequenceFileInputFormat
>>>                       output format:
>>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>>>       Local Work:
>>>         Map Reduce Local Work
>>>           Alias -> Map Local Tables:
>>>             a
>>>               Fetch Operator
>>>                 limit: -1
>>>           Alias -> Map Local Operator Tree:
>>>             a
>>>                 Select Operator
>>>                   expressions:
>>>                         expr: url_pattern
>>>                         type: string
>>>                         expr: dt
>>>                         type: string
>>>                   Common Join Operator
>>>                     condition map:
>>>                          Inner Join 0 to 1
>>>                     condition expressions:
>>>                       0 {0} {1}
>>>                       1 {0} {1}
>>>                     keys:
>>>                       0
>>>                       1
>>>                     Position of Big Table: 1
>>>                     File Output Operator
>>>                       compressed: false
>>>                       GlobalTableId: 0
>>>                       table:
>>>                           input format:
>>> org.apache.hadoop.mapred.SequenceFileInputFormat
>>>                           output format:
>>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>>>
>>>   Stage: Stage-2
>>>     Map Reduce
>>>       Alias -> Map Operator Tree:
>>>         hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
>>>           Select Operator
>>>             Filter Operator
>>>               predicate:
>>>                   expr: (((3 = '20090611') and (2 regexp 0)) and (1 =
>>> '20090609'))
>>>                   type: boolean
>>>               Select Operator
>>>                 expressions:
>>>                       expr: 0
>>>                       type: string
>>>                       expr: 2
>>>                       type: string
>>>                 File Output Operator
>>>                   compressed: true
>>>                   GlobalTableId: 0
>>>                   table:
>>>                       input format:
>>> org.apache.hadoop.mapred.TextInputFormat
>>>                       output format:
>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>
>>>   Stage: Stage-0
>>>     Fetch Operator
>>>       limit: -1
>>>
>>> On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com> wrote:
>>>
>>> I was looking at the code – and there may be a bug in cartesian product
>>> codepath for map-join.
>>>
>>> Can you do a explain plan and send it ?
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com> wrote:
>>>
>>>
>>> 1. tried setting hive.mapjoin.cache.numrows to be 100,  failed with the
>>> same exception.
>>> 2. Actually, we used to do the same thing by loading small tables into
>>> memory of each map node in normal map-reduce with the same cluster, where
>>> same heap size is guranteed between running hive map-side join and our
>>> map-reduce job.  OOM exceptions never happened in that only 1MB would be
>>> spent to load those 20k pieces of records while mapred.child.java.opts was
>>> set to be -Xmx200m.
>>>
>>> here is the schema of our small table:
>>> > describe application;
>>> transaction_id    string
>>> subclass_id     string
>>> class_id        string
>>> memo string
>>> url_alias    string
>>> url_pattern     string
>>> dt      string  (daily partitioned)
>>>
>>> Thanks,
>>> Min
>>> On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:
>>>
>>> 1. Can you reduce the number of cached rows and try ?
>>>
>>> 2. Were you using default memory settings of the mapper ? If yes, can can
>>> increase it and try ?
>>>
>>> It would be useful to try both of them independently – it would give a
>>> good idea of memory consumption of JDBM also.
>>>
>>>
>>> Can you send the exact schema/data of the small table if possible ? You
>>> can file a jira and load it there if it not a security issue.
>>>
>>> Thanks,
>>> -namit
>>>
>>>
>>>
>>> On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:
>>>
>>> 20k
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>  --
>>> My research interests are distributed systems, parallel computing and
>>> bytecode based virtual machine.
>>>
>>> My profile:
>>> http://www.linkedin.com/in/coderplay
>>> My blog:
>>> http://coderplay.javaeye.com
>>>
>>>
>>>
>>>
>>> --
>>> My research interests are distributed systems, parallel computing and
>>> bytecode based virtual machine.
>>>
>>> My profile:
>>> http://www.linkedin.com/in/coderplay
>>> My blog:
>>> http://coderplay.javaeye.com
>>>
>>
>>
>>
>> --
>> My research interests are distributed systems, parallel computing and
>> bytecode based virtual machine.
>>
>> My profile:
>> http://www.linkedin.com/in/coderplay
>> My blog:
>> http://coderplay.javaeye.com
>>
>>
>
>
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>



-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

RE: OutOfMemory when doing map-side join

Posted by Ashish Thusoo <at...@facebook.com>.
my bad. yes it is 100K. That sounds a bit too much as well.

Ashish

________________________________
From: Min Zhou [mailto:coderplay@gmail.com]
Sent: Wednesday, June 17, 2009 6:25 PM
To: hive-user@hadoop.apache.org
Subject: Re: OutOfMemory when doing map-side join

what's your 100kb standed for?

On Thu, Jun 18, 2009 at 6:26 AM, Amr Awadallah <aa...@cloudera.com>> wrote:
hmm, that is a 100KB per my math.

20K * 100K = 2GB

-- amr


Ashish Thusoo wrote:
That does not sound right. Each row is 100MB - that sounds too much...

Ashish

________________________________
From: Min Zhou [mailto:coderplay@gmail.com]
Sent: Monday, June 15, 2009 7:16 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: OutOfMemory when doing map-side join

20k rows need 2G memory?  so terrible.  The whole small table of mine is less than 4MB,   what about yours?

On Tue, Jun 16, 2009 at 6:59 AM, Namit Jain <nj...@facebook.com>> wrote:

Set  mapred.child.java.opts to increase mapper memory.









From: Namit Jain [mailto:njain@facebook.com<ma...@facebook.com>]
Sent: Monday, June 15, 2009 3:53 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: OutOfMemory when doing map-side join



There are multiple things going on.



Column pruning is not working with map-joins. It is being tracked at:



https://issues.apache.org/jira/browse/HIVE-560





Also, since it is a Cartesian product, jdbm does not help  - because a single key can be very large.





For now, you can do the column pruning yourself - create a new table with only the columns needed and then

join with the bigger table.



You may still need to increase the mapper memory -  I was able to load about 20k rows with about 2G mapper.















From: Min Zhou [mailto:coderplay@gmail.com<ma...@gmail.com>]
Sent: Sunday, June 14, 2009 11:02 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: OutOfMemory when doing map-side join



btw, that small table 'application' has only one partition right now,  20k rows.

On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <co...@gmail.com>> wrote:

failed with null pointer exception.
hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
FAILED: Unknown exception : null


$cat /tmp/hive/hive.log | tail...

2009-06-15 13:57:02,933 ERROR ql.Driver (SessionState.java:printError(279)) - FAILED: Unknown exception : null
java.lang.NullPointerException
        at org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
        at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
        at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
        at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)



On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com>> wrote:

The problem seems to be in partition pruning - The small table 'application' is partitioned - and probably, there are 20k rows in the partition
20090609.

Due to a bug, the pruning is not happening, and all partitions of 'application' are being loaded - which may be too much for map-join to handle.
This is a serious bug, but for now can you put in a subquery and try -

select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;


Please file a JIRA for the above.




On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com>> wrote:

hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern and a.dt='20090609';
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w) logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (. (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt) '20090609')))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        w
            Select Operator
              expressions:
                    expr: url
                    type: string
                    expr: logdate
                    type: string
              Common Join Operator
                condition map:
                     Inner Join 0 to 1
                condition expressions:
                  0 {0} {1}
                  1 {0} {1}
                keys:
                  0
                  1
                Position of Big Table: 1
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
      Local Work:
        Map Reduce Local Work
          Alias -> Map Local Tables:
            a
              Fetch Operator
                limit: -1
          Alias -> Map Local Operator Tree:
            a
                Select Operator
                  expressions:
                        expr: url_pattern
                        type: string
                        expr: dt
                        type: string
                  Common Join Operator
                    condition map:
                         Inner Join 0 to 1
                    condition expressions:
                      0 {0} {1}
                      1 {0} {1}
                    keys:
                      0
                      1
                    Position of Big Table: 1
                    File Output Operator
                      compressed: false
                      GlobalTableId: 0
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
          Select Operator
            Filter Operator
              predicate:
                  expr: (((3 = '20090611') and (2 regexp 0)) and (1 = '20090609'))
                  type: boolean
              Select Operator
                expressions:
                      expr: 0
                      type: string
                      expr: 2
                      type: string
                File Output Operator
                  compressed: true
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com>> wrote:

I was looking at the code - and there may be a bug in cartesian product codepath for map-join.

Can you do a explain plan and send it ?






On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com>> wrote:

1. tried setting hive.mapjoin.cache.numrows to be 100,  failed with the same exception.
2. Actually, we used to do the same thing by loading small tables into memory of each map node in normal map-reduce with the same cluster, where same heap size is guranteed between running hive map-side join and our map-reduce job.  OOM exceptions never happened in that only 1MB would be spent to load those 20k pieces of records while mapred.child.java.opts was set to be -Xmx200m.

here is the schema of our small table:
> describe application;
transaction_id    string
subclass_id     string
class_id        string
memo string
url_alias    string
url_pattern     string
dt      string  (daily partitioned)

Thanks,
Min
On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com>> wrote:

1. Can you reduce the number of cached rows and try ?

2. Were you using default memory settings of the mapper ? If yes, can can increase it and try ?

It would be useful to try both of them independently - it would give a good idea of memory consumption of JDBM also.


Can you send the exact schema/data of the small table if possible ? You can file a jira and load it there if it not a security issue.

Thanks,
-namit



On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com>> wrote:

20k







--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com



--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com



--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com



--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Re: OutOfMemory when doing map-side join

Posted by Min Zhou <co...@gmail.com>.
what's your 100kb standed for?

On Thu, Jun 18, 2009 at 6:26 AM, Amr Awadallah <aa...@cloudera.com> wrote:

>  hmm, that is a 100KB per my math.
>
> 20K * 100K = 2GB
>
> -- amr
>
>
> Ashish Thusoo wrote:
>
> That does not sound right. Each row is 100MB - that sounds too much...
>
> Ashish
>
>  ------------------------------
> *From:* Min Zhou [mailto:coderplay@gmail.com <co...@gmail.com>]
> *Sent:* Monday, June 15, 2009 7:16 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: OutOfMemory when doing map-side join
>
>  20k rows need 2G memory?  so terrible.  The whole small table of mine is
> less than 4MB,   what about yours?
>
> On Tue, Jun 16, 2009 at 6:59 AM, Namit Jain <nj...@facebook.com> wrote:
>
>>  Set  mapred.child.java.opts to increase mapper memory.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* Namit Jain [mailto:njain@facebook.com]
>> *Sent:* Monday, June 15, 2009 3:53 PM
>> *To:* hive-user@hadoop.apache.org
>>  *Subject:* RE: OutOfMemory when doing map-side join
>>
>>
>>
>> There are multiple things going on.
>>
>>
>>
>> Column pruning is not working with map-joins. It is being tracked at:
>>
>>
>>
>> https://issues.apache.org/jira/browse/HIVE-560
>>
>>
>>
>>
>>
>> Also, since it is a Cartesian product, jdbm does not help  - because a
>> single key can be very large.
>>
>>
>>
>>
>>
>> For now, you can do the column pruning yourself – create a new table with
>> only the columns needed and then
>>
>> join with the bigger table.
>>
>>
>>
>> You may still need to increase the mapper memory -  I was able to load
>> about 20k rows with about 2G mapper.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* Min Zhou [mailto:coderplay@gmail.com]
>> *Sent:* Sunday, June 14, 2009 11:02 PM
>> *To:* hive-user@hadoop.apache.org
>> *Subject:* Re: OutOfMemory when doing map-side join
>>
>>
>>
>> btw, that small table 'application' has only one partition right now,  20k
>> rows.
>>
>> On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <co...@gmail.com> wrote:
>>
>> failed with null pointer exception.
>> hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select
>> x.url_pattern from application x where x.dt = '20090609') a join web_log w
>> where w.logdate='20090611' and w.url rlike a.url_pattern;
>> FAILED: Unknown exception : null
>>
>>
>> $cat /tmp/hive/hive.log | tail...
>>
>> 2009-06-15 13:57:02,933 ERROR ql.Driver
>> (SessionState.java:printError(279)) - FAILED: Unknown exception : null
>> java.lang.NullPointerException
>>         at
>> org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
>>         at
>> org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
>>         at
>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
>>         at
>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>>         at
>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>>         at
>> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>>         at
>> org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
>>         at
>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
>>         at
>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
>>         at
>> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
>>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
>>         at
>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
>>         at
>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
>>         at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>>         at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>         at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>>
>>
>>
>> On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com> wrote:
>>
>> The problem seems to be in partition pruning – The small table
>> ‘application’ is partitioned – and probably, there are 20k rows in the
>> partition
>> 20090609.
>>
>> Due to a bug, the pruning is not happening, and all partitions of
>> ‘application’ are being loaded – which may be too much for map-join to
>> handle.
>> This is a serious bug, but for now can you put in a subquery and try -
>>
>> select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern
>> from application x where x.dt = ‘20090609’) a join web_log w where
>> w.logdate='20090611' and w.url rlike a.url_pattern;
>>
>>
>> Please file a JIRA for the above.
>>
>>
>>
>>
>>
>> On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com> wrote:
>>
>> hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from
>> application a join web_log w where w.logdate='20090611' and w.url rlike
>> a.url_pattern and a.dt='20090609';
>> OK
>> ABSTRACT SYNTAX TREE:
>>   (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF
>> web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
>> (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a)))
>> (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (.
>> (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w)
>> logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (.
>> (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt)
>> '20090609')))))
>>
>> STAGE DEPENDENCIES:
>>   Stage-1 is a root stage
>>   Stage-2 depends on stages: Stage-1
>>   Stage-0 is a root stage
>>
>> STAGE PLANS:
>>   Stage: Stage-1
>>     Map Reduce
>>       Alias -> Map Operator Tree:
>>         w
>>             Select Operator
>>               expressions:
>>                     expr: url
>>                     type: string
>>                     expr: logdate
>>                     type: string
>>               Common Join Operator
>>                 condition map:
>>                      Inner Join 0 to 1
>>                 condition expressions:
>>                   0 {0} {1}
>>                   1 {0} {1}
>>                 keys:
>>                   0
>>                   1
>>                 Position of Big Table: 1
>>                 File Output Operator
>>                   compressed: false
>>                   GlobalTableId: 0
>>                   table:
>>                       input format:
>> org.apache.hadoop.mapred.SequenceFileInputFormat
>>                       output format:
>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>>       Local Work:
>>         Map Reduce Local Work
>>           Alias -> Map Local Tables:
>>             a
>>               Fetch Operator
>>                 limit: -1
>>           Alias -> Map Local Operator Tree:
>>             a
>>                 Select Operator
>>                   expressions:
>>                         expr: url_pattern
>>                         type: string
>>                         expr: dt
>>                         type: string
>>                   Common Join Operator
>>                     condition map:
>>                          Inner Join 0 to 1
>>                     condition expressions:
>>                       0 {0} {1}
>>                       1 {0} {1}
>>                     keys:
>>                       0
>>                       1
>>                     Position of Big Table: 1
>>                     File Output Operator
>>                       compressed: false
>>                       GlobalTableId: 0
>>                       table:
>>                           input format:
>> org.apache.hadoop.mapred.SequenceFileInputFormat
>>                           output format:
>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>>
>>   Stage: Stage-2
>>     Map Reduce
>>       Alias -> Map Operator Tree:
>>         hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
>>           Select Operator
>>             Filter Operator
>>               predicate:
>>                   expr: (((3 = '20090611') and (2 regexp 0)) and (1 =
>> '20090609'))
>>                   type: boolean
>>               Select Operator
>>                 expressions:
>>                       expr: 0
>>                       type: string
>>                       expr: 2
>>                       type: string
>>                 File Output Operator
>>                   compressed: true
>>                   GlobalTableId: 0
>>                   table:
>>                       input format:
>> org.apache.hadoop.mapred.TextInputFormat
>>                       output format:
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>
>>   Stage: Stage-0
>>     Fetch Operator
>>       limit: -1
>>
>> On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com> wrote:
>>
>> I was looking at the code – and there may be a bug in cartesian product
>> codepath for map-join.
>>
>> Can you do a explain plan and send it ?
>>
>>
>>
>>
>>
>>
>> On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com> wrote:
>>
>>
>> 1. tried setting hive.mapjoin.cache.numrows to be 100,  failed with the
>> same exception.
>> 2. Actually, we used to do the same thing by loading small tables into
>> memory of each map node in normal map-reduce with the same cluster, where
>> same heap size is guranteed between running hive map-side join and our
>> map-reduce job.  OOM exceptions never happened in that only 1MB would be
>> spent to load those 20k pieces of records while mapred.child.java.opts was
>> set to be -Xmx200m.
>>
>> here is the schema of our small table:
>> > describe application;
>> transaction_id    string
>> subclass_id     string
>> class_id        string
>> memo string
>> url_alias    string
>> url_pattern     string
>> dt      string  (daily partitioned)
>>
>> Thanks,
>> Min
>> On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:
>>
>> 1. Can you reduce the number of cached rows and try ?
>>
>> 2. Were you using default memory settings of the mapper ? If yes, can can
>> increase it and try ?
>>
>> It would be useful to try both of them independently – it would give a
>> good idea of memory consumption of JDBM also.
>>
>>
>> Can you send the exact schema/data of the small table if possible ? You
>> can file a jira and load it there if it not a security issue.
>>
>> Thanks,
>> -namit
>>
>>
>>
>> On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:
>>
>> 20k
>>
>>
>>
>>
>>
>>
>>
>>    --
>> My research interests are distributed systems, parallel computing and
>> bytecode based virtual machine.
>>
>> My profile:
>> http://www.linkedin.com/in/coderplay
>> My blog:
>> http://coderplay.javaeye.com
>>
>>
>>
>>
>> --
>> My research interests are distributed systems, parallel computing and
>> bytecode based virtual machine.
>>
>> My profile:
>> http://www.linkedin.com/in/coderplay
>> My blog:
>> http://coderplay.javaeye.com
>>
>
>
>
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>
>


-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Re: OutOfMemory when doing map-side join

Posted by Amr Awadallah <aa...@cloudera.com>.
hmm, that is a 100KB per my math.

20K * 100K = 2GB

-- amr

Ashish Thusoo wrote:
> That does not sound right. Each row is 100MB - that sounds too much...
>  
> Ashish
>
> ------------------------------------------------------------------------
> *From:* Min Zhou [mailto:coderplay@gmail.com]
> *Sent:* Monday, June 15, 2009 7:16 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: OutOfMemory when doing map-side join
>
> 20k rows need 2G memory?  so terrible.  The whole small table of mine 
> is less than 4MB,   what about yours?
>
> On Tue, Jun 16, 2009 at 6:59 AM, Namit Jain <njain@facebook.com 
> <ma...@facebook.com>> wrote:
>
>     Set  mapred.child.java.opts to increase mapper memory.
>
>      
>
>      
>
>      
>
>      
>
>     *From:* Namit Jain [mailto:njain@facebook.com
>     <ma...@facebook.com>]
>     *Sent:* Monday, June 15, 2009 3:53 PM
>
>     *To:* hive-user@hadoop.apache.org <ma...@hadoop.apache.org>
>     *Subject:* RE: OutOfMemory when doing map-side join
>
>      
>
>     There are multiple things going on.
>
>      
>
>     Column pruning is not working with map-joins. It is being tracked at:
>
>      
>
>     https://issues.apache.org/jira/browse/HIVE-560
>
>      
>
>      
>
>     Also, since it is a Cartesian product, jdbm does not help  -
>     because a single key can be very large.
>
>      
>
>      
>
>     For now, you can do the column pruning yourself -- create a new
>     table with only the columns needed and then
>
>     join with the bigger table.
>
>      
>
>     You may still need to increase the mapper memory -  I was able to
>     load about 20k rows with about 2G mapper.
>
>      
>
>      
>
>      
>
>      
>
>      
>
>      
>
>      
>
>     *From:* Min Zhou [mailto:coderplay@gmail.com
>     <ma...@gmail.com>]
>     *Sent:* Sunday, June 14, 2009 11:02 PM
>     *To:* hive-user@hadoop.apache.org <ma...@hadoop.apache.org>
>     *Subject:* Re: OutOfMemory when doing map-side join
>
>      
>
>     btw, that small table 'application' has only one partition right
>     now,  20k rows.
>
>     On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <coderplay@gmail.com
>     <ma...@gmail.com>> wrote:
>
>     failed with null pointer exception.
>     hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select
>     x.url_pattern from application x where x.dt = '20090609') a join
>     web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
>     FAILED: Unknown exception : null
>
>
>     $cat /tmp/hive/hive.log | tail...
>
>     2009-06-15 13:57:02,933 ERROR ql.Driver
>     (SessionState.java:printError(279)) - FAILED: Unknown exception : null
>     java.lang.NullPointerException
>             at
>     org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
>             at
>     org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
>             at
>     org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
>             at
>     org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>             at
>     org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>             at
>     org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>             at
>     org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
>             at
>     org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
>             at
>     org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
>             at
>     org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
>             at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
>             at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
>             at
>     org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
>             at
>     org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
>             at
>     org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
>             at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>             at
>     sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>             at
>     sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>             at java.lang.reflect.Method.invoke(Method.java:597)
>             at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>             at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>             at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>             at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>             at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
>      
>
>     On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <njain@facebook.com
>     <ma...@facebook.com>> wrote:
>
>     The problem seems to be in partition pruning -- The small table
>     'application' is partitioned -- and probably, there are 20k rows
>     in the partition
>     20090609.
>
>     Due to a bug, the pruning is not happening, and all partitions of
>     'application' are being loaded -- which may be too much for
>     map-join to handle.
>     This is a serious bug, but for now can you put in a subquery and try -
>
>     select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select
>     x.url_pattern from application x where x.dt = '20090609') a join
>     web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
>
>
>     Please file a JIRA for the above.
>
>
>
>
>
>     On 6/14/09 10:20 PM, "Min Zhou" <coderplay@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url
>         from application a join web_log w where w.logdate='20090611'
>         and w.url rlike a.url_pattern and a.dt='20090609';
>         OK
>         ABSTRACT SYNTAX TREE:
>           (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a)
>         (TOK_TABREF web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR
>         TOK_TMP_FILE)) (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN
>         (TOK_HINTARGLIST a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a)
>         url_pattern)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL w) url)))
>         (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w) logdate)
>         '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (.
>         (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a)
>         dt) '20090609')))))
>
>         STAGE DEPENDENCIES:
>           Stage-1 is a root stage
>           Stage-2 depends on stages: Stage-1
>           Stage-0 is a root stage
>
>         STAGE PLANS:
>           Stage: Stage-1
>             Map Reduce
>               Alias -> Map Operator Tree:
>                 w
>                     Select Operator
>                       expressions:
>                             expr: url
>                             type: string
>                             expr: logdate
>                             type: string
>                       Common Join Operator
>                         condition map:
>                              Inner Join 0 to 1
>                         condition expressions:
>                           0 {0} {1}
>                           1 {0} {1}
>                         keys:
>                           0
>                           1
>                         Position of Big Table: 1
>                         File Output Operator
>                           compressed: false
>                           GlobalTableId: 0
>                           table:
>                               input format:
>         org.apache.hadoop.mapred.SequenceFileInputFormat
>                               output format:
>         org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>               Local Work:
>                 Map Reduce Local Work
>                   Alias -> Map Local Tables:
>                     a
>                       Fetch Operator
>                         limit: -1
>                   Alias -> Map Local Operator Tree:
>                     a
>                         Select Operator
>                           expressions:
>                                 expr: url_pattern
>                                 type: string
>                                 expr: dt
>                                 type: string
>                           Common Join Operator
>                             condition map:
>                                  Inner Join 0 to 1
>                             condition expressions:
>                               0 {0} {1}
>                               1 {0} {1}
>                             keys:
>                               0
>                               1
>                             Position of Big Table: 1
>                             File Output Operator
>                               compressed: false
>                               GlobalTableId: 0
>                               table:
>                                   input format:
>         org.apache.hadoop.mapred.SequenceFileInputFormat
>                                   output format:
>         org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>
>           Stage: Stage-2
>             Map Reduce
>               Alias -> Map Operator Tree:
>                
>         hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
>                   Select Operator
>                     Filter Operator
>                       predicate:
>                           expr: (((3 = '20090611') and (2 regexp 0))
>         and (1 = '20090609'))
>                           type: boolean
>                       Select Operator
>                         expressions:
>                               expr: 0
>                               type: string
>                               expr: 2
>                               type: string
>                         File Output Operator
>                           compressed: true
>                           GlobalTableId: 0
>                           table:
>                               input format:
>         org.apache.hadoop.mapred.TextInputFormat
>                               output format:
>         org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
>           Stage: Stage-0
>             Fetch Operator
>               limit: -1
>
>         On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain
>         <njain@facebook.com <ma...@facebook.com>> wrote:
>
>         I was looking at the code -- and there may be a bug in
>         cartesian product codepath for map-join.
>
>         Can you do a explain plan and send it ?
>
>
>
>
>
>
>         On 6/14/09 10:06 PM, "Min Zhou" <coderplay@gmail.com
>         <ma...@gmail.com>> wrote:
>
>
>         1. tried setting hive.mapjoin.cache.numrows to be 100,  failed
>         with the same exception.
>         2. Actually, we used to do the same thing by loading small
>         tables into memory of each map node in normal map-reduce with
>         the same cluster, where same heap size is guranteed between
>         running hive map-side join and our map-reduce job.  OOM
>         exceptions never happened in that only 1MB would be spent to
>         load those 20k pieces of records while mapred.child.java.opts
>         was set to be -Xmx200m.
>
>         here is the schema of our small table:
>         > describe application;
>         transaction_id    string
>         subclass_id     string
>         class_id        string
>         memo string
>         url_alias    string
>         url_pattern     string
>         dt      string  (daily partitioned)
>          
>         Thanks,
>         Min
>         On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain
>         <njain@facebook.com <ma...@facebook.com>> wrote:
>
>         1. Can you reduce the number of cached rows and try ?
>
>         2. Were you using default memory settings of the mapper ? If
>         yes, can can increase it and try ?
>
>         It would be useful to try both of them independently -- it
>         would give a good idea of memory consumption of JDBM also.
>
>
>         Can you send the exact schema/data of the small table if
>         possible ? You can file a jira and load it there if it not a
>         security issue.
>
>         Thanks,
>         -namit
>
>
>
>         On 6/14/09 9:23 PM, "Min Zhou" <coderplay@gmail.com
>         <ma...@gmail.com>> wrote:
>
>         20k
>
>          
>
>          
>
>
>
>     -- 
>     My research interests are distributed systems, parallel computing
>     and bytecode based virtual machine.
>
>     My profile:
>     http://www.linkedin.com/in/coderplay
>     My blog:
>     http://coderplay.javaeye.com
>
>
>
>
>     -- 
>     My research interests are distributed systems, parallel computing
>     and bytecode based virtual machine.
>
>     My profile:
>     http://www.linkedin.com/in/coderplay
>     My blog:
>     http://coderplay.javaeye.com
>
>
>
>
> -- 
> My research interests are distributed systems, parallel computing and 
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com

RE: OutOfMemory when doing map-side join

Posted by Ashish Thusoo <at...@facebook.com>.
That does not sound right. Each row is 100MB - that sounds too much...

Ashish

________________________________
From: Min Zhou [mailto:coderplay@gmail.com]
Sent: Monday, June 15, 2009 7:16 PM
To: hive-user@hadoop.apache.org
Subject: Re: OutOfMemory when doing map-side join

20k rows need 2G memory?  so terrible.  The whole small table of mine is less than 4MB,   what about yours?

On Tue, Jun 16, 2009 at 6:59 AM, Namit Jain <nj...@facebook.com>> wrote:

Set  mapred.child.java.opts to increase mapper memory.









From: Namit Jain [mailto:njain@facebook.com<ma...@facebook.com>]
Sent: Monday, June 15, 2009 3:53 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: OutOfMemory when doing map-side join



There are multiple things going on.



Column pruning is not working with map-joins. It is being tracked at:



https://issues.apache.org/jira/browse/HIVE-560





Also, since it is a Cartesian product, jdbm does not help  - because a single key can be very large.





For now, you can do the column pruning yourself - create a new table with only the columns needed and then

join with the bigger table.



You may still need to increase the mapper memory -  I was able to load about 20k rows with about 2G mapper.















From: Min Zhou [mailto:coderplay@gmail.com<ma...@gmail.com>]
Sent: Sunday, June 14, 2009 11:02 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: OutOfMemory when doing map-side join



btw, that small table 'application' has only one partition right now,  20k rows.

On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <co...@gmail.com>> wrote:

failed with null pointer exception.
hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
FAILED: Unknown exception : null


$cat /tmp/hive/hive.log | tail...

2009-06-15 13:57:02,933 ERROR ql.Driver (SessionState.java:printError(279)) - FAILED: Unknown exception : null
java.lang.NullPointerException
        at org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
        at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
        at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
        at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)



On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com>> wrote:

The problem seems to be in partition pruning - The small table 'application' is partitioned - and probably, there are 20k rows in the partition
20090609.

Due to a bug, the pruning is not happening, and all partitions of 'application' are being loaded - which may be too much for map-join to handle.
This is a serious bug, but for now can you put in a subquery and try -

select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;


Please file a JIRA for the above.




On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com>> wrote:

hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern and a.dt='20090609';
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w) logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (. (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt) '20090609')))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        w
            Select Operator
              expressions:
                    expr: url
                    type: string
                    expr: logdate
                    type: string
              Common Join Operator
                condition map:
                     Inner Join 0 to 1
                condition expressions:
                  0 {0} {1}
                  1 {0} {1}
                keys:
                  0
                  1
                Position of Big Table: 1
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
      Local Work:
        Map Reduce Local Work
          Alias -> Map Local Tables:
            a
              Fetch Operator
                limit: -1
          Alias -> Map Local Operator Tree:
            a
                Select Operator
                  expressions:
                        expr: url_pattern
                        type: string
                        expr: dt
                        type: string
                  Common Join Operator
                    condition map:
                         Inner Join 0 to 1
                    condition expressions:
                      0 {0} {1}
                      1 {0} {1}
                    keys:
                      0
                      1
                    Position of Big Table: 1
                    File Output Operator
                      compressed: false
                      GlobalTableId: 0
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
          Select Operator
            Filter Operator
              predicate:
                  expr: (((3 = '20090611') and (2 regexp 0)) and (1 = '20090609'))
                  type: boolean
              Select Operator
                expressions:
                      expr: 0
                      type: string
                      expr: 2
                      type: string
                File Output Operator
                  compressed: true
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com>> wrote:

I was looking at the code - and there may be a bug in cartesian product codepath for map-join.

Can you do a explain plan and send it ?






On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com>> wrote:

1. tried setting hive.mapjoin.cache.numrows to be 100,  failed with the same exception.
2. Actually, we used to do the same thing by loading small tables into memory of each map node in normal map-reduce with the same cluster, where same heap size is guranteed between running hive map-side join and our map-reduce job.  OOM exceptions never happened in that only 1MB would be spent to load those 20k pieces of records while mapred.child.java.opts was set to be -Xmx200m.

here is the schema of our small table:
> describe application;
transaction_id    string
subclass_id     string
class_id        string
memo string
url_alias    string
url_pattern     string
dt      string  (daily partitioned)

Thanks,
Min
On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com>> wrote:

1. Can you reduce the number of cached rows and try ?

2. Were you using default memory settings of the mapper ? If yes, can can increase it and try ?

It would be useful to try both of them independently - it would give a good idea of memory consumption of JDBM also.


Can you send the exact schema/data of the small table if possible ? You can file a jira and load it there if it not a security issue.

Thanks,
-namit



On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com>> wrote:

20k







--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com



--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com



--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Re: OutOfMemory when doing map-side join

Posted by Min Zhou <co...@gmail.com>.
20k rows need 2G memory?  so terrible.  The whole small table of mine is
less than 4MB,   what about yours?

On Tue, Jun 16, 2009 at 6:59 AM, Namit Jain <nj...@facebook.com> wrote:

>  Set  mapred.child.java.opts to increase mapper memory.
>
>
>
>
>
>
>
>
>
> *From:* Namit Jain [mailto:njain@facebook.com]
> *Sent:* Monday, June 15, 2009 3:53 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* RE: OutOfMemory when doing map-side join
>
>
>
> There are multiple things going on.
>
>
>
> Column pruning is not working with map-joins. It is being tracked at:
>
>
>
> https://issues.apache.org/jira/browse/HIVE-560
>
>
>
>
>
> Also, since it is a Cartesian product, jdbm does not help  - because a
> single key can be very large.
>
>
>
>
>
> For now, you can do the column pruning yourself – create a new table with
> only the columns needed and then
>
> join with the bigger table.
>
>
>
> You may still need to increase the mapper memory -  I was able to load
> about 20k rows with about 2G mapper.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From:* Min Zhou [mailto:coderplay@gmail.com]
> *Sent:* Sunday, June 14, 2009 11:02 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: OutOfMemory when doing map-side join
>
>
>
> btw, that small table 'application' has only one partition right now,  20k
> rows.
>
> On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <co...@gmail.com> wrote:
>
> failed with null pointer exception.
> hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select
> x.url_pattern from application x where x.dt = '20090609') a join web_log w
> where w.logdate='20090611' and w.url rlike a.url_pattern;
> FAILED: Unknown exception : null
>
>
> $cat /tmp/hive/hive.log | tail...
>
> 2009-06-15 13:57:02,933 ERROR ql.Driver (SessionState.java:printError(279))
> - FAILED: Unknown exception : null
> java.lang.NullPointerException
>         at
> org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
>         at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
>         at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
>         at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>         at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>         at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>         at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
>         at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
>         at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
>         at
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
>         at
> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
>         at
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
>         at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>         at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>         at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
>
>
> On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com> wrote:
>
> The problem seems to be in partition pruning – The small table
> ‘application’ is partitioned – and probably, there are 20k rows in the
> partition
> 20090609.
>
> Due to a bug, the pruning is not happening, and all partitions of
> ‘application’ are being loaded – which may be too much for map-join to
> handle.
> This is a serious bug, but for now can you put in a subquery and try -
>
> select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern
> from application x where x.dt = ‘20090609’) a join web_log w where
> w.logdate='20090611' and w.url rlike a.url_pattern;
>
>
> Please file a JIRA for the above.
>
>
>
>
>
> On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com> wrote:
>
> hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from
> application a join web_log w where w.logdate='20090611' and w.url rlike
> a.url_pattern and a.dt='20090609';
> OK
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF
> web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
> (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a)))
> (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (.
> (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w)
> logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (.
> (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt)
> '20090609')))))
>
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-2 depends on stages: Stage-1
>   Stage-0 is a root stage
>
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         w
>             Select Operator
>               expressions:
>                     expr: url
>                     type: string
>                     expr: logdate
>                     type: string
>               Common Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                 condition expressions:
>                   0 {0} {1}
>                   1 {0} {1}
>                 keys:
>                   0
>                   1
>                 Position of Big Table: 1
>                 File Output Operator
>                   compressed: false
>                   GlobalTableId: 0
>                   table:
>                       input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                       output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>       Local Work:
>         Map Reduce Local Work
>           Alias -> Map Local Tables:
>             a
>               Fetch Operator
>                 limit: -1
>           Alias -> Map Local Operator Tree:
>             a
>                 Select Operator
>                   expressions:
>                         expr: url_pattern
>                         type: string
>                         expr: dt
>                         type: string
>                   Common Join Operator
>                     condition map:
>                          Inner Join 0 to 1
>                     condition expressions:
>                       0 {0} {1}
>                       1 {0} {1}
>                     keys:
>                       0
>                       1
>                     Position of Big Table: 1
>                     File Output Operator
>                       compressed: false
>                       GlobalTableId: 0
>                       table:
>                           input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                           output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>
>   Stage: Stage-2
>     Map Reduce
>       Alias -> Map Operator Tree:
>         hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
>           Select Operator
>             Filter Operator
>               predicate:
>                   expr: (((3 = '20090611') and (2 regexp 0)) and (1 =
> '20090609'))
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: 0
>                       type: string
>                       expr: 2
>                       type: string
>                 File Output Operator
>                   compressed: true
>                   GlobalTableId: 0
>                   table:
>                       input format:
> org.apache.hadoop.mapred.TextInputFormat
>                       output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>
> On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com> wrote:
>
> I was looking at the code – and there may be a bug in cartesian product
> codepath for map-join.
>
> Can you do a explain plan and send it ?
>
>
>
>
>
>
> On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com> wrote:
>
>
> 1. tried setting hive.mapjoin.cache.numrows to be 100,  failed with the
> same exception.
> 2. Actually, we used to do the same thing by loading small tables into
> memory of each map node in normal map-reduce with the same cluster, where
> same heap size is guranteed between running hive map-side join and our
> map-reduce job.  OOM exceptions never happened in that only 1MB would be
> spent to load those 20k pieces of records while mapred.child.java.opts was
> set to be -Xmx200m.
>
> here is the schema of our small table:
> > describe application;
> transaction_id    string
> subclass_id     string
> class_id        string
> memo string
> url_alias    string
> url_pattern     string
> dt      string  (daily partitioned)
>
> Thanks,
> Min
> On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:
>
> 1. Can you reduce the number of cached rows and try ?
>
> 2. Were you using default memory settings of the mapper ? If yes, can can
> increase it and try ?
>
> It would be useful to try both of them independently – it would give a good
> idea of memory consumption of JDBM also.
>
>
> Can you send the exact schema/data of the small table if possible ? You can
> file a jira and load it there if it not a security issue.
>
> Thanks,
> -namit
>
>
>
> On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:
>
> 20k
>
>
>
>
>
>
>
>    --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>
>
>
>
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>



-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

RE: OutOfMemory when doing map-side join

Posted by Namit Jain <nj...@facebook.com>.
Set  mapred.child.java.opts to increase mapper memory.




From: Namit Jain [mailto:njain@facebook.com]
Sent: Monday, June 15, 2009 3:53 PM
To: hive-user@hadoop.apache.org
Subject: RE: OutOfMemory when doing map-side join

There are multiple things going on.

Column pruning is not working with map-joins. It is being tracked at:

https://issues.apache.org/jira/browse/HIVE-560


Also, since it is a Cartesian product, jdbm does not help  - because a single key can be very large.


For now, you can do the column pruning yourself - create a new table with only the columns needed and then
join with the bigger table.

You may still need to increase the mapper memory -  I was able to load about 20k rows with about 2G mapper.







From: Min Zhou [mailto:coderplay@gmail.com]
Sent: Sunday, June 14, 2009 11:02 PM
To: hive-user@hadoop.apache.org
Subject: Re: OutOfMemory when doing map-side join

btw, that small table 'application' has only one partition right now,  20k rows.
On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <co...@gmail.com>> wrote:
failed with null pointer exception.
hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
FAILED: Unknown exception : null


$cat /tmp/hive/hive.log | tail...

2009-06-15 13:57:02,933 ERROR ql.Driver (SessionState.java:printError(279)) - FAILED: Unknown exception : null
java.lang.NullPointerException
        at org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
        at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
        at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
        at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com>> wrote:
The problem seems to be in partition pruning - The small table 'application' is partitioned - and probably, there are 20k rows in the partition
20090609.

Due to a bug, the pruning is not happening, and all partitions of 'application' are being loaded - which may be too much for map-join to handle.
This is a serious bug, but for now can you put in a subquery and try -

select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;


Please file a JIRA for the above.




On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com>> wrote:
hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern and a.dt='20090609';
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w) logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (. (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt) '20090609')))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        w
            Select Operator
              expressions:
                    expr: url
                    type: string
                    expr: logdate
                    type: string
              Common Join Operator
                condition map:
                     Inner Join 0 to 1
                condition expressions:
                  0 {0} {1}
                  1 {0} {1}
                keys:
                  0
                  1
                Position of Big Table: 1
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
      Local Work:
        Map Reduce Local Work
          Alias -> Map Local Tables:
            a
              Fetch Operator
                limit: -1
          Alias -> Map Local Operator Tree:
            a
                Select Operator
                  expressions:
                        expr: url_pattern
                        type: string
                        expr: dt
                        type: string
                  Common Join Operator
                    condition map:
                         Inner Join 0 to 1
                    condition expressions:
                      0 {0} {1}
                      1 {0} {1}
                    keys:
                      0
                      1
                    Position of Big Table: 1
                    File Output Operator
                      compressed: false
                      GlobalTableId: 0
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
          Select Operator
            Filter Operator
              predicate:
                  expr: (((3 = '20090611') and (2 regexp 0)) and (1 = '20090609'))
                  type: boolean
              Select Operator
                expressions:
                      expr: 0
                      type: string
                      expr: 2
                      type: string
                File Output Operator
                  compressed: true
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com>> wrote:
I was looking at the code - and there may be a bug in cartesian product codepath for map-join.

Can you do a explain plan and send it ?






On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com>> wrote:

1. tried setting hive.mapjoin.cache.numrows to be 100,  failed with the same exception.
2. Actually, we used to do the same thing by loading small tables into memory of each map node in normal map-reduce with the same cluster, where same heap size is guranteed between running hive map-side join and our map-reduce job.  OOM exceptions never happened in that only 1MB would be spent to load those 20k pieces of records while mapred.child.java.opts was set to be -Xmx200m.

here is the schema of our small table:
> describe application;
transaction_id    string
subclass_id     string
class_id        string
memo string
url_alias    string
url_pattern     string
dt      string  (daily partitioned)

Thanks,
Min
On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com>> wrote:
1. Can you reduce the number of cached rows and try ?

2. Were you using default memory settings of the mapper ? If yes, can can increase it and try ?

It would be useful to try both of them independently - it would give a good idea of memory consumption of JDBM also.


Can you send the exact schema/data of the small table if possible ? You can file a jira and load it there if it not a security issue.

Thanks,
-namit



On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com>> wrote:
20k




--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com



--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

RE: OutOfMemory when doing map-side join

Posted by Namit Jain <nj...@facebook.com>.
There are multiple things going on.

Column pruning is not working with map-joins. It is being tracked at:

https://issues.apache.org/jira/browse/HIVE-560


Also, since it is a Cartesian product, jdbm does not help  - because a single key can be very large.


For now, you can do the column pruning yourself - create a new table with only the columns needed and then
join with the bigger table.

You may still need to increase the mapper memory -  I was able to load about 20k rows with about 2G mapper.







From: Min Zhou [mailto:coderplay@gmail.com]
Sent: Sunday, June 14, 2009 11:02 PM
To: hive-user@hadoop.apache.org
Subject: Re: OutOfMemory when doing map-side join

btw, that small table 'application' has only one partition right now,  20k rows.
On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <co...@gmail.com>> wrote:
failed with null pointer exception.
hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
FAILED: Unknown exception : null


$cat /tmp/hive/hive.log | tail...

2009-06-15 13:57:02,933 ERROR ql.Driver (SessionState.java:printError(279)) - FAILED: Unknown exception : null
java.lang.NullPointerException
        at org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
        at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
        at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
        at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com>> wrote:
The problem seems to be in partition pruning - The small table 'application' is partitioned - and probably, there are 20k rows in the partition
20090609.

Due to a bug, the pruning is not happening, and all partitions of 'application' are being loaded - which may be too much for map-join to handle.
This is a serious bug, but for now can you put in a subquery and try -

select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;


Please file a JIRA for the above.




On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com>> wrote:
hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern and a.dt='20090609';
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w) logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (. (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt) '20090609')))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        w
            Select Operator
              expressions:
                    expr: url
                    type: string
                    expr: logdate
                    type: string
              Common Join Operator
                condition map:
                     Inner Join 0 to 1
                condition expressions:
                  0 {0} {1}
                  1 {0} {1}
                keys:
                  0
                  1
                Position of Big Table: 1
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
      Local Work:
        Map Reduce Local Work
          Alias -> Map Local Tables:
            a
              Fetch Operator
                limit: -1
          Alias -> Map Local Operator Tree:
            a
                Select Operator
                  expressions:
                        expr: url_pattern
                        type: string
                        expr: dt
                        type: string
                  Common Join Operator
                    condition map:
                         Inner Join 0 to 1
                    condition expressions:
                      0 {0} {1}
                      1 {0} {1}
                    keys:
                      0
                      1
                    Position of Big Table: 1
                    File Output Operator
                      compressed: false
                      GlobalTableId: 0
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
          Select Operator
            Filter Operator
              predicate:
                  expr: (((3 = '20090611') and (2 regexp 0)) and (1 = '20090609'))
                  type: boolean
              Select Operator
                expressions:
                      expr: 0
                      type: string
                      expr: 2
                      type: string
                File Output Operator
                  compressed: true
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com>> wrote:
I was looking at the code - and there may be a bug in cartesian product codepath for map-join.

Can you do a explain plan and send it ?






On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com>> wrote:

1. tried setting hive.mapjoin.cache.numrows to be 100,  failed with the same exception.
2. Actually, we used to do the same thing by loading small tables into memory of each map node in normal map-reduce with the same cluster, where same heap size is guranteed between running hive map-side join and our map-reduce job.  OOM exceptions never happened in that only 1MB would be spent to load those 20k pieces of records while mapred.child.java.opts was set to be -Xmx200m.

here is the schema of our small table:
> describe application;
transaction_id    string
subclass_id     string
class_id        string
memo string
url_alias    string
url_pattern     string
dt      string  (daily partitioned)

Thanks,
Min
On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com>> wrote:
1. Can you reduce the number of cached rows and try ?

2. Were you using default memory settings of the mapper ? If yes, can can increase it and try ?

It would be useful to try both of them independently - it would give a good idea of memory consumption of JDBM also.


Can you send the exact schema/data of the small table if possible ? You can file a jira and load it there if it not a security issue.

Thanks,
-namit



On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com>> wrote:
20k




--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com



--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Re: OutOfMemory when doing map-side join

Posted by Min Zhou <co...@gmail.com>.
btw, that small table 'application' has only one partition right now,  20k
rows.

On Mon, Jun 15, 2009 at 1:59 PM, Min Zhou <co...@gmail.com> wrote:

> failed with null pointer exception.
> hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select
> x.url_pattern from application x where x.dt = '20090609') a join web_log w
> where w.logdate='20090611' and w.url rlike a.url_pattern;
> FAILED: Unknown exception : null
>
>
> $cat /tmp/hive/hive.log | tail...
>
> 2009-06-15 13:57:02,933 ERROR ql.Driver (SessionState.java:printError(279))
> - FAILED: Unknown exception : null
> java.lang.NullPointerException
>         at
> org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
>         at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
>         at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
>         at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>         at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>         at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
>         at
> org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
>         at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
>         at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
>         at
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
>         at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
>         at
> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
>         at
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
>         at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>         at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>         at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
>
> On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com> wrote:
>
>>  The problem seems to be in partition pruning – The small table
>> ‘application’ is partitioned – and probably, there are 20k rows in the
>> partition
>> 20090609.
>>
>> Due to a bug, the pruning is not happening, and all partitions of
>> ‘application’ are being loaded – which may be too much for map-join to
>> handle.
>> This is a serious bug, but for now can you put in a subquery and try -
>>
>> select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern
>> from application x where x.dt = ‘20090609’) a join web_log w where
>> w.logdate='20090611' and w.url rlike a.url_pattern;
>>
>>
>> Please file a JIRA for the above.
>>
>>
>>
>>
>> On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com> wrote:
>>
>> hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from
>> application a join web_log w where w.logdate='20090611' and w.url rlike
>> a.url_pattern and a.dt='20090609';
>> OK
>> ABSTRACT SYNTAX TREE:
>>   (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF
>> web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
>> (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a)))
>> (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (.
>> (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w)
>> logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (.
>> (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt)
>> '20090609')))))
>>
>> STAGE DEPENDENCIES:
>>   Stage-1 is a root stage
>>   Stage-2 depends on stages: Stage-1
>>   Stage-0 is a root stage
>>
>> STAGE PLANS:
>>   Stage: Stage-1
>>     Map Reduce
>>       Alias -> Map Operator Tree:
>>         w
>>             Select Operator
>>               expressions:
>>                     expr: url
>>                     type: string
>>                     expr: logdate
>>                     type: string
>>               Common Join Operator
>>                 condition map:
>>                      Inner Join 0 to 1
>>                 condition expressions:
>>                   0 {0} {1}
>>                   1 {0} {1}
>>                 keys:
>>                   0
>>                   1
>>                 Position of Big Table: 1
>>                 File Output Operator
>>                   compressed: false
>>                   GlobalTableId: 0
>>                   table:
>>                       input format:
>> org.apache.hadoop.mapred.SequenceFileInputFormat
>>                       output format:
>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>>       Local Work:
>>         Map Reduce Local Work
>>           Alias -> Map Local Tables:
>>             a
>>               Fetch Operator
>>                 limit: -1
>>           Alias -> Map Local Operator Tree:
>>             a
>>                 Select Operator
>>                   expressions:
>>                         expr: url_pattern
>>                         type: string
>>                         expr: dt
>>                         type: string
>>                   Common Join Operator
>>                     condition map:
>>                          Inner Join 0 to 1
>>                     condition expressions:
>>                       0 {0} {1}
>>                       1 {0} {1}
>>                     keys:
>>                       0
>>                       1
>>                     Position of Big Table: 1
>>                     File Output Operator
>>                       compressed: false
>>                       GlobalTableId: 0
>>                       table:
>>                           input format:
>> org.apache.hadoop.mapred.SequenceFileInputFormat
>>                           output format:
>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>>
>>   Stage: Stage-2
>>     Map Reduce
>>       Alias -> Map Operator Tree:
>>         hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
>>           Select Operator
>>             Filter Operator
>>               predicate:
>>                   expr: (((3 = '20090611') and (2 regexp 0)) and (1 =
>> '20090609'))
>>                   type: boolean
>>               Select Operator
>>                 expressions:
>>                       expr: 0
>>                       type: string
>>                       expr: 2
>>                       type: string
>>                 File Output Operator
>>                   compressed: true
>>                   GlobalTableId: 0
>>                   table:
>>                       input format:
>> org.apache.hadoop.mapred.TextInputFormat
>>                       output format:
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>
>>   Stage: Stage-0
>>     Fetch Operator
>>       limit: -1
>>
>> On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com> wrote:
>>
>> I was looking at the code – and there may be a bug in cartesian product
>> codepath for map-join.
>>
>> Can you do a explain plan and send it ?
>>
>>
>>
>>
>>
>>
>> On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com> wrote:
>>
>>
>> 1. tried setting hive.mapjoin.cache.numrows to be 100,  failed with the
>> same exception.
>> 2. Actually, we used to do the same thing by loading small tables into
>> memory of each map node in normal map-reduce with the same cluster, where
>> same heap size is guranteed between running hive map-side join and our
>> map-reduce job.  OOM exceptions never happened in that only 1MB would be
>> spent to load those 20k pieces of records while mapred.child.java.opts was
>> set to be -Xmx200m.
>>
>> here is the schema of our small table:
>> > describe application;
>> transaction_id    string
>> subclass_id     string
>> class_id        string
>> memo string
>> url_alias    string
>> url_pattern     string
>> dt      string  (daily partitioned)
>>
>> Thanks,
>> Min
>> On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:
>>
>> 1. Can you reduce the number of cached rows and try ?
>>
>> 2. Were you using default memory settings of the mapper ? If yes, can can
>> increase it and try ?
>>
>> It would be useful to try both of them independently – it would give a
>> good idea of memory consumption of JDBM also.
>>
>>
>> Can you send the exact schema/data of the small table if possible ? You
>> can file a jira and load it there if it not a security issue.
>>
>> Thanks,
>> -namit
>>
>>
>>
>> On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:
>>
>> 20k
>>
>>
>>
>>
>>
>>
>
>
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>



-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Re: OutOfMemory when doing map-side join

Posted by Namit Jain <nj...@facebook.com>.
I will take a look and get back to you



On 6/14/09 10:59 PM, "Min Zhou" <co...@gmail.com> wrote:

failed with null pointer exception.
hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;
FAILED: Unknown exception : null


$cat /tmp/hive/hive.log | tail...

2009-06-15 13:57:02,933 ERROR ql.Driver (SessionState.java:printError(279)) - FAILED: Unknown exception : null
java.lang.NullPointerException
        at org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
        at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
        at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
        at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com> wrote:
The problem seems to be in partition pruning - The small table 'application' is partitioned - and probably, there are 20k rows in the partition
20090609.

Due to a bug, the pruning is not happening, and all partitions of 'application' are being loaded - which may be too much for map-join to handle.
This is a serious bug, but for now can you put in a subquery and try -

select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;


Please file a JIRA for the above.




On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com> wrote:

hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern and a.dt='20090609';
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w) logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (. (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt) '20090609')))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        w
            Select Operator
              expressions:
                    expr: url
                    type: string
                    expr: logdate
                    type: string
              Common Join Operator
                condition map:
                     Inner Join 0 to 1
                condition expressions:
                  0 {0} {1}
                  1 {0} {1}
                keys:
                  0
                  1
                Position of Big Table: 1
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
      Local Work:
        Map Reduce Local Work
          Alias -> Map Local Tables:
            a
              Fetch Operator
                limit: -1
          Alias -> Map Local Operator Tree:
            a
                Select Operator
                  expressions:
                        expr: url_pattern
                        type: string
                        expr: dt
                        type: string
                  Common Join Operator
                    condition map:
                         Inner Join 0 to 1
                    condition expressions:
                      0 {0} {1}
                      1 {0} {1}
                    keys:
                      0
                      1
                    Position of Big Table: 1
                    File Output Operator
                      compressed: false
                      GlobalTableId: 0
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
          Select Operator
            Filter Operator
              predicate:
                  expr: (((3 = '20090611') and (2 regexp 0)) and (1 = '20090609'))
                  type: boolean
              Select Operator
                expressions:
                      expr: 0
                      type: string
                      expr: 2
                      type: string
                File Output Operator
                  compressed: true
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com> wrote:
I was looking at the code - and there may be a bug in cartesian product codepath for map-join.

Can you do a explain plan and send it ?






On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com> wrote:


1. tried setting hive.mapjoin.cache.numrows to be 100,  failed with the same exception.
2. Actually, we used to do the same thing by loading small tables into memory of each map node in normal map-reduce with the same cluster, where same heap size is guranteed between running hive map-side join and our map-reduce job.  OOM exceptions never happened in that only 1MB would be spent to load those 20k pieces of records while mapred.child.java.opts was set to be -Xmx200m.

here is the schema of our small table:
> describe application;
transaction_id    string
subclass_id     string
class_id        string
memo string
url_alias    string
url_pattern     string
dt      string  (daily partitioned)

Thanks,
Min
On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:
1. Can you reduce the number of cached rows and try ?

2. Were you using default memory settings of the mapper ? If yes, can can increase it and try ?

It would be useful to try both of them independently - it would give a good idea of memory consumption of JDBM also.


Can you send the exact schema/data of the small table if possible ? You can file a jira and load it there if it not a security issue.

Thanks,
-namit



On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:

20k







Re: OutOfMemory when doing map-side join

Posted by Min Zhou <co...@gmail.com>.
failed with null pointer exception.
hive>select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select
x.url_pattern from application x where x.dt = '20090609') a join web_log w
where w.logdate='20090611' and w.url rlike a.url_pattern;
FAILED: Unknown exception : null


$cat /tmp/hive/hive.log | tail...

2009-06-15 13:57:02,933 ERROR ql.Driver (SessionState.java:printError(279))
- FAILED: Unknown exception : null
java.lang.NullPointerException
        at
org.apache.hadoop.hive.ql.parse.QBMetaData.getTableForAlias(QBMetaData.java:76)
        at
org.apache.hadoop.hive.ql.parse.PartitionPruner.getTableColumnDesc(PartitionPruner.java:284)
        at
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:217)
        at
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at
org.apache.hadoop.hive.ql.parse.PartitionPruner.genExprNodeDesc(PartitionPruner.java:231)
        at
org.apache.hadoop.hive.ql.parse.PartitionPruner.addExpression(PartitionPruner.java:377)
        at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPartitionPruners(SemanticAnalyzer.java:608)
        at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:3785)
        at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:76)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:177)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:209)
        at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
        at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:216)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:309)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

On Mon, Jun 15, 2009 at 1:52 PM, Namit Jain <nj...@facebook.com> wrote:

>  The problem seems to be in partition pruning – The small table
> ‘application’ is partitioned – and probably, there are 20k rows in the
> partition
> 20090609.
>
> Due to a bug, the pruning is not happening, and all partitions of
> ‘application’ are being loaded – which may be too much for map-join to
> handle.
> This is a serious bug, but for now can you put in a subquery and try -
>
> select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern
> from application x where x.dt = ‘20090609’) a join web_log w where
> w.logdate='20090611' and w.url rlike a.url_pattern;
>
>
> Please file a JIRA for the above.
>
>
>
>
> On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com> wrote:
>
> hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from
> application a join web_log w where w.logdate='20090611' and w.url rlike
> a.url_pattern and a.dt='20090609';
> OK
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF
> web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
> (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a)))
> (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (.
> (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w)
> logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (.
> (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt)
> '20090609')))))
>
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-2 depends on stages: Stage-1
>   Stage-0 is a root stage
>
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         w
>             Select Operator
>               expressions:
>                     expr: url
>                     type: string
>                     expr: logdate
>                     type: string
>               Common Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                 condition expressions:
>                   0 {0} {1}
>                   1 {0} {1}
>                 keys:
>                   0
>                   1
>                 Position of Big Table: 1
>                 File Output Operator
>                   compressed: false
>                   GlobalTableId: 0
>                   table:
>                       input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                       output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>       Local Work:
>         Map Reduce Local Work
>           Alias -> Map Local Tables:
>             a
>               Fetch Operator
>                 limit: -1
>           Alias -> Map Local Operator Tree:
>             a
>                 Select Operator
>                   expressions:
>                         expr: url_pattern
>                         type: string
>                         expr: dt
>                         type: string
>                   Common Join Operator
>                     condition map:
>                          Inner Join 0 to 1
>                     condition expressions:
>                       0 {0} {1}
>                       1 {0} {1}
>                     keys:
>                       0
>                       1
>                     Position of Big Table: 1
>                     File Output Operator
>                       compressed: false
>                       GlobalTableId: 0
>                       table:
>                           input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                           output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>
>   Stage: Stage-2
>     Map Reduce
>       Alias -> Map Operator Tree:
>         hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
>           Select Operator
>             Filter Operator
>               predicate:
>                   expr: (((3 = '20090611') and (2 regexp 0)) and (1 =
> '20090609'))
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: 0
>                       type: string
>                       expr: 2
>                       type: string
>                 File Output Operator
>                   compressed: true
>                   GlobalTableId: 0
>                   table:
>                       input format:
> org.apache.hadoop.mapred.TextInputFormat
>                       output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>
> On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com> wrote:
>
> I was looking at the code – and there may be a bug in cartesian product
> codepath for map-join.
>
> Can you do a explain plan and send it ?
>
>
>
>
>
>
> On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com> wrote:
>
>
> 1. tried setting hive.mapjoin.cache.numrows to be 100,  failed with the
> same exception.
> 2. Actually, we used to do the same thing by loading small tables into
> memory of each map node in normal map-reduce with the same cluster, where
> same heap size is guranteed between running hive map-side join and our
> map-reduce job.  OOM exceptions never happened in that only 1MB would be
> spent to load those 20k pieces of records while mapred.child.java.opts was
> set to be -Xmx200m.
>
> here is the schema of our small table:
> > describe application;
> transaction_id    string
> subclass_id     string
> class_id        string
> memo string
> url_alias    string
> url_pattern     string
> dt      string  (daily partitioned)
>
> Thanks,
> Min
> On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:
>
> 1. Can you reduce the number of cached rows and try ?
>
> 2. Were you using default memory settings of the mapper ? If yes, can can
> increase it and try ?
>
> It would be useful to try both of them independently – it would give a good
> idea of memory consumption of JDBM also.
>
>
> Can you send the exact schema/data of the small table if possible ? You can
> file a jira and load it there if it not a security issue.
>
> Thanks,
> -namit
>
>
>
> On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:
>
> 20k
>
>
>
>
>
>


-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Re: OutOfMemory when doing map-side join

Posted by Namit Jain <nj...@facebook.com>.
The problem seems to be in partition pruning - The small table 'application' is partitioned - and probably, there are 20k rows in the partition
20090609.

Due to a bug, the pruning is not happening, and all partitions of 'application' are being loaded - which may be too much for map-join to handle.
This is a serious bug, but for now can you put in a subquery and try -

select /*+ MAPJOIN(a) */ a.url_pattern, w.url from  (select x.url_pattern from application x where x.dt = '20090609') a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern;


Please file a JIRA for the above.



On 6/14/09 10:20 PM, "Min Zhou" <co...@gmail.com> wrote:

hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern and a.dt='20090609';
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w) logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (. (TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt) '20090609')))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        w
            Select Operator
              expressions:
                    expr: url
                    type: string
                    expr: logdate
                    type: string
              Common Join Operator
                condition map:
                     Inner Join 0 to 1
                condition expressions:
                  0 {0} {1}
                  1 {0} {1}
                keys:
                  0
                  1
                Position of Big Table: 1
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
      Local Work:
        Map Reduce Local Work
          Alias -> Map Local Tables:
            a
              Fetch Operator
                limit: -1
          Alias -> Map Local Operator Tree:
            a
                Select Operator
                  expressions:
                        expr: url_pattern
                        type: string
                        expr: dt
                        type: string
                  Common Join Operator
                    condition map:
                         Inner Join 0 to 1
                    condition expressions:
                      0 {0} {1}
                      1 {0} {1}
                    keys:
                      0
                      1
                    Position of Big Table: 1
                    File Output Operator
                      compressed: false
                      GlobalTableId: 0
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
          Select Operator
            Filter Operator
              predicate:
                  expr: (((3 = '20090611') and (2 regexp 0)) and (1 = '20090609'))
                  type: boolean
              Select Operator
                expressions:
                      expr: 0
                      type: string
                      expr: 2
                      type: string
                File Output Operator
                  compressed: true
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com> wrote:
I was looking at the code - and there may be a bug in cartesian product codepath for map-join.

Can you do a explain plan and send it ?






On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com> wrote:


1. tried setting hive.mapjoin.cache.numrows to be 100,  failed with the same exception.
2. Actually, we used to do the same thing by loading small tables into memory of each map node in normal map-reduce with the same cluster, where same heap size is guranteed between running hive map-side join and our map-reduce job.  OOM exceptions never happened in that only 1MB would be spent to load those 20k pieces of records while mapred.child.java.opts was set to be -Xmx200m.

here is the schema of our small table:
> describe application;
transaction_id    string
subclass_id     string
class_id        string
memo string
url_alias    string
url_pattern     string
dt      string  (daily partitioned)

Thanks,
Min
On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:
1. Can you reduce the number of cached rows and try ?

2. Were you using default memory settings of the mapper ? If yes, can can increase it and try ?

It would be useful to try both of them independently - it would give a good idea of memory consumption of JDBM also.


Can you send the exact schema/data of the small table if possible ? You can file a jira and load it there if it not a security issue.

Thanks,
-namit



On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:

20k





Re: OutOfMemory when doing map-side join

Posted by Min Zhou <co...@gmail.com>.
hive> explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application
a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern
and a.dt='20090609';
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF application a) (TOK_TABREF
web_log w))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
(TOK_SELECT (TOK_HINTLIST (TOK_HINT TOK_MAPJOIN (TOK_HINTARGLIST a)))
(TOK_SELEXPR (. (TOK_TABLE_OR_COL a) url_pattern)) (TOK_SELEXPR (.
(TOK_TABLE_OR_COL w) url))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL w)
logdate) '20090611') (rlike (. (TOK_TABLE_OR_COL w) url) (.
(TOK_TABLE_OR_COL a) url_pattern))) (= (. (TOK_TABLE_OR_COL a) dt)
'20090609')))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        w
            Select Operator
              expressions:
                    expr: url
                    type: string
                    expr: logdate
                    type: string
              Common Join Operator
                condition map:
                     Inner Join 0 to 1
                condition expressions:
                  0 {0} {1}
                  1 {0} {1}
                keys:
                  0
                  1
                Position of Big Table: 1
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
      Local Work:
        Map Reduce Local Work
          Alias -> Map Local Tables:
            a
              Fetch Operator
                limit: -1
          Alias -> Map Local Operator Tree:
            a
                Select Operator
                  expressions:
                        expr: url_pattern
                        type: string
                        expr: dt
                        type: string
                  Common Join Operator
                    condition map:
                         Inner Join 0 to 1
                    condition expressions:
                      0 {0} {1}
                      1 {0} {1}
                    keys:
                      0
                      1
                    Position of Big Table: 1
                    File Output Operator
                      compressed: false
                      GlobalTableId: 0
                      table:
                          input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        hdfs://hdpnn.cm3:9000/group/taobao/hive/hive-tmp/220575636/10004
          Select Operator
            Filter Operator
              predicate:
                  expr: (((3 = '20090611') and (2 regexp 0)) and (1 =
'20090609'))
                  type: boolean
              Select Operator
                expressions:
                      expr: 0
                      type: string
                      expr: 2
                      type: string
                File Output Operator
                  compressed: true
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

On Mon, Jun 15, 2009 at 1:14 PM, Namit Jain <nj...@facebook.com> wrote:

>  I was looking at the code – and there may be a bug in cartesian product
> codepath for map-join.
>
> Can you do a explain plan and send it ?
>
>
>
>
>
> On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com> wrote:
>
>
> 1. tried setting hive.mapjoin.cache.numrows to be 100,  failed with the
> same exception.
> 2. Actually, we used to do the same thing by loading small tables into
> memory of each map node in normal map-reduce with the same cluster, where
> same heap size is guranteed between running hive map-side join and our
> map-reduce job.  OOM exceptions never happened in that only 1MB would be
> spent to load those 20k pieces of records while mapred.child.java.opts was
> set to be -Xmx200m.
>
> here is the schema of our small table:
> > describe application;
> transaction_id    string
> subclass_id     string
> class_id        string
> memo string
> url_alias    string
> url_pattern     string
> dt      string  (daily partitioned)
>
> Thanks,
> Min
> On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:
>
> 1. Can you reduce the number of cached rows and try ?
>
> 2. Were you using default memory settings of the mapper ? If yes, can can
> increase it and try ?
>
> It would be useful to try both of them independently – it would give a good
> idea of memory consumption of JDBM also.
>
>
> Can you send the exact schema/data of the small table if possible ? You can
> file a jira and load it there if it not a security issue.
>
> Thanks,
> -namit
>
>
>
> On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:
>
> 20k
>
>
>
>


-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Re: OutOfMemory when doing map-side join

Posted by Namit Jain <nj...@facebook.com>.
I was looking at the code - and there may be a bug in cartesian product codepath for map-join.

Can you do a explain plan and send it ?





On 6/14/09 10:06 PM, "Min Zhou" <co...@gmail.com> wrote:


1. tried setting hive.mapjoin.cache.numrows to be 100,  failed with the same exception.
2. Actually, we used to do the same thing by loading small tables into memory of each map node in normal map-reduce with the same cluster, where same heap size is guranteed between running hive map-side join and our map-reduce job.  OOM exceptions never happened in that only 1MB would be spent to load those 20k pieces of records while mapred.child.java.opts was set to be -Xmx200m.

here is the schema of our small table:
> describe application;
transaction_id    string
subclass_id     string
class_id        string
memo string
url_alias    string
url_pattern     string
dt      string  (daily partitioned)

Thanks,
Min
On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:
1. Can you reduce the number of cached rows and try ?

2. Were you using default memory settings of the mapper ? If yes, can can increase it and try ?

It would be useful to try both of them independently - it would give a good idea of memory consumption of JDBM also.


Can you send the exact schema/data of the small table if possible ? You can file a jira and load it there if it not a security issue.

Thanks,
-namit



On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:

20k



Re: OutOfMemory when doing map-side join

Posted by Min Zhou <co...@gmail.com>.
1. tried setting hive.mapjoin.cache.numrows to be 100,  failed with the same
exception.
2. Actually, we used to do the same thing by loading small tables into
memory of each map node in normal map-reduce with the same cluster, where
same heap size is guranteed between running hive map-side join and our
map-reduce job.  OOM exceptions never happened in that only 1MB would be
spent to load those 20k pieces of records while mapred.child.java.opts was
set to be -Xmx200m.

here is the schema of our small table:
> describe application;
transaction_id    string
subclass_id     string
class_id        string
memo string
url_alias    string
url_pattern     string
dt      string  (daily partitioned)

Thanks,
Min
On Mon, Jun 15, 2009 at 12:51 PM, Namit Jain <nj...@facebook.com> wrote:

>  1. Can you reduce the number of cached rows and try ?
>
> 2. Were you using default memory settings of the mapper ? If yes, can can
> increase it and try ?
>
> It would be useful to try both of them independently – it would give a good
> idea of memory consumption of JDBM also.
>
>
> Can you send the exact schema/data of the small table if possible ? You can
> file a jira and load it there if it not a security issue.
>
> Thanks,
> -namit
>
>
>
> On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:
>
> 20k
>
>


-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Re: OutOfMemory when doing map-side join

Posted by Namit Jain <nj...@facebook.com>.
1. Can you reduce the number of cached rows and try ?

2. Were you using default memory settings of the mapper ? If yes, can can increase it and try ?

It would be useful to try both of them independently - it would give a good idea of memory consumption of JDBM also.


Can you send the exact schema/data of the small table if possible ? You can file a jira and load it there if it not a security issue.

Thanks,
-namit



On 6/14/09 9:23 PM, "Min Zhou" <co...@gmail.com> wrote:

20k