You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Sudipto Das <su...@cs.ucsb.edu> on 2009/09/09 23:01:14 UTC

Directing Hive to perform Hash Join for small inner tables

Hi,

I am new to hive so pardon me if this is something very obvious which I
might have missed in the documentation.

I have an application where I am joining a small inner table with a really
large outer table. The inner table is small enough to fit into memory at
each mapper. In such a case, putting the inner table into an in-memory hash
table and performing a hash based join is much more efficient than
performing the sort-merge join which the JOIN operator selects. Is there a
way in Hive where I can instruct it perform the hash based join?

Thanks

Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto

Re: Directing Hive to perform Hash Join for small inner tables

Posted by Sudipto Das <su...@gmail.com>.
The entire data is about 3G. I will have to find a way to put it up. Will
get back to you tomorrow..

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto


On Wed, Sep 9, 2009 at 7:12 PM, Namit Jain <nj...@facebook.com> wrote:

>  If possible, can you send the entire dataset ?
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Wednesday, September 09, 2009 7:10 PM
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> This seems to be one of those sneaky bugs... the query works for the
> smaller sets of data which I tried creating from the original data set. But
> as soon as I move to the entire data set, the error comes back. I tried it
> with another table in my data set, but the error persists. It only vanishes
> for the subset.. I can't think of a good way of creating a test data set. I
> was just taking the first 100 - 1000 rows..
>
> I am attaching all the queries and the explain. Let me know if it helps:
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>  On Wed, Sep 9, 2009 at 6:13 PM, Namit Jain <nj...@facebook.com> wrote:
>
> Can you provide a small testcase ?
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Wednesday, September 09, 2009 2:20 PM
>
>
> *To:* hive-user@hadoop.apache.org
>
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> Thanks for the quick response. I tried the query:
>
> insert overwrite table join_result
> select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
> from data r JOIN param m ON (r.mid = m.mid);
>
> param has only 17k rows with 2 columns.
>
> I got this exception
>
> java.lang.RuntimeException
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)
>
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>
>
>
>
>
>         at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
>
> Caused by: java.lang.NullPointerException
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)
>
>         at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>         at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>         at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)
>
>
>
>
>
>         ... 3 more
>
>
> Additionally, the query compiled into two MR jobs. The 2nd one didn't start
> because the first failed, but I couldn't reason about the 2nd job.
>
> I am using Hive trunk, revision 811082 updated on 09/03.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <nj...@facebook.com> wrote:
>
> You can specify it as a hint in the select list:
>
>
>
>
>
> select /*+ MAPJOIN(b) */  …   from T a JOIN T2 b on …
>
>
>
>
>
> In the example above, T2 is the small table which can be cached in memory
>
>
>
>
>
>
>
>
>
> *From:* sudiptdas@gmail.com [mailto:sudiptdas@gmail.com] *On Behalf Of *Sudipto
> Das
> *Sent:* Wednesday, September 09, 2009 2:01 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> I am new to hive so pardon me if this is something very obvious which I
> might have missed in the documentation.
>
> I have an application where I am joining a small inner table with a really
> large outer table. The inner table is small enough to fit into memory at
> each mapper. In such a case, putting the inner table into an in-memory hash
> table and performing a hash based join is much more efficient than
> performing the sort-merge join which the JOIN operator selects. Is there a
> way in Hive where I can instruct it perform the hash based join?
>
> Thanks
>
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>
>
>
>

Re: Directing Hive to perform Hash Join for small inner tables

Posted by Sudipto Das <su...@gmail.com>.
I am seeing a very weird behavior. When the inner table is small (~17k rows
each with two columns (INT,DOUBLE)), the join executes in about 2 mins
(outer table is 500 million rows). While when the inner table is a bit
bigger (~480k rows (INT, DOUBLE)), the join takes about 20 mins for the same
outer table. Does the MapJoin do something fancy if it sees that the inner
table is big? In my case, both the tables together fit in memory, and each
task is provided 1024MB heap, with 256 MB sort buffer for 256MB block size.

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto


On Fri, Sep 11, 2009 at 4:32 PM, Sudipto Das <su...@gmail.com> wrote:

> Thanks. I switched to branch 0.4, and the hash join is working, even though
> it is running much slower that I expected. I will try to figure out the
> reason.
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>
> On Fri, Sep 11, 2009 at 3:13 PM, Namit Jain <nj...@facebook.com> wrote:
>
>>  Your join keys are different – in that case, we don’t merge currently
>>
>>
>>
>>
>>
>> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
>> *Sent:* Friday, September 11, 2009 3:10 PM
>>
>> *To:* hive-user@hadoop.apache.org
>> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>>
>>
>>
>> I checked join26.q. That wasn't what I meant. My query looks something
>> like this:
>>
>> insert overwrite table join_result
>> select /*+ MAPJOIN(c,m)*/ c.cid, m.mid, c.cparam, m.mparam, r.rate,
>> (r.rate - c.cparam*m.mparam)
>> from mparam m JOIN data r ON (m.mid = r.mid)
>> JOIN cparam c ON (c.cid = r.cid);
>>
>> mparam and cparam are two inner tables, which are small enough to be
>> loaded into memory together, whereas data is a really huge table, so I want
>> to avoid the cost of making two passes over it. So I want a plan where
>> mparam and cparam are hashed in memory, and then data is streamed through.
>>
>> PhD Candidate
>> CS @ UCSB
>> Santa Barbara, CA 93106, USA
>> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>>
>>  On Fri, Sep 11, 2009 at 2:22 PM, Namit Jain <nj...@facebook.com> wrote:
>>
>> That is not true – look at the unit test join26.q
>>
>>
>>
>>
>>
>> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
>> *Sent:* Friday, September 11, 2009 2:07 PM
>>
>>
>> *To:* hive-user@hadoop.apache.org
>> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>>
>>
>>
>> But this creates a join where each join is performed in a single Map only
>> MR join, which is as good as specifying the query as MAPJOIN(x) followed by
>> another query as MAPJOIN(y) with the result of the previous join. Is there
>> as way to make it pick just one MR job, where both the small inner tables
>> are put in memory, and only one pass is done over the data?
>>
>> PhD Candidate
>> CS @ UCSB
>> Santa Barbara, CA 93106, USA
>> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>>
>> On Fri, Sep 11, 2009 at 1:29 PM, Namit Jain <nj...@facebook.com> wrote:
>>
>> Yes, you can specify the list of tables in the hint
>>
>>
>>
>>
>>
>> MAPJOIN(x,y,z)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
>> *Sent:* Friday, September 11, 2009 1:17 PM
>>
>>
>> *To:* hive-user@hadoop.apache.org
>> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>>
>>
>>
>> Is there any performance difference between 0.4 and trunk? I can then
>> temporarily switch to 0.4 while the problem is being fixed.
>>
>> Also, can I specify two tables in the MAPJOIN hint when doing a 3 way
>> join, where two tables can fit in memory. I tried some intuitive
>> alternatives but did not work. The only option I have now is to have 2
>> separate JOINs. Since this is a map only join, and the user knows that the
>> inner table(s) will fit in memory, so allowing multiple tables will result
>> in the JOIN being done in a single Map only job, and will also allow
>> Combiners and Reducers to be put in if the query is doing some aggregation
>> on the JOINed result.
>>
>> Thanks
>> Sudipto
>>
>> PhD Candidate
>> CS @ UCSB
>> Santa Barbara, CA 93106, USA
>> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>>
>> On Fri, Sep 11, 2009 at 12:12 PM, Namit Jain <nj...@facebook.com> wrote:
>>
>> This is some problem in trunk – it runs fine in 0.4 – I will take a look
>>
>>
>>
>> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
>> *Sent:* Thursday, September 10, 2009 6:16 PM
>>
>>
>> *To:* hive-user@hadoop.apache.org
>> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>>
>>
>>
>> Hi Namit,
>>
>> The join column is the first column of the data. Here are the queries to
>> recreate the error.
>>
>> create table mydata (mid INT, cid INT,
>> rating INT)
>> stored as SEQUENCEFILE;
>>
>> create external table mydata_stg(mid INT, cid INT,
>> rating INT, date1 STRING)
>> row format delimited fields terminated by '\054' lines terminated by
>> '\012'
>> stored as textfile
>> location '/user/sudipto/data/hive/text1';
>>
>> from mydata_stg stg
>> insert overwrite table mydata
>> select stg.mid, stg.cid, stg.rating;
>>
>> create table param(mid INT, param DOUBLE)
>> stored as SEQUENCEFILE;
>>
>> insert overwrite table param
>> select mid, pow(avg(rating), 0.5)
>> from mydata
>> group by mid;
>>
>> create table result(cid INT, mid INT, param DOUBLE, rating DOUBLE)
>> stored as SEQUENCEFILE;
>>
>> insert overwrite table result
>> select /*+ MAPJOIN(m)*/ r.cid, m.mid, m.param, r.rating
>> from mydata r JOIN param m ON (r.mid = m.mid);
>>
>> Thanks
>> Sudipto
>>
>> PhD Candidate
>> CS @ UCSB
>> Santa Barbara, CA 93106, USA
>> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>>
>> On Thu, Sep 10, 2009 at 6:05 PM, Namit Jain <nj...@facebook.com> wrote:
>>
>> The data you have sent has 4 columns –
>>
>>
>>
>> insert overwrite table join_result
>> select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
>> from data r JOIN param m ON (r.mid = m.mid);
>>
>>
>>
>> Which is the join column -
>>
>>
>>
>>
>>
>> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
>> *Sent:* Thursday, September 10, 2009 3:26 PM
>>
>>
>> *To:* hive-user@hadoop.apache.org
>> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>>
>>
>>
>> Hi Namit,
>>
>> I was able to create a smaller data set to replicate the error. Please see
>> the attached file. Put the file in HDFS, and run the queries (make sure you
>> update the path for the external table to point to directory in HDFS that
>> contains this file). I hope you can see the same error. I am using Hadoop
>> 0.18.3, and Hive trunk r812721.
>>
>> Thanks
>> Sudipto
>>
>> PhD Candidate
>> CS @ UCSB
>> Santa Barbara, CA 93106, USA
>> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>>
>> On Wed, Sep 9, 2009 at 7:12 PM, Namit Jain <nj...@facebook.com> wrote:
>>
>> If possible, can you send the entire dataset ?
>>
>>
>>
>>
>>
>> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
>> *Sent:* Wednesday, September 09, 2009 7:10 PM
>>
>>
>> *To:* hive-user@hadoop.apache.org
>> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>>
>>
>>
>> Hi,
>>
>> This seems to be one of those sneaky bugs... the query works for the
>> smaller sets of data which I tried creating from the original data set. But
>> as soon as I move to the entire data set, the error comes back. I tried it
>> with another table in my data set, but the error persists. It only vanishes
>> for the subset.. I can't think of a good way of creating a test data set. I
>> was just taking the first 100 - 1000 rows..
>>
>> I am attaching all the queries and the explain. Let me know if it helps:
>>
>> Thanks
>> Sudipto
>>
>> PhD Candidate
>> CS @ UCSB
>> Santa Barbara, CA 93106, USA
>> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>>
>> On Wed, Sep 9, 2009 at 6:13 PM, Namit Jain <nj...@facebook.com> wrote:
>>
>> Can you provide a small testcase ?
>>
>>
>>
>>
>>
>> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
>> *Sent:* Wednesday, September 09, 2009 2:20 PM
>>
>>
>> *To:* hive-user@hadoop.apache.org
>>
>> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>>
>>
>>
>> Hi,
>>
>> Thanks for the quick response. I tried the query:
>>
>> insert overwrite table join_result
>> select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
>> from data r JOIN param m ON (r.mid = m.mid);
>>
>> param has only 17k rows with 2 columns.
>>
>> I got this exception
>>
>> java.lang.RuntimeException
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Caused by: java.lang.NullPointerException
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         ... 3 more
>>
>>
>> Additionally, the query compiled into two MR jobs. The 2nd one didn't
>> start because the first failed, but I couldn't reason about the 2nd job.
>>
>> I am using Hive trunk, revision 811082 updated on 09/03.
>>
>> Thanks
>> Sudipto
>>
>> PhD Candidate
>> CS @ UCSB
>> Santa Barbara, CA 93106, USA
>> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>>
>> On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <nj...@facebook.com> wrote:
>>
>> You can specify it as a hint in the select list:
>>
>>
>>
>>
>>
>> select /*+ MAPJOIN(b) */  …   from T a JOIN T2 b on …
>>
>>
>>
>>
>>
>> In the example above, T2 is the small table which can be cached in memory
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* sudiptdas@gmail.com [mailto:sudiptdas@gmail.com] *On Behalf Of *Sudipto
>> Das
>> *Sent:* Wednesday, September 09, 2009 2:01 PM
>> *To:* hive-user@hadoop.apache.org
>> *Subject:* Directing Hive to perform Hash Join for small inner tables
>>
>>
>>
>> Hi,
>>
>> I am new to hive so pardon me if this is something very obvious which I
>> might have missed in the documentation.
>>
>> I have an application where I am joining a small inner table with a really
>> large outer table. The inner table is small enough to fit into memory at
>> each mapper. In such a case, putting the inner table into an in-memory hash
>> table and performing a hash based join is much more efficient than
>> performing the sort-merge join which the JOIN operator selects. Is there a
>> way in Hive where I can instruct it perform the hash based join?
>>
>> Thanks
>>
>> Sudipto
>>
>> PhD Candidate
>> CS @ UCSB
>> Santa Barbara, CA 93106, USA
>> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Directing Hive to perform Hash Join for small inner tables

Posted by Sudipto Das <su...@gmail.com>.
Thanks. I switched to branch 0.4, and the hash join is working, even though
it is running much slower that I expected. I will try to figure out the
reason.

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto


On Fri, Sep 11, 2009 at 3:13 PM, Namit Jain <nj...@facebook.com> wrote:

>  Your join keys are different – in that case, we don’t merge currently
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Friday, September 11, 2009 3:10 PM
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> I checked join26.q. That wasn't what I meant. My query looks something like
> this:
>
> insert overwrite table join_result
> select /*+ MAPJOIN(c,m)*/ c.cid, m.mid, c.cparam, m.mparam, r.rate, (r.rate
> - c.cparam*m.mparam)
> from mparam m JOIN data r ON (m.mid = r.mid)
> JOIN cparam c ON (c.cid = r.cid);
>
> mparam and cparam are two inner tables, which are small enough to be loaded
> into memory together, whereas data is a really huge table, so I want to
> avoid the cost of making two passes over it. So I want a plan where mparam
> and cparam are hashed in memory, and then data is streamed through.
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>  On Fri, Sep 11, 2009 at 2:22 PM, Namit Jain <nj...@facebook.com> wrote:
>
> That is not true – look at the unit test join26.q
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Friday, September 11, 2009 2:07 PM
>
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> But this creates a join where each join is performed in a single Map only
> MR join, which is as good as specifying the query as MAPJOIN(x) followed by
> another query as MAPJOIN(y) with the result of the previous join. Is there
> as way to make it pick just one MR job, where both the small inner tables
> are put in memory, and only one pass is done over the data?
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Fri, Sep 11, 2009 at 1:29 PM, Namit Jain <nj...@facebook.com> wrote:
>
> Yes, you can specify the list of tables in the hint
>
>
>
>
>
> MAPJOIN(x,y,z)
>
>
>
>
>
>
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Friday, September 11, 2009 1:17 PM
>
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Is there any performance difference between 0.4 and trunk? I can then
> temporarily switch to 0.4 while the problem is being fixed.
>
> Also, can I specify two tables in the MAPJOIN hint when doing a 3 way join,
> where two tables can fit in memory. I tried some intuitive alternatives but
> did not work. The only option I have now is to have 2 separate JOINs. Since
> this is a map only join, and the user knows that the inner table(s) will fit
> in memory, so allowing multiple tables will result in the JOIN being done in
> a single Map only job, and will also allow Combiners and Reducers to be put
> in if the query is doing some aggregation on the JOINed result.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Fri, Sep 11, 2009 at 12:12 PM, Namit Jain <nj...@facebook.com> wrote:
>
> This is some problem in trunk – it runs fine in 0.4 – I will take a look
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Thursday, September 10, 2009 6:16 PM
>
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi Namit,
>
> The join column is the first column of the data. Here are the queries to
> recreate the error.
>
> create table mydata (mid INT, cid INT,
> rating INT)
> stored as SEQUENCEFILE;
>
> create external table mydata_stg(mid INT, cid INT,
> rating INT, date1 STRING)
> row format delimited fields terminated by '\054' lines terminated by '\012'
> stored as textfile
> location '/user/sudipto/data/hive/text1';
>
> from mydata_stg stg
> insert overwrite table mydata
> select stg.mid, stg.cid, stg.rating;
>
> create table param(mid INT, param DOUBLE)
> stored as SEQUENCEFILE;
>
> insert overwrite table param
> select mid, pow(avg(rating), 0.5)
> from mydata
> group by mid;
>
> create table result(cid INT, mid INT, param DOUBLE, rating DOUBLE)
> stored as SEQUENCEFILE;
>
> insert overwrite table result
> select /*+ MAPJOIN(m)*/ r.cid, m.mid, m.param, r.rating
> from mydata r JOIN param m ON (r.mid = m.mid);
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Thu, Sep 10, 2009 at 6:05 PM, Namit Jain <nj...@facebook.com> wrote:
>
> The data you have sent has 4 columns –
>
>
>
> insert overwrite table join_result
> select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
> from data r JOIN param m ON (r.mid = m.mid);
>
>
>
> Which is the join column -
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Thursday, September 10, 2009 3:26 PM
>
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi Namit,
>
> I was able to create a smaller data set to replicate the error. Please see
> the attached file. Put the file in HDFS, and run the queries (make sure you
> update the path for the external table to point to directory in HDFS that
> contains this file). I hope you can see the same error. I am using Hadoop
> 0.18.3, and Hive trunk r812721.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 7:12 PM, Namit Jain <nj...@facebook.com> wrote:
>
> If possible, can you send the entire dataset ?
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Wednesday, September 09, 2009 7:10 PM
>
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> This seems to be one of those sneaky bugs... the query works for the
> smaller sets of data which I tried creating from the original data set. But
> as soon as I move to the entire data set, the error comes back. I tried it
> with another table in my data set, but the error persists. It only vanishes
> for the subset.. I can't think of a good way of creating a test data set. I
> was just taking the first 100 - 1000 rows..
>
> I am attaching all the queries and the explain. Let me know if it helps:
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 6:13 PM, Namit Jain <nj...@facebook.com> wrote:
>
> Can you provide a small testcase ?
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Wednesday, September 09, 2009 2:20 PM
>
>
> *To:* hive-user@hadoop.apache.org
>
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> Thanks for the quick response. I tried the query:
>
> insert overwrite table join_result
> select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
> from data r JOIN param m ON (r.mid = m.mid);
>
> param has only 17k rows with 2 columns.
>
> I got this exception
>
> java.lang.RuntimeException
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Caused by: java.lang.NullPointerException
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         ... 3 more
>
>
> Additionally, the query compiled into two MR jobs. The 2nd one didn't start
> because the first failed, but I couldn't reason about the 2nd job.
>
> I am using Hive trunk, revision 811082 updated on 09/03.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <nj...@facebook.com> wrote:
>
> You can specify it as a hint in the select list:
>
>
>
>
>
> select /*+ MAPJOIN(b) */  …   from T a JOIN T2 b on …
>
>
>
>
>
> In the example above, T2 is the small table which can be cached in memory
>
>
>
>
>
>
>
>
>
> *From:* sudiptdas@gmail.com [mailto:sudiptdas@gmail.com] *On Behalf Of *Sudipto
> Das
> *Sent:* Wednesday, September 09, 2009 2:01 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> I am new to hive so pardon me if this is something very obvious which I
> might have missed in the documentation.
>
> I have an application where I am joining a small inner table with a really
> large outer table. The inner table is small enough to fit into memory at
> each mapper. In such a case, putting the inner table into an in-memory hash
> table and performing a hash based join is much more efficient than
> performing the sort-merge join which the JOIN operator selects. Is there a
> way in Hive where I can instruct it perform the hash based join?
>
> Thanks
>
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

RE: Directing Hive to perform Hash Join for small inner tables

Posted by Namit Jain <nj...@facebook.com>.
Your join keys are different - in that case, we don't merge currently


From: Sudipto Das [mailto:sudiptdas@gmail.com]
Sent: Friday, September 11, 2009 3:10 PM
To: hive-user@hadoop.apache.org
Subject: Re: Directing Hive to perform Hash Join for small inner tables

I checked join26.q. That wasn't what I meant. My query looks something like this:

insert overwrite table join_result
select /*+ MAPJOIN(c,m)*/ c.cid, m.mid, c.cparam, m.mparam, r.rate, (r.rate - c.cparam*m.mparam)
from mparam m JOIN data r ON (m.mid = r.mid)
JOIN cparam c ON (c.cid = r.cid);

mparam and cparam are two inner tables, which are small enough to be loaded into memory together, whereas data is a really huge table, so I want to avoid the cost of making two passes over it. So I want a plan where mparam and cparam are hashed in memory, and then data is streamed through.

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto

On Fri, Sep 11, 2009 at 2:22 PM, Namit Jain <nj...@facebook.com>> wrote:

That is not true - look at the unit test join26.q





From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Friday, September 11, 2009 2:07 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Directing Hive to perform Hash Join for small inner tables



But this creates a join where each join is performed in a single Map only MR join, which is as good as specifying the query as MAPJOIN(x) followed by another query as MAPJOIN(y) with the result of the previous join. Is there as way to make it pick just one MR job, where both the small inner tables are put in memory, and only one pass is done over the data?

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Fri, Sep 11, 2009 at 1:29 PM, Namit Jain <nj...@facebook.com>> wrote:

Yes, you can specify the list of tables in the hint





MAPJOIN(x,y,z)











From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Friday, September 11, 2009 1:17 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Directing Hive to perform Hash Join for small inner tables



Is there any performance difference between 0.4 and trunk? I can then temporarily switch to 0.4 while the problem is being fixed.

Also, can I specify two tables in the MAPJOIN hint when doing a 3 way join, where two tables can fit in memory. I tried some intuitive alternatives but did not work. The only option I have now is to have 2 separate JOINs. Since this is a map only join, and the user knows that the inner table(s) will fit in memory, so allowing multiple tables will result in the JOIN being done in a single Map only job, and will also allow Combiners and Reducers to be put in if the query is doing some aggregation on the JOINed result.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Fri, Sep 11, 2009 at 12:12 PM, Namit Jain <nj...@facebook.com>> wrote:

This is some problem in trunk - it runs fine in 0.4 - I will take a look



From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Thursday, September 10, 2009 6:16 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Directing Hive to perform Hash Join for small inner tables



Hi Namit,

The join column is the first column of the data. Here are the queries to recreate the error.

create table mydata (mid INT, cid INT,
rating INT)
stored as SEQUENCEFILE;

create external table mydata_stg(mid INT, cid INT,
rating INT, date1 STRING)
row format delimited fields terminated by '\054' lines terminated by '\012'
stored as textfile
location '/user/sudipto/data/hive/text1';

from mydata_stg stg
insert overwrite table mydata
select stg.mid, stg.cid, stg.rating;

create table param(mid INT, param DOUBLE)
stored as SEQUENCEFILE;

insert overwrite table param
select mid, pow(avg(rating), 0.5)
from mydata
group by mid;

create table result(cid INT, mid INT, param DOUBLE, rating DOUBLE)
stored as SEQUENCEFILE;

insert overwrite table result
select /*+ MAPJOIN(m)*/ r.cid, m.mid, m.param, r.rating
from mydata r JOIN param m ON (r.mid = m.mid);

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Thu, Sep 10, 2009 at 6:05 PM, Namit Jain <nj...@facebook.com>> wrote:

The data you have sent has 4 columns -



insert overwrite table join_result
select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
from data r JOIN param m ON (r.mid = m.mid);



Which is the join column -





From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Thursday, September 10, 2009 3:26 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Directing Hive to perform Hash Join for small inner tables



Hi Namit,

I was able to create a smaller data set to replicate the error. Please see the attached file. Put the file in HDFS, and run the queries (make sure you update the path for the external table to point to directory in HDFS that contains this file). I hope you can see the same error. I am using Hadoop 0.18.3, and Hive trunk r812721.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Wed, Sep 9, 2009 at 7:12 PM, Namit Jain <nj...@facebook.com>> wrote:

If possible, can you send the entire dataset ?





From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Wednesday, September 09, 2009 7:10 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Directing Hive to perform Hash Join for small inner tables



Hi,

This seems to be one of those sneaky bugs... the query works for the smaller sets of data which I tried creating from the original data set. But as soon as I move to the entire data set, the error comes back. I tried it with another table in my data set, but the error persists. It only vanishes for the subset.. I can't think of a good way of creating a test data set. I was just taking the first 100 - 1000 rows..

I am attaching all the queries and the explain. Let me know if it helps:

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Wed, Sep 9, 2009 at 6:13 PM, Namit Jain <nj...@facebook.com>> wrote:

Can you provide a small testcase ?





From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Wednesday, September 09, 2009 2:20 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>

Subject: Re: Directing Hive to perform Hash Join for small inner tables



Hi,

Thanks for the quick response. I tried the query:

insert overwrite table join_result
select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
from data r JOIN param m ON (r.mid = m.mid);

param has only 17k rows with 2 columns.

I got this exception

java.lang.RuntimeException
































































































































































        at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)
































































































































































        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
































































































































































        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)


































































































































































































































































































































        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
































































































































































Caused by: java.lang.NullPointerException
































































































































































        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)


































































































































































































































































































































        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)
































































































































































        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
































































































































































        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)


































































































































































































































































































































        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)
































































































































































        at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)
































































































































































        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)


































































































































































































































































































































        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
































































































































































        at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)
































































































































































        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)


































































































































































































































































































































        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
































































































































































        at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)
































































































































































        at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)


































































































































































































































































































































        ... 3 more

Additionally, the query compiled into two MR jobs. The 2nd one didn't start because the first failed, but I couldn't reason about the 2nd job.

I am using Hive trunk, revision 811082 updated on 09/03.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <nj...@facebook.com>> wrote:

You can specify it as a hint in the select list:





select /*+ MAPJOIN(b) */  ...   from T a JOIN T2 b on ...





In the example above, T2 is the small table which can be cached in memory









From: sudiptdas@gmail.com<ma...@gmail.com> [mailto:sudiptdas@gmail.com<ma...@gmail.com>] On Behalf Of Sudipto Das
Sent: Wednesday, September 09, 2009 2:01 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Directing Hive to perform Hash Join for small inner tables



Hi,

I am new to hive so pardon me if this is something very obvious which I might have missed in the documentation.

I have an application where I am joining a small inner table with a really large outer table. The inner table is small enough to fit into memory at each mapper. In such a case, putting the inner table into an in-memory hash table and performing a hash based join is much more efficient than performing the sort-merge join which the JOIN operator selects. Is there a way in Hive where I can instruct it perform the hash based join?

Thanks

Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>














Re: Directing Hive to perform Hash Join for small inner tables

Posted by Sudipto Das <su...@gmail.com>.
I checked join26.q. That wasn't what I meant. My query looks something like
this:

insert overwrite table join_result
select /*+ MAPJOIN(c,m)*/ c.cid, m.mid, c.cparam, m.mparam, r.rate, (r.rate
- c.cparam*m.mparam)
from mparam m JOIN data r ON (m.mid = r.mid)
JOIN cparam c ON (c.cid = r.cid);

mparam and cparam are two inner tables, which are small enough to be loaded
into memory together, whereas data is a really huge table, so I want to
avoid the cost of making two passes over it. So I want a plan where mparam
and cparam are hashed in memory, and then data is streamed through.

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto


On Fri, Sep 11, 2009 at 2:22 PM, Namit Jain <nj...@facebook.com> wrote:

>  That is not true – look at the unit test join26.q
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Friday, September 11, 2009 2:07 PM
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> But this creates a join where each join is performed in a single Map only
> MR join, which is as good as specifying the query as MAPJOIN(x) followed by
> another query as MAPJOIN(y) with the result of the previous join. Is there
> as way to make it pick just one MR job, where both the small inner tables
> are put in memory, and only one pass is done over the data?
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>  On Fri, Sep 11, 2009 at 1:29 PM, Namit Jain <nj...@facebook.com> wrote:
>
> Yes, you can specify the list of tables in the hint
>
>
>
>
>
> MAPJOIN(x,y,z)
>
>
>
>
>
>
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Friday, September 11, 2009 1:17 PM
>
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Is there any performance difference between 0.4 and trunk? I can then
> temporarily switch to 0.4 while the problem is being fixed.
>
> Also, can I specify two tables in the MAPJOIN hint when doing a 3 way join,
> where two tables can fit in memory. I tried some intuitive alternatives but
> did not work. The only option I have now is to have 2 separate JOINs. Since
> this is a map only join, and the user knows that the inner table(s) will fit
> in memory, so allowing multiple tables will result in the JOIN being done in
> a single Map only job, and will also allow Combiners and Reducers to be put
> in if the query is doing some aggregation on the JOINed result.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Fri, Sep 11, 2009 at 12:12 PM, Namit Jain <nj...@facebook.com> wrote:
>
> This is some problem in trunk – it runs fine in 0.4 – I will take a look
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Thursday, September 10, 2009 6:16 PM
>
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi Namit,
>
> The join column is the first column of the data. Here are the queries to
> recreate the error.
>
> create table mydata (mid INT, cid INT,
> rating INT)
> stored as SEQUENCEFILE;
>
> create external table mydata_stg(mid INT, cid INT,
> rating INT, date1 STRING)
> row format delimited fields terminated by '\054' lines terminated by '\012'
> stored as textfile
> location '/user/sudipto/data/hive/text1';
>
> from mydata_stg stg
> insert overwrite table mydata
> select stg.mid, stg.cid, stg.rating;
>
> create table param(mid INT, param DOUBLE)
> stored as SEQUENCEFILE;
>
> insert overwrite table param
> select mid, pow(avg(rating), 0.5)
> from mydata
> group by mid;
>
> create table result(cid INT, mid INT, param DOUBLE, rating DOUBLE)
> stored as SEQUENCEFILE;
>
> insert overwrite table result
> select /*+ MAPJOIN(m)*/ r.cid, m.mid, m.param, r.rating
> from mydata r JOIN param m ON (r.mid = m.mid);
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Thu, Sep 10, 2009 at 6:05 PM, Namit Jain <nj...@facebook.com> wrote:
>
> The data you have sent has 4 columns –
>
>
>
> insert overwrite table join_result
> select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
> from data r JOIN param m ON (r.mid = m.mid);
>
>
>
> Which is the join column -
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Thursday, September 10, 2009 3:26 PM
>
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi Namit,
>
> I was able to create a smaller data set to replicate the error. Please see
> the attached file. Put the file in HDFS, and run the queries (make sure you
> update the path for the external table to point to directory in HDFS that
> contains this file). I hope you can see the same error. I am using Hadoop
> 0.18.3, and Hive trunk r812721.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 7:12 PM, Namit Jain <nj...@facebook.com> wrote:
>
> If possible, can you send the entire dataset ?
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Wednesday, September 09, 2009 7:10 PM
>
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> This seems to be one of those sneaky bugs... the query works for the
> smaller sets of data which I tried creating from the original data set. But
> as soon as I move to the entire data set, the error comes back. I tried it
> with another table in my data set, but the error persists. It only vanishes
> for the subset.. I can't think of a good way of creating a test data set. I
> was just taking the first 100 - 1000 rows..
>
> I am attaching all the queries and the explain. Let me know if it helps:
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 6:13 PM, Namit Jain <nj...@facebook.com> wrote:
>
> Can you provide a small testcase ?
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Wednesday, September 09, 2009 2:20 PM
>
>
> *To:* hive-user@hadoop.apache.org
>
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> Thanks for the quick response. I tried the query:
>
> insert overwrite table join_result
> select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
> from data r JOIN param m ON (r.mid = m.mid);
>
> param has only 17k rows with 2 columns.
>
> I got this exception
>
> java.lang.RuntimeException
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Caused by: java.lang.NullPointerException
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         ... 3 more
>
>
> Additionally, the query compiled into two MR jobs. The 2nd one didn't start
> because the first failed, but I couldn't reason about the 2nd job.
>
> I am using Hive trunk, revision 811082 updated on 09/03.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <nj...@facebook.com> wrote:
>
> You can specify it as a hint in the select list:
>
>
>
>
>
> select /*+ MAPJOIN(b) */  …   from T a JOIN T2 b on …
>
>
>
>
>
> In the example above, T2 is the small table which can be cached in memory
>
>
>
>
>
>
>
>
>
> *From:* sudiptdas@gmail.com [mailto:sudiptdas@gmail.com] *On Behalf Of *Sudipto
> Das
> *Sent:* Wednesday, September 09, 2009 2:01 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> I am new to hive so pardon me if this is something very obvious which I
> might have missed in the documentation.
>
> I have an application where I am joining a small inner table with a really
> large outer table. The inner table is small enough to fit into memory at
> each mapper. In such a case, putting the inner table into an in-memory hash
> table and performing a hash based join is much more efficient than
> performing the sort-merge join which the JOIN operator selects. Is there a
> way in Hive where I can instruct it perform the hash based join?
>
> Thanks
>
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>
>
>
>
>
>
>
>
>
>
>
>

RE: Directing Hive to perform Hash Join for small inner tables

Posted by Namit Jain <nj...@facebook.com>.
That is not true - look at the unit test join26.q


From: Sudipto Das [mailto:sudiptdas@gmail.com]
Sent: Friday, September 11, 2009 2:07 PM
To: hive-user@hadoop.apache.org
Subject: Re: Directing Hive to perform Hash Join for small inner tables

But this creates a join where each join is performed in a single Map only MR join, which is as good as specifying the query as MAPJOIN(x) followed by another query as MAPJOIN(y) with the result of the previous join. Is there as way to make it pick just one MR job, where both the small inner tables are put in memory, and only one pass is done over the data?

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto

On Fri, Sep 11, 2009 at 1:29 PM, Namit Jain <nj...@facebook.com>> wrote:

Yes, you can specify the list of tables in the hint





MAPJOIN(x,y,z)











From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Friday, September 11, 2009 1:17 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Directing Hive to perform Hash Join for small inner tables



Is there any performance difference between 0.4 and trunk? I can then temporarily switch to 0.4 while the problem is being fixed.

Also, can I specify two tables in the MAPJOIN hint when doing a 3 way join, where two tables can fit in memory. I tried some intuitive alternatives but did not work. The only option I have now is to have 2 separate JOINs. Since this is a map only join, and the user knows that the inner table(s) will fit in memory, so allowing multiple tables will result in the JOIN being done in a single Map only job, and will also allow Combiners and Reducers to be put in if the query is doing some aggregation on the JOINed result.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Fri, Sep 11, 2009 at 12:12 PM, Namit Jain <nj...@facebook.com>> wrote:

This is some problem in trunk - it runs fine in 0.4 - I will take a look



From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Thursday, September 10, 2009 6:16 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Directing Hive to perform Hash Join for small inner tables



Hi Namit,

The join column is the first column of the data. Here are the queries to recreate the error.

create table mydata (mid INT, cid INT,
rating INT)
stored as SEQUENCEFILE;

create external table mydata_stg(mid INT, cid INT,
rating INT, date1 STRING)
row format delimited fields terminated by '\054' lines terminated by '\012'
stored as textfile
location '/user/sudipto/data/hive/text1';

from mydata_stg stg
insert overwrite table mydata
select stg.mid, stg.cid, stg.rating;

create table param(mid INT, param DOUBLE)
stored as SEQUENCEFILE;

insert overwrite table param
select mid, pow(avg(rating), 0.5)
from mydata
group by mid;

create table result(cid INT, mid INT, param DOUBLE, rating DOUBLE)
stored as SEQUENCEFILE;

insert overwrite table result
select /*+ MAPJOIN(m)*/ r.cid, m.mid, m.param, r.rating
from mydata r JOIN param m ON (r.mid = m.mid);

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Thu, Sep 10, 2009 at 6:05 PM, Namit Jain <nj...@facebook.com>> wrote:

The data you have sent has 4 columns -



insert overwrite table join_result
select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
from data r JOIN param m ON (r.mid = m.mid);



Which is the join column -





From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Thursday, September 10, 2009 3:26 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Directing Hive to perform Hash Join for small inner tables



Hi Namit,

I was able to create a smaller data set to replicate the error. Please see the attached file. Put the file in HDFS, and run the queries (make sure you update the path for the external table to point to directory in HDFS that contains this file). I hope you can see the same error. I am using Hadoop 0.18.3, and Hive trunk r812721.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Wed, Sep 9, 2009 at 7:12 PM, Namit Jain <nj...@facebook.com>> wrote:

If possible, can you send the entire dataset ?





From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Wednesday, September 09, 2009 7:10 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Directing Hive to perform Hash Join for small inner tables



Hi,

This seems to be one of those sneaky bugs... the query works for the smaller sets of data which I tried creating from the original data set. But as soon as I move to the entire data set, the error comes back. I tried it with another table in my data set, but the error persists. It only vanishes for the subset.. I can't think of a good way of creating a test data set. I was just taking the first 100 - 1000 rows..

I am attaching all the queries and the explain. Let me know if it helps:

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Wed, Sep 9, 2009 at 6:13 PM, Namit Jain <nj...@facebook.com>> wrote:

Can you provide a small testcase ?





From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Wednesday, September 09, 2009 2:20 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>

Subject: Re: Directing Hive to perform Hash Join for small inner tables



Hi,

Thanks for the quick response. I tried the query:

insert overwrite table join_result
select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
from data r JOIN param m ON (r.mid = m.mid);

param has only 17k rows with 2 columns.

I got this exception

java.lang.RuntimeException
















































































        at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)
















































































        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
















































































        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)


































































































































































        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
















































































Caused by: java.lang.NullPointerException
















































































        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)


































































































































































        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)
















































































        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
















































































        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)


































































































































































        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)
















































































        at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)
















































































        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)


































































































































































        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
















































































        at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)
















































































        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)


































































































































































        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
















































































        at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)
















































































        at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)


































































































































































        ... 3 more

Additionally, the query compiled into two MR jobs. The 2nd one didn't start because the first failed, but I couldn't reason about the 2nd job.

I am using Hive trunk, revision 811082 updated on 09/03.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <nj...@facebook.com>> wrote:

You can specify it as a hint in the select list:





select /*+ MAPJOIN(b) */  ...   from T a JOIN T2 b on ...





In the example above, T2 is the small table which can be cached in memory









From: sudiptdas@gmail.com<ma...@gmail.com> [mailto:sudiptdas@gmail.com<ma...@gmail.com>] On Behalf Of Sudipto Das
Sent: Wednesday, September 09, 2009 2:01 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Directing Hive to perform Hash Join for small inner tables



Hi,

I am new to hive so pardon me if this is something very obvious which I might have missed in the documentation.

I have an application where I am joining a small inner table with a really large outer table. The inner table is small enough to fit into memory at each mapper. In such a case, putting the inner table into an in-memory hash table and performing a hash based join is much more efficient than performing the sort-merge join which the JOIN operator selects. Is there a way in Hive where I can instruct it perform the hash based join?

Thanks

Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>












Re: Directing Hive to perform Hash Join for small inner tables

Posted by Sudipto Das <su...@gmail.com>.
But this creates a join where each join is performed in a single Map only MR
join, which is as good as specifying the query as MAPJOIN(x) followed by
another query as MAPJOIN(y) with the result of the previous join. Is there
as way to make it pick just one MR job, where both the small inner tables
are put in memory, and only one pass is done over the data?

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto


On Fri, Sep 11, 2009 at 1:29 PM, Namit Jain <nj...@facebook.com> wrote:

>  Yes, you can specify the list of tables in the hint
>
>
>
>
>
> MAPJOIN(x,y,z)
>
>
>
>
>
>
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Friday, September 11, 2009 1:17 PM
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Is there any performance difference between 0.4 and trunk? I can then
> temporarily switch to 0.4 while the problem is being fixed.
>
> Also, can I specify two tables in the MAPJOIN hint when doing a 3 way join,
> where two tables can fit in memory. I tried some intuitive alternatives but
> did not work. The only option I have now is to have 2 separate JOINs. Since
> this is a map only join, and the user knows that the inner table(s) will fit
> in memory, so allowing multiple tables will result in the JOIN being done in
> a single Map only job, and will also allow Combiners and Reducers to be put
> in if the query is doing some aggregation on the JOINed result.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>  On Fri, Sep 11, 2009 at 12:12 PM, Namit Jain <nj...@facebook.com> wrote:
>
> This is some problem in trunk – it runs fine in 0.4 – I will take a look
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Thursday, September 10, 2009 6:16 PM
>
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi Namit,
>
> The join column is the first column of the data. Here are the queries to
> recreate the error.
>
> create table mydata (mid INT, cid INT,
> rating INT)
> stored as SEQUENCEFILE;
>
> create external table mydata_stg(mid INT, cid INT,
> rating INT, date1 STRING)
> row format delimited fields terminated by '\054' lines terminated by '\012'
> stored as textfile
> location '/user/sudipto/data/hive/text1';
>
> from mydata_stg stg
> insert overwrite table mydata
> select stg.mid, stg.cid, stg.rating;
>
> create table param(mid INT, param DOUBLE)
> stored as SEQUENCEFILE;
>
> insert overwrite table param
> select mid, pow(avg(rating), 0.5)
> from mydata
> group by mid;
>
> create table result(cid INT, mid INT, param DOUBLE, rating DOUBLE)
> stored as SEQUENCEFILE;
>
> insert overwrite table result
> select /*+ MAPJOIN(m)*/ r.cid, m.mid, m.param, r.rating
> from mydata r JOIN param m ON (r.mid = m.mid);
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Thu, Sep 10, 2009 at 6:05 PM, Namit Jain <nj...@facebook.com> wrote:
>
> The data you have sent has 4 columns –
>
>
>
> insert overwrite table join_result
> select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
> from data r JOIN param m ON (r.mid = m.mid);
>
>
>
> Which is the join column -
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Thursday, September 10, 2009 3:26 PM
>
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi Namit,
>
> I was able to create a smaller data set to replicate the error. Please see
> the attached file. Put the file in HDFS, and run the queries (make sure you
> update the path for the external table to point to directory in HDFS that
> contains this file). I hope you can see the same error. I am using Hadoop
> 0.18.3, and Hive trunk r812721.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 7:12 PM, Namit Jain <nj...@facebook.com> wrote:
>
> If possible, can you send the entire dataset ?
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Wednesday, September 09, 2009 7:10 PM
>
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> This seems to be one of those sneaky bugs... the query works for the
> smaller sets of data which I tried creating from the original data set. But
> as soon as I move to the entire data set, the error comes back. I tried it
> with another table in my data set, but the error persists. It only vanishes
> for the subset.. I can't think of a good way of creating a test data set. I
> was just taking the first 100 - 1000 rows..
>
> I am attaching all the queries and the explain. Let me know if it helps:
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 6:13 PM, Namit Jain <nj...@facebook.com> wrote:
>
> Can you provide a small testcase ?
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Wednesday, September 09, 2009 2:20 PM
>
>
> *To:* hive-user@hadoop.apache.org
>
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> Thanks for the quick response. I tried the query:
>
> insert overwrite table join_result
> select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
> from data r JOIN param m ON (r.mid = m.mid);
>
> param has only 17k rows with 2 columns.
>
> I got this exception
>
> java.lang.RuntimeException
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Caused by: java.lang.NullPointerException
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         ... 3 more
>
>
> Additionally, the query compiled into two MR jobs. The 2nd one didn't start
> because the first failed, but I couldn't reason about the 2nd job.
>
> I am using Hive trunk, revision 811082 updated on 09/03.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <nj...@facebook.com> wrote:
>
> You can specify it as a hint in the select list:
>
>
>
>
>
> select /*+ MAPJOIN(b) */  …   from T a JOIN T2 b on …
>
>
>
>
>
> In the example above, T2 is the small table which can be cached in memory
>
>
>
>
>
>
>
>
>
> *From:* sudiptdas@gmail.com [mailto:sudiptdas@gmail.com] *On Behalf Of *Sudipto
> Das
> *Sent:* Wednesday, September 09, 2009 2:01 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> I am new to hive so pardon me if this is something very obvious which I
> might have missed in the documentation.
>
> I have an application where I am joining a small inner table with a really
> large outer table. The inner table is small enough to fit into memory at
> each mapper. In such a case, putting the inner table into an in-memory hash
> table and performing a hash based join is much more efficient than
> performing the sort-merge join which the JOIN operator selects. Is there a
> way in Hive where I can instruct it perform the hash based join?
>
> Thanks
>
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>
>
>
>
>
>
>
>
>
>

RE: Directing Hive to perform Hash Join for small inner tables

Posted by Namit Jain <nj...@facebook.com>.
Yes, you can specify the list of tables in the hint


MAPJOIN(x,y,z)





From: Sudipto Das [mailto:sudiptdas@gmail.com]
Sent: Friday, September 11, 2009 1:17 PM
To: hive-user@hadoop.apache.org
Subject: Re: Directing Hive to perform Hash Join for small inner tables

Is there any performance difference between 0.4 and trunk? I can then temporarily switch to 0.4 while the problem is being fixed.

Also, can I specify two tables in the MAPJOIN hint when doing a 3 way join, where two tables can fit in memory. I tried some intuitive alternatives but did not work. The only option I have now is to have 2 separate JOINs. Since this is a map only join, and the user knows that the inner table(s) will fit in memory, so allowing multiple tables will result in the JOIN being done in a single Map only job, and will also allow Combiners and Reducers to be put in if the query is doing some aggregation on the JOINed result.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto

On Fri, Sep 11, 2009 at 12:12 PM, Namit Jain <nj...@facebook.com>> wrote:

This is some problem in trunk - it runs fine in 0.4 - I will take a look



From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Thursday, September 10, 2009 6:16 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Directing Hive to perform Hash Join for small inner tables



Hi Namit,

The join column is the first column of the data. Here are the queries to recreate the error.

create table mydata (mid INT, cid INT,
rating INT)
stored as SEQUENCEFILE;

create external table mydata_stg(mid INT, cid INT,
rating INT, date1 STRING)
row format delimited fields terminated by '\054' lines terminated by '\012'
stored as textfile
location '/user/sudipto/data/hive/text1';

from mydata_stg stg
insert overwrite table mydata
select stg.mid, stg.cid, stg.rating;

create table param(mid INT, param DOUBLE)
stored as SEQUENCEFILE;

insert overwrite table param
select mid, pow(avg(rating), 0.5)
from mydata
group by mid;

create table result(cid INT, mid INT, param DOUBLE, rating DOUBLE)
stored as SEQUENCEFILE;

insert overwrite table result
select /*+ MAPJOIN(m)*/ r.cid, m.mid, m.param, r.rating
from mydata r JOIN param m ON (r.mid = m.mid);

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Thu, Sep 10, 2009 at 6:05 PM, Namit Jain <nj...@facebook.com>> wrote:

The data you have sent has 4 columns -



insert overwrite table join_result
select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
from data r JOIN param m ON (r.mid = m.mid);



Which is the join column -





From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Thursday, September 10, 2009 3:26 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Directing Hive to perform Hash Join for small inner tables



Hi Namit,

I was able to create a smaller data set to replicate the error. Please see the attached file. Put the file in HDFS, and run the queries (make sure you update the path for the external table to point to directory in HDFS that contains this file). I hope you can see the same error. I am using Hadoop 0.18.3, and Hive trunk r812721.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Wed, Sep 9, 2009 at 7:12 PM, Namit Jain <nj...@facebook.com>> wrote:

If possible, can you send the entire dataset ?





From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Wednesday, September 09, 2009 7:10 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Directing Hive to perform Hash Join for small inner tables



Hi,

This seems to be one of those sneaky bugs... the query works for the smaller sets of data which I tried creating from the original data set. But as soon as I move to the entire data set, the error comes back. I tried it with another table in my data set, but the error persists. It only vanishes for the subset.. I can't think of a good way of creating a test data set. I was just taking the first 100 - 1000 rows..

I am attaching all the queries and the explain. Let me know if it helps:

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Wed, Sep 9, 2009 at 6:13 PM, Namit Jain <nj...@facebook.com>> wrote:

Can you provide a small testcase ?





From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Wednesday, September 09, 2009 2:20 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>

Subject: Re: Directing Hive to perform Hash Join for small inner tables



Hi,

Thanks for the quick response. I tried the query:

insert overwrite table join_result
select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
from data r JOIN param m ON (r.mid = m.mid);

param has only 17k rows with 2 columns.

I got this exception

java.lang.RuntimeException








































        at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)








































        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)








































        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)


















































































        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)








































Caused by: java.lang.NullPointerException








































        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)


















































































        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)








































        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)








































        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)


















































































        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)








































        at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)








































        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)


















































































        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)








































        at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)








































        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)


















































































        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)








































        at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)








































        at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)


















































































        ... 3 more

Additionally, the query compiled into two MR jobs. The 2nd one didn't start because the first failed, but I couldn't reason about the 2nd job.

I am using Hive trunk, revision 811082 updated on 09/03.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <nj...@facebook.com>> wrote:

You can specify it as a hint in the select list:





select /*+ MAPJOIN(b) */  ...   from T a JOIN T2 b on ...





In the example above, T2 is the small table which can be cached in memory









From: sudiptdas@gmail.com<ma...@gmail.com> [mailto:sudiptdas@gmail.com<ma...@gmail.com>] On Behalf Of Sudipto Das
Sent: Wednesday, September 09, 2009 2:01 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Directing Hive to perform Hash Join for small inner tables



Hi,

I am new to hive so pardon me if this is something very obvious which I might have missed in the documentation.

I have an application where I am joining a small inner table with a really large outer table. The inner table is small enough to fit into memory at each mapper. In such a case, putting the inner table into an in-memory hash table and performing a hash based join is much more efficient than performing the sort-merge join which the JOIN operator selects. Is there a way in Hive where I can instruct it perform the hash based join?

Thanks

Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>










Re: Directing Hive to perform Hash Join for small inner tables

Posted by Sudipto Das <su...@gmail.com>.
Is there any performance difference between 0.4 and trunk? I can then
temporarily switch to 0.4 while the problem is being fixed.

Also, can I specify two tables in the MAPJOIN hint when doing a 3 way join,
where two tables can fit in memory. I tried some intuitive alternatives but
did not work. The only option I have now is to have 2 separate JOINs. Since
this is a map only join, and the user knows that the inner table(s) will fit
in memory, so allowing multiple tables will result in the JOIN being done in
a single Map only job, and will also allow Combiners and Reducers to be put
in if the query is doing some aggregation on the JOINed result.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto


On Fri, Sep 11, 2009 at 12:12 PM, Namit Jain <nj...@facebook.com> wrote:

>  This is some problem in trunk – it runs fine in 0.4 – I will take a look
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Thursday, September 10, 2009 6:16 PM
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi Namit,
>
> The join column is the first column of the data. Here are the queries to
> recreate the error.
>
> create table mydata (mid INT, cid INT,
> rating INT)
> stored as SEQUENCEFILE;
>
> create external table mydata_stg(mid INT, cid INT,
> rating INT, date1 STRING)
> row format delimited fields terminated by '\054' lines terminated by '\012'
> stored as textfile
> location '/user/sudipto/data/hive/text1';
>
> from mydata_stg stg
> insert overwrite table mydata
> select stg.mid, stg.cid, stg.rating;
>
> create table param(mid INT, param DOUBLE)
> stored as SEQUENCEFILE;
>
> insert overwrite table param
> select mid, pow(avg(rating), 0.5)
> from mydata
> group by mid;
>
> create table result(cid INT, mid INT, param DOUBLE, rating DOUBLE)
> stored as SEQUENCEFILE;
>
> insert overwrite table result
> select /*+ MAPJOIN(m)*/ r.cid, m.mid, m.param, r.rating
> from mydata r JOIN param m ON (r.mid = m.mid);
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>  On Thu, Sep 10, 2009 at 6:05 PM, Namit Jain <nj...@facebook.com> wrote:
>
> The data you have sent has 4 columns –
>
>
>
> insert overwrite table join_result
> select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
> from data r JOIN param m ON (r.mid = m.mid);
>
>
>
> Which is the join column -
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Thursday, September 10, 2009 3:26 PM
>
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi Namit,
>
> I was able to create a smaller data set to replicate the error. Please see
> the attached file. Put the file in HDFS, and run the queries (make sure you
> update the path for the external table to point to directory in HDFS that
> contains this file). I hope you can see the same error. I am using Hadoop
> 0.18.3, and Hive trunk r812721.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 7:12 PM, Namit Jain <nj...@facebook.com> wrote:
>
> If possible, can you send the entire dataset ?
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Wednesday, September 09, 2009 7:10 PM
>
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> This seems to be one of those sneaky bugs... the query works for the
> smaller sets of data which I tried creating from the original data set. But
> as soon as I move to the entire data set, the error comes back. I tried it
> with another table in my data set, but the error persists. It only vanishes
> for the subset.. I can't think of a good way of creating a test data set. I
> was just taking the first 100 - 1000 rows..
>
> I am attaching all the queries and the explain. Let me know if it helps:
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 6:13 PM, Namit Jain <nj...@facebook.com> wrote:
>
> Can you provide a small testcase ?
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Wednesday, September 09, 2009 2:20 PM
>
>
> *To:* hive-user@hadoop.apache.org
>
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> Thanks for the quick response. I tried the query:
>
> insert overwrite table join_result
> select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
> from data r JOIN param m ON (r.mid = m.mid);
>
> param has only 17k rows with 2 columns.
>
> I got this exception
>
> java.lang.RuntimeException
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)
>
>
>
>
>
>
>
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>
>
>
>
>
>
>
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
>
>
>
>
>
>
>
> Caused by: java.lang.NullPointerException
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>         ... 3 more
>
>
> Additionally, the query compiled into two MR jobs. The 2nd one didn't start
> because the first failed, but I couldn't reason about the 2nd job.
>
> I am using Hive trunk, revision 811082 updated on 09/03.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <nj...@facebook.com> wrote:
>
> You can specify it as a hint in the select list:
>
>
>
>
>
> select /*+ MAPJOIN(b) */  …   from T a JOIN T2 b on …
>
>
>
>
>
> In the example above, T2 is the small table which can be cached in memory
>
>
>
>
>
>
>
>
>
> *From:* sudiptdas@gmail.com [mailto:sudiptdas@gmail.com] *On Behalf Of *Sudipto
> Das
> *Sent:* Wednesday, September 09, 2009 2:01 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> I am new to hive so pardon me if this is something very obvious which I
> might have missed in the documentation.
>
> I have an application where I am joining a small inner table with a really
> large outer table. The inner table is small enough to fit into memory at
> each mapper. In such a case, putting the inner table into an in-memory hash
> table and performing a hash based join is much more efficient than
> performing the sort-merge join which the JOIN operator selects. Is there a
> way in Hive where I can instruct it perform the hash based join?
>
> Thanks
>
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>
>
>
>
>
>
>
>

RE: Directing Hive to perform Hash Join for small inner tables

Posted by Namit Jain <nj...@facebook.com>.
This is some problem in trunk - it runs fine in 0.4 - I will take a look

From: Sudipto Das [mailto:sudiptdas@gmail.com]
Sent: Thursday, September 10, 2009 6:16 PM
To: hive-user@hadoop.apache.org
Subject: Re: Directing Hive to perform Hash Join for small inner tables

Hi Namit,

The join column is the first column of the data. Here are the queries to recreate the error.

create table mydata (mid INT, cid INT,
rating INT)
stored as SEQUENCEFILE;

create external table mydata_stg(mid INT, cid INT,
rating INT, date1 STRING)
row format delimited fields terminated by '\054' lines terminated by '\012'
stored as textfile
location '/user/sudipto/data/hive/text1';

from mydata_stg stg
insert overwrite table mydata
select stg.mid, stg.cid, stg.rating;

create table param(mid INT, param DOUBLE)
stored as SEQUENCEFILE;

insert overwrite table param
select mid, pow(avg(rating), 0.5)
from mydata
group by mid;

create table result(cid INT, mid INT, param DOUBLE, rating DOUBLE)
stored as SEQUENCEFILE;

insert overwrite table result
select /*+ MAPJOIN(m)*/ r.cid, m.mid, m.param, r.rating
from mydata r JOIN param m ON (r.mid = m.mid);

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto

On Thu, Sep 10, 2009 at 6:05 PM, Namit Jain <nj...@facebook.com>> wrote:

The data you have sent has 4 columns -



insert overwrite table join_result
select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
from data r JOIN param m ON (r.mid = m.mid);



Which is the join column -





From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Thursday, September 10, 2009 3:26 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Directing Hive to perform Hash Join for small inner tables



Hi Namit,

I was able to create a smaller data set to replicate the error. Please see the attached file. Put the file in HDFS, and run the queries (make sure you update the path for the external table to point to directory in HDFS that contains this file). I hope you can see the same error. I am using Hadoop 0.18.3, and Hive trunk r812721.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Wed, Sep 9, 2009 at 7:12 PM, Namit Jain <nj...@facebook.com>> wrote:

If possible, can you send the entire dataset ?





From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Wednesday, September 09, 2009 7:10 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Directing Hive to perform Hash Join for small inner tables



Hi,

This seems to be one of those sneaky bugs... the query works for the smaller sets of data which I tried creating from the original data set. But as soon as I move to the entire data set, the error comes back. I tried it with another table in my data set, but the error persists. It only vanishes for the subset.. I can't think of a good way of creating a test data set. I was just taking the first 100 - 1000 rows..

I am attaching all the queries and the explain. Let me know if it helps:

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Wed, Sep 9, 2009 at 6:13 PM, Namit Jain <nj...@facebook.com>> wrote:

Can you provide a small testcase ?





From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Wednesday, September 09, 2009 2:20 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>

Subject: Re: Directing Hive to perform Hash Join for small inner tables



Hi,

Thanks for the quick response. I tried the query:

insert overwrite table join_result
select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
from data r JOIN param m ON (r.mid = m.mid);

param has only 17k rows with 2 columns.

I got this exception

java.lang.RuntimeException




















        at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)




















        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)




















        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)










































        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)




















Caused by: java.lang.NullPointerException




















        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)










































        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)




















        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)




















        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)










































        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)




















        at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)




















        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)










































        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)




















        at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)




















        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)










































        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)




















        at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)




















        at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)










































        ... 3 more

Additionally, the query compiled into two MR jobs. The 2nd one didn't start because the first failed, but I couldn't reason about the 2nd job.

I am using Hive trunk, revision 811082 updated on 09/03.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <nj...@facebook.com>> wrote:

You can specify it as a hint in the select list:





select /*+ MAPJOIN(b) */  ...   from T a JOIN T2 b on ...





In the example above, T2 is the small table which can be cached in memory









From: sudiptdas@gmail.com<ma...@gmail.com> [mailto:sudiptdas@gmail.com<ma...@gmail.com>] On Behalf Of Sudipto Das
Sent: Wednesday, September 09, 2009 2:01 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Directing Hive to perform Hash Join for small inner tables



Hi,

I am new to hive so pardon me if this is something very obvious which I might have missed in the documentation.

I have an application where I am joining a small inner table with a really large outer table. The inner table is small enough to fit into memory at each mapper. In such a case, putting the inner table into an in-memory hash table and performing a hash based join is much more efficient than performing the sort-merge join which the JOIN operator selects. Is there a way in Hive where I can instruct it perform the hash based join?

Thanks

Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>








Re: Directing Hive to perform Hash Join for small inner tables

Posted by Sudipto Das <su...@gmail.com>.
Hi Namit,

The join column is the first column of the data. Here are the queries to
recreate the error.

create table mydata (mid INT, cid INT,
rating INT)
stored as SEQUENCEFILE;

create external table mydata_stg(mid INT, cid INT,
rating INT, date1 STRING)
row format delimited fields terminated by '\054' lines terminated by '\012'
stored as textfile
location '/user/sudipto/data/hive/text1';

from mydata_stg stg
insert overwrite table mydata
select stg.mid, stg.cid, stg.rating;

create table param(mid INT, param DOUBLE)
stored as SEQUENCEFILE;

insert overwrite table param
select mid, pow(avg(rating), 0.5)
from mydata
group by mid;

create table result(cid INT, mid INT, param DOUBLE, rating DOUBLE)
stored as SEQUENCEFILE;

insert overwrite table result
select /*+ MAPJOIN(m)*/ r.cid, m.mid, m.param, r.rating
from mydata r JOIN param m ON (r.mid = m.mid);

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto


On Thu, Sep 10, 2009 at 6:05 PM, Namit Jain <nj...@facebook.com> wrote:

>  The data you have sent has 4 columns –
>
>
>
> insert overwrite table join_result
> select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
> from data r JOIN param m ON (r.mid = m.mid);
>
>
>
> Which is the join column -
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Thursday, September 10, 2009 3:26 PM
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi Namit,
>
> I was able to create a smaller data set to replicate the error. Please see
> the attached file. Put the file in HDFS, and run the queries (make sure you
> update the path for the external table to point to directory in HDFS that
> contains this file). I hope you can see the same error. I am using Hadoop
> 0.18.3, and Hive trunk r812721.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>  On Wed, Sep 9, 2009 at 7:12 PM, Namit Jain <nj...@facebook.com> wrote:
>
> If possible, can you send the entire dataset ?
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Wednesday, September 09, 2009 7:10 PM
>
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> This seems to be one of those sneaky bugs... the query works for the
> smaller sets of data which I tried creating from the original data set. But
> as soon as I move to the entire data set, the error comes back. I tried it
> with another table in my data set, but the error persists. It only vanishes
> for the subset.. I can't think of a good way of creating a test data set. I
> was just taking the first 100 - 1000 rows..
>
> I am attaching all the queries and the explain. Let me know if it helps:
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 6:13 PM, Namit Jain <nj...@facebook.com> wrote:
>
> Can you provide a small testcase ?
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Wednesday, September 09, 2009 2:20 PM
>
>
> *To:* hive-user@hadoop.apache.org
>
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> Thanks for the quick response. I tried the query:
>
> insert overwrite table join_result
> select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
> from data r JOIN param m ON (r.mid = m.mid);
>
> param has only 17k rows with 2 columns.
>
> I got this exception
>
> java.lang.RuntimeException
>
>
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)
>
>
>
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>
>
>
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
>
>
>
> Caused by: java.lang.NullPointerException
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)
>
>
>
>         at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>
>
>         at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>
>
>         at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)
>
>
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)
>
>
>
>
>
>
>
>
>
>         ... 3 more
>
>
> Additionally, the query compiled into two MR jobs. The 2nd one didn't start
> because the first failed, but I couldn't reason about the 2nd job.
>
> I am using Hive trunk, revision 811082 updated on 09/03.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <nj...@facebook.com> wrote:
>
> You can specify it as a hint in the select list:
>
>
>
>
>
> select /*+ MAPJOIN(b) */  …   from T a JOIN T2 b on …
>
>
>
>
>
> In the example above, T2 is the small table which can be cached in memory
>
>
>
>
>
>
>
>
>
> *From:* sudiptdas@gmail.com [mailto:sudiptdas@gmail.com] *On Behalf Of *Sudipto
> Das
> *Sent:* Wednesday, September 09, 2009 2:01 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> I am new to hive so pardon me if this is something very obvious which I
> might have missed in the documentation.
>
> I have an application where I am joining a small inner table with a really
> large outer table. The inner table is small enough to fit into memory at
> each mapper. In such a case, putting the inner table into an in-memory hash
> table and performing a hash based join is much more efficient than
> performing the sort-merge join which the JOIN operator selects. Is there a
> way in Hive where I can instruct it perform the hash based join?
>
> Thanks
>
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>
>
>
>
>
>

RE: Directing Hive to perform Hash Join for small inner tables

Posted by Namit Jain <nj...@facebook.com>.
The data you have sent has 4 columns -

insert overwrite table join_result
select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
from data r JOIN param m ON (r.mid = m.mid);


Which is the join column -


From: Sudipto Das [mailto:sudiptdas@gmail.com]
Sent: Thursday, September 10, 2009 3:26 PM
To: hive-user@hadoop.apache.org
Subject: Re: Directing Hive to perform Hash Join for small inner tables

Hi Namit,

I was able to create a smaller data set to replicate the error. Please see the attached file. Put the file in HDFS, and run the queries (make sure you update the path for the external table to point to directory in HDFS that contains this file). I hope you can see the same error. I am using Hadoop 0.18.3, and Hive trunk r812721.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto

On Wed, Sep 9, 2009 at 7:12 PM, Namit Jain <nj...@facebook.com>> wrote:

If possible, can you send the entire dataset ?





From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Wednesday, September 09, 2009 7:10 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Directing Hive to perform Hash Join for small inner tables



Hi,

This seems to be one of those sneaky bugs... the query works for the smaller sets of data which I tried creating from the original data set. But as soon as I move to the entire data set, the error comes back. I tried it with another table in my data set, but the error persists. It only vanishes for the subset.. I can't think of a good way of creating a test data set. I was just taking the first 100 - 1000 rows..

I am attaching all the queries and the explain. Let me know if it helps:

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Wed, Sep 9, 2009 at 6:13 PM, Namit Jain <nj...@facebook.com>> wrote:

Can you provide a small testcase ?





From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Wednesday, September 09, 2009 2:20 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>

Subject: Re: Directing Hive to perform Hash Join for small inner tables



Hi,

Thanks for the quick response. I tried the query:

insert overwrite table join_result
select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
from data r JOIN param m ON (r.mid = m.mid);

param has only 17k rows with 2 columns.

I got this exception

java.lang.RuntimeException










        at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)










        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)










        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)






















        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)










Caused by: java.lang.NullPointerException










        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)






















        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)










        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)










        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)






















        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)










        at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)










        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)






















        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)










        at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)










        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)






















        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)










        at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)










        at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)






















        ... 3 more

Additionally, the query compiled into two MR jobs. The 2nd one didn't start because the first failed, but I couldn't reason about the 2nd job.

I am using Hive trunk, revision 811082 updated on 09/03.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <nj...@facebook.com>> wrote:

You can specify it as a hint in the select list:





select /*+ MAPJOIN(b) */  ...   from T a JOIN T2 b on ...





In the example above, T2 is the small table which can be cached in memory









From: sudiptdas@gmail.com<ma...@gmail.com> [mailto:sudiptdas@gmail.com<ma...@gmail.com>] On Behalf Of Sudipto Das
Sent: Wednesday, September 09, 2009 2:01 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Directing Hive to perform Hash Join for small inner tables



Hi,

I am new to hive so pardon me if this is something very obvious which I might have missed in the documentation.

I have an application where I am joining a small inner table with a really large outer table. The inner table is small enough to fit into memory at each mapper. In such a case, putting the inner table into an in-memory hash table and performing a hash based join is much more efficient than performing the sort-merge join which the JOIN operator selects. Is there a way in Hive where I can instruct it perform the hash based join?

Thanks

Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>






Re: Directing Hive to perform Hash Join for small inner tables

Posted by Sudipto Das <su...@gmail.com>.
Hi Namit,

I was able to create a smaller data set to replicate the error. Please see
the attached file. Put the file in HDFS, and run the queries (make sure you
update the path for the external table to point to directory in HDFS that
contains this file). I hope you can see the same error. I am using Hadoop
0.18.3, and Hive trunk r812721.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto


On Wed, Sep 9, 2009 at 7:12 PM, Namit Jain <nj...@facebook.com> wrote:

>  If possible, can you send the entire dataset ?
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Wednesday, September 09, 2009 7:10 PM
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> This seems to be one of those sneaky bugs... the query works for the
> smaller sets of data which I tried creating from the original data set. But
> as soon as I move to the entire data set, the error comes back. I tried it
> with another table in my data set, but the error persists. It only vanishes
> for the subset.. I can't think of a good way of creating a test data set. I
> was just taking the first 100 - 1000 rows..
>
> I am attaching all the queries and the explain. Let me know if it helps:
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>  On Wed, Sep 9, 2009 at 6:13 PM, Namit Jain <nj...@facebook.com> wrote:
>
> Can you provide a small testcase ?
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Wednesday, September 09, 2009 2:20 PM
>
>
> *To:* hive-user@hadoop.apache.org
>
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> Thanks for the quick response. I tried the query:
>
> insert overwrite table join_result
> select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
> from data r JOIN param m ON (r.mid = m.mid);
>
> param has only 17k rows with 2 columns.
>
> I got this exception
>
> java.lang.RuntimeException
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)
>
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>
>
>
>
>
>         at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
>
> Caused by: java.lang.NullPointerException
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)
>
>         at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>         at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>         at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)
>
>
>
>
>
>         ... 3 more
>
>
> Additionally, the query compiled into two MR jobs. The 2nd one didn't start
> because the first failed, but I couldn't reason about the 2nd job.
>
> I am using Hive trunk, revision 811082 updated on 09/03.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
> On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <nj...@facebook.com> wrote:
>
> You can specify it as a hint in the select list:
>
>
>
>
>
> select /*+ MAPJOIN(b) */  …   from T a JOIN T2 b on …
>
>
>
>
>
> In the example above, T2 is the small table which can be cached in memory
>
>
>
>
>
>
>
>
>
> *From:* sudiptdas@gmail.com [mailto:sudiptdas@gmail.com] *On Behalf Of *Sudipto
> Das
> *Sent:* Wednesday, September 09, 2009 2:01 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> I am new to hive so pardon me if this is something very obvious which I
> might have missed in the documentation.
>
> I have an application where I am joining a small inner table with a really
> large outer table. The inner table is small enough to fit into memory at
> each mapper. In such a case, putting the inner table into an in-memory hash
> table and performing a hash based join is much more efficient than
> performing the sort-merge join which the JOIN operator selects. Is there a
> way in Hive where I can instruct it perform the hash based join?
>
> Thanks
>
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>
>
>
>

RE: Directing Hive to perform Hash Join for small inner tables

Posted by Namit Jain <nj...@facebook.com>.
If possible, can you send the entire dataset ?


From: Sudipto Das [mailto:sudiptdas@gmail.com]
Sent: Wednesday, September 09, 2009 7:10 PM
To: hive-user@hadoop.apache.org
Subject: Re: Directing Hive to perform Hash Join for small inner tables

Hi,

This seems to be one of those sneaky bugs... the query works for the smaller sets of data which I tried creating from the original data set. But as soon as I move to the entire data set, the error comes back. I tried it with another table in my data set, but the error persists. It only vanishes for the subset.. I can't think of a good way of creating a test data set. I was just taking the first 100 - 1000 rows..

I am attaching all the queries and the explain. Let me know if it helps:

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto

On Wed, Sep 9, 2009 at 6:13 PM, Namit Jain <nj...@facebook.com>> wrote:

Can you provide a small testcase ?





From: Sudipto Das [mailto:sudiptdas@gmail.com<ma...@gmail.com>]
Sent: Wednesday, September 09, 2009 2:20 PM

To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Directing Hive to perform Hash Join for small inner tables



Hi,

Thanks for the quick response. I tried the query:

insert overwrite table join_result
select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
from data r JOIN param m ON (r.mid = m.mid);

param has only 17k rows with 2 columns.

I got this exception

java.lang.RuntimeException





        at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)





        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)





        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)












        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)





Caused by: java.lang.NullPointerException





        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)












        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)





        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)





        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)












        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)





        at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)





        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)












        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)





        at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)





        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)












        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)





        at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)





        at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)












        ... 3 more

Additionally, the query compiled into two MR jobs. The 2nd one didn't start because the first failed, but I couldn't reason about the 2nd job.

I am using Hive trunk, revision 811082 updated on 09/03.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>

On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <nj...@facebook.com>> wrote:

You can specify it as a hint in the select list:





select /*+ MAPJOIN(b) */  ...   from T a JOIN T2 b on ...





In the example above, T2 is the small table which can be cached in memory









From: sudiptdas@gmail.com<ma...@gmail.com> [mailto:sudiptdas@gmail.com<ma...@gmail.com>] On Behalf Of Sudipto Das
Sent: Wednesday, September 09, 2009 2:01 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Directing Hive to perform Hash Join for small inner tables



Hi,

I am new to hive so pardon me if this is something very obvious which I might have missed in the documentation.

I have an application where I am joining a small inner table with a really large outer table. The inner table is small enough to fit into memory at each mapper. In such a case, putting the inner table into an in-memory hash table and performing a hash based join is much more efficient than performing the sort-merge join which the JOIN operator selects. Is there a way in Hive where I can instruct it perform the hash based join?

Thanks

Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>




Re: Directing Hive to perform Hash Join for small inner tables

Posted by Sudipto Das <su...@gmail.com>.
Hi,

This seems to be one of those sneaky bugs... the query works for the smaller
sets of data which I tried creating from the original data set. But as soon
as I move to the entire data set, the error comes back. I tried it with
another table in my data set, but the error persists. It only vanishes for
the subset.. I can't think of a good way of creating a test data set. I was
just taking the first 100 - 1000 rows..

I am attaching all the queries and the explain. Let me know if it helps:

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto


On Wed, Sep 9, 2009 at 6:13 PM, Namit Jain <nj...@facebook.com> wrote:

>  Can you provide a small testcase ?
>
>
>
>
>
> *From:* Sudipto Das [mailto:sudiptdas@gmail.com]
> *Sent:* Wednesday, September 09, 2009 2:20 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> Thanks for the quick response. I tried the query:
>
> insert overwrite table join_result
> select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
> from data r JOIN param m ON (r.mid = m.mid);
>
> param has only 17k rows with 2 columns.
>
> I got this exception
>
> java.lang.RuntimeException
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)
>
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>
>
>
>         at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
>
> Caused by: java.lang.NullPointerException
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
>
>
>
>         at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)
>
>         at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>         at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)
>
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)
>
>
>
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
>
>         at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)
>
>         at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)
>
>
>
>         ... 3 more
>
>
> Additionally, the query compiled into two MR jobs. The 2nd one didn't start
> because the first failed, but I couldn't reason about the 2nd job.
>
> I am using Hive trunk, revision 811082 updated on 09/03.
>
> Thanks
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>  On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <nj...@facebook.com> wrote:
>
> You can specify it as a hint in the select list:
>
>
>
>
>
> select /*+ MAPJOIN(b) */  …   from T a JOIN T2 b on …
>
>
>
>
>
> In the example above, T2 is the small table which can be cached in memory
>
>
>
>
>
>
>
>
>
> *From:* sudiptdas@gmail.com [mailto:sudiptdas@gmail.com] *On Behalf Of *Sudipto
> Das
> *Sent:* Wednesday, September 09, 2009 2:01 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> I am new to hive so pardon me if this is something very obvious which I
> might have missed in the documentation.
>
> I have an application where I am joining a small inner table with a really
> large outer table. The inner table is small enough to fit into memory at
> each mapper. In such a case, putting the inner table into an in-memory hash
> table and performing a hash based join is much more efficient than
> performing the sort-merge join which the JOIN operator selects. Is there a
> way in Hive where I can instruct it perform the hash based join?
>
> Thanks
>
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>
>
>

RE: Directing Hive to perform Hash Join for small inner tables

Posted by Namit Jain <nj...@facebook.com>.
Can you provide a small testcase ?


From: Sudipto Das [mailto:sudiptdas@gmail.com]
Sent: Wednesday, September 09, 2009 2:20 PM
To: hive-user@hadoop.apache.org
Subject: Re: Directing Hive to perform Hash Join for small inner tables

Hi,

Thanks for the quick response. I tried the query:

insert overwrite table join_result
select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
from data r JOIN param m ON (r.mid = m.mid);

param has only 17k rows with 2 columns.

I got this exception

java.lang.RuntimeException

        at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)

        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)

        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)







        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)

Caused by: java.lang.NullPointerException

        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)







        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)

        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)

        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)







        at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)

        at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)

        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)







        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)

        at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)

        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)







        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)

        at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)

        at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)







        ... 3 more

Additionally, the query compiled into two MR jobs. The 2nd one didn't start because the first failed, but I couldn't reason about the 2nd job.

I am using Hive trunk, revision 811082 updated on 09/03.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto

On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <nj...@facebook.com>> wrote:

You can specify it as a hint in the select list:





select /*+ MAPJOIN(b) */  ...   from T a JOIN T2 b on ...





In the example above, T2 is the small table which can be cached in memory









From: sudiptdas@gmail.com<ma...@gmail.com> [mailto:sudiptdas@gmail.com<ma...@gmail.com>] On Behalf Of Sudipto Das
Sent: Wednesday, September 09, 2009 2:01 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Directing Hive to perform Hash Join for small inner tables



Hi,

I am new to hive so pardon me if this is something very obvious which I might have missed in the documentation.

I have an application where I am joining a small inner table with a really large outer table. The inner table is small enough to fit into memory at each mapper. In such a case, putting the inner table into an in-memory hash table and performing a hash based join is much more efficient than performing the sort-merge join which the JOIN operator selects. Is there a way in Hive where I can instruct it perform the hash based join?

Thanks

Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto<http://www.cs.ucsb.edu/%7Esudipto>


Re: Directing Hive to perform Hash Join for small inner tables

Posted by Sudipto Das <su...@gmail.com>.
Hi,

Thanks for the quick response. I tried the query:

insert overwrite table join_result
select /*+ MAPJOIN(m)*/ m.mid, m.param, r.rating
from data r JOIN param m ON (r.mid = m.mid);

param has only 17k rows with 2 columns.

I got this exception

java.lang.RuntimeException
	at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:182)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)

	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
Caused by: java.lang.NullPointerException
	at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:283)

	at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:530)
	at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)
	at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:519)

	at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:560)
	at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:299)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)

	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
	at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:42)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:374)

	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:580)
	at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:320)
	at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:165)

	... 3 more


Additionally, the query compiled into two MR jobs. The 2nd one didn't start
because the first failed, but I couldn't reason about the 2nd job.

I am using Hive trunk, revision 811082 updated on 09/03.

Thanks
Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto


On Wed, Sep 9, 2009 at 2:03 PM, Namit Jain <nj...@facebook.com> wrote:

>  You can specify it as a hint in the select list:
>
>
>
>
>
> select /*+ MAPJOIN(b) */  …   from T a JOIN T2 b on …
>
>
>
>
>
> In the example above, T2 is the small table which can be cached in memory
>
>
>
>
>
>
>
>
>
> *From:* sudiptdas@gmail.com [mailto:sudiptdas@gmail.com] *On Behalf Of *Sudipto
> Das
> *Sent:* Wednesday, September 09, 2009 2:01 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Directing Hive to perform Hash Join for small inner tables
>
>
>
> Hi,
>
> I am new to hive so pardon me if this is something very obvious which I
> might have missed in the documentation.
>
> I have an application where I am joining a small inner table with a really
> large outer table. The inner table is small enough to fit into memory at
> each mapper. In such a case, putting the inner table into an in-memory hash
> table and performing a hash based join is much more efficient than
> performing the sort-merge join which the JOIN operator selects. Is there a
> way in Hive where I can instruct it perform the hash based join?
>
> Thanks
>
> Sudipto
>
> PhD Candidate
> CS @ UCSB
> Santa Barbara, CA 93106, USA
> http://www.cs.ucsb.edu/~sudipto <http://www.cs.ucsb.edu/%7Esudipto>
>

RE: Directing Hive to perform Hash Join for small inner tables

Posted by Namit Jain <nj...@facebook.com>.
You can specify it as a hint in the select list:


select /*+ MAPJOIN(b) */  ...   from T a JOIN T2 b on ...


In the example above, T2 is the small table which can be cached in memory




From: sudiptdas@gmail.com [mailto:sudiptdas@gmail.com] On Behalf Of Sudipto Das
Sent: Wednesday, September 09, 2009 2:01 PM
To: hive-user@hadoop.apache.org
Subject: Directing Hive to perform Hash Join for small inner tables

Hi,

I am new to hive so pardon me if this is something very obvious which I might have missed in the documentation.

I have an application where I am joining a small inner table with a really large outer table. The inner table is small enough to fit into memory at each mapper. In such a case, putting the inner table into an in-memory hash table and performing a hash based join is much more efficient than performing the sort-merge join which the JOIN operator selects. Is there a way in Hive where I can instruct it perform the hash based join?

Thanks

Sudipto

PhD Candidate
CS @ UCSB
Santa Barbara, CA 93106, USA
http://www.cs.ucsb.edu/~sudipto