You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Viral Bajaria <vi...@gmail.com> on 2011/01/18 11:04:54 UTC
partitioned column join does not work as expected
I am facing issues with a query where I am joining two fairly large tables
on the partitioned column along with other common columns. The expected
output is not in line with what I expect it to be. Since the query is very
complex, I will simplify it so that people can provide inputs if they have
faced similar issues or if I am doing something totally wrong.
TABLE A:
a_id bigint
common_id bigint
some_string string
total_count bigint
part_col string <---- this is the partitioned column
TABLE B:
b_int bigint
common_id bigint
some_string string
total_sum bigint
part_col string <---- this is the partitioned column
now the query is as follows:
SELECT /*+ STREAMTABLE(A,B) */ A.some_string, B.some_string,
sum(A.total_count), sum(B.total_sum) from A JOIN B ON (t1.part_col =
t2.part_col AND t1.common_id = t2.common_id) WHERE t1.part_col >= 'val1' AND
t2.part_col >= 'val1' GROUP BY A.some_string, B.some_string
Does HIVE not like to join on the partitioned columns ? because when i
create a join on just the partitioned column the reduce step never finishes.
I am using HIVE 0.5.0
Thanks,
Viral
Re: partitioned column join does not work as expected
Posted by Viral Bajaria <vi...@gmail.com>.
Thanks again.
I think I figured out the bug (not sure if it's a bug or whether that's
a known limitation when creating a third-level join) .... we need another
table c to re-create my scenario.
table_a
create table table_a(a_id bigint, common_id bigint, int_a int, int_b int,
int_c int, int_d int, string_a string, total_count bigint) partitioned by
(part_col string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
TERMINATED BY '\n' STORED AS TEXTFILE;
table_b
create table table_b(b_id bigint, common_id bigint, int_a int, int_b int,
int_c int, int_d int, string_b string, total_count bigint) partitioned by
(part_col string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
TERMINATED BY '\n' STORED AS TEXTFILE;
table_c
create table table_c(int_c int, string_c string) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
query
explain select c.string_c, sum(a.total_count), sum(b.total_count) from
table_a a join table_b b on a.common_id = b.common_id and a.int_c = b.int_c
and a.int_d = b.int_d join table_c c on a.int_c = c.int_c where a.part_col
>= "blah1" and b.part_col >= "blah1" group by c.string_c;
If you look at the query plan for table_b you will see the value expression
does not project the column int_c, and if you look at the join operator in
the query plan it show a join operating between 3 columns from table_a and 2
columns from table_b which is not the intention of the query.
I think hive should output all the columns from table_b which are part of
the join conditions and not look to see if the column is going to be
consumed in the later stages.
Do you think I am not writing the hive query in a right way ? the query
would return results as expected in a mysql or sql-server environment.
Thanks,
Viral
On Wed, Jan 19, 2011 at 11:06 AM, Appan Thirumaligai <
athirumaligai@ngmoco.com> wrote:
> EXPLAIN select
> t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from
> table_a t1 join table_b t2 on t1.part_col = t2.part_col and t1.common_id =
> t2.common_id where t1.part_col >= 'mypart' and t2.part_col >= 'mypart' group
> by t1.some_string,t2.some_string;
>
> OK
> ABSTRACT SYNTAX TREE:
> (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF table_a t1) (TOK_TABREF
> table_b t2) (and (= (. (TOK_TABLE_OR_COL t1) part_col) (. (TOK_TABLE_OR_COL
> t2) part_col)) (= (. (TOK_TABLE_OR_COL t1) common_id) (. (TOK_TABLE_OR_COL
> t2) common_id))))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
> (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) some_string)) (TOK_SELEXPR
> (. (TOK_TABLE_OR_COL t2) some_string)) (TOK_SELEXPR (TOK_FUNCTION sum (.
> (TOK_TABLE_OR_COL t1) total_count))) (TOK_SELEXPR (TOK_FUNCTION sum (.
> (TOK_TABLE_OR_COL t2) total_count)))) (TOK_WHERE (and (>= (.
> (TOK_TABLE_OR_COL t1) part_col) 'mypart') (>= (. (TOK_TABLE_OR_COL t2)
> part_col) 'mypart'))) (TOK_GROUPBY (. (TOK_TABLE_OR_COL t1) some_string) (.
> (TOK_TABLE_OR_COL t2) some_string))))
>
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-2 depends on stages: Stage-1
> Stage-0 is a root stage
>
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t1
> TableScan
> alias: t1
> Filter Operator
> predicate:
> expr: (part_col >= 'mypart')
> type: boolean
> Reduce Output Operator
> key expressions:
> expr: part_col
> type: string
> expr: common_id
> type: bigint
> sort order: ++
> Map-reduce partition columns:
> expr: part_col
> type: string
> expr: common_id
> type: bigint
> tag: 0
> value expressions:
> expr: some_string
> type: string
> expr: total_count
> type: bigint
> expr: part_col
> type: string
> t2
> TableScan
> alias: t2
> Filter Operator
> predicate:
> expr: (part_col >= 'mypart')
> type: boolean
> Reduce Output Operator
> key expressions:
> expr: part_col
> type: string
> expr: common_id
> type: bigint
> sort order: ++
> Map-reduce partition columns:
> expr: part_col
> type: string
> expr: common_id
> type: bigint
> tag: 1
> value expressions:
> expr: some_string
> type: string
> expr: total_count
> type: bigint
> expr: part_col
> type: string
> Reduce Operator Tree:
> Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {VALUE._col2} {VALUE._col3} {VALUE._col4}
> 1 {VALUE._col2} {VALUE._col3} {VALUE._col4}
> handleSkewJoin: false
> outputColumnNames: _col2, _col3, _col4, _col9, _col10, _col11
> Filter Operator
> predicate:
> expr: ((_col4 >= 'mypart') and (_col11 >= 'mypart'))
> type: boolean
> Select Operator
> expressions:
> expr: _col2
> type: string
> expr: _col9
> type: string
> expr: _col3
> type: bigint
> expr: _col10
> type: bigint
> outputColumnNames: _col2, _col9, _col3, _col10
> Group By Operator
> aggregations:
> expr: sum(_col3)
> expr: sum(_col10)
> bucketGroup: false
> keys:
> expr: _col2
> type: string
> expr: _col9
> type: string
> mode: hash
> outputColumnNames: _col0, _col1, _col2, _col3
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>
> Stage: Stage-2
> Map Reduce
> Alias -> Map Operator Tree:
>
> hdfs://localhost:8022/tmp/hive-training/hive_2011-01-19_11-05-34_526_453408324928472657/-mr-10002
>
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: string
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: string
> expr: _col1
> type: string
> tag: -1
> value expressions:
> expr: _col2
> type: bigint
> expr: _col3
> type: bigint
> Reduce Operator Tree:
> Group By Operator
> aggregations:
> expr: sum(VALUE._col0)
> expr: sum(VALUE._col1)
> bucketGroup: false
> keys:
> expr: KEY._col0
> type: string
> expr: KEY._col1
> type: string
> mode: mergepartial
> outputColumnNames: _col0, _col1, _col2, _col3
> Select Operator
> expressions:
> expr: _col0
> type: string
> expr: _col1
> type: string
> expr: _col2
> type: bigint
> expr: _col3
> type: bigint
> outputColumnNames: _col0, _col1, _col2, _col3
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
> Stage: Stage-0
> Fetch Operator
> limit: -1
>
> On Jan 19, 2011, at 10:47 AM, Viral Bajaria wrote:
>
> Thanks Appan for verifying. I will do some more tests on my side too and
> let you know the results.
>
> I tried a different version of the query where I join'ed two sub-queries
> for the same partitions and the data comes out to be correct.
>
> I will see if I can post the real-world example to the list, because that
> might sound like a more practical example.
>
> If you still have your example(s) do you mind sending me your query-plan
> for
>
> select
> t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from
> table_a t1 join table_b t2 on t1.part_col = t2.part_col and t1.common_id =
> t2.common_id where t1.part_col >= 'mypart' and t2.part_col >= 'mypart' group
> by t1.some_string,t2.some_string;
>
> -Viral
> On Wed, Jan 19, 2011 at 10:36 AM, Appan Thirumaligai <
> athirumaligai@ngmoco.com> wrote:
>
>> Viral,
>>
>> I tried the queries below (similar to yours) and I get the expected
>> results when I do the join. I ran my queries after building hive from the
>> latest source and hadoop 0.20+.
>> create table table_a(a_id bigint, common_id bigint, some_string
>> string,total_count bigint) partitioned by (part_col string) ROW FORMAT
>> DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS
>> TEXTFILE;
>> create table table_b(b_id bigint, common_id bigint, some_string
>> string,total_count bigint) partitioned by (part_col string) ROW FORMAT
>> DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS
>> TEXTFILE;
>> dfs -mkdir /user/data/table_a;
>> dfs -mkdir /user/data/table_b;
>> dfs -put /home/training/hiveug/table_a.csv /user/data/table_a;
>> dfs -put /home/training/hiveug/table_b.csv /user/data/table_b;
>> alter table table_a add partition (part_col = 'mypart') location
>> '/user/data/table_a';
>> alter table table_b add partition (part_col = 'mypart') location
>> '/user/data/table_b';
>> select * from table_a t1 join table_b t2 on t1.part_col == t2.part_col;
>> -->> Returns expected result
>> select
>> t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from
>> table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >=
>> 'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_string;
>> --->>Works fine.
>> select
>> t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from
>> table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >=
>> 'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_st*
>> from table_a t1 join table_b t2 on t1.part_col = t2.part_col where
>> t1.part_col >= 'mypart' and t2.part_col >= 'mypart';
>> --->Works fine.
>>
>> I created the two files with sample data in them and copied it to hdfs
>>
>> I'll try later on your hive 0.5.0 but looks like there might be something
>> wrong in your query.
>>
>> On Jan 18, 2011, at 8:40 PM, Ajo Fod wrote:
>>
>> Can you try this with a dummy table with very few rows ... to see if
>> the reason the script doesn't finish is a computational issue?
>>
>> One other thing is to try with a combined partition, to see if it is a
>> problem with the partitioning.
>>
>> Also, take a look at the results of an EXPLAIN statement, see if
>> there are any hints there.
>>
>> NOTE: I'm new to hive too.
>>
>> -Ajo
>>
>>
>> On Tue, Jan 18, 2011 at 8:08 PM, Viral Bajaria <vi...@gmail.com>
>> wrote:
>>
>> I haven't heard back from any on the list and am still struggling to join
>>
>> two tables on partitioned column
>>
>>
>> Has anyone every tried joining two tables on a paritioned column and the
>>
>> results are not as expected ?
>>
>> On Tue, Jan 18, 2011 at 2:04 AM, Viral Bajaria <vi...@gmail.com>
>>
>> wrote:
>>
>>
>> I am facing issues with a query where I am joining two fairly large
>> tables
>>
>> on the partitioned column along with other common columns. The expected
>>
>> output is not in line with what I expect it to be. Since the query is
>> very
>>
>> complex, I will simplify it so that people can provide inputs if they
>> have
>>
>> faced similar issues or if I am doing something totally wrong.
>>
>> TABLE A:
>>
>> a_id bigint
>>
>> common_id bigint
>>
>> some_string string
>>
>> total_count bigint
>>
>> part_col string <---- this is the partitioned column
>>
>> TABLE B:
>>
>> b_int bigint
>>
>> common_id bigint
>>
>> some_string string
>>
>> total_sum bigint
>>
>> part_col string <---- this is the partitioned column
>>
>> now the query is as follows:
>>
>> SELECT /*+ STREAMTABLE(A,B) */ A.some_string, B.some_string,
>>
>> sum(A.total_count), sum(B.total_sum) from A JOIN B ON (t1.part_col =
>>
>> t2.part_col AND t1.common_id = t2.common_id) WHERE t1.part_col >= 'val1'
>> AND
>>
>> t2.part_col >= 'val1' GROUP BY A.some_string, B.some_string
>>
>> Does HIVE not like to join on the partitioned columns ? because when i
>>
>> create a join on just the partitioned column the reduce step never
>> finishes.
>>
>> I am using HIVE 0.5.0
>>
>> Thanks,
>>
>> Viral
>>
>>
>>
>> Appan Thirumaligai
>> appan@ngmoco.com
>> Ph:1-818-472-8427
>> ngmoco:)
>>
>>
>
> Appan Thirumaligai
> appan@ngmoco.com
> Ph:1-818-472-8427
> ngmoco:)
>
>
Re: partitioned column join does not work as expected
Posted by Appan Thirumaligai <at...@ngmoco.com>.
EXPLAIN select t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from table_a t1 join table_b t2 on t1.part_col = t2.part_col and t1.common_id = t2.common_id where t1.part_col >= 'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_string;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF table_a t1) (TOK_TABREF table_b t2) (and (= (. (TOK_TABLE_OR_COL t1) part_col) (. (TOK_TABLE_OR_COL t2) part_col)) (= (. (TOK_TABLE_OR_COL t1) common_id) (. (TOK_TABLE_OR_COL t2) common_id))))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) some_string)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t2) some_string)) (TOK_SELEXPR (TOK_FUNCTION sum (. (TOK_TABLE_OR_COL t1) total_count))) (TOK_SELEXPR (TOK_FUNCTION sum (. (TOK_TABLE_OR_COL t2) total_count)))) (TOK_WHERE (and (>= (. (TOK_TABLE_OR_COL t1) part_col) 'mypart') (>= (. (TOK_TABLE_OR_COL t2) part_col) 'mypart'))) (TOK_GROUPBY (. (TOK_TABLE_OR_COL t1) some_string) (. (TOK_TABLE_OR_COL t2) some_string))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
t1
TableScan
alias: t1
Filter Operator
predicate:
expr: (part_col >= 'mypart')
type: boolean
Reduce Output Operator
key expressions:
expr: part_col
type: string
expr: common_id
type: bigint
sort order: ++
Map-reduce partition columns:
expr: part_col
type: string
expr: common_id
type: bigint
tag: 0
value expressions:
expr: some_string
type: string
expr: total_count
type: bigint
expr: part_col
type: string
t2
TableScan
alias: t2
Filter Operator
predicate:
expr: (part_col >= 'mypart')
type: boolean
Reduce Output Operator
key expressions:
expr: part_col
type: string
expr: common_id
type: bigint
sort order: ++
Map-reduce partition columns:
expr: part_col
type: string
expr: common_id
type: bigint
tag: 1
value expressions:
expr: some_string
type: string
expr: total_count
type: bigint
expr: part_col
type: string
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col2} {VALUE._col3} {VALUE._col4}
1 {VALUE._col2} {VALUE._col3} {VALUE._col4}
handleSkewJoin: false
outputColumnNames: _col2, _col3, _col4, _col9, _col10, _col11
Filter Operator
predicate:
expr: ((_col4 >= 'mypart') and (_col11 >= 'mypart'))
type: boolean
Select Operator
expressions:
expr: _col2
type: string
expr: _col9
type: string
expr: _col3
type: bigint
expr: _col10
type: bigint
outputColumnNames: _col2, _col9, _col3, _col10
Group By Operator
aggregations:
expr: sum(_col3)
expr: sum(_col10)
bucketGroup: false
keys:
expr: _col2
type: string
expr: _col9
type: string
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://localhost:8022/tmp/hive-training/hive_2011-01-19_11-05-34_526_453408324928472657/-mr-10002
Reduce Output Operator
key expressions:
expr: _col0
type: string
expr: _col1
type: string
sort order: ++
Map-reduce partition columns:
expr: _col0
type: string
expr: _col1
type: string
tag: -1
value expressions:
expr: _col2
type: bigint
expr: _col3
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: sum(VALUE._col0)
expr: sum(VALUE._col1)
bucketGroup: false
keys:
expr: KEY._col0
type: string
expr: KEY._col1
type: string
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: bigint
expr: _col3
type: bigint
outputColumnNames: _col0, _col1, _col2, _col3
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
On Jan 19, 2011, at 10:47 AM, Viral Bajaria wrote:
> Thanks Appan for verifying. I will do some more tests on my side too and let you know the results.
>
> I tried a different version of the query where I join'ed two sub-queries for the same partitions and the data comes out to be correct.
>
> I will see if I can post the real-world example to the list, because that might sound like a more practical example.
>
> If you still have your example(s) do you mind sending me your query-plan for
>
> select t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from table_a t1 join table_b t2 on t1.part_col = t2.part_col and t1.common_id = t2.common_id where t1.part_col >= 'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_string;
>
> -Viral
> On Wed, Jan 19, 2011 at 10:36 AM, Appan Thirumaligai <at...@ngmoco.com> wrote:
> Viral,
>
> I tried the queries below (similar to yours) and I get the expected results when I do the join. I ran my queries after building hive from the latest source and hadoop 0.20+.
> create table table_a(a_id bigint, common_id bigint, some_string string,total_count bigint) partitioned by (part_col string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
> create table table_b(b_id bigint, common_id bigint, some_string string,total_count bigint) partitioned by (part_col string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
> dfs -mkdir /user/data/table_a;
> dfs -mkdir /user/data/table_b;
> dfs -put /home/training/hiveug/table_a.csv /user/data/table_a;
> dfs -put /home/training/hiveug/table_b.csv /user/data/table_b;
> alter table table_a add partition (part_col = 'mypart') location '/user/data/table_a';
> alter table table_b add partition (part_col = 'mypart') location '/user/data/table_b';
> select * from table_a t1 join table_b t2 on t1.part_col == t2.part_col;
> -->> Returns expected result
> select t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >= 'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_string;
> --->>Works fine.
> select t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >= 'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_st* from table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >= 'mypart' and t2.part_col >= 'mypart';
> --->Works fine.
>
> I created the two files with sample data in them and copied it to hdfs
>
> I'll try later on your hive 0.5.0 but looks like there might be something wrong in your query.
>
> On Jan 18, 2011, at 8:40 PM, Ajo Fod wrote:
>
>> Can you try this with a dummy table with very few rows ... to see if
>> the reason the script doesn't finish is a computational issue?
>>
>> One other thing is to try with a combined partition, to see if it is a
>> problem with the partitioning.
>>
>> Also, take a look at the results of an EXPLAIN statement, see if
>> there are any hints there.
>>
>> NOTE: I'm new to hive too.
>>
>> -Ajo
>>
>>
>> On Tue, Jan 18, 2011 at 8:08 PM, Viral Bajaria <vi...@gmail.com> wrote:
>>> I haven't heard back from any on the list and am still struggling to join
>>> two tables on partitioned column
>>>
>>> Has anyone every tried joining two tables on a paritioned column and the
>>> results are not as expected ?
>>> On Tue, Jan 18, 2011 at 2:04 AM, Viral Bajaria <vi...@gmail.com>
>>> wrote:
>>>>
>>>> I am facing issues with a query where I am joining two fairly large tables
>>>> on the partitioned column along with other common columns. The expected
>>>> output is not in line with what I expect it to be. Since the query is very
>>>> complex, I will simplify it so that people can provide inputs if they have
>>>> faced similar issues or if I am doing something totally wrong.
>>>> TABLE A:
>>>> a_id bigint
>>>> common_id bigint
>>>> some_string string
>>>> total_count bigint
>>>> part_col string <---- this is the partitioned column
>>>> TABLE B:
>>>> b_int bigint
>>>> common_id bigint
>>>> some_string string
>>>> total_sum bigint
>>>> part_col string <---- this is the partitioned column
>>>> now the query is as follows:
>>>> SELECT /*+ STREAMTABLE(A,B) */ A.some_string, B.some_string,
>>>> sum(A.total_count), sum(B.total_sum) from A JOIN B ON (t1.part_col =
>>>> t2.part_col AND t1.common_id = t2.common_id) WHERE t1.part_col >= 'val1' AND
>>>> t2.part_col >= 'val1' GROUP BY A.some_string, B.some_string
>>>> Does HIVE not like to join on the partitioned columns ? because when i
>>>> create a join on just the partitioned column the reduce step never finishes.
>>>> I am using HIVE 0.5.0
>>>> Thanks,
>>>> Viral
>>>
>
> Appan Thirumaligai
> appan@ngmoco.com
> Ph:1-818-472-8427
> ngmoco:)
>
>
Appan Thirumaligai
appan@ngmoco.com
Ph:1-818-472-8427
ngmoco:)
Re: partitioned column join does not work as expected
Posted by Viral Bajaria <vi...@gmail.com>.
Thanks Appan for verifying. I will do some more tests on my side too and let
you know the results.
I tried a different version of the query where I join'ed two sub-queries for
the same partitions and the data comes out to be correct.
I will see if I can post the real-world example to the list, because that
might sound like a more practical example.
If you still have your example(s) do you mind sending me your query-plan for
select t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count)
from table_a t1 join table_b t2 on t1.part_col = t2.part_col and
t1.common_id = t2.common_id where t1.part_col >= 'mypart' and t2.part_col >=
'mypart' group by t1.some_string,t2.some_string;
-Viral
On Wed, Jan 19, 2011 at 10:36 AM, Appan Thirumaligai <
athirumaligai@ngmoco.com> wrote:
> Viral,
>
> I tried the queries below (similar to yours) and I get the expected results
> when I do the join. I ran my queries after building hive from the latest
> source and hadoop 0.20+.
> create table table_a(a_id bigint, common_id bigint, some_string
> string,total_count bigint) partitioned by (part_col string) ROW FORMAT
> DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS
> TEXTFILE;
> create table table_b(b_id bigint, common_id bigint, some_string
> string,total_count bigint) partitioned by (part_col string) ROW FORMAT
> DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS
> TEXTFILE;
> dfs -mkdir /user/data/table_a;
> dfs -mkdir /user/data/table_b;
> dfs -put /home/training/hiveug/table_a.csv /user/data/table_a;
> dfs -put /home/training/hiveug/table_b.csv /user/data/table_b;
> alter table table_a add partition (part_col = 'mypart') location
> '/user/data/table_a';
> alter table table_b add partition (part_col = 'mypart') location
> '/user/data/table_b';
> select * from table_a t1 join table_b t2 on t1.part_col == t2.part_col;
> -->> Returns expected result
> select
> t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from
> table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >=
> 'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_string;
> --->>Works fine.
> select
> t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from
> table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >=
> 'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_st*
> from table_a t1 join table_b t2 on t1.part_col = t2.part_col where
> t1.part_col >= 'mypart' and t2.part_col >= 'mypart';
> --->Works fine.
>
> I created the two files with sample data in them and copied it to hdfs
>
> I'll try later on your hive 0.5.0 but looks like there might be something
> wrong in your query.
>
> On Jan 18, 2011, at 8:40 PM, Ajo Fod wrote:
>
> Can you try this with a dummy table with very few rows ... to see if
> the reason the script doesn't finish is a computational issue?
>
> One other thing is to try with a combined partition, to see if it is a
> problem with the partitioning.
>
> Also, take a look at the results of an EXPLAIN statement, see if
> there are any hints there.
>
> NOTE: I'm new to hive too.
>
> -Ajo
>
>
> On Tue, Jan 18, 2011 at 8:08 PM, Viral Bajaria <vi...@gmail.com>
> wrote:
>
> I haven't heard back from any on the list and am still struggling to join
>
> two tables on partitioned column
>
>
> Has anyone every tried joining two tables on a paritioned column and the
>
> results are not as expected ?
>
> On Tue, Jan 18, 2011 at 2:04 AM, Viral Bajaria <vi...@gmail.com>
>
> wrote:
>
>
> I am facing issues with a query where I am joining two fairly large
> tables
>
> on the partitioned column along with other common columns. The expected
>
> output is not in line with what I expect it to be. Since the query is
> very
>
> complex, I will simplify it so that people can provide inputs if they
> have
>
> faced similar issues or if I am doing something totally wrong.
>
> TABLE A:
>
> a_id bigint
>
> common_id bigint
>
> some_string string
>
> total_count bigint
>
> part_col string <---- this is the partitioned column
>
> TABLE B:
>
> b_int bigint
>
> common_id bigint
>
> some_string string
>
> total_sum bigint
>
> part_col string <---- this is the partitioned column
>
> now the query is as follows:
>
> SELECT /*+ STREAMTABLE(A,B) */ A.some_string, B.some_string,
>
> sum(A.total_count), sum(B.total_sum) from A JOIN B ON (t1.part_col =
>
> t2.part_col AND t1.common_id = t2.common_id) WHERE t1.part_col >= 'val1'
> AND
>
> t2.part_col >= 'val1' GROUP BY A.some_string, B.some_string
>
> Does HIVE not like to join on the partitioned columns ? because when i
>
> create a join on just the partitioned column the reduce step never
> finishes.
>
> I am using HIVE 0.5.0
>
> Thanks,
>
> Viral
>
>
>
> Appan Thirumaligai
> appan@ngmoco.com
> Ph:1-818-472-8427
> ngmoco:)
>
>
Re: partitioned column join does not work as expected
Posted by Appan Thirumaligai <at...@ngmoco.com>.
Viral,
I tried the queries below (similar to yours) and I get the expected results when I do the join. I ran my queries after building hive from the latest source and hadoop 0.20+.
create table table_a(a_id bigint, common_id bigint, some_string string,total_count bigint) partitioned by (part_col string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
create table table_b(b_id bigint, common_id bigint, some_string string,total_count bigint) partitioned by (part_col string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
dfs -mkdir /user/data/table_a;
dfs -mkdir /user/data/table_b;
dfs -put /home/training/hiveug/table_a.csv /user/data/table_a;
dfs -put /home/training/hiveug/table_b.csv /user/data/table_b;
alter table table_a add partition (part_col = 'mypart') location '/user/data/table_a';
alter table table_b add partition (part_col = 'mypart') location '/user/data/table_b';
select * from table_a t1 join table_b t2 on t1.part_col == t2.part_col;
-->> Returns expected result
select t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >= 'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_string;
--->>Works fine.
select t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >= 'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_st* from table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >= 'mypart' and t2.part_col >= 'mypart';
--->Works fine.
I created the two files with sample data in them and copied it to hdfs
I'll try later on your hive 0.5.0 but looks like there might be something wrong in your query.
On Jan 18, 2011, at 8:40 PM, Ajo Fod wrote:
> Can you try this with a dummy table with very few rows ... to see if
> the reason the script doesn't finish is a computational issue?
>
> One other thing is to try with a combined partition, to see if it is a
> problem with the partitioning.
>
> Also, take a look at the results of an EXPLAIN statement, see if
> there are any hints there.
>
> NOTE: I'm new to hive too.
>
> -Ajo
>
>
> On Tue, Jan 18, 2011 at 8:08 PM, Viral Bajaria <vi...@gmail.com> wrote:
>> I haven't heard back from any on the list and am still struggling to join
>> two tables on partitioned column
>>
>> Has anyone every tried joining two tables on a paritioned column and the
>> results are not as expected ?
>> On Tue, Jan 18, 2011 at 2:04 AM, Viral Bajaria <vi...@gmail.com>
>> wrote:
>>>
>>> I am facing issues with a query where I am joining two fairly large tables
>>> on the partitioned column along with other common columns. The expected
>>> output is not in line with what I expect it to be. Since the query is very
>>> complex, I will simplify it so that people can provide inputs if they have
>>> faced similar issues or if I am doing something totally wrong.
>>> TABLE A:
>>> a_id bigint
>>> common_id bigint
>>> some_string string
>>> total_count bigint
>>> part_col string <---- this is the partitioned column
>>> TABLE B:
>>> b_int bigint
>>> common_id bigint
>>> some_string string
>>> total_sum bigint
>>> part_col string <---- this is the partitioned column
>>> now the query is as follows:
>>> SELECT /*+ STREAMTABLE(A,B) */ A.some_string, B.some_string,
>>> sum(A.total_count), sum(B.total_sum) from A JOIN B ON (t1.part_col =
>>> t2.part_col AND t1.common_id = t2.common_id) WHERE t1.part_col >= 'val1' AND
>>> t2.part_col >= 'val1' GROUP BY A.some_string, B.some_string
>>> Does HIVE not like to join on the partitioned columns ? because when i
>>> create a join on just the partitioned column the reduce step never finishes.
>>> I am using HIVE 0.5.0
>>> Thanks,
>>> Viral
>>
Appan Thirumaligai
appan@ngmoco.com
Ph:1-818-472-8427
ngmoco:)
Re: partitioned column join does not work as expected
Posted by Ajo Fod <aj...@gmail.com>.
Can you try this with a dummy table with very few rows ... to see if
the reason the script doesn't finish is a computational issue?
One other thing is to try with a combined partition, to see if it is a
problem with the partitioning.
Also, take a look at the results of an EXPLAIN statement, see if
there are any hints there.
NOTE: I'm new to hive too.
-Ajo
On Tue, Jan 18, 2011 at 8:08 PM, Viral Bajaria <vi...@gmail.com> wrote:
> I haven't heard back from any on the list and am still struggling to join
> two tables on partitioned column
>
> Has anyone every tried joining two tables on a paritioned column and the
> results are not as expected ?
> On Tue, Jan 18, 2011 at 2:04 AM, Viral Bajaria <vi...@gmail.com>
> wrote:
>>
>> I am facing issues with a query where I am joining two fairly large tables
>> on the partitioned column along with other common columns. The expected
>> output is not in line with what I expect it to be. Since the query is very
>> complex, I will simplify it so that people can provide inputs if they have
>> faced similar issues or if I am doing something totally wrong.
>> TABLE A:
>> a_id bigint
>> common_id bigint
>> some_string string
>> total_count bigint
>> part_col string <---- this is the partitioned column
>> TABLE B:
>> b_int bigint
>> common_id bigint
>> some_string string
>> total_sum bigint
>> part_col string <---- this is the partitioned column
>> now the query is as follows:
>> SELECT /*+ STREAMTABLE(A,B) */ A.some_string, B.some_string,
>> sum(A.total_count), sum(B.total_sum) from A JOIN B ON (t1.part_col =
>> t2.part_col AND t1.common_id = t2.common_id) WHERE t1.part_col >= 'val1' AND
>> t2.part_col >= 'val1' GROUP BY A.some_string, B.some_string
>> Does HIVE not like to join on the partitioned columns ? because when i
>> create a join on just the partitioned column the reduce step never finishes.
>> I am using HIVE 0.5.0
>> Thanks,
>> Viral
>
Re: partitioned column join does not work as expected
Posted by Viral Bajaria <vi...@gmail.com>.
I haven't heard back from any on the list and am still struggling to join
two tables on partitioned column
Has anyone every tried joining two tables on a paritioned column and the
results are not as expected ?
On Tue, Jan 18, 2011 at 2:04 AM, Viral Bajaria <vi...@gmail.com>wrote:
> I am facing issues with a query where I am joining two fairly large tables
> on the partitioned column along with other common columns. The expected
> output is not in line with what I expect it to be. Since the query is very
> complex, I will simplify it so that people can provide inputs if they have
> faced similar issues or if I am doing something totally wrong.
>
> TABLE A:
> a_id bigint
> common_id bigint
> some_string string
> total_count bigint
> part_col string <---- this is the partitioned column
>
> TABLE B:
> b_int bigint
> common_id bigint
> some_string string
> total_sum bigint
> part_col string <---- this is the partitioned column
>
> now the query is as follows:
> SELECT /*+ STREAMTABLE(A,B) */ A.some_string, B.some_string,
> sum(A.total_count), sum(B.total_sum) from A JOIN B ON (t1.part_col =
> t2.part_col AND t1.common_id = t2.common_id) WHERE t1.part_col >= 'val1' AND
> t2.part_col >= 'val1' GROUP BY A.some_string, B.some_string
>
> Does HIVE not like to join on the partitioned columns ? because when i
> create a join on just the partitioned column the reduce step never finishes.
>
> I am using HIVE 0.5.0
>
> Thanks,
> Viral
>