You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Gabi Kazav <Ga...@pursway.com> on 2013/05/29 20:41:10 UTC

Hive - max rows limit (int limit = 2^31). need Help (looks liek a bug)

Hi,

We are working on hive DB with our Hadoop cluster.
We now facing an issue about joining a big partition with more than 2^31 rows.

When the partition has more than 2147483648 rows (even 2147483649) the output of the join is a single row.
When the partition has less than 2147483648 rows (event 2147483647) the output is correct.
Our test case:

create a table with 2147483649 rows in a partition with the value : "1" , join this table to another table with a single row,single column with the value "1" on the partition_key.
later delete 2 rows and run the same join.
1st : only a single row is created
2nd : 2147483647 rows
the query we run for test the case is:

create table output_rows_over as
select a.s1
from  max_sint_rows a join small_table b
on (a.p1=b.p1);

on more than 2^31 rows we got the following on reducer log:

2013-05-27 21:51:14,186 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: TABLE_ID_1_ROWCOUNT:1
On less than 2^31 rows we got the following reducer log:

2013-05-27 23:43:14,681 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: TABLE_ID_1_ROWCOUNT:2147483647


Anyone faced this issue?
Does hive has workaround for that?
I have huge partitions I need to work on and I cannot use hive for that..

Thanks,


Gabi Kazav
Infrastructure Team Leader, Pursway.com



RE: Hive - max rows limit (int limit = 2^31). need Help (looks liek a bug)

Posted by Gabi Kazav <Ga...@pursway.com>.
Hi,
I took it one step ahead and dropped the partition:

hive> desc max_int;
OK
s1      int
Time taken: 0.238 seconds

hive> desc small_table;
OK
s1      int
Time taken: 0.187 seconds

Max_int has 2^31+1 records, which them all the number 1:
hive> select * from max_int limit 5;
OK
1
1
1
1
1
Time taken: 0.813 seconds

Small_table has 1 record , the number 1:
hive> select * from small_table;
OK
1
Time taken: 0.122 seconds

I am running:
create table output_rows as select a.s1 from max_int a join small_table b on (a.s1=b.s1);

and getting 1 row at the end...:
hive> select count (*) from output_rows;                                        
OK
1
Time taken: 14.892 seconds

if I filling the table max_int with 2^31-1 records of the int 1, I am getting the right data at the end.. (output_rows 2 is the table I created with the join query, on the 2^31-1 record table):
hive> select count (*) from output_rows2;
OK
2147483647
Time taken: 472.449 seconds




-----Original Message-----
From: John Meagher [mailto:john.meagher@gmail.com] 
Sent: Thursday, May 30, 2013 12:06 AM
To: user@hive.apache.org
Subject: Re: Hive - max rows limit (int limit = 2^31). need Help (looks liek a bug)

What is the data type of the p1 column?  I've used hive with partitions containing far above 2 billion rows without having any problems like this.

On Wed, May 29, 2013 at 2:41 PM, Gabi Kazav <Ga...@pursway.com> wrote:
> Hi,
>
>
>
> We are working on hive DB with our Hadoop cluster.
>
> We now facing an issue about joining a big partition with more than 
> 2^31 rows.
>
> When the partition has more than 2147483648 rows (even 2147483649) the 
> output of the join is a single row.
> When the partition has less than 2147483648 rows (event 2147483647) 
> the output is correct.
>
> Our test case:
>
> create a table with 2147483649 rows in a partition with the value : 
> "1" , join this table to another table with a single row,single column 
> with the value "1" on the partition_key.
> later delete 2 rows and run the same join.
> 1st : only a single row is created
> 2nd : 2147483647 rows
>
> the query we run for test the case is:
>
>
>
> create table output_rows_over as
>
> select a.s1
>
> from  max_sint_rows a join small_table b
>
> on (a.p1=b.p1);
>
>
>
> on more than 2^31 rows we got the following on reducer log:
>
> 2013-05-27 21:51:14,186 INFO
> org.apache.hadoop.hive.ql.exec.FileSinkOperator: TABLE_ID_1_ROWCOUNT:1
>
> On less than 2^31 rows we got the following reducer log:
>
> 2013-05-27 23:43:14,681 INFO
> org.apache.hadoop.hive.ql.exec.FileSinkOperator:
> TABLE_ID_1_ROWCOUNT:2147483647
>
>
>
>
>
> Anyone faced this issue?
>
> Does hive has workaround for that?
>
> I have huge partitions I need to work on and I cannot use hive for that..
>
>
>
> Thanks,
>
>
>
>
>
> Gabi Kazav
>
> Infrastructure Team Leader, Pursway.com
>
>
>
>

 
 
************************************************************************************
This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals & computer viruses.
************************************************************************************




RE: Hive - max rows limit (int limit = 2^31). need Help (looks liek a bug)

Posted by Gabi Kazav <Ga...@pursway.com>.
Thanks for helping.

Here is some more data:

create table max_sint_rows (s1 string) 
partitioned by (p1 string)
ROW FORMAT DELIMITED
   LINES TERMINATED BY  '\n';

Create table small_table (p1 string)
ROW FORMAT DELIMITED
   LINES TERMINATED BY  '\n';

alter table max_sint_rows add partition (p1="1");



-----Original Message-----
From: John Meagher [mailto:john.meagher@gmail.com] 
Sent: Thursday, May 30, 2013 12:06 AM
To: user@hive.apache.org
Subject: Re: Hive - max rows limit (int limit = 2^31). need Help (looks liek a bug)

What is the data type of the p1 column?  I've used hive with partitions containing far above 2 billion rows without having any problems like this.

On Wed, May 29, 2013 at 2:41 PM, Gabi Kazav <Ga...@pursway.com> wrote:
> Hi,
>
>
>
> We are working on hive DB with our Hadoop cluster.
>
> We now facing an issue about joining a big partition with more than 
> 2^31 rows.
>
> When the partition has more than 2147483648 rows (even 2147483649) the 
> output of the join is a single row.
> When the partition has less than 2147483648 rows (event 2147483647) 
> the output is correct.
>
> Our test case:
>
> create a table with 2147483649 rows in a partition with the value : 
> "1" , join this table to another table with a single row,single column 
> with the value "1" on the partition_key.
> later delete 2 rows and run the same join.
> 1st : only a single row is created
> 2nd : 2147483647 rows
>
> the query we run for test the case is:
>
>
>
> create table output_rows_over as
>
> select a.s1
>
> from  max_sint_rows a join small_table b
>
> on (a.p1=b.p1);
>
>
>
> on more than 2^31 rows we got the following on reducer log:
>
> 2013-05-27 21:51:14,186 INFO
> org.apache.hadoop.hive.ql.exec.FileSinkOperator: TABLE_ID_1_ROWCOUNT:1
>
> On less than 2^31 rows we got the following reducer log:
>
> 2013-05-27 23:43:14,681 INFO
> org.apache.hadoop.hive.ql.exec.FileSinkOperator:
> TABLE_ID_1_ROWCOUNT:2147483647
>
>
>
>
>
> Anyone faced this issue?
>
> Does hive has workaround for that?
>
> I have huge partitions I need to work on and I cannot use hive for that..
>
>
>
> Thanks,
>
>
>
>
>
> Gabi Kazav
>
> Infrastructure Team Leader, Pursway.com
>
>
>
>

 
 
************************************************************************************
This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals & computer viruses.
************************************************************************************




Re: Hive - max rows limit (int limit = 2^31). need Help (looks liek a bug)

Posted by John Meagher <jo...@gmail.com>.
What is the data type of the p1 column?  I've used hive with
partitions containing far above 2 billion rows without having any
problems like this.

On Wed, May 29, 2013 at 2:41 PM, Gabi Kazav <Ga...@pursway.com> wrote:
> Hi,
>
>
>
> We are working on hive DB with our Hadoop cluster.
>
> We now facing an issue about joining a big partition with more than 2^31
> rows.
>
> When the partition has more than 2147483648 rows (even 2147483649) the
> output of the join is a single row.
> When the partition has less than 2147483648 rows (event 2147483647) the
> output is correct.
>
> Our test case:
>
> create a table with 2147483649 rows in a partition with the value : "1" ,
> join this table to another table with a single row,single column with the
> value "1" on the partition_key.
> later delete 2 rows and run the same join.
> 1st : only a single row is created
> 2nd : 2147483647 rows
>
> the query we run for test the case is:
>
>
>
> create table output_rows_over as
>
> select a.s1
>
> from  max_sint_rows a join small_table b
>
> on (a.p1=b.p1);
>
>
>
> on more than 2^31 rows we got the following on reducer log:
>
> 2013-05-27 21:51:14,186 INFO
> org.apache.hadoop.hive.ql.exec.FileSinkOperator: TABLE_ID_1_ROWCOUNT:1
>
> On less than 2^31 rows we got the following reducer log:
>
> 2013-05-27 23:43:14,681 INFO
> org.apache.hadoop.hive.ql.exec.FileSinkOperator:
> TABLE_ID_1_ROWCOUNT:2147483647
>
>
>
>
>
> Anyone faced this issue?
>
> Does hive has workaround for that?
>
> I have huge partitions I need to work on and I cannot use hive for that..
>
>
>
> Thanks,
>
>
>
>
>
> Gabi Kazav
>
> Infrastructure Team Leader, Pursway.com
>
>
>
>