You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by javateck javateck <ja...@gmail.com> on 2009/04/15 22:00:15 UTC

hive performance

Hi,
  I want to check if hive data grows huge in the table (for example to
200GB), does anybody see the mapreduce performance degrade a lot? I did not
factor things out, but just want to check first.

  thanks,

Re: hive performance

Posted by Raghu Murthy <rm...@facebook.com>.
You don't need to specify the partition predicate in the sum if you are
running one query per partition -- using it in the where clause is just
fine. Can you provide a complete sample query for one partition? Also, it
would be good to know about your hadoop installation, like number of nodes
and amount of memory per node.

On 4/15/09 4:31 PM, "javateck javateck" <ja...@gmail.com> wrote:

> my query is running on one partition at a time which could have up to 250 of
> 65MB files, my hadoop block size is 65MB, so not that bad, I guess. I use
> quite a lot of group by, join, and sort by, etc.
> 
> I'm looking into my queries, one question though, when I do sum query, do I
> need to put partition in sum also? one query is like:
> 
> SELECT sum(if(col1=12 AND col2=11, 1, 0)), sum(if(col3=33 AND col4=22, 1, 0))
> where dt='2009-04-14-02'
> 
> I think putting partition in the where clause should be sufficient, not sure
> if I need to put partition in sum to prevent the whole table scan
> 
> thanks
> 
> On Wed, Apr 15, 2009 at 3:19 PM, Ashish Thusoo <at...@facebook.com> wrote:
>> We know that map/reduce slows down in case there are a lot of small files.
>> How 
>> many partitions are you running this query on. Each partition is holding just
>> 65 MB of data which may be smaller than the hadoop block size (we use 128MB)
>> and you will have a lot of internal fragmentation. Also what is the nature of
>> your query? Are you doing a join, group by or just a filter?
>> 
>> Ashish
>> ________________________________________
>> From: Prasad Chakka [pchakka@facebook.com]
>> Sent: Wednesday, April 15, 2009 3:06 PM
>> To: hive-user@hadoop.apache.org
>> Subject: Re: hive performance
>> 
>> I highly doubt this is the case but If there are too many partitions in the
>> metadata db, the db query that gets all the partitions for a table slows
>> down. 
>> You might want to check whether this is causing the issue by doing `show
>> partitions <tbl_name>`. Ofcourse this won¹t be more than few secs in any case
>> so may not matter if the data in a partition is very large.
>> 
>> 
>> ________________________________
>> From: Zheng Shao <zs...@gmail.com>
>> Reply-To: <hi...@hadoop.apache.org>
>> Date: Wed, 15 Apr 2009 14:56:51 -0700
>> To: <hi...@hadoop.apache.org>
>> Subject: Re: hive performance
>> 
>> HI Javateck,
>> 
>> In the compilation stage, Hive does get all partition names and test if they
>> can pass the WHERE condition or not.
>> This part of time will be linear to the number of partitions, although each
>> test should not take much time at all.
>> 
>> Otherwise, there is no difference (as you expected, Hive just submit the
>> files 
>> in the partitions that matter to hadoop).
>> 
>> Zheng
>> 
>> On Wed, Apr 15, 2009 at 2:01 PM, javateck javateck <ja...@gmail.com>
>> wrote:
>> in my design, it's like following:
>> 
>> 1. for every hour, I'll generate a hourly data and load into hive table,
>> around 3GB each hour in peak time, so I break them into 65MB chunks each, so
>> my partition will be like 2009-04-15-09 and so on, there are 24 partitions
>> for 
>> one day, so have 8760 partitions for one year
>> 2. we need to keep up to 1 year of raw data, and I'll run around 25 queries
>> on 
>> hourly basis, of course, it could run more than one hour depending on the
>> data 
>> size, but hope that it can catch up during off-peak time.
>> 3. currently I'm using jdbc to connect to hive standalone server, even
>> thought 
>> it's not supporting multi-threading yet, but it should be there soon (a few
>> weeks as I got the info from the forum), so I need to run the queries
>> sequentially for now, and will change to multi-threading later on.
>> 
>> I'm doing stress testing now, but it seems sometimes it's running faster and
>> sometimes slower, right now I have around 15 partitions, and it runs much
>> slower than just a few partitions, of course I have some code changes in
>> between, but should not affect this, since loading data to hadoop and run
>> hive 
>> queries are separated. I need to look into why it's getting slower.
>> 
>> thanks,
>> 
>> 
>> On Wed, Apr 15, 2009 at 1:42 PM, Stephen Corona <sc...@adknowledge.com>
>> wrote:
>> I am using Hive.
>> 
>> How many partitions do you have? In my setup, I am using partitions as well.
>> Each partition has 24 files, about 500MB each (so ~12GB per partition)
>> 
>> Steve
>> ________________________________________
>> From: javateck javateck [javateck@gmail.com]
>> Sent: Wednesday, April 15, 2009 4:28 PM
>> To: hive-user@hadoop.apache.org
>> Subject: Re: hive performance
>> 
>> thanks, Stephen, are you directly using hadoop or using hive?
>> 
>> I did not make the questions clear in my last email, I have hive partitions,
>> each partition has around 100 files, each with 65MB. When I query, I'll just
>> query specific partition. Previously I had fewer partitions, it ran faster,
>> but when the partition numbers grow, it's taking longer time, I don't think
>> partition is playing a role here, since when doing mapreduce, I guess hive
>> just use the specific partition to submit to hadoop, I still need to look
>> into 
>> another possible areas, but just want to run fast cross the forum to see if
>> anyone else has similar situation and could shed some lights on it.
>> 
>> 
>> On Wed, Apr 15, 2009 at 1:09 PM, Stephen Corona
>> <sc...@adknowledge.com>> wrote:
>> Hi,
>> 
>> I'm not sure what kind of performance numbers you are looking for, but I
>> figured I would toss in a data point:
>> 
>> On a 10-node large EC2 cluster w/ EBS volumes, it takes me about 10 minutes
>> to 
>> crunch through 300GB of CSV data  (120 million records).
>> 
>> DFS Replication = 1
>> Block Size = 128MB
>> Max Mappers = 60
>> ________________________________________
>> From: javateck javateck [javateck@gmail.com<ma...@gmail.com>]
>> Sent: Wednesday, April 15, 2009 4:00 PM
>> To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
>> Subject: hive performance
>> 
>> Hi,
>> 
>>  I want to check if hive data grows huge in the table (for example to 200GB),
>> does anybody see the mapreduce performance degrade a lot? I did not factor
>> things out, but just want to check first.
>> 
>>  thanks,
>> 
>> 
>> 
>> 
>> 
>> --
>> Yours,
>> Zheng
>> 
> 


Re: hive performance

Posted by javateck javateck <ja...@gmail.com>.
my query is running on one partition at a time which could have up to 250 of
65MB files, my hadoop block size is 65MB, so not that bad, I guess. I use
quite a lot of group by, join, and sort by, etc.
I'm looking into my queries, one question though, when I do sum query, do I
need to put partition in sum also? one query is like:

SELECT sum(if(col1=12 AND col2=11, 1, 0)), sum(if(col3=33 AND col4=22, 1,
0)) where dt='2009-04-14-02'

I think putting partition in the where clause should be sufficient, not sure
if I need to put partition in sum to prevent the whole table scan

thanks

On Wed, Apr 15, 2009 at 3:19 PM, Ashish Thusoo <at...@facebook.com> wrote:

> We know that map/reduce slows down in case there are a lot of small files.
> How many partitions are you running this query on. Each partition is holding
> just 65 MB of data which may be smaller than the hadoop block size (we use
> 128MB) and you will have a lot of internal fragmentation. Also what is the
> nature of your query? Are you doing a join, group by or just a filter?
>
> Ashish
> ________________________________________
> From: Prasad Chakka [pchakka@facebook.com]
> Sent: Wednesday, April 15, 2009 3:06 PM
> To: hive-user@hadoop.apache.org
> Subject: Re: hive performance
>
> I highly doubt this is the case but If there are too many partitions in the
> metadata db, the db query that gets all the partitions for a table slows
> down. You might want to check whether this is causing the issue by doing
> `show partitions <tbl_name>`. Ofcourse this won’t be more than few secs in
> any case so may not matter if the data in a partition is very large.
>
>
> ________________________________
> From: Zheng Shao <zs...@gmail.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 15 Apr 2009 14:56:51 -0700
> To: <hi...@hadoop.apache.org>
> Subject: Re: hive performance
>
> HI Javateck,
>
> In the compilation stage, Hive does get all partition names and test if
> they can pass the WHERE condition or not.
> This part of time will be linear to the number of partitions, although each
> test should not take much time at all.
>
> Otherwise, there is no difference (as you expected, Hive just submit the
> files in the partitions that matter to hadoop).
>
> Zheng
>
> On Wed, Apr 15, 2009 at 2:01 PM, javateck javateck <ja...@gmail.com>
> wrote:
> in my design, it's like following:
>
> 1. for every hour, I'll generate a hourly data and load into hive table,
> around 3GB each hour in peak time, so I break them into 65MB chunks each, so
> my partition will be like 2009-04-15-09 and so on, there are 24 partitions
> for one day, so have 8760 partitions for one year
> 2. we need to keep up to 1 year of raw data, and I'll run around 25 queries
> on hourly basis, of course, it could run more than one hour depending on the
> data size, but hope that it can catch up during off-peak time.
> 3. currently I'm using jdbc to connect to hive standalone server, even
> thought it's not supporting multi-threading yet, but it should be there soon
> (a few weeks as I got the info from the forum), so I need to run the queries
> sequentially for now, and will change to multi-threading later on.
>
> I'm doing stress testing now, but it seems sometimes it's running faster
> and sometimes slower, right now I have around 15 partitions, and it runs
> much slower than just a few partitions, of course I have some code changes
> in between, but should not affect this, since loading data to hadoop and run
> hive queries are separated. I need to look into why it's getting slower.
>
> thanks,
>
>
> On Wed, Apr 15, 2009 at 1:42 PM, Stephen Corona <sc...@adknowledge.com>
> wrote:
> I am using Hive.
>
> How many partitions do you have? In my setup, I am using partitions as
> well. Each partition has 24 files, about 500MB each (so ~12GB per partition)
>
> Steve
> ________________________________________
> From: javateck javateck [javateck@gmail.com]
> Sent: Wednesday, April 15, 2009 4:28 PM
> To: hive-user@hadoop.apache.org
> Subject: Re: hive performance
>
> thanks, Stephen, are you directly using hadoop or using hive?
>
> I did not make the questions clear in my last email, I have hive
> partitions, each partition has around 100 files, each with 65MB. When I
> query, I'll just query specific partition. Previously I had fewer
> partitions, it ran faster, but when the partition numbers grow, it's taking
> longer time, I don't think partition is playing a role here, since when
> doing mapreduce, I guess hive just use the specific partition to submit to
> hadoop, I still need to look into another possible areas, but just want to
> run fast cross the forum to see if anyone else has similar situation and
> could shed some lights on it.
>
>
> On Wed, Apr 15, 2009 at 1:09 PM, Stephen Corona <scorona@adknowledge.com
> <ma...@adknowledge.com>> wrote:
> Hi,
>
> I'm not sure what kind of performance numbers you are looking for, but I
> figured I would toss in a data point:
>
> On a 10-node large EC2 cluster w/ EBS volumes, it takes me about 10 minutes
> to crunch through 300GB of CSV data  (120 million records).
>
> DFS Replication = 1
> Block Size = 128MB
> Max Mappers = 60
> ________________________________________
> From: javateck javateck [javateck@gmail.com<ma...@gmail.com>]
> Sent: Wednesday, April 15, 2009 4:00 PM
> To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
> Subject: hive performance
>
> Hi,
>
>  I want to check if hive data grows huge in the table (for example to
> 200GB), does anybody see the mapreduce performance degrade a lot? I did not
> factor things out, but just want to check first.
>
>  thanks,
>
>
>
>
>
> --
> Yours,
> Zheng
>
>

RE: hive performance

Posted by Ashish Thusoo <at...@facebook.com>.
We know that map/reduce slows down in case there are a lot of small files. How many partitions are you running this query on. Each partition is holding just 65 MB of data which may be smaller than the hadoop block size (we use 128MB) and you will have a lot of internal fragmentation. Also what is the nature of your query? Are you doing a join, group by or just a filter?

Ashish
________________________________________
From: Prasad Chakka [pchakka@facebook.com]
Sent: Wednesday, April 15, 2009 3:06 PM
To: hive-user@hadoop.apache.org
Subject: Re: hive performance

I highly doubt this is the case but If there are too many partitions in the metadata db, the db query that gets all the partitions for a table slows down. You might want to check whether this is causing the issue by doing `show partitions <tbl_name>`. Ofcourse this won’t be more than few secs in any case so may not matter if the data in a partition is very large.


________________________________
From: Zheng Shao <zs...@gmail.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 15 Apr 2009 14:56:51 -0700
To: <hi...@hadoop.apache.org>
Subject: Re: hive performance

HI Javateck,

In the compilation stage, Hive does get all partition names and test if they can pass the WHERE condition or not.
This part of time will be linear to the number of partitions, although each test should not take much time at all.

Otherwise, there is no difference (as you expected, Hive just submit the files in the partitions that matter to hadoop).

Zheng

On Wed, Apr 15, 2009 at 2:01 PM, javateck javateck <ja...@gmail.com> wrote:
in my design, it's like following:

1. for every hour, I'll generate a hourly data and load into hive table, around 3GB each hour in peak time, so I break them into 65MB chunks each, so my partition will be like 2009-04-15-09 and so on, there are 24 partitions for one day, so have 8760 partitions for one year
2. we need to keep up to 1 year of raw data, and I'll run around 25 queries on hourly basis, of course, it could run more than one hour depending on the data size, but hope that it can catch up during off-peak time.
3. currently I'm using jdbc to connect to hive standalone server, even thought it's not supporting multi-threading yet, but it should be there soon (a few weeks as I got the info from the forum), so I need to run the queries sequentially for now, and will change to multi-threading later on.

I'm doing stress testing now, but it seems sometimes it's running faster and sometimes slower, right now I have around 15 partitions, and it runs much slower than just a few partitions, of course I have some code changes in between, but should not affect this, since loading data to hadoop and run hive queries are separated. I need to look into why it's getting slower.

thanks,


On Wed, Apr 15, 2009 at 1:42 PM, Stephen Corona <sc...@adknowledge.com> wrote:
I am using Hive.

How many partitions do you have? In my setup, I am using partitions as well. Each partition has 24 files, about 500MB each (so ~12GB per partition)

Steve
________________________________________
From: javateck javateck [javateck@gmail.com]
Sent: Wednesday, April 15, 2009 4:28 PM
To: hive-user@hadoop.apache.org
Subject: Re: hive performance

thanks, Stephen, are you directly using hadoop or using hive?

I did not make the questions clear in my last email, I have hive partitions, each partition has around 100 files, each with 65MB. When I query, I'll just query specific partition. Previously I had fewer partitions, it ran faster, but when the partition numbers grow, it's taking longer time, I don't think partition is playing a role here, since when doing mapreduce, I guess hive just use the specific partition to submit to hadoop, I still need to look into another possible areas, but just want to run fast cross the forum to see if anyone else has similar situation and could shed some lights on it.


On Wed, Apr 15, 2009 at 1:09 PM, Stephen Corona <sc...@adknowledge.com>> wrote:
Hi,

I'm not sure what kind of performance numbers you are looking for, but I figured I would toss in a data point:

On a 10-node large EC2 cluster w/ EBS volumes, it takes me about 10 minutes to crunch through 300GB of CSV data  (120 million records).

DFS Replication = 1
Block Size = 128MB
Max Mappers = 60
________________________________________
From: javateck javateck [javateck@gmail.com<ma...@gmail.com>]
Sent: Wednesday, April 15, 2009 4:00 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: hive performance

Hi,

 I want to check if hive data grows huge in the table (for example to 200GB), does anybody see the mapreduce performance degrade a lot? I did not factor things out, but just want to check first.

 thanks,





--
Yours,
Zheng


Re: hive performance

Posted by javateck javateck <ja...@gmail.com>.
really appreciate Zheng and Prasad's explanation, one variation I forgot to
mention which is important is that since my data is random, so it could
generate different size of intermediate data and will have different speed
on mapreduce, I think, in my case, the speed varies from 2 hours to 4 hours
when running my 25 queries on a 10GB or 55 mil records, most of my queries
using quite a few joins and group by and sort by, some sub queries also.

On Wed, Apr 15, 2009 at 3:06 PM, Prasad Chakka <pc...@facebook.com> wrote:

>  I highly doubt this is the case but If there are too many partitions in
> the metadata db, the db query that gets all the partitions for a table slows
> down. You might want to check whether this is causing the issue by doing
> `show partitions <tbl_name>`. Ofcourse this won’t be more than few secs in
> any case so may not matter if the data in a partition is very large.
>
>
> ------------------------------
> *From: *Zheng Shao <zs...@gmail.com>
> *Reply-To: *<hi...@hadoop.apache.org>
> *Date: *Wed, 15 Apr 2009 14:56:51 -0700
>
> *To: *<hi...@hadoop.apache.org>
> *Subject: *Re: hive performance
>
> HI Javateck,
>
> In the compilation stage, Hive does get all partition names and test if
> they can pass the WHERE condition or not.
> This part of time will be linear to the number of partitions, although each
> test should not take much time at all.
>
> Otherwise, there is no difference (as you expected, Hive just submit the
> files in the partitions that matter to hadoop).
>
> Zheng
>
> On Wed, Apr 15, 2009 at 2:01 PM, javateck javateck <ja...@gmail.com>
> wrote:
>
> in my design, it's like following:
>
> 1. for every hour, I'll generate a hourly data and load into hive table,
> around 3GB each hour in peak time, so I break them into 65MB chunks each, so
> my partition will be like 2009-04-15-09 and so on, there are 24 partitions
> for one day, so have 8760 partitions for one year
> 2. we need to keep up to 1 year of raw data, and I'll run around 25 queries
> on hourly basis, of course, it could run more than one hour depending on the
> data size, but hope that it can catch up during off-peak time.
> 3. currently I'm using jdbc to connect to hive standalone server, even
> thought it's not supporting multi-threading yet, but it should be there soon
> (a few weeks as I got the info from the forum), so I need to run the queries
> sequentially for now, and will change to multi-threading later on.
>
> I'm doing stress testing now, but it seems sometimes it's running faster
> and sometimes slower, right now I have around 15 partitions, and it runs
> much slower than just a few partitions, of course I have some code changes
> in between, but should not affect this, since loading data to hadoop and run
> hive queries are separated. I need to look into why it's getting slower.
>
> thanks,
>
>
> On Wed, Apr 15, 2009 at 1:42 PM, Stephen Corona <sc...@adknowledge.com>
> wrote:
>
> I am using Hive.
>
> How many partitions do you have? In my setup, I am using partitions as
> well. Each partition has 24 files, about 500MB each (so ~12GB per partition)
>
> Steve
> ________________________________________
> From: javateck javateck [javateck@gmail.com]
> Sent: Wednesday, April 15, 2009 4:28 PM
> To: hive-user@hadoop.apache.org
> Subject: Re: hive performance
>
> thanks, Stephen, are you directly using hadoop or using hive?
>
> I did not make the questions clear in my last email, I have hive
> partitions, each partition has around 100 files, each with 65MB. When I
> query, I'll just query specific partition. Previously I had fewer
> partitions, it ran faster, but when the partition numbers grow, it's taking
> longer time, I don't think partition is playing a role here, since when
> doing mapreduce, I guess hive just use the specific partition to submit to
> hadoop, I still need to look into another possible areas, but just want to
> run fast cross the forum to see if anyone else has similar situation and
> could shed some lights on it.
>
>
> On Wed, Apr 15, 2009 at 1:09 PM, Stephen Corona <
> scorona@adknowledge.com<ma...@adknowledge.com>> wrote:
> Hi,
>
> I'm not sure what kind of performance numbers you are looking for, but I
> figured I would toss in a data point:
>
> On a 10-node large EC2 cluster w/ EBS volumes, it takes me about 10 minutes
> to crunch through 300GB of CSV data  (120 million records).
>
> DFS Replication = 1
> Block Size = 128MB
> Max Mappers = 60
> ________________________________________
> From: javateck javateck [javateck@gmail.com<ma...@gmail.com>]
> Sent: Wednesday, April 15, 2009 4:00 PM
> To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
> Subject: hive performance
>
> Hi,
>
>  I want to check if hive data grows huge in the table (for example to
> 200GB), does anybody see the mapreduce performance degrade a lot? I did not
> factor things out, but just want to check first.
>
>  thanks,
>
>
>
>
>
> --
> Yours,
> Zheng
>
>

Re: hive performance

Posted by Prasad Chakka <pc...@facebook.com>.
I highly doubt this is the case but If there are too many partitions in the metadata db, the db query that gets all the partitions for a table slows down. You might want to check whether this is causing the issue by doing `show partitions <tbl_name>`. Ofcourse this won't be more than few secs in any case so may not matter if the data in a partition is very large.


________________________________
From: Zheng Shao <zs...@gmail.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 15 Apr 2009 14:56:51 -0700
To: <hi...@hadoop.apache.org>
Subject: Re: hive performance

HI Javateck,

In the compilation stage, Hive does get all partition names and test if they can pass the WHERE condition or not.
This part of time will be linear to the number of partitions, although each test should not take much time at all.

Otherwise, there is no difference (as you expected, Hive just submit the files in the partitions that matter to hadoop).

Zheng

On Wed, Apr 15, 2009 at 2:01 PM, javateck javateck <ja...@gmail.com> wrote:
in my design, it's like following:

1. for every hour, I'll generate a hourly data and load into hive table, around 3GB each hour in peak time, so I break them into 65MB chunks each, so my partition will be like 2009-04-15-09 and so on, there are 24 partitions for one day, so have 8760 partitions for one year
2. we need to keep up to 1 year of raw data, and I'll run around 25 queries on hourly basis, of course, it could run more than one hour depending on the data size, but hope that it can catch up during off-peak time.
3. currently I'm using jdbc to connect to hive standalone server, even thought it's not supporting multi-threading yet, but it should be there soon (a few weeks as I got the info from the forum), so I need to run the queries sequentially for now, and will change to multi-threading later on.

I'm doing stress testing now, but it seems sometimes it's running faster and sometimes slower, right now I have around 15 partitions, and it runs much slower than just a few partitions, of course I have some code changes in between, but should not affect this, since loading data to hadoop and run hive queries are separated. I need to look into why it's getting slower.

thanks,


On Wed, Apr 15, 2009 at 1:42 PM, Stephen Corona <sc...@adknowledge.com> wrote:
I am using Hive.

How many partitions do you have? In my setup, I am using partitions as well. Each partition has 24 files, about 500MB each (so ~12GB per partition)

Steve
________________________________________
From: javateck javateck [javateck@gmail.com]
Sent: Wednesday, April 15, 2009 4:28 PM
To: hive-user@hadoop.apache.org
Subject: Re: hive performance

thanks, Stephen, are you directly using hadoop or using hive?

I did not make the questions clear in my last email, I have hive partitions, each partition has around 100 files, each with 65MB. When I query, I'll just query specific partition. Previously I had fewer partitions, it ran faster, but when the partition numbers grow, it's taking longer time, I don't think partition is playing a role here, since when doing mapreduce, I guess hive just use the specific partition to submit to hadoop, I still need to look into another possible areas, but just want to run fast cross the forum to see if anyone else has similar situation and could shed some lights on it.


On Wed, Apr 15, 2009 at 1:09 PM, Stephen Corona <sc...@adknowledge.com>> wrote:
Hi,

I'm not sure what kind of performance numbers you are looking for, but I figured I would toss in a data point:

On a 10-node large EC2 cluster w/ EBS volumes, it takes me about 10 minutes to crunch through 300GB of CSV data  (120 million records).

DFS Replication = 1
Block Size = 128MB
Max Mappers = 60
________________________________________
From: javateck javateck [javateck@gmail.com<ma...@gmail.com>]
Sent: Wednesday, April 15, 2009 4:00 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: hive performance

Hi,

 I want to check if hive data grows huge in the table (for example to 200GB), does anybody see the mapreduce performance degrade a lot? I did not factor things out, but just want to check first.

 thanks,





--
Yours,
Zheng


Re: hive performance

Posted by Zheng Shao <zs...@gmail.com>.
HI Javateck,

In the compilation stage, Hive does get all partition names and test if they
can pass the WHERE condition or not.
This part of time will be linear to the number of partitions, although each
test should not take much time at all.

Otherwise, there is no difference (as you expected, Hive just submit the
files in the partitions that matter to hadoop).

Zheng

On Wed, Apr 15, 2009 at 2:01 PM, javateck javateck <ja...@gmail.com>wrote:

> in my design, it's like following:
> 1. for every hour, I'll generate a hourly data and load into hive table,
> around 3GB each hour in peak time, so I break them into 65MB chunks each, so
> my partition will be like 2009-04-15-09 and so on, there are 24 partitions
> for one day, so have 8760 partitions for one year
> 2. we need to keep up to 1 year of raw data, and I'll run around 25 queries
> on hourly basis, of course, it could run more than one hour depending on the
> data size, but hope that it can catch up during off-peak time.
> 3. currently I'm using jdbc to connect to hive standalone server, even
> thought it's not supporting multi-threading yet, but it should be there soon
> (a few weeks as I got the info from the forum), so I need to run the queries
> sequentially for now, and will change to multi-threading later on.
>
> I'm doing stress testing now, but it seems sometimes it's running faster
> and sometimes slower, right now I have around 15 partitions, and it runs
> much slower than just a few partitions, of course I have some code changes
> in between, but should not affect this, since loading data to hadoop and run
> hive queries are separated. I need to look into why it's getting slower.
>
> thanks,
>
>
> On Wed, Apr 15, 2009 at 1:42 PM, Stephen Corona <sc...@adknowledge.com>wrote:
>
>> I am using Hive.
>>
>> How many partitions do you have? In my setup, I am using partitions as
>> well. Each partition has 24 files, about 500MB each (so ~12GB per partition)
>>
>> Steve
>> ________________________________________
>> From: javateck javateck [javateck@gmail.com]
>> Sent: Wednesday, April 15, 2009 4:28 PM
>> To: hive-user@hadoop.apache.org
>> Subject: Re: hive performance
>>
>> thanks, Stephen, are you directly using hadoop or using hive?
>>
>> I did not make the questions clear in my last email, I have hive
>> partitions, each partition has around 100 files, each with 65MB. When I
>> query, I'll just query specific partition. Previously I had fewer
>> partitions, it ran faster, but when the partition numbers grow, it's taking
>> longer time, I don't think partition is playing a role here, since when
>> doing mapreduce, I guess hive just use the specific partition to submit to
>> hadoop, I still need to look into another possible areas, but just want to
>> run fast cross the forum to see if anyone else has similar situation and
>> could shed some lights on it.
>>
>>
>> On Wed, Apr 15, 2009 at 1:09 PM, Stephen Corona <scorona@adknowledge.com
>> <ma...@adknowledge.com>> wrote:
>> Hi,
>>
>> I'm not sure what kind of performance numbers you are looking for, but I
>> figured I would toss in a data point:
>>
>> On a 10-node large EC2 cluster w/ EBS volumes, it takes me about 10
>> minutes to crunch through 300GB of CSV data  (120 million records).
>>
>> DFS Replication = 1
>> Block Size = 128MB
>> Max Mappers = 60
>> ________________________________________
>> From: javateck javateck [javateck@gmail.com<ma...@gmail.com>]
>> Sent: Wednesday, April 15, 2009 4:00 PM
>> To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
>> Subject: hive performance
>>
>> Hi,
>>
>>  I want to check if hive data grows huge in the table (for example to
>> 200GB), does anybody see the mapreduce performance degrade a lot? I did not
>> factor things out, but just want to check first.
>>
>>  thanks,
>>
>>
>


-- 
Yours,
Zheng

Re: hive performance

Posted by javateck javateck <ja...@gmail.com>.
in my case, I have many input files, and each record has a time stamp
associated with it, and  each input file include records, and the timestamp
could belong to different days and hours, I need to look at each timestamp
and put the record into appropriate partition (for example 2009-04-15-09
means the data partition for hour 09, 4/15/2009), and we need to aggregate
the hourly data which is business requirement.
The way I'm writing to hadoop is to write files to local first and when it
reaches 65MB I'll flush it to hadoop in one shot, I don't want to use too
big files, because writing to hadoop could fail more if using bigger files.

And I need to kick off the 25 queries on hourly data, currently I think the
good way is to put them into hourly bucket since I need to run hourly job,
because the hourly data is used by all 25 queries, and we could have backlog
which will trigger the hourly mapreduce on the old data set also.

On Wed, Apr 15, 2009 at 2:33 PM, Stephen Corona <sc...@adknowledge.com>wrote:

> Why do you break the hourly log files up into chunks? HDFS already does
> this for you.
>
> Steve
> ________________________________________
> From: javateck javateck [javateck@gmail.com]
> Sent: Wednesday, April 15, 2009 5:01 PM
> To: hive-user@hadoop.apache.org
> Subject: Re: hive performance
>
> in my design, it's like following:
>
> 1. for every hour, I'll generate a hourly data and load into hive table,
> around 3GB each hour in peak time, so I break them into 65MB chunks each, so
> my partition will be like 2009-04-15-09 and so on, there are 24 partitions
> for one day, so have 8760 partitions for one year
> 2. we need to keep up to 1 year of raw data, and I'll run around 25 queries
> on hourly basis, of course, it could run more than one hour depending on the
> data size, but hope that it can catch up during off-peak time.
> 3. currently I'm using jdbc to connect to hive standalone server, even
> thought it's not supporting multi-threading yet, but it should be there soon
> (a few weeks as I got the info from the forum), so I need to run the queries
> sequentially for now, and will change to multi-threading later on.
>
> I'm doing stress testing now, but it seems sometimes it's running faster
> and sometimes slower, right now I have around 15 partitions, and it runs
> much slower than just a few partitions, of course I have some code changes
> in between, but should not affect this, since loading data to hadoop and run
> hive queries are separated. I need to look into why it's getting slower.
>
> thanks,
>
> On Wed, Apr 15, 2009 at 1:42 PM, Stephen Corona <scorona@adknowledge.com
> <ma...@adknowledge.com>> wrote:
> I am using Hive.
>
> How many partitions do you have? In my setup, I am using partitions as
> well. Each partition has 24 files, about 500MB each (so ~12GB per partition)
>
> Steve
> ________________________________________
> From: javateck javateck [javateck@gmail.com<ma...@gmail.com>]
> Sent: Wednesday, April 15, 2009 4:28 PM
> To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
> Subject: Re: hive performance
>
> thanks, Stephen, are you directly using hadoop or using hive?
>
> I did not make the questions clear in my last email, I have hive
> partitions, each partition has around 100 files, each with 65MB. When I
> query, I'll just query specific partition. Previously I had fewer
> partitions, it ran faster, but when the partition numbers grow, it's taking
> longer time, I don't think partition is playing a role here, since when
> doing mapreduce, I guess hive just use the specific partition to submit to
> hadoop, I still need to look into another possible areas, but just want to
> run fast cross the forum to see if anyone else has similar situation and
> could shed some lights on it.
>
>
> On Wed, Apr 15, 2009 at 1:09 PM, Stephen Corona <scorona@adknowledge.com
> <ma...@adknowledge.com><mailto:scorona@adknowledge.com<mailto:
> scorona@adknowledge.com>>> wrote:
> Hi,
>
> I'm not sure what kind of performance numbers you are looking for, but I
> figured I would toss in a data point:
>
> On a 10-node large EC2 cluster w/ EBS volumes, it takes me about 10 minutes
> to crunch through 300GB of CSV data  (120 million records).
>
> DFS Replication = 1
> Block Size = 128MB
> Max Mappers = 60
> ________________________________________
> From: javateck javateck [javateck@gmail.com<mailto:javateck@gmail.com
> ><ma...@gmail.com>>]
> Sent: Wednesday, April 15, 2009 4:00 PM
> To: hive-user@hadoop.apache.org<mailto:hive-user@hadoop.apache.org
> ><ma...@hadoop.apache.org>>
> Subject: hive performance
>
> Hi,
>
>  I want to check if hive data grows huge in the table (for example to
> 200GB), does anybody see the mapreduce performance degrade a lot? I did not
> factor things out, but just want to check first.
>
>  thanks,
>
>
>

RE: hive performance

Posted by Stephen Corona <sc...@adknowledge.com>.
Why do you break the hourly log files up into chunks? HDFS already does this for you. 

Steve
________________________________________
From: javateck javateck [javateck@gmail.com]
Sent: Wednesday, April 15, 2009 5:01 PM
To: hive-user@hadoop.apache.org
Subject: Re: hive performance

in my design, it's like following:

1. for every hour, I'll generate a hourly data and load into hive table, around 3GB each hour in peak time, so I break them into 65MB chunks each, so my partition will be like 2009-04-15-09 and so on, there are 24 partitions for one day, so have 8760 partitions for one year
2. we need to keep up to 1 year of raw data, and I'll run around 25 queries on hourly basis, of course, it could run more than one hour depending on the data size, but hope that it can catch up during off-peak time.
3. currently I'm using jdbc to connect to hive standalone server, even thought it's not supporting multi-threading yet, but it should be there soon (a few weeks as I got the info from the forum), so I need to run the queries sequentially for now, and will change to multi-threading later on.

I'm doing stress testing now, but it seems sometimes it's running faster and sometimes slower, right now I have around 15 partitions, and it runs much slower than just a few partitions, of course I have some code changes in between, but should not affect this, since loading data to hadoop and run hive queries are separated. I need to look into why it's getting slower.

thanks,

On Wed, Apr 15, 2009 at 1:42 PM, Stephen Corona <sc...@adknowledge.com>> wrote:
I am using Hive.

How many partitions do you have? In my setup, I am using partitions as well. Each partition has 24 files, about 500MB each (so ~12GB per partition)

Steve
________________________________________
From: javateck javateck [javateck@gmail.com<ma...@gmail.com>]
Sent: Wednesday, April 15, 2009 4:28 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: hive performance

thanks, Stephen, are you directly using hadoop or using hive?

I did not make the questions clear in my last email, I have hive partitions, each partition has around 100 files, each with 65MB. When I query, I'll just query specific partition. Previously I had fewer partitions, it ran faster, but when the partition numbers grow, it's taking longer time, I don't think partition is playing a role here, since when doing mapreduce, I guess hive just use the specific partition to submit to hadoop, I still need to look into another possible areas, but just want to run fast cross the forum to see if anyone else has similar situation and could shed some lights on it.


On Wed, Apr 15, 2009 at 1:09 PM, Stephen Corona <sc...@adknowledge.com>>> wrote:
Hi,

I'm not sure what kind of performance numbers you are looking for, but I figured I would toss in a data point:

On a 10-node large EC2 cluster w/ EBS volumes, it takes me about 10 minutes to crunch through 300GB of CSV data  (120 million records).

DFS Replication = 1
Block Size = 128MB
Max Mappers = 60
________________________________________
From: javateck javateck [javateck@gmail.com<ma...@gmail.com>>]
Sent: Wednesday, April 15, 2009 4:00 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>>
Subject: hive performance

Hi,

 I want to check if hive data grows huge in the table (for example to 200GB), does anybody see the mapreduce performance degrade a lot? I did not factor things out, but just want to check first.

 thanks,



Re: hive performance

Posted by javateck javateck <ja...@gmail.com>.
in my design, it's like following:
1. for every hour, I'll generate a hourly data and load into hive table,
around 3GB each hour in peak time, so I break them into 65MB chunks each, so
my partition will be like 2009-04-15-09 and so on, there are 24 partitions
for one day, so have 8760 partitions for one year
2. we need to keep up to 1 year of raw data, and I'll run around 25 queries
on hourly basis, of course, it could run more than one hour depending on the
data size, but hope that it can catch up during off-peak time.
3. currently I'm using jdbc to connect to hive standalone server, even
thought it's not supporting multi-threading yet, but it should be there soon
(a few weeks as I got the info from the forum), so I need to run the queries
sequentially for now, and will change to multi-threading later on.

I'm doing stress testing now, but it seems sometimes it's running faster and
sometimes slower, right now I have around 15 partitions, and it runs much
slower than just a few partitions, of course I have some code changes in
between, but should not affect this, since loading data to hadoop and run
hive queries are separated. I need to look into why it's getting slower.

thanks,

On Wed, Apr 15, 2009 at 1:42 PM, Stephen Corona <sc...@adknowledge.com>wrote:

> I am using Hive.
>
> How many partitions do you have? In my setup, I am using partitions as
> well. Each partition has 24 files, about 500MB each (so ~12GB per partition)
>
> Steve
> ________________________________________
> From: javateck javateck [javateck@gmail.com]
> Sent: Wednesday, April 15, 2009 4:28 PM
> To: hive-user@hadoop.apache.org
> Subject: Re: hive performance
>
> thanks, Stephen, are you directly using hadoop or using hive?
>
> I did not make the questions clear in my last email, I have hive
> partitions, each partition has around 100 files, each with 65MB. When I
> query, I'll just query specific partition. Previously I had fewer
> partitions, it ran faster, but when the partition numbers grow, it's taking
> longer time, I don't think partition is playing a role here, since when
> doing mapreduce, I guess hive just use the specific partition to submit to
> hadoop, I still need to look into another possible areas, but just want to
> run fast cross the forum to see if anyone else has similar situation and
> could shed some lights on it.
>
>
> On Wed, Apr 15, 2009 at 1:09 PM, Stephen Corona <scorona@adknowledge.com
> <ma...@adknowledge.com>> wrote:
> Hi,
>
> I'm not sure what kind of performance numbers you are looking for, but I
> figured I would toss in a data point:
>
> On a 10-node large EC2 cluster w/ EBS volumes, it takes me about 10 minutes
> to crunch through 300GB of CSV data  (120 million records).
>
> DFS Replication = 1
> Block Size = 128MB
> Max Mappers = 60
> ________________________________________
> From: javateck javateck [javateck@gmail.com<ma...@gmail.com>]
> Sent: Wednesday, April 15, 2009 4:00 PM
> To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
> Subject: hive performance
>
> Hi,
>
>  I want to check if hive data grows huge in the table (for example to
> 200GB), does anybody see the mapreduce performance degrade a lot? I did not
> factor things out, but just want to check first.
>
>  thanks,
>
>

RE: hive performance

Posted by Stephen Corona <sc...@adknowledge.com>.
I am using Hive.

How many partitions do you have? In my setup, I am using partitions as well. Each partition has 24 files, about 500MB each (so ~12GB per partition)

Steve
________________________________________
From: javateck javateck [javateck@gmail.com]
Sent: Wednesday, April 15, 2009 4:28 PM
To: hive-user@hadoop.apache.org
Subject: Re: hive performance

thanks, Stephen, are you directly using hadoop or using hive?

I did not make the questions clear in my last email, I have hive partitions, each partition has around 100 files, each with 65MB. When I query, I'll just query specific partition. Previously I had fewer partitions, it ran faster, but when the partition numbers grow, it's taking longer time, I don't think partition is playing a role here, since when doing mapreduce, I guess hive just use the specific partition to submit to hadoop, I still need to look into another possible areas, but just want to run fast cross the forum to see if anyone else has similar situation and could shed some lights on it.


On Wed, Apr 15, 2009 at 1:09 PM, Stephen Corona <sc...@adknowledge.com>> wrote:
Hi,

I'm not sure what kind of performance numbers you are looking for, but I figured I would toss in a data point:

On a 10-node large EC2 cluster w/ EBS volumes, it takes me about 10 minutes to crunch through 300GB of CSV data  (120 million records).

DFS Replication = 1
Block Size = 128MB
Max Mappers = 60
________________________________________
From: javateck javateck [javateck@gmail.com<ma...@gmail.com>]
Sent: Wednesday, April 15, 2009 4:00 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: hive performance

Hi,

 I want to check if hive data grows huge in the table (for example to 200GB), does anybody see the mapreduce performance degrade a lot? I did not factor things out, but just want to check first.

 thanks,


Re: hive performance

Posted by javateck javateck <ja...@gmail.com>.
thanks, Stephen, are you directly using hadoop or using hive?

I did not make the questions clear in my last email, I have hive partitions,
each partition has around 100 files, each with 65MB. When I query, I'll just
query specific partition. Previously I had fewer partitions, it ran faster,
but when the partition numbers grow, it's taking longer time, I don't think
partition is playing a role here, since when doing mapreduce, I guess hive
just use the specific partition to submit to hadoop, I still need to look
into another possible areas, but just want to run fast cross the forum to
see if anyone else has similar situation and could shed some lights on it.

On Wed, Apr 15, 2009 at 1:09 PM, Stephen Corona <sc...@adknowledge.com>wrote:

> Hi,
>
> I'm not sure what kind of performance numbers you are looking for, but I
> figured I would toss in a data point:
>
> On a 10-node large EC2 cluster w/ EBS volumes, it takes me about 10 minutes
> to crunch through 300GB of CSV data  (120 million records).
>
> DFS Replication = 1
> Block Size = 128MB
> Max Mappers = 60
> ________________________________________
> From: javateck javateck [javateck@gmail.com]
> Sent: Wednesday, April 15, 2009 4:00 PM
> To: hive-user@hadoop.apache.org
> Subject: hive performance
>
> Hi,
>
>  I want to check if hive data grows huge in the table (for example to
> 200GB), does anybody see the mapreduce performance degrade a lot? I did not
> factor things out, but just want to check first.
>
>  thanks,
>

RE: hive performance

Posted by Stephen Corona <sc...@adknowledge.com>.
Hi,

I'm not sure what kind of performance numbers you are looking for, but I figured I would toss in a data point:

On a 10-node large EC2 cluster w/ EBS volumes, it takes me about 10 minutes to crunch through 300GB of CSV data  (120 million records).

DFS Replication = 1
Block Size = 128MB
Max Mappers = 60
________________________________________
From: javateck javateck [javateck@gmail.com]
Sent: Wednesday, April 15, 2009 4:00 PM
To: hive-user@hadoop.apache.org
Subject: hive performance

Hi,

  I want to check if hive data grows huge in the table (for example to 200GB), does anybody see the mapreduce performance degrade a lot? I did not factor things out, but just want to check first.

  thanks,