You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by "Avdeev V. M." <ls...@list.ru> on 2012/05/28 07:17:07 UTC

table design and performance questions

Question from novice.

Where I can read table design best practices? I have a measure table with millions of rows and many dimension tables with less than 1000 rows each. I can't find out the way to get optimal design of both kind of tables. Is there performance tuning guides or performance FAQ?

Specifically
1) PARTITIONED BY, CLUSTERED BY, SORTED BY statements. In which cases using these statements make sense?
2) DDL language manual says 'This can improve performance on certain kinds of queries.' about CLUSTERED BY statement. What kind of queries can be improved?
3) What is preferable - SEQUENCEFILE, RCFILE or TEXTFILE - in terms of performance? What aspects should be taken into account when choosing a file format?
4) Compressed storage article says 'Keeping data compressed in Hive tables has, in some cases, known to give better performance that uncompressed storage;' and again - What is these cases? 

Thanks!
Vyacheslav

RE: Re[2]: table design and performance questions

Posted by ra...@accenture.com.

Hi Avdeev,

  in response to question 2. Bucketing is a second level of data division on the table. You can tell Hive that your data should be distributed over several files, inside your table of partition, using some fields as distribution key. So you can have partitions, what means your files are inside a directory structure, and buckets that split your data in distinct files.

See an example. You have the following data:


   |id     |country
   | 1     | US
   | 2     | ES
   | 3     | ES
   | 4     | ES
   | 5     | US
   | 6     | US

You create the table as:

CREATE TABLE data (id int)
PARTITIONED BY (country string)
CLUSTERED BY (id) INTO 2 BUCKETS;

Then the structure of data will be (bucket filenames don't have any meaning just a sample):

   data/country=ES/bucket.0
   data/country=ES/bucket.1
   data/country=US/bucket.0
   data/country=US/bucket.1

Then bucket.0 should contain registers with id 1,3,5... while bucket.1 should contain 2,4,6... This can be helpful for sampling (you don't what to read all registers but just 50% of them in this sample). In that scenario you only need to read bucket.0 on each partition. Also can be useful for joins. If both tables have the same number of buckets (or an exact multiple) then the join can be done in a bucket by bucket fashion instead of a full tables join. Looking for a given id also can be optimized in this situation because you know which buckets to discard.

There is another optional clause that tells a bucket should by ordered by cluster key: CLUSTER BY (id) SORTED BY (id ASC) INTO 2 BUCKETS. This is an extra optimization for joins because now not only can be done in a bucket by bucket fashion. It also can be implemented as a merge sort.

Hope this was useful,
   Ramón Pin


-----Original Message-----
From: Avdeev V. M. [mailto:lsoft@list.ru]
Sent: sábado, 02 de junio de 2012 18:43
To: ruben.devries@hyves.nl
Cc: user@hive.apache.org
Subject: Re[2]: table design and performance questions

Thank for the information Ruben.

1. I found the issue https://issues.apache.org/jira/browse/HIVE-1642
does it mean that MAPJOIN hint is obsolete since 2010 and I can avoid this hint absolutely?

2. sorry for stupid questions, but I can't understand bucketing still. partitioning is ok, it is hdfs folders and I able to understand how it improve query execution. but what is bucketing in terms of storing data?

3. I embarrassed to ask such stupid questions, but is there 'how hive works' manual or something like?

And again  - sorry for bad English.

Vyacheslav

Tue, 29 May 2012 10:02:14 +0200 от Ruben de Vries <ru...@hyves.nl>:
> Partitioning can greatly increase performance for WHERE clauses since hive can omit parsing the data in the partitions which do no meet the requirement.
>
> For example if you partition by date (I do it by INT dateint, in which case I set dateint to be YYYYMMDD) and you do WHERE dateint >= 20120101 then it won't even have to touch any of the data from before 2012-01-01 and in my case that means I don't parse the last 2 years of data, reducing the time the query takes by about 70% :-)
>
>
>
> Buckets are the second awesome way of getting a big optimization in, specifically for joins! If you have 2 tables you're joining onto each other then if they're both bucketed on their join column it will also greatly increase speed.
>
> Another good join optimization is MAPJOIN, if one of the tables you're joining is rather small (below 30mb) then you can force it to MAPJOIN or you can enable automatic mapjoin, I personally prefere explicit behavory instead of automagic so use a hint:
>
> SELECT /* +MAPJOIN(the_small_table) */ fields FROM table JOIN the_small_table, etc.
>
> Sorted by is for sorting within buckets, only relevant if you're doing a lot of ordering I think.
>
>
>
> I'm assuming sequencefiles are faster, but I wouldn't really know :( need someone else to tell us more about that ;)
>
>
>
>
>
> -----Original Message-----
>
> From: Avdeev V. M. [mailto:lsoft@list.ru]
>
> Sent: Monday, May 28, 2012 7:17 AM
>
> To: user@hive.apache.org
>
> Subject: table design and performance questions
>
>
>
> Question from novice.
>
>
>
> Where I can read table design best practices? I have a measure table with millions of rows and many dimension tables with less than 1000 rows each. I can't find out the way to get optimal design of both kind of tables. Is there performance tuning guides or performance FAQ?
>
>
>
> Specifically
>
> 1) PARTITIONED BY, CLUSTERED BY, SORTED BY statements. In which cases using these statements make sense?
>
> 2) DDL language manual says 'This can improve performance on certain kinds of queries.' about CLUSTERED BY statement. What kind of queries can be improved?
>
> 3) What is preferable - SEQUENCEFILE, RCFILE or TEXTFILE - in terms of performance? What aspects should be taken into account when choosing a file format?
>
> 4) Compressed storage article says 'Keeping data compressed in Hive tables has, in some cases, known to give better performance that uncompressed storage;' and again - What is these cases?
>
>
>
> Thanks!
>
> Vyacheslav
>
>

________________________________
Subject to local law, communications with Accenture and its affiliates including telephone calls and emails (including content), may be monitored by our systems for the purposes of security and the assessment of internal compliance with Accenture policy.
______________________________________________________________________________________

www.accenture.com

Re[2]: table design and performance questions

Posted by "Avdeev V. M." <ls...@list.ru>.

Thank for the information Ruben.

1. I found the issue https://issues.apache.org/jira/browse/HIVE-1642
does it mean that MAPJOIN hint is obsolete since 2010 and I can avoid this hint absolutely?

2. sorry for stupid questions, but I can't understand bucketing still. partitioning is ok, it is hdfs folders and I able to understand how it improve query execution. but what is bucketing in terms of storing data?

3. I embarrassed to ask such stupid questions, but is there 'how hive works' manual or something like?

And again  - sorry for bad English.

Vyacheslav

Tue, 29 May 2012 10:02:14 +0200 от Ruben de Vries <ru...@hyves.nl>:
> Partitioning can greatly increase performance for WHERE clauses since hive can omit parsing the data in the partitions which do no meet the requirement.
> 
> For example if you partition by date (I do it by INT dateint, in which case I set dateint to be YYYYMMDD) and you do WHERE dateint >= 20120101 then it won't even have to touch any of the data from before 2012-01-01 and in my case that means I don't parse the last 2 years of data, reducing the time the query takes by about 70% :-)
> 
> 
> 
> Buckets are the second awesome way of getting a big optimization in, specifically for joins! If you have 2 tables you're joining onto each other then if they're both bucketed on their join column it will also greatly increase speed.
> 
> Another good join optimization is MAPJOIN, if one of the tables you're joining is rather small (below 30mb) then you can force it to MAPJOIN or you can enable automatic mapjoin, I personally prefere explicit behavory instead of automagic so use a hint:
> 
> SELECT /* +MAPJOIN(the_small_table) */ fields FROM table JOIN the_small_table, etc.
> 
> Sorted by is for sorting within buckets, only relevant if you're doing a lot of ordering I think.
> 
> 
> 
> I'm assuming sequencefiles are faster, but I wouldn't really know :( need someone else to tell us more about that ;)
> 
> 
> 
> 
> 
> -----Original Message-----
> 
> From: Avdeev V. M. [mailto:lsoft@list.ru] 
> 
> Sent: Monday, May 28, 2012 7:17 AM
> 
> To: user@hive.apache.org
> 
> Subject: table design and performance questions
> 
> 
> 
> Question from novice.
> 
> 
> 
> Where I can read table design best practices? I have a measure table with millions of rows and many dimension tables with less than 1000 rows each. I can't find out the way to get optimal design of both kind of tables. Is there performance tuning guides or performance FAQ?
> 
> 
> 
> Specifically
> 
> 1) PARTITIONED BY, CLUSTERED BY, SORTED BY statements. In which cases using these statements make sense?
> 
> 2) DDL language manual says 'This can improve performance on certain kinds of queries.' about CLUSTERED BY statement. What kind of queries can be improved?
> 
> 3) What is preferable - SEQUENCEFILE, RCFILE or TEXTFILE - in terms of performance? What aspects should be taken into account when choosing a file format?
> 
> 4) Compressed storage article says 'Keeping data compressed in Hive tables has, in some cases, known to give better performance that uncompressed storage;' and again - What is these cases? 
> 
> 
> 
> Thanks!
> 
> Vyacheslav
> 
>

RE: table design and performance questions

Posted by Ruben de Vries <ru...@hyves.nl>.

Partitioning can greatly increase performance for WHERE clauses since hive can omit parsing the data in the partitions which do no meet the requirement.
For example if you partition by date (I do it by INT dateint, in which case I set dateint to be YYYYMMDD) and you do WHERE dateint >= 20120101 then it won't even have to touch any of the data from before 2012-01-01 and in my case that means I don't parse the last 2 years of data, reducing the time the query takes by about 70% :-)

Buckets are the second awesome way of getting a big optimization in, specifically for joins! If you have 2 tables you're joining onto each other then if they're both bucketed on their join column it will also greatly increase speed.
Another good join optimization is MAPJOIN, if one of the tables you're joining is rather small (below 30mb) then you can force it to MAPJOIN or you can enable automatic mapjoin, I personally prefere explicit behavory instead of automagic so use a hint:
SELECT /* +MAPJOIN(the_small_table) */ fields FROM table JOIN the_small_table, etc.
Sorted by is for sorting within buckets, only relevant if you're doing a lot of ordering I think.

I'm assuming sequencefiles are faster, but I wouldn't really know :( need someone else to tell us more about that ;)


-----Original Message-----
From: Avdeev V. M. [mailto:lsoft@list.ru] 
Sent: Monday, May 28, 2012 7:17 AM
To: user@hive.apache.org
Subject: table design and performance questions

Question from novice.

Where I can read table design best practices? I have a measure table with millions of rows and many dimension tables with less than 1000 rows each. I can't find out the way to get optimal design of both kind of tables. Is there performance tuning guides or performance FAQ?

Specifically
1) PARTITIONED BY, CLUSTERED BY, SORTED BY statements. In which cases using these statements make sense?
2) DDL language manual says 'This can improve performance on certain kinds of queries.' about CLUSTERED BY statement. What kind of queries can be improved?
3) What is preferable - SEQUENCEFILE, RCFILE or TEXTFILE - in terms of performance? What aspects should be taken into account when choosing a file format?
4) Compressed storage article says 'Keeping data compressed in Hive tables has, in some cases, known to give better performance that uncompressed storage;' and again - What is these cases? 

Thanks!
Vyacheslav