You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Ashish Thusoo <at...@facebook.com> on 2009/04/01 15:55:55 UTC

RE: Hive for RRA Data

Man..

I think you have too little data. What you will end up with lots of small files with this approach and the fragmentation itself (considering that you will not be able to fill up even one block in the partition) will kill the performance on the cluster.

There are 2 things that you could do to alleviate this problem to some extent:

1. You could just slap on an external table on a directory that contains the data in hdfs. And you could you hadoop appends(appends are there since 0.18 - I think) to update a file in this directory that has all this data. You can then use Hive as usual on this data.
2. You can change your loading process to run a hive union all to create a file that has the merged data and then drop and recreate the original table. We could add a swap command to hive (similar to what oracle does with swap partitions) to make this better too...

But the key is to keep your blocks full, otherwise, fragmentation will really slow down your queries.

Ashish
________________________________________
From: Edward Capriolo [edlinuxguru@gmail.com]
Sent: Tuesday, March 31, 2009 9:06 AM
To: hive-user@hadoop.apache.org
Subject: Re: Hive for RRA Data

I made some progress with this. Rather then going with the
anti-pattern of storing a column name as a column I decided to create
a table for each 'cacti data template' type. After extracting the
columns from the RRD file I do:


    hql.append("CREATE TABLE IF NOT EXISTS "+schema.getTableName() +" (" );
    hql.append(" row_time BIGINT, ");
    Iterator i = schema.getColumns().iterator();
    while  (i.hasNext()){
      hql.append(" "+i.next()+" DOUBLE ");
      if (i.hasNext()){
        hql.append(" , ");
      }
    }
    hql.append(" ) ");
    hql.append( " PARTITIONED BY ( day STRING, data_template_data_id INT) ");
    hql.append( " ROW FORMAT DELIMITED " );
    hql.append( " FIELDS TERMINATED BY '\\054' " );
    hql.append( " LINES TERMINATED BY '\\012' ");

My data looks like: (without the column header)

file: server1_harddrive.rrd

time, hdd_used, hdd_free
1234840, 56,90
1235840, 54,92

I write the data to hadoop then I use 'load infile'.
The Good news:
I am partitioned by day and data_template_data_id ( data id). This
makes a query for specific data fast. Easy to count and group.
The Bad news:
Each partition/RRD file is about 4KB-8KB. I have about 400 devices,
and 9000 data sources. My block size is 128 MB so poor data/block.

I was thinking of rolling this data into months, but after doing some
high level math even a years for of data would only be able 2MB.

I considered making 'data_template_data_id' a column rather then a
partition. That has the same result really. For my deployment, each
device of the 300 devices has 1 or two hard drives. 1 Day for all
three hundred hard drives still does not fully utilize block. (If you
had 3000 or 30000 devices you might be able to utilize the block)

I am considering no partitions. Now the 'no append' in hadoop is an
issue. My process is going to be kicked off nightly, resulting in
daily files. More or less the same as partitioning by day. Can hive
merge up those files automatically?

Any ideas?

Re: Hive for RRA Data

Posted by Edward Capriolo <ed...@gmail.com>.

On Wed, Apr 1, 2009 at 9:55 AM, Ashish Thusoo <at...@facebook.com> wrote:
> Man..
>
> I think you have too little data. What you will end up with lots of small files with this approach and the fragmentation itself (considering that you will not be able to fill up even one block in the partition) will kill the performance on the cluster.
>
> There are 2 things that you could do to alleviate this problem to some extent:
>
> 1. You could just slap on an external table on a directory that contains the data in hdfs. And you could you hadoop appends(appends are there since 0.18 - I think) to update a file in this directory that has all this data. You can then use Hive as usual on this data.
> 2. You can change your loading process to run a hive union all to create a file that has the merged data and then drop and recreate the original table. We could add a swap command to hive (similar to what oracle does with swap partitions) to make this better too...
>
> But the key is to keep your blocks full, otherwise, fragmentation will really slow down your queries.
>
> Ashish
> ________________________________________
> From: Edward Capriolo [edlinuxguru@gmail.com]
> Sent: Tuesday, March 31, 2009 9:06 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: Hive for RRA Data
>
> I made some progress with this. Rather then going with the
> anti-pattern of storing a column name as a column I decided to create
> a table for each 'cacti data template' type. After extracting the
> columns from the RRD file I do:
>
>
>    hql.append("CREATE TABLE IF NOT EXISTS "+schema.getTableName() +" (" );
>    hql.append(" row_time BIGINT, ");
>    Iterator i = schema.getColumns().iterator();
>    while  (i.hasNext()){
>      hql.append(" "+i.next()+" DOUBLE ");
>      if (i.hasNext()){
>        hql.append(" , ");
>      }
>    }
>    hql.append(" ) ");
>    hql.append( " PARTITIONED BY ( day STRING, data_template_data_id INT) ");
>    hql.append( " ROW FORMAT DELIMITED " );
>    hql.append( " FIELDS TERMINATED BY '\\054' " );
>    hql.append( " LINES TERMINATED BY '\\012' ");
>
> My data looks like: (without the column header)
>
> file: server1_harddrive.rrd
>
> time, hdd_used, hdd_free
> 1234840, 56,90
> 1235840, 54,92
>
> I write the data to hadoop then I use 'load infile'.
> The Good news:
> I am partitioned by day and data_template_data_id ( data id). This
> makes a query for specific data fast. Easy to count and group.
> The Bad news:
> Each partition/RRD file is about 4KB-8KB. I have about 400 devices,
> and 9000 data sources. My block size is 128 MB so poor data/block.
>
> I was thinking of rolling this data into months, but after doing some
> high level math even a years for of data would only be able 2MB.
>
> I considered making 'data_template_data_id' a column rather then a
> partition. That has the same result really. For my deployment, each
> device of the 300 devices has 1 or two hard drives. 1 Day for all
> three hundred hard drives still does not fully utilize block. (If you
> had 3000 or 30000 devices you might be able to utilize the block)
>
> I am considering no partitions. Now the 'no append' in hadoop is an
> issue. My process is going to be kicked off nightly, resulting in
> daily files. More or less the same as partitioning by day. Can hive
> merge up those files automatically?
>
> Any ideas?
>

Ashish,

You are right, there is too little data. The original partition was
(day='2009-07-04', data_template_data_id=6). I have decided not to use
data_template_data_id as a partition moving it back into the data.
Also I am going to partition by year. This way the tables will not
grow indefinitely but I can get the blocks up to a decent size.

I figure I can replace the partitions with 'cluster by' and 'sort by'.
There are going to be two main types of queries to optimize for.
1) queries by time  - give me all the information for this day.
2) queries by data_template_data_id - give me all the information for
this RRA file.

If I bucket the table on data_template_data_id that should make #2
efficient and if I order by time that should make #1 efficient.

How does the query optimizer use 'sort by' columns ?

I also am looking at a union all, or a merge process. The complexity
of the union/merge process depends on if I go with 'cluster by' and
'sort by'. I am thinking to avoid this complexity, but the goal is a
capacity management application, maybe I should be optimizing for
3000-10,000 machines rather then 300.