You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "Richard K. Turner" <rk...@petersontechnology.com> on 2008/03/10 19:01:38 UTC

File Per Column in Hadoop

I have found that storing each column in its own gzip file can really speed up processing time on arbitrary subsets of columns.  For example suppose I have two CSV files called csv_file1.gz and csv_file2.gz.  I can create a file for each column as follows :

   csv_file1/col1.gz
   csv_file1/col2.gz
   csv_file1/col3.gz
     .
     .
     .
   csv_file1/colN.gz
   csv_file2/col1.gz
   csv_file2/col2.gz
   csv_file2/col3.gz
     .
     .
     .
   csv_file2/colN.gz


I would like to use this approach when writing map reduce jobs in Hadoop.  Inorder to do this I think I would need to write an input format, which I can look into.  However, I want to avoid the situation where a map task reads column files from different nodes.   To avoid this situation, all columns files derived from the same CSV file must be co-located on the same node(or nodes if replication is enabled).  So for my example I would like to ask HDFS to keep all files in dir csv_file1 together on the same node(s).  I would also do the same for dir csv_file2.  Does anyone know how to do this in Hadoop?

Thanks,

Keith

RE: File Per Column in Hadoop

Posted by "Richard K. Turner" <rk...@petersontechnology.com>.
I have not seen that paper.  I will have to take a look at it.  I have read a few papers on this subject.  For anyone interested, I think the first two pages of "Column-Stores For Wide and Sparse Data" by Daniel J. Abadi give an excellent overview of the pros and cons of column oriented datasets.

http://cs-www.cs.yale.edu/homes/dna/papers/abadicidr07.pdf

-----Original Message-----
From: Ashish Thusoo [mailto:athusoo@facebook.com]
Sent: Tue 3/11/2008 2:43 PM
To: core-user@hadoop.apache.org
Subject: RE: File Per Column in Hadoop
 
This is very interesting and very useful.

There was some work done in the database community to look at different
block organizations that boost cache and I/O performance and essentially
they also proposed a scheme similar to what you are talking about
(although at a database block level)

Link is below if you are interested in it and may give some good
insights while you are doing this (in case you have not already seen
this)

http://citeseer.ist.psu.edu/ailamaki02data.html

Ashish

-----Original Message-----
From: Richard K. Turner [mailto:rkt@petersontechnology.com] 
Sent: Tuesday, March 11, 2008 9:34 AM
To: core-user@hadoop.apache.org
Subject: RE: File Per Column in Hadoop

This is somewhat similar to what I am doing.  I create a set of
compressed columns per input CSV file.  You are saying take a fixed
number of rows and create compressed column blocks.  As long as you do
this with a large enough row subset you will get alot of the benefit of
compressing similar data.

In addition to compression, another benefit of file per column is
drastically reduced I/O.  If I have 100 columns and I want to analyze 5,
then I do not have to even read, decompress, and throw away the other 95
columns.  This can drastically reduce I/O and CPU utilization and
increase cache (CPU and disk) utilization.  To get this benefit you
would need metadata that indicates where each column block starts within
a record.  This metadata will allow seeking to the beginning of columns
of interest.

I will look into creating another file format like the SequenceFile that
supports this structure and an input format to go along with it.  First
order of business will be to see if the input format can support
seeking.

Keith

-----Original Message-----
From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]
Sent: Tue 3/11/2008 11:29 AM
To: core-user@hadoop.apache.org; core-user@hadoop.apache.org
Subject: RE: File Per Column in Hadoop
 
it would be interesting to integrate knowledge of columnar structure
with compression. i wouldn't approach it as an inputformat problem
(because of the near impossibility of colocating all these files) - but
perhaps extend the compression libraries in Hadoop - so that the library
understood the structured nature of the underlying dataset.

One would store all the columns together in a single row. But each block
of (a compressed sequencefile) would actually be stored as a set of
compressed blocks (each block representing a column). This would give
most of the benefits of columnar compression (not all - because one
would only be compressing a block at a time) - while still being
transparent to mapreduce.

So - doable i would think and very sexy - but i don't know how complex
(the compression code seems hairy - but that's probably just ignorance).
We would also love to get to this stage (we already have the metadata
with each file) - but i think it would take us many months before we got
there.

Joydeep



-----Original Message-----
From: Richard K. Turner [mailto:rkt@petersontechnology.com]
Sent: Mon 3/10/2008 11:01 AM
To: core-user@hadoop.apache.org
Subject: File Per Column in Hadoop
 

I have found that storing each column in its own gzip file can really
speed up processing time on arbitrary subsets of columns.  For example
suppose I have two CSV files called csv_file1.gz and csv_file2.gz.  I
can create a file for each column as follows :

   csv_file1/col1.gz
   csv_file1/col2.gz
   csv_file1/col3.gz
     .
     .
     .
   csv_file1/colN.gz
   csv_file2/col1.gz
   csv_file2/col2.gz
   csv_file2/col3.gz
     .
     .
     .
   csv_file2/colN.gz


I would like to use this approach when writing map reduce jobs in
Hadoop.  Inorder to do this I think I would need to write an input
format, which I can look into.  However, I want to avoid the situation
where a map task reads column files from different nodes.   To avoid
this situation, all columns files derived from the same CSV file must be
co-located on the same node(or nodes if replication is enabled).  So for
my example I would like to ask HDFS to keep all files in dir csv_file1
together on the same node(s).  I would also do the same for dir
csv_file2.  Does anyone know how to do this in Hadoop?

Thanks,

Keith




RE: File Per Column in Hadoop

Posted by Ashish Thusoo <at...@facebook.com>.
This is very interesting and very useful.

There was some work done in the database community to look at different
block organizations that boost cache and I/O performance and essentially
they also proposed a scheme similar to what you are talking about
(although at a database block level)

Link is below if you are interested in it and may give some good
insights while you are doing this (in case you have not already seen
this)

http://citeseer.ist.psu.edu/ailamaki02data.html

Ashish

-----Original Message-----
From: Richard K. Turner [mailto:rkt@petersontechnology.com] 
Sent: Tuesday, March 11, 2008 9:34 AM
To: core-user@hadoop.apache.org
Subject: RE: File Per Column in Hadoop

This is somewhat similar to what I am doing.  I create a set of
compressed columns per input CSV file.  You are saying take a fixed
number of rows and create compressed column blocks.  As long as you do
this with a large enough row subset you will get alot of the benefit of
compressing similar data.

In addition to compression, another benefit of file per column is
drastically reduced I/O.  If I have 100 columns and I want to analyze 5,
then I do not have to even read, decompress, and throw away the other 95
columns.  This can drastically reduce I/O and CPU utilization and
increase cache (CPU and disk) utilization.  To get this benefit you
would need metadata that indicates where each column block starts within
a record.  This metadata will allow seeking to the beginning of columns
of interest.

I will look into creating another file format like the SequenceFile that
supports this structure and an input format to go along with it.  First
order of business will be to see if the input format can support
seeking.

Keith

-----Original Message-----
From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]
Sent: Tue 3/11/2008 11:29 AM
To: core-user@hadoop.apache.org; core-user@hadoop.apache.org
Subject: RE: File Per Column in Hadoop
 
it would be interesting to integrate knowledge of columnar structure
with compression. i wouldn't approach it as an inputformat problem
(because of the near impossibility of colocating all these files) - but
perhaps extend the compression libraries in Hadoop - so that the library
understood the structured nature of the underlying dataset.

One would store all the columns together in a single row. But each block
of (a compressed sequencefile) would actually be stored as a set of
compressed blocks (each block representing a column). This would give
most of the benefits of columnar compression (not all - because one
would only be compressing a block at a time) - while still being
transparent to mapreduce.

So - doable i would think and very sexy - but i don't know how complex
(the compression code seems hairy - but that's probably just ignorance).
We would also love to get to this stage (we already have the metadata
with each file) - but i think it would take us many months before we got
there.

Joydeep



-----Original Message-----
From: Richard K. Turner [mailto:rkt@petersontechnology.com]
Sent: Mon 3/10/2008 11:01 AM
To: core-user@hadoop.apache.org
Subject: File Per Column in Hadoop
 

I have found that storing each column in its own gzip file can really
speed up processing time on arbitrary subsets of columns.  For example
suppose I have two CSV files called csv_file1.gz and csv_file2.gz.  I
can create a file for each column as follows :

   csv_file1/col1.gz
   csv_file1/col2.gz
   csv_file1/col3.gz
     .
     .
     .
   csv_file1/colN.gz
   csv_file2/col1.gz
   csv_file2/col2.gz
   csv_file2/col3.gz
     .
     .
     .
   csv_file2/colN.gz


I would like to use this approach when writing map reduce jobs in
Hadoop.  Inorder to do this I think I would need to write an input
format, which I can look into.  However, I want to avoid the situation
where a map task reads column files from different nodes.   To avoid
this situation, all columns files derived from the same CSV file must be
co-located on the same node(or nodes if replication is enabled).  So for
my example I would like to ask HDFS to keep all files in dir csv_file1
together on the same node(s).  I would also do the same for dir
csv_file2.  Does anyone know how to do this in Hadoop?

Thanks,

Keith



RE: File Per Column in Hadoop

Posted by "Richard K. Turner" <rk...@petersontechnology.com>.
One other thing to add to list

3. overhead of parsing row to extract needed fields

It is good to know that the filesystem may read all data even if I do seek.  I did not know that.  I knew HDFS had CRCs but I did not think of the implications.

-----Original Message-----
From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]
Sent: Tue 3/11/2008 1:15 PM
To: core-user@hadoop.apache.org; core-user@hadoop.apache.org
Subject: RE: File Per Column in Hadoop
 
pretty cool - i think this would be a great contrib.

on columnar access over row organized data, breaking down downsides more clearly, there is:
1. overhead of reading more data from disk into memory
2. overhead of decompressing extra data

with a sub-block per column - #2 is avoided. #1 is hard/impossible to avoid (the underlying file system would probably read sequentially anyway - and besides there's checksum verification in dfs itself that would require entire data to be read)

but so far from what i have seen - there is usually an excess of serial read bandwidth in a hadoop cluster. to the extent that extra data reads can cause hidden cpu cost (if they cause memory bandwidth to max out) - this could be a concern. but - given other inefficiencies in memory usage including language and endless copies of data being made in the io path - i would speculate that (right now) #1 is not that big a concern for Hadoop.


-----Original Message-----
From: Richard K. Turner [mailto:rkt@petersontechnology.com]
Sent: Tue 3/11/2008 9:33 AM
To: core-user@hadoop.apache.org
Subject: RE: File Per Column in Hadoop
 
This is somewhat similar to what I am doing.  I create a set of compressed columns per input CSV file.  You are saying take a fixed number of rows and create compressed column blocks.  As long as you do this with a large enough row subset you will get alot of the benefit of compressing similar data.

In addition to compression, another benefit of file per column is drastically reduced I/O.  If I have 100 columns and I want to analyze 5, then I do not have to even read, decompress, and throw away the other 95 columns.  This can drastically reduce I/O and CPU utilization and increase cache (CPU and disk) utilization.  To get this benefit you would need metadata that indicates where each column block starts within a record.  This metadata will allow seeking to the beginning of columns of interest.

I will look into creating another file format like the SequenceFile that supports this structure and an input format to go along with it.  First order of business will be to see if the input format can support seeking.

Keith

-----Original Message-----
From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]
Sent: Tue 3/11/2008 11:29 AM
To: core-user@hadoop.apache.org; core-user@hadoop.apache.org
Subject: RE: File Per Column in Hadoop
 
it would be interesting to integrate knowledge of columnar structure with compression. i wouldn't approach it as an inputformat problem (because of the near impossibility of colocating all these files) - but perhaps extend the compression libraries in Hadoop - so that the library understood the structured nature of the underlying dataset.

One would store all the columns together in a single row. But each block of (a compressed sequencefile) would actually be stored as a set of compressed blocks (each block representing a column). This would give most of the benefits of columnar compression (not all - because one would only be compressing a block at a time) - while still being transparent to mapreduce.

So - doable i would think and very sexy - but i don't know how complex (the compression code seems hairy - but that's probably just ignorance). We would also love to get to this stage (we already have the metadata with each file) - but i think it would take us many months before we got there.

Joydeep



-----Original Message-----
From: Richard K. Turner [mailto:rkt@petersontechnology.com]
Sent: Mon 3/10/2008 11:01 AM
To: core-user@hadoop.apache.org
Subject: File Per Column in Hadoop
 

I have found that storing each column in its own gzip file can really speed up processing time on arbitrary subsets of columns.  For example suppose I have two CSV files called csv_file1.gz and csv_file2.gz.  I can create a file for each column as follows :

   csv_file1/col1.gz
   csv_file1/col2.gz
   csv_file1/col3.gz
     .
     .
     .
   csv_file1/colN.gz
   csv_file2/col1.gz
   csv_file2/col2.gz
   csv_file2/col3.gz
     .
     .
     .
   csv_file2/colN.gz


I would like to use this approach when writing map reduce jobs in Hadoop.  Inorder to do this I think I would need to write an input format, which I can look into.  However, I want to avoid the situation where a map task reads column files from different nodes.   To avoid this situation, all columns files derived from the same CSV file must be co-located on the same node(or nodes if replication is enabled).  So for my example I would like to ask HDFS to keep all files in dir csv_file1 together on the same node(s).  I would also do the same for dir csv_file2.  Does anyone know how to do this in Hadoop?

Thanks,

Keith





RE: File Per Column in Hadoop

Posted by Joydeep Sen Sarma <js...@facebook.com>.
pretty cool - i think this would be a great contrib.

on columnar access over row organized data, breaking down downsides more clearly, there is:
1. overhead of reading more data from disk into memory
2. overhead of decompressing extra data

with a sub-block per column - #2 is avoided. #1 is hard/impossible to avoid (the underlying file system would probably read sequentially anyway - and besides there's checksum verification in dfs itself that would require entire data to be read)

but so far from what i have seen - there is usually an excess of serial read bandwidth in a hadoop cluster. to the extent that extra data reads can cause hidden cpu cost (if they cause memory bandwidth to max out) - this could be a concern. but - given other inefficiencies in memory usage including language and endless copies of data being made in the io path - i would speculate that (right now) #1 is not that big a concern for Hadoop.


-----Original Message-----
From: Richard K. Turner [mailto:rkt@petersontechnology.com]
Sent: Tue 3/11/2008 9:33 AM
To: core-user@hadoop.apache.org
Subject: RE: File Per Column in Hadoop
 
This is somewhat similar to what I am doing.  I create a set of compressed columns per input CSV file.  You are saying take a fixed number of rows and create compressed column blocks.  As long as you do this with a large enough row subset you will get alot of the benefit of compressing similar data.

In addition to compression, another benefit of file per column is drastically reduced I/O.  If I have 100 columns and I want to analyze 5, then I do not have to even read, decompress, and throw away the other 95 columns.  This can drastically reduce I/O and CPU utilization and increase cache (CPU and disk) utilization.  To get this benefit you would need metadata that indicates where each column block starts within a record.  This metadata will allow seeking to the beginning of columns of interest.

I will look into creating another file format like the SequenceFile that supports this structure and an input format to go along with it.  First order of business will be to see if the input format can support seeking.

Keith

-----Original Message-----
From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]
Sent: Tue 3/11/2008 11:29 AM
To: core-user@hadoop.apache.org; core-user@hadoop.apache.org
Subject: RE: File Per Column in Hadoop
 
it would be interesting to integrate knowledge of columnar structure with compression. i wouldn't approach it as an inputformat problem (because of the near impossibility of colocating all these files) - but perhaps extend the compression libraries in Hadoop - so that the library understood the structured nature of the underlying dataset.

One would store all the columns together in a single row. But each block of (a compressed sequencefile) would actually be stored as a set of compressed blocks (each block representing a column). This would give most of the benefits of columnar compression (not all - because one would only be compressing a block at a time) - while still being transparent to mapreduce.

So - doable i would think and very sexy - but i don't know how complex (the compression code seems hairy - but that's probably just ignorance). We would also love to get to this stage (we already have the metadata with each file) - but i think it would take us many months before we got there.

Joydeep



-----Original Message-----
From: Richard K. Turner [mailto:rkt@petersontechnology.com]
Sent: Mon 3/10/2008 11:01 AM
To: core-user@hadoop.apache.org
Subject: File Per Column in Hadoop
 

I have found that storing each column in its own gzip file can really speed up processing time on arbitrary subsets of columns.  For example suppose I have two CSV files called csv_file1.gz and csv_file2.gz.  I can create a file for each column as follows :

   csv_file1/col1.gz
   csv_file1/col2.gz
   csv_file1/col3.gz
     .
     .
     .
   csv_file1/colN.gz
   csv_file2/col1.gz
   csv_file2/col2.gz
   csv_file2/col3.gz
     .
     .
     .
   csv_file2/colN.gz


I would like to use this approach when writing map reduce jobs in Hadoop.  Inorder to do this I think I would need to write an input format, which I can look into.  However, I want to avoid the situation where a map task reads column files from different nodes.   To avoid this situation, all columns files derived from the same CSV file must be co-located on the same node(or nodes if replication is enabled).  So for my example I would like to ask HDFS to keep all files in dir csv_file1 together on the same node(s).  I would also do the same for dir csv_file2.  Does anyone know how to do this in Hadoop?

Thanks,

Keith




RE: File Per Column in Hadoop

Posted by "Richard K. Turner" <rk...@petersontechnology.com>.
This is somewhat similar to what I am doing.  I create a set of compressed columns per input CSV file.  You are saying take a fixed number of rows and create compressed column blocks.  As long as you do this with a large enough row subset you will get alot of the benefit of compressing similar data.

In addition to compression, another benefit of file per column is drastically reduced I/O.  If I have 100 columns and I want to analyze 5, then I do not have to even read, decompress, and throw away the other 95 columns.  This can drastically reduce I/O and CPU utilization and increase cache (CPU and disk) utilization.  To get this benefit you would need metadata that indicates where each column block starts within a record.  This metadata will allow seeking to the beginning of columns of interest.

I will look into creating another file format like the SequenceFile that supports this structure and an input format to go along with it.  First order of business will be to see if the input format can support seeking.

Keith

-----Original Message-----
From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]
Sent: Tue 3/11/2008 11:29 AM
To: core-user@hadoop.apache.org; core-user@hadoop.apache.org
Subject: RE: File Per Column in Hadoop
 
it would be interesting to integrate knowledge of columnar structure with compression. i wouldn't approach it as an inputformat problem (because of the near impossibility of colocating all these files) - but perhaps extend the compression libraries in Hadoop - so that the library understood the structured nature of the underlying dataset.

One would store all the columns together in a single row. But each block of (a compressed sequencefile) would actually be stored as a set of compressed blocks (each block representing a column). This would give most of the benefits of columnar compression (not all - because one would only be compressing a block at a time) - while still being transparent to mapreduce.

So - doable i would think and very sexy - but i don't know how complex (the compression code seems hairy - but that's probably just ignorance). We would also love to get to this stage (we already have the metadata with each file) - but i think it would take us many months before we got there.

Joydeep



-----Original Message-----
From: Richard K. Turner [mailto:rkt@petersontechnology.com]
Sent: Mon 3/10/2008 11:01 AM
To: core-user@hadoop.apache.org
Subject: File Per Column in Hadoop
 

I have found that storing each column in its own gzip file can really speed up processing time on arbitrary subsets of columns.  For example suppose I have two CSV files called csv_file1.gz and csv_file2.gz.  I can create a file for each column as follows :

   csv_file1/col1.gz
   csv_file1/col2.gz
   csv_file1/col3.gz
     .
     .
     .
   csv_file1/colN.gz
   csv_file2/col1.gz
   csv_file2/col2.gz
   csv_file2/col3.gz
     .
     .
     .
   csv_file2/colN.gz


I would like to use this approach when writing map reduce jobs in Hadoop.  Inorder to do this I think I would need to write an input format, which I can look into.  However, I want to avoid the situation where a map task reads column files from different nodes.   To avoid this situation, all columns files derived from the same CSV file must be co-located on the same node(or nodes if replication is enabled).  So for my example I would like to ask HDFS to keep all files in dir csv_file1 together on the same node(s).  I would also do the same for dir csv_file2.  Does anyone know how to do this in Hadoop?

Thanks,

Keith



RE: File Per Column in Hadoop

Posted by Joydeep Sen Sarma <js...@facebook.com>.
it would be interesting to integrate knowledge of columnar structure with compression. i wouldn't approach it as an inputformat problem (because of the near impossibility of colocating all these files) - but perhaps extend the compression libraries in Hadoop - so that the library understood the structured nature of the underlying dataset.

One would store all the columns together in a single row. But each block of (a compressed sequencefile) would actually be stored as a set of compressed blocks (each block representing a column). This would give most of the benefits of columnar compression (not all - because one would only be compressing a block at a time) - while still being transparent to mapreduce.

So - doable i would think and very sexy - but i don't know how complex (the compression code seems hairy - but that's probably just ignorance). We would also love to get to this stage (we already have the metadata with each file) - but i think it would take us many months before we got there.

Joydeep



-----Original Message-----
From: Richard K. Turner [mailto:rkt@petersontechnology.com]
Sent: Mon 3/10/2008 11:01 AM
To: core-user@hadoop.apache.org
Subject: File Per Column in Hadoop
 

I have found that storing each column in its own gzip file can really speed up processing time on arbitrary subsets of columns.  For example suppose I have two CSV files called csv_file1.gz and csv_file2.gz.  I can create a file for each column as follows :

   csv_file1/col1.gz
   csv_file1/col2.gz
   csv_file1/col3.gz
     .
     .
     .
   csv_file1/colN.gz
   csv_file2/col1.gz
   csv_file2/col2.gz
   csv_file2/col3.gz
     .
     .
     .
   csv_file2/colN.gz


I would like to use this approach when writing map reduce jobs in Hadoop.  Inorder to do this I think I would need to write an input format, which I can look into.  However, I want to avoid the situation where a map task reads column files from different nodes.   To avoid this situation, all columns files derived from the same CSV file must be co-located on the same node(or nodes if replication is enabled).  So for my example I would like to ask HDFS to keep all files in dir csv_file1 together on the same node(s).  I would also do the same for dir csv_file2.  Does anyone know how to do this in Hadoop?

Thanks,

Keith


Re: File Per Column in Hadoop

Posted by stack <st...@duboce.net>.
Richard K. Turner wrote:
> To get the data in separate files, I would need to put each column in its own column family using a row id as the key.  Each column family will end up as a seperate file.  HBase will sort each column family independently, so if I had 100 columns I would be doing 100 times more sorting than I need to do.  I believe all of this sorting would make insert rates really low.
>
>   
> HBase supports an arbitrary number of columns per a row in a column family.  To do this each row value has <col name>=<col value> pairs.  For my case this is unessecary overhead as I would only have one column name per column family.
>   

For sure there is a cost keeping columns in a manner that facilitates 
row-based accesses -- fatter keys and sort/compactions -- and that 
allows on-the-fly cell-level updates.   If your access pattern is purely 
columnar and your data static, for sure, there is little sense paying 
the overhead.

> It seems that when a map reduce job is run against HBase that it reads input through the HBase server.  I suspect reading gzip files off local disk is much faster, but I am not sure.
>   
Yes.  We're doing our best to minimize the tax you pay accessing via 
hbase but we have some ways to go yet.  You can get some sense of it 
from the table at the end of this page, 
http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation, where we 
compare accesses that go direct against (non-compressed) mapfiles and 
then the same access via hbase.

Yours,
St.Ack

> -----Original Message-----
> From: Ted Dunning [mailto:tdunning@veoh.com]
> Sent: Mon 3/10/2008 2:57 PM
> To: core-user@hadoop.apache.org
> Subject: Re: File Per Column in Hadoop
>  
>
> Have you looked at hbase.  It looks like you are trying to reimplement a
> bunch of it.
>
>
> On 3/10/08 11:01 AM, "Richard K. Turner" <rk...@petersontechnology.com> wrote:
>
>   
>> ... [storing data in columns is nice] ... I would also do the same for dir
>>     
> csv_file2.  Does anyone know how to do this
>   
>> in Hadoop?
>>     
>
>
>   


RE: File Per Column in Hadoop

Posted by "Richard K. Turner" <rk...@petersontechnology.com>.
Yeah I have looked at HBase.  I do not think it meets my needs in this particular case.  I want to efficiently do adhoc analysis on arbitrary subsets of columns using map reduce.  Below are some of the reasons I feel HBase does not fit my needs.

To get the data in separate files, I would need to put each column in its own column family using a row id as the key.  Each column family will end up as a seperate file.  HBase will sort each column family independently, so if I had 100 columns I would be doing 100 times more sorting than I need to do.  I believe all of this sorting would make insert rates really low.

HBase supports an arbitrary number of columns per a row in a column family.  To do this each row value has <col name>=<col value> pairs.  For my case this is unessecary overhead as I would only have one column name per column family.

It seems that when a map reduce job is run against HBase that it reads input through the HBase server.  I suspect reading gzip files off local disk is much faster, but I am not sure.

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com]
Sent: Mon 3/10/2008 2:57 PM
To: core-user@hadoop.apache.org
Subject: Re: File Per Column in Hadoop
 

Have you looked at hbase.  It looks like you are trying to reimplement a
bunch of it.


On 3/10/08 11:01 AM, "Richard K. Turner" <rk...@petersontechnology.com> wrote:

> ... [storing data in columns is nice] ... I would also do the same for dir
csv_file2.  Does anyone know how to do this
> in Hadoop?



Re: File Per Column in Hadoop

Posted by Ted Dunning <td...@veoh.com>.
Have you looked at hbase.  It looks like you are trying to reimplement a
bunch of it.


On 3/10/08 11:01 AM, "Richard K. Turner" <rk...@petersontechnology.com> wrote:

> ... [storing data in columns is nice] ... I would also do the same for dir
csv_file2.  Does anyone know how to do this
> in Hadoop?