You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by abhishek1015 <ab...@gmail.com> on 2014/06/13 02:58:48 UTC

HFile V2 vs HFile V3

Hello

I am interested in doing a comparative study of Hfile v2 and Hfile v3 under
different workload condition. I have following questions in this regard:

1) Is HFile v3 available in HBase .98? 
2) Where can I find the physical design of HFile V3? I can only see the
physical design details of HFile V2 in some blogs.
3) Has anyone done similar study? If yes, where can I find the results?

Thanks
Abhishek




--
View this message in context: http://apache-hbase.679495.n3.nabble.com/HFile-V2-vs-HFile-V3-tp4060405.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: HFile V2 vs HFile V3

Posted by abhishek1015 <ab...@gmail.com>.
Dremel is designed to store a nesting structure of arbitrary depth. They use
repetition and definition levels to be able to reconstruct the nested
structure. However, Bigtable like system such as HBase and Cassandra is a
multi-dimensional sorted map, which maps rowkey, column-family, columnkey,
time-stamp into value. Therefore, both repetition and definition levels are
not required to reconstruct a row. This could be a reason that cassandra is
using a dremel inspired format, rather than implementing dremel itself.

We can also visualize this sorted map as a table structure with columns as
"rowkey", "column-family:columnkey" and values as "time-stamp,value". The
HFile is designed with the assumption that hbase table structure is very
sparse. This assumption is true in many cases where columnkey is also used
to store some information (e.g. order_id). However, this assumption is not
true for all tables. In many cases, we use columnkey as traditional column
name.

Therefore, it will be good to have two file formats. Based on sparsity of
table, user can choose between the traditional hfile and a columnar format.
As a lot of companies are using Hbase, I am wondering if any company will be
interested in sharing their anonymized production trace so that I can
estimate the sparsity of their table to validate my argument.

Thanks
Abhishek  





--
View this message in context: http://apache-hbase.679495.n3.nabble.com/HFile-V2-vs-HFile-V3-tp4060405p4060450.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: HFile V2 vs HFile V3

Posted by Andrew Purtell <ap...@apache.org>.
If you read down through that JIRA, you'll have the answer to that
question: The results were inconclusive and the changes in that patch broke
thread safety.

I also suggest returning to the bottom of the Cassandra wiki page you
mentioned and follow the link to the JIRA. Cassandra appears to have not
actually tested a Dremel-style storage format but rather modified their
existing file format inspired in limited ways by concepts from the Dremel
paper.

In the Apache ecosystem, we have Parquet, an I would say faithful
implementation of the ideas in the Dremel paper, see http://parquet.io/

I encourage you to look into the details of HFile and Parquet, and learn
more about the inner workings of HBase, as to why using a Dremel-style
columnar storage format with HBase might not be an easy undertaking.
Abstractly speaking it would be interesting to consider, could be nice to
provide support for bulk ingest of Parquet files for immutable data
perhaps. The next question is who would volunteer to do that.



On Fri, Jun 13, 2014 at 9:15 AM, abhishek1015 <ab...@gmail.com>
wrote:

> Thanks ted for providing the link to HBase-5313. Apparently, no one seems
> to
> work on this which is strange.
>
> Abhishek
>
>
>
> --
> View this message in context:
> http://apache-hbase.679495.n3.nabble.com/HFile-V2-vs-HFile-V3-tp4060405p4060418.html
> Sent from the HBase User mailing list archive at Nabble.com.
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: HFile V2 vs HFile V3

Posted by abhishek1015 <ab...@gmail.com>.
Thanks ted for providing the link to HBase-5313. Apparently, no one seems to
work on this which is strange. 

Abhishek 



--
View this message in context: http://apache-hbase.679495.n3.nabble.com/HFile-V2-vs-HFile-V3-tp4060405p4060418.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: HFile V2 vs HFile V3

Posted by Ted Yu <yu...@gmail.com>.
w.r.t. columnar format, there were discussions in the past:

https://issues.apache.org/jira/browse/HBASE-5313?focusedCommentId=13203324&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13203324

HBASE-5521 Move compression/decompression to an encoder specific encoding
context

FYI


On Thu, Jun 12, 2014 at 10:04 PM, abhishek1015 <ab...@gmail.com>
wrote:

> Thank you Ram and Ted.
>
> I am wondering why a file format similar to Dremel is not tested for HBase
> while it appears to improve the performance. I think that it is tested in
> cassandra: http://wiki.apache.org/cassandra/FileFormatDesignDoc
>
> Dremel: http://research.google.com/pubs/pub36632.html
>
> Can anyone share their insight on this.
>
> Thanks
> Abhishek
>
>
>
> --
> View this message in context:
> http://apache-hbase.679495.n3.nabble.com/HFile-V2-vs-HFile-V3-tp4060405p4060409.html
> Sent from the HBase User mailing list archive at Nabble.com.
>

Re: HFile V2 vs HFile V3

Posted by abhishek1015 <ab...@gmail.com>.
Thank you Ram and Ted.

I am wondering why a file format similar to Dremel is not tested for HBase
while it appears to improve the performance. I think that it is tested in
cassandra: http://wiki.apache.org/cassandra/FileFormatDesignDoc 

Dremel: http://research.google.com/pubs/pub36632.html

Can anyone share their insight on this.

Thanks
Abhishek



--
View this message in context: http://apache-hbase.679495.n3.nabble.com/HFile-V2-vs-HFile-V3-tp4060405p4060409.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: HFile V2 vs HFile V3

Posted by ramkrishna vasudevan <ra...@gmail.com>.
Hi

HFileV3's layout interms of the block indexes, blooms, the HFileBlock
layout everything remains the same as in V2.  The only difference is that
as the KeyValue format supports Tags (An arbitrary metadata) that can be
attached with the KeyValue, the same should  be perisisted in the HFiles
also.
V3 basically allows the Tags in the HFiles and some additional FileInfo
related info indicating the presence of Tags.  So if your data does not
have Tags then you are very much using V2 way of HFiles only.and there
should not be any difference in terms of the working and performance.

Regards
Ram


On Fri, Jun 13, 2014 at 6:47 AM, Ted Yu <yu...@gmail.com> wrote:

> For #1: yes
>
> See related JIRAs:
> HBASE-8496 Implement tags and the internals of how a tag should look like
> HBASE-10855 Enable hfilev3 by default
> HBASE-10451 Enable back Tag compression on HFiles
>
> Cheers
>
> On Thu, Jun 12, 2014 at 5:58 PM, abhishek1015 <ab...@gmail.com>
> wrote:
>
> > Hello
> >
> > I am interested in doing a comparative study of Hfile v2 and Hfile v3
> under
> > different workload condition. I have following questions in this regard:
> >
> > 1) Is HFile v3 available in HBase .98?
> > 2) Where can I find the physical design of HFile V3? I can only see the
> > physical design details of HFile V2 in some blogs.
> > 3) Has anyone done similar study? If yes, where can I find the results?
> >
> > Thanks
> > Abhishek
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-hbase.679495.n3.nabble.com/HFile-V2-vs-HFile-V3-tp4060405.html
> > Sent from the HBase User mailing list archive at Nabble.com.
> >
>

Re: HFile V2 vs HFile V3

Posted by Ted Yu <yu...@gmail.com>.
For #1: yes

See related JIRAs:
HBASE-8496 Implement tags and the internals of how a tag should look like
HBASE-10855 Enable hfilev3 by default
HBASE-10451 Enable back Tag compression on HFiles

Cheers

On Thu, Jun 12, 2014 at 5:58 PM, abhishek1015 <ab...@gmail.com>
wrote:

> Hello
>
> I am interested in doing a comparative study of Hfile v2 and Hfile v3 under
> different workload condition. I have following questions in this regard:
>
> 1) Is HFile v3 available in HBase .98?
> 2) Where can I find the physical design of HFile V3? I can only see the
> physical design details of HFile V2 in some blogs.
> 3) Has anyone done similar study? If yes, where can I find the results?
>
> Thanks
> Abhishek
>
>
>
>
> --
> View this message in context:
> http://apache-hbase.679495.n3.nabble.com/HFile-V2-vs-HFile-V3-tp4060405.html
> Sent from the HBase User mailing list archive at Nabble.com.
>