You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Krishna <re...@gmail.com> on 2016/01/21 23:43:27 UTC

HFile vs Parquet for very wide table

We are evaluating Parquet and HBase for storing a dense & very, very wide
matrix (can have more than 600K columns).

I've following questions:

   - Is there is a limit on # of columns in Parquet or HFile? We expect to
   query [10-100] columns at a time using Spark - what are the performance
   implications in this scenario?
   - HBase can support millions of columns - anyone with prior experience
   that compares Parquet vs HFile performance for wide structured tables?
   - We want a schema-less solution since the matrix can get wider over a
   period of time
   - Is there a way to generate wide structured schema-less Parquet files
   using map-reduce (input files are in custom binary format)?

What other solutions other than Parquet & HBase are useful for this
use-case?

Re: HFile vs Parquet for very wide table

Posted by Jerry He <je...@gmail.com>.

Parquet may be more efficient in your use case, coupled with a upper layer
query engine.
But Parquet has schema. Schema can evolve though. e.g. adding columns in
new Parquet files.
HBase would be able to do the job too, and it schema-less -- you can add
columns freely.

Jerry

On Fri, Jan 22, 2016 at 10:04 AM, Krishna <re...@gmail.com> wrote:

> Thanks Ted, Jerry.
>
> Computing pairwise similarity is the primary purpose of the matrix. This is
> done by extracting all rows for a set of columns at each iteration.
>
> On Thursday, January 21, 2016, Jerry He <je...@gmail.com> wrote:
>
> > What do you want to do with your matrix data?  How do you want to use it?
> > Do you need random read/write or point query?  Do you need to get the
> > row/record or many many columns at a time?
> > If yes, HBase is a good choice for you.
> > Parquet is good as a storage format for large scans, aggregations, on
> > limited number of specific columns. Analytical type of work.
> >
> > Jerry
> >
> >
> >
> >
> > On Thu, Jan 21, 2016 at 3:25 PM, Ted Yu <yuzhihong@gmail.com
> > <javascript:;>> wrote:
> >
> > > I have very limited knowledge on Parquet, so I can only answer from
> HBase
> > > point of view.
> > >
> > > Please see recent thread on number of columns in a row in HBase:
> > >
> > > http://search-hadoop.com/m/YGbb3NN3v1jeL1f
> > >
> > > There're a few Spark hbase connectors.
> > > See this thread:
> > >
> > > http://search-hadoop.com/m/q3RTt4cp9Z4p37s
> > >
> > > Sorry I cannot answer performance comparison question.
> > >
> > > Cheers
> > >
> > > On Thu, Jan 21, 2016 at 2:43 PM, Krishna <research800@gmail.com
> > <javascript:;>> wrote:
> > >
> > > > We are evaluating Parquet and HBase for storing a dense & very, very
> > wide
> > > > matrix (can have more than 600K columns).
> > > >
> > > > I've following questions:
> > > >
> > > >    - Is there is a limit on # of columns in Parquet or HFile? We
> expect
> > > to
> > > >    query [10-100] columns at a time using Spark - what are the
> > > performance
> > > >    implications in this scenario?
> > > >    - HBase can support millions of columns - anyone with prior
> > experience
> > > >    that compares Parquet vs HFile performance for wide structured
> > tables?
> > > >    - We want a schema-less solution since the matrix can get wider
> > over a
> > > >    period of time
> > > >    - Is there a way to generate wide structured schema-less Parquet
> > files
> > > >    using map-reduce (input files are in custom binary format)?
> > > >
> > > > What other solutions other than Parquet & HBase are useful for this
> > > > use-case?
> > > >
> > >
> >
>

Re: HFile vs Parquet for very wide table

Posted by Krishna <re...@gmail.com>.

Thanks Ted, Jerry.

Computing pairwise similarity is the primary purpose of the matrix. This is
done by extracting all rows for a set of columns at each iteration.

On Thursday, January 21, 2016, Jerry He <je...@gmail.com> wrote:

> What do you want to do with your matrix data?  How do you want to use it?
> Do you need random read/write or point query?  Do you need to get the
> row/record or many many columns at a time?
> If yes, HBase is a good choice for you.
> Parquet is good as a storage format for large scans, aggregations, on
> limited number of specific columns. Analytical type of work.
>
> Jerry
>
>
>
>
> On Thu, Jan 21, 2016 at 3:25 PM, Ted Yu <yuzhihong@gmail.com
> <javascript:;>> wrote:
>
> > I have very limited knowledge on Parquet, so I can only answer from HBase
> > point of view.
> >
> > Please see recent thread on number of columns in a row in HBase:
> >
> > http://search-hadoop.com/m/YGbb3NN3v1jeL1f
> >
> > There're a few Spark hbase connectors.
> > See this thread:
> >
> > http://search-hadoop.com/m/q3RTt4cp9Z4p37s
> >
> > Sorry I cannot answer performance comparison question.
> >
> > Cheers
> >
> > On Thu, Jan 21, 2016 at 2:43 PM, Krishna <research800@gmail.com
> <javascript:;>> wrote:
> >
> > > We are evaluating Parquet and HBase for storing a dense & very, very
> wide
> > > matrix (can have more than 600K columns).
> > >
> > > I've following questions:
> > >
> > >    - Is there is a limit on # of columns in Parquet or HFile? We expect
> > to
> > >    query [10-100] columns at a time using Spark - what are the
> > performance
> > >    implications in this scenario?
> > >    - HBase can support millions of columns - anyone with prior
> experience
> > >    that compares Parquet vs HFile performance for wide structured
> tables?
> > >    - We want a schema-less solution since the matrix can get wider
> over a
> > >    period of time
> > >    - Is there a way to generate wide structured schema-less Parquet
> files
> > >    using map-reduce (input files are in custom binary format)?
> > >
> > > What other solutions other than Parquet & HBase are useful for this
> > > use-case?
> > >
> >
>

Re: HFile vs Parquet for very wide table

Posted by Jerry He <je...@gmail.com>.

What do you want to do with your matrix data?  How do you want to use it?
Do you need random read/write or point query?  Do you need to get the
row/record or many many columns at a time?
If yes, HBase is a good choice for you.
Parquet is good as a storage format for large scans, aggregations, on
limited number of specific columns. Analytical type of work.

Jerry




On Thu, Jan 21, 2016 at 3:25 PM, Ted Yu <yu...@gmail.com> wrote:

> I have very limited knowledge on Parquet, so I can only answer from HBase
> point of view.
>
> Please see recent thread on number of columns in a row in HBase:
>
> http://search-hadoop.com/m/YGbb3NN3v1jeL1f
>
> There're a few Spark hbase connectors.
> See this thread:
>
> http://search-hadoop.com/m/q3RTt4cp9Z4p37s
>
> Sorry I cannot answer performance comparison question.
>
> Cheers
>
> On Thu, Jan 21, 2016 at 2:43 PM, Krishna <re...@gmail.com> wrote:
>
> > We are evaluating Parquet and HBase for storing a dense & very, very wide
> > matrix (can have more than 600K columns).
> >
> > I've following questions:
> >
> >    - Is there is a limit on # of columns in Parquet or HFile? We expect
> to
> >    query [10-100] columns at a time using Spark - what are the
> performance
> >    implications in this scenario?
> >    - HBase can support millions of columns - anyone with prior experience
> >    that compares Parquet vs HFile performance for wide structured tables?
> >    - We want a schema-less solution since the matrix can get wider over a
> >    period of time
> >    - Is there a way to generate wide structured schema-less Parquet files
> >    using map-reduce (input files are in custom binary format)?
> >
> > What other solutions other than Parquet & HBase are useful for this
> > use-case?
> >
>

Re: HFile vs Parquet for very wide table

Posted by Ted Yu <yu...@gmail.com>.

I have very limited knowledge on Parquet, so I can only answer from HBase
point of view.

Please see recent thread on number of columns in a row in HBase:

http://search-hadoop.com/m/YGbb3NN3v1jeL1f

There're a few Spark hbase connectors.
See this thread:

http://search-hadoop.com/m/q3RTt4cp9Z4p37s

Sorry I cannot answer performance comparison question.

Cheers

On Thu, Jan 21, 2016 at 2:43 PM, Krishna <re...@gmail.com> wrote:

> We are evaluating Parquet and HBase for storing a dense & very, very wide
> matrix (can have more than 600K columns).
>
> I've following questions:
>
>    - Is there is a limit on # of columns in Parquet or HFile? We expect to
>    query [10-100] columns at a time using Spark - what are the performance
>    implications in this scenario?
>    - HBase can support millions of columns - anyone with prior experience
>    that compares Parquet vs HFile performance for wide structured tables?
>    - We want a schema-less solution since the matrix can get wider over a
>    period of time
>    - Is there a way to generate wide structured schema-less Parquet files
>    using map-reduce (input files are in custom binary format)?
>
> What other solutions other than Parquet & HBase are useful for this
> use-case?
>