You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Kamal Bahadur <ma...@gmail.com> on 2013/12/24 00:01:39 UTC

Schema Design Newbie Question

Hello,

I am just starting to use HBase and I am coming from Cassandra world.Here
is a quick background regarding my data:

My system will be storing data that belongs to a certain category.
Currently I have around 1000 categories.  Also note that some categories
produce lot more data than others. To be precise, 10% of the categories
provide more than 65% of the total data in the system.

Data access queries always contains this category in the query. I have
listed 2 options to design the schema:

1. Add category as first component of the row key [category + timestamp] so
that my data is sorted based on category for fast retrieval.
2. Add category as column family so that I can just use timestamp as
rowkey. This option will however create more hfiles since I have more
categories.

I am leaning towards option2. I like the idea that HBase separates data for
each CF into its own HFiles. However I still worried about the number of
hfiles that will be created on the server. Will it cause any other side
effects? I would like to hear from the user community as to which option
will be the best option in my case.

Kamal

Re: Schema Design Newbie Question

Posted by Kamal Bahadur <ma...@gmail.com>.

I am now convinced that option 1 will be the best option for my data.
Thanks Lars!

Kamal


On Mon, Dec 23, 2013 at 4:12 PM, lars hofhansl <la...@apache.org> wrote:

> The HDFS NameNode will have to deal with lots of small files (currently
> HBase cannot flush column families independently, so if one is flushed all
> of them are).
> The other reason is that scanning will the slow (if your scan involves
> many column families, due to the merge sort HBase needs to perform).
>
> Option #1 should be better. HBase will be smart just scanning the HFile
> necessary for the key range you provide (Category + Timestamp).
>
> -- Lars
>
>
>
> ________________________________
>  From: Kamal Bahadur <ma...@gmail.com>
> To: user <us...@hbase.apache.org>; Dhaval Shah <prince_mithibai@yahoo.co.in
> >
> Sent: Monday, December 23, 2013 3:47 PM
> Subject: Re: Schema Design Newbie Question
>
>
> Hi Dhaval,
>
> Thanks for the quick response!
>
> Why do you think having more files is not a good idea? Is it because of OS
> restrictions?
>
> I get around 50 million records a day and each record contains  ~25
> columns. Values for each column are ~30 characters.
>
> Kamal
>
>
>
> On Mon, Dec 23, 2013 at 3:35 PM, Dhaval Shah <prince_mithibai@yahoo.co.in
> >wrote:
>
> > A 1000 CFs with HBase does not sound like a good idea.
> >
> > category + timestamp sounds like the better of the 2 options you have
> > thought of.
> >
> > Can you tell us a little more about your data?
> >
> > Regards,
> >
> > Dhaval
> >
> >
> > ________________________________
> >  From: Kamal Bahadur <ma...@gmail.com>
> > To: user@hbase.apache.org
> > Sent: Monday, 23 December 2013 6:01 PM
> > Subject: Schema Design Newbie Question
> >
> >
> > Hello,
> >
> > I am just starting to use HBase and I am coming from Cassandra world.Here
> > is a quick background regarding my data:
> >
> > My system will be storing data that belongs to a certain category.
> > Currently I have around 1000 categories.  Also note that some categories
> > produce lot more data than others. To be precise, 10% of the categories
> > provide more than 65% of the total data in the system.
> >
> > Data access queries always contains this category in the query. I have
> > listed 2 options to design the schema:
> >
> > 1. Add category as first component of the row key [category + timestamp]
> so
> > that my data is sorted based on category for fast retrieval.
> > 2. Add category as column family so that I can just use timestamp as
> > rowkey. This option will however create more hfiles since I have more
> > categories.
> >
> > I am leaning towards option2. I like the idea that HBase separates data
> for
> > each CF into its own HFiles. However I still worried about the number of
> > hfiles that will be created on the server. Will it cause any other side
> > effects? I would like to hear from the user community as to which option
> > will be the best option in my case.
> >
> > Kamal
> >
>

Re: Schema Design Newbie Question

Posted by lars hofhansl <la...@apache.org>.

The HDFS NameNode will have to deal with lots of small files (currently HBase cannot flush column families independently, so if one is flushed all of them are).
The other reason is that scanning will the slow (if your scan involves many column families, due to the merge sort HBase needs to perform).

Option #1 should be better. HBase will be smart just scanning the HFile necessary for the key range you provide (Category + Timestamp).

-- Lars



________________________________
 From: Kamal Bahadur <ma...@gmail.com>
To: user <us...@hbase.apache.org>; Dhaval Shah <pr...@yahoo.co.in> 
Sent: Monday, December 23, 2013 3:47 PM
Subject: Re: Schema Design Newbie Question
 

Hi Dhaval,

Thanks for the quick response!

Why do you think having more files is not a good idea? Is it because of OS
restrictions?

I get around 50 million records a day and each record contains  ~25
columns. Values for each column are ~30 characters.

Kamal



On Mon, Dec 23, 2013 at 3:35 PM, Dhaval Shah <pr...@yahoo.co.in>wrote:

> A 1000 CFs with HBase does not sound like a good idea.
>
> category + timestamp sounds like the better of the 2 options you have
> thought of.
>
> Can you tell us a little more about your data?
>
> Regards,
>
> Dhaval
>
>
> ________________________________
>  From: Kamal Bahadur <ma...@gmail.com>
> To: user@hbase.apache.org
> Sent: Monday, 23 December 2013 6:01 PM
> Subject: Schema Design Newbie Question
>
>
> Hello,
>
> I am just starting to use HBase and I am coming from Cassandra world.Here
> is a quick background regarding my data:
>
> My system will be storing data that belongs to a certain category.
> Currently I have around 1000 categories.  Also note that some categories
> produce lot more data than others. To be precise, 10% of the categories
> provide more than 65% of the total data in the system.
>
> Data access queries always contains this category in the query. I have
> listed 2 options to design the schema:
>
> 1. Add category as first component of the row key [category + timestamp] so
> that my data is sorted based on category for fast retrieval.
> 2. Add category as column family so that I can just use timestamp as
> rowkey. This option will however create more hfiles since I have more
> categories.
>
> I am leaning towards option2. I like the idea that HBase separates data for
> each CF into its own HFiles. However I still worried about the number of
> hfiles that will be created on the server. Will it cause any other side
> effects? I would like to hear from the user community as to which option
> will be the best option in my case.
>
> Kamal
>

Re: Schema Design Newbie Question

Posted by Kamal Bahadur <ma...@gmail.com>.

Hi Dhaval,

Thanks for the quick response!

Why do you think having more files is not a good idea? Is it because of OS
restrictions?

I get around 50 million records a day and each record contains  ~25
columns. Values for each column are ~30 characters.

Kamal


On Mon, Dec 23, 2013 at 3:35 PM, Dhaval Shah <pr...@yahoo.co.in>wrote:

> A 1000 CFs with HBase does not sound like a good idea.
>
> category + timestamp sounds like the better of the 2 options you have
> thought of.
>
> Can you tell us a little more about your data?
>
> Regards,
>
> Dhaval
>
>
> ________________________________
>  From: Kamal Bahadur <ma...@gmail.com>
> To: user@hbase.apache.org
> Sent: Monday, 23 December 2013 6:01 PM
> Subject: Schema Design Newbie Question
>
>
> Hello,
>
> I am just starting to use HBase and I am coming from Cassandra world.Here
> is a quick background regarding my data:
>
> My system will be storing data that belongs to a certain category.
> Currently I have around 1000 categories.  Also note that some categories
> produce lot more data than others. To be precise, 10% of the categories
> provide more than 65% of the total data in the system.
>
> Data access queries always contains this category in the query. I have
> listed 2 options to design the schema:
>
> 1. Add category as first component of the row key [category + timestamp] so
> that my data is sorted based on category for fast retrieval.
> 2. Add category as column family so that I can just use timestamp as
> rowkey. This option will however create more hfiles since I have more
> categories.
>
> I am leaning towards option2. I like the idea that HBase separates data for
> each CF into its own HFiles. However I still worried about the number of
> hfiles that will be created on the server. Will it cause any other side
> effects? I would like to hear from the user community as to which option
> will be the best option in my case.
>
> Kamal
>

Re: Schema Design Newbie Question

Posted by Dhaval Shah <pr...@yahoo.co.in>.

A 1000 CFs with HBase does not sound like a good idea. 

category + timestamp sounds like the better of the 2 options you have thought of. 

Can you tell us a little more about your data? 
 
Regards,

Dhaval


________________________________
 From: Kamal Bahadur <ma...@gmail.com>
To: user@hbase.apache.org 
Sent: Monday, 23 December 2013 6:01 PM
Subject: Schema Design Newbie Question
 

Hello,

I am just starting to use HBase and I am coming from Cassandra world.Here
is a quick background regarding my data:

My system will be storing data that belongs to a certain category.
Currently I have around 1000 categories.  Also note that some categories
produce lot more data than others. To be precise, 10% of the categories
provide more than 65% of the total data in the system.

Data access queries always contains this category in the query. I have
listed 2 options to design the schema:

1. Add category as first component of the row key [category + timestamp] so
that my data is sorted based on category for fast retrieval.
2. Add category as column family so that I can just use timestamp as
rowkey. This option will however create more hfiles since I have more
categories.

I am leaning towards option2. I like the idea that HBase separates data for
each CF into its own HFiles. However I still worried about the number of
hfiles that will be created on the server. Will it cause any other side
effects? I would like to hear from the user community as to which option
will be the best option in my case.

Kamal