You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Cameron Gandevia <cg...@gmail.com> on 2013/04/26 21:49:01 UTC

Schema Design Question

Hi

I am new to HBase, I have been trying to POC an application and have a
design questions.

Currently we have a single table with the following key design

jobId_batchId_bundleId_uniquefileId

This is an offline processing system so data would be bulk loaded into
HBase via map/reduce jobs. We only need to support report generation
queries using map/reduce over a batch (And possibly a single column filter)
with the batchId as the start/end scan key. Once we have finished
processing a job we are free to remove the data from HBase.

We have varied workloads so a job could be made up of 10 rows, 100,000 rows
or 1 billion rows with the average falling somewhere around 10 million rows.

My question is related to pre-splitting. If we have a billion rows all with
the same batchId (Our map/reduce scan key) my understanding is we should
perform pre-splitting to create buckets hosted by different regions. If a
jobs workload can be so varied would it make sense to have a single table
containing all jobs? Or should we create 1 table per job and pre-split the
table for the given workload? If we had separate table we could drop them
when no longer needed.

If we didn't have a separate table per job how should we perform splitting?
Should we choose our largest possible workload and split for that? even
though 90% of our jobs would fall in the lower bound in terms of row count.
Would we experience any issue purging jobs of varying sizes if everything
was in a single table?

any advice would be greatly appreciated.

Thanks

Re: Schema Design Question

Posted by Cameron Gandevia <cg...@gmail.com>.
Thanks for all the replies. Sorry I should have provided more context in my
original question. Our system performs document
conversion/analysis/de-dupping of files stored in HDFS. We wanted to store
meta-data in HBase during each stage of the process. We would then run
map/reduce jobs to generate reports from this meta-data. We wouldn't run
Hbase queries over the entire job only over batches (and stage column
filter) within the job. Worst case scenario a Job consists of a single
batch but it is more likely a job will consist of 10s or 100s of batches,
so we would generate reports for 1 million of 1 billion records for example.

>From your responses it does sound like HBase might not be the best solution
to our problem. We will look at possibly using Hive.

Thanks again for the responses

On Apr 30, 2013 10:35 PM, "lars hofhansl" <la...@apache.org> wrote:

> Same here.
> HBase is generally good at honing in to a small (maybe 10-100m rows)
> continuous subset of an essentially unlimited dataset.
> If all you ever do is scanning _everything_ and then throwing it away, a
> straight scan (using Impala for example) or direct M/R on file(s) in HDFS
> is far better.
>
> -- Lars
>
>
>
> ________________________________
>  From: Michel Segel <mi...@hotmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Cc: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Monday, April 29, 2013 6:52 AM
> Subject: Re: Schema Design Question
>
>
> I would have to agree.
> The use case doesn't make much sense for HBase and sounds a bit more like
> a problem for Hive.
>
> The OP indicated that the data was disposable after a round of processing.
> IMHO Hive is a better fit.
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Apr 29, 2013, at 12:46 AM, Asaf Mesika <as...@gmail.com> wrote:
>
> > I actually don't see the benefit of saving the data into HBase if all you
> > do is read per job id and purges it. Why not accumulate into HDFS per job
> > id and then dump the file? The way I see it, HBase is good for querying
> > parts of your data, even if it is only 10 rows. In your case your average
> > is 1 billion, so streaming it from hdfs seems faster .
> >
> > On Saturday, April 27, 2013, Enis Söztutar wrote:
> >
> >> Hi,
> >>
> >> Interesting use case. I think it depends on job many jobId's you expect
> to
> >> have. If it is on the order of thousands, I would caution against going
> the
> >> one table per jobid approach, since for every table, there is some
> master
> >> overhead, as well as file structures in hdfs. If jobId's are managable,
> >> going with separate tables makes sense if you want to efficiently delete
> >> all the data related to a job.
> >>
> >> Also pre-splitting will depend on expected number of jobIds / batchIds
> and
> >> their ranges vs desired number of regions. You would want to keep
> number of
> >> regions hosted by a single region server in the low tens, thus, your
> splits
> >> can be across jobs or within jobs depending on cardinality. Can you
> share
> >> some more?
> >>
> >> Enis
> >>
> >>
> >> On Fri, Apr 26, 2013 at 2:34 PM, Ted Yu <yuzhihong@gmail.com
> <javascript:;>>
> >> wrote:
> >>
> >>> My understanding of your use case is that data for different jobIds
> would
> >>> be continuously loaded into the underlying table(s).
> >>>
> >>> Looks like you can have one table per job. This way you drop the table
> >>> after map reduce is complete. In the single table approach, you would
> >>> delete many rows in the table which is not as fast as dropping the
> >> separate
> >>> table.
> >>>
> >>> Cheers
> >>>
> >>> On Sat, Apr 27, 2013 at 3:49 AM, Cameron Gandevia <cgandevia@gmail.com
> <javascript:;>
> >>>> wrote:
> >>>
> >>>> Hi
> >>>>
> >>>> I am new to HBase, I have been trying to POC an application and have a
> >>>> design questions.
> >>>>
> >>>> Currently we have a single table with the following key design
> >>>>
> >>>> jobId_batchId_bundleId_uniquefileId
> >>>>
> >>>> This is an offline processing system so data would be bulk loaded into
> >>>> HBase via map/reduce jobs. We only need to support report generation
> >>>> queries using map/reduce over a batch (And possibly a single column
> >>> filter)
> >>>> with the batchId as the start/end scan key. Once we have finished
> >>>> processing a job we are free to remove the data from HBase.
> >>>>
> >>>> We have varied workloads so a job could be made up of 10 rows, 100,000
> >>> rows
> >>>> or 1 billion rows with the average falling somewhere around 10 million
> >>>> rows.
> >>>>
> >>>> My question is related to pre-splitting. If we have a billion rows all
> >>> with
> >>>> the same batchId (Our map/reduce scan key) my understanding is we
> >> should
> >>>> perform pre-splitting to create buckets hosted by different regions.
> >> If a
> >>>> jobs workload can be so varied would it make sense to have a single
> >> table
> >>>> containing all jobs? Or should we create 1 table per job and pre-split
> >>> the
> >>>> table for the given workload? If we had separate table we could drop
> >> them
> >>>> when no longer needed.
> >>>>
> >>>> If we didn't have a separate table per job how should we perform
> >>> splitting?
> >>>> Should we choose our largest possible workload and split for that?
> even
> >>>> though 90% of our jobs would fall in the lower bound in terms of row
> >>> count.
> >>>> Would we experience any issue purging jobs of varying sizes if
> >> everything
> >>>> was in a single table?
> >>>>
> >>>> any advice would be greatly appreciated.
> >>>>
> >>>> Thanks
> >>

Re: Schema Design Question

Posted by lars hofhansl <la...@apache.org>.
Same here.
HBase is generally good at honing in to a small (maybe 10-100m rows) continuous subset of an essentially unlimited dataset.
If all you ever do is scanning _everything_ and then throwing it away, a straight scan (using Impala for example) or direct M/R on file(s) in HDFS is far better.

-- Lars



________________________________
 From: Michel Segel <mi...@hotmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Cc: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Monday, April 29, 2013 6:52 AM
Subject: Re: Schema Design Question
 

I would have to agree. 
The use case doesn't make much sense for HBase and sounds a bit more like a problem for Hive.

The OP indicated that the data was disposable after a round of processing. 
IMHO Hive is a better fit.


Sent from a remote device. Please excuse any typos...

Mike Segel

On Apr 29, 2013, at 12:46 AM, Asaf Mesika <as...@gmail.com> wrote:

> I actually don't see the benefit of saving the data into HBase if all you
> do is read per job id and purges it. Why not accumulate into HDFS per job
> id and then dump the file? The way I see it, HBase is good for querying
> parts of your data, even if it is only 10 rows. In your case your average
> is 1 billion, so streaming it from hdfs seems faster .
> 
> On Saturday, April 27, 2013, Enis Söztutar wrote:
> 
>> Hi,
>> 
>> Interesting use case. I think it depends on job many jobId's you expect to
>> have. If it is on the order of thousands, I would caution against going the
>> one table per jobid approach, since for every table, there is some master
>> overhead, as well as file structures in hdfs. If jobId's are managable,
>> going with separate tables makes sense if you want to efficiently delete
>> all the data related to a job.
>> 
>> Also pre-splitting will depend on expected number of jobIds / batchIds and
>> their ranges vs desired number of regions. You would want to keep number of
>> regions hosted by a single region server in the low tens, thus, your splits
>> can be across jobs or within jobs depending on cardinality. Can you share
>> some more?
>> 
>> Enis
>> 
>> 
>> On Fri, Apr 26, 2013 at 2:34 PM, Ted Yu <yuzhihong@gmail.com<javascript:;>>
>> wrote:
>> 
>>> My understanding of your use case is that data for different jobIds would
>>> be continuously loaded into the underlying table(s).
>>> 
>>> Looks like you can have one table per job. This way you drop the table
>>> after map reduce is complete. In the single table approach, you would
>>> delete many rows in the table which is not as fast as dropping the
>> separate
>>> table.
>>> 
>>> Cheers
>>> 
>>> On Sat, Apr 27, 2013 at 3:49 AM, Cameron Gandevia <cgandevia@gmail.com<javascript:;>
>>>> wrote:
>>> 
>>>> Hi
>>>> 
>>>> I am new to HBase, I have been trying to POC an application and have a
>>>> design questions.
>>>> 
>>>> Currently we have a single table with the following key design
>>>> 
>>>> jobId_batchId_bundleId_uniquefileId
>>>> 
>>>> This is an offline processing system so data would be bulk loaded into
>>>> HBase via map/reduce jobs. We only need to support report generation
>>>> queries using map/reduce over a batch (And possibly a single column
>>> filter)
>>>> with the batchId as the start/end scan key. Once we have finished
>>>> processing a job we are free to remove the data from HBase.
>>>> 
>>>> We have varied workloads so a job could be made up of 10 rows, 100,000
>>> rows
>>>> or 1 billion rows with the average falling somewhere around 10 million
>>>> rows.
>>>> 
>>>> My question is related to pre-splitting. If we have a billion rows all
>>> with
>>>> the same batchId (Our map/reduce scan key) my understanding is we
>> should
>>>> perform pre-splitting to create buckets hosted by different regions.
>> If a
>>>> jobs workload can be so varied would it make sense to have a single
>> table
>>>> containing all jobs? Or should we create 1 table per job and pre-split
>>> the
>>>> table for the given workload? If we had separate table we could drop
>> them
>>>> when no longer needed.
>>>> 
>>>> If we didn't have a separate table per job how should we perform
>>> splitting?
>>>> Should we choose our largest possible workload and split for that? even
>>>> though 90% of our jobs would fall in the lower bound in terms of row
>>> count.
>>>> Would we experience any issue purging jobs of varying sizes if
>> everything
>>>> was in a single table?
>>>> 
>>>> any advice would be greatly appreciated.
>>>> 
>>>> Thanks
>> 

Re: Schema Design Question

Posted by Michel Segel <mi...@hotmail.com>.
I would have to agree. 
The use case doesn't make much sense for HBase and sounds a bit more like a problem for Hive.

The OP indicated that the data was disposable after a round of processing. 
IMHO Hive is a better fit.


Sent from a remote device. Please excuse any typos...

Mike Segel

On Apr 29, 2013, at 12:46 AM, Asaf Mesika <as...@gmail.com> wrote:

> I actually don't see the benefit of saving the data into HBase if all you
> do is read per job id and purges it. Why not accumulate into HDFS per job
> id and then dump the file? The way I see it, HBase is good for querying
> parts of your data, even if it is only 10 rows. In your case your average
> is 1 billion, so streaming it from hdfs seems faster .
> 
> On Saturday, April 27, 2013, Enis Söztutar wrote:
> 
>> Hi,
>> 
>> Interesting use case. I think it depends on job many jobId's you expect to
>> have. If it is on the order of thousands, I would caution against going the
>> one table per jobid approach, since for every table, there is some master
>> overhead, as well as file structures in hdfs. If jobId's are managable,
>> going with separate tables makes sense if you want to efficiently delete
>> all the data related to a job.
>> 
>> Also pre-splitting will depend on expected number of jobIds / batchIds and
>> their ranges vs desired number of regions. You would want to keep number of
>> regions hosted by a single region server in the low tens, thus, your splits
>> can be across jobs or within jobs depending on cardinality. Can you share
>> some more?
>> 
>> Enis
>> 
>> 
>> On Fri, Apr 26, 2013 at 2:34 PM, Ted Yu <yuzhihong@gmail.com<javascript:;>>
>> wrote:
>> 
>>> My understanding of your use case is that data for different jobIds would
>>> be continuously loaded into the underlying table(s).
>>> 
>>> Looks like you can have one table per job. This way you drop the table
>>> after map reduce is complete. In the single table approach, you would
>>> delete many rows in the table which is not as fast as dropping the
>> separate
>>> table.
>>> 
>>> Cheers
>>> 
>>> On Sat, Apr 27, 2013 at 3:49 AM, Cameron Gandevia <cgandevia@gmail.com<javascript:;>
>>>> wrote:
>>> 
>>>> Hi
>>>> 
>>>> I am new to HBase, I have been trying to POC an application and have a
>>>> design questions.
>>>> 
>>>> Currently we have a single table with the following key design
>>>> 
>>>> jobId_batchId_bundleId_uniquefileId
>>>> 
>>>> This is an offline processing system so data would be bulk loaded into
>>>> HBase via map/reduce jobs. We only need to support report generation
>>>> queries using map/reduce over a batch (And possibly a single column
>>> filter)
>>>> with the batchId as the start/end scan key. Once we have finished
>>>> processing a job we are free to remove the data from HBase.
>>>> 
>>>> We have varied workloads so a job could be made up of 10 rows, 100,000
>>> rows
>>>> or 1 billion rows with the average falling somewhere around 10 million
>>>> rows.
>>>> 
>>>> My question is related to pre-splitting. If we have a billion rows all
>>> with
>>>> the same batchId (Our map/reduce scan key) my understanding is we
>> should
>>>> perform pre-splitting to create buckets hosted by different regions.
>> If a
>>>> jobs workload can be so varied would it make sense to have a single
>> table
>>>> containing all jobs? Or should we create 1 table per job and pre-split
>>> the
>>>> table for the given workload? If we had separate table we could drop
>> them
>>>> when no longer needed.
>>>> 
>>>> If we didn't have a separate table per job how should we perform
>>> splitting?
>>>> Should we choose our largest possible workload and split for that? even
>>>> though 90% of our jobs would fall in the lower bound in terms of row
>>> count.
>>>> Would we experience any issue purging jobs of varying sizes if
>> everything
>>>> was in a single table?
>>>> 
>>>> any advice would be greatly appreciated.
>>>> 
>>>> Thanks
>> 

Re: Schema Design Question

Posted by Asaf Mesika <as...@gmail.com>.
I actually don't see the benefit of saving the data into HBase if all you
do is read per job id and purges it. Why not accumulate into HDFS per job
id and then dump the file? The way I see it, HBase is good for querying
parts of your data, even if it is only 10 rows. In your case your average
is 1 billion, so streaming it from hdfs seems faster .

On Saturday, April 27, 2013, Enis Söztutar wrote:

> Hi,
>
> Interesting use case. I think it depends on job many jobId's you expect to
> have. If it is on the order of thousands, I would caution against going the
> one table per jobid approach, since for every table, there is some master
> overhead, as well as file structures in hdfs. If jobId's are managable,
> going with separate tables makes sense if you want to efficiently delete
> all the data related to a job.
>
> Also pre-splitting will depend on expected number of jobIds / batchIds and
> their ranges vs desired number of regions. You would want to keep number of
> regions hosted by a single region server in the low tens, thus, your splits
> can be across jobs or within jobs depending on cardinality. Can you share
> some more?
>
> Enis
>
>
> On Fri, Apr 26, 2013 at 2:34 PM, Ted Yu <yuzhihong@gmail.com<javascript:;>>
> wrote:
>
> > My understanding of your use case is that data for different jobIds would
> > be continuously loaded into the underlying table(s).
> >
> > Looks like you can have one table per job. This way you drop the table
> > after map reduce is complete. In the single table approach, you would
> > delete many rows in the table which is not as fast as dropping the
> separate
> > table.
> >
> > Cheers
> >
> > On Sat, Apr 27, 2013 at 3:49 AM, Cameron Gandevia <cgandevia@gmail.com<javascript:;>
> > >wrote:
> >
> > > Hi
> > >
> > > I am new to HBase, I have been trying to POC an application and have a
> > > design questions.
> > >
> > > Currently we have a single table with the following key design
> > >
> > > jobId_batchId_bundleId_uniquefileId
> > >
> > > This is an offline processing system so data would be bulk loaded into
> > > HBase via map/reduce jobs. We only need to support report generation
> > > queries using map/reduce over a batch (And possibly a single column
> > filter)
> > > with the batchId as the start/end scan key. Once we have finished
> > > processing a job we are free to remove the data from HBase.
> > >
> > > We have varied workloads so a job could be made up of 10 rows, 100,000
> > rows
> > > or 1 billion rows with the average falling somewhere around 10 million
> > > rows.
> > >
> > > My question is related to pre-splitting. If we have a billion rows all
> > with
> > > the same batchId (Our map/reduce scan key) my understanding is we
> should
> > > perform pre-splitting to create buckets hosted by different regions.
> If a
> > > jobs workload can be so varied would it make sense to have a single
> table
> > > containing all jobs? Or should we create 1 table per job and pre-split
> > the
> > > table for the given workload? If we had separate table we could drop
> them
> > > when no longer needed.
> > >
> > > If we didn't have a separate table per job how should we perform
> > splitting?
> > > Should we choose our largest possible workload and split for that? even
> > > though 90% of our jobs would fall in the lower bound in terms of row
> > count.
> > > Would we experience any issue purging jobs of varying sizes if
> everything
> > > was in a single table?
> > >
> > > any advice would be greatly appreciated.
> > >
> > > Thanks
> > >
> >
>

Re: Schema Design Question

Posted by Enis Söztutar <en...@gmail.com>.
Hi,

Interesting use case. I think it depends on job many jobId's you expect to
have. If it is on the order of thousands, I would caution against going the
one table per jobid approach, since for every table, there is some master
overhead, as well as file structures in hdfs. If jobId's are managable,
going with separate tables makes sense if you want to efficiently delete
all the data related to a job.

Also pre-splitting will depend on expected number of jobIds / batchIds and
their ranges vs desired number of regions. You would want to keep number of
regions hosted by a single region server in the low tens, thus, your splits
can be across jobs or within jobs depending on cardinality. Can you share
some more?

Enis


On Fri, Apr 26, 2013 at 2:34 PM, Ted Yu <yu...@gmail.com> wrote:

> My understanding of your use case is that data for different jobIds would
> be continuously loaded into the underlying table(s).
>
> Looks like you can have one table per job. This way you drop the table
> after map reduce is complete. In the single table approach, you would
> delete many rows in the table which is not as fast as dropping the separate
> table.
>
> Cheers
>
> On Sat, Apr 27, 2013 at 3:49 AM, Cameron Gandevia <cgandevia@gmail.com
> >wrote:
>
> > Hi
> >
> > I am new to HBase, I have been trying to POC an application and have a
> > design questions.
> >
> > Currently we have a single table with the following key design
> >
> > jobId_batchId_bundleId_uniquefileId
> >
> > This is an offline processing system so data would be bulk loaded into
> > HBase via map/reduce jobs. We only need to support report generation
> > queries using map/reduce over a batch (And possibly a single column
> filter)
> > with the batchId as the start/end scan key. Once we have finished
> > processing a job we are free to remove the data from HBase.
> >
> > We have varied workloads so a job could be made up of 10 rows, 100,000
> rows
> > or 1 billion rows with the average falling somewhere around 10 million
> > rows.
> >
> > My question is related to pre-splitting. If we have a billion rows all
> with
> > the same batchId (Our map/reduce scan key) my understanding is we should
> > perform pre-splitting to create buckets hosted by different regions. If a
> > jobs workload can be so varied would it make sense to have a single table
> > containing all jobs? Or should we create 1 table per job and pre-split
> the
> > table for the given workload? If we had separate table we could drop them
> > when no longer needed.
> >
> > If we didn't have a separate table per job how should we perform
> splitting?
> > Should we choose our largest possible workload and split for that? even
> > though 90% of our jobs would fall in the lower bound in terms of row
> count.
> > Would we experience any issue purging jobs of varying sizes if everything
> > was in a single table?
> >
> > any advice would be greatly appreciated.
> >
> > Thanks
> >
>

Re: Schema Design Question

Posted by Ted Yu <yu...@gmail.com>.
My understanding of your use case is that data for different jobIds would
be continuously loaded into the underlying table(s).

Looks like you can have one table per job. This way you drop the table
after map reduce is complete. In the single table approach, you would
delete many rows in the table which is not as fast as dropping the separate
table.

Cheers

On Sat, Apr 27, 2013 at 3:49 AM, Cameron Gandevia <cg...@gmail.com>wrote:

> Hi
>
> I am new to HBase, I have been trying to POC an application and have a
> design questions.
>
> Currently we have a single table with the following key design
>
> jobId_batchId_bundleId_uniquefileId
>
> This is an offline processing system so data would be bulk loaded into
> HBase via map/reduce jobs. We only need to support report generation
> queries using map/reduce over a batch (And possibly a single column filter)
> with the batchId as the start/end scan key. Once we have finished
> processing a job we are free to remove the data from HBase.
>
> We have varied workloads so a job could be made up of 10 rows, 100,000 rows
> or 1 billion rows with the average falling somewhere around 10 million
> rows.
>
> My question is related to pre-splitting. If we have a billion rows all with
> the same batchId (Our map/reduce scan key) my understanding is we should
> perform pre-splitting to create buckets hosted by different regions. If a
> jobs workload can be so varied would it make sense to have a single table
> containing all jobs? Or should we create 1 table per job and pre-split the
> table for the given workload? If we had separate table we could drop them
> when no longer needed.
>
> If we didn't have a separate table per job how should we perform splitting?
> Should we choose our largest possible workload and split for that? even
> though 90% of our jobs would fall in the lower bound in terms of row count.
> Would we experience any issue purging jobs of varying sizes if everything
> was in a single table?
>
> any advice would be greatly appreciated.
>
> Thanks
>