You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by mete <ef...@gmail.com> on 2012/02/05 09:37:27 UTC

storing logs in hbase

Hello,

i am thinking about using hbase for storing web log data, i like the idea
to have hdfs underneath so that i wont be worried about failure cases much
and i can benefit from all the cool HBase features.

The thing i could not figure out is howto effectively store and query the
data.I am planning to split each kind of log record to 10 - 20 columns and
then use MR jobs query the table with full scans.
(I guess i can use hive or pig for this as well but i am not familiar with
those yet)
I find this approach simple and easy to implement but on the other hand
this is like an offline process, it could take a lot of time to get a
single report. And of course a business user would be very dissappointed to
see that he/she has to wait another 40 mins to get the results of the query.

So what i am trying to achieve is to keep this query time as small as
possible. For this i can sacrifice the write speed as well, i dont really
have to integrate new logs on-the-fly but a job that runs overnight is also
fine.

So for this kind of situation do you find Hbase useful?

I read about star-schema design to make more effective queries but then
this makes the developers job a lot more harder because i need to design
different schemas for different log types, adding a new log type would
require some time to gather requirements,develop etc...

I thought about creating a very simple hbase shema, like just a key and the
content for each record, and then index this content with lucene, but then
this sounded like i did not need hbase in the first place because i am not
really benefiting from it except for storage.Also i could not be sure about
how big my lucene indexes would get, and if i could cope up with bigdata on
lucene. What do you think about lucene indexes on hbase?

I read about how rackspace was doing things, as far as i understood they
are generating lucene indexes while parsing the logs in hadoop, and then
merging this index into some system that is serving the previous indexes.(
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data)

Does anyone use a similar approach or have any ideas about this?

Do you think any of these are suitable? or if not should i try a different
way?

Thanks in advance
Mete

Re: storing logs in hbase

Posted by Eric <er...@gmail.com>.
It sounds to me that you are better off using Hive. HBase is suitable for
real time access to specific records. If you want to do batch processing
(Map Reduce) on your data, like you said yourself, then Hive removes all
the HBase overhead and gives you a powerful query language to search
through your data. You can also use Pig and e.g. Wonderdog to index your
data in ElasticSearch.

2012/2/5 Doug Meil <do...@explorysmedical.com>

>
> ... but it depends on what you want to do.  If you want full-text
> searching, then yes, you probably want to look at Lucene.  If you want
> activity analysis, summaries are probably better.
>
>
>
>
>
> On 2/5/12 1:54 PM, "Doug Meil" <do...@explorysmedical.com> wrote:
>
> >
> >Hi there-
> >
> >You probably want to check out these chapters of the Hbase ref guide:
> >
> >http://hbase.apache.org/book.html#datamodel
> >http://hbase.apache.org/book.html#schema
> >http://hbase.apache.org/book.html#mapreduce
> >
> >... and with respect to the "40 minutes per report", a common pattern is
> >to create summary table/files as appropriate.
> >
> >
> >
> >
> >On 2/5/12 3:37 AM, "mete" <ef...@gmail.com> wrote:
> >
> >>Hello,
> >>
> >>i am thinking about using hbase for storing web log data, i like the idea
> >>to have hdfs underneath so that i wont be worried about failure cases
> >>much
> >>and i can benefit from all the cool HBase features.
> >>
> >>The thing i could not figure out is howto effectively store and query the
> >>data.I am planning to split each kind of log record to 10 - 20 columns
> >>and
> >>then use MR jobs query the table with full scans.
> >>(I guess i can use hive or pig for this as well but i am not familiar
> >>with
> >>those yet)
> >>I find this approach simple and easy to implement but on the other hand
> >>this is like an offline process, it could take a lot of time to get a
> >>single report. And of course a business user would be very dissappointed
> >>to
> >>see that he/she has to wait another 40 mins to get the results of the
> >>query.
> >>
> >>So what i am trying to achieve is to keep this query time as small as
> >>possible. For this i can sacrifice the write speed as well, i dont really
> >>have to integrate new logs on-the-fly but a job that runs overnight is
> >>also
> >>fine.
> >>
> >>So for this kind of situation do you find Hbase useful?
> >>
> >>I read about star-schema design to make more effective queries but then
> >>this makes the developers job a lot more harder because i need to design
> >>different schemas for different log types, adding a new log type would
> >>require some time to gather requirements,develop etc...
> >>
> >>I thought about creating a very simple hbase shema, like just a key and
> >>the
> >>content for each record, and then index this content with lucene, but
> >>then
> >>this sounded like i did not need hbase in the first place because i am
> >>not
> >>really benefiting from it except for storage.Also i could not be sure
> >>about
> >>how big my lucene indexes would get, and if i could cope up with bigdata
> >>on
> >>lucene. What do you think about lucene indexes on hbase?
> >>
> >>I read about how rackspace was doing things, as far as i understood they
> >>are generating lucene indexes while parsing the logs in hadoop, and then
> >>merging this index into some system that is serving the previous
> >>indexes.(
> >>
> http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-qu
> >>e
> >>ry-terabytes-data)
> >>
> >>Does anyone use a similar approach or have any ideas about this?
> >>
> >>Do you think any of these are suitable? or if not should i try a
> >>different
> >>way?
> >>
> >>Thanks in advance
> >>Mete
> >
> >
> >
>
>
>

Re: storing logs in hbase

Posted by Doug Meil <do...@explorysmedical.com>.
... but it depends on what you want to do.  If you want full-text
searching, then yes, you probably want to look at Lucene.  If you want
activity analysis, summaries are probably better.





On 2/5/12 1:54 PM, "Doug Meil" <do...@explorysmedical.com> wrote:

>
>Hi there-
>
>You probably want to check out these chapters of the Hbase ref guide:
>
>http://hbase.apache.org/book.html#datamodel
>http://hbase.apache.org/book.html#schema
>http://hbase.apache.org/book.html#mapreduce
>
>... and with respect to the "40 minutes per report", a common pattern is
>to create summary table/files as appropriate.
>
>
>
>
>On 2/5/12 3:37 AM, "mete" <ef...@gmail.com> wrote:
>
>>Hello,
>>
>>i am thinking about using hbase for storing web log data, i like the idea
>>to have hdfs underneath so that i wont be worried about failure cases
>>much
>>and i can benefit from all the cool HBase features.
>>
>>The thing i could not figure out is howto effectively store and query the
>>data.I am planning to split each kind of log record to 10 - 20 columns
>>and
>>then use MR jobs query the table with full scans.
>>(I guess i can use hive or pig for this as well but i am not familiar
>>with
>>those yet)
>>I find this approach simple and easy to implement but on the other hand
>>this is like an offline process, it could take a lot of time to get a
>>single report. And of course a business user would be very dissappointed
>>to
>>see that he/she has to wait another 40 mins to get the results of the
>>query.
>>
>>So what i am trying to achieve is to keep this query time as small as
>>possible. For this i can sacrifice the write speed as well, i dont really
>>have to integrate new logs on-the-fly but a job that runs overnight is
>>also
>>fine.
>>
>>So for this kind of situation do you find Hbase useful?
>>
>>I read about star-schema design to make more effective queries but then
>>this makes the developers job a lot more harder because i need to design
>>different schemas for different log types, adding a new log type would
>>require some time to gather requirements,develop etc...
>>
>>I thought about creating a very simple hbase shema, like just a key and
>>the
>>content for each record, and then index this content with lucene, but
>>then
>>this sounded like i did not need hbase in the first place because i am
>>not
>>really benefiting from it except for storage.Also i could not be sure
>>about
>>how big my lucene indexes would get, and if i could cope up with bigdata
>>on
>>lucene. What do you think about lucene indexes on hbase?
>>
>>I read about how rackspace was doing things, as far as i understood they
>>are generating lucene indexes while parsing the logs in hadoop, and then
>>merging this index into some system that is serving the previous
>>indexes.(
>>http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-qu
>>e
>>ry-terabytes-data)
>>
>>Does anyone use a similar approach or have any ideas about this?
>>
>>Do you think any of these are suitable? or if not should i try a
>>different
>>way?
>>
>>Thanks in advance
>>Mete
>
>
>



Re: storing logs in hbase

Posted by Doug Meil <do...@explorysmedical.com>.
Hi there-

You probably want to check out these chapters of the Hbase ref guide:

http://hbase.apache.org/book.html#datamodel
http://hbase.apache.org/book.html#schema
http://hbase.apache.org/book.html#mapreduce

... and with respect to the "40 minutes per report", a common pattern is
to create summary table/files as appropriate.




On 2/5/12 3:37 AM, "mete" <ef...@gmail.com> wrote:

>Hello,
>
>i am thinking about using hbase for storing web log data, i like the idea
>to have hdfs underneath so that i wont be worried about failure cases much
>and i can benefit from all the cool HBase features.
>
>The thing i could not figure out is howto effectively store and query the
>data.I am planning to split each kind of log record to 10 - 20 columns and
>then use MR jobs query the table with full scans.
>(I guess i can use hive or pig for this as well but i am not familiar with
>those yet)
>I find this approach simple and easy to implement but on the other hand
>this is like an offline process, it could take a lot of time to get a
>single report. And of course a business user would be very dissappointed
>to
>see that he/she has to wait another 40 mins to get the results of the
>query.
>
>So what i am trying to achieve is to keep this query time as small as
>possible. For this i can sacrifice the write speed as well, i dont really
>have to integrate new logs on-the-fly but a job that runs overnight is
>also
>fine.
>
>So for this kind of situation do you find Hbase useful?
>
>I read about star-schema design to make more effective queries but then
>this makes the developers job a lot more harder because i need to design
>different schemas for different log types, adding a new log type would
>require some time to gather requirements,develop etc...
>
>I thought about creating a very simple hbase shema, like just a key and
>the
>content for each record, and then index this content with lucene, but then
>this sounded like i did not need hbase in the first place because i am not
>really benefiting from it except for storage.Also i could not be sure
>about
>how big my lucene indexes would get, and if i could cope up with bigdata
>on
>lucene. What do you think about lucene indexes on hbase?
>
>I read about how rackspace was doing things, as far as i understood they
>are generating lucene indexes while parsing the logs in hadoop, and then
>merging this index into some system that is serving the previous indexes.(
>http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-que
>ry-terabytes-data)
>
>Does anyone use a similar approach or have any ideas about this?
>
>Do you think any of these are suitable? or if not should i try a different
>way?
>
>Thanks in advance
>Mete