You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by imbmay <br...@media6degrees.com> on 2008/07/18 22:41:47 UTC

Table Updates with Map/Reduce

I want to use hbase to maintain a very large dataset which needs to be
updated pretty much continuously.  I'm creating a record for each entity and
including a creation timestamp column as well as between 10 and 1000
additional columns named for distinct events related to the record entity. 
Being new to hbase the approach I've taken is to create a map/reduce app
that for each input record:

Does a lookup in the table using HTable get(row, column) on the timestamp
colum to determine if there is an existing row for the entity.
If there is no existing record for the entity, the event history for the
entity is added to the table with one column added per unique event id.
If there is an existing record for the entity, it just adds the most recent
event to the table.

I'd like feedback as to whether this is a reasonable approach in terms of
general performance and reliability or if there is a different pattern
better suited to hbase with map/reduce or if I should even be using
map/reduce for this.

Thanks in advance. 


-- 
View this message in context: http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18537368.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Table Updates with Map/Reduce

Posted by imbmay <br...@media6degrees.com>.

Thank you, supports what I was thinking.


Jean-Daniel Cryans wrote:
> 
> Ok now a have a good picture of your situation (took me a moment).
> 
> I guess that even if it's concurrent it will not be that much of a
> problem.
> Keeping the max version at 1 will insure that even if 3 mappers insert the
> history of one entity, the data that overlaps will still be inserted in
> your
> "event:" family and the rest will be discarded. Your biggest concern will
> be
> the efficiency of reading data from HBase so your mappers should have a
> local cache.
> 
> Hope this helps,
> 
> J-D
> 
> On Sat, Jul 19, 2008 at 5:22 PM, imbmay <br...@media6degrees.com> wrote:
> 
>>
>> The table was created with two column families: createdAt and event, the
>> former is the timestamp, so 1 entry per entity and the latter is a
>> collection of events.  In the latter entries take the form event:1524,
>> event:1207, etc. and for the time being I'm storing only the event time.
>> The input is a set of text files generated at a rate of about 600 an hour
>> with up to 50,000 entries per file.  Each line in the text file contains
>> a
>> unique entity ID, a timestamp of the first time it was seen, an event
>> code
>> and a history of the last 100 event codes.  In cases where I haven't seen
>> an
>> entity before I want to add everything in the history; when the entity
>> has
>> been seen previously I just want to add the last event.  I'm keeping the
>> table design simple to start with while I'm getting familiar with HBase.
>>
>> The principal area of concern I have is regarding the reading of the data
>> from the HBase table during the map/reduce process to determine if an
>> entity
>> already exists.  If I'm running the map/reduce on a single machine then
>> its
>> pretty easy to keep track of previously unknown entities; but if I'm
>> running
>> in a cluster a new entity may show up in the inputs to several concurrent
>> mappers.brian@media6degrees.com
>>
>>
>> Jean-Daniel Cryans wrote:
>> >
>> > Brian (guessing it's your name from your email address),
>> >
>> > Please be more specific about your table design. For example, a
>> "column"
>> > in
>> > HBase is a very vague word since it may refer to a column family or a
>> > column
>> > key inside a column family. Also, what kind of load you expect to have?
>> >
>> > Maybe answering to this will also help you understanding HBase.
>> >
>> > Thx,
>> >
>> > J-D
>> >
>> > On Fri, Jul 18, 2008 at 4:41 PM, imbmay <br...@media6degrees.com>
>> wrote:
>> >
>> >>
>> >> I want to use hbase to maintain a very large dataset which needs to be
>> >> updated pretty much continuously.  I'm creating a record for each
>> entity
>> >> and
>> >> including a creation timestamp column as well as between 10 and 1000
>> >> additional columns named for distinct events related to the record
>> >> entity.
>> >> Being new to hbase the approach I've taken is to create a map/reduce
>> app
>> >> that for each input record:
>> >>
>> >> Does a lookup in the table using HTable get(row, column) on the
>> timestamp
>> >> colum to determine if there is an existing row for the entity.
>> >> If there is no existing record for the entity, the event history for
>> the
>> >> entity is added to the table with one column added per unique event
>> id.
>> >> If there is an existing record for the entity, it just adds the most
>> >> recent
>> >> event to the table.
>> >>
>> >> I'd like feedback as to whether this is a reasonable approach in terms
>> of
>> >> general performance and reliability or if there is a different pattern
>> >> better suited to hbase with map/reduce or if I should even be using
>> >> map/reduce for this.
>> >>
>> >> Thanks in advance.
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18537368.html
>> >> Sent from the HBase User mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18548888.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18576436.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Table Updates with Map/Reduce

Posted by Jean-Daniel Cryans <jd...@gmail.com>.

Ok now a have a good picture of your situation (took me a moment).

I guess that even if it's concurrent it will not be that much of a problem.
Keeping the max version at 1 will insure that even if 3 mappers insert the
history of one entity, the data that overlaps will still be inserted in your
"event:" family and the rest will be discarded. Your biggest concern will be
the efficiency of reading data from HBase so your mappers should have a
local cache.

Hope this helps,

J-D

On Sat, Jul 19, 2008 at 5:22 PM, imbmay <br...@media6degrees.com> wrote:

>
> The table was created with two column families: createdAt and event, the
> former is the timestamp, so 1 entry per entity and the latter is a
> collection of events.  In the latter entries take the form event:1524,
> event:1207, etc. and for the time being I'm storing only the event time.
> The input is a set of text files generated at a rate of about 600 an hour
> with up to 50,000 entries per file.  Each line in the text file contains a
> unique entity ID, a timestamp of the first time it was seen, an event code
> and a history of the last 100 event codes.  In cases where I haven't seen
> an
> entity before I want to add everything in the history; when the entity has
> been seen previously I just want to add the last event.  I'm keeping the
> table design simple to start with while I'm getting familiar with HBase.
>
> The principal area of concern I have is regarding the reading of the data
> from the HBase table during the map/reduce process to determine if an
> entity
> already exists.  If I'm running the map/reduce on a single machine then its
> pretty easy to keep track of previously unknown entities; but if I'm
> running
> in a cluster a new entity may show up in the inputs to several concurrent
> mappers.brian@media6degrees.com
>
>
> Jean-Daniel Cryans wrote:
> >
> > Brian (guessing it's your name from your email address),
> >
> > Please be more specific about your table design. For example, a "column"
> > in
> > HBase is a very vague word since it may refer to a column family or a
> > column
> > key inside a column family. Also, what kind of load you expect to have?
> >
> > Maybe answering to this will also help you understanding HBase.
> >
> > Thx,
> >
> > J-D
> >
> > On Fri, Jul 18, 2008 at 4:41 PM, imbmay <br...@media6degrees.com> wrote:
> >
> >>
> >> I want to use hbase to maintain a very large dataset which needs to be
> >> updated pretty much continuously.  I'm creating a record for each entity
> >> and
> >> including a creation timestamp column as well as between 10 and 1000
> >> additional columns named for distinct events related to the record
> >> entity.
> >> Being new to hbase the approach I've taken is to create a map/reduce app
> >> that for each input record:
> >>
> >> Does a lookup in the table using HTable get(row, column) on the
> timestamp
> >> colum to determine if there is an existing row for the entity.
> >> If there is no existing record for the entity, the event history for the
> >> entity is added to the table with one column added per unique event id.
> >> If there is an existing record for the entity, it just adds the most
> >> recent
> >> event to the table.
> >>
> >> I'd like feedback as to whether this is a reasonable approach in terms
> of
> >> general performance and reliability or if there is a different pattern
> >> better suited to hbase with map/reduce or if I should even be using
> >> map/reduce for this.
> >>
> >> Thanks in advance.
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18537368.html
> >> Sent from the HBase User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18548888.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>

Re: Table Updates with Map/Reduce

Posted by imbmay <br...@media6degrees.com>.

The table was created with two column families: createdAt and event, the
former is the timestamp, so 1 entry per entity and the latter is a
collection of events.  In the latter entries take the form event:1524,
event:1207, etc. and for the time being I'm storing only the event time. 
The input is a set of text files generated at a rate of about 600 an hour
with up to 50,000 entries per file.  Each line in the text file contains a
unique entity ID, a timestamp of the first time it was seen, an event code
and a history of the last 100 event codes.  In cases where I haven't seen an
entity before I want to add everything in the history; when the entity has
been seen previously I just want to add the last event.  I'm keeping the
table design simple to start with while I'm getting familiar with HBase.

The principal area of concern I have is regarding the reading of the data
from the HBase table during the map/reduce process to determine if an entity
already exists.  If I'm running the map/reduce on a single machine then its
pretty easy to keep track of previously unknown entities; but if I'm running
in a cluster a new entity may show up in the inputs to several concurrent
mappers.brian@media6degrees.com

Jean-Daniel Cryans wrote:
> 
> Brian (guessing it's your name from your email address),
> 
> Please be more specific about your table design. For example, a "column"
> in
> HBase is a very vague word since it may refer to a column family or a
> column
> key inside a column family. Also, what kind of load you expect to have?
> 
> Maybe answering to this will also help you understanding HBase.
> 
> Thx,
> 
> J-D
> 
> On Fri, Jul 18, 2008 at 4:41 PM, imbmay <br...@media6degrees.com> wrote:
> 
>>
>> I want to use hbase to maintain a very large dataset which needs to be
>> updated pretty much continuously.  I'm creating a record for each entity
>> and
>> including a creation timestamp column as well as between 10 and 1000
>> additional columns named for distinct events related to the record
>> entity.
>> Being new to hbase the approach I've taken is to create a map/reduce app
>> that for each input record:
>>
>> Does a lookup in the table using HTable get(row, column) on the timestamp
>> colum to determine if there is an existing row for the entity.
>> If there is no existing record for the entity, the event history for the
>> entity is added to the table with one column added per unique event id.
>> If there is an existing record for the entity, it just adds the most
>> recent
>> event to the table.
>>
>> I'd like feedback as to whether this is a reasonable approach in terms of
>> general performance and reliability or if there is a different pattern
>> better suited to hbase with map/reduce or if I should even be using
>> map/reduce for this.
>>
>> Thanks in advance.
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18537368.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18548888.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Table Updates with Map/Reduce

Posted by Jean-Daniel Cryans <jd...@gmail.com>.

Brian (guessing it's your name from your email address),

Please be more specific about your table design. For example, a "column" in
HBase is a very vague word since it may refer to a column family or a column
key inside a column family. Also, what kind of load you expect to have?

Maybe answering to this will also help you understanding HBase.

Thx,

J-D

On Fri, Jul 18, 2008 at 4:41 PM, imbmay <br...@media6degrees.com> wrote:

>
> I want to use hbase to maintain a very large dataset which needs to be
> updated pretty much continuously.  I'm creating a record for each entity
> and
> including a creation timestamp column as well as between 10 and 1000
> additional columns named for distinct events related to the record entity.
> Being new to hbase the approach I've taken is to create a map/reduce app
> that for each input record:
>
> Does a lookup in the table using HTable get(row, column) on the timestamp
> colum to determine if there is an existing row for the entity.
> If there is no existing record for the entity, the event history for the
> entity is added to the table with one column added per unique event id.
> If there is an existing record for the entity, it just adds the most recent
> event to the table.
>
> I'd like feedback as to whether this is a reasonable approach in terms of
> general performance and reliability or if there is a different pattern
> better suited to hbase with map/reduce or if I should even be using
> map/reduce for this.
>
> Thanks in advance.
>
>
> --
> View this message in context:
> http://www.nabble.com/Table-Updates-with-Map-Reduce-tp18537368p18537368.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>