You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Claudio Martella <cl...@tis.bz.it> on 2010/11/29 16:12:32 UTC

incremental counters and a global String->Long Dictionary

Hello list,

I'm kind of new to HBase, so I'll post this email with a request for
comment.
Very briefly, I do a lot of text processing with mapreduce, so it's very
useful for me to convert string to longs, so i can make my computations
faster.

My corpus keeps on growing and I want this String->Long mapping to be
persistent and dynamical (i want to add new mappings when i find new words).
At the moment i'm tackling the problem this way (pseudo-code):

longvalue = convert(word) # gets from hbase
if longvalue == -1:
    longvalue = insert(word) # puts in hbase

longvalue now contains the new mapped value. This approach requires a
global counter that saves the latest mapped long and increments at every
insert. I can easily do this two ways. A special row in hbase "_counter"
that I increment through IncrementColumnValue, or creating a sequential
non-ephemeral znode in zookeeper and use the version as my counter. The
first one is of course faster. So the solution would be:

insert(word):
    longvalue = hbase.incrementColumnValue("_counter", "v")
    hbase.put(word, longvalue)
    return longvalue

The problem is that between the time i realize there's no mapping for my
word and the time i insert the new longvalue, somebody else might have
done the same for me, so I have a corrupted dictionary.

One possible solution would be to acquire a lock on the "_counter" row,
recheck for the presence of the mapping and then insert my new value:

safe_insert(word):
    lock("_counter")
    longvalue = convert(word)
    if longvalue == -1: #nobody inserted the mapping in the meantime
        longvalue = insert(word)
    unlock("_counter")
    return longvalue

This way the counter row, with its lock, would behave as a global lock.
This would solve my problems but would create a bottleneck (although
with time my inserts tend to get very rare as the dictionary grows). A
solution to this problem would be to have locks on zookeeper based on words.

ZKsafe_insert(word):
    ZKlock("/words/"+ word)
    longvalue = convert(word)
    if longvalue == -1: #nobody inserted the mapping in the meantime
        longvalue = insert(word)
    ZKunlock("/words/"+word)
    return longvalue

This of course would allow me to have more finegrained locks and better
scalability, but I'd relay on a system with higher latency (ZK).

Does anybody have a better solution with hbase? I guess using
hbase_transational would also be a possibility, but again, what about
speed and the actual issues with the package (like recovering in the
face of hregion failure).


Thank you,

Claudio

-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.martella@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.



Re: incremental counters and a global String->Long Dictionary

Posted by Claudio Martella <cl...@tis.bz.it>.
Hi Todd,

you're right, there's no need to be purists in this case.

Thanks


On 12/1/10 9:24 AM, Todd Lipcon wrote:
> On Tue, Nov 30, 2010 at 6:02 AM, Claudio Martella <
> claudio.martella@tis.bz.it> wrote:
>
>> Lars,
>>
>> yes, that's exactly the problem, i also considered checkAndPut() but
>> that wouldn't work for two reasons:
>>
>> 1) i wouldn't atomically both insert the row AND increment the counter
>>
> To me, that seems fine - if you "waste" a counter value, is that really a
> problem? The race is rare, so it's only occasional that you will grab a
> counter value with the ICV and then end up losing the checkAndPut. When that
> happens, yes, you skip an identifier, but the goal here is just to compress
> the strings, and therefore there's no real serious benefit to keeping the
> range of identifiers completely contiguous.
>
>
>> 2) for what i can tell checkAndPut() doesn't work like existsOrPut(),
>> right? That would be a nice add to the API, i think. What do you guys
>> think?
>>
> As Ryan mentioned, you can do this with null and the current API.
>
> -Todd
>
>
>> At this point i think, if i decide not to impleemnt Dave's idea, I'll
>> have to go for biglock. It's slow at the beginning, but the bottleneck
>> is negligible after a while.
>>
>> Thanks for your support.
>>
>>
>> On 11/30/10 2:18 AM, Lars George wrote:
>>> I like that idea Dave.
>>>
>>> As for the checkAndPut(), this will not work as Claudio intended? He
>>> wanted the counter and put to run together, so that former is only
>>> half the deal? Just wondering.
>>>
>>> Lars
>>>
>>> On Tue, Nov 30, 2010 at 1:43 AM, Buttler, David <bu...@llnl.gov>
>> wrote:
>>>> A while back I had a strange idea to bypass this problem: create a
>> 64-bit hash code for the word.  Your word space should be significantly
>> smaller than 64 bits, so a good hash algorithm (the top 64 bits of sha1 say)
>> should make collisions extremely rare.  And, if you can always check your
>> dictionary later for collisions if this feels wrong.
>>>> This should be a good deal simpler than trying to keep around an order
>> dependent integer mapping for your dictionary.  And, it is somewhat
>> recoverable if you ever lose your dictionary for some reason.
>>>> Dave
>>>>
>>>> -----Original Message-----
>>>> From: Claudio Martella [mailto:claudio.martella@tis.bz.it]
>>>> Sent: Monday, November 29, 2010 7:13 AM
>>>> To: user@hbase.apache.org
>>>> Subject: incremental counters and a global String->Long Dictionary
>>>>
>>>> Hello list,
>>>>
>>>> I'm kind of new to HBase, so I'll post this email with a request for
>>>> comment.
>>>> Very briefly, I do a lot of text processing with mapreduce, so it's very
>>>> useful for me to convert string to longs, so i can make my computations
>>>> faster.
>>>>
>>>> My corpus keeps on growing and I want this String->Long mapping to be
>>>> persistent and dynamical (i want to add new mappings when i find new
>> words).
>>>> At the moment i'm tackling the problem this way (pseudo-code):
>>>>
>>>> longvalue = convert(word) # gets from hbase
>>>> if longvalue == -1:
>>>>    longvalue = insert(word) # puts in hbase
>>>>
>>>> longvalue now contains the new mapped value. This approach requires a
>>>> global counter that saves the latest mapped long and increments at every
>>>> insert. I can easily do this two ways. A special row in hbase "_counter"
>>>> that I increment through IncrementColumnValue, or creating a sequential
>>>> non-ephemeral znode in zookeeper and use the version as my counter. The
>>>> first one is of course faster. So the solution would be:
>>>>
>>>> insert(word):
>>>>    longvalue = hbase.incrementColumnValue("_counter", "v")
>>>>    hbase.put(word, longvalue)
>>>>    return longvalue
>>>>
>>>> The problem is that between the time i realize there's no mapping for my
>>>> word and the time i insert the new longvalue, somebody else might have
>>>> done the same for me, so I have a corrupted dictionary.
>>>>
>>>> One possible solution would be to acquire a lock on the "_counter" row,
>>>> recheck for the presence of the mapping and then insert my new value:
>>>>
>>>> safe_insert(word):
>>>>    lock("_counter")
>>>>    longvalue = convert(word)
>>>>    if longvalue == -1: #nobody inserted the mapping in the meantime
>>>>        longvalue = insert(word)
>>>>    unlock("_counter")
>>>>    return longvalue
>>>>
>>>> This way the counter row, with its lock, would behave as a global lock.
>>>> This would solve my problems but would create a bottleneck (although
>>>> with time my inserts tend to get very rare as the dictionary grows). A
>>>> solution to this problem would be to have locks on zookeeper based on
>> words.
>>>> ZKsafe_insert(word):
>>>>    ZKlock("/words/"+ word)
>>>>    longvalue = convert(word)
>>>>    if longvalue == -1: #nobody inserted the mapping in the meantime
>>>>        longvalue = insert(word)
>>>>    ZKunlock("/words/"+word)
>>>>    return longvalue
>>>>
>>>> This of course would allow me to have more finegrained locks and better
>>>> scalability, but I'd relay on a system with higher latency (ZK).
>>>>
>>>> Does anybody have a better solution with hbase? I guess using
>>>> hbase_transational would also be a possibility, but again, what about
>>>> speed and the actual issues with the package (like recovering in the
>>>> face of hregion failure).
>>>>
>>>>
>>>> Thank you,
>>>>
>>>> Claudio
>>>>
>>>> --
>>>> Claudio Martella
>>>> Digital Technologies
>>>> Unit Research & Development - Analyst
>>>>
>>>> TIS innovation park
>>>> Via Siemens 19 | Siemensstr. 19
>>>> 39100 Bolzano | 39100 Bozen
>>>> Tel. +39 0471 068 123
>>>> Fax  +39 0471 068 129
>>>> claudio.martella@tis.bz.it http://www.tis.bz.it
>>>>
>>>> Short information regarding use of personal data. According to Section
>> 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that
>> we process your personal data in order to fulfil contractual and fiscal
>> obligations and also to send you information regarding our services and
>> events. Your personal data are processed with and without electronic means
>> and by respecting data subjects' rights, fundamental freedoms and dignity,
>> particularly with regard to confidentiality, personal identity and the right
>> to personal data protection. At any time and without formalities you can
>> write an e-mail to privacy@tis.bz.it in order to object the processing of
>> your personal data for the purpose of sending advertising materials and also
>> to exercise the right to access personal data and other rights referred to
>> in Section 7 of Decree 196/2003. The data controller is TIS Techno
>> Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the
>> complete information on the web site www.tis.bz.it.
>>>>
>>>>
>>
>> --
>> Claudio Martella
>> Digital Technologies
>> Unit Research & Development - Analyst
>>
>> TIS innovation park
>> Via Siemens 19 | Siemensstr. 19
>> 39100 Bolzano | 39100 Bozen
>> Tel. +39 0471 068 123
>> Fax  +39 0471 068 129
>> claudio.martella@tis.bz.it http://www.tis.bz.it
>>
>> Short information regarding use of personal data. According to Section 13
>> of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we
>> process your personal data in order to fulfil contractual and fiscal
>> obligations and also to send you information regarding our services and
>> events. Your personal data are processed with and without electronic means
>> and by respecting data subjects' rights, fundamental freedoms and dignity,
>> particularly with regard to confidentiality, personal identity and the right
>> to personal data protection. At any time and without formalities you can
>> write an e-mail to privacy@tis.bz.it in order to object the processing of
>> your personal data for the purpose of sending advertising materials and also
>> to exercise the right to access personal data and other rights referred to
>> in Section 7 of Decree 196/2003. The data controller is TIS Techno
>> Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the
>> complete information on the web site www.tis.bz.it.
>>
>>
>>
>


-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.martella@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.



Re: incremental counters and a global String->Long Dictionary

Posted by Todd Lipcon <to...@cloudera.com>.
On Tue, Nov 30, 2010 at 6:02 AM, Claudio Martella <
claudio.martella@tis.bz.it> wrote:

> Lars,
>
> yes, that's exactly the problem, i also considered checkAndPut() but
> that wouldn't work for two reasons:
>
> 1) i wouldn't atomically both insert the row AND increment the counter
>

To me, that seems fine - if you "waste" a counter value, is that really a
problem? The race is rare, so it's only occasional that you will grab a
counter value with the ICV and then end up losing the checkAndPut. When that
happens, yes, you skip an identifier, but the goal here is just to compress
the strings, and therefore there's no real serious benefit to keeping the
range of identifiers completely contiguous.


> 2) for what i can tell checkAndPut() doesn't work like existsOrPut(),
> right? That would be a nice add to the API, i think. What do you guys
> think?
>

As Ryan mentioned, you can do this with null and the current API.

-Todd


>
> At this point i think, if i decide not to impleemnt Dave's idea, I'll
> have to go for biglock. It's slow at the beginning, but the bottleneck
> is negligible after a while.
>
> Thanks for your support.
>
>
> On 11/30/10 2:18 AM, Lars George wrote:
> > I like that idea Dave.
> >
> > As for the checkAndPut(), this will not work as Claudio intended? He
> > wanted the counter and put to run together, so that former is only
> > half the deal? Just wondering.
> >
> > Lars
> >
> > On Tue, Nov 30, 2010 at 1:43 AM, Buttler, David <bu...@llnl.gov>
> wrote:
> >> A while back I had a strange idea to bypass this problem: create a
> 64-bit hash code for the word.  Your word space should be significantly
> smaller than 64 bits, so a good hash algorithm (the top 64 bits of sha1 say)
> should make collisions extremely rare.  And, if you can always check your
> dictionary later for collisions if this feels wrong.
> >> This should be a good deal simpler than trying to keep around an order
> dependent integer mapping for your dictionary.  And, it is somewhat
> recoverable if you ever lose your dictionary for some reason.
> >>
> >> Dave
> >>
> >> -----Original Message-----
> >> From: Claudio Martella [mailto:claudio.martella@tis.bz.it]
> >> Sent: Monday, November 29, 2010 7:13 AM
> >> To: user@hbase.apache.org
> >> Subject: incremental counters and a global String->Long Dictionary
> >>
> >> Hello list,
> >>
> >> I'm kind of new to HBase, so I'll post this email with a request for
> >> comment.
> >> Very briefly, I do a lot of text processing with mapreduce, so it's very
> >> useful for me to convert string to longs, so i can make my computations
> >> faster.
> >>
> >> My corpus keeps on growing and I want this String->Long mapping to be
> >> persistent and dynamical (i want to add new mappings when i find new
> words).
> >> At the moment i'm tackling the problem this way (pseudo-code):
> >>
> >> longvalue = convert(word) # gets from hbase
> >> if longvalue == -1:
> >>    longvalue = insert(word) # puts in hbase
> >>
> >> longvalue now contains the new mapped value. This approach requires a
> >> global counter that saves the latest mapped long and increments at every
> >> insert. I can easily do this two ways. A special row in hbase "_counter"
> >> that I increment through IncrementColumnValue, or creating a sequential
> >> non-ephemeral znode in zookeeper and use the version as my counter. The
> >> first one is of course faster. So the solution would be:
> >>
> >> insert(word):
> >>    longvalue = hbase.incrementColumnValue("_counter", "v")
> >>    hbase.put(word, longvalue)
> >>    return longvalue
> >>
> >> The problem is that between the time i realize there's no mapping for my
> >> word and the time i insert the new longvalue, somebody else might have
> >> done the same for me, so I have a corrupted dictionary.
> >>
> >> One possible solution would be to acquire a lock on the "_counter" row,
> >> recheck for the presence of the mapping and then insert my new value:
> >>
> >> safe_insert(word):
> >>    lock("_counter")
> >>    longvalue = convert(word)
> >>    if longvalue == -1: #nobody inserted the mapping in the meantime
> >>        longvalue = insert(word)
> >>    unlock("_counter")
> >>    return longvalue
> >>
> >> This way the counter row, with its lock, would behave as a global lock.
> >> This would solve my problems but would create a bottleneck (although
> >> with time my inserts tend to get very rare as the dictionary grows). A
> >> solution to this problem would be to have locks on zookeeper based on
> words.
> >>
> >> ZKsafe_insert(word):
> >>    ZKlock("/words/"+ word)
> >>    longvalue = convert(word)
> >>    if longvalue == -1: #nobody inserted the mapping in the meantime
> >>        longvalue = insert(word)
> >>    ZKunlock("/words/"+word)
> >>    return longvalue
> >>
> >> This of course would allow me to have more finegrained locks and better
> >> scalability, but I'd relay on a system with higher latency (ZK).
> >>
> >> Does anybody have a better solution with hbase? I guess using
> >> hbase_transational would also be a possibility, but again, what about
> >> speed and the actual issues with the package (like recovering in the
> >> face of hregion failure).
> >>
> >>
> >> Thank you,
> >>
> >> Claudio
> >>
> >> --
> >> Claudio Martella
> >> Digital Technologies
> >> Unit Research & Development - Analyst
> >>
> >> TIS innovation park
> >> Via Siemens 19 | Siemensstr. 19
> >> 39100 Bolzano | 39100 Bozen
> >> Tel. +39 0471 068 123
> >> Fax  +39 0471 068 129
> >> claudio.martella@tis.bz.it http://www.tis.bz.it
> >>
> >> Short information regarding use of personal data. According to Section
> 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that
> we process your personal data in order to fulfil contractual and fiscal
> obligations and also to send you information regarding our services and
> events. Your personal data are processed with and without electronic means
> and by respecting data subjects' rights, fundamental freedoms and dignity,
> particularly with regard to confidentiality, personal identity and the right
> to personal data protection. At any time and without formalities you can
> write an e-mail to privacy@tis.bz.it in order to object the processing of
> your personal data for the purpose of sending advertising materials and also
> to exercise the right to access personal data and other rights referred to
> in Section 7 of Decree 196/2003. The data controller is TIS Techno
> Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the
> complete information on the web site www.tis.bz.it.
> >>
> >>
> >>
>
>
> --
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> claudio.martella@tis.bz.it http://www.tis.bz.it
>
> Short information regarding use of personal data. According to Section 13
> of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we
> process your personal data in order to fulfil contractual and fiscal
> obligations and also to send you information regarding our services and
> events. Your personal data are processed with and without electronic means
> and by respecting data subjects' rights, fundamental freedoms and dignity,
> particularly with regard to confidentiality, personal identity and the right
> to personal data protection. At any time and without formalities you can
> write an e-mail to privacy@tis.bz.it in order to object the processing of
> your personal data for the purpose of sending advertising materials and also
> to exercise the right to access personal data and other rights referred to
> in Section 7 of Decree 196/2003. The data controller is TIS Techno
> Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the
> complete information on the web site www.tis.bz.it.
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: incremental counters and a global String->Long Dictionary

Posted by Claudio Martella <cl...@tis.bz.it>.
Lars,

yes, that's exactly the problem, i also considered checkAndPut() but
that wouldn't work for two reasons:

1) i wouldn't atomically both insert the row AND increment the counter
2) for what i can tell checkAndPut() doesn't work like existsOrPut(),
right? That would be a nice add to the API, i think. What do you guys think?

At this point i think, if i decide not to impleemnt Dave's idea, I'll
have to go for biglock. It's slow at the beginning, but the bottleneck
is negligible after a while.

Thanks for your support.


On 11/30/10 2:18 AM, Lars George wrote:
> I like that idea Dave.
>
> As for the checkAndPut(), this will not work as Claudio intended? He
> wanted the counter and put to run together, so that former is only
> half the deal? Just wondering.
>
> Lars
>
> On Tue, Nov 30, 2010 at 1:43 AM, Buttler, David <bu...@llnl.gov> wrote:
>> A while back I had a strange idea to bypass this problem: create a 64-bit hash code for the word.  Your word space should be significantly smaller than 64 bits, so a good hash algorithm (the top 64 bits of sha1 say) should make collisions extremely rare.  And, if you can always check your dictionary later for collisions if this feels wrong.
>> This should be a good deal simpler than trying to keep around an order dependent integer mapping for your dictionary.  And, it is somewhat recoverable if you ever lose your dictionary for some reason.
>>
>> Dave
>>
>> -----Original Message-----
>> From: Claudio Martella [mailto:claudio.martella@tis.bz.it]
>> Sent: Monday, November 29, 2010 7:13 AM
>> To: user@hbase.apache.org
>> Subject: incremental counters and a global String->Long Dictionary
>>
>> Hello list,
>>
>> I'm kind of new to HBase, so I'll post this email with a request for
>> comment.
>> Very briefly, I do a lot of text processing with mapreduce, so it's very
>> useful for me to convert string to longs, so i can make my computations
>> faster.
>>
>> My corpus keeps on growing and I want this String->Long mapping to be
>> persistent and dynamical (i want to add new mappings when i find new words).
>> At the moment i'm tackling the problem this way (pseudo-code):
>>
>> longvalue = convert(word) # gets from hbase
>> if longvalue == -1:
>>    longvalue = insert(word) # puts in hbase
>>
>> longvalue now contains the new mapped value. This approach requires a
>> global counter that saves the latest mapped long and increments at every
>> insert. I can easily do this two ways. A special row in hbase "_counter"
>> that I increment through IncrementColumnValue, or creating a sequential
>> non-ephemeral znode in zookeeper and use the version as my counter. The
>> first one is of course faster. So the solution would be:
>>
>> insert(word):
>>    longvalue = hbase.incrementColumnValue("_counter", "v")
>>    hbase.put(word, longvalue)
>>    return longvalue
>>
>> The problem is that between the time i realize there's no mapping for my
>> word and the time i insert the new longvalue, somebody else might have
>> done the same for me, so I have a corrupted dictionary.
>>
>> One possible solution would be to acquire a lock on the "_counter" row,
>> recheck for the presence of the mapping and then insert my new value:
>>
>> safe_insert(word):
>>    lock("_counter")
>>    longvalue = convert(word)
>>    if longvalue == -1: #nobody inserted the mapping in the meantime
>>        longvalue = insert(word)
>>    unlock("_counter")
>>    return longvalue
>>
>> This way the counter row, with its lock, would behave as a global lock.
>> This would solve my problems but would create a bottleneck (although
>> with time my inserts tend to get very rare as the dictionary grows). A
>> solution to this problem would be to have locks on zookeeper based on words.
>>
>> ZKsafe_insert(word):
>>    ZKlock("/words/"+ word)
>>    longvalue = convert(word)
>>    if longvalue == -1: #nobody inserted the mapping in the meantime
>>        longvalue = insert(word)
>>    ZKunlock("/words/"+word)
>>    return longvalue
>>
>> This of course would allow me to have more finegrained locks and better
>> scalability, but I'd relay on a system with higher latency (ZK).
>>
>> Does anybody have a better solution with hbase? I guess using
>> hbase_transational would also be a possibility, but again, what about
>> speed and the actual issues with the package (like recovering in the
>> face of hregion failure).
>>
>>
>> Thank you,
>>
>> Claudio
>>
>> --
>> Claudio Martella
>> Digital Technologies
>> Unit Research & Development - Analyst
>>
>> TIS innovation park
>> Via Siemens 19 | Siemensstr. 19
>> 39100 Bolzano | 39100 Bozen
>> Tel. +39 0471 068 123
>> Fax  +39 0471 068 129
>> claudio.martella@tis.bz.it http://www.tis.bz.it
>>
>> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>>
>>
>>


-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.martella@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.



Re: incremental counters and a global String->Long Dictionary

Posted by Lars George <la...@gmail.com>.
I like that idea Dave.

As for the checkAndPut(), this will not work as Claudio intended? He
wanted the counter and put to run together, so that former is only
half the deal? Just wondering.

Lars

On Tue, Nov 30, 2010 at 1:43 AM, Buttler, David <bu...@llnl.gov> wrote:
> A while back I had a strange idea to bypass this problem: create a 64-bit hash code for the word.  Your word space should be significantly smaller than 64 bits, so a good hash algorithm (the top 64 bits of sha1 say) should make collisions extremely rare.  And, if you can always check your dictionary later for collisions if this feels wrong.
> This should be a good deal simpler than trying to keep around an order dependent integer mapping for your dictionary.  And, it is somewhat recoverable if you ever lose your dictionary for some reason.
>
> Dave
>
> -----Original Message-----
> From: Claudio Martella [mailto:claudio.martella@tis.bz.it]
> Sent: Monday, November 29, 2010 7:13 AM
> To: user@hbase.apache.org
> Subject: incremental counters and a global String->Long Dictionary
>
> Hello list,
>
> I'm kind of new to HBase, so I'll post this email with a request for
> comment.
> Very briefly, I do a lot of text processing with mapreduce, so it's very
> useful for me to convert string to longs, so i can make my computations
> faster.
>
> My corpus keeps on growing and I want this String->Long mapping to be
> persistent and dynamical (i want to add new mappings when i find new words).
> At the moment i'm tackling the problem this way (pseudo-code):
>
> longvalue = convert(word) # gets from hbase
> if longvalue == -1:
>    longvalue = insert(word) # puts in hbase
>
> longvalue now contains the new mapped value. This approach requires a
> global counter that saves the latest mapped long and increments at every
> insert. I can easily do this two ways. A special row in hbase "_counter"
> that I increment through IncrementColumnValue, or creating a sequential
> non-ephemeral znode in zookeeper and use the version as my counter. The
> first one is of course faster. So the solution would be:
>
> insert(word):
>    longvalue = hbase.incrementColumnValue("_counter", "v")
>    hbase.put(word, longvalue)
>    return longvalue
>
> The problem is that between the time i realize there's no mapping for my
> word and the time i insert the new longvalue, somebody else might have
> done the same for me, so I have a corrupted dictionary.
>
> One possible solution would be to acquire a lock on the "_counter" row,
> recheck for the presence of the mapping and then insert my new value:
>
> safe_insert(word):
>    lock("_counter")
>    longvalue = convert(word)
>    if longvalue == -1: #nobody inserted the mapping in the meantime
>        longvalue = insert(word)
>    unlock("_counter")
>    return longvalue
>
> This way the counter row, with its lock, would behave as a global lock.
> This would solve my problems but would create a bottleneck (although
> with time my inserts tend to get very rare as the dictionary grows). A
> solution to this problem would be to have locks on zookeeper based on words.
>
> ZKsafe_insert(word):
>    ZKlock("/words/"+ word)
>    longvalue = convert(word)
>    if longvalue == -1: #nobody inserted the mapping in the meantime
>        longvalue = insert(word)
>    ZKunlock("/words/"+word)
>    return longvalue
>
> This of course would allow me to have more finegrained locks and better
> scalability, but I'd relay on a system with higher latency (ZK).
>
> Does anybody have a better solution with hbase? I guess using
> hbase_transational would also be a possibility, but again, what about
> speed and the actual issues with the package (like recovering in the
> face of hregion failure).
>
>
> Thank you,
>
> Claudio
>
> --
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> claudio.martella@tis.bz.it http://www.tis.bz.it
>
> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>
>
>

Re: incremental counters and a global String->Long Dictionary

Posted by Claudio Martella <cl...@tis.bz.it>.
Ok, I read it as the non existence of the column, not the whole key. My bad.




On 12/2/10 10:43 PM, Stack wrote:
> I think it does already Claudio:
>
> http://hbase.apache.org/docs/r0.89.20100924/apidocs/org/apache/hadoop/hbase/client/HTable.html#checkAndPut(byte[],
> byte[], byte[], byte[], org.apache.hadoop.hbase.client.Put)
>
> St.Ack
>
> On Thu, Dec 2, 2010 at 7:42 AM, Claudio Martella
> <cl...@tis.bz.it> wrote:
>> Hi Ryan,
>>
>> yes that would help for sure. Shouldn't this feature be documented?
>>
>> Thanks
>>
>>
>> On 12/1/10 4:03 AM, Ryan Rawson wrote:
>>> CheckAndPut interprets a 'null' value argument as a check for
>>> existence.  That is if you set the expected value to null it will only
>>> succeed if the value does not exist.
>>>
>>> Would that help?
>>>
>>> -ryan
>>>
>>> On Tue, Nov 30, 2010 at 6:07 AM, Claudio Martella
>>> <cl...@tis.bz.it> wrote:
>>>> Hi Dave,
>>>>
>>>> thanks for you idea. I also considered this possibility. Although the
>>>> possibility of a collision is very small, what scares me is the fact
>>>> that i don't think the corruption can be corrected.
>>>> I can for sure detect it afterwards in O(NlogN) time by scanning the
>>>> table, but correcting my long-based corpus is impossible. Once the
>>>> database is converted, the information is lost.
>>>>
>>>>
>>>> On 11/30/10 1:43 AM, Buttler, David wrote:
>>>>> A while back I had a strange idea to bypass this problem: create a 64-bit hash code for the word.  Your word space should be significantly smaller than 64 bits, so a good hash algorithm (the top 64 bits of sha1 say) should make collisions extremely rare.  And, if you can always check your dictionary later for collisions if this feels wrong.
>>>>> This should be a good deal simpler than trying to keep around an order dependent integer mapping for your dictionary.  And, it is somewhat recoverable if you ever lose your dictionary for some reason.
>>>>>
>>>>> Dave
>>>>>
>>>>> -----Original Message-----
>>>>> From: Claudio Martella [mailto:claudio.martella@tis.bz.it]
>>>>> Sent: Monday, November 29, 2010 7:13 AM
>>>>> To: user@hbase.apache.org
>>>>> Subject: incremental counters and a global String->Long Dictionary
>>>>>
>>>>> Hello list,
>>>>>
>>>>> I'm kind of new to HBase, so I'll post this email with a request for
>>>>> comment.
>>>>> Very briefly, I do a lot of text processing with mapreduce, so it's very
>>>>> useful for me to convert string to longs, so i can make my computations
>>>>> faster.
>>>>>
>>>>> My corpus keeps on growing and I want this String->Long mapping to be
>>>>> persistent and dynamical (i want to add new mappings when i find new words).
>>>>> At the moment i'm tackling the problem this way (pseudo-code):
>>>>>
>>>>> longvalue = convert(word) # gets from hbase
>>>>> if longvalue == -1:
>>>>>     longvalue = insert(word) # puts in hbase
>>>>>
>>>>> longvalue now contains the new mapped value. This approach requires a
>>>>> global counter that saves the latest mapped long and increments at every
>>>>> insert. I can easily do this two ways. A special row in hbase "_counter"
>>>>> that I increment through IncrementColumnValue, or creating a sequential
>>>>> non-ephemeral znode in zookeeper and use the version as my counter. The
>>>>> first one is of course faster. So the solution would be:
>>>>>
>>>>> insert(word):
>>>>>     longvalue = hbase.incrementColumnValue("_counter", "v")
>>>>>     hbase.put(word, longvalue)
>>>>>     return longvalue
>>>>>
>>>>> The problem is that between the time i realize there's no mapping for my
>>>>> word and the time i insert the new longvalue, somebody else might have
>>>>> done the same for me, so I have a corrupted dictionary.
>>>>>
>>>>> One possible solution would be to acquire a lock on the "_counter" row,
>>>>> recheck for the presence of the mapping and then insert my new value:
>>>>>
>>>>> safe_insert(word):
>>>>>     lock("_counter")
>>>>>     longvalue = convert(word)
>>>>>     if longvalue == -1: #nobody inserted the mapping in the meantime
>>>>>         longvalue = insert(word)
>>>>>     unlock("_counter")
>>>>>     return longvalue
>>>>>
>>>>> This way the counter row, with its lock, would behave as a global lock.
>>>>> This would solve my problems but would create a bottleneck (although
>>>>> with time my inserts tend to get very rare as the dictionary grows). A
>>>>> solution to this problem would be to have locks on zookeeper based on words.
>>>>>
>>>>> ZKsafe_insert(word):
>>>>>     ZKlock("/words/"+ word)
>>>>>     longvalue = convert(word)
>>>>>     if longvalue == -1: #nobody inserted the mapping in the meantime
>>>>>         longvalue = insert(word)
>>>>>     ZKunlock("/words/"+word)
>>>>>     return longvalue
>>>>>
>>>>> This of course would allow me to have more finegrained locks and better
>>>>> scalability, but I'd relay on a system with higher latency (ZK).
>>>>>
>>>>> Does anybody have a better solution with hbase? I guess using
>>>>> hbase_transational would also be a possibility, but again, what about
>>>>> speed and the actual issues with the package (like recovering in the
>>>>> face of hregion failure).
>>>>>
>>>>>
>>>>> Thank you,
>>>>>
>>>>> Claudio
>>>>>
>>>> --
>>>> Claudio Martella
>>>> Digital Technologies
>>>> Unit Research & Development - Analyst
>>>>
>>>> TIS innovation park
>>>> Via Siemens 19 | Siemensstr. 19
>>>> 39100 Bolzano | 39100 Bozen
>>>> Tel. +39 0471 068 123
>>>> Fax  +39 0471 068 129
>>>> claudio.martella@tis.bz.it http://www.tis.bz.it
>>>>
>>>> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>>>>
>>>>
>>>>
>>
>> --
>> Claudio Martella
>> Digital Technologies
>> Unit Research & Development - Analyst
>>
>> TIS innovation park
>> Via Siemens 19 | Siemensstr. 19
>> 39100 Bolzano | 39100 Bozen
>> Tel. +39 0471 068 123
>> Fax  +39 0471 068 129
>> claudio.martella@tis.bz.it http://www.tis.bz.it
>>
>> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>>
>>
>>


-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.martella@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.



Re: incremental counters and a global String->Long Dictionary

Posted by Stack <st...@duboce.net>.
I think it does already Claudio:

http://hbase.apache.org/docs/r0.89.20100924/apidocs/org/apache/hadoop/hbase/client/HTable.html#checkAndPut(byte[],
byte[], byte[], byte[], org.apache.hadoop.hbase.client.Put)

St.Ack

On Thu, Dec 2, 2010 at 7:42 AM, Claudio Martella
<cl...@tis.bz.it> wrote:
> Hi Ryan,
>
> yes that would help for sure. Shouldn't this feature be documented?
>
> Thanks
>
>
> On 12/1/10 4:03 AM, Ryan Rawson wrote:
>> CheckAndPut interprets a 'null' value argument as a check for
>> existence.  That is if you set the expected value to null it will only
>> succeed if the value does not exist.
>>
>> Would that help?
>>
>> -ryan
>>
>> On Tue, Nov 30, 2010 at 6:07 AM, Claudio Martella
>> <cl...@tis.bz.it> wrote:
>>> Hi Dave,
>>>
>>> thanks for you idea. I also considered this possibility. Although the
>>> possibility of a collision is very small, what scares me is the fact
>>> that i don't think the corruption can be corrected.
>>> I can for sure detect it afterwards in O(NlogN) time by scanning the
>>> table, but correcting my long-based corpus is impossible. Once the
>>> database is converted, the information is lost.
>>>
>>>
>>> On 11/30/10 1:43 AM, Buttler, David wrote:
>>>> A while back I had a strange idea to bypass this problem: create a 64-bit hash code for the word.  Your word space should be significantly smaller than 64 bits, so a good hash algorithm (the top 64 bits of sha1 say) should make collisions extremely rare.  And, if you can always check your dictionary later for collisions if this feels wrong.
>>>> This should be a good deal simpler than trying to keep around an order dependent integer mapping for your dictionary.  And, it is somewhat recoverable if you ever lose your dictionary for some reason.
>>>>
>>>> Dave
>>>>
>>>> -----Original Message-----
>>>> From: Claudio Martella [mailto:claudio.martella@tis.bz.it]
>>>> Sent: Monday, November 29, 2010 7:13 AM
>>>> To: user@hbase.apache.org
>>>> Subject: incremental counters and a global String->Long Dictionary
>>>>
>>>> Hello list,
>>>>
>>>> I'm kind of new to HBase, so I'll post this email with a request for
>>>> comment.
>>>> Very briefly, I do a lot of text processing with mapreduce, so it's very
>>>> useful for me to convert string to longs, so i can make my computations
>>>> faster.
>>>>
>>>> My corpus keeps on growing and I want this String->Long mapping to be
>>>> persistent and dynamical (i want to add new mappings when i find new words).
>>>> At the moment i'm tackling the problem this way (pseudo-code):
>>>>
>>>> longvalue = convert(word) # gets from hbase
>>>> if longvalue == -1:
>>>>     longvalue = insert(word) # puts in hbase
>>>>
>>>> longvalue now contains the new mapped value. This approach requires a
>>>> global counter that saves the latest mapped long and increments at every
>>>> insert. I can easily do this two ways. A special row in hbase "_counter"
>>>> that I increment through IncrementColumnValue, or creating a sequential
>>>> non-ephemeral znode in zookeeper and use the version as my counter. The
>>>> first one is of course faster. So the solution would be:
>>>>
>>>> insert(word):
>>>>     longvalue = hbase.incrementColumnValue("_counter", "v")
>>>>     hbase.put(word, longvalue)
>>>>     return longvalue
>>>>
>>>> The problem is that between the time i realize there's no mapping for my
>>>> word and the time i insert the new longvalue, somebody else might have
>>>> done the same for me, so I have a corrupted dictionary.
>>>>
>>>> One possible solution would be to acquire a lock on the "_counter" row,
>>>> recheck for the presence of the mapping and then insert my new value:
>>>>
>>>> safe_insert(word):
>>>>     lock("_counter")
>>>>     longvalue = convert(word)
>>>>     if longvalue == -1: #nobody inserted the mapping in the meantime
>>>>         longvalue = insert(word)
>>>>     unlock("_counter")
>>>>     return longvalue
>>>>
>>>> This way the counter row, with its lock, would behave as a global lock.
>>>> This would solve my problems but would create a bottleneck (although
>>>> with time my inserts tend to get very rare as the dictionary grows). A
>>>> solution to this problem would be to have locks on zookeeper based on words.
>>>>
>>>> ZKsafe_insert(word):
>>>>     ZKlock("/words/"+ word)
>>>>     longvalue = convert(word)
>>>>     if longvalue == -1: #nobody inserted the mapping in the meantime
>>>>         longvalue = insert(word)
>>>>     ZKunlock("/words/"+word)
>>>>     return longvalue
>>>>
>>>> This of course would allow me to have more finegrained locks and better
>>>> scalability, but I'd relay on a system with higher latency (ZK).
>>>>
>>>> Does anybody have a better solution with hbase? I guess using
>>>> hbase_transational would also be a possibility, but again, what about
>>>> speed and the actual issues with the package (like recovering in the
>>>> face of hregion failure).
>>>>
>>>>
>>>> Thank you,
>>>>
>>>> Claudio
>>>>
>>>
>>> --
>>> Claudio Martella
>>> Digital Technologies
>>> Unit Research & Development - Analyst
>>>
>>> TIS innovation park
>>> Via Siemens 19 | Siemensstr. 19
>>> 39100 Bolzano | 39100 Bozen
>>> Tel. +39 0471 068 123
>>> Fax  +39 0471 068 129
>>> claudio.martella@tis.bz.it http://www.tis.bz.it
>>>
>>> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>>>
>>>
>>>
>
>
> --
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> claudio.martella@tis.bz.it http://www.tis.bz.it
>
> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>
>
>

Re: incremental counters and a global String->Long Dictionary

Posted by Claudio Martella <cl...@tis.bz.it>.
Hi Ryan,

yes that would help for sure. Shouldn't this feature be documented?

Thanks


On 12/1/10 4:03 AM, Ryan Rawson wrote:
> CheckAndPut interprets a 'null' value argument as a check for
> existence.  That is if you set the expected value to null it will only
> succeed if the value does not exist.
>
> Would that help?
>
> -ryan
>
> On Tue, Nov 30, 2010 at 6:07 AM, Claudio Martella
> <cl...@tis.bz.it> wrote:
>> Hi Dave,
>>
>> thanks for you idea. I also considered this possibility. Although the
>> possibility of a collision is very small, what scares me is the fact
>> that i don't think the corruption can be corrected.
>> I can for sure detect it afterwards in O(NlogN) time by scanning the
>> table, but correcting my long-based corpus is impossible. Once the
>> database is converted, the information is lost.
>>
>>
>> On 11/30/10 1:43 AM, Buttler, David wrote:
>>> A while back I had a strange idea to bypass this problem: create a 64-bit hash code for the word.  Your word space should be significantly smaller than 64 bits, so a good hash algorithm (the top 64 bits of sha1 say) should make collisions extremely rare.  And, if you can always check your dictionary later for collisions if this feels wrong.
>>> This should be a good deal simpler than trying to keep around an order dependent integer mapping for your dictionary.  And, it is somewhat recoverable if you ever lose your dictionary for some reason.
>>>
>>> Dave
>>>
>>> -----Original Message-----
>>> From: Claudio Martella [mailto:claudio.martella@tis.bz.it]
>>> Sent: Monday, November 29, 2010 7:13 AM
>>> To: user@hbase.apache.org
>>> Subject: incremental counters and a global String->Long Dictionary
>>>
>>> Hello list,
>>>
>>> I'm kind of new to HBase, so I'll post this email with a request for
>>> comment.
>>> Very briefly, I do a lot of text processing with mapreduce, so it's very
>>> useful for me to convert string to longs, so i can make my computations
>>> faster.
>>>
>>> My corpus keeps on growing and I want this String->Long mapping to be
>>> persistent and dynamical (i want to add new mappings when i find new words).
>>> At the moment i'm tackling the problem this way (pseudo-code):
>>>
>>> longvalue = convert(word) # gets from hbase
>>> if longvalue == -1:
>>>     longvalue = insert(word) # puts in hbase
>>>
>>> longvalue now contains the new mapped value. This approach requires a
>>> global counter that saves the latest mapped long and increments at every
>>> insert. I can easily do this two ways. A special row in hbase "_counter"
>>> that I increment through IncrementColumnValue, or creating a sequential
>>> non-ephemeral znode in zookeeper and use the version as my counter. The
>>> first one is of course faster. So the solution would be:
>>>
>>> insert(word):
>>>     longvalue = hbase.incrementColumnValue("_counter", "v")
>>>     hbase.put(word, longvalue)
>>>     return longvalue
>>>
>>> The problem is that between the time i realize there's no mapping for my
>>> word and the time i insert the new longvalue, somebody else might have
>>> done the same for me, so I have a corrupted dictionary.
>>>
>>> One possible solution would be to acquire a lock on the "_counter" row,
>>> recheck for the presence of the mapping and then insert my new value:
>>>
>>> safe_insert(word):
>>>     lock("_counter")
>>>     longvalue = convert(word)
>>>     if longvalue == -1: #nobody inserted the mapping in the meantime
>>>         longvalue = insert(word)
>>>     unlock("_counter")
>>>     return longvalue
>>>
>>> This way the counter row, with its lock, would behave as a global lock.
>>> This would solve my problems but would create a bottleneck (although
>>> with time my inserts tend to get very rare as the dictionary grows). A
>>> solution to this problem would be to have locks on zookeeper based on words.
>>>
>>> ZKsafe_insert(word):
>>>     ZKlock("/words/"+ word)
>>>     longvalue = convert(word)
>>>     if longvalue == -1: #nobody inserted the mapping in the meantime
>>>         longvalue = insert(word)
>>>     ZKunlock("/words/"+word)
>>>     return longvalue
>>>
>>> This of course would allow me to have more finegrained locks and better
>>> scalability, but I'd relay on a system with higher latency (ZK).
>>>
>>> Does anybody have a better solution with hbase? I guess using
>>> hbase_transational would also be a possibility, but again, what about
>>> speed and the actual issues with the package (like recovering in the
>>> face of hregion failure).
>>>
>>>
>>> Thank you,
>>>
>>> Claudio
>>>
>>
>> --
>> Claudio Martella
>> Digital Technologies
>> Unit Research & Development - Analyst
>>
>> TIS innovation park
>> Via Siemens 19 | Siemensstr. 19
>> 39100 Bolzano | 39100 Bozen
>> Tel. +39 0471 068 123
>> Fax  +39 0471 068 129
>> claudio.martella@tis.bz.it http://www.tis.bz.it
>>
>> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>>
>>
>>


-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.martella@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.



Re: incremental counters and a global String->Long Dictionary

Posted by Ryan Rawson <ry...@gmail.com>.
CheckAndPut interprets a 'null' value argument as a check for
existence.  That is if you set the expected value to null it will only
succeed if the value does not exist.

Would that help?

-ryan

On Tue, Nov 30, 2010 at 6:07 AM, Claudio Martella
<cl...@tis.bz.it> wrote:
> Hi Dave,
>
> thanks for you idea. I also considered this possibility. Although the
> possibility of a collision is very small, what scares me is the fact
> that i don't think the corruption can be corrected.
> I can for sure detect it afterwards in O(NlogN) time by scanning the
> table, but correcting my long-based corpus is impossible. Once the
> database is converted, the information is lost.
>
>
> On 11/30/10 1:43 AM, Buttler, David wrote:
>> A while back I had a strange idea to bypass this problem: create a 64-bit hash code for the word.  Your word space should be significantly smaller than 64 bits, so a good hash algorithm (the top 64 bits of sha1 say) should make collisions extremely rare.  And, if you can always check your dictionary later for collisions if this feels wrong.
>> This should be a good deal simpler than trying to keep around an order dependent integer mapping for your dictionary.  And, it is somewhat recoverable if you ever lose your dictionary for some reason.
>>
>> Dave
>>
>> -----Original Message-----
>> From: Claudio Martella [mailto:claudio.martella@tis.bz.it]
>> Sent: Monday, November 29, 2010 7:13 AM
>> To: user@hbase.apache.org
>> Subject: incremental counters and a global String->Long Dictionary
>>
>> Hello list,
>>
>> I'm kind of new to HBase, so I'll post this email with a request for
>> comment.
>> Very briefly, I do a lot of text processing with mapreduce, so it's very
>> useful for me to convert string to longs, so i can make my computations
>> faster.
>>
>> My corpus keeps on growing and I want this String->Long mapping to be
>> persistent and dynamical (i want to add new mappings when i find new words).
>> At the moment i'm tackling the problem this way (pseudo-code):
>>
>> longvalue = convert(word) # gets from hbase
>> if longvalue == -1:
>>     longvalue = insert(word) # puts in hbase
>>
>> longvalue now contains the new mapped value. This approach requires a
>> global counter that saves the latest mapped long and increments at every
>> insert. I can easily do this two ways. A special row in hbase "_counter"
>> that I increment through IncrementColumnValue, or creating a sequential
>> non-ephemeral znode in zookeeper and use the version as my counter. The
>> first one is of course faster. So the solution would be:
>>
>> insert(word):
>>     longvalue = hbase.incrementColumnValue("_counter", "v")
>>     hbase.put(word, longvalue)
>>     return longvalue
>>
>> The problem is that between the time i realize there's no mapping for my
>> word and the time i insert the new longvalue, somebody else might have
>> done the same for me, so I have a corrupted dictionary.
>>
>> One possible solution would be to acquire a lock on the "_counter" row,
>> recheck for the presence of the mapping and then insert my new value:
>>
>> safe_insert(word):
>>     lock("_counter")
>>     longvalue = convert(word)
>>     if longvalue == -1: #nobody inserted the mapping in the meantime
>>         longvalue = insert(word)
>>     unlock("_counter")
>>     return longvalue
>>
>> This way the counter row, with its lock, would behave as a global lock.
>> This would solve my problems but would create a bottleneck (although
>> with time my inserts tend to get very rare as the dictionary grows). A
>> solution to this problem would be to have locks on zookeeper based on words.
>>
>> ZKsafe_insert(word):
>>     ZKlock("/words/"+ word)
>>     longvalue = convert(word)
>>     if longvalue == -1: #nobody inserted the mapping in the meantime
>>         longvalue = insert(word)
>>     ZKunlock("/words/"+word)
>>     return longvalue
>>
>> This of course would allow me to have more finegrained locks and better
>> scalability, but I'd relay on a system with higher latency (ZK).
>>
>> Does anybody have a better solution with hbase? I guess using
>> hbase_transational would also be a possibility, but again, what about
>> speed and the actual issues with the package (like recovering in the
>> face of hregion failure).
>>
>>
>> Thank you,
>>
>> Claudio
>>
>
>
> --
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> claudio.martella@tis.bz.it http://www.tis.bz.it
>
> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>
>
>

Re: incremental counters and a global String->Long Dictionary

Posted by Claudio Martella <cl...@tis.bz.it>.
Hi Dave,

thanks for you idea. I also considered this possibility. Although the
possibility of a collision is very small, what scares me is the fact
that i don't think the corruption can be corrected.
I can for sure detect it afterwards in O(NlogN) time by scanning the
table, but correcting my long-based corpus is impossible. Once the
database is converted, the information is lost.


On 11/30/10 1:43 AM, Buttler, David wrote:
> A while back I had a strange idea to bypass this problem: create a 64-bit hash code for the word.  Your word space should be significantly smaller than 64 bits, so a good hash algorithm (the top 64 bits of sha1 say) should make collisions extremely rare.  And, if you can always check your dictionary later for collisions if this feels wrong.
> This should be a good deal simpler than trying to keep around an order dependent integer mapping for your dictionary.  And, it is somewhat recoverable if you ever lose your dictionary for some reason.
>
> Dave
>
> -----Original Message-----
> From: Claudio Martella [mailto:claudio.martella@tis.bz.it] 
> Sent: Monday, November 29, 2010 7:13 AM
> To: user@hbase.apache.org
> Subject: incremental counters and a global String->Long Dictionary
>
> Hello list,
>
> I'm kind of new to HBase, so I'll post this email with a request for
> comment.
> Very briefly, I do a lot of text processing with mapreduce, so it's very
> useful for me to convert string to longs, so i can make my computations
> faster.
>
> My corpus keeps on growing and I want this String->Long mapping to be
> persistent and dynamical (i want to add new mappings when i find new words).
> At the moment i'm tackling the problem this way (pseudo-code):
>
> longvalue = convert(word) # gets from hbase
> if longvalue == -1:
>     longvalue = insert(word) # puts in hbase
>
> longvalue now contains the new mapped value. This approach requires a
> global counter that saves the latest mapped long and increments at every
> insert. I can easily do this two ways. A special row in hbase "_counter"
> that I increment through IncrementColumnValue, or creating a sequential
> non-ephemeral znode in zookeeper and use the version as my counter. The
> first one is of course faster. So the solution would be:
>
> insert(word):
>     longvalue = hbase.incrementColumnValue("_counter", "v")
>     hbase.put(word, longvalue)
>     return longvalue
>
> The problem is that between the time i realize there's no mapping for my
> word and the time i insert the new longvalue, somebody else might have
> done the same for me, so I have a corrupted dictionary.
>
> One possible solution would be to acquire a lock on the "_counter" row,
> recheck for the presence of the mapping and then insert my new value:
>
> safe_insert(word):
>     lock("_counter")
>     longvalue = convert(word)
>     if longvalue == -1: #nobody inserted the mapping in the meantime
>         longvalue = insert(word)
>     unlock("_counter")
>     return longvalue
>
> This way the counter row, with its lock, would behave as a global lock.
> This would solve my problems but would create a bottleneck (although
> with time my inserts tend to get very rare as the dictionary grows). A
> solution to this problem would be to have locks on zookeeper based on words.
>
> ZKsafe_insert(word):
>     ZKlock("/words/"+ word)
>     longvalue = convert(word)
>     if longvalue == -1: #nobody inserted the mapping in the meantime
>         longvalue = insert(word)
>     ZKunlock("/words/"+word)
>     return longvalue
>
> This of course would allow me to have more finegrained locks and better
> scalability, but I'd relay on a system with higher latency (ZK).
>
> Does anybody have a better solution with hbase? I guess using
> hbase_transational would also be a possibility, but again, what about
> speed and the actual issues with the package (like recovering in the
> face of hregion failure).
>
>
> Thank you,
>
> Claudio
>


-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.martella@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.



RE: incremental counters and a global String->Long Dictionary

Posted by "Buttler, David" <bu...@llnl.gov>.
A while back I had a strange idea to bypass this problem: create a 64-bit hash code for the word.  Your word space should be significantly smaller than 64 bits, so a good hash algorithm (the top 64 bits of sha1 say) should make collisions extremely rare.  And, if you can always check your dictionary later for collisions if this feels wrong.
This should be a good deal simpler than trying to keep around an order dependent integer mapping for your dictionary.  And, it is somewhat recoverable if you ever lose your dictionary for some reason.

Dave

-----Original Message-----
From: Claudio Martella [mailto:claudio.martella@tis.bz.it] 
Sent: Monday, November 29, 2010 7:13 AM
To: user@hbase.apache.org
Subject: incremental counters and a global String->Long Dictionary

Hello list,

I'm kind of new to HBase, so I'll post this email with a request for
comment.
Very briefly, I do a lot of text processing with mapreduce, so it's very
useful for me to convert string to longs, so i can make my computations
faster.

My corpus keeps on growing and I want this String->Long mapping to be
persistent and dynamical (i want to add new mappings when i find new words).
At the moment i'm tackling the problem this way (pseudo-code):

longvalue = convert(word) # gets from hbase
if longvalue == -1:
    longvalue = insert(word) # puts in hbase

longvalue now contains the new mapped value. This approach requires a
global counter that saves the latest mapped long and increments at every
insert. I can easily do this two ways. A special row in hbase "_counter"
that I increment through IncrementColumnValue, or creating a sequential
non-ephemeral znode in zookeeper and use the version as my counter. The
first one is of course faster. So the solution would be:

insert(word):
    longvalue = hbase.incrementColumnValue("_counter", "v")
    hbase.put(word, longvalue)
    return longvalue

The problem is that between the time i realize there's no mapping for my
word and the time i insert the new longvalue, somebody else might have
done the same for me, so I have a corrupted dictionary.

One possible solution would be to acquire a lock on the "_counter" row,
recheck for the presence of the mapping and then insert my new value:

safe_insert(word):
    lock("_counter")
    longvalue = convert(word)
    if longvalue == -1: #nobody inserted the mapping in the meantime
        longvalue = insert(word)
    unlock("_counter")
    return longvalue

This way the counter row, with its lock, would behave as a global lock.
This would solve my problems but would create a bottleneck (although
with time my inserts tend to get very rare as the dictionary grows). A
solution to this problem would be to have locks on zookeeper based on words.

ZKsafe_insert(word):
    ZKlock("/words/"+ word)
    longvalue = convert(word)
    if longvalue == -1: #nobody inserted the mapping in the meantime
        longvalue = insert(word)
    ZKunlock("/words/"+word)
    return longvalue

This of course would allow me to have more finegrained locks and better
scalability, but I'd relay on a system with higher latency (ZK).

Does anybody have a better solution with hbase? I guess using
hbase_transational would also be a possibility, but again, what about
speed and the actual issues with the package (like recovering in the
face of hregion failure).


Thank you,

Claudio

-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.martella@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.



Re: incremental counters and a global String->Long Dictionary

Posted by Stack <st...@duboce.net>.
You might try http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/client/HTable.html#checkAndPut(byte[],
byte[], byte[], byte[], org.apache.hadoop.hbase.client.Put)

St.Ack

On Mon, Nov 29, 2010 at 10:03 AM, Claudio Martella
<cl...@tis.bz.it> wrote:
> Hi Lars,
>
> thanks for you answer. Yes, I read Percolator's paper, but I'd like to
> get my problem solved with existing software solution, and i like HBase.
> The ephemeral node is, i think, my last solution i proposed, the one i
> called ZKsafe_insert(). Or?
>
> On 11/29/10 6:35 PM, Lars George wrote:
>> Hi Claudio,
>>
>> Did you have a look at Google's Percolator paper? I think a mechanism like this may work. Another option often used to implement distributed transactions is using Zookeeper where you could create an ephemeral node on the new word and the host succeeding to do so is adding it and then releasing the lock. Or some such.
>>
>> Lars
>>
>> On Nov 29, 2010, at 16:12, Claudio Martella <cl...@tis.bz.it> wrote:
>>
>>> Hello list,
>>>
>>> I'm kind of new to HBase, so I'll post this email with a request for
>>> comment.
>>> Very briefly, I do a lot of text processing with mapreduce, so it's very
>>> useful for me to convert string to longs, so i can make my computations
>>> faster.
>>>
>>> My corpus keeps on growing and I want this String->Long mapping to be
>>> persistent and dynamical (i want to add new mappings when i find new words).
>>> At the moment i'm tackling the problem this way (pseudo-code):
>>>
>>> longvalue = convert(word) # gets from hbase
>>> if longvalue == -1:
>>>    longvalue = insert(word) # puts in hbase
>>>
>>> longvalue now contains the new mapped value. This approach requires a
>>> global counter that saves the latest mapped long and increments at every
>>> insert. I can easily do this two ways. A special row in hbase "_counter"
>>> that I increment through IncrementColumnValue, or creating a sequential
>>> non-ephemeral znode in zookeeper and use the version as my counter. The
>>> first one is of course faster. So the solution would be:
>>>
>>> insert(word):
>>>    longvalue = hbase.incrementColumnValue("_counter", "v")
>>>    hbase.put(word, longvalue)
>>>    return longvalue
>>>
>>> The problem is that between the time i realize there's no mapping for my
>>> word and the time i insert the new longvalue, somebody else might have
>>> done the same for me, so I have a corrupted dictionary.
>>>
>>> One possible solution would be to acquire a lock on the "_counter" row,
>>> recheck for the presence of the mapping and then insert my new value:
>>>
>>> safe_insert(word):
>>>    lock("_counter")
>>>    longvalue = convert(word)
>>>    if longvalue == -1: #nobody inserted the mapping in the meantime
>>>        longvalue = insert(word)
>>>    unlock("_counter")
>>>    return longvalue
>>>
>>> This way the counter row, with its lock, would behave as a global lock.
>>> This would solve my problems but would create a bottleneck (although
>>> with time my inserts tend to get very rare as the dictionary grows). A
>>> solution to this problem would be to have locks on zookeeper based on words.
>>>
>>> ZKsafe_insert(word):
>>>    ZKlock("/words/"+ word)
>>>    longvalue = convert(word)
>>>    if longvalue == -1: #nobody inserted the mapping in the meantime
>>>        longvalue = insert(word)
>>>    ZKunlock("/words/"+word)
>>>    return longvalue
>>>
>>> This of course would allow me to have more finegrained locks and better
>>> scalability, but I'd relay on a system with higher latency (ZK).
>>>
>>> Does anybody have a better solution with hbase? I guess using
>>> hbase_transational would also be a possibility, but again, what about
>>> speed and the actual issues with the package (like recovering in the
>>> face of hregion failure).
>>>
>>>
>>> Thank you,
>>>
>>> Claudio
>>>
>>> --
>>> Claudio Martella
>>> Digital Technologies
>>> Unit Research & Development - Analyst
>>>
>>> TIS innovation park
>>> Via Siemens 19 | Siemensstr. 19
>>> 39100 Bolzano | 39100 Bozen
>>> Tel. +39 0471 068 123
>>> Fax  +39 0471 068 129
>>> claudio.martella@tis.bz.it http://www.tis.bz.it
>>>
>>> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>>>
>>>
>
>
> --
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> claudio.martella@tis.bz.it http://www.tis.bz.it
>
> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>
>
>

Re: incremental counters and a global String->Long Dictionary

Posted by Claudio Martella <cl...@tis.bz.it>.
Hi Lars,

thanks for you answer. Yes, I read Percolator's paper, but I'd like to
get my problem solved with existing software solution, and i like HBase.
The ephemeral node is, i think, my last solution i proposed, the one i
called ZKsafe_insert(). Or?

On 11/29/10 6:35 PM, Lars George wrote:
> Hi Claudio,
>
> Did you have a look at Google's Percolator paper? I think a mechanism like this may work. Another option often used to implement distributed transactions is using Zookeeper where you could create an ephemeral node on the new word and the host succeeding to do so is adding it and then releasing the lock. Or some such. 
>
> Lars 
>
> On Nov 29, 2010, at 16:12, Claudio Martella <cl...@tis.bz.it> wrote:
>
>> Hello list,
>>
>> I'm kind of new to HBase, so I'll post this email with a request for
>> comment.
>> Very briefly, I do a lot of text processing with mapreduce, so it's very
>> useful for me to convert string to longs, so i can make my computations
>> faster.
>>
>> My corpus keeps on growing and I want this String->Long mapping to be
>> persistent and dynamical (i want to add new mappings when i find new words).
>> At the moment i'm tackling the problem this way (pseudo-code):
>>
>> longvalue = convert(word) # gets from hbase
>> if longvalue == -1:
>>    longvalue = insert(word) # puts in hbase
>>
>> longvalue now contains the new mapped value. This approach requires a
>> global counter that saves the latest mapped long and increments at every
>> insert. I can easily do this two ways. A special row in hbase "_counter"
>> that I increment through IncrementColumnValue, or creating a sequential
>> non-ephemeral znode in zookeeper and use the version as my counter. The
>> first one is of course faster. So the solution would be:
>>
>> insert(word):
>>    longvalue = hbase.incrementColumnValue("_counter", "v")
>>    hbase.put(word, longvalue)
>>    return longvalue
>>
>> The problem is that between the time i realize there's no mapping for my
>> word and the time i insert the new longvalue, somebody else might have
>> done the same for me, so I have a corrupted dictionary.
>>
>> One possible solution would be to acquire a lock on the "_counter" row,
>> recheck for the presence of the mapping and then insert my new value:
>>
>> safe_insert(word):
>>    lock("_counter")
>>    longvalue = convert(word)
>>    if longvalue == -1: #nobody inserted the mapping in the meantime
>>        longvalue = insert(word)
>>    unlock("_counter")
>>    return longvalue
>>
>> This way the counter row, with its lock, would behave as a global lock.
>> This would solve my problems but would create a bottleneck (although
>> with time my inserts tend to get very rare as the dictionary grows). A
>> solution to this problem would be to have locks on zookeeper based on words.
>>
>> ZKsafe_insert(word):
>>    ZKlock("/words/"+ word)
>>    longvalue = convert(word)
>>    if longvalue == -1: #nobody inserted the mapping in the meantime
>>        longvalue = insert(word)
>>    ZKunlock("/words/"+word)
>>    return longvalue
>>
>> This of course would allow me to have more finegrained locks and better
>> scalability, but I'd relay on a system with higher latency (ZK).
>>
>> Does anybody have a better solution with hbase? I guess using
>> hbase_transational would also be a possibility, but again, what about
>> speed and the actual issues with the package (like recovering in the
>> face of hregion failure).
>>
>>
>> Thank you,
>>
>> Claudio
>>
>> -- 
>> Claudio Martella
>> Digital Technologies
>> Unit Research & Development - Analyst
>>
>> TIS innovation park
>> Via Siemens 19 | Siemensstr. 19
>> 39100 Bolzano | 39100 Bozen
>> Tel. +39 0471 068 123
>> Fax  +39 0471 068 129
>> claudio.martella@tis.bz.it http://www.tis.bz.it
>>
>> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
>>
>>


-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.martella@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.



Re: incremental counters and a global String->Long Dictionary

Posted by Lars George <la...@gmail.com>.
Hi Claudio,

Did you have a look at Google's Percolator paper? I think a mechanism like this may work. Another option often used to implement distributed transactions is using Zookeeper where you could create an ephemeral node on the new word and the host succeeding to do so is adding it and then releasing the lock. Or some such. 

Lars 

On Nov 29, 2010, at 16:12, Claudio Martella <cl...@tis.bz.it> wrote:

> Hello list,
> 
> I'm kind of new to HBase, so I'll post this email with a request for
> comment.
> Very briefly, I do a lot of text processing with mapreduce, so it's very
> useful for me to convert string to longs, so i can make my computations
> faster.
> 
> My corpus keeps on growing and I want this String->Long mapping to be
> persistent and dynamical (i want to add new mappings when i find new words).
> At the moment i'm tackling the problem this way (pseudo-code):
> 
> longvalue = convert(word) # gets from hbase
> if longvalue == -1:
>    longvalue = insert(word) # puts in hbase
> 
> longvalue now contains the new mapped value. This approach requires a
> global counter that saves the latest mapped long and increments at every
> insert. I can easily do this two ways. A special row in hbase "_counter"
> that I increment through IncrementColumnValue, or creating a sequential
> non-ephemeral znode in zookeeper and use the version as my counter. The
> first one is of course faster. So the solution would be:
> 
> insert(word):
>    longvalue = hbase.incrementColumnValue("_counter", "v")
>    hbase.put(word, longvalue)
>    return longvalue
> 
> The problem is that between the time i realize there's no mapping for my
> word and the time i insert the new longvalue, somebody else might have
> done the same for me, so I have a corrupted dictionary.
> 
> One possible solution would be to acquire a lock on the "_counter" row,
> recheck for the presence of the mapping and then insert my new value:
> 
> safe_insert(word):
>    lock("_counter")
>    longvalue = convert(word)
>    if longvalue == -1: #nobody inserted the mapping in the meantime
>        longvalue = insert(word)
>    unlock("_counter")
>    return longvalue
> 
> This way the counter row, with its lock, would behave as a global lock.
> This would solve my problems but would create a bottleneck (although
> with time my inserts tend to get very rare as the dictionary grows). A
> solution to this problem would be to have locks on zookeeper based on words.
> 
> ZKsafe_insert(word):
>    ZKlock("/words/"+ word)
>    longvalue = convert(word)
>    if longvalue == -1: #nobody inserted the mapping in the meantime
>        longvalue = insert(word)
>    ZKunlock("/words/"+word)
>    return longvalue
> 
> This of course would allow me to have more finegrained locks and better
> scalability, but I'd relay on a system with higher latency (ZK).
> 
> Does anybody have a better solution with hbase? I guess using
> hbase_transational would also be a possibility, but again, what about
> speed and the actual issues with the package (like recovering in the
> face of hregion failure).
> 
> 
> Thank you,
> 
> Claudio
> 
> -- 
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
> 
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> claudio.martella@tis.bz.it http://www.tis.bz.it
> 
> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to privacy@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
> 
>