You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Delip Rao <de...@gmail.com> on 2009/01/15 03:47:15 UTC
Indexed Hashtables
Hi,
I need to lookup a large number of key/value pairs in my map(). Is
there any indexed hashtable available as a part of Hadoop I/O API?
I find Hbase an overkill for my application; something on the lines of
HashStore (www.cellspark.com/hashstore.html) should be fine.
Thanks,
Delip
Re: Indexed Hashtables
Posted by Renaud Delbru <re...@deri.org>.
Tokyo Cabinet ?
http://tokyocabinet.sourceforge.net/index.html
--
Renaud Delbru
Delip Rao wrote:
>
> Hi,
>
> I need to lookup a large number of key/value pairs in my map(). Is
> there any indexed hashtable available as a part of Hadoop I/O API?
> I find Hbase an overkill for my application; something on the lines of
> HashStore (www.cellspark.com/hashstore.html) should be fine.
>
> Thanks,
> Delip
>
Re: Indexed Hashtables
Posted by Delip Rao <de...@gmail.com>.
Thanks everyone for the suggestions! I tried all options so far except
Voldemort (Steve) and here's my evaluation:
memcached (Sean) -- works very fast. Good option if used along with an
existing slow index.
MapFile (Peter) -- excellent option that is a part of Hadoop but works
very slow for large number of key/value pairs. This was the problem
with HashStore too.
We initially started with Hbase but found it very hard to setup and
when we did, it wasn't kind to our modest academic cluster with
limited memory. But Hbase is a great option otherwise. Our
requirements were very simple -- we have a few million key/value pairs
(both strings) that need to be looked up frequently. The solution I
ended up was a simple trie based hash for the keys storing the index
of the corresponding values which are kept on the disk.
Cheers,
Delip
On Thu, Jan 15, 2009 at 4:14 PM, Jim Twensky <ji...@gmail.com> wrote:
> Delip,
>
> Why do you think Hbase will be an overkill? I do something similar to what
> you're trying to do with Hbase and I haven't encountered any significant
> problems so far. Can you give some more info on the size of the data you
> have?
>
> Jim
>
> On Wed, Jan 14, 2009 at 8:47 PM, Delip Rao <de...@gmail.com> wrote:
>
>> Hi,
>>
>> I need to lookup a large number of key/value pairs in my map(). Is
>> there any indexed hashtable available as a part of Hadoop I/O API?
>> I find Hbase an overkill for my application; something on the lines of
>> HashStore (www.cellspark.com/hashstore.html) should be fine.
>>
>> Thanks,
>> Delip
>>
>
Re: Indexed Hashtables
Posted by Jim Twensky <ji...@gmail.com>.
Delip,
Why do you think Hbase will be an overkill? I do something similar to what
you're trying to do with Hbase and I haven't encountered any significant
problems so far. Can you give some more info on the size of the data you
have?
Jim
On Wed, Jan 14, 2009 at 8:47 PM, Delip Rao <de...@gmail.com> wrote:
> Hi,
>
> I need to lookup a large number of key/value pairs in my map(). Is
> there any indexed hashtable available as a part of Hadoop I/O API?
> I find Hbase an overkill for my application; something on the lines of
> HashStore (www.cellspark.com/hashstore.html) should be fine.
>
> Thanks,
> Delip
>
Re: Indexed Hashtables
Posted by pr...@optivo.de.
Delip,
what about Hadoop MapFile?
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/MapFile.html
Regards,
Peter
Re: Indexed Hashtables
Posted by Steve Loughran <st...@apache.org>.
Sean Shanny wrote:
> Delip,
>
> So far we have had pretty good luck with memcached. We are building a
> hadoop based solution for data warehouse ETL on XML based log files that
> represent click stream data on steroids.
>
> We process about 34 million records or about 70 GB data a day. We have
> to process dimensional data in our warehouse and then load the surrogate
> <key><value> pairs in memcached so we can traverse the XML files once
> again to perform the substitutions. We are using the memcached solution
> because is scales out just like hadoop. We will have code that allows
> us to fall back to the DB if the memcached lookup fails but that should
> not happen to often.
>
LinkedIn have just opened up something they run internally, Project
Voldemort:
http://highscalability.com/product-project-voldemort-distributed-database
http://project-voldemort.com/
It's a DHT, Java based. I haven't played with it yet, but it looks like
a good part of the portfolio.
Re: Indexed Hashtables
Posted by Sean Shanny <ss...@tripadvisor.com>.
Delip,
So far we have had pretty good luck with memcached. We are building a
hadoop based solution for data warehouse ETL on XML based log files
that represent click stream data on steroids.
We process about 34 million records or about 70 GB data a day. We
have to process dimensional data in our warehouse and then load the
surrogate <key><value> pairs in memcached so we can traverse the XML
files once again to perform the substitutions. We are using the
memcached solution because is scales out just like hadoop. We will
have code that allows us to fall back to the DB if the memcached
lookup fails but that should not happen to often.
Thanks.
--sean
Sean Shanny
sshanny@tripadvisor.com
On Jan 14, 2009, at 9:47 PM, Delip Rao wrote:
> Hi,
>
> I need to lookup a large number of key/value pairs in my map(). Is
> there any indexed hashtable available as a part of Hadoop I/O API?
> I find Hbase an overkill for my application; something on the lines of
> HashStore (www.cellspark.com/hashstore.html) should be fine.
>
> Thanks,
> Delip