You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Delip Rao <de...@gmail.com> on 2009/01/15 03:47:15 UTC

Indexed Hashtables

Hi,

I need to lookup a large number of key/value pairs in my map(). Is
there any indexed hashtable available as a part of Hadoop I/O API?
I find Hbase an overkill for my application; something on the lines of
HashStore (www.cellspark.com/hashstore.html) should be fine.

Thanks,
Delip

Re: Indexed Hashtables

Posted by Renaud Delbru <re...@deri.org>.

Tokyo Cabinet ?

http://tokyocabinet.sourceforge.net/index.html
-- 
Renaud Delbru

Delip Rao wrote:
>
> Hi,
>
> I need to lookup a large number of key/value pairs in my map(). Is
> there any indexed hashtable available as a part of Hadoop I/O API?
> I find Hbase an overkill for my application; something on the lines of
> HashStore (www.cellspark.com/hashstore.html) should be fine.
>
> Thanks,
> Delip
>

Re: Indexed Hashtables

Posted by Delip Rao <de...@gmail.com>.

Thanks everyone for the suggestions! I tried all options so far except
Voldemort (Steve) and here's my evaluation:

memcached (Sean) -- works very fast. Good option if used along with an
existing slow index.
MapFile (Peter) -- excellent option that is a part of Hadoop but works
very slow for large number of key/value pairs. This was the problem
with HashStore too.

We initially started with Hbase but found it very hard to setup and
when we did, it wasn't kind to our modest academic cluster with
limited memory. But Hbase is a great option otherwise. Our
requirements were very simple -- we have a few million key/value pairs
(both strings) that need to be looked up frequently. The solution I
ended up was a simple trie based hash for the keys storing the index
of the corresponding values which are kept on the disk.

Cheers,
Delip

On Thu, Jan 15, 2009 at 4:14 PM, Jim Twensky <ji...@gmail.com> wrote:
> Delip,
>
> Why do you think Hbase will be an overkill? I do something similar to what
> you're trying to do with Hbase and I haven't encountered any significant
> problems so far. Can you give some more info on the size of the data you
> have?
>
> Jim
>
> On Wed, Jan 14, 2009 at 8:47 PM, Delip Rao <de...@gmail.com> wrote:
>
>> Hi,
>>
>> I need to lookup a large number of key/value pairs in my map(). Is
>> there any indexed hashtable available as a part of Hadoop I/O API?
>> I find Hbase an overkill for my application; something on the lines of
>> HashStore (www.cellspark.com/hashstore.html) should be fine.
>>
>> Thanks,
>> Delip
>>
>

Re: Indexed Hashtables

Posted by Jim Twensky <ji...@gmail.com>.

Delip,

Why do you think Hbase will be an overkill? I do something similar to what
you're trying to do with Hbase and I haven't encountered any significant
problems so far. Can you give some more info on the size of the data you
have?

Jim

On Wed, Jan 14, 2009 at 8:47 PM, Delip Rao <de...@gmail.com> wrote:

> Hi,
>
> I need to lookup a large number of key/value pairs in my map(). Is
> there any indexed hashtable available as a part of Hadoop I/O API?
> I find Hbase an overkill for my application; something on the lines of
> HashStore (www.cellspark.com/hashstore.html) should be fine.
>
> Thanks,
> Delip
>

Re: Indexed Hashtables

Posted by pr...@optivo.de.

Delip,

what about Hadoop MapFile?

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/MapFile.html

Regards,

Peter

Re: Indexed Hashtables

Posted by Steve Loughran <st...@apache.org>.

Sean Shanny wrote:
> Delip,
> 
> So far we have had pretty good luck with memcached.  We are building a 
> hadoop based solution for data warehouse ETL on XML based log files that 
> represent click stream data on steroids.
> 
> We process about 34 million records or about 70 GB data a day.  We have 
> to process dimensional data in our warehouse and then load the surrogate 
> <key><value> pairs in memcached so we can traverse the XML files once 
> again to perform the substitutions.  We are using the memcached solution 
> because is scales out just like hadoop.  We will have code that allows 
> us to fall back to the DB if the memcached lookup fails but that should 
> not happen to often.
> 

LinkedIn have just opened up something they run internally, Project 
Voldemort:

http://highscalability.com/product-project-voldemort-distributed-database
http://project-voldemort.com/

It's a DHT, Java based. I haven't played with it yet, but it looks like 
a good part of the portfolio.

Re: Indexed Hashtables

Posted by Sean Shanny <ss...@tripadvisor.com>.

Delip,

So far we have had pretty good luck with memcached.  We are building a  
hadoop based solution for data warehouse ETL on XML based log files  
that represent click stream data on steroids.

We process about 34 million records or about 70 GB data a day.  We  
have to process dimensional data in our warehouse and then load the  
surrogate <key><value> pairs in memcached so we can traverse the XML  
files once again to perform the substitutions.  We are using the  
memcached solution because is scales out just like hadoop.  We will  
have code that allows us to fall back to the DB if the memcached  
lookup fails but that should not happen to often.

Thanks.

--sean

Sean Shanny
sshanny@tripadvisor.com

On Jan 14, 2009, at 9:47 PM, Delip Rao wrote:

> Hi,
>
> I need to lookup a large number of key/value pairs in my map(). Is
> there any indexed hashtable available as a part of Hadoop I/O API?
> I find Hbase an overkill for my application; something on the lines of
> HashStore (www.cellspark.com/hashstore.html) should be fine.
>
> Thanks,
> Delip