You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Chris Dyer <re...@umd.edu> on 2008/09/18 07:05:03 UTC

Serving contents of large MapFiles/SequenceFiles from memory across many machines

Hi all-
One more question.

I'm looking for a lightweight way to serve data stored as key-value
pairs in a series of MapFiles or SequenceFiles.  HBase/Hypertable
offer a very robust, powerful solution to this problem with a bunch of
extra features like updates and column types, etc., that I don't need
at all.  But, I'm wondering if there might be something
ultra-lightweight that someone has come up with for a very restricted
(but important!) set of use cases.  Basically, I'd like to be able to
load the entire contents of a file key-value map file in DFS into
memory across many machines in my cluster so that I can access any of
it with ultra-low latencies.  I don't need updates--I just need
ultra-fast queries into a very large hash map (actually, just an array
would be sufficient).  This would correspond, approximately to the
"sstable" functionality that BigTable is implemented on top of, but
which is also useful for many, many things directly (refer to the
BigTable paper or
http://www.techworld.com/storage/features/index.cfm?featureid=3183).

This question may be better targeted to the HBase community, if so,
please let me know.  Has anyone else tried to deal with this?

Thanks--
Chris

Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

Posted by Andrzej Bialecki <ab...@getopt.org>.
Miles Osborne wrote:
> the problem here is that you don't want each mapper/reducer to have a
> copy of the data.  you want that data --which can be very large--
> stored in a distributed manner over your cluster and allow random
> access to it during computation.
> 
> (this is what HBase etc do)

I had a somewhat similar situation to the one that the original poster 
described. In my case the trick proved to be to avoid actually accessing 
the data when possible ... The key space was very large, but sparsely 
populated - namely, consisting of phrases up to N words long. The 
MapFile-s that contained a reference dictionary were large (in the order 
of hundred million records) and it was inconvenient to copy them to 
local machines.

I managed to get a decent performance by combining two approaches:

* using a fail-fast version of MapFile-s (see HADOOP-3063) - my map() 
implementation generated a lot of keys, which had to be tested against 
the dictionaries, and in most cases the phrases wouldn't exist in the 
dictionary, so they didn't have to be actually retrieved. Result - no 
I/O to check for missing keys. The BloomFilter in the BloomMapFile is 
loaded completely in memory, so the speed of lookup was fantastic.

* and in case I really had to load a few records from the dictionaries, 
I kept a local LRU cache. I went with a simple LinkedHashMap, but you 
could be more sophisticated and use JCS or something like that.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.
the problem here is that you don't want each mapper/reducer to have a
copy of the data.  you want that data --which can be very large--
stored in a distributed manner over your cluster and allow random
access to it during computation.

(this is what HBase etc do)

Miles

2008/9/19 Stuart Sierra <ma...@stuartsierra.com>:
> On Thu, Sep 18, 2008 at 1:05 AM, Chris Dyer <re...@umd.edu> wrote:
>> Basically, I'd like to be able to
>> load the entire contents of a file key-value map file in DFS into
>> memory across many machines in my cluster so that I can access any of
>> it with ultra-low latencies.
>
> I think the simplest way, which I've used, is to put your key-value
> file into DistributedCache, then load it into a HashMap or ArrayList
> in the configure method of each Map/Reduce task.
>
> -Stuart
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

Posted by Stuart Sierra <ma...@stuartsierra.com>.
On Thu, Sep 18, 2008 at 1:05 AM, Chris Dyer <re...@umd.edu> wrote:
> Basically, I'd like to be able to
> load the entire contents of a file key-value map file in DFS into
> memory across many machines in my cluster so that I can access any of
> it with ultra-low latencies.

I think the simplest way, which I've used, is to put your key-value
file into DistributedCache, then load it into a HashMap or ArrayList
in the configure method of each Map/Reduce task.

-Stuart

Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

Posted by Chris Dyer <re...@umd.edu>.
Memcached looks like it would be a reasonable solution for my problem,
although it's not optimal since it doesn't support an easy way of
initializing itself at start up, but I can work around that.  This may
be wishful thinking, but does anyone have any experience using the
Hadoop job/task framework to launch supporting tasks like a memcached
processes?  Is there anyone else thinking about issues of scheduling
other kinds of tasks (other than mappers and reducers) in Hadoop?

Thanks--
Chris

On Fri, Sep 19, 2008 at 2:53 PM, Alex Feinberg <al...@socialmedia.com> wrote:
> Do any of CouchDB/Cassandra/other frameworks specifically do in-memory serving?
> I haven't found any that do this explicitly. For now I've been using
> memcached for that
> functionality (with the usual memcached caveats).
>
> Ehcache may be another memcache-like solution
> (http://ehcache.sourceforge.net/), but
> it also provides an on-disk storage in addition to in-memory (thus
> avoiding the "if a machine
> goes down, data is lost" issue of memcached).
>
> On Fri, Sep 19, 2008 at 10:54 AM, James Moore <ja...@gmail.com> wrote:
>> On Wed, Sep 17, 2008 at 10:05 PM, Chris Dyer <re...@umd.edu> wrote:
>>> I'm looking for a lightweight way to serve data stored as key-value
>>> pairs in a series of MapFiles or SequenceFiles.
>>
>> Might be worth taking a look at CouchDB as well.  Haven't used it
>> myself, so can't comment on how it might work for what you're
>> describing.
>>
>> --
>> James Moore | james@restphone.com
>> Ruby and Ruby on Rails consulting
>> blog.restphone.com
>>
>
>
>
> --
> Alex Feinberg
> Platform Engineer, SocialMedia Networks
>

Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

Posted by Alex Feinberg <al...@socialmedia.com>.
Do any of CouchDB/Cassandra/other frameworks specifically do in-memory serving?
I haven't found any that do this explicitly. For now I've been using
memcached for that
functionality (with the usual memcached caveats).

Ehcache may be another memcache-like solution
(http://ehcache.sourceforge.net/), but
it also provides an on-disk storage in addition to in-memory (thus
avoiding the "if a machine
goes down, data is lost" issue of memcached).

On Fri, Sep 19, 2008 at 10:54 AM, James Moore <ja...@gmail.com> wrote:
> On Wed, Sep 17, 2008 at 10:05 PM, Chris Dyer <re...@umd.edu> wrote:
>> I'm looking for a lightweight way to serve data stored as key-value
>> pairs in a series of MapFiles or SequenceFiles.
>
> Might be worth taking a look at CouchDB as well.  Haven't used it
> myself, so can't comment on how it might work for what you're
> describing.
>
> --
> James Moore | james@restphone.com
> Ruby and Ruby on Rails consulting
> blog.restphone.com
>



-- 
Alex Feinberg
Platform Engineer, SocialMedia Networks

Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

Posted by James Moore <ja...@gmail.com>.
On Wed, Sep 17, 2008 at 10:05 PM, Chris Dyer <re...@umd.edu> wrote:
> I'm looking for a lightweight way to serve data stored as key-value
> pairs in a series of MapFiles or SequenceFiles.

Might be worth taking a look at CouchDB as well.  Haven't used it
myself, so can't comment on how it might work for what you're
describing.

-- 
James Moore | james@restphone.com
Ruby and Ruby on Rails consulting
blog.restphone.com

Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.
hello Chris!

(if you are talking about serving language models and/or phrase tables)

i had a student look at using HBase for LMs this summer.  i don't
think it is sufficiently quick to deal with millions of queries per
second, but that may be due to blunders on our part.

it may be possible that Hypertable might work for you ... never tried it yet.

(i'm also interested in this problem, so we could talk offline about it)

Miles

2008/9/18 Chris Dyer <re...@umd.edu>:
> Hi all-
> One more question.
>
> I'm looking for a lightweight way to serve data stored as key-value
> pairs in a series of MapFiles or SequenceFiles.  HBase/Hypertable
> offer a very robust, powerful solution to this problem with a bunch of
> extra features like updates and column types, etc., that I don't need
> at all.  But, I'm wondering if there might be something
> ultra-lightweight that someone has come up with for a very restricted
> (but important!) set of use cases.  Basically, I'd like to be able to
> load the entire contents of a file key-value map file in DFS into
> memory across many machines in my cluster so that I can access any of
> it with ultra-low latencies.  I don't need updates--I just need
> ultra-fast queries into a very large hash map (actually, just an array
> would be sufficient).  This would correspond, approximately to the
> "sstable" functionality that BigTable is implemented on top of, but
> which is also useful for many, many things directly (refer to the
> BigTable paper or
> http://www.techworld.com/storage/features/index.cfm?featureid=3183).
>
> This question may be better targeted to the HBase community, if so,
> please let me know.  Has anyone else tried to deal with this?
>
> Thanks--
> Chris
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.