You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by mark harwood <ma...@yahoo.co.uk> on 2012/03/22 18:42:27 UTC

Proposal - a high performance Key-Value store based on Lucene APIs/concepts

I've been spending quite a bit of time recently benchmarking various Key-Value stores for a demanding project and been largely disappointed with results
However, I have developed a promising implementation based on these concepts:  http://www.slideshare.net/MarkHarwood/lucene-kvstore

The code needs some packaging before I can release it but the slide deck should give a good overview of the design.


Is this something that it is likely to be of interest as a contrib module here?
I appreciate this is a departure from the regular search focus but it builds on some common ground in Lucene core and may have some applications here.

Cheers,
Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts

Posted by "J. Delgado" <jo...@gmail.com>.

Hi Mark,

I'm interested in what you have done in somewhat peculiar way:

Currently, we use fields and terms in Lucene as the basis for the inverted
index. However, as you can read in this paper for indexing Boolean
expressions : http://theory.stanford.edu/~sergei/papers/vldb09-indexing.pdf,
they create posting lists for all possible attribute name and value
pairs
(also called keys) among the conjunctions. A posting list head contains the
key (A, v). The keys of the posting lists are stored in a hash table, which
will be used to search posting lists given keys of an assignment.

So perhaps the ability to mix the two different forms of indexes by
building a posting list for each entry in your KVStore may help me design a
Lucene-based solution for this problem.

-- J


On Sat, Mar 24, 2012 at 4:50 PM, Lance Norskog <go...@gmail.com> wrote:

> Cool!
>
> On Sat, Mar 24, 2012 at 4:17 PM, Mark Harwood <ma...@yahoo.co.uk>
> wrote:
> > OK I have some code and benchmarks for this solution up on a Google Code
> project here: http://code.google.com/p/graphdb-load-tester/
> >
> > The project exists to address the performance challenges I have
> encountered when dealing with large graphs. It  uses all of the Wikipedia
> links as a test dataset and a choice of graph databases (most of which use
> Lucene BTW).
> > The test data is essentially 130 million edges representing links
> between pages e.g.  Communism->Russia.
> > To load the data all of the graph databases have to translate
> user-defined keys like "Russia" into an internally-generated node ID using
> a service that looks like this:
> >        interface KeyService
> >        {
> >                //Returns existing nodeid or -1 if is not already in store
> >                public long getGraphNodeId(String udk);
> >
> >                //Adds a new record - assumption is client has checked
> user defined key (udk) is not stored already using getGraphNodeId
> >                public void put(String udk, long graphNodeId);
> >        }
> >
> > This is a challenge on a dataset of this size. I tried using a
> Lucene-based implementation for this service with the following
> optimisations:
> > 1) a Bloomfilter to quickly "know what we don't know"
> > 2) an LRUCache to hold on to commonly referenced vertices e.g the
> Wikipdedia article for "United States"
> > 3) a hashmap representing the unflushed state of Lucene's IndexWriter to
> avoid the need for excessive flushing with NRT reader etc
> >
> > The search/write performance showed the familiar saw-toothing as the
> Lucene index grew in size and merge operations kicked in.
> >
> > The KVStore implementation I wrote attempts to tackle this problem using
> a fundamentally different form of index. The results from the KVStore runs
> show it was twice as fast as this  Lucene solution and maintains constant
> performance without the saw toothing effect.
> >
> > Benchmark figures are here: http://goo.gl/VQ027
> > The KVStore source code is here: http://goo.gl/ovkop and the Lucene
> implementation I compare against is also in the project.
> >
> > Cheers
> > Mark
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts

Posted by Lance Norskog <go...@gmail.com>.

Cool!

On Sat, Mar 24, 2012 at 4:17 PM, Mark Harwood <ma...@yahoo.co.uk> wrote:
> OK I have some code and benchmarks for this solution up on a Google Code project here: http://code.google.com/p/graphdb-load-tester/
>
> The project exists to address the performance challenges I have encountered when dealing with large graphs. It  uses all of the Wikipedia links as a test dataset and a choice of graph databases (most of which use Lucene BTW).
> The test data is essentially 130 million edges representing links between pages e.g.  Communism->Russia.
> To load the data all of the graph databases have to translate user-defined keys like "Russia" into an internally-generated node ID using a service that looks like this:
>        interface KeyService
>        {
>                //Returns existing nodeid or -1 if is not already in store
>                public long getGraphNodeId(String udk);
>
>                //Adds a new record - assumption is client has checked user defined key (udk) is not stored already using getGraphNodeId
>                public void put(String udk, long graphNodeId);
>        }
>
> This is a challenge on a dataset of this size. I tried using a Lucene-based implementation for this service with the following optimisations:
> 1) a Bloomfilter to quickly "know what we don't know"
> 2) an LRUCache to hold on to commonly referenced vertices e.g the Wikipdedia article for "United States"
> 3) a hashmap representing the unflushed state of Lucene's IndexWriter to avoid the need for excessive flushing with NRT reader etc
>
> The search/write performance showed the familiar saw-toothing as the Lucene index grew in size and merge operations kicked in.
>
> The KVStore implementation I wrote attempts to tackle this problem using a fundamentally different form of index. The results from the KVStore runs show it was twice as fast as this  Lucene solution and maintains constant performance without the saw toothing effect.
>
> Benchmark figures are here: http://goo.gl/VQ027
> The KVStore source code is here: http://goo.gl/ovkop and the Lucene implementation I compare against is also in the project.
>
> Cheers
> Mark
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts

Posted by Mark Harwood <ma...@yahoo.co.uk>.

OK I have some code and benchmarks for this solution up on a Google Code project here: http://code.google.com/p/graphdb-load-tester/

The project exists to address the performance challenges I have encountered when dealing with large graphs. It  uses all of the Wikipedia links as a test dataset and a choice of graph databases (most of which use Lucene BTW).
The test data is essentially 130 million edges representing links between pages e.g.  Communism->Russia.
To load the data all of the graph databases have to translate user-defined keys like "Russia" into an internally-generated node ID using a service that looks like this: 
	interface KeyService
	{ 
		//Returns existing nodeid or -1 if is not already in store
		public long getGraphNodeId(String udk);

		//Adds a new record - assumption is client has checked user defined key (udk) is not stored already using getGraphNodeId
		public void put(String udk, long graphNodeId);
	}

This is a challenge on a dataset of this size. I tried using a Lucene-based implementation for this service with the following optimisations:
1) a Bloomfilter to quickly "know what we don't know"
2) an LRUCache to hold on to commonly referenced vertices e.g the Wikipdedia article for "United States"
3) a hashmap representing the unflushed state of Lucene's IndexWriter to avoid the need for excessive flushing with NRT reader etc

The search/write performance showed the familiar saw-toothing as the Lucene index grew in size and merge operations kicked in.

The KVStore implementation I wrote attempts to tackle this problem using a fundamentally different form of index. The results from the KVStore runs show it was twice as fast as this  Lucene solution and maintains constant performance without the saw toothing effect.

Benchmark figures are here: http://goo.gl/VQ027
The KVStore source code is here: http://goo.gl/ovkop and the Lucene implementation I compare against is also in the project.

Cheers
Mark





---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts

Posted by Mark Harwood <ma...@yahoo.co.uk>.

> 
> Random question: Do you basically end up with something very similar to LevelDB that many people where talking about a few weeks ago ? 
> 


Haven't looked at LevelDB because I was concentrating on Java implementations.

Riak's Bitcask is the most similar in principle but I didn't like the idea of holding keys in RAM. See  http://downloads.basho.com/papers/bitcask-intro.pdf





---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts

Posted by Thomas Matthijs <li...@selckin.be>.

On Thu, Mar 22, 2012 at 7:29 PM, Mark Harwood <ma...@yahoo.co.uk>wrote:

>
>
> Mark, can you share more on what K-V (NoSQL) stores have you've been
> benchmarking and what have been the results?
>
>
> Mongo, Cassandra, Krati, Bdb a Java version of BitCask, Lucene, MySQL
>
> I was interested in benchmarking the single-server stores rather than a
> distributed setup because your choice of store could be plugged into the
> likes of Voldemort for scale out.
>
> The design is similar to the Bitcask paper but keeps only hashes of keys
> in ram not the full key.
>
> My implementation was the only store that didn't degrade noticeably as you
> get into 10s of millions of keys in the store.
>
>
>
Random question: Do you basically end up with something very similar to
LevelDB that many people where talking about a few weeks ago ?

Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts

Posted by Mark Harwood <ma...@yahoo.co.uk>.


> Mark, can you share more on what K-V (NoSQL) stores have you've been benchmarking and what have been the results?
> 

Mongo, Cassandra, Krati, Bdb a Java version of BitCask, Lucene, MySQL

I was interested in benchmarking the single-server stores rather than a distributed setup because your choice of store could be plugged into the likes of Voldemort for scale out. 

The design is similar to the Bitcask paper but keeps only hashes of keys in ram not the full key. 

My implementation was the only store that didn't degrade noticeably as you get into 10s of millions of keys in the store. 





> Did you try all the well known ones?
> http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
> 
> -- J
> 
> On Thu, Mar 22, 2012 at 10:42 AM, mark harwood <ma...@yahoo.co.uk> wrote:
> I've been spending quite a bit of time recently benchmarking various Key-Value stores for a demanding project and been largely disappointed with results
> However, I have developed a promising implementation based on these concepts:  http://www.slideshare.net/MarkHarwood/lucene-kvstore
> 
> The code needs some packaging before I can release it but the slide deck should give a good overview of the design.
> 
> 
> Is this something that it is likely to be of interest as a contrib module here?
> I appreciate this is a departure from the regular search focus but it builds on some common ground in Lucene core and may have some applications here.
> 
> Cheers,
> Mark
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 
>

Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts

Posted by "J. Delgado" <jo...@gmail.com>.

Mark, can you share more on what K-V (NoSQL) stores have you've been
benchmarking and what have been the results?

Did you try all the well known ones?
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis

-- J

On Thu, Mar 22, 2012 at 10:42 AM, mark harwood <ma...@yahoo.co.uk>wrote:

> I've been spending quite a bit of time recently benchmarking various
> Key-Value stores for a demanding project and been largely disappointed with
> results
> However, I have developed a promising implementation based on these
> concepts:  http://www.slideshare.net/MarkHarwood/lucene-kvstore
>
> The code needs some packaging before I can release it but the slide deck
> should give a good overview of the design.
>
>
> Is this something that it is likely to be of interest as a contrib module
> here?
> I appreciate this is a departure from the regular search focus but it
> builds on some common ground in Lucene core and may have some applications
> here.
>
> Cheers,
> Mark
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts

Posted by Ryan McKinley <ry...@gmail.com>.

+1

The one potential problem is the use of Trove for primitives


On Thu, Mar 22, 2012 at 10:42 AM, mark harwood <ma...@yahoo.co.uk> wrote:
> I've been spending quite a bit of time recently benchmarking various Key-Value stores for a demanding project and been largely disappointed with results
> However, I have developed a promising implementation based on these concepts:  http://www.slideshare.net/MarkHarwood/lucene-kvstore
>
> The code needs some packaging before I can release it but the slide deck should give a good overview of the design.
>
>
> Is this something that it is likely to be of interest as a contrib module here?
> I appreciate this is a departure from the regular search focus but it builds on some common ground in Lucene core and may have some applications here.
>
> Cheers,
> Mark
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org