You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Jason Rutherglen <ja...@gmail.com> on 2008/07/11 04:16:11 UTC

Hadoop RPC for distributed Lucene

Has anyone taken a look at using Hadoop RPC for enabling distributed
Lucene?  I am thinking it would implement the Searchable interface and use
serialization to be compatible with the current RMI version.  Somewhat
defeats the purpose of using Hadoop RPC and serialization however Hadoop RPC
scales far beyond what RMI can at the networking level.  RMI uses a thread
per socket and has reportedly has latency issues.  Hadoop RPC uses NIO and
is proven to scale to thousands of servers.  Serialization unfortunately
must be used with Lucene due to the Weight, Query and Filter classes.  There
could be an extended version of Searchable that allows passing Weight,
Query, and Filter classes that implement Hadoop's Writeable interface if a
user wants to bypass using serialization.

Re: Hadoop RPC for distributed Lucene

Posted by Jason Rutherglen <ja...@gmail.com>.

In terms of a grid system for Lucene.  One model that might be compelling is
using a Hadoop RPC + Serialization system to implement both the Searchable
side and an Indexing side.

I personally think it would be interesting to implement code mobility like
Jini and RMI for Lucene.  This would have several benefits for a Lucene grid
computing system.  The number one benefit being simplicity in a distributed
environment.  One would only need to start a Java process and let it run,
there would be no need to deploy new jar files for new Query, Filter, or
Analyzer classes (assuming Analyzer is Serializable).  The reason for this
is because the class implementations for these types (Query, Filter, or
Analyzer) would be dynamically downloaded from the client.  This would
require ClassLoading magic which has been well researched for Jini
http://research.sun.com/techrep/2006/smli_tr-2006-149.pdf.  The system I am
thinking of would be far simpler than what Sun has implemented due to the
less complex requirements of Lucene.

Because the Lucene objects (Query, Filter, or Analyzer) that would be sent
over the network implement standard abstract classes, it is relatively
straightforward to have underlying class implementations change on a client
and be sent over to a server, even on a per method basis (this is an extreme
case).  The serialUID would be used to insure class conflicts do not occur.
The ObjectInputStream.resolveClass method would be used to insure the proper
class is used for a particular call to say IndexWriter.addDocument(Document,
Analyzer).  Where the Analyzer class implementations are subject to change.

The indexing side would be an interface to IndexWriter akin to Searchable
offering many of the standard IndexWriter methods.  There would probably
need to be a bulk method for addDocument.

This type of system would enable grid computing with no setup and
maintenance costs.  One could remotely search and index thousands of Lucene
indexes as if they are on the local machine.  Failover and error handling
systems could be built on top.

On Fri, Jul 11, 2008 at 10:27 AM, Ken Krugler <kk...@transpac.com>
wrote:

>  I believe Hadoop RPC was originally built for distributed search for
> Nutch.  Here's some core code I think Nutch still uses
> http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/
> nutch/searcher/DistributedSearch.java?revision=619648&view=markup
>
>
> Hadoop PRC is used for distributed search, but at a layer above Lucene -
> search requests are sent via RPC to remote "searchers", which are Java
> processes running on multiple boxes. These in turn make Lucene queries and
> send back results.
>
> You might want to look at the Katta project (
> http://katta.wiki.sourceforge.net/), which uses Hadoop to handle
> distributed Lucene indexes.
>
> -- Ken
>
> One thing I wanted to add to the original email is if some of the core
> query and filter classes implemented java.io.Externalizable then there would
> be a speedup in serialization equivalent to using Writeable.  It would also
> be backwards compatible with and enhance the existing distributed search
> using RMI.  Classes that do not implement Externalizable would simply use
> the default reflection based serialization.
>
>
> On Fri, Jul 11, 2008 at 9:13 AM, Grant Ingersoll <gs...@apache.org>
> wrote:
>
> I believe there is a subproject over at Hadoop for doing distributed stuff
> w/ Lucene, but I am not sure if they are doing search side, only indexing.
>  I was always under the impression that it was too slow for search side, as
> I don't think Nutch even uses it for the search side of the equation, but I
> don't know if that is still the case.
>
>
>
>
> On Jul 10, 2008, at 10:16 PM, Jason Rutherglen wrote:
>
> Has anyone taken a look at using Hadoop RPC for enabling distributed
> Lucene?  I am thinking it would implement the Searchable interface and use
> serialization to be compatible with the current RMI version.  Somewhat
> defeats the purpose of using Hadoop RPC and serialization however Hadoop RPC
> scales far beyond what RMI can at the networking level.  RMI uses a thread
> per socket and has reportedly has latency issues.  Hadoop RPC uses NIO and
> is proven to scale to thousands of servers.  Serialization unfortunately
> must be used with Lucene due to the Weight, Query and Filter classes.  There
> could be an extended version of Searchable that allows passing Weight,
> Query, and Filter classes that implement Hadoop's Writeable interface if a
> user wants to bypass using serialization.
>
>
>
>
>  ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
> --
>
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "If you can't find it, you can't fix it"
>

Re: Hadoop RPC for distributed Lucene

Posted by Ken Krugler <kk...@transpac.com>.

>I believe Hadoop RPC was originally built for distributed search for 
>Nutch.  Here's some core code I think Nutch still uses 
><http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/DistributedSearch.java?revision=619648&view=markup>http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/DistributedSearch.java?revision=619648&view=markup

Hadoop PRC is used for distributed search, but at a layer above 
Lucene - search requests are sent via RPC to remote "searchers", 
which are Java processes running on multiple boxes. These in turn 
make Lucene queries and send back results.

You might want to look at the Katta project 
(http://katta.wiki.sourceforge.net/), which uses Hadoop to handle 
distributed Lucene indexes.

-- Ken

>One thing I wanted to add to the original email is if some of the 
>core query and filter classes implemented java.io.Externalizable 
>then there would be a speedup in serialization equivalent to using 
>Writeable.  It would also be backwards compatible with and enhance 
>the existing distributed search using RMI.  Classes that do not 
>implement Externalizable would simply use the default reflection 
>based serialization.
>
>On Fri, Jul 11, 2008 at 9:13 AM, Grant Ingersoll 
><<m...@apache.org> wrote:
>
>I believe there is a subproject over at Hadoop for doing distributed 
>stuff w/ Lucene, but I am not sure if they are doing search side, 
>only indexing.  I was always under the impression that it was too 
>slow for search side, as I don't think Nutch even uses it for the 
>search side of the equation, but I don't know if that is still the 
>case.
>
>
>
>On Jul 10, 2008, at 10:16 PM, Jason Rutherglen wrote:
>
>Has anyone taken a look at using Hadoop RPC for enabling distributed 
>Lucene?  I am thinking it would implement the Searchable interface 
>and use serialization to be compatible with the current RMI version. 
> Somewhat defeats the purpose of using Hadoop RPC and serialization 
>however Hadoop RPC scales far beyond what RMI can at the networking 
>level.  RMI uses a thread per socket and has reportedly has latency 
>issues.  Hadoop RPC uses NIO and is proven to scale to thousands of 
>servers.  Serialization unfortunately must be used with Lucene due 
>to the Weight, Query and Filter classes.  There could be an extended 
>version of Searchable that allows passing Weight, Query, and Filter 
>classes that implement Hadoop's Writeable interface if a user wants 
>to bypass using serialization.
>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: 
><ma...@lucene.apache.org>java-dev-unsubscribe@lucene.apache.org
>For additional commands, e-mail: 
><ma...@lucene.apache.org>java-dev-help@lucene.apache.org


-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: Hadoop RPC for distributed Lucene

Posted by Jason Rutherglen <ja...@gmail.com>.

I believe Hadoop RPC was originally built for distributed search for Nutch.
Here's some core code I think Nutch still uses
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/DistributedSearch.java?revision=619648&view=markup

One thing I wanted to add to the original email is if some of the core query
and filter classes implemented java.io.Externalizable then there would be a
speedup in serialization equivalent to using Writeable.  It would also be
backwards compatible with and enhance the existing distributed search using
RMI.  Classes that do not implement Externalizable would simply use the
default reflection based serialization.

On Fri, Jul 11, 2008 at 9:13 AM, Grant Ingersoll <gs...@apache.org>
wrote:

> I believe there is a subproject over at Hadoop for doing distributed stuff
> w/ Lucene, but I am not sure if they are doing search side, only indexing.
>  I was always under the impression that it was too slow for search side, as
> I don't think Nutch even uses it for the search side of the equation, but I
> don't know if that is still the case.
>
>
> On Jul 10, 2008, at 10:16 PM, Jason Rutherglen wrote:
>
>  Has anyone taken a look at using Hadoop RPC for enabling distributed
>> Lucene?  I am thinking it would implement the Searchable interface and use
>> serialization to be compatible with the current RMI version.  Somewhat
>> defeats the purpose of using Hadoop RPC and serialization however Hadoop RPC
>> scales far beyond what RMI can at the networking level.  RMI uses a thread
>> per socket and has reportedly has latency issues.  Hadoop RPC uses NIO and
>> is proven to scale to thousands of servers.  Serialization unfortunately
>> must be used with Lucene due to the Weight, Query and Filter classes.  There
>> could be an extended version of Searchable that allows passing Weight,
>> Query, and Filter classes that implement Hadoop's Writeable interface if a
>> user wants to bypass using serialization.
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Hadoop RPC for distributed Lucene

Posted by Grant Ingersoll <gs...@apache.org>.

I believe there is a subproject over at Hadoop for doing distributed  
stuff w/ Lucene, but I am not sure if they are doing search side, only  
indexing.  I was always under the impression that it was too slow for  
search side, as I don't think Nutch even uses it for the search side  
of the equation, but I don't know if that is still the case.


On Jul 10, 2008, at 10:16 PM, Jason Rutherglen wrote:

> Has anyone taken a look at using Hadoop RPC for enabling distributed  
> Lucene?  I am thinking it would implement the Searchable interface  
> and use serialization to be compatible with the current RMI  
> version.  Somewhat defeats the purpose of using Hadoop RPC and  
> serialization however Hadoop RPC scales far beyond what RMI can at  
> the networking level.  RMI uses a thread per socket and has  
> reportedly has latency issues.  Hadoop RPC uses NIO and is proven to  
> scale to thousands of servers.  Serialization unfortunately must be  
> used with Lucene due to the Weight, Query and Filter classes.  There  
> could be an extended version of Searchable that allows passing  
> Weight, Query, and Filter classes that implement Hadoop's Writeable  
> interface if a user wants to bypass using serialization.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org