You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucenenet.apache.org by Zachary Gramana <zg...@gmail.com> on 2012/08/18 01:47:06 UTC

Reviving DistributedSearch

All:

I spent quite a bit of time researching the demise of Lucene.Net.Distributed (now Contrib.Distributed), primarily looking at SVN tags and the mailing list. I gather that it was only ever a reluctant feature, which had few users, went unmaintained, and thus was quietly dropped during the transition from 2.4.0 -> 2.9.1.

I noticed that much of that code remains in src/contribs, so I attempted to restore it back to its former state. To make a long story short, I have reconstructed solution files, project files, tests, etc. and have migrated them all forward to use 3.0.3 trunk. It now builds and (nominally) runs. It does not pass the tests since the Soap formatter does not support generics.

I did not make any modifications beyond those needed for reconstruction and migration. I am considering migrating this to ServiceStack, or WCF, if there is interest. However, some may be interested in playing with it as-is. I'd be happy to submit a patch against trunk, but I'm not sure of the protocol for that (e.g. send it to dev list, private email). The patch currently weighs in at 344 KB zipped/~3 MB unzipped.

Best Regards,
Zack



Re: Reviving DistributedSearch

Posted by Troy Howard <th...@gmail.com>.
Zack,

I'm really pleased to see this you expressing interest in this. Some points
that I'd like to make regarding the topic:

- Distributed Search isn't just hard, it's really hard.

- Lucene, and thus, Lucene.Net is a library, and anything which is
contributed to it should remain within that scope. If it's application
code, (eg SOLR) it should probably be done as a separate project.

- That said, many of the components of such a system would be perfect for
the Contrib library and could facilitate building different (and custom)
implementations of distributed search based on .NET. If you contribute the
reusable bits, you can maintain your design focus and whatever choices you
make about dependencies/stack in your application, and still help others to
get started if they want to do a different implementation.

- Just to state the obvious, the main value of a distributed search
application built around Lucene.Net (vs using an existing one based around
Java Lucene) is the ability to use custom Analyzers, Queries, Scorers, etc,
written in .NET vs Java. This can be a big deal and support for this should
be baked into the application. Consider a (secure?) administrative API for
pushing custom libs across the wire and loading them in an isolated
AppDomain, which interacts with the API service.

- To further state the obvious, I think you're correct about the
abstraction of IndexReader being too chatty, and that moving to a higher
level makes sense. A DistributedQuery and DistributedCollector
implementation is probably the right place to start.

- Please don't assume a .NET client on the other end! This could be a cool
product to use in polyglot environments.  HTTP APIs are your friend. :)

- ZMQ makes a pretty good communication layer if HTTP doesn't work for you.

- Developing a language agnostic peer to peer API for the search nodes
would enable others to say, implement a version in Java, or ?? other
language which can fulfill the API. Could even create a hybrid engine with
Lucene based search only being one of the node types.

Thanks,
Troy

On Mon, Aug 20, 2012 at 9:37 AM, Zachary Gramana <zg...@gmail.com> wrote:

> Nick,
>
> Thanks for the feedback. You may not have been looking for this long and
> complex of a reply, but I wanted to share my thinking and validate some
> assumptions with the group before I get too much further down the road.
>
> Let me walk you through where my thinking is at, and see what you think.
>
> First, some observations:
>
> * MultiSearcher and RemoteSearchable is deprecated starting in Java Lucene
> starting with 3.1 (
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/201106.mbox/%3C007001cc2c35$359afba0$a0d0f2e0$@thetaphi.de%3E),
> and for good reason. Not only does it have some bugs related to scoring,
> etc., i
>
> * IndexReader, as the service interface, results in excessive network
> chatter. Query, in my mind, sounds like the right abstraction. Parse an
> incoming query request once, distribute the query objects to core
> instances, then merge the results. IndexSearcher in 3.3 implements a merge
> TopDocs method, so this approach seems promising. This would also enable
> each core to use a request queue to handle concurrent requests. Query,
> Filter, etc., have been marked serlializable for a long time.
>
> * I like Solr's separated Web/Core approach. The remoting-based approaches
> buy into a few of the 8 fallacies of distributed computing. The web/core
> approach, not so much.
>
> * Java-Lucene has recently delegated distributed search to Solr (and
> ElasticSearch, Katta, IndexTank, etc) in v3.1 and later. This says (a)
> distributed search is hard, and (b) requires solving problems that are
> beyond the scope of Lucene. Unfortunately, this highlights the lack of a
> .NET Solr analog.
>
> These observations lead me to the following questions:
>
> 1. Jeez, it would be nice if we had a .NET Solr-ish project. Kidding,
> kidding. Kind of.
> 2. Should distributed search live in Contribs, or in another project
> altogether?
> 3. Is there value in an in-between solution for #2? Perhaps something like
> a Solr Core only implementation, or a reference implementation that tackles
> a limited set of requirements?
>
> I should disclose here that my interest in this code is part of a broader
> project that I'm running at my place of employment. This project will be
> released as open source once it hits minimum viable product (it's not
> proprietary, just early in development). This project is tightly integrated
> with ServiceStack. It is also currently self-hosted, with an IIS host
> coming shortly.
>
> That said, Web API is very ServiceStack-like, though ServiceStack has some
> additional benefits: .NET 3.5 and Mono support, out of the box protocol
> buffers integration (and around another two dozen serialization formats,
> including a very fast JSON serializer),  nice cache and auth interfaces,
> and a simple plugin architecture. It's also based on the request/response
> pattern using strongly-typed DTO's, which I am a big proponent of. My
> project leverages these features quite a bit.
>
> I anticipate following a model similar to Solr Web/Core. The biggest
> questions I'm currently wrestling with are #3 and #2. Should the core be
> able to stand alone in a limited capacity? If so, does it makes sense for
> it to live in Contribs? I would naturally prefer to use ServiceStack to
> build it, consistent with the rest of my project. I would also take
> advantage of its protocol buffers support to improve performance, since
> this would be a peer-to-peer API and not client-server API. However, if a
> standalone core were to live in contribs I would want to make sure most
> people have a comfort level with that dependency.
>
> When I think of all of the features that need to be implemented in a core,
> like configuration and authentication, I start heading back towards
> distributed search living outside of Contribs.
>
> - Zack
>
> On Aug 17, 2012, at 8:43 PM, Nicholas Paldino [.NET/C# MVP] <
> casperOne@caspershouse.com> wrote:
>
> > Zach,
> >
> > Just a suggestion, maybe going the web API route and self hosting (which
> allows for something more RESTful and  with good bindings for JSON, XML, et
> al):
> >
> > http://code.msdn.microsoft.com/ASPNET-Web-API-Self-Host-30abca12
> >
> >
> http://www.asp.net/web-api/overview/hosting-aspnet-web-api/self-host-a-web-api
> >
> > - Nick
> >
>
>

Re: Reviving DistributedSearch

Posted by Zachary Gramana <zg...@gmail.com>.
Nick,

Thanks for the feedback. You may not have been looking for this long and complex of a reply, but I wanted to share my thinking and validate some assumptions with the group before I get too much further down the road.

Let me walk you through where my thinking is at, and see what you think.

First, some observations:

* MultiSearcher and RemoteSearchable is deprecated starting in Java Lucene starting with 3.1 (http://mail-archives.apache.org/mod_mbox/lucene-java-user/201106.mbox/%3C007001cc2c35$359afba0$a0d0f2e0$@thetaphi.de%3E), and for good reason. Not only does it have some bugs related to scoring, etc., i

* IndexReader, as the service interface, results in excessive network chatter. Query, in my mind, sounds like the right abstraction. Parse an incoming query request once, distribute the query objects to core instances, then merge the results. IndexSearcher in 3.3 implements a merge TopDocs method, so this approach seems promising. This would also enable each core to use a request queue to handle concurrent requests. Query, Filter, etc., have been marked serlializable for a long time.

* I like Solr's separated Web/Core approach. The remoting-based approaches buy into a few of the 8 fallacies of distributed computing. The web/core approach, not so much.

* Java-Lucene has recently delegated distributed search to Solr (and ElasticSearch, Katta, IndexTank, etc) in v3.1 and later. This says (a) distributed search is hard, and (b) requires solving problems that are beyond the scope of Lucene. Unfortunately, this highlights the lack of a .NET Solr analog.

These observations lead me to the following questions:

1. Jeez, it would be nice if we had a .NET Solr-ish project. Kidding, kidding. Kind of.
2. Should distributed search live in Contribs, or in another project altogether?
3. Is there value in an in-between solution for #2? Perhaps something like a Solr Core only implementation, or a reference implementation that tackles a limited set of requirements?

I should disclose here that my interest in this code is part of a broader project that I'm running at my place of employment. This project will be released as open source once it hits minimum viable product (it's not proprietary, just early in development). This project is tightly integrated with ServiceStack. It is also currently self-hosted, with an IIS host coming shortly.

That said, Web API is very ServiceStack-like, though ServiceStack has some additional benefits: .NET 3.5 and Mono support, out of the box protocol buffers integration (and around another two dozen serialization formats, including a very fast JSON serializer),  nice cache and auth interfaces, and a simple plugin architecture. It's also based on the request/response pattern using strongly-typed DTO's, which I am a big proponent of. My project leverages these features quite a bit.

I anticipate following a model similar to Solr Web/Core. The biggest questions I'm currently wrestling with are #3 and #2. Should the core be able to stand alone in a limited capacity? If so, does it makes sense for it to live in Contribs? I would naturally prefer to use ServiceStack to build it, consistent with the rest of my project. I would also take advantage of its protocol buffers support to improve performance, since this would be a peer-to-peer API and not client-server API. However, if a standalone core were to live in contribs I would want to make sure most people have a comfort level with that dependency.

When I think of all of the features that need to be implemented in a core, like configuration and authentication, I start heading back towards distributed search living outside of Contribs.

- Zack

On Aug 17, 2012, at 8:43 PM, Nicholas Paldino [.NET/C# MVP] <ca...@caspershouse.com> wrote:

> Zach,
> 
> Just a suggestion, maybe going the web API route and self hosting (which allows for something more RESTful and  with good bindings for JSON, XML, et al):
> 
> http://code.msdn.microsoft.com/ASPNET-Web-API-Self-Host-30abca12
> 
> http://www.asp.net/web-api/overview/hosting-aspnet-web-api/self-host-a-web-api
> 
> - Nick
> 


Re: Reviving DistributedSearch

Posted by "Nicholas Paldino [.NET/C# MVP]" <ca...@caspershouse.com>.
Zach,

Just a suggestion, maybe going the web API route and self hosting (which allows for something more RESTful and  with good bindings for JSON, XML, et al):

http://code.msdn.microsoft.com/ASPNET-Web-API-Self-Host-30abca12

http://www.asp.net/web-api/overview/hosting-aspnet-web-api/self-host-a-web-api

- Nick




On Aug 17, 2012, at 7:47 PM, "Zachary Gramana" <zg...@gmail.com> wrote:

> All:
> 
> I spent quite a bit of time researching the demise of Lucene.Net.Distributed (now Contrib.Distributed), primarily looking at SVN tags and the mailing list. I gather that it was only ever a reluctant feature, which had few users, went unmaintained, and thus was quietly dropped during the transition from 2.4.0 -> 2.9.1.
> 
> I noticed that much of that code remains in src/contribs, so I attempted to restore it back to its former state. To make a long story short, I have reconstructed solution files, project files, tests, etc. and have migrated them all forward to use 3.0.3 trunk. It now builds and (nominally) runs. It does not pass the tests since the Soap formatter does not support generics.
> 
> I did not make any modifications beyond those needed for reconstruction and migration. I am considering migrating this to ServiceStack, or WCF, if there is interest. However, some may be interested in playing with it as-is. I'd be happy to submit a patch against trunk, but I'm not sure of the protocol for that (e.g. send it to dev list, private email). The patch currently weighs in at 344 KB zipped/~3 MB unzipped.
> 
> Best Regards,
> Zack
> 
> 
>