You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2007/09/25 21:24:49 UTC

Solr contrib

I am working on an RequestHandler that incorporates Aperture (http:// 
aperture.sourceforge.net) into Solr.  Aperture is a crawling and  
extraction framework based on RDF that provides a common interface to  
disparate Open Source libraries like PDFBox, POI, OpenOffice, as well  
as data stores like IMAP and also does crawling of HTTP, File  
systems, etc.  It has similar goals to Tika (a Lucene TLP sub- 
project) but is much further along in my opinion (although I do  
notice that Tika has picked up the pace lately).  Tika could easily  
be dropped in as a replacement at any point in the future (or other  
extraction libraries, too).  I also have a client-side version using  
SolrJ and Aperture.  This would be related to https:// 
issues.apache.org/jira/browse/SOLR-284 but I haven't looked for  
synergies between Eric's idea and mine.  I will do that.

I know I could put this in the core as a ReqHandler just like all the  
others, but it doesn't really seem like it fits there, especially due  
to having a fair number of dependencies (Aperture, PDFBox, POI, etc.)

I would like to suggest we start a contrib package for Solr modeled  
after the Lucene Java contrib package.  One thing that comes to mind,  
is do we just want to mirror the processes of Lucene Java or do we  
think there are improvements to be made?  One thing that I dislike  
about the current Lucene Java way is the dependency management.  Some  
of the contrib modules have the same copy of libraries checked in or  
they rely on non-ASF compatible code.  Maven or Ivy easily solve this  
problem, with my preference being Maven (but I am not trying to start  
a Maven war here, either, so please don't take it that way).

Anyone have thoughts on this?  I will submit a patch at some point in  
the near future.

-Grant

Re: Solr contrib

Posted by Eric Pugh <ep...@opensourceconnections.com>.

I like the idea of providing a home for all these other non core  
projects as well.   I think my approach in SOLR-284 could be used as  
a starting point, or for ideas, but it was aimed fairly specifically  
at scratching my itch.

It does seem like parsing rich documents is a popular request!

Eric


On Sep 25, 2007, at 3:24 PM, Grant Ingersoll wrote:

> I am working on an RequestHandler that incorporates Aperture  
> (http://aperture.sourceforge.net) into Solr.  Aperture is a  
> crawling and extraction framework based on RDF that provides a  
> common interface to disparate Open Source libraries like PDFBox,  
> POI, OpenOffice, as well as data stores like IMAP and also does  
> crawling of HTTP, File systems, etc.  It has similar goals to Tika  
> (a Lucene TLP sub-project) but is much further along in my opinion  
> (although I do notice that Tika has picked up the pace lately).   
> Tika could easily be dropped in as a replacement at any point in  
> the future (or other extraction libraries, too).  I also have a  
> client-side version using SolrJ and Aperture.  This would be  
> related to https://issues.apache.org/jira/browse/SOLR-284 but I  
> haven't looked for synergies between Eric's idea and mine.  I will  
> do that.
>
> I know I could put this in the core as a ReqHandler just like all  
> the others, but it doesn't really seem like it fits there,  
> especially due to having a fair number of dependencies (Aperture,  
> PDFBox, POI, etc.)
>
> I would like to suggest we start a contrib package for Solr modeled  
> after the Lucene Java contrib package.  One thing that comes to  
> mind, is do we just want to mirror the processes of Lucene Java or  
> do we think there are improvements to be made?  One thing that I  
> dislike about the current Lucene Java way is the dependency  
> management.  Some of the contrib modules have the same copy of  
> libraries checked in or they rely on non-ASF compatible code.   
> Maven or Ivy easily solve this problem, with my preference being  
> Maven (but I am not trying to start a Maven war here, either, so  
> please don't take it that way).
>
> Anyone have thoughts on this?  I will submit a patch at some point  
> in the near future.
>
> -Grant
>
>

-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |  
http://www.opensourceconnections.com

Re: Solr contrib

Posted by Chris Hostetter <ho...@fucit.org>.

: I would like to suggest we start a contrib package for Solr modeled after the
: Lucene Java contrib package.  One thing that comes to mind, is do we just want
: to mirror the processes of Lucene Java or do we think there are improvements
: to be made?  One thing that I dislike about the current Lucene Java way is the
: dependency management.  Some of the contrib modules have the same copy of
: libraries checked in or they rely on non-ASF compatible code.  Maven or Ivy
: easily solve this problem, with my preference being Maven (but I am not trying
: to start a Maven war here, either, so please don't take it that way).

On one hand, my familiarity with the Lucene contrib system leads me to 
feel that we should really have something like that in Solr -- but then i 
think about the motivations lucene had for adding the sandbox/contrib 
setup: it's a way to keep the core library simple and small so that tiny 
micro apps don't have to load in a lot of code they don't plan on using; 
and it's a way to mange code such that people can be given commit accesss 
just to contribs.  

The second motivation doesn't seem to really be a concern for Solr at the 
moment -- if we're going to make someone a committer, lets just make them 
a committer.  The first motivation while applicaable isn't nearly as 
crucial.  As a web based application, our principle user base is much 
different then Lucene-Java.  Saving a little space isn't nearly as 
important as keeping Solr easy to use out ofhte box.  request handlers and 
field types and analysis components and response writers that are common 
enough to be commited into the Solr repositories are probably going to be 
the kind of things that a lot of people could want to take advantage of -- 
and we need to make it as easy as possible for those people to use all of 
those cool features (ie: one big self contained war).  While there may 
certianly be plenty of people who think "I don't need this functionality, 
i don't wnat a bloated war" those people are probably going to be 
comforatable repacking the war themselves to stip out the classes they 
don't need.

All of which leads me to think that the complexity of having a framework 
for contribs probably isn't neccessary at this point -- i would love to 
see Solr get to the point where we have so many cool bells and whistles 
and add ons that the war becomes rediculously and prohibitively huge if 
you try to use all of them, but we can always refactor things into 
contribs at that point (but the way a lot of lucene code was refactored 
into contribs after the 1.4.3 release)

*IF* we do decide that a contrib framework is imporntant, then switching 
to something like maven would probably make a lot of sense ... BUT ... a 
bigger concern i have then contribs designed to be loaded by solr as 
plugins, is having a cohesive method for building/testing/packaging all of 
hte client code that is starting to get added to the repository ... and 
i'm not sure that maven can really help us with that ... if we're going to 
have to roll our own solution for genericly building the 
ruby/python/java/perl/lua client code modules, perhaps we should reuse 
that same framework for building contribs (instead of making people 
understand both our custom method for building clients, and the mavent 
method for building contribs)



-Hoss