You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2007/09/25 21:24:49 UTC
Solr contrib
I am working on an RequestHandler that incorporates Aperture (http://
aperture.sourceforge.net) into Solr. Aperture is a crawling and
extraction framework based on RDF that provides a common interface to
disparate Open Source libraries like PDFBox, POI, OpenOffice, as well
as data stores like IMAP and also does crawling of HTTP, File
systems, etc. It has similar goals to Tika (a Lucene TLP sub-
project) but is much further along in my opinion (although I do
notice that Tika has picked up the pace lately). Tika could easily
be dropped in as a replacement at any point in the future (or other
extraction libraries, too). I also have a client-side version using
SolrJ and Aperture. This would be related to https://
issues.apache.org/jira/browse/SOLR-284 but I haven't looked for
synergies between Eric's idea and mine. I will do that.
I know I could put this in the core as a ReqHandler just like all the
others, but it doesn't really seem like it fits there, especially due
to having a fair number of dependencies (Aperture, PDFBox, POI, etc.)
I would like to suggest we start a contrib package for Solr modeled
after the Lucene Java contrib package. One thing that comes to mind,
is do we just want to mirror the processes of Lucene Java or do we
think there are improvements to be made? One thing that I dislike
about the current Lucene Java way is the dependency management. Some
of the contrib modules have the same copy of libraries checked in or
they rely on non-ASF compatible code. Maven or Ivy easily solve this
problem, with my preference being Maven (but I am not trying to start
a Maven war here, either, so please don't take it that way).
Anyone have thoughts on this? I will submit a patch at some point in
the near future.
-Grant
Re: Solr contrib
Posted by Eric Pugh <ep...@opensourceconnections.com>.
I like the idea of providing a home for all these other non core
projects as well. I think my approach in SOLR-284 could be used as
a starting point, or for ideas, but it was aimed fairly specifically
at scratching my itch.
It does seem like parsing rich documents is a popular request!
Eric
On Sep 25, 2007, at 3:24 PM, Grant Ingersoll wrote:
> I am working on an RequestHandler that incorporates Aperture
> (http://aperture.sourceforge.net) into Solr. Aperture is a
> crawling and extraction framework based on RDF that provides a
> common interface to disparate Open Source libraries like PDFBox,
> POI, OpenOffice, as well as data stores like IMAP and also does
> crawling of HTTP, File systems, etc. It has similar goals to Tika
> (a Lucene TLP sub-project) but is much further along in my opinion
> (although I do notice that Tika has picked up the pace lately).
> Tika could easily be dropped in as a replacement at any point in
> the future (or other extraction libraries, too). I also have a
> client-side version using SolrJ and Aperture. This would be
> related to https://issues.apache.org/jira/browse/SOLR-284 but I
> haven't looked for synergies between Eric's idea and mine. I will
> do that.
>
> I know I could put this in the core as a ReqHandler just like all
> the others, but it doesn't really seem like it fits there,
> especially due to having a fair number of dependencies (Aperture,
> PDFBox, POI, etc.)
>
> I would like to suggest we start a contrib package for Solr modeled
> after the Lucene Java contrib package. One thing that comes to
> mind, is do we just want to mirror the processes of Lucene Java or
> do we think there are improvements to be made? One thing that I
> dislike about the current Lucene Java way is the dependency
> management. Some of the contrib modules have the same copy of
> libraries checked in or they rely on non-ASF compatible code.
> Maven or Ivy easily solve this problem, with my preference being
> Maven (but I am not trying to start a Maven war here, either, so
> please don't take it that way).
>
> Anyone have thoughts on this? I will submit a patch at some point
> in the near future.
>
> -Grant
>
>
-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
http://www.opensourceconnections.com
Re: Solr contrib
Posted by Chris Hostetter <ho...@fucit.org>.
: I would like to suggest we start a contrib package for Solr modeled after the
: Lucene Java contrib package. One thing that comes to mind, is do we just want
: to mirror the processes of Lucene Java or do we think there are improvements
: to be made? One thing that I dislike about the current Lucene Java way is the
: dependency management. Some of the contrib modules have the same copy of
: libraries checked in or they rely on non-ASF compatible code. Maven or Ivy
: easily solve this problem, with my preference being Maven (but I am not trying
: to start a Maven war here, either, so please don't take it that way).
On one hand, my familiarity with the Lucene contrib system leads me to
feel that we should really have something like that in Solr -- but then i
think about the motivations lucene had for adding the sandbox/contrib
setup: it's a way to keep the core library simple and small so that tiny
micro apps don't have to load in a lot of code they don't plan on using;
and it's a way to mange code such that people can be given commit accesss
just to contribs.
The second motivation doesn't seem to really be a concern for Solr at the
moment -- if we're going to make someone a committer, lets just make them
a committer. The first motivation while applicaable isn't nearly as
crucial. As a web based application, our principle user base is much
different then Lucene-Java. Saving a little space isn't nearly as
important as keeping Solr easy to use out ofhte box. request handlers and
field types and analysis components and response writers that are common
enough to be commited into the Solr repositories are probably going to be
the kind of things that a lot of people could want to take advantage of --
and we need to make it as easy as possible for those people to use all of
those cool features (ie: one big self contained war). While there may
certianly be plenty of people who think "I don't need this functionality,
i don't wnat a bloated war" those people are probably going to be
comforatable repacking the war themselves to stip out the classes they
don't need.
All of which leads me to think that the complexity of having a framework
for contribs probably isn't neccessary at this point -- i would love to
see Solr get to the point where we have so many cool bells and whistles
and add ons that the war becomes rediculously and prohibitively huge if
you try to use all of them, but we can always refactor things into
contribs at that point (but the way a lot of lucene code was refactored
into contribs after the 1.4.3 release)
*IF* we do decide that a contrib framework is imporntant, then switching
to something like maven would probably make a lot of sense ... BUT ... a
bigger concern i have then contribs designed to be loaded by solr as
plugins, is having a cohesive method for building/testing/packaging all of
hte client code that is starting to get added to the repository ... and
i'm not sure that maven can really help us with that ... if we're going to
have to roll our own solution for genericly building the
ruby/python/java/perl/lua client code modules, perhaps we should reuse
that same framework for building contribs (instead of making people
understand both our custom method for building clients, and the mavent
method for building contribs)
-Hoss