You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Roland Villemoes <rv...@alpha-solutions.dk> on 2011/04/18 12:37:04 UTC

PipeLine for Solr

Hi All,

I know this question may have been asked before - but I really did not find any usable answers browsing the archives. So I have to try the developer list here.

We at Alpha Solutions often need a Pipeline for handling crawling, analyzing and routing before we hit the UpdateRequestHandler in Solr. I know we could actually use the UpdateRequestHandler for this - but often we like to perform all these tasks before hitting Solr.
We have been using OpenPipeline which does offer a GUI also which makes it rather nice to administer (if you tweak the GUI a bit!). I does seem though, that OpenPipeline will not really get going. Nothing happens, and there is not really any community around it - and it doesn't seem that the guys that's behind this will ever move this further.

So we are looking around towards other "pipeline" projects that can work well with Solr.

So - does any of you have any ideas on this? Any recommendations? Or any plans of this for Solr?

Thanks a lot
Med venlig hilsen / Best regards

Roland Villemoes
Tel: (+45) 22 69 59 62
E-mail: rv@alpha-solutions.dk<ma...@alpha-solutions.dk>

Alpha Solutions A/S
Borgergade 2, 3.sal, DK-1300 Copenhagen K
Tel: (+45) 70 20 65 38
Web: www.alpha-solutions.dk<http://www.alpha-solutions.dk/>


** This message including any attachments may contain confidential and/or privileged information
intended only for the person or entity to which it is addressed. If you are not the intended recipient
you should delete this message. Any printing, copying, distribution or other use of this message is strictly prohibited.
If you have received this message in error, please notify the sender immediately by telephone
or e-mail and delete all copies of this message and any attachments from your system. Thank you.


Re: PipeLine for Solr

Posted by "Smiley, David W." <ds...@mitre.org>.
Great discussion here; although I do believe it belongs on the Solr user list because we're not talking about development on Solr.  I'm very tempted to cross-post but I believe that's discouraged so I won't.

> Still wondering where the Solr Community will bring this in the future?

I strongly believe that Solr should focus on what it does best (being a search engine) and not on pipelines / data acquisition which is really a separate concern that is useful without Solr -- other apps could use such pipelines.  This is a chief concern I have with the DIH.

By the way I've used Endeca (a commercial long-time faceted search vendor) which has its own pipeline called "Forge".  I used it on a project in which the pipelines were extremely extensive getting data from a dozen plus sources of varying flavors and manipulating the data in various ways.   It addresses a key need, but the implementation is poor IMO. The interesting parts of it pertained to how it supports joins from sub-pipelines (i.e. chain of steps). I've not yet been in the same situation with Solr. I've gotten by with some basic stuff thrown together (shell scripts w/ XSLT) or simple DIH uses.

I've been maintaining a list of software that could be used for a data pipeline for getting data into Solr.  Here it is:
* Calabache (XProc)
* OpenPipe
* ManifoldCF
* ESBs (various options; includes Spring-Integration Framework)

I don't have UIMA on this list since I think it's too focused on extracting data from unstructured text than on being a solid pipeline first & foremost.

Roland, if your assessment on OpenPipeline going nowhere is true, then that's disappointing news.

It's not clear to me that a data pipeline needs to be different than what ESBs do.  Some pieces are missing but 80% of what's needed is there.  When I next have a project getting data from many places I'll be able to think through this more.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/





Re: PipeLine for Solr

Posted by Gérard Dupont <ge...@gmail.com>.
On 18 April 2011 13:30, Roland Villemoes <rv...@alpha-solutions.dk> wrote:

> Hi Gérard
>
>
>
> Thank you for your reply.
>

You're welcome ;-)


> I will absolutely look into it J
>
>
>
> Still wondering where the Solr Community will bring this in the future?
>

As for Lucene, SolR is mainly focussed on indexing and not preprocessing.
I'm convince that it's good for an opensource project to stick on its core
added-value. I'm seen to much projects dying because they tried to do
everything. But perhaps this could be done in Nutch which cover almost every
part of a search engine. UIMA (as it was suggested) may also be a good
solution.


> Looking at commercial products (we use this a lot here at Alpha Solutions)
> products like Exalead and FAST really does have impressive content (and
> search) pipelines, and most of all impressive tools included. And as the
> future for FAST is extremely uncertain now FAST customers moving to Solr
> will lack the pipelines and the tools.
>

But I don't suggest open source project to follow any of these commercial
product roadmap ;-) I rather prefer modular self contained and efficient
open source projects. Then for integration, we need another layer like UIMA
or WebLab.


> Well as consultants we can establish functionality developing the missing
> pieces – but tools are still missing. And where customers could (almost)
>  administer and work on pipelines themselves – they now need developers.
>

That's the tricky part. Hopefully these projects will mature to a level
where administration of high level orchestration is easy. But to be frank,
it's not really easy in many way and if the end-users want to administrate
this part themselves, they will still need some basic understanding and
training.


> Thanks for input – looking forward to see more J
>

Good luck, keep me informed.

gd


>
>
> Roland Villemoes
>
> *From:* Gérard Dupont [mailto:ger.dupont@gmail.com]
> *Sent:* 18. april 2011 12:50
> *To:* dev@lucene.apache.org
> *Subject:* Re: PipeLine for Solr
>
>
>
> Hi Roland,
>
>
>
> We are proposing exactly this kind of integration facility with our open
> source WebLab-project (see weblab-project.org). The tutorials are not
> perfect, but we are a team of 15-like engineers on the project which has
> more than 4 years history and is currently used in our projects. Our goal is
> to rely as much as possible on standards and thus each processing step
> (SourceReader, Normaliser, Analyser...) are defined as Webservice. Then the
> global orchestration is done in BPEL. On the plus side we have a SolR
> indexer, but I'm quite sure it's not very optimised ;-).
>
>
>
> If you are interested I'll be happy to support you (I'm paid for that
> already ;-).
>
>
>
> cheers
>
>
>
> On 18 April 2011 12:37, Roland Villemoes <rv...@alpha-solutions.dk> wrote:
>
> Hi All,
>
>
>
> I know this question may have been asked before – but I really did not find
> any usable answers browsing the archives. So I have to try the developer
> list here.
>
>
>
> We at Alpha Solutions often need a Pipeline for handling crawling,
> analyzing and routing before we hit the UpdateRequestHandler in Solr. I know
> we could actually use the UpdateRequestHandler for this - but often we like
> to perform all these tasks before hitting Solr.
>
> We have been using OpenPipeline which does offer a GUI also which makes it
> rather nice to administer (if you tweak the GUI a bit!). I does seem though,
> that OpenPipeline will not really get going. Nothing happens, and there is
> not really any community around it – and it doesn’t seem that the guys
> that’s behind this will ever move this further.
>
>
>
> So we are looking around towards other “pipeline” projects that can work
> well with Solr.
>
>
>
> So – does any of you have any ideas on this? Any recommendations? Or any
> plans of this for Solr?
>
>
>
> Thanks a lot
>
> *Med venlig hilsen / Best regards*
>
> *Roland Villemoes*
> *Tel:* (+45) 22 69 59 62
> *E-mail:* rv@alpha-solutions.dk
>
> *Alpha Solutions A/S*
> Borgergade 2, 3.sal, DK-1300 Copenhagen K
> *Tel:* (+45) 70 20 65 38
> *Web:* www.alpha-solutions.dk
>
>
> ** This message including any attachments may contain confidential and/or
> privileged information
> intended only for the person or entity to which it is addressed. If you are
> not the intended recipient
> you should delete this message. Any printing, copying, distribution or
> other use of this message is strictly prohibited.
> If you have received this message in error, please notify the sender
> immediately by telephone
> or e-mail and delete all copies of this message and any attachments from
> your system. Thank you.
>
>
>
>
>
>
> --
> Gérard Dupont
> Information Processing Control and Cognition (IPCC)
>
> CASSIDIAN - an EADS company
>
>
> Document & Learning team - LITIS Laboratory
>
>
>



-- 
Gérard Dupont
Information Processing Control and Cognition (IPCC)
CASSIDIAN - an EADS company

Document & Learning team - LITIS Laboratory

RE: PipeLine for Solr

Posted by Roland Villemoes <rv...@alpha-solutions.dk>.
Hi Gérard

Thank you for your reply. I will absolutely look into it :)

Still wondering where the Solr Community will bring this in the future?

Looking at commercial products (we use this a lot here at Alpha Solutions) products like Exalead and FAST really does have impressive content (and search) pipelines, and most of all impressive tools included. And as the future for FAST is extremely uncertain now FAST customers moving to Solr will lack the pipelines and the tools. Well as consultants we can establish functionality developing the missing pieces - but tools are still missing. And where customers could (almost)  administer and work on pipelines themselves - they now need developers.

Thanks for input - looking forward to see more :)

Roland Villemoes
From: Gérard Dupont [mailto:ger.dupont@gmail.com]
Sent: 18. april 2011 12:50
To: dev@lucene.apache.org
Subject: Re: PipeLine for Solr

Hi Roland,

We are proposing exactly this kind of integration facility with our open source WebLab-project (see weblab-project.org<http://weblab-project.org>). The tutorials are not perfect, but we are a team of 15-like engineers on the project which has more than 4 years history and is currently used in our projects. Our goal is to rely as much as possible on standards and thus each processing step (SourceReader, Normaliser, Analyser...) are defined as Webservice. Then the global orchestration is done in BPEL. On the plus side we have a SolR indexer, but I'm quite sure it's not very optimised ;-).

If you are interested I'll be happy to support you (I'm paid for that already ;-).

cheers

On 18 April 2011 12:37, Roland Villemoes <rv...@alpha-solutions.dk>> wrote:
Hi All,

I know this question may have been asked before - but I really did not find any usable answers browsing the archives. So I have to try the developer list here.

We at Alpha Solutions often need a Pipeline for handling crawling, analyzing and routing before we hit the UpdateRequestHandler in Solr. I know we could actually use the UpdateRequestHandler for this - but often we like to perform all these tasks before hitting Solr.
We have been using OpenPipeline which does offer a GUI also which makes it rather nice to administer (if you tweak the GUI a bit!). I does seem though, that OpenPipeline will not really get going. Nothing happens, and there is not really any community around it - and it doesn't seem that the guys that's behind this will ever move this further.

So we are looking around towards other "pipeline" projects that can work well with Solr.

So - does any of you have any ideas on this? Any recommendations? Or any plans of this for Solr?

Thanks a lot
Med venlig hilsen / Best regards

Roland Villemoes
Tel: (+45) 22 69 59 62
E-mail: rv@alpha-solutions.dk<ma...@alpha-solutions.dk>

Alpha Solutions A/S
Borgergade 2, 3.sal, DK-1300 Copenhagen K
Tel: (+45) 70 20 65 38
Web: www.alpha-solutions.dk<http://www.alpha-solutions.dk/>


** This message including any attachments may contain confidential and/or privileged information
intended only for the person or entity to which it is addressed. If you are not the intended recipient
you should delete this message. Any printing, copying, distribution or other use of this message is strictly prohibited.
If you have received this message in error, please notify the sender immediately by telephone
or e-mail and delete all copies of this message and any attachments from your system. Thank you.




--
Gérard Dupont
Information Processing Control and Cognition (IPCC)
CASSIDIAN - an EADS company

Document & Learning team - LITIS Laboratory


Re: PipeLine for Solr

Posted by Gérard Dupont <ge...@gmail.com>.
Hi Roland,

We are proposing exactly this kind of integration facility with our open
source WebLab-project (see weblab-project.org). The tutorials are not
perfect, but we are a team of 15-like engineers on the project which has
more than 4 years history and is currently used in our projects. Our goal is
to rely as much as possible on standards and thus each processing step
(SourceReader, Normaliser, Analyser...) are defined as Webservice. Then the
global orchestration is done in BPEL. On the plus side we have a SolR
indexer, but I'm quite sure it's not very optimised ;-).

If you are interested I'll be happy to support you (I'm paid for that
already ;-).

cheers

On 18 April 2011 12:37, Roland Villemoes <rv...@alpha-solutions.dk> wrote:

> Hi All,
>
>
>
> I know this question may have been asked before – but I really did not find
> any usable answers browsing the archives. So I have to try the developer
> list here.
>
>
>
> We at Alpha Solutions often need a Pipeline for handling crawling,
> analyzing and routing before we hit the UpdateRequestHandler in Solr. I know
> we could actually use the UpdateRequestHandler for this - but often we like
> to perform all these tasks before hitting Solr.
>
> We have been using OpenPipeline which does offer a GUI also which makes it
> rather nice to administer (if you tweak the GUI a bit!). I does seem though,
> that OpenPipeline will not really get going. Nothing happens, and there is
> not really any community around it – and it doesn’t seem that the guys
> that’s behind this will ever move this further.
>
>
>
> So we are looking around towards other “pipeline” projects that can work
> well with Solr.
>
>
>
> So – does any of you have any ideas on this? Any recommendations? Or any
> plans of this for Solr?
>
>
>
> Thanks a lot
>
> *Med venlig hilsen / Best regards*
>
> *Roland Villemoes*
> *Tel:* (+45) 22 69 59 62
> *E-mail:* rv@alpha-solutions.dk
>
> *Alpha Solutions A/S*
> Borgergade 2, 3.sal, DK-1300 Copenhagen K
> *Tel:* (+45) 70 20 65 38
> *Web:* www.alpha-solutions.dk
>
>
> ** This message including any attachments may contain confidential and/or
> privileged information
> intended only for the person or entity to which it is addressed. If you are
> not the intended recipient
> you should delete this message. Any printing, copying, distribution or
> other use of this message is strictly prohibited.
> If you have received this message in error, please notify the sender
> immediately by telephone
> or e-mail and delete all copies of this message and any attachments from
> your system. Thank you.
>
>
>



-- 
Gérard Dupont
Information Processing Control and Cognition (IPCC)
CASSIDIAN - an EADS company

Document & Learning team - LITIS Laboratory

Re: PipeLine for Solr

Posted by Jan Høydahl <ja...@cominvent.com>.
Hi,

We've been thinking about the same for some time, and I put together a WIKI page discussing the need for such a pipeline. Feel free to edit, extend and discuss inside the WIKI.
http://wiki.apache.org/solr/DocumentProcessing

It need not be a part of Solr as such, but it is crucial that there is one preferred pipeline where the community can gather their effort in writing re-usable processing stages.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 18. apr. 2011, at 12.37, Roland Villemoes wrote:

> Hi All,
>  
> I know this question may have been asked before – but I really did not find any usable answers browsing the archives. So I have to try the developer list here.
>  
> We at Alpha Solutions often need a Pipeline for handling crawling, analyzing and routing before we hit the UpdateRequestHandler in Solr. I know we could actually use the UpdateRequestHandler for this - but often we like to perform all these tasks before hitting Solr.
> We have been using OpenPipeline which does offer a GUI also which makes it rather nice to administer (if you tweak the GUI a bit!). I does seem though, that OpenPipeline will not really get going. Nothing happens, and there is not really any community around it – and it doesn’t seem that the guys that’s behind this will ever move this further.
>  
> So we are looking around towards other “pipeline” projects that can work well with Solr.
>  
> So – does any of you have any ideas on this? Any recommendations? Or any plans of this for Solr?
>  
> Thanks a lot
> Med venlig hilsen / Best regards
> 
> Roland Villemoes
> Tel: (+45) 22 69 59 62
> E-mail: rv@alpha-solutions.dk
> 
> Alpha Solutions A/S
> Borgergade 2, 3.sal, DK-1300 Copenhagen K
> Tel: (+45) 70 20 65 38
> Web: www.alpha-solutions.dk
> 
> 
> ** This message including any attachments may contain confidential and/or privileged information
> intended only for the person or entity to which it is addressed. If you are not the intended recipient
> you should delete this message. Any printing, copying, distribution or other use of this message is strictly prohibited.
> If you have received this message in error, please notify the sender immediately by telephone
> or e-mail and delete all copies of this message and any attachments from your system. Thank you.
>  


Re: PipeLine for Solr

Posted by Tommaso Teofili <to...@gmail.com>.
Hello Roland,

I think a nice option would be using UIMA [1] which supports a pipeline
architecture to analyze unstructured information.
With that you can use CollectionReaders to get documents from various
sources, Annotators to eventually extract metadata from documents [2] and
then a Solr CAS Consumer to write everything to Solr [3].

You could also exploit the UIMA integration already committed under a
dedicated Solr contrib module [4][5] which uses a custom UpdateHandler.

Hope this helps,
Tommaso

[1] : http://uima.apache.org
[2] :
http://uima.apache.org/d/uimaj-2.3.1/overview_and_setup.html#ugr.ovv.conceptual.graduating_to_collection_processing
[3] : http://uima.apache.org/sandbox.html#solrcas.consumer
[4] : http://wiki.apache.org/solr/SolrUIMA
[5] : http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/uima/

2011/4/18 Roland Villemoes <rv...@alpha-solutions.dk>

> Hi All,
>
>
>
> I know this question may have been asked before – but I really did not find
> any usable answers browsing the archives. So I have to try the developer
> list here.
>
>
>
> We at Alpha Solutions often need a Pipeline for handling crawling,
> analyzing and routing before we hit the UpdateRequestHandler in Solr. I know
> we could actually use the UpdateRequestHandler for this - but often we like
> to perform all these tasks before hitting Solr.
>
> We have been using OpenPipeline which does offer a GUI also which makes it
> rather nice to administer (if you tweak the GUI a bit!). I does seem though,
> that OpenPipeline will not really get going. Nothing happens, and there is
> not really any community around it – and it doesn’t seem that the guys
> that’s behind this will ever move this further.
>
>
>
> So we are looking around towards other “pipeline” projects that can work
> well with Solr.
>
>
>
> So – does any of you have any ideas on this? Any recommendations? Or any
> plans of this for Solr?
>
>
>
> Thanks a lot
>
> *Med venlig hilsen / Best regards*
>
> *Roland Villemoes*
> *Tel:* (+45) 22 69 59 62
> *E-mail:* rv@alpha-solutions.dk
>
> *Alpha Solutions A/S*
> Borgergade 2, 3.sal, DK-1300 Copenhagen K
> *Tel:* (+45) 70 20 65 38
> *Web:* www.alpha-solutions.dk
>
>
> ** This message including any attachments may contain confidential and/or
> privileged information
> intended only for the person or entity to which it is addressed. If you are
> not the intended recipient
> you should delete this message. Any printing, copying, distribution or
> other use of this message is strictly prohibited.
> If you have received this message in error, please notify the sender
> immediately by telephone
> or e-mail and delete all copies of this message and any attachments from
> your system. Thank you.
>
>
>