You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2009/12/07 21:51:44 UTC

Solr Cell revamped as an UpdateProcessor?

ASs someone with very little knowledge of Solr Cell and/or Tika, I find 
myself wondering if ExtractingRequestHandler would make more sense as an 
extractingUpdateProcessor -- where it could be configured to take take 
either binary fields (or string fields containing URLs) out of the 
Documents, parse them with tika, and add the various XPath matching hunks 
of text back into the document as new fields.

Then ExtractingRequestHandler just becomes a handler that slurps up it's 
ContentStreams and adds them as binary data fields and adds the other 
literal params as fields.

Wouldn't that make things like SOLR-1358, and using Tika with 
URLs/filepaths in XML and CSV based updates fairly trivial?



-Hoss


Re: Solr Cell revamped as an UpdateProcessor?

Posted by Chris Hostetter <ho...@fucit.org>.
: Subject: Re: Solr Cell revamped as an UpdateProcessor?
: 
: Hi, I'm developing a directory monitor to add in a Sor implementation.

Hmmm ... Is this really related to the Solr Cell thread you replied to? 

Please start a a new thread if you want to discuss a new topic...

http://people.apache.org/~hossman/#threadhijack


-Hoss


Re: Solr Cell revamped as an UpdateProcessor?

Posted by Grant Ingersoll <gs...@apache.org>.
On Jan 5, 2010, at 1:53 PM, Zacarias wrote:

> I'd attached a file to the previous mail. Is there any filter for pdf files
> or any other reason.

The mailer strips attachments, although you might be able to get a zip through.  Perhaps send a pointer to somewhere else or just describe it here.

> 
> On Tue, Jan 5, 2010 at 12:49 PM, Zacarias <za...@linebee.com> wrote:
> 
>> Here is my propousal
>> 
>> Regards
>> 
>> 
>> 
>> 
>> On Tue, Jan 5, 2010 at 12:48 PM, Zacarias <za...@linebee.com> wrote:
>> 
>>> Hi, I'm developing a directory monitor to add in a Sor implementation.
>>> Tell me if it could be interesting for you we will be glad to share it
>>> with the comunity. Also I would like your opinion about the propousal if it
>>> looks ok for you and if you like to make any change or question it will be
>>> very well welcome.
>>> 
>>> Regards
>>> Zacarias
>>> www.linebee.com
>>> 
>>> 
>>> 2009/12/8 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>
>>> 
>>> I was refering to SOLR-1358. Anyway , SolrCell as an updateprocessor
>>>> is a good idea
>>>> 
>>>> On Tue, Dec 8, 2009 at 4:47 PM, Grant Ingersoll <gs...@apache.org>
>>>> wrote:
>>>>> 
>>>>> On Dec 8, 2009, at 12:22 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
>>>>> 
>>>>>> Integrating Extraction w/ DIH is a better option. DIH makes it easier
>>>>>> to do the mapping of fields etc.
>>>>> 
>>>>> Which comment is this directed at?  I'm lacking context here.
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Dec 8, 2009 at 4:59 AM, Grant Ingersoll <gs...@apache.org>
>>>> wrote:
>>>>>>> 
>>>>>>> On Dec 7, 2009, at 3:51 PM, Chris Hostetter wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> ASs someone with very little knowledge of Solr Cell and/or Tika, I
>>>> find myself wondering if ExtractingRequestHandler would make more sense as
>>>> an extractingUpdateProcessor -- where it could be configured to take take
>>>> either binary fields (or string fields containing URLs) out of the
>>>> Documents, parse them with tika, and add the various XPath matching hunks of
>>>> text back into the document as new fields.
>>>>>>>> 
>>>>>>>> Then ExtractingRequestHandler just becomes a handler that slurps up
>>>> it's ContentStreams and adds them as binary data fields and adds the other
>>>> literal params as fields.
>>>>>>>> 
>>>>>>>> Wouldn't that make things like SOLR-1358, and using Tika with
>>>> URLs/filepaths in XML and CSV based updates fairly trivial?
>>>>>>> 
>>>>>>> It probably could, but am not sure how it works in a processor chain.
>>>> However, I'm not sure I understand how they work all that much either.  I
>>>> also plan on adding, BTW, a SolrJ client for Tika that does the extraction
>>>> on the client.  In many cases, the ExtrReqHandler is really only designed
>>>> for lighter weight extraction cases, as one would simply not want to send
>>>> that much rich content over the wire.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> -----------------------------------------------------
>>>>>> Noble Paul | Systems Architect| AOL | http://aol.com
>>>>> 
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.com/
>>>>> 
>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>>> using Solr/Lucene:
>>>>> http://www.lucidimagination.com/search
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> -----------------------------------------------------
>>>> Noble Paul | Systems Architect| AOL | http://aol.com
>>>> 
>>> 
>>> 
>> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search


Re: Solr Cell revamped as an UpdateProcessor?

Posted by Zacarias <za...@linebee.com>.
I'd attached a file to the previous mail. Is there any filter for pdf files
or any other reason.

On Tue, Jan 5, 2010 at 12:49 PM, Zacarias <za...@linebee.com> wrote:

> Here is my propousal
>
> Regards
>
>
>
>
> On Tue, Jan 5, 2010 at 12:48 PM, Zacarias <za...@linebee.com> wrote:
>
>> Hi, I'm developing a directory monitor to add in a Sor implementation.
>> Tell me if it could be interesting for you we will be glad to share it
>> with the comunity. Also I would like your opinion about the propousal if it
>> looks ok for you and if you like to make any change or question it will be
>> very well welcome.
>>
>> Regards
>> Zacarias
>> www.linebee.com
>>
>>
>> 2009/12/8 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>
>>
>> I was refering to SOLR-1358. Anyway , SolrCell as an updateprocessor
>>> is a good idea
>>>
>>> On Tue, Dec 8, 2009 at 4:47 PM, Grant Ingersoll <gs...@apache.org>
>>> wrote:
>>> >
>>> > On Dec 8, 2009, at 12:22 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
>>> >
>>> >> Integrating Extraction w/ DIH is a better option. DIH makes it easier
>>> >> to do the mapping of fields etc.
>>> >
>>> > Which comment is this directed at?  I'm lacking context here.
>>> >
>>> >>
>>> >>
>>> >> On Tue, Dec 8, 2009 at 4:59 AM, Grant Ingersoll <gs...@apache.org>
>>> wrote:
>>> >>>
>>> >>> On Dec 7, 2009, at 3:51 PM, Chris Hostetter wrote:
>>> >>>
>>> >>>>
>>> >>>> ASs someone with very little knowledge of Solr Cell and/or Tika, I
>>> find myself wondering if ExtractingRequestHandler would make more sense as
>>> an extractingUpdateProcessor -- where it could be configured to take take
>>> either binary fields (or string fields containing URLs) out of the
>>> Documents, parse them with tika, and add the various XPath matching hunks of
>>> text back into the document as new fields.
>>> >>>>
>>> >>>> Then ExtractingRequestHandler just becomes a handler that slurps up
>>> it's ContentStreams and adds them as binary data fields and adds the other
>>> literal params as fields.
>>> >>>>
>>> >>>> Wouldn't that make things like SOLR-1358, and using Tika with
>>> URLs/filepaths in XML and CSV based updates fairly trivial?
>>> >>>
>>> >>> It probably could, but am not sure how it works in a processor chain.
>>>  However, I'm not sure I understand how they work all that much either.  I
>>> also plan on adding, BTW, a SolrJ client for Tika that does the extraction
>>> on the client.  In many cases, the ExtrReqHandler is really only designed
>>> for lighter weight extraction cases, as one would simply not want to send
>>> that much rich content over the wire.
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> -----------------------------------------------------
>>> >> Noble Paul | Systems Architect| AOL | http://aol.com
>>> >
>>> > --------------------------
>>> > Grant Ingersoll
>>> > http://www.lucidimagination.com/
>>> >
>>> > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>> using Solr/Lucene:
>>> > http://www.lucidimagination.com/search
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> -----------------------------------------------------
>>> Noble Paul | Systems Architect| AOL | http://aol.com
>>>
>>
>>
>

Re: Solr Cell revamped as an UpdateProcessor?

Posted by Zacarias <za...@linebee.com>.
Here is my propousal

Regards



On Tue, Jan 5, 2010 at 12:48 PM, Zacarias <za...@linebee.com> wrote:

> Hi, I'm developing a directory monitor to add in a Sor implementation.
> Tell me if it could be interesting for you we will be glad to share it with
> the comunity. Also I would like your opinion about the propousal if it looks
> ok for you and if you like to make any change or question it will be very
> well welcome.
>
> Regards
> Zacarias
> www.linebee.com
>
>
> 2009/12/8 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>
>
> I was refering to SOLR-1358. Anyway , SolrCell as an updateprocessor
>> is a good idea
>>
>> On Tue, Dec 8, 2009 at 4:47 PM, Grant Ingersoll <gs...@apache.org>
>> wrote:
>> >
>> > On Dec 8, 2009, at 12:22 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
>> >
>> >> Integrating Extraction w/ DIH is a better option. DIH makes it easier
>> >> to do the mapping of fields etc.
>> >
>> > Which comment is this directed at?  I'm lacking context here.
>> >
>> >>
>> >>
>> >> On Tue, Dec 8, 2009 at 4:59 AM, Grant Ingersoll <gs...@apache.org>
>> wrote:
>> >>>
>> >>> On Dec 7, 2009, at 3:51 PM, Chris Hostetter wrote:
>> >>>
>> >>>>
>> >>>> ASs someone with very little knowledge of Solr Cell and/or Tika, I
>> find myself wondering if ExtractingRequestHandler would make more sense as
>> an extractingUpdateProcessor -- where it could be configured to take take
>> either binary fields (or string fields containing URLs) out of the
>> Documents, parse them with tika, and add the various XPath matching hunks of
>> text back into the document as new fields.
>> >>>>
>> >>>> Then ExtractingRequestHandler just becomes a handler that slurps up
>> it's ContentStreams and adds them as binary data fields and adds the other
>> literal params as fields.
>> >>>>
>> >>>> Wouldn't that make things like SOLR-1358, and using Tika with
>> URLs/filepaths in XML and CSV based updates fairly trivial?
>> >>>
>> >>> It probably could, but am not sure how it works in a processor chain.
>>  However, I'm not sure I understand how they work all that much either.  I
>> also plan on adding, BTW, a SolrJ client for Tika that does the extraction
>> on the client.  In many cases, the ExtrReqHandler is really only designed
>> for lighter weight extraction cases, as one would simply not want to send
>> that much rich content over the wire.
>> >>
>> >>
>> >>
>> >> --
>> >> -----------------------------------------------------
>> >> Noble Paul | Systems Architect| AOL | http://aol.com
>> >
>> > --------------------------
>> > Grant Ingersoll
>> > http://www.lucidimagination.com/
>> >
>> > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:
>> > http://www.lucidimagination.com/search
>> >
>> >
>>
>>
>>
>> --
>> -----------------------------------------------------
>> Noble Paul | Systems Architect| AOL | http://aol.com
>>
>
>

Re: Solr Cell revamped as an UpdateProcessor?

Posted by Zacarias <za...@linebee.com>.
Hi, I'm developing a directory monitor to add in a Sor implementation.
Tell me if it could be interesting for you we will be glad to share it with
the comunity. Also I would like your opinion about the propousal if it looks
ok for you and if you like to make any change or question it will be very
well welcome.

Regards
Zacarias
www.linebee.com


2009/12/8 Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>

> I was refering to SOLR-1358. Anyway , SolrCell as an updateprocessor
> is a good idea
>
> On Tue, Dec 8, 2009 at 4:47 PM, Grant Ingersoll <gs...@apache.org>
> wrote:
> >
> > On Dec 8, 2009, at 12:22 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
> >
> >> Integrating Extraction w/ DIH is a better option. DIH makes it easier
> >> to do the mapping of fields etc.
> >
> > Which comment is this directed at?  I'm lacking context here.
> >
> >>
> >>
> >> On Tue, Dec 8, 2009 at 4:59 AM, Grant Ingersoll <gs...@apache.org>
> wrote:
> >>>
> >>> On Dec 7, 2009, at 3:51 PM, Chris Hostetter wrote:
> >>>
> >>>>
> >>>> ASs someone with very little knowledge of Solr Cell and/or Tika, I
> find myself wondering if ExtractingRequestHandler would make more sense as
> an extractingUpdateProcessor -- where it could be configured to take take
> either binary fields (or string fields containing URLs) out of the
> Documents, parse them with tika, and add the various XPath matching hunks of
> text back into the document as new fields.
> >>>>
> >>>> Then ExtractingRequestHandler just becomes a handler that slurps up
> it's ContentStreams and adds them as binary data fields and adds the other
> literal params as fields.
> >>>>
> >>>> Wouldn't that make things like SOLR-1358, and using Tika with
> URLs/filepaths in XML and CSV based updates fairly trivial?
> >>>
> >>> It probably could, but am not sure how it works in a processor chain.
>  However, I'm not sure I understand how they work all that much either.  I
> also plan on adding, BTW, a SolrJ client for Tika that does the extraction
> on the client.  In many cases, the ExtrReqHandler is really only designed
> for lighter weight extraction cases, as one would simply not want to send
> that much rich content over the wire.
> >>
> >>
> >>
> >> --
> >> -----------------------------------------------------
> >> Noble Paul | Systems Architect| AOL | http://aol.com
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
> >
>
>
>
> --
> -----------------------------------------------------
> Noble Paul | Systems Architect| AOL | http://aol.com
>

Re: Solr Cell revamped as an UpdateProcessor?

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
I was refering to SOLR-1358. Anyway , SolrCell as an updateprocessor
is a good idea

On Tue, Dec 8, 2009 at 4:47 PM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Dec 8, 2009, at 12:22 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
>
>> Integrating Extraction w/ DIH is a better option. DIH makes it easier
>> to do the mapping of fields etc.
>
> Which comment is this directed at?  I'm lacking context here.
>
>>
>>
>> On Tue, Dec 8, 2009 at 4:59 AM, Grant Ingersoll <gs...@apache.org> wrote:
>>>
>>> On Dec 7, 2009, at 3:51 PM, Chris Hostetter wrote:
>>>
>>>>
>>>> ASs someone with very little knowledge of Solr Cell and/or Tika, I find myself wondering if ExtractingRequestHandler would make more sense as an extractingUpdateProcessor -- where it could be configured to take take either binary fields (or string fields containing URLs) out of the Documents, parse them with tika, and add the various XPath matching hunks of text back into the document as new fields.
>>>>
>>>> Then ExtractingRequestHandler just becomes a handler that slurps up it's ContentStreams and adds them as binary data fields and adds the other literal params as fields.
>>>>
>>>> Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths in XML and CSV based updates fairly trivial?
>>>
>>> It probably could, but am not sure how it works in a processor chain.  However, I'm not sure I understand how they work all that much either.  I also plan on adding, BTW, a SolrJ client for Tika that does the extraction on the client.  In many cases, the ExtrReqHandler is really only designed for lighter weight extraction cases, as one would simply not want to send that much rich content over the wire.
>>
>>
>>
>> --
>> -----------------------------------------------------
>> Noble Paul | Systems Architect| AOL | http://aol.com
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>



-- 
-----------------------------------------------------
Noble Paul | Systems Architect| AOL | http://aol.com

Re: Solr Cell revamped as an UpdateProcessor?

Posted by Grant Ingersoll <gs...@apache.org>.
On Dec 8, 2009, at 12:22 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote:

> Integrating Extraction w/ DIH is a better option. DIH makes it easier
> to do the mapping of fields etc.

Which comment is this directed at?  I'm lacking context here.

> 
> 
> On Tue, Dec 8, 2009 at 4:59 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> 
>> On Dec 7, 2009, at 3:51 PM, Chris Hostetter wrote:
>> 
>>> 
>>> ASs someone with very little knowledge of Solr Cell and/or Tika, I find myself wondering if ExtractingRequestHandler would make more sense as an extractingUpdateProcessor -- where it could be configured to take take either binary fields (or string fields containing URLs) out of the Documents, parse them with tika, and add the various XPath matching hunks of text back into the document as new fields.
>>> 
>>> Then ExtractingRequestHandler just becomes a handler that slurps up it's ContentStreams and adds them as binary data fields and adds the other literal params as fields.
>>> 
>>> Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths in XML and CSV based updates fairly trivial?
>> 
>> It probably could, but am not sure how it works in a processor chain.  However, I'm not sure I understand how they work all that much either.  I also plan on adding, BTW, a SolrJ client for Tika that does the extraction on the client.  In many cases, the ExtrReqHandler is really only designed for lighter weight extraction cases, as one would simply not want to send that much rich content over the wire.
> 
> 
> 
> -- 
> -----------------------------------------------------
> Noble Paul | Systems Architect| AOL | http://aol.com

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search


Re: Solr Cell revamped as an UpdateProcessor?

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
Integrating Extraction w/ DIH is a better option. DIH makes it easier
to do the mapping of fields etc.


On Tue, Dec 8, 2009 at 4:59 AM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Dec 7, 2009, at 3:51 PM, Chris Hostetter wrote:
>
>>
>> ASs someone with very little knowledge of Solr Cell and/or Tika, I find myself wondering if ExtractingRequestHandler would make more sense as an extractingUpdateProcessor -- where it could be configured to take take either binary fields (or string fields containing URLs) out of the Documents, parse them with tika, and add the various XPath matching hunks of text back into the document as new fields.
>>
>> Then ExtractingRequestHandler just becomes a handler that slurps up it's ContentStreams and adds them as binary data fields and adds the other literal params as fields.
>>
>> Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths in XML and CSV based updates fairly trivial?
>
> It probably could, but am not sure how it works in a processor chain.  However, I'm not sure I understand how they work all that much either.  I also plan on adding, BTW, a SolrJ client for Tika that does the extraction on the client.  In many cases, the ExtrReqHandler is really only designed for lighter weight extraction cases, as one would simply not want to send that much rich content over the wire.



-- 
-----------------------------------------------------
Noble Paul | Systems Architect| AOL | http://aol.com

Re: Solr Cell revamped as an UpdateProcessor?

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.
I created an issue for this improvement idea to make sure it doesn't just die away:
https://issues.apache.org/jira/browse/SOLR-1763

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 22. jan. 2010, at 23.37, Jan Høydahl / Cominvent wrote:

> On 8. des. 2009, at 00.29, Grant Ingersoll wrote:
>> On Dec 7, 2009, at 3:51 PM, Chris Hostetter wrote:
>>> ASs someone with very little knowledge of Solr Cell and/or Tika, I find myself wondering if ExtractingRequestHandler would make more sense as an extractingUpdateProcessor -- where it could be configured to take take either binary fields (or string fields containing URLs) out of the Documents, parse them with tika, and add the various XPath matching hunks of text back into the document as new fields.
>>> 
>>> Then ExtractingRequestHandler just becomes a handler that slurps up it's ContentStreams and adds them as binary data fields and adds the other literal params as fields.
>>> 
>>> Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths in XML and CSV based updates fairly trivial?
>> 
>> It probably could, but am not sure how it works in a processor chain.  However, I'm not sure I understand how they work all that much either.  I also plan on adding, BTW, a SolrJ client for Tika that does the extraction on the client.  In many cases, the ExtrReqHandler is really only designed for lighter weight extraction cases, as one would simply not want to send that much rich content over the wire.
> 
> Good match. UpdateProcessors is the way to go for functionality which modifiy documents prior to indexing.
> With this, we can mix and match any type of content source with other processing needs.
> 
> I think it can be neneficial to have the choice to do extration on the SolrJ side. But you don't always have that choice, if your source is a crawler without built-in Tika, some base64 encoded field in an XML or some other random source, you want to do the extraction at an arbitrary place in the chain.
> 
> Examples:
>  Crawler (httpheaders, binarybody) -> TikaUpdateProcessor (+title, +text, +meta...) -> index
>  XML (title, pdfurl) -> GetUrlProcessor (+pdfbin) -> TikaUpdateProcessor (+text, +meta) -> index
>  DIH (city, street, lat, lon) -> LatLon2GeoHashProcessor (+geohash) -> index
> 
> I propose to model the document processor chain more after FAST ESP's flexible processing chain, which must be seen as an industry best practice. I'm thinking of starting a Wiki page to model what direction we should go.
> 
> --
> Jan Høydahl  - search architect
> Cominvent AS - www.cominvent.com
> 


Re: Solr Cell revamped as an UpdateProcessor?

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.
On 8. des. 2009, at 00.29, Grant Ingersoll wrote:
> On Dec 7, 2009, at 3:51 PM, Chris Hostetter wrote:
>> ASs someone with very little knowledge of Solr Cell and/or Tika, I find myself wondering if ExtractingRequestHandler would make more sense as an extractingUpdateProcessor -- where it could be configured to take take either binary fields (or string fields containing URLs) out of the Documents, parse them with tika, and add the various XPath matching hunks of text back into the document as new fields.
>> 
>> Then ExtractingRequestHandler just becomes a handler that slurps up it's ContentStreams and adds them as binary data fields and adds the other literal params as fields.
>> 
>> Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths in XML and CSV based updates fairly trivial?
> 
> It probably could, but am not sure how it works in a processor chain.  However, I'm not sure I understand how they work all that much either.  I also plan on adding, BTW, a SolrJ client for Tika that does the extraction on the client.  In many cases, the ExtrReqHandler is really only designed for lighter weight extraction cases, as one would simply not want to send that much rich content over the wire.

Good match. UpdateProcessors is the way to go for functionality which modifiy documents prior to indexing.
With this, we can mix and match any type of content source with other processing needs.

I think it can be neneficial to have the choice to do extration on the SolrJ side. But you don't always have that choice, if your source is a crawler without built-in Tika, some base64 encoded field in an XML or some other random source, you want to do the extraction at an arbitrary place in the chain.

Examples:
  Crawler (httpheaders, binarybody) -> TikaUpdateProcessor (+title, +text, +meta...) -> index
  XML (title, pdfurl) -> GetUrlProcessor (+pdfbin) -> TikaUpdateProcessor (+text, +meta) -> index
  DIH (city, street, lat, lon) -> LatLon2GeoHashProcessor (+geohash) -> index

I propose to model the document processor chain more after FAST ESP's flexible processing chain, which must be seen as an industry best practice. I'm thinking of starting a Wiki page to model what direction we should go.

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com


Re: Solr Cell revamped as an UpdateProcessor?

Posted by Grant Ingersoll <gs...@apache.org>.
On Dec 7, 2009, at 3:51 PM, Chris Hostetter wrote:

> 
> ASs someone with very little knowledge of Solr Cell and/or Tika, I find myself wondering if ExtractingRequestHandler would make more sense as an extractingUpdateProcessor -- where it could be configured to take take either binary fields (or string fields containing URLs) out of the Documents, parse them with tika, and add the various XPath matching hunks of text back into the document as new fields.
> 
> Then ExtractingRequestHandler just becomes a handler that slurps up it's ContentStreams and adds them as binary data fields and adds the other literal params as fields.
> 
> Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths in XML and CSV based updates fairly trivial?

It probably could, but am not sure how it works in a processor chain.  However, I'm not sure I understand how they work all that much either.  I also plan on adding, BTW, a SolrJ client for Tika that does the extraction on the client.  In many cases, the ExtrReqHandler is really only designed for lighter weight extraction cases, as one would simply not want to send that much rich content over the wire.