You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Jan Høydahl (JIRA)" <ji...@apache.org> on 2010/02/08 21:41:28 UTC

[jira] Created: (SOLR-1763) Integrate Solr Cell/Tika as an UpdateRequestProcessor

Integrate Solr Cell/Tika as an UpdateRequestProcessor
-----------------------------------------------------

                 Key: SOLR-1763
                 URL: https://issues.apache.org/jira/browse/SOLR-1763
             Project: Solr
          Issue Type: New Feature
          Components: update
            Reporter: Jan Høydahl


>From Chris Hostetter's original post in solr-dev:

As someone with very little knowledge of Solr Cell and/or Tika, I find myself wondering if ExtractingRequestHandler would make more sense as an extractingUpdateProcessor -- where it could be configured to take take either binary fields (or string fields containing URLs) out of the Documents, parse them with tika, and add the various XPath matching hunks of text back into the document as new fields.

Then ExtractingRequestHandler just becomes a handler that slurps up it's ContentStreams and adds them as binary data fields and adds the other literal params as fields.

Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths in XML and CSV based updates fairly trivial?

-Hoss

I couldn't agree more, so I decided to add it as an issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1763) Integrate Solr Cell/Tika as an UpdateRequestProcessor

Posted by "Jan Høydahl (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843217#action_12843217 ] 

Jan Høydahl commented on SOLR-1763:
-----------------------------------

I may have a need for this functionality in an upcoming project. Anyone knowing the code who can estimate the effort?

> Integrate Solr Cell/Tika as an UpdateRequestProcessor
> -----------------------------------------------------
>
>                 Key: SOLR-1763
>                 URL: https://issues.apache.org/jira/browse/SOLR-1763
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>
> From Chris Hostetter's original post in solr-dev:
> As someone with very little knowledge of Solr Cell and/or Tika, I find myself wondering if ExtractingRequestHandler would make more sense as an extractingUpdateProcessor -- where it could be configured to take take either binary fields (or string fields containing URLs) out of the Documents, parse them with tika, and add the various XPath matching hunks of text back into the document as new fields.
> Then ExtractingRequestHandler just becomes a handler that slurps up it's ContentStreams and adds them as binary data fields and adds the other literal params as fields.
> Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths in XML and CSV based updates fairly trivial?
> -Hoss
> I couldn't agree more, so I decided to add it as an issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1763) Integrate Solr Cell/Tika as an UpdateRequestProcessor

Posted by "Jan Høydahl (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831108#action_12831108 ] 

Jan Høydahl commented on SOLR-1763:
-----------------------------------

Re-posting my comment from solr-dev in this ticket:
Good match. UpdateProcessors is the way to go for functionality which modifiy documents prior to indexing.
With this, we can mix and match any type of content source with other processing needs.

I think it can be neneficial to have the choice to do extration on the SolrJ side. But you don't always have that choice, if your source is a crawler without built-in Tika, some base64 encoded field in an XML or some other random source, you want to do the extraction at an arbitrary place in the chain.

Examples:
 Crawler (httpheaders, binarybody) -> TikaUpdateProcessor (+title, +text, +meta...) -> index
 XML (title, pdfurl) -> GetUrlProcessor (+pdfbin) -> TikaUpdateProcessor (+text, +meta) -> index
 DIH (city, street, lat, lon) -> LatLon2GeoHashProcessor (+geohash) -> index

I propose to model the document processor chain more after FAST ESP's flexible processing chain, which must be seen as an industry best practice. I'm thinking of starting a Wiki page to model what direction we should go.

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com


> Integrate Solr Cell/Tika as an UpdateRequestProcessor
> -----------------------------------------------------
>
>                 Key: SOLR-1763
>                 URL: https://issues.apache.org/jira/browse/SOLR-1763
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>
> From Chris Hostetter's original post in solr-dev:
> As someone with very little knowledge of Solr Cell and/or Tika, I find myself wondering if ExtractingRequestHandler would make more sense as an extractingUpdateProcessor -- where it could be configured to take take either binary fields (or string fields containing URLs) out of the Documents, parse them with tika, and add the various XPath matching hunks of text back into the document as new fields.
> Then ExtractingRequestHandler just becomes a handler that slurps up it's ContentStreams and adds them as binary data fields and adds the other literal params as fields.
> Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths in XML and CSV based updates fairly trivial?
> -Hoss
> I couldn't agree more, so I decided to add it as an issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.