You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Yonik Seeley (JIRA)" <ji...@apache.org> on 2009/06/27 16:10:47 UTC

[jira] Commented: (SOLR-284) Parsing Rich Document Types

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724855#action_12724855 ] 

Yonik Seeley commented on SOLR-284:
-----------------------------------

Not sure if I should open a new issue or keep improvements here.
I think we need to improve the OOTB experience with this...
http://search.lucidimagination.com/search/document/302440b8a2451908/solr_cell

Ideas for improvement:
- auto-mapping names of the form Last-Modified to a more solrish field name like last_modified
- drop "ext." from parameter names, and revisit naming to try and unify with other update handlers like CSV
  note: in the future, one could see generic functionality like boosting fields, setting field value defaults, etc, being handled by a generic component or update processor... all the better reason to drop the ext prefix.
-  I imagine that metadata is normally useful, so we should
  1. predefine commonly used metadata fields in the example schema... there's really no cost to this
  2. use mappings to normalize any metadata names (if such normalization isn't already done in Tika)
  3. ignore or drop fields that have little use
  4. provide a way to handle new attributes w/o dropping them or throwing an error
- enable the handler by default - lazy to avoid a dependency on having all the tika libs available


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.