You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Ryan McKinley (JIRA)" <ji...@apache.org> on 2007/07/02 23:14:05 UTC

[jira] Commented: (SOLR-284) Parsing Rich Document Types

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509676 ] 

Ryan McKinley commented on SOLR-284:
------------------------------------

I haven't run this patch, but have a few questions...

What is the *general* approach to extract a lucene document (list of fields) from a PDF? Word? Powerpoint?

Is this just access to a few common fields like author, keywords, text, etc?  Is this something that realistically would need to be custom for each case?  

Perhaps it makes sense to add a contrib section for this sort of stuff.  It seems weird to add 10 library dependencies to the core distribution.  How does nutch handle this?
 


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: rich.patch, test-files.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> I am attaching a patch file with the code changes, and if this looks good, will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.