You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Ryan McKinley (JIRA)" <ji...@apache.org> on 2007/07/02 23:14:05 UTC
[jira] Commented: (SOLR-284) Parsing Rich Document Types
[ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509676 ]
Ryan McKinley commented on SOLR-284:
------------------------------------
I haven't run this patch, but have a few questions...
What is the *general* approach to extract a lucene document (list of fields) from a PDF? Word? Powerpoint?
Is this just access to a few common fields like author, keywords, text, etc? Is this something that realistically would need to be custom for each case?
Perhaps it makes sense to add a contrib section for this sort of stuff. It seems weird to add 10 library dependencies to the core distribution. How does nutch handle this?
> Parsing Rich Document Types
> ---------------------------
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
> Issue Type: New Feature
> Components: update
> Affects Versions: 1.3
> Reporter: Eric Pugh
> Fix For: 1.3
>
> Attachments: rich.patch, test-files.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> I am attaching a patch file with the code changes, and if this looks good, will add a page similar to http://wiki.apache.org/solr/UpdateCSV.
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.