You are viewing a plain text version of this content. The canonical link for it is here.
Posted to droids-dev@incubator.apache.org by "Richard Frovarp (JIRA)" <ji...@apache.org> on 2010/02/16 19:43:28 UTC
[jira] Created: (DROIDS-81) Create a document parser that doesn't
HTMLify the results.
Create a document parser that doesn't HTMLify the results.
----------------------------------------------------------
Key: DROIDS-81
URL: https://issues.apache.org/jira/browse/DROIDS-81
Project: Droids
Issue Type: Bug
Components: tika
Affects Versions: 0.01
Reporter: Richard Frovarp
Priority: Minor
While the TikaHTMLParser can parse pdfs, docs, etc, it returns them in an HTMLified format. Solr blows up on that format, and it isn't always necessary to do this step anyway.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (DROIDS-81) Create a document parser that doesn't
HTMLify the results.
Posted by "Richard Frovarp (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/DROIDS-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Richard Frovarp updated DROIDS-81:
----------------------------------
Attachment: (was: TikaDocumentParser.java)
> Create a document parser that doesn't HTMLify the results.
> ----------------------------------------------------------
>
> Key: DROIDS-81
> URL: https://issues.apache.org/jira/browse/DROIDS-81
> Project: Droids
> Issue Type: Bug
> Components: tika
> Affects Versions: 0.01
> Reporter: Richard Frovarp
> Priority: Minor
> Attachments: tika-document-parser.patch
>
>
> While the TikaHTMLParser can parse pdfs, docs, etc, it returns them in an HTMLified format. Solr blows up on that format, and it isn't always necessary to do this step anyway.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (DROIDS-81) Create a document parser that doesn't
HTMLify the results.
Posted by "Richard Frovarp (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/DROIDS-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Richard Frovarp updated DROIDS-81:
----------------------------------
Attachment: TikaDocumentParser.java
Initial implementation of a Tika document parser that does not HTMLify its results and does not look for links.
> Create a document parser that doesn't HTMLify the results.
> ----------------------------------------------------------
>
> Key: DROIDS-81
> URL: https://issues.apache.org/jira/browse/DROIDS-81
> Project: Droids
> Issue Type: Bug
> Components: tika
> Affects Versions: 0.01
> Reporter: Richard Frovarp
> Priority: Minor
> Attachments: TikaDocumentParser.java
>
>
> While the TikaHTMLParser can parse pdfs, docs, etc, it returns them in an HTMLified format. Solr blows up on that format, and it isn't always necessary to do this step anyway.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (DROIDS-81) Create a document parser that doesn't
HTMLify the results.
Posted by "Richard Frovarp (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/DROIDS-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Richard Frovarp updated DROIDS-81:
----------------------------------
Attachment: tika-document-parser.patch
Initial implementation of a Tika document parser that does not HTMLify its results and does not look for links. This time in patch format with dependency added.
> Create a document parser that doesn't HTMLify the results.
> ----------------------------------------------------------
>
> Key: DROIDS-81
> URL: https://issues.apache.org/jira/browse/DROIDS-81
> Project: Droids
> Issue Type: Bug
> Components: tika
> Affects Versions: 0.01
> Reporter: Richard Frovarp
> Priority: Minor
> Attachments: tika-document-parser.patch
>
>
> While the TikaHTMLParser can parse pdfs, docs, etc, it returns them in an HTMLified format. Solr blows up on that format, and it isn't always necessary to do this step anyway.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (DROIDS-81) Create a document parser that doesn't
HTMLify the results.
Posted by "Thorsten Scherler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/DROIDS-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thorsten Scherler resolved DROIDS-81.
-------------------------------------
Resolution: Fixed
Committed revision 939642.
> Create a document parser that doesn't HTMLify the results.
> ----------------------------------------------------------
>
> Key: DROIDS-81
> URL: https://issues.apache.org/jira/browse/DROIDS-81
> Project: Droids
> Issue Type: Bug
> Components: tika
> Affects Versions: 0.01
> Reporter: Richard Frovarp
> Priority: Minor
> Attachments: tika-document-parser.patch
>
>
> While the TikaHTMLParser can parse pdfs, docs, etc, it returns them in an HTMLified format. Solr blows up on that format, and it isn't always necessary to do this step anyway.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.