You are viewing a plain text version of this content. The canonical link for it is here.

Posted to droids-dev@incubator.apache.org by "Richard Frovarp (JIRA)" <ji...@apache.org> on 2010/02/16 19:43:28 UTC

[jira] Created: (DROIDS-81) Create a document parser that doesn't HTMLify the results.

Create a document parser that doesn't HTMLify the results.
----------------------------------------------------------

                 Key: DROIDS-81
                 URL: https://issues.apache.org/jira/browse/DROIDS-81
             Project: Droids
          Issue Type: Bug
          Components: tika
    Affects Versions: 0.01
            Reporter: Richard Frovarp
            Priority: Minor


While the TikaHTMLParser can parse pdfs, docs, etc, it returns them in an HTMLified format. Solr blows up on that format, and it isn't always necessary to do this step anyway. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (DROIDS-81) Create a document parser that doesn't HTMLify the results.

Posted by "Richard Frovarp (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/DROIDS-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Frovarp updated DROIDS-81:
----------------------------------

    Attachment:     (was: TikaDocumentParser.java)

> Create a document parser that doesn't HTMLify the results.
> ----------------------------------------------------------
>
>                 Key: DROIDS-81
>                 URL: https://issues.apache.org/jira/browse/DROIDS-81
>             Project: Droids
>          Issue Type: Bug
>          Components: tika
>    Affects Versions: 0.01
>            Reporter: Richard Frovarp
>            Priority: Minor
>         Attachments: tika-document-parser.patch
>
>
> While the TikaHTMLParser can parse pdfs, docs, etc, it returns them in an HTMLified format. Solr blows up on that format, and it isn't always necessary to do this step anyway. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (DROIDS-81) Create a document parser that doesn't HTMLify the results.

Posted by "Richard Frovarp (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/DROIDS-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Frovarp updated DROIDS-81:
----------------------------------

    Attachment: TikaDocumentParser.java

Initial implementation of a Tika document parser that does not HTMLify its results and does not look for links.

> Create a document parser that doesn't HTMLify the results.
> ----------------------------------------------------------
>
>                 Key: DROIDS-81
>                 URL: https://issues.apache.org/jira/browse/DROIDS-81
>             Project: Droids
>          Issue Type: Bug
>          Components: tika
>    Affects Versions: 0.01
>            Reporter: Richard Frovarp
>            Priority: Minor
>         Attachments: TikaDocumentParser.java
>
>
> While the TikaHTMLParser can parse pdfs, docs, etc, it returns them in an HTMLified format. Solr blows up on that format, and it isn't always necessary to do this step anyway. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (DROIDS-81) Create a document parser that doesn't HTMLify the results.

Posted by "Richard Frovarp (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/DROIDS-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Frovarp updated DROIDS-81:
----------------------------------

    Attachment: tika-document-parser.patch

Initial implementation of a Tika document parser that does not HTMLify its results and does not look for links. This time in patch format with dependency added.

> Create a document parser that doesn't HTMLify the results.
> ----------------------------------------------------------
>
>                 Key: DROIDS-81
>                 URL: https://issues.apache.org/jira/browse/DROIDS-81
>             Project: Droids
>          Issue Type: Bug
>          Components: tika
>    Affects Versions: 0.01
>            Reporter: Richard Frovarp
>            Priority: Minor
>         Attachments: tika-document-parser.patch
>
>
> While the TikaHTMLParser can parse pdfs, docs, etc, it returns them in an HTMLified format. Solr blows up on that format, and it isn't always necessary to do this step anyway. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (DROIDS-81) Create a document parser that doesn't HTMLify the results.

Posted by "Thorsten Scherler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/DROIDS-81?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thorsten Scherler resolved DROIDS-81.
-------------------------------------

    Resolution: Fixed

Committed revision 939642.

> Create a document parser that doesn't HTMLify the results.
> ----------------------------------------------------------
>
>                 Key: DROIDS-81
>                 URL: https://issues.apache.org/jira/browse/DROIDS-81
>             Project: Droids
>          Issue Type: Bug
>          Components: tika
>    Affects Versions: 0.01
>            Reporter: Richard Frovarp
>            Priority: Minor
>         Attachments: tika-document-parser.patch
>
>
> While the TikaHTMLParser can parse pdfs, docs, etc, it returns them in an HTMLified format. Solr blows up on that format, and it isn't always necessary to do this step anyway. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.