You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2011/04/01 18:15:06 UTC

[jira] [Created] (TIKA-632) Rtf parsing ignores links

Rtf parsing ignores links
-------------------------

                 Key: TIKA-632
                 URL: https://issues.apache.org/jira/browse/TIKA-632
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.9
            Reporter: Nick Burch
         Attachments: test.rtf

I spotted this while working on TIKA-631 - an RTF file containing links has the link skipped over - neither the link text nor the link href are output.

In the attached sample file (which is the RTF contents of /test-documents/test-outlook2003.msg), we should see things like:

[a href="http://r.office.microsoft.com/r/rlidOutlookWelcomeMail1?clid=1033">Streamlined Mail Experience[/a> - Outlook

Instead, all we get is " - Outlook"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (TIKA-632) Rtf parsing ignores links

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned TIKA-632:
---------------------------------------

    Assignee: Michael McCandless

> Rtf parsing ignores links
> -------------------------
>
>                 Key: TIKA-632
>                 URL: https://issues.apache.org/jira/browse/TIKA-632
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Michael McCandless
>         Attachments: test.rtf
>
>
> I spotted this while working on TIKA-631 - an RTF file containing links has the link skipped over - neither the link text nor the link href are output.
> In the attached sample file (which is the RTF contents of /test-documents/test-outlook2003.msg), we should see things like:
> [a href="http://r.office.microsoft.com/r/rlidOutlookWelcomeMail1?clid=1033">Streamlined Mail Experience[/a> - Outlook
> Instead, all we get is " - Outlook"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-632) Rtf parsing ignores links

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch updated TIKA-632:
----------------------------

    Attachment: test.rtf

> Rtf parsing ignores links
> -------------------------
>
>                 Key: TIKA-632
>                 URL: https://issues.apache.org/jira/browse/TIKA-632
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>         Attachments: test.rtf
>
>
> I spotted this while working on TIKA-631 - an RTF file containing links has the link skipped over - neither the link text nor the link href are output.
> In the attached sample file (which is the RTF contents of /test-documents/test-outlook2003.msg), we should see things like:
> [a href="http://r.office.microsoft.com/r/rlidOutlookWelcomeMail1?clid=1033">Streamlined Mail Experience[/a> - Outlook
> Instead, all we get is " - Outlook"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-632) Rtf parsing ignores links

Posted by "Cristian Vat (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080449#comment-13080449 ] 

Cristian Vat commented on TIKA-632:
-----------------------------------

Tika uses RTFEditorKit from javax.swing.text.rtf for the actual RTF Parsing and that doesn't seem to support links.

In the example you provided links are actually marked using two methods:
- \htmlrtf tags which are "Control Words Introduced by Specific/Other Microsoft Products"
- \field instances of type hyperlink, which are seem to be the normal RTF way of adding links

However the RTF Parser in Swing ignores a lot of "unknown" control words, including \field completely.
For reference, there is a bug opened in 1999 and closed as "Will Not Fix" to enhance RTF Parsing ( http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4261277 )

To quote Jukka from another issue: "there's little we can do about this as long as we're stuck with the Swing RTF parser".

> Rtf parsing ignores links
> -------------------------
>
>                 Key: TIKA-632
>                 URL: https://issues.apache.org/jira/browse/TIKA-632
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>         Attachments: test.rtf
>
>
> I spotted this while working on TIKA-631 - an RTF file containing links has the link skipped over - neither the link text nor the link href are output.
> In the attached sample file (which is the RTF contents of /test-documents/test-outlook2003.msg), we should see things like:
> [a href="http://r.office.microsoft.com/r/rlidOutlookWelcomeMail1?clid=1033">Streamlined Mail Experience[/a> - Outlook
> Instead, all we get is " - Outlook"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-632) Rtf parsing ignores links

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-632:
------------------------------------

    Attachment: TIKA-632.patch

Patch, adding hyperlink extraction to the RTF parser, and enabling the OutlookParserTest case (it passes).

I think it's ready to commit... I'll wait until after 0.10 is out.

> Rtf parsing ignores links
> -------------------------
>
>                 Key: TIKA-632
>                 URL: https://issues.apache.org/jira/browse/TIKA-632
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Michael McCandless
>         Attachments: TIKA-632.patch, test.rtf
>
>
> I spotted this while working on TIKA-631 - an RTF file containing links has the link skipped over - neither the link text nor the link href are output.
> In the attached sample file (which is the RTF contents of /test-documents/test-outlook2003.msg), we should see things like:
> [a href="http://r.office.microsoft.com/r/rlidOutlookWelcomeMail1?clid=1033">Streamlined Mail Experience[/a> - Outlook
> Instead, all we get is " - Outlook"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-632) Rtf parsing ignores links

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113761#comment-13113761 ] 

Nick Burch commented on TIKA-632:
---------------------------------

Now we have our own RTF parser, it may be possible to add this. For an example, the RTF from /test-documents/test-outlook2003.msg for a part containing a hyperlink is the delightful:

-----------
{\*\htmltag84 <I>}\htmlrtf {\i \htmlrtf0 If you want to let us know what you think about Outlook 2003, reply to this message. We're always looking for feedback from the people who use Outlook every day! If you would like to keep up with the latest information about Outlook, sign up for a free subscription to the 

{\*\htmltag84 <A HREF="http://r.office.microsoft.com/r/rlidNewsletterSignUp?clid=1033">}\htmlrtf {\field{\*\fldinst{HYPERLINK "http://r.office.microsoft.com/r/rlidNewsletterSignUp?clid=1033"}}{\fldrslt\cf1\ul \htmlrtf0 Inside Office Newsletter\htmlrtf }\htmlrtf0 \htmlrtf }\htmlrtf0 

{\*\htmltag92 </A>}. The newsletter will be sent to you by e-mail on a regular basis.
-----------



> Rtf parsing ignores links
> -------------------------
>
>                 Key: TIKA-632
>                 URL: https://issues.apache.org/jira/browse/TIKA-632
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>         Attachments: test.rtf
>
>
> I spotted this while working on TIKA-631 - an RTF file containing links has the link skipped over - neither the link text nor the link href are output.
> In the attached sample file (which is the RTF contents of /test-documents/test-outlook2003.msg), we should see things like:
> [a href="http://r.office.microsoft.com/r/rlidOutlookWelcomeMail1?clid=1033">Streamlined Mail Experience[/a> - Outlook
> Instead, all we get is " - Outlook"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-632) Rtf parsing ignores links

Posted by "Michael McCandless (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved TIKA-632.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0
    
> Rtf parsing ignores links
> -------------------------
>
>                 Key: TIKA-632
>                 URL: https://issues.apache.org/jira/browse/TIKA-632
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: TIKA-632.patch, test.rtf
>
>
> I spotted this while working on TIKA-631 - an RTF file containing links has the link skipped over - neither the link text nor the link href are output.
> In the attached sample file (which is the RTF contents of /test-documents/test-outlook2003.msg), we should see things like:
> [a href="http://r.office.microsoft.com/r/rlidOutlookWelcomeMail1?clid=1033">Streamlined Mail Experience[/a> - Outlook
> Instead, all we get is " - Outlook"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira