You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Arjohn Kampman (Created) (JIRA)" <ji...@apache.org> on 2011/11/11 21:10:51 UTC

[jira] [Created] (TIKA-782) Add support for parsing binary data in RTF files

Add support for parsing binary data in RTF files
------------------------------------------------

                 Key: TIKA-782
                 URL: https://issues.apache.org/jira/browse/TIKA-782
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.0
            Reporter: Arjohn Kampman
         Attachments: bin.patch

The current RTF parser doesn't process \bin control words yet. These control words are followed by a specific amount of binary data. Because of this, the RTF parser trips over some of these bytes in a number of (classified) documents.

I've implemented processing of the \bin control word, but it required of the core parsing algorithm. IMHO, it also improved readability of the code. I hope you will accept this patch. Please let me know if the patch requires modifications.

Apart from the \bin code word, this patch also makes the parser stop after reading the document-closing '}' character. In a number of files (again, classified), the parser would include non-readable characters that appeared after this closing brace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (TIKA-782) Add support for parsing binary data in RTF files

Posted by "Michael McCandless (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned TIKA-782:
---------------------------------------

    Assignee: Michael McCandless
    
> Add support for parsing binary data in RTF files
> ------------------------------------------------
>
>                 Key: TIKA-782
>                 URL: https://issues.apache.org/jira/browse/TIKA-782
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Arjohn Kampman
>            Assignee: Michael McCandless
>         Attachments: bin.patch, bin2.patch
>
>
> The current RTF parser doesn't process \bin control words yet. These control words are followed by a specific amount of binary data. Because of this, the RTF parser trips over some of these bytes in a number of (classified) documents.
> I've implemented processing of the \bin control word, but it required of the core parsing algorithm. IMHO, it also improved readability of the code. I hope you will accept this patch. Please let me know if the patch requires modifications.
> Apart from the \bin code word, this patch also makes the parser stop after reading the document-closing '}' character. In a number of files (again, classified), the parser would include non-readable characters that appeared after this closing brace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-782) Add support for parsing binary data in RTF files

Posted by "Arjohn Kampman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150719#comment-13150719 ] 

Arjohn Kampman commented on TIKA-782:
-------------------------------------

I've attached an improved patch that actually reads the binary data into an array. Apparently, InputStream.skip(long) can read past the file limit and pretend to have skipped to specified number of bytes.
                
> Add support for parsing binary data in RTF files
> ------------------------------------------------
>
>                 Key: TIKA-782
>                 URL: https://issues.apache.org/jira/browse/TIKA-782
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Arjohn Kampman
>         Attachments: bin.patch, bin2.patch
>
>
> The current RTF parser doesn't process \bin control words yet. These control words are followed by a specific amount of binary data. Because of this, the RTF parser trips over some of these bytes in a number of (classified) documents.
> I've implemented processing of the \bin control word, but it required of the core parsing algorithm. IMHO, it also improved readability of the code. I hope you will accept this patch. Please let me know if the patch requires modifications.
> Apart from the \bin code word, this patch also makes the parser stop after reading the document-closing '}' character. In a number of files (again, classified), the parser would include non-readable characters that appeared after this closing brace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-782) Add support for parsing binary data in RTF files

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152246#comment-13152246 ] 

Michael McCandless commented on TIKA-782:
-----------------------------------------

OK looks great Arjohn!  Do you have an example RTF doc with \bin that we can use as a test case...?
                
> Add support for parsing binary data in RTF files
> ------------------------------------------------
>
>                 Key: TIKA-782
>                 URL: https://issues.apache.org/jira/browse/TIKA-782
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Arjohn Kampman
>            Assignee: Michael McCandless
>         Attachments: bin.patch, bin2.patch, bin3.patch
>
>
> The current RTF parser doesn't process \bin control words yet. These control words are followed by a specific amount of binary data. Because of this, the RTF parser trips over some of these bytes in a number of (classified) documents.
> I've implemented processing of the \bin control word, but it required of the core parsing algorithm. IMHO, it also improved readability of the code. I hope you will accept this patch. Please let me know if the patch requires modifications.
> Apart from the \bin code word, this patch also makes the parser stop after reading the document-closing '}' character. In a number of files (again, classified), the parser would include non-readable characters that appeared after this closing brace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (TIKA-782) Add support for parsing binary data in RTF files

Posted by "Arjohn Kampman (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152076#comment-13152076 ] 

Arjohn Kampman edited comment on TIKA-782 at 11/17/11 2:36 PM:
---------------------------------------------------------------

I'll make the necessary changes.

Do you mind if I changed the <long> parameter of processControlWord() to an <int>.There's a comment above that method that says:

{quote}
// Param is long because spec says max value is 1+ Integer.MAX_VALUE!
{quote}

However, Microsoft's RTF 1.9.1 specs says:

{quote}
The range of the values for the number is nominally -32768 through 32767, i.e., a signed 16-bit integer. A small number of control words take values in the range‌ −2,147,483,648 to 2,147,483,647 (32-bit signed integer).
{quote}

This is exactly the range of Java's int. Not sure which spec the comment is referring to though.

                
      was (Author: arjohn):
    I'll make the necessary changes.

Do you mind if I changed the <long> parameter of processControlWord() to an <int>.There's a comment above that method that says:

    // Param is long because spec says max value is 1+ Integer.MAX_VALUE!

However, Microsoft's RTF 1.9.1 specs says:

    The range of the values for the number is nominally -32768 through 32767,
    i.e., a signed 16-bit integer. A small number of control words take values
    in the range‌ −2,147,483,648 to 2,147,483,647 (32-bit signed integer).

This is exactly the range of Java's int. Not sure which spec the comment is referring to though.

                  
> Add support for parsing binary data in RTF files
> ------------------------------------------------
>
>                 Key: TIKA-782
>                 URL: https://issues.apache.org/jira/browse/TIKA-782
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Arjohn Kampman
>            Assignee: Michael McCandless
>         Attachments: bin.patch, bin2.patch
>
>
> The current RTF parser doesn't process \bin control words yet. These control words are followed by a specific amount of binary data. Because of this, the RTF parser trips over some of these bytes in a number of (classified) documents.
> I've implemented processing of the \bin control word, but it required of the core parsing algorithm. IMHO, it also improved readability of the code. I hope you will accept this patch. Please let me know if the patch requires modifications.
> Apart from the \bin code word, this patch also makes the parser stop after reading the document-closing '}' character. In a number of files (again, classified), the parser would include non-readable characters that appeared after this closing brace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Resolved] (TIKA-782) Add support for parsing binary data in RTF files

Posted by "Michael McCandless (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved TIKA-782.
-------------------------------------

    Resolution: Fixed

I made minor edits (fixing up whitespace; removing unused param), and whittled back the test case to a smaller size while still showing fail + pass.

Thanks Arjohn!
                
> Add support for parsing binary data in RTF files
> ------------------------------------------------
>
>                 Key: TIKA-782
>                 URL: https://issues.apache.org/jira/browse/TIKA-782
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Arjohn Kampman
>            Assignee: Michael McCandless
>         Attachments: bin.patch, bin2.patch, bin3.patch, logo.zip
>
>
> The current RTF parser doesn't process \bin control words yet. These control words are followed by a specific amount of binary data. Because of this, the RTF parser trips over some of these bytes in a number of (classified) documents.
> I've implemented processing of the \bin control word, but it required of the core parsing algorithm. IMHO, it also improved readability of the code. I hope you will accept this patch. Please let me know if the patch requires modifications.
> Apart from the \bin code word, this patch also makes the parser stop after reading the document-closing '}' character. In a number of files (again, classified), the parser would include non-readable characters that appeared after this closing brace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-782) Add support for parsing binary data in RTF files

Posted by "Arjohn Kampman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arjohn Kampman updated TIKA-782:
--------------------------------

    Attachment: logo.zip

Bingo, found one in the published Enron data. It's an RTF with the Enron logo. No text though.
                
> Add support for parsing binary data in RTF files
> ------------------------------------------------
>
>                 Key: TIKA-782
>                 URL: https://issues.apache.org/jira/browse/TIKA-782
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Arjohn Kampman
>            Assignee: Michael McCandless
>         Attachments: bin.patch, bin2.patch, bin3.patch, logo.zip
>
>
> The current RTF parser doesn't process \bin control words yet. These control words are followed by a specific amount of binary data. Because of this, the RTF parser trips over some of these bytes in a number of (classified) documents.
> I've implemented processing of the \bin control word, but it required of the core parsing algorithm. IMHO, it also improved readability of the code. I hope you will accept this patch. Please let me know if the patch requires modifications.
> Apart from the \bin code word, this patch also makes the parser stop after reading the document-closing '}' character. In a number of files (again, classified), the parser would include non-readable characters that appeared after this closing brace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-782) Add support for parsing binary data in RTF files

Posted by "Arjohn Kampman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arjohn Kampman updated TIKA-782:
--------------------------------

    Attachment: bin2.patch

improved patch
                
> Add support for parsing binary data in RTF files
> ------------------------------------------------
>
>                 Key: TIKA-782
>                 URL: https://issues.apache.org/jira/browse/TIKA-782
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Arjohn Kampman
>         Attachments: bin.patch, bin2.patch
>
>
> The current RTF parser doesn't process \bin control words yet. These control words are followed by a specific amount of binary data. Because of this, the RTF parser trips over some of these bytes in a number of (classified) documents.
> I've implemented processing of the \bin control word, but it required of the core parsing algorithm. IMHO, it also improved readability of the code. I hope you will accept this patch. Please let me know if the patch requires modifications.
> Apart from the \bin code word, this patch also makes the parser stop after reading the document-closing '}' character. In a number of files (again, classified), the parser would include non-readable characters that appeared after this closing brace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-782) Add support for parsing binary data in RTF files

Posted by "Arjohn Kampman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arjohn Kampman updated TIKA-782:
--------------------------------

    Attachment: bin.patch

Patch adding \bin support.
                
> Add support for parsing binary data in RTF files
> ------------------------------------------------
>
>                 Key: TIKA-782
>                 URL: https://issues.apache.org/jira/browse/TIKA-782
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Arjohn Kampman
>         Attachments: bin.patch
>
>
> The current RTF parser doesn't process \bin control words yet. These control words are followed by a specific amount of binary data. Because of this, the RTF parser trips over some of these bytes in a number of (classified) documents.
> I've implemented processing of the \bin control word, but it required of the core parsing algorithm. IMHO, it also improved readability of the code. I hope you will accept this patch. Please let me know if the patch requires modifications.
> Apart from the \bin code word, this patch also makes the parser stop after reading the document-closing '}' character. In a number of files (again, classified), the parser would include non-readable characters that appeared after this closing brace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-782) Add support for parsing binary data in RTF files

Posted by "Arjohn Kampman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152252#comment-13152252 ] 

Arjohn Kampman commented on TIKA-782:
-------------------------------------

Unfortunately, the one that I used is confidential. I've searching for another file but didn't fine one yet. Apparently, most images are encoded as hex values. I'll let you know if I manage to find one.

As an alternative, I can extend an existing rtf file with a fake binary section. Would that work for you?
                
> Add support for parsing binary data in RTF files
> ------------------------------------------------
>
>                 Key: TIKA-782
>                 URL: https://issues.apache.org/jira/browse/TIKA-782
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Arjohn Kampman
>            Assignee: Michael McCandless
>         Attachments: bin.patch, bin2.patch, bin3.patch
>
>
> The current RTF parser doesn't process \bin control words yet. These control words are followed by a specific amount of binary data. Because of this, the RTF parser trips over some of these bytes in a number of (classified) documents.
> I've implemented processing of the \bin control word, but it required of the core parsing algorithm. IMHO, it also improved readability of the code. I hope you will accept this patch. Please let me know if the patch requires modifications.
> Apart from the \bin code word, this patch also makes the parser stop after reading the document-closing '}' character. In a number of files (again, classified), the parser would include non-readable characters that appeared after this closing brace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-782) Add support for parsing binary data in RTF files

Posted by "Arjohn Kampman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arjohn Kampman updated TIKA-782:
--------------------------------

    Attachment: bin3.patch

New patch with the requested changes.
                
> Add support for parsing binary data in RTF files
> ------------------------------------------------
>
>                 Key: TIKA-782
>                 URL: https://issues.apache.org/jira/browse/TIKA-782
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Arjohn Kampman
>            Assignee: Michael McCandless
>         Attachments: bin.patch, bin2.patch, bin3.patch
>
>
> The current RTF parser doesn't process \bin control words yet. These control words are followed by a specific amount of binary data. Because of this, the RTF parser trips over some of these bytes in a number of (classified) documents.
> I've implemented processing of the \bin control word, but it required of the core parsing algorithm. IMHO, it also improved readability of the code. I hope you will accept this patch. Please let me know if the patch requires modifications.
> Apart from the \bin code word, this patch also makes the parser stop after reading the document-closing '}' character. In a number of files (again, classified), the parser would include non-readable characters that appeared after this closing brace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-782) Add support for parsing binary data in RTF files

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152119#comment-13152119 ] 

Michael McCandless commented on TIKA-782:
-----------------------------------------


bq. I'll make the necessary changes.

Thanks!

bq. Do you mind if I changed the <long> parameter of processControlWord() to an <int>.

Hrm, the 1.8 spec says this:

{noformat}
The parameter can be a positive or negative number. The range of the
values for the number is generally –32767 through 32767. However, Word
tends to restrict the range to –31680 through 31680 and also allows
values in the range –2,147,483,648 to 2,147,483,648 for a small number
of keywords (specifically \bin, \revdttm, and some picture
properties). An RTF parser must allow an arbitrary string of digits as
a legal value for a keyword (providing it does not exceed value ranges
noted earlier). 
{noformat}

But you're right the 1.9 spec changed the max to
Integer.MAX_VALUE... I guess they changed it for 1.9, and this was a
"bug" in the 1.8 spec ;)

So I think int is safe!

                
> Add support for parsing binary data in RTF files
> ------------------------------------------------
>
>                 Key: TIKA-782
>                 URL: https://issues.apache.org/jira/browse/TIKA-782
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Arjohn Kampman
>            Assignee: Michael McCandless
>         Attachments: bin.patch, bin2.patch
>
>
> The current RTF parser doesn't process \bin control words yet. These control words are followed by a specific amount of binary data. Because of this, the RTF parser trips over some of these bytes in a number of (classified) documents.
> I've implemented processing of the \bin control word, but it required of the core parsing algorithm. IMHO, it also improved readability of the code. I hope you will accept this patch. Please let me know if the patch requires modifications.
> Apart from the \bin code word, this patch also makes the parser stop after reading the document-closing '}' character. In a number of files (again, classified), the parser would include non-readable characters that appeared after this closing brace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (TIKA-782) Add support for parsing binary data in RTF files

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152288#comment-13152288 ] 

Michael McCandless commented on TIKA-782:
-----------------------------------------

That works for me: pre-patch we extract this binary content as if it were text; post patch we extract nothing.  I'll add a test case based on this and commit....

Thanks Arjohn!
                
> Add support for parsing binary data in RTF files
> ------------------------------------------------
>
>                 Key: TIKA-782
>                 URL: https://issues.apache.org/jira/browse/TIKA-782
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Arjohn Kampman
>            Assignee: Michael McCandless
>         Attachments: bin.patch, bin2.patch, bin3.patch, logo.zip
>
>
> The current RTF parser doesn't process \bin control words yet. These control words are followed by a specific amount of binary data. Because of this, the RTF parser trips over some of these bytes in a number of (classified) documents.
> I've implemented processing of the \bin control word, but it required of the core parsing algorithm. IMHO, it also improved readability of the code. I hope you will accept this patch. Please let me know if the patch requires modifications.
> Apart from the \bin code word, this patch also makes the parser stop after reading the document-closing '}' character. In a number of files (again, classified), the parser would include non-readable characters that appeared after this closing brace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-782) Add support for parsing binary data in RTF files

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152041#comment-13152041 ] 

Michael McCandless commented on TIKA-782:
-----------------------------------------

These changes look great!

Cutover to PushbackInputStream, and then using this to lookahead to
find '*' after '{' to know we should ignore the group state, is
great; much simpler than before.

I like the new broken out methods for parsing control token/word, hex
char.

Since addOutputByte now takes int instead of byte, could you add an
assert that the int is "in bounds" for byte?

It makes me nervous creating a new StringBuilder and String for every
control word; can we go back to our own reused char[] buffer/ equals
method?

Instead of allocating a full byte[] for the "bin" control word, can we
say allocate (up to) a fixed buffer size and read in chunks until
we've skipped that many bytes?  This way if an RTF doc has massive
embedded binary data we don't use lots of RAM when skipping it.

                
> Add support for parsing binary data in RTF files
> ------------------------------------------------
>
>                 Key: TIKA-782
>                 URL: https://issues.apache.org/jira/browse/TIKA-782
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Arjohn Kampman
>            Assignee: Michael McCandless
>         Attachments: bin.patch, bin2.patch
>
>
> The current RTF parser doesn't process \bin control words yet. These control words are followed by a specific amount of binary data. Because of this, the RTF parser trips over some of these bytes in a number of (classified) documents.
> I've implemented processing of the \bin control word, but it required of the core parsing algorithm. IMHO, it also improved readability of the code. I hope you will accept this patch. Please let me know if the patch requires modifications.
> Apart from the \bin code word, this patch also makes the parser stop after reading the document-closing '}' character. In a number of files (again, classified), the parser would include non-readable characters that appeared after this closing brace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-782) Add support for parsing binary data in RTF files

Posted by "Arjohn Kampman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13152076#comment-13152076 ] 

Arjohn Kampman commented on TIKA-782:
-------------------------------------

I'll make the necessary changes.

Do you mind if I changed the <long> parameter of processControlWord() to an <int>.There's a comment above that method that says:

    // Param is long because spec says max value is 1+ Integer.MAX_VALUE!

However, Microsoft's RTF 1.9.1 specs says:

    The range of the values for the number is nominally -32768 through 32767,
    i.e., a signed 16-bit integer. A small number of control words take values
    in the range‌ −2,147,483,648 to 2,147,483,647 (32-bit signed integer).

This is exactly the range of Java's int. Not sure which spec the comment is referring to though.

                
> Add support for parsing binary data in RTF files
> ------------------------------------------------
>
>                 Key: TIKA-782
>                 URL: https://issues.apache.org/jira/browse/TIKA-782
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Arjohn Kampman
>            Assignee: Michael McCandless
>         Attachments: bin.patch, bin2.patch
>
>
> The current RTF parser doesn't process \bin control words yet. These control words are followed by a specific amount of binary data. Because of this, the RTF parser trips over some of these bytes in a number of (classified) documents.
> I've implemented processing of the \bin control word, but it required of the core parsing algorithm. IMHO, it also improved readability of the code. I hope you will accept this patch. Please let me know if the patch requires modifications.
> Apart from the \bin code word, this patch also makes the parser stop after reading the document-closing '}' character. In a number of files (again, classified), the parser would include non-readable characters that appeared after this closing brace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira