You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Arjohn Kampman (Created) (JIRA)" <ji...@apache.org> on 2011/11/08 17:47:51 UTC

[jira] [Created] (TIKA-777) RTF parser incorrectly applies fonts to complete group

RTF parser incorrectly applies fonts to complete group
------------------------------------------------------

                 Key: TIKA-777
                 URL: https://issues.apache.org/jira/browse/TIKA-777
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.0
            Reporter: Arjohn Kampman


Tika's RTF parser processes the following rtf fragment incorrectly, applying the wrong character encoding to the parsed characters:

{\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0
{\fonttbl
{\f0\fswiss\fcharset0 Arial;}
{\f1\fswiss\fcharset204 Arial;}
}
{\f1\fs20 \'d3\'e2\'e0\'e6\'e0\'e5\'ec\'fb\'e9 \'ea\'eb\'e8\'e5\'ed\'f2!\f0}\par
}

This document contains russian characters (\f1), but tika decodes these as latin due to the \f0 directive at the end of the group. The RTF parser should probably flush its pendingBytes buffer before processing directives such as these.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-777) RTF parser incorrectly applies fonts to complete group

Posted by "Michael McCandless (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved TIKA-777.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.1

Thanks Arjohn!
                
> RTF parser incorrectly applies fonts to complete group
> ------------------------------------------------------
>
>                 Key: TIKA-777
>                 URL: https://issues.apache.org/jira/browse/TIKA-777
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Arjohn Kampman
>            Assignee: Michael McCandless
>             Fix For: 1.1
>
>
> Tika's RTF parser processes the following rtf document incorrectly, applying the wrong character encoding to the parsed characters:
> {\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0
> {\fonttbl
> {\f0\fswiss\fcharset0 Arial;}
> {\f1\fswiss\fcharset204 Arial;}
> }
> {\f1\fs20 \'d3\'e2\'e0\'e6\'e0\'e5\'ec\'fb\'e9 \'ea\'eb\'e8\'e5\'ed\'f2!\f0}\par
> }
> This document contains russian characters (\f1), but tika decodes these as latin due to the \f0 directive at the end of the group. The RTF parser should probably flush its pendingBytes buffer before processing directives such as these.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (TIKA-777) RTF parser incorrectly applies fonts to complete group

Posted by "Michael McCandless (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned TIKA-777:
---------------------------------------

    Assignee: Michael McCandless
    
> RTF parser incorrectly applies fonts to complete group
> ------------------------------------------------------
>
>                 Key: TIKA-777
>                 URL: https://issues.apache.org/jira/browse/TIKA-777
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Arjohn Kampman
>            Assignee: Michael McCandless
>
> Tika's RTF parser processes the following rtf document incorrectly, applying the wrong character encoding to the parsed characters:
> {\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0
> {\fonttbl
> {\f0\fswiss\fcharset0 Arial;}
> {\f1\fswiss\fcharset204 Arial;}
> }
> {\f1\fs20 \'d3\'e2\'e0\'e6\'e0\'e5\'ec\'fb\'e9 \'ea\'eb\'e8\'e5\'ed\'f2!\f0}\par
> }
> This document contains russian characters (\f1), but tika decodes these as latin due to the \f0 directive at the end of the group. The RTF parser should probably flush its pendingBytes buffer before processing directives such as these.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-777) RTF parser incorrectly applies fonts to complete group

Posted by "Arjohn Kampman (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arjohn Kampman updated TIKA-777:
--------------------------------

    Description: 
Tika's RTF parser processes the following rtf document incorrectly, applying the wrong character encoding to the parsed characters:

{\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0
{\fonttbl
{\f0\fswiss\fcharset0 Arial;}
{\f1\fswiss\fcharset204 Arial;}
}
{\f1\fs20 \'d3\'e2\'e0\'e6\'e0\'e5\'ec\'fb\'e9 \'ea\'eb\'e8\'e5\'ed\'f2!\f0}\par
}

This document contains russian characters (\f1), but tika decodes these as latin due to the \f0 directive at the end of the group. The RTF parser should probably flush its pendingBytes buffer before processing directives such as these.

  was:
Tika's RTF parser processes the following rtf fragment incorrectly, applying the wrong character encoding to the parsed characters:

{\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0
{\fonttbl
{\f0\fswiss\fcharset0 Arial;}
{\f1\fswiss\fcharset204 Arial;}
}
{\f1\fs20 \'d3\'e2\'e0\'e6\'e0\'e5\'ec\'fb\'e9 \'ea\'eb\'e8\'e5\'ed\'f2!\f0}\par
}

This document contains russian characters (\f1), but tika decodes these as latin due to the \f0 directive at the end of the group. The RTF parser should probably flush its pendingBytes buffer before processing directives such as these.

    
> RTF parser incorrectly applies fonts to complete group
> ------------------------------------------------------
>
>                 Key: TIKA-777
>                 URL: https://issues.apache.org/jira/browse/TIKA-777
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Arjohn Kampman
>
> Tika's RTF parser processes the following rtf document incorrectly, applying the wrong character encoding to the parsed characters:
> {\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0
> {\fonttbl
> {\f0\fswiss\fcharset0 Arial;}
> {\f1\fswiss\fcharset204 Arial;}
> }
> {\f1\fs20 \'d3\'e2\'e0\'e6\'e0\'e5\'ec\'fb\'e9 \'ea\'eb\'e8\'e5\'ed\'f2!\f0}\par
> }
> This document contains russian characters (\f1), but tika decodes these as latin due to the \f0 directive at the end of the group. The RTF parser should probably flush its pendingBytes buffer before processing directives such as these.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira