You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Aleksandr Dubinsky (JIRA)" <ji...@apache.org> on 2014/05/24 07:19:01 UTC

[jira] [Created] (TIKA-1309) RTF TextExtractor can ignore consecutive linebreaks

Aleksandr Dubinsky created TIKA-1309:
----------------------------------------

             Summary: RTF TextExtractor can ignore consecutive linebreaks
                 Key: TIKA-1309
                 URL: https://issues.apache.org/jira/browse/TIKA-1309
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.5, 1.6
            Reporter: Aleksandr Dubinsky


Some RTF files encode consecutive linebreaks as simply consecutive \par commands. However, org.apache.tika.parser.rtf.TextExtractor ignores the second \par.

Solution is to replace at line 1158:

        } else if (equals("par")) {
            if (!ignored) {
                endParagraph(true);
            }
        }

with:


        } else if (equals("par")) {
            if (!ignored) {
                lazyStartParagraph();
                endParagraph(true);
            }
        }



--
This message was sent by Atlassian JIRA
(v6.2#6252)