You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Steven Van Ingelgem (JIRA)" <ji...@apache.org> on 2019/03/25 05:19:00 UTC

[jira] [Created] (TIKA-2843) Question about strange characters in the output

Steven Van Ingelgem created TIKA-2843:
-----------------------------------------

             Summary: Question about strange characters in the output
                 Key: TIKA-2843
                 URL: https://issues.apache.org/jira/browse/TIKA-2843
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.20
            Reporter: Steven Van Ingelgem


I started my server like this: "java -jar tika-server-1.20.jar -server"
 
I was working with a RAR file to get the information and I noticed that a LOT of weird output was included.
This file contained both binaries (.so), .class and .java files (as well as some additional resources like png, html & certificates).
 
The problem is that I can consistently reproduce it with that rar file, but I cannot share it.
So I tried to reproduce it this way:
"curl -T tika-server-1.20.jar localhost:9998/tika --header "Accept: text/plain" > out.txt"
 
This gave me a little bit of the same problem (just not as bad as I had it with the rar) :(
{code}
schemaorg_apache_xmlbeans/system/sD023D6490046BA0250A839A9AD24C443/agautoformatattributegroup.xsb
Úzº¾�����������9[http://schemas.openxmlformats.org/spreadsheetml/2006/main]�^MAG_AutoFormat��unqualified�8<xsd:attributeGroup name="AG_AutoFormat" xmlns="[http://schemas.openxmlformats.org/spreadsheetml/2006/main]" xmlns:xsd="[http://www.w3.org/2001/XMLSchema]">
    <xsd:attribute name="autoFormatId" type="xsd:unsignedInt">
      <xsd:annotation>
{code} 

My first question is: why is this outputted?
 
Some tests with the rar-file (not the tika-jar) showed me that each file separately is extracted properly. (meaning: i get proper text)
Plus that when I delete files from the rar, some files are extracted properly which were not extracted properly before.
 
Furthermore I noticed a distinctive pattern: EF BF BD (which seems to be an UTF-8 replacement character).
 
But it's not with every rar, for a test I downloaded the "br" dump of wikimedia and rarred it. then "/rmeta/text" on it, and that extracted properly.
 
So I'm guessing some kind of buffer overflowing into the next text-extraction?
 
 
What could I do to debug this more in-depth and/or provide you with some more info so you could tackle this bug?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)