You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Steven Van Ingelgem (JIRA)" <ji...@apache.org> on 2019/03/25 05:19:00 UTC
[jira] [Created] (TIKA-2843) Question about strange characters in
the output
Steven Van Ingelgem created TIKA-2843:
-----------------------------------------
Summary: Question about strange characters in the output
Key: TIKA-2843
URL: https://issues.apache.org/jira/browse/TIKA-2843
Project: Tika
Issue Type: Bug
Affects Versions: 1.20
Reporter: Steven Van Ingelgem
I started my server like this: "java -jar tika-server-1.20.jar -server"
I was working with a RAR file to get the information and I noticed that a LOT of weird output was included.
This file contained both binaries (.so), .class and .java files (as well as some additional resources like png, html & certificates).
The problem is that I can consistently reproduce it with that rar file, but I cannot share it.
So I tried to reproduce it this way:
"curl -T tika-server-1.20.jar localhost:9998/tika --header "Accept: text/plain" > out.txt"
This gave me a little bit of the same problem (just not as bad as I had it with the rar) :(
{code}
schemaorg_apache_xmlbeans/system/sD023D6490046BA0250A839A9AD24C443/agautoformatattributegroup.xsb
Úzº¾�����������9[http://schemas.openxmlformats.org/spreadsheetml/2006/main]�^MAG_AutoFormat��unqualified�8<xsd:attributeGroup name="AG_AutoFormat" xmlns="[http://schemas.openxmlformats.org/spreadsheetml/2006/main]" xmlns:xsd="[http://www.w3.org/2001/XMLSchema]">
<xsd:attribute name="autoFormatId" type="xsd:unsignedInt">
<xsd:annotation>
{code}
My first question is: why is this outputted?
Some tests with the rar-file (not the tika-jar) showed me that each file separately is extracted properly. (meaning: i get proper text)
Plus that when I delete files from the rar, some files are extracted properly which were not extracted properly before.
Furthermore I noticed a distinctive pattern: EF BF BD (which seems to be an UTF-8 replacement character).
But it's not with every rar, for a test I downloaded the "br" dump of wikimedia and rarred it. then "/rmeta/text" on it, and that extracted properly.
So I'm guessing some kind of buffer overflowing into the next text-extraction?
What could I do to debug this more in-depth and/or provide you with some more info so you could tackle this bug?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)