You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Steven Van Ingelgem <st...@vaningelgem.be> on 2019/03/23 14:33:31 UTC

Question about strange characters in the output

Hello all,


I started my server like this: "java -jar tika-server-1.20.jar -server"

I was working with a RAR file to get the information and I noticed that a
LOT of weird output was included.
This file contained both binaries (.so), .class and .java files.

The problem is that I can consistently reproduce it with that rar file, but
I cannot share it.
So I tried:
"curl -T tika-server-1.20.jar localhost:9998/tika --header "Accept:
text/plain" > out.txt"

This gave me a little bit of the same problem (just not as bad as I had it):
schemaorg_apache_xmlbeans/system/sD023D6490046BA0250A839A9AD24C443/agautoformatattributegroup.xsb
Úzº¾�����������9http://schemas.openxmlformats.org/spreadsheetml/2006/main�^MAG_AutoFormat��unqualified�8<xsd:attributeGroup
name="AG_AutoFormat" xmlns="
http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:xsd="
http://www.w3.org/2001/XMLSchema">
    <xsd:attribute name="autoFormatId" type="xsd:unsignedInt">
      <xsd:annotation>

My question is: why is this outputted?

Some tests with the rar-file (not the tika-jar) showed me that each file
seperatly is extracted properly. (meaning: i get text)
Plus that when I delete files from the rar, some files are extracted
properly which were not extracted properly before.

Furthermore I noticed a distinctive pattern: EF BF BD (which seems to be an
UTF-8 replacement character).

But it's not with every rar, for a test I downloaded the "br" dump of
wikimedia and rarred it. then "/rmeta/text" on it, and that extract
properly.

So I'm guessing some kind of buffer overflowing into the next
text-extraction?


What could I do to debug this more in-depth and/or provide the devs with
some more info so they could tackle it for me?


Thanks!