You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Taylor, Wade" <wt...@ptfs.com> on 2012/04/17 15:07:54 UTC

Problem detecting XML

Hi, I'm having trouble detecting a file as "application/xml". When I detect
a URL containing XML the detection works and I get "application/xml" as the
media type.

The output of running:

new Tika().detect(this.getClass().getResourceAsStream("/xml/sample.xml"))

is "text/plain".

And the sample.xml file is:

<?xml version="1.0" encoding="UTF-8"?>

<metadata xmlns="http://musicbrainz.org/ns/mmd-1.0#">
    <release id="02232360-337e-4a3f-ad20-6cdd4c34288c" type="Album
Official">
        <title>Little Earthquakes</title>
        <text-representation language="ENG" script="Latn"/>
        <asin>B000002IT2</asin>
        <artist id="c0b2500e-0cef-4130-869d-732b23ed9df5" type="Person">
            <name>Tori Amos</name>
            <sort-name>Amos, Tori</sort-name>
            <life-span begin="1963-08-22"/>
        </artist>
        <release-event-list>
            <event date="1992-01-13" country="GB"/>
            <event date="1992-01-17" country="DE"/>
            <event date="1992-02-25" country="US"/>
        </release-event-list>
        <disc-list>
            <disc id="ILKp3.bZmvoMO7wSrq1cw7WatfA-"/>
            <disc id="ejdrdtX1ZyvCb0g6vfJejVaLIK8-"/>
            <disc id="Y96eDQZbF4Z26Y5.Sxdbh3wGypo-"/>
        </disc-list>
        <track-list count="12"/>
    </release>
</metadata>

Is there something else I'm missing in order to detect a file as XML?

Thanks,
Wade

Re: Problem detecting XML

Posted by "Taylor, Wade" <wt...@ptfs.com>.
Hi, thanks for all the help. I really appreciate it.

I tried your test and it worked for me too, so I started examining the
maven dependency tree for my project looking for conflicts, etc. My project
has a dependency on jaxen-1.1.1 which in turn has a dependency on
xercesImpl-2.6.2. I found that excluding the xercesImpl dependency fixed my
problem with detecting XML.

For example, in my pom.xml I added:

<dependency>
    <groupId>jaxen</groupId>
    <artifactId>jaxen</artifactId>
    <version>1.1.1</version>
    <exclusions>
        <exclusion>
            <groupId>xerces</groupId>
            <artifactId>xercesImpl</artifactId>
        </exclusion>
    </exclusions>
</dependency>

I guess the services in xercesImpl were overriding the built-in (I'm using
Java 6) XML APIs?

Regards,
Wade



On Tue, Apr 17, 2012 at 3:27 PM, Nick Burch <ni...@alfresco.com> wrote:

> On Tue, 17 Apr 2012, Taylor, Wade wrote:
>
>> Since I couldn't get that to work I went back to basics and tried a
>> simple XML string:
>>
>> new Tika().detect(new ByteArrayInputStream("<?xml version=\"1.0\"
>> encoding=\"UTF-8\"?><root><**child>text</child></root>".**getBytes())));
>>
>> but this gets detected as "text/plain" too and I can't figure out why it's
>> not coming back as "application/xml".
>>
>
> I've just tried this with a very simple test class:
>
> import org.apache.tika.*;
> import java.io.*;
> public class Test {
>   public static void main(String[] a) throws Exception {
>      System.out.println(
>
>        new Tika().detect(new ByteArrayInputStream(
>          "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root><**
> child>text</child></root>".**getBytes()))
>      );
>   }
> }
>
> When I run it, it works fine:
>   java -classpath tika-core-1.2-SNAPSHOT.jar:. Test
>   application/xml
>
> Looks to me like you've managed to miss some key parts of Tika out when
> you added it to your application. I'm not sure which bits you missed, and
> how it hasn't blown up complaining, but it does seem to me that it's your
> environment that's stuffed...
>
> Nick
>

Re: Problem detecting XML

Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 17 Apr 2012, Taylor, Wade wrote:
> Since I couldn't get that to work I went back to basics and tried a 
> simple XML string:
>
> new Tika().detect(new ByteArrayInputStream("<?xml version=\"1.0\"
> encoding=\"UTF-8\"?><root><child>text</child></root>".getBytes())));
>
> but this gets detected as "text/plain" too and I can't figure out why it's
> not coming back as "application/xml".

I've just tried this with a very simple test class:

import org.apache.tika.*;
import java.io.*;
public class Test {
    public static void main(String[] a) throws Exception {
       System.out.println(
         new Tika().detect(new ByteArrayInputStream(
           "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root><child>text</child></root>".getBytes()))
       );
    }
}

When I run it, it works fine:
    java -classpath tika-core-1.2-SNAPSHOT.jar:. Test
    application/xml

Looks to me like you've managed to miss some key parts of Tika out when 
you added it to your application. I'm not sure which bits you missed, and 
how it hasn't blown up complaining, but it does seem to me that it's your 
environment that's stuffed...

Nick

Re: Problem detecting XML

Posted by "Taylor, Wade" <wt...@ptfs.com>.
Hi, I still can't get Tika to detect the right mime-type. When I use
Tika-app it returns the correct mime-type, so I dug into the source and I
can't see what's different. Since I couldn't get that to work I went back
to basics and tried a simple XML string:

new Tika().detect(new ByteArrayInputStream("<?xml version=\"1.0\"
encoding=\"UTF-8\"?><root><child>text</child></root>".getBytes())));

but this gets detected as "text/plain" too and I can't figure out why it's
not coming back as "application/xml".


Regards,
Wade




On Tue, Apr 17, 2012 at 12:33 PM, Jukka Zitting <ju...@gmail.com>wrote:

> Hi,
>
> On Tue, Apr 17, 2012 at 6:06 PM, Taylor, Wade <wt...@ptfs.com> wrote:
> > Hi, thanks for the tips. I opened the XML file with a hex editor and did
> > find 3 control characters at the beginning: 0xEF, 0xBB, 0xBF.
>
> That's the UTF-8 byte order mark. I guess Tika should be able to deal
> with that, but AFAICT it currently doesn't. Would you mind filing a
> bug report about this?
>
> > Then I went back to my code and ran it against the fixed XML file:
> >
> > new
> >
> Tika().detect(this.getClass().getResourceAsStream("/xml/sample_fixed.wde"))
> >
> > but it still detects it as "text/plain".
>
> Hmm, can you verify that the returned input stream actually contains
> what you expect it to?
>
> Also, you can check the difference of how Tika detects full files
> (with the extra file name hint) and plain byte streams by comparing
> the output of the following two commands:
>
>    java -jar tika-app-1.1.jar --detect sample_fixed.wde
>    java -jar tika-app-1.1.jar --detect < sample_fixed.wde
>
> BR,
>
> Jukka Zitting
>

Re: Problem detecting XML

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Tue, Apr 17, 2012 at 6:06 PM, Taylor, Wade <wt...@ptfs.com> wrote:
> Hi, thanks for the tips. I opened the XML file with a hex editor and did
> find 3 control characters at the beginning: 0xEF, 0xBB, 0xBF.

That's the UTF-8 byte order mark. I guess Tika should be able to deal
with that, but AFAICT it currently doesn't. Would you mind filing a
bug report about this?

> Then I went back to my code and ran it against the fixed XML file:
>
> new
> Tika().detect(this.getClass().getResourceAsStream("/xml/sample_fixed.wde"))
>
> but it still detects it as "text/plain".

Hmm, can you verify that the returned input stream actually contains
what you expect it to?

Also, you can check the difference of how Tika detects full files
(with the extra file name hint) and plain byte streams by comparing
the output of the following two commands:

    java -jar tika-app-1.1.jar --detect sample_fixed.wde
    java -jar tika-app-1.1.jar --detect < sample_fixed.wde

BR,

Jukka Zitting

Re: Problem detecting XML

Posted by "Taylor, Wade" <wt...@ptfs.com>.
Hi, thanks for the tips. I opened the XML file with a hex editor and did
find 3 control characters at the beginning: 0xEF, 0xBB, 0xBF. When I remove
them and run:

java -jar tika-app-1.1.jar --detect sample_fixed.xml

it outputs "application/xml".

I then changed the file name to sample_fixed.wde to ensure that the byte
stream was being used in the detection. Running:

java -jar tika-app-1.1.jar --detect sample_fixed.wde

also outputs "application/xml".

Then I went back to my code and ran it against the fixed XML file:

new
Tika().detect(this.getClass().getResourceAsStream("/xml/sample_fixed.wde"))

but it still detects it as "text/plain".


Any idea why using Tika.detect() reports a different type than Tika-app?


Regards,
Wade


On Tue, Apr 17, 2012 at 9:50 AM, Jukka Zitting <ju...@gmail.com>wrote:

> Hi,
>
> On Tue, Apr 17, 2012 at 3:32 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> > I think the problem is that the detection does not see the filename. If
> you
> > pass a InputStream to the detection method, you should also pass metadata
> > (including the file name).
>
> Tika should have no trouble detecting XML also from just the byte stream.
>
> A typical reason why an XML document is detected as text/plain is if
> it's actually not valid XML, either because of some well-formedness
> issue (unclosed tags) or because of some extra characters like
> suggested by Nick.
>
> BR,
>
> Jukka Zitting
>

Re: Problem detecting XML

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Tue, Apr 17, 2012 at 3:32 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> I think the problem is that the detection does not see the filename. If you
> pass a InputStream to the detection method, you should also pass metadata
> (including the file name).

Tika should have no trouble detecting XML also from just the byte stream.

A typical reason why an XML document is detected as text/plain is if
it's actually not valid XML, either because of some well-formedness
issue (unclosed tags) or because of some extra characters like
suggested by Nick.

BR,

Jukka Zitting

RE: Problem detecting XML

Posted by Uwe Schindler <uw...@thetaphi.de>.
I think the problem is that the detection does not see the filename. If you
pass a InputStream to the detection method, you should also pass metadata
(including the file name).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Nick Burch [mailto:nick.burch@alfresco.com]
> Sent: Tuesday, April 17, 2012 3:30 PM
> To: user@tika.apache.org
> Subject: Re: Problem detecting XML
> 
> On Tue, 17 Apr 2012, Taylor, Wade wrote:
> > Hi, I'm having trouble detecting a file as "application/xml". When I
> > detect a URL containing XML the detection works and I get
> > "application/xml" as the media type.
> 
> Hmm, that's odd. I've taken your sample xml, popped it in a new file, and
run
> "java -jar tika-app-1.2-SNAPSHOT.jar --detect sample.xml" and it came back
> with "application/xml"
> 
> I'd suggest you double check with the Tika App, and also ensure there's no
> whitespace or control characters at the start of your file before the xml
> declaration
> 
> Nick


Re: Problem detecting XML

Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 17 Apr 2012, Taylor, Wade wrote:
> Hi, I'm having trouble detecting a file as "application/xml". When I 
> detect a URL containing XML the detection works and I get 
> "application/xml" as the media type.

Hmm, that's odd. I've taken your sample xml, popped it in a new file, and 
run "java -jar tika-app-1.2-SNAPSHOT.jar --detect sample.xml" and it came 
back with "application/xml"

I'd suggest you double check with the Tika App, and also ensure there's no 
whitespace or control characters at the start of your file before the xml 
declaration

Nick