You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Jonathan Koren <jo...@soe.ucsc.edu> on 2009/06/24 23:21:27 UTC
package parser ignoring tika-config.xml
I created my own ContentHandler, XmlParser that echos out the dom tree
of the xml file being parsed. I modified tika-config so that
AutoDetectParser will call this parser for xml files:
<parser name="parse-xml" class="XmlParser">
<mime>application/xml</mime>
</parser>
If tika parses an xml file directly, the right thing is done:
resourceName: 1001281.xml
ComplexIndexerTaskThread()
XmlParser Begins
SCH: start document
SCH: start element nitf
SCH: a: change.date=June 10, 2005
SCH: a: change.time=19:30
SCH: a: version=-//IPTC//DTD NITF 3.3//EN
SCH: start element head
SCH: start element title
Apprentices Sample Life Of Doctors In Villages
SCH: end element title
SCH: start element meta
SCH: a: content=Y11DOC$01
SCH: a: name=slug
and so on for the fragment:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd
">
<nitf change.date="June 10, 2005" change.time="19:30" version="-//
IPTC//DTD NITF 3.3//EN">
<head>
<title>Apprentices Sample Life Of Doctors In Villages</title>
<meta content="Y11DOC$01" name="slug"/>
Now. If I put this XML file within a a gzipped tar file, my XmlParser
isn't called. Instead it is somehow converted to plain text. Which
is not correct. Example output:
fullpathname: /Users/jonathan/devel/cs/spade/aaa.tar.gz
resourceName: aaa.tar.gz
ComplexIndexerTaskThread()
SCH: start document
SCH: start element html
SCH: start element head
SCH: start element title
SCH: end element title
SCH: end element head
SCH: start element body
SCH: start element div
SCH: a: class=package-entry
SCH: subfile 1 detected!
SCH: start element h1
aaa.tar
SCH: subfile 1's name is aaa.tar
SCH: end element h1
SCH: start element div
SCH: a: class=package-entry
SCH: subfile 2 detected!
SCH: start element h1
1001281.xml
SCH: subfile 2's name is 1001281.xml
SCH: end element h1
SCH: start element p
Apprentices Sample Life Of Doctors In Villages
and so on.
Why is PackageParser ignoring the configuration within tika-
config.xml ? This shouldn't be defined behavior. If a user
configured tika to handle certain mimetypes special, then the files
matching those mimetypes should be handled special wherever the file
is found. I suspect that this has a problem with how mimetypes are
detected.
--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/