You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Jonathan Koren <jo...@soe.ucsc.edu> on 2009/06/24 23:21:27 UTC

package parser ignoring tika-config.xml

I created my own ContentHandler, XmlParser that echos out the dom tree  
of the xml file being parsed.  I modified tika-config so that  
AutoDetectParser will call this parser for xml files:

         <parser name="parse-xml" class="XmlParser">
                 <mime>application/xml</mime>
         </parser>

If tika parses an xml file directly, the right thing is done:

	resourceName: 1001281.xml
ComplexIndexerTaskThread()
	XmlParser Begins
	SCH: start document
	SCH: start element nitf
	SCH: a: change.date=June 10, 2005
	SCH: a: change.time=19:30
	SCH: a: version=-//IPTC//DTD NITF 3.3//EN
	SCH: start element head
	SCH: start element title
	Apprentices Sample Life Of Doctors In Villages
	SCH: end element title
	SCH: start element meta
	SCH: a: content=Y11DOC$01
	SCH: a: name=slug

and so on for the fragment:

	<?xml version="1.0" encoding="UTF-8"?>
	<!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd 
">
	<nitf change.date="June 10, 2005" change.time="19:30" version="-// 
IPTC//DTD NITF 3.3//EN">
	<head>
	<title>Apprentices Sample Life Of Doctors In Villages</title>
	<meta content="Y11DOC$01" name="slug"/>


Now.  If I put this XML file within a a gzipped tar file, my XmlParser  
isn't called.  Instead it is somehow converted to plain text.  Which  
is not correct.   Example output:

	fullpathname: /Users/jonathan/devel/cs/spade/aaa.tar.gz
	resourceName: aaa.tar.gz
	ComplexIndexerTaskThread()
	SCH: start document
	SCH: start element html
	SCH: start element head
	SCH: start element title

	SCH: end element title

	SCH: end element head
	SCH: start element body
	SCH: start element div
	SCH: a: class=package-entry
	SCH: subfile 1 detected!
	SCH: start element h1
	aaa.tar
	SCH: subfile 1's name is aaa.tar

	SCH: end element h1
	SCH: start element div
	SCH: a: class=package-entry
	SCH: subfile 2 detected!
	SCH: start element h1
	1001281.xml
	SCH: subfile 2's name is 1001281.xml

	SCH: end element h1
	SCH: start element p


     Apprentices Sample Life Of Doctors In Villages


and so on.

Why is PackageParser ignoring the configuration within tika- 
config.xml ?  This shouldn't be defined behavior.  If a user  
configured tika to handle certain mimetypes special, then the files  
matching those mimetypes should be handled special wherever the file  
is found.  I suspect that this has a problem with how mimetypes are  
detected.


--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/