You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jonathan Koren (JIRA)" <ji...@apache.org> on 2009/06/24 23:25:07 UTC

[jira] Updated: (TIKA-251) package parser ignoring tika-config.xml

     [ https://issues.apache.org/jira/browse/TIKA-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Koren updated TIKA-251:
--------------------------------

    Priority: Minor  (was: Major)

> package parser ignoring tika-config.xml 
> ----------------------------------------
>
>                 Key: TIKA-251
>                 URL: https://issues.apache.org/jira/browse/TIKA-251
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Jonathan Koren
>            Priority: Minor
>
> I created my own ContentHandler, XmlParser that echos out the dom tree of the xml file being parsed.  I modified tika-config so that AutoDetectParser will call this parser for xml files:
>        <parser name="parse-xml" class="XmlParser">
>                <mime>application/xml</mime>
>        </parser>
> If tika parses an xml file directly, the right thing is done:
> 	resourceName: 1001281.xml
> ComplexIndexerTaskThread()
> 	XmlParser Begins
> 	SCH: start document
> 	SCH: start element nitf
> 	SCH: a: change.date=June 10, 2005
> 	SCH: a: change.time=19:30
> 	SCH: a: version=-//IPTC//DTD NITF 3.3//EN
> 	SCH: start element head
> 	SCH: start element title
> 	Apprentices Sample Life Of Doctors In Villages
> 	SCH: end element title
> 	SCH: start element meta
> 	SCH: a: content=Y11DOC$01
> 	SCH: a: name=slug
> and so on for the fragment:
> 	<?xml version="1.0" encoding="UTF-8"?>
> 	<!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd">
> 	<nitf change.date="June 10, 2005" change.time="19:30" version="-//IPTC//DTD NITF 3.3//EN">
> 	<head>
> 	<title>Apprentices Sample Life Of Doctors In Villages</title>
> 	<meta content="Y11DOC$01" name="slug"/>
> Now.  If I put this XML file within a a gzipped tar file, my XmlParser isn't called.  Instead it is somehow converted to plain text.  Which is not correct. Example output:
> 	fullpathname: /Users/jonathan/devel/cs/spade/aaa.tar.gz
> 	resourceName: aaa.tar.gz
> 	ComplexIndexerTaskThread()
> 	SCH: start document
> 	SCH: start element html
> 	SCH: start element head
> 	SCH: start element title
> 	SCH: end element title
> 	SCH: end element head
> 	SCH: start element body
> 	SCH: start element div
> 	SCH: a: class=package-entry
> 	SCH: subfile 1 detected!
> 	SCH: start element h1
> 	aaa.tar
> 	SCH: subfile 1's name is aaa.tar
> 	SCH: end element h1
> 	SCH: start element div
> 	SCH: a: class=package-entry
> 	SCH: subfile 2 detected!
> 	SCH: start element h1
> 	1001281.xml
> 	SCH: subfile 2's name is 1001281.xml
> 	SCH: end element h1
> 	SCH: start element p
>    Apprentices Sample Life Of Doctors In Villages
> and so on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.