You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2009/06/25 21:03:07 UTC
[jira] Commented: (TIKA-251) package parser ignoring
tika-config.xml
[ https://issues.apache.org/jira/browse/TIKA-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724199#action_12724199 ]
Jukka Zitting commented on TIKA-251:
------------------------------------
The package parser might not be picking up your custom configuration. Are you using a recent version from trunk?
See TIKA-238 that should fix the issue of a PackageParser always using the default Tika configuration.
> package parser ignoring tika-config.xml
> ----------------------------------------
>
> Key: TIKA-251
> URL: https://issues.apache.org/jira/browse/TIKA-251
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.4
> Reporter: Jonathan Koren
> Priority: Minor
>
> I created my own ContentHandler, XmlParser that echos out the dom tree of the xml file being parsed. I modified tika-config so that AutoDetectParser will call this parser for xml files:
> <parser name="parse-xml" class="XmlParser">
> <mime>application/xml</mime>
> </parser>
> If tika parses an xml file directly, the right thing is done:
> resourceName: 1001281.xml
> ComplexIndexerTaskThread()
> XmlParser Begins
> SCH: start document
> SCH: start element nitf
> SCH: a: change.date=June 10, 2005
> SCH: a: change.time=19:30
> SCH: a: version=-//IPTC//DTD NITF 3.3//EN
> SCH: start element head
> SCH: start element title
> Apprentices Sample Life Of Doctors In Villages
> SCH: end element title
> SCH: start element meta
> SCH: a: content=Y11DOC$01
> SCH: a: name=slug
> and so on for the fragment:
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd">
> <nitf change.date="June 10, 2005" change.time="19:30" version="-//IPTC//DTD NITF 3.3//EN">
> <head>
> <title>Apprentices Sample Life Of Doctors In Villages</title>
> <meta content="Y11DOC$01" name="slug"/>
> Now. If I put this XML file within a a gzipped tar file, my XmlParser isn't called. Instead it is somehow converted to plain text. Which is not correct. Example output:
> fullpathname: /Users/jonathan/devel/cs/spade/aaa.tar.gz
> resourceName: aaa.tar.gz
> ComplexIndexerTaskThread()
> SCH: start document
> SCH: start element html
> SCH: start element head
> SCH: start element title
> SCH: end element title
> SCH: end element head
> SCH: start element body
> SCH: start element div
> SCH: a: class=package-entry
> SCH: subfile 1 detected!
> SCH: start element h1
> aaa.tar
> SCH: subfile 1's name is aaa.tar
> SCH: end element h1
> SCH: start element div
> SCH: a: class=package-entry
> SCH: subfile 2 detected!
> SCH: start element h1
> 1001281.xml
> SCH: subfile 2's name is 1001281.xml
> SCH: end element h1
> SCH: start element p
> Apprentices Sample Life Of Doctors In Villages
> and so on.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.