You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by ph...@free.fr on 2013/07/08 17:57:01 UTC

Writing a customer Tika parser based on a XML schema

Hello,

I would like to write a custom Tika Parser based on a XML Schema (by "based" I mean "which uses the attributes in that file").

The files which I would like to parse look have the following structure:

----------------------------------

<?xml version="1.0" encoding="ISO-8859-1" ?>
<article artName="abc" _Title="Maastricht" _Byline="By abc" _Dateline="abc" _CategoryId="news" _SourceId="Others" _AltSource="newspaper" Summary="An abc businessman ...">
<head>
<clipsHead>New York</clipsHead>
</head>
<body>
<dateline>
<txt it="No" bd="Yes">New York</txt>
</dateline>
<byline>By L Z</byline>
<credit>Newspaper</credit>

........................

----------------------------------

and here's the XML Schema File's structure:

----------------------------------

<Schema name="GN3" xmlns="urn:schemas-microsoft-com:xml-data" xmlns:dt="urn:schemas-microsoft-com:datatypes" xmlns:gn3="urn:schemas-teradp-com:gn3">

	<!-- METADATA 2.0 - ATTRIBUTES -->

	<AttributeType
		name="_UID"
		required="no"
		dt:type="string"
		gn3:label="Reference UID:"
		default=""
	/>

	<AttributeType
		name="_Priority"
		required="no"
		dt:type="i4"
		gn3:label="Priority:"
		default="2"
	/>

	<AttributeType
		name="_Byline"
		required="no"
		dt:type="string"
		gn3:style="multiLine"
		gn3:label="Byline:"
		default=""
	/>

......

----------------------------------

Any suggestions would be greatly appreciated.

Philippe