You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by ph...@free.fr on 2013/07/08 17:57:01 UTC
Writing a customer Tika parser based on a XML schema
Hello,
I would like to write a custom Tika Parser based on a XML Schema (by "based" I mean "which uses the attributes in that file").
The files which I would like to parse look have the following structure:
----------------------------------
<?xml version="1.0" encoding="ISO-8859-1" ?>
<article artName="abc" _Title="Maastricht" _Byline="By abc" _Dateline="abc" _CategoryId="news" _SourceId="Others" _AltSource="newspaper" Summary="An abc businessman ...">
<head>
<clipsHead>New York</clipsHead>
</head>
<body>
<dateline>
<txt it="No" bd="Yes">New York</txt>
</dateline>
<byline>By L Z</byline>
<credit>Newspaper</credit>
........................
----------------------------------
and here's the XML Schema File's structure:
----------------------------------
<Schema name="GN3" xmlns="urn:schemas-microsoft-com:xml-data" xmlns:dt="urn:schemas-microsoft-com:datatypes" xmlns:gn3="urn:schemas-teradp-com:gn3">
<!-- METADATA 2.0 - ATTRIBUTES -->
<AttributeType
name="_UID"
required="no"
dt:type="string"
gn3:label="Reference UID:"
default=""
/>
<AttributeType
name="_Priority"
required="no"
dt:type="i4"
gn3:label="Priority:"
default="2"
/>
<AttributeType
name="_Byline"
required="no"
dt:type="string"
gn3:style="multiLine"
gn3:label="Byline:"
default=""
/>
......
----------------------------------
Any suggestions would be greatly appreciated.
Philippe