You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by Rajesh <ra...@smartdot.net> on 2002/09/03 15:50:28 UTC

Cocoon : Content Transformation from WORD/PDF/EXCEL to XML

Hi, 
 
I have a requirement to convert hundreds of unstructured documents in
WORD/PDF/TXT/EMAIL formats
into a structured repository of XML Metadata of the document and the
documents itself. 
 
I need to parse each of these documents and extract the relevant
information to build a XML metadata
document for each document. 
 
The XML structured metadata of the underlying document will contain
fields like Keywords, Category, Doc Name, 
Author etc. 
 
Is it possible to use Cocoon and or POI to do this.  And if yes how to
use Cocoon to do the extraction. 
 
I am new to Cocoon, and trying to understand the world of
transformers/generators etc. 
 
Also could I use Lucene to index the XML documents and build a search
engine around it. 
 
I would like to know about the possible ways to do this. 
 
regards
 
rajesh.