You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by Anna Afonchenko <an...@ubaccess.com> on 2003/11/03 14:55:19 UTC

Transform PDF to XML/XHTML

Hi all.
I need to transform a PDF file to XML (XHTML) format.
I saw an example in Cocoon of doing the opposite, i.e.
XML->PDF using XSL-FO.

Is there a similar way of making PDF2XML transformation too
or do I need to write my own Transformer (or maybe Generator)?
I am using Cocoon 2.0.4.

Thanks in advance for your help.

Anna

Re: Transform PDF to XML/XHTML

Posted by Andrzej Jan Taramina <an...@chaeron.com>.
Anna:

> I need to transform a PDF file to XML (XHTML) format.
> I saw an example in Cocoon of doing the opposite, i.e.
> XML->PDF using XSL-FO.

There probably is a way to do this....but it's a bit involved.

There is a commercial software package available that will convert a PDF back 
into a Word document.  I don't remember who sells it....ping me privately later 
(when I am back in the office) and I'll tell you were to find it.  It's about 
$50.

You could use this tool to get into Word .doc format, then use Word or 
something similar to convert this .doc into RTF (older Word versions) or XML 
(Office 2003)....then you have clear text that you can process into XHTML.

Ugly...and would take a while to put in place, but doable.

....Andrzej

Chaeron Corporation
http://www.chaeron.com


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Transform PDF to XML/XHTML

Posted by Bertrand Delacretaz <bd...@apache.org>.
Le Lundi, 3 nov 2003, à 15:22 Europe/Zurich, alex@OWAL.co.uk a écrit :

> ...There are some tools which can possibly extract the plain text from
> a PDF file but that has nothing to do with Cocoon...

Note that some tools (recent versions of Acrobat Distiller AFAIK) allow 
"tagged PDF" to be generated, which should allow much more structure to 
be extracted than "plain" PDF.

But as you say, there's nothing in Cocoon today to do this.

-Bertrand


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Transform PDF to XML/XHTML

Posted by al...@OWAL.co.uk.
anna@ubaccess.com wrote:
> Hi all.
> I need to transform a PDF file to XML (XHTML) format.
> I saw an example in Cocoon of doing the opposite, i.e.
> XML->PDF using XSL-FO.

That is what XSL-FO is for - generating page descriptions like PDF from XML.

> Is there a similar way of making PDF2XML transformation too

It isn't for that. There is no generic tool which does that - 
not even a PDF->XSLFO converter as far as I know.

There are some tools which can possibly extract the plain text from 
a PDF file but that has nothing to do with Cocoon.

PS - Is this question in the FAQ?

Goodluck 

Alex




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org