You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@xml.apache.org by Pa...@wdr.com on 2000/01/27 15:38:13 UTC

PDF to XML

     Does anyone know of any other PDF to XML converters.
     
     I'm currenlty looking at stuff from ReachCast.
     
     
     thanks in advance
     ======================================================================


This message contains confidential information and is intended only 
for the individual named.  If you are not the named addressee you 
should not disseminate, distribute or copy this e-mail.  Please 
notify the sender immediately by e-mail if you have received this 
e-mail by mistake and delete this e-mail from your system.

E-mail transmission cannot be guaranteed to be secure or error-free 
as information could be intercepted, corrupted, lost, destroyed, 
arrive late or incomplete, or contain viruses.  The sender therefore 
does not accept liability for any errors or omissions in the contents 
of this message which arise as a result of e-mail transmission.  If 
verification is required please request a hard-copy version.  This 
message is provided for informational purposes and should not be 
construed as a solicitation or offer to buy or sell any securities or 
related financial instruments.


Re: PDF to XML - LOL!

Posted by Pierpaolo Fumagalli <pi...@apache.org>.
Dan Morrison wrote:
> 
> What's next?
> 
> GIF -> XML?

Hahahaha :)

> I've seen JPEG -> ASCII but c'mon...

I come from the old BBS world (when modems at 2400 bps were a luxury!
V24bis rulez! and when HTML was just a random association of four
consonants :)... We had such nice stuff to convert images to ANSI :)

> Sorry for my noise but this is just too funny...

It's funny, but it's the way it is... Almost nobody out there knows
exactly what XML is, but it's one of the most powerful TLA in the
marketing world today...

Just look at the difference between:
	I make HTML pages
and
	I make XML pages
(meaning XHTML :) There's no difference, but if you use the second
phrase, you can charge 50% more :)

	Pier

-- 
--------------------------------------------------------------------
-          P              I              E              R          -
stable structure erected over water to allow the docking of seacraft
<ma...@betaversion.org>    <http://www.betaversion.org/~pier/>
--------------------------------------------------------------------
- ApacheCON Y2K: Come to the official Apache developers conference -
-------------------- <http://www.apachecon.com> --------------------

Re: PDF to XML - LOL!

Posted by Pierpaolo Fumagalli <pi...@apache.org>.
Dan Morrison wrote:
> 
> In my experience PDF (with its eye on a completely different ball)
> tends to obfuscate the STRUCTURE and the CONTENT (yay XML!) of the
> document even more.

It doesn't try to obfuscate anything... It's just a graphical and good
representation of a page layout... It's really good to preserve graphic
ideas...

	Pier

-- 
--------------------------------------------------------------------
-          P              I              E              R          -
stable structure erected over water to allow the docking of seacraft
<ma...@betaversion.org>    <http://www.betaversion.org/~pier/>
--------------------------------------------------------------------
- ApacheCON Y2K: Come to the official Apache developers conference -
-------------------- <http://www.apachecon.com> --------------------

Re: PDF to XML - LOL!

Posted by Pa...@wdr.com.
     It has been really interesting looking at these threads on this 
     particular item and it gives me another perspective on PDF -> XML
     
     My perspective on posting the item was that, this system has legacy 
     docs in PDF and that from an architectual stand point if I can get 
     them into XML then I can react to the business alot quicker. 
     
     Really all I want to do is put together a frame work where the PDF 
     docs can be mixed with associated data from other systems and then 
     served relevant user service. ie: WWW, WAP, B2B, PDA eBook? other 
     messaging system, anything else that comes along.
     
     I see what ever I build now should not be a quick fix to get PDF mixed 
     in with some other stuff to deliver just to the WWW.
     
     To pick up a question in Dan's note, I think I might be able to get  
     the source of a few documents but I would like to point out that we 
     are talking about 10's of thousands of documents in this paricular 
     case. :-( not good.
     
     
     thanks Paul


______________________________ Reply Separator _________________________________
Subject: Re: PDF to XML - LOL!
Author:  dman (dman@es.co.nz) at unix,mime
Date:    28/01/00 12:52


Pierpaolo Fumagalli wrote:
> ... You cant "recontextualize"
> those informations that were extracted from their context...
     
Indeed.
I accept that someone may take it upon themselves to inline a 
representation of binary or propriatary(sp?) data (I still think of PDF 
as propriatary, in comparison to XML anyway).
I guess you're welcome to introduce a <UUENCODE> block or whatever 
suits.
     
The thing is, it's a bit beyond XML translators (at the moment) to look 
at a magazine page and break it up into its constituent bits with 
meaningful tag names. Heck even translating from Word->HTML is a mess 
unless the original has been crafted using style templates 100% of the 
time. In my experience PDF (with its eye on a completely different ball) 
tends to obfuscate the STRUCTURE and the CONTENT (yay XML!) of the 
document even more. 
     
Honestly, if you really need to proceed in this direction, the best 
you're going to achieve is a parcel of <TEXT></TEXT> nodes, similar to a 
'save as plain text' function in the DTP packages.
OK, possibly you can tune it to recognise titles & bylines - but only 
for your select group of identically structured source docs. There will 
be no push-button solution for a while.
     
Seeing as you're looking into this field, have you ever tried to train 
HTML-Transit to do its translations? It'd be like that only worse & less 
accurate.
     
Do you have access to the source documents that the PDFs were distilled 
from? Get hold of them and you _may_ find a better packaged solution 
available.
     
- trying to be constructive this time - 
...dan.


This message contains confidential information and is intended only 
for the individual named.  If you are not the named addressee you 
should not disseminate, distribute or copy this e-mail.  Please 
notify the sender immediately by e-mail if you have received this 
e-mail by mistake and delete this e-mail from your system.

E-mail transmission cannot be guaranteed to be secure or error-free 
as information could be intercepted, corrupted, lost, destroyed, 
arrive late or incomplete, or contain viruses.  The sender therefore 
does not accept liability for any errors or omissions in the contents 
of this message which arise as a result of e-mail transmission.  If 
verification is required please request a hard-copy version.  This 
message is provided for informational purposes and should not be 
construed as a solicitation or offer to buy or sell any securities or 
related financial instruments.


RE: PDF to XML - LOL!

Posted by Philipp Knirck <ph...@maas.de>.
if u care for AFP --> PDF
MAAS High Tech has developed a AFP2Web converter which does just that

www.afp2web.de


check it out!!!!!!!!!!!!!!!!!!!!!!!!!!!!


Mit freundlichen Grüßen

Philipp Knirck

MAAS High Tech Software GmbH
Siemensweg 4
70794 Filderstadt - Bonlanden
Tel.  0711 - 77 91 7(0) - 39

Mobil 0177 - 34 02 113

Email: mailto:phil@maas.de


> -----Original Message-----
> From: Dan Morrison [mailto:dman@es.co.nz]
> Sent: Freitag, 28. Januar 2000 12:53
> To: general@xml.apache.org
> Subject: Re: PDF to XML - LOL!
>
>
> Pierpaolo Fumagalli wrote:
> > ... You cant "recontextualize"
> > those informations that were extracted from their context...
>
> Indeed.
> I accept that someone may take it upon themselves to inline a
> representation of binary or propriatary(sp?) data (I still think of PDF
> as propriatary, in comparison to XML anyway).
> I guess you're welcome to introduce a <UUENCODE> block or whatever
> suits.
>
> The thing is, it's a bit beyond XML translators (at the moment) to look
> at a magazine page and break it up into its constituent bits with
> meaningful tag names. Heck even translating from Word->HTML is a mess
> unless the original has been crafted using style templates 100% of the
> time. In my experience PDF (with its eye on a completely different ball)
> tends to obfuscate the STRUCTURE and the CONTENT (yay XML!) of the
> document even more.
>
> Honestly, if you really need to proceed in this direction, the best
> you're going to achieve is a parcel of <TEXT></TEXT> nodes, similar to a
> 'save as plain text' function in the DTP packages.
> OK, possibly you can tune it to recognise titles & bylines - but only
> for your select group of identically structured source docs. There will
> be no push-button solution for a while.
>
> Seeing as you're looking into this field, have you ever tried to train
> HTML-Transit to do its translations? It'd be like that only worse & less
> accurate.
>
> Do you have access to the source documents that the PDFs were distilled
> from? Get hold of them and you _may_ find a better packaged solution
> available.
>
> - trying to be constructive this time -
> .dan.

Re: PDF to XML - LOL!

Posted by Dan Morrison <dm...@es.co.nz>.
Pierpaolo Fumagalli wrote:
> ... You cant "recontextualize"
> those informations that were extracted from their context...

Indeed.
I accept that someone may take it upon themselves to inline a
representation of binary or propriatary(sp?) data (I still think of PDF
as propriatary, in comparison to XML anyway).
I guess you're welcome to introduce a <UUENCODE> block or whatever
suits.

The thing is, it's a bit beyond XML translators (at the moment) to look
at a magazine page and break it up into its constituent bits with
meaningful tag names. Heck even translating from Word->HTML is a mess
unless the original has been crafted using style templates 100% of the
time. In my experience PDF (with its eye on a completely different ball)
tends to obfuscate the STRUCTURE and the CONTENT (yay XML!) of the
document even more. 

Honestly, if you really need to proceed in this direction, the best
you're going to achieve is a parcel of <TEXT></TEXT> nodes, similar to a
'save as plain text' function in the DTP packages.
OK, possibly you can tune it to recognise titles & bylines - but only
for your select group of identically structured source docs. There will
be no push-button solution for a while.

Seeing as you're looking into this field, have you ever tried to train
HTML-Transit to do its translations? It'd be like that only worse & less
accurate.

Do you have access to the source documents that the PDFs were distilled
from? Get hold of them and you _may_ find a better packaged solution
available.

- trying to be constructive this time -
.dan.

Re: PDF to XML - LOL!

Posted by Pierpaolo Fumagalli <pi...@apache.org>.
Paul.Waugh@wdr.com wrote:
> 
> Dan, this might sound crazy but this could happen. Also you I have
> seem some stuff on CADXML so there might be a STEP -> XML.
> 
> I think it depends on how the SVG (scalable vector graphics) takes off
> 
> I have noticed that some of the Charting I have done in the past for
> an application using GIF's can now be done in SVG.
> 
> So would you want to convert legacy GIF's to XML, ????

I believe this was not the point...
The power of XML is the ability to give a "context" to the data you
write...
Converting GIF to SVG, is like converting BMP to TIF... You don't loose
anything, but you don't gain anything also... You cant "recontextualize"
those informations that were extracted from their context...

	Pier

-- 
--------------------------------------------------------------------
-          P              I              E              R          -
stable structure erected over water to allow the docking of seacraft
<ma...@betaversion.org>    <http://www.betaversion.org/~pier/>
--------------------------------------------------------------------
- ApacheCON Y2K: Come to the official Apache developers conference -
-------------------- <http://www.apachecon.com> --------------------

Re: PDF to XML - LOL!

Posted by Pa...@wdr.com.
     Dan, this might sound crazy but this could happen. Also you I have 
     seem some stuff on CADXML so there might be a STEP -> XML.
     
     I think it depends on how the SVG (scalable vector graphics) takes off
     
     I have noticed that some of the Charting I have done in the past for 
     an application using GIF's can now be done in SVG.
     
     So would you want to convert legacy GIF's to XML, ????
     
     
     Ok Paul
     


______________________________ Reply Separator _________________________________
Subject: Re: PDF to XML - LOL!
Author:  dman (dman@es.co.nz) at unix,mime
Date:    28/01/00 08:06


What's next?
     
GIF -> XML?
     
I've seen JPEG -> ASCII but c'mon...
     
Sorry for my noise but this is just too funny...
     
...dan.


This message contains confidential information and is intended only 
for the individual named.  If you are not the named addressee you 
should not disseminate, distribute or copy this e-mail.  Please 
notify the sender immediately by e-mail if you have received this 
e-mail by mistake and delete this e-mail from your system.

E-mail transmission cannot be guaranteed to be secure or error-free 
as information could be intercepted, corrupted, lost, destroyed, 
arrive late or incomplete, or contain viruses.  The sender therefore 
does not accept liability for any errors or omissions in the contents 
of this message which arise as a result of e-mail transmission.  If 
verification is required please request a hard-copy version.  This 
message is provided for informational purposes and should not be 
construed as a solicitation or offer to buy or sell any securities or 
related financial instruments.


Re: PDF to XML - LOL!

Posted by Dan Morrison <dm...@es.co.nz>.
What's next?

GIF -> XML?

I've seen JPEG -> ASCII but c'mon...

Sorry for my noise but this is just too funny...

.dan.

Re: PDF to XML

Posted by Pierpaolo Fumagalli <pi...@apache.org>.
Paul.Waugh@wdr.com wrote:
> 
> Pier, apparantly, according to there brochure ReachCast have a product
> that can take PDF files and convert them to an XML document. I do not
> know how much flexibility you have on the conversion process.
> 
> This conversion is of specific importance to me, as a project I'm
> about to work on has a number of legacy PDF docs that need to be
> converted into XML.

Yep, I've found it... But, as I said, they convert PDF to HTML or XML.
I believe (that's the only thing I can imagine) that when they convert
to XML, what they're really doing is taking the PDF and styling it to a
XHTML+CSS format.
That's the only thing logically possible, but, in that case, you loose
the power of XML, its ability to give a context to the content...

I've seen a similar tool used by Mike Pogue... He did show me once a
printer driver for windows that was outputting HTML to a file. I believe
you can use the same tool to print your PDF to this "HTML printer" and
display them on line. (Or, from HTML, convert them into XHTML, and try
to do something from there).

Mike, what was the tool you were using ????

	Pier

-- 
--------------------------------------------------------------------
-          P              I              E              R          -
stable structure erected over water to allow the docking of seacraft
<ma...@betaversion.org>    <http://www.betaversion.org/~pier/>
--------------------------------------------------------------------
- ApacheCON Y2K: Come to the official Apache developers conference -
-------------------- <http://www.apachecon.com> --------------------

Re: PDF to XML

Posted by Pa...@wdr.com.
     Pier, apparantly, according to there brochure ReachCast have a product 
     that can take PDF files and convert them to an XML document. I do not 
     know how much flexibility you have on the conversion process.
     
     This conversion is of specific importance to me, as a project I'm 
     about to work on has a number of legacy PDF docs that need to be 
     converted into XML.
     
     
     Ok Paul


______________________________ Reply Separator _________________________________
Subject: Re: PDF to XML
Author:  pier (pier@apache.org) at unix,mime
Date:    27/01/00 14:49


Paul.Waugh@wdr.com wrote:
> 
> Does anyone know of any other PDF to XML converters.
     
PDF -> XML ???? That's odd... I've always thought that people wanted to 
do XML -> PDF, not all the way around...
Being PDF a "graphic" language (much likely HTML, it's not content 
oriented, but display oriented) I believe that the only translation you 
can do is PDF->XSL:FO or another display metalanguage...
     
> I'm currenlty looking at stuff from ReachCast.
     
I'm browsing their site (www.reachcast.com) but can't find anything 
related to PDF -> XML... Any link?
     
        Pier
     
-- 
-------------------------------------------------------------------- 
-          P              I              E              R          - 
stable structure erected over water to allow the docking of seacraft 
<ma...@betaversion.org>    <http://www.betaversion.org/~pier/> 
-------------------------------------------------------------------- 
- ApacheCON Y2K: Come to the official Apache developers conference - 
-------------------- <http://www.apachecon.com> --------------------


This message contains confidential information and is intended only 
for the individual named.  If you are not the named addressee you 
should not disseminate, distribute or copy this e-mail.  Please 
notify the sender immediately by e-mail if you have received this 
e-mail by mistake and delete this e-mail from your system.

E-mail transmission cannot be guaranteed to be secure or error-free 
as information could be intercepted, corrupted, lost, destroyed, 
arrive late or incomplete, or contain viruses.  The sender therefore 
does not accept liability for any errors or omissions in the contents 
of this message which arise as a result of e-mail transmission.  If 
verification is required please request a hard-copy version.  This 
message is provided for informational purposes and should not be 
construed as a solicitation or offer to buy or sell any securities or 
related financial instruments.


Re: PDF to XML

Posted by Pierpaolo Fumagalli <pi...@apache.org>.
Paul.Waugh@wdr.com wrote:
> 
> Does anyone know of any other PDF to XML converters.

PDF -> XML ???? That's odd... I've always thought that people wanted to
do XML -> PDF, not all the way around...
Being PDF a "graphic" language (much likely HTML, it's not content
oriented, but display oriented) I believe that the only translation you
can do is PDF->XSL:FO or another display metalanguage...

> I'm currenlty looking at stuff from ReachCast.

I'm browsing their site (www.reachcast.com) but can't find anything
related to PDF -> XML... Any link?

	Pier

-- 
--------------------------------------------------------------------
-          P              I              E              R          -
stable structure erected over water to allow the docking of seacraft
<ma...@betaversion.org>    <http://www.betaversion.org/~pier/>
--------------------------------------------------------------------
- ApacheCON Y2K: Come to the official Apache developers conference -
-------------------- <http://www.apachecon.com> --------------------