You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@xml.apache.org by Pa...@wdr.com on 2000/01/27 15:38:13 UTC
PDF to XML
Does anyone know of any other PDF to XML converters.
I'm currenlty looking at stuff from ReachCast.
thanks in advance
======================================================================
This message contains confidential information and is intended only
for the individual named. If you are not the named addressee you
should not disseminate, distribute or copy this e-mail. Please
notify the sender immediately by e-mail if you have received this
e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or error-free
as information could be intercepted, corrupted, lost, destroyed,
arrive late or incomplete, or contain viruses. The sender therefore
does not accept liability for any errors or omissions in the contents
of this message which arise as a result of e-mail transmission. If
verification is required please request a hard-copy version. This
message is provided for informational purposes and should not be
construed as a solicitation or offer to buy or sell any securities or
related financial instruments.
Re: PDF to XML - LOL!
Posted by Pierpaolo Fumagalli <pi...@apache.org>.
Dan Morrison wrote:
>
> What's next?
>
> GIF -> XML?
Hahahaha :)
> I've seen JPEG -> ASCII but c'mon...
I come from the old BBS world (when modems at 2400 bps were a luxury!
V24bis rulez! and when HTML was just a random association of four
consonants :)... We had such nice stuff to convert images to ANSI :)
> Sorry for my noise but this is just too funny...
It's funny, but it's the way it is... Almost nobody out there knows
exactly what XML is, but it's one of the most powerful TLA in the
marketing world today...
Just look at the difference between:
I make HTML pages
and
I make XML pages
(meaning XHTML :) There's no difference, but if you use the second
phrase, you can charge 50% more :)
Pier
--
--------------------------------------------------------------------
- P I E R -
stable structure erected over water to allow the docking of seacraft
<ma...@betaversion.org> <http://www.betaversion.org/~pier/>
--------------------------------------------------------------------
- ApacheCON Y2K: Come to the official Apache developers conference -
-------------------- <http://www.apachecon.com> --------------------
Re: PDF to XML - LOL!
Posted by Pierpaolo Fumagalli <pi...@apache.org>.
Dan Morrison wrote:
>
> In my experience PDF (with its eye on a completely different ball)
> tends to obfuscate the STRUCTURE and the CONTENT (yay XML!) of the
> document even more.
It doesn't try to obfuscate anything... It's just a graphical and good
representation of a page layout... It's really good to preserve graphic
ideas...
Pier
--
--------------------------------------------------------------------
- P I E R -
stable structure erected over water to allow the docking of seacraft
<ma...@betaversion.org> <http://www.betaversion.org/~pier/>
--------------------------------------------------------------------
- ApacheCON Y2K: Come to the official Apache developers conference -
-------------------- <http://www.apachecon.com> --------------------
Re: PDF to XML - LOL!
Posted by Pa...@wdr.com.
It has been really interesting looking at these threads on this
particular item and it gives me another perspective on PDF -> XML
My perspective on posting the item was that, this system has legacy
docs in PDF and that from an architectual stand point if I can get
them into XML then I can react to the business alot quicker.
Really all I want to do is put together a frame work where the PDF
docs can be mixed with associated data from other systems and then
served relevant user service. ie: WWW, WAP, B2B, PDA eBook? other
messaging system, anything else that comes along.
I see what ever I build now should not be a quick fix to get PDF mixed
in with some other stuff to deliver just to the WWW.
To pick up a question in Dan's note, I think I might be able to get
the source of a few documents but I would like to point out that we
are talking about 10's of thousands of documents in this paricular
case. :-( not good.
thanks Paul
______________________________ Reply Separator _________________________________
Subject: Re: PDF to XML - LOL!
Author: dman (dman@es.co.nz) at unix,mime
Date: 28/01/00 12:52
Pierpaolo Fumagalli wrote:
> ... You cant "recontextualize"
> those informations that were extracted from their context...
Indeed.
I accept that someone may take it upon themselves to inline a
representation of binary or propriatary(sp?) data (I still think of PDF
as propriatary, in comparison to XML anyway).
I guess you're welcome to introduce a <UUENCODE> block or whatever
suits.
The thing is, it's a bit beyond XML translators (at the moment) to look
at a magazine page and break it up into its constituent bits with
meaningful tag names. Heck even translating from Word->HTML is a mess
unless the original has been crafted using style templates 100% of the
time. In my experience PDF (with its eye on a completely different ball)
tends to obfuscate the STRUCTURE and the CONTENT (yay XML!) of the
document even more.
Honestly, if you really need to proceed in this direction, the best
you're going to achieve is a parcel of <TEXT></TEXT> nodes, similar to a
'save as plain text' function in the DTP packages.
OK, possibly you can tune it to recognise titles & bylines - but only
for your select group of identically structured source docs. There will
be no push-button solution for a while.
Seeing as you're looking into this field, have you ever tried to train
HTML-Transit to do its translations? It'd be like that only worse & less
accurate.
Do you have access to the source documents that the PDFs were distilled
from? Get hold of them and you _may_ find a better packaged solution
available.
- trying to be constructive this time -
...dan.
This message contains confidential information and is intended only
for the individual named. If you are not the named addressee you
should not disseminate, distribute or copy this e-mail. Please
notify the sender immediately by e-mail if you have received this
e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or error-free
as information could be intercepted, corrupted, lost, destroyed,
arrive late or incomplete, or contain viruses. The sender therefore
does not accept liability for any errors or omissions in the contents
of this message which arise as a result of e-mail transmission. If
verification is required please request a hard-copy version. This
message is provided for informational purposes and should not be
construed as a solicitation or offer to buy or sell any securities or
related financial instruments.
RE: PDF to XML - LOL!
Posted by Philipp Knirck <ph...@maas.de>.
if u care for AFP --> PDF
MAAS High Tech has developed a AFP2Web converter which does just that
www.afp2web.de
check it out!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Mit freundlichen Grüßen
Philipp Knirck
MAAS High Tech Software GmbH
Siemensweg 4
70794 Filderstadt - Bonlanden
Tel. 0711 - 77 91 7(0) - 39
Mobil 0177 - 34 02 113
Email: mailto:phil@maas.de
> -----Original Message-----
> From: Dan Morrison [mailto:dman@es.co.nz]
> Sent: Freitag, 28. Januar 2000 12:53
> To: general@xml.apache.org
> Subject: Re: PDF to XML - LOL!
>
>
> Pierpaolo Fumagalli wrote:
> > ... You cant "recontextualize"
> > those informations that were extracted from their context...
>
> Indeed.
> I accept that someone may take it upon themselves to inline a
> representation of binary or propriatary(sp?) data (I still think of PDF
> as propriatary, in comparison to XML anyway).
> I guess you're welcome to introduce a <UUENCODE> block or whatever
> suits.
>
> The thing is, it's a bit beyond XML translators (at the moment) to look
> at a magazine page and break it up into its constituent bits with
> meaningful tag names. Heck even translating from Word->HTML is a mess
> unless the original has been crafted using style templates 100% of the
> time. In my experience PDF (with its eye on a completely different ball)
> tends to obfuscate the STRUCTURE and the CONTENT (yay XML!) of the
> document even more.
>
> Honestly, if you really need to proceed in this direction, the best
> you're going to achieve is a parcel of <TEXT></TEXT> nodes, similar to a
> 'save as plain text' function in the DTP packages.
> OK, possibly you can tune it to recognise titles & bylines - but only
> for your select group of identically structured source docs. There will
> be no push-button solution for a while.
>
> Seeing as you're looking into this field, have you ever tried to train
> HTML-Transit to do its translations? It'd be like that only worse & less
> accurate.
>
> Do you have access to the source documents that the PDFs were distilled
> from? Get hold of them and you _may_ find a better packaged solution
> available.
>
> - trying to be constructive this time -
> .dan.
Re: PDF to XML - LOL!
Posted by Dan Morrison <dm...@es.co.nz>.
Pierpaolo Fumagalli wrote:
> ... You cant "recontextualize"
> those informations that were extracted from their context...
Indeed.
I accept that someone may take it upon themselves to inline a
representation of binary or propriatary(sp?) data (I still think of PDF
as propriatary, in comparison to XML anyway).
I guess you're welcome to introduce a <UUENCODE> block or whatever
suits.
The thing is, it's a bit beyond XML translators (at the moment) to look
at a magazine page and break it up into its constituent bits with
meaningful tag names. Heck even translating from Word->HTML is a mess
unless the original has been crafted using style templates 100% of the
time. In my experience PDF (with its eye on a completely different ball)
tends to obfuscate the STRUCTURE and the CONTENT (yay XML!) of the
document even more.
Honestly, if you really need to proceed in this direction, the best
you're going to achieve is a parcel of <TEXT></TEXT> nodes, similar to a
'save as plain text' function in the DTP packages.
OK, possibly you can tune it to recognise titles & bylines - but only
for your select group of identically structured source docs. There will
be no push-button solution for a while.
Seeing as you're looking into this field, have you ever tried to train
HTML-Transit to do its translations? It'd be like that only worse & less
accurate.
Do you have access to the source documents that the PDFs were distilled
from? Get hold of them and you _may_ find a better packaged solution
available.
- trying to be constructive this time -
.dan.
Re: PDF to XML - LOL!
Posted by Pierpaolo Fumagalli <pi...@apache.org>.
Paul.Waugh@wdr.com wrote:
>
> Dan, this might sound crazy but this could happen. Also you I have
> seem some stuff on CADXML so there might be a STEP -> XML.
>
> I think it depends on how the SVG (scalable vector graphics) takes off
>
> I have noticed that some of the Charting I have done in the past for
> an application using GIF's can now be done in SVG.
>
> So would you want to convert legacy GIF's to XML, ????
I believe this was not the point...
The power of XML is the ability to give a "context" to the data you
write...
Converting GIF to SVG, is like converting BMP to TIF... You don't loose
anything, but you don't gain anything also... You cant "recontextualize"
those informations that were extracted from their context...
Pier
--
--------------------------------------------------------------------
- P I E R -
stable structure erected over water to allow the docking of seacraft
<ma...@betaversion.org> <http://www.betaversion.org/~pier/>
--------------------------------------------------------------------
- ApacheCON Y2K: Come to the official Apache developers conference -
-------------------- <http://www.apachecon.com> --------------------
Re: PDF to XML - LOL!
Posted by Pa...@wdr.com.
Dan, this might sound crazy but this could happen. Also you I have
seem some stuff on CADXML so there might be a STEP -> XML.
I think it depends on how the SVG (scalable vector graphics) takes off
I have noticed that some of the Charting I have done in the past for
an application using GIF's can now be done in SVG.
So would you want to convert legacy GIF's to XML, ????
Ok Paul
______________________________ Reply Separator _________________________________
Subject: Re: PDF to XML - LOL!
Author: dman (dman@es.co.nz) at unix,mime
Date: 28/01/00 08:06
What's next?
GIF -> XML?
I've seen JPEG -> ASCII but c'mon...
Sorry for my noise but this is just too funny...
...dan.
This message contains confidential information and is intended only
for the individual named. If you are not the named addressee you
should not disseminate, distribute or copy this e-mail. Please
notify the sender immediately by e-mail if you have received this
e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or error-free
as information could be intercepted, corrupted, lost, destroyed,
arrive late or incomplete, or contain viruses. The sender therefore
does not accept liability for any errors or omissions in the contents
of this message which arise as a result of e-mail transmission. If
verification is required please request a hard-copy version. This
message is provided for informational purposes and should not be
construed as a solicitation or offer to buy or sell any securities or
related financial instruments.
Re: PDF to XML - LOL!
Posted by Dan Morrison <dm...@es.co.nz>.
What's next?
GIF -> XML?
I've seen JPEG -> ASCII but c'mon...
Sorry for my noise but this is just too funny...
.dan.
Re: PDF to XML
Posted by Pierpaolo Fumagalli <pi...@apache.org>.
Paul.Waugh@wdr.com wrote:
>
> Pier, apparantly, according to there brochure ReachCast have a product
> that can take PDF files and convert them to an XML document. I do not
> know how much flexibility you have on the conversion process.
>
> This conversion is of specific importance to me, as a project I'm
> about to work on has a number of legacy PDF docs that need to be
> converted into XML.
Yep, I've found it... But, as I said, they convert PDF to HTML or XML.
I believe (that's the only thing I can imagine) that when they convert
to XML, what they're really doing is taking the PDF and styling it to a
XHTML+CSS format.
That's the only thing logically possible, but, in that case, you loose
the power of XML, its ability to give a context to the content...
I've seen a similar tool used by Mike Pogue... He did show me once a
printer driver for windows that was outputting HTML to a file. I believe
you can use the same tool to print your PDF to this "HTML printer" and
display them on line. (Or, from HTML, convert them into XHTML, and try
to do something from there).
Mike, what was the tool you were using ????
Pier
--
--------------------------------------------------------------------
- P I E R -
stable structure erected over water to allow the docking of seacraft
<ma...@betaversion.org> <http://www.betaversion.org/~pier/>
--------------------------------------------------------------------
- ApacheCON Y2K: Come to the official Apache developers conference -
-------------------- <http://www.apachecon.com> --------------------
Re: PDF to XML
Posted by Pa...@wdr.com.
Pier, apparantly, according to there brochure ReachCast have a product
that can take PDF files and convert them to an XML document. I do not
know how much flexibility you have on the conversion process.
This conversion is of specific importance to me, as a project I'm
about to work on has a number of legacy PDF docs that need to be
converted into XML.
Ok Paul
______________________________ Reply Separator _________________________________
Subject: Re: PDF to XML
Author: pier (pier@apache.org) at unix,mime
Date: 27/01/00 14:49
Paul.Waugh@wdr.com wrote:
>
> Does anyone know of any other PDF to XML converters.
PDF -> XML ???? That's odd... I've always thought that people wanted to
do XML -> PDF, not all the way around...
Being PDF a "graphic" language (much likely HTML, it's not content
oriented, but display oriented) I believe that the only translation you
can do is PDF->XSL:FO or another display metalanguage...
> I'm currenlty looking at stuff from ReachCast.
I'm browsing their site (www.reachcast.com) but can't find anything
related to PDF -> XML... Any link?
Pier
--
--------------------------------------------------------------------
- P I E R -
stable structure erected over water to allow the docking of seacraft
<ma...@betaversion.org> <http://www.betaversion.org/~pier/>
--------------------------------------------------------------------
- ApacheCON Y2K: Come to the official Apache developers conference -
-------------------- <http://www.apachecon.com> --------------------
This message contains confidential information and is intended only
for the individual named. If you are not the named addressee you
should not disseminate, distribute or copy this e-mail. Please
notify the sender immediately by e-mail if you have received this
e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or error-free
as information could be intercepted, corrupted, lost, destroyed,
arrive late or incomplete, or contain viruses. The sender therefore
does not accept liability for any errors or omissions in the contents
of this message which arise as a result of e-mail transmission. If
verification is required please request a hard-copy version. This
message is provided for informational purposes and should not be
construed as a solicitation or offer to buy or sell any securities or
related financial instruments.
Re: PDF to XML
Posted by Pierpaolo Fumagalli <pi...@apache.org>.
Paul.Waugh@wdr.com wrote:
>
> Does anyone know of any other PDF to XML converters.
PDF -> XML ???? That's odd... I've always thought that people wanted to
do XML -> PDF, not all the way around...
Being PDF a "graphic" language (much likely HTML, it's not content
oriented, but display oriented) I believe that the only translation you
can do is PDF->XSL:FO or another display metalanguage...
> I'm currenlty looking at stuff from ReachCast.
I'm browsing their site (www.reachcast.com) but can't find anything
related to PDF -> XML... Any link?
Pier
--
--------------------------------------------------------------------
- P I E R -
stable structure erected over water to allow the docking of seacraft
<ma...@betaversion.org> <http://www.betaversion.org/~pier/>
--------------------------------------------------------------------
- ApacheCON Y2K: Come to the official Apache developers conference -
-------------------- <http://www.apachecon.com> --------------------