You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@xml.apache.org by Dan Morrison <dm...@es.co.nz> on 2000/05/28 15:01:21 UTC
Re: Cocoon the other way???

Samuel Kock wrote:
> 
> Hi
> 
> I am quite new to this list, but hace been following the goings on for
> some time. Most things You people talk about are a bit greek to me, but
> I do have one question:

Well, some of it can be a bit techie (where are your classpaths etc) but
now and then higher thoughts are discussed here...

> Is it possible to maybe write an extension to cocoon that would go the
> other way? For example, convert a PDF file to an XML file using (I
> suppose) XSLT? Or a RTF or HTML file, for that matter????

While I can sympathise with this desire, it's just not a happening
thing, except in the most basic sense.

It is possible to design round-trip stylesheets, so that data coming
from XML into $PRESENTATION_FORMAT can be decomposed back into XML, but
in the general case, the answer is currently no.

If you were able to impose the same limitation upon your source input as
XMLers do, with strict DTDs and all, sure! But the general case is
nothing like that.

The task you're looking at is (I assume) migration of legacy documents
into a more versatile medium.

The problem is that you cannot add value to these documents
automatically without some deeper understanding of the context and the
rules the layout follows.

_If_ you were converting vanilla HTML, which used <H1>, <H2> in a
consistant way, you could extract an XML file that could indicate "here
is the beginning of a section, its title is ZZZ. This probably isn't
even enough to deduce where you could put your <SYNOPSIS> and <AUTHOR>
tags, but it would maintain the structure that was already there.
But I'd say the only website site i've seen in the last year which tried
to do so was W3C. Think of trying to make sense of the semantic content
of any high-profile site that hasn't been designed with these hooks in
there beforehand. (news sites have hooks, as do many database-driven
sites)

Theoretically the same could be done with Word files, /assuming/ that
every author had used 'styles' consistantly and rigorously. In the real
world that almost never happens. A company could design a perfect
standard template, but the first bozo to use it will delete the date by
accident, then replace it with text that may look exactly the same, but
be 'bold italic' instead of style:date.
So the automated process that is to slurp this doc has much less chance
of popping it into a <DATE> field.

Have a play with HTML Transit, and some real world docs from a period of
time, and you'll see that it can be done, but painfully, and with many
special cases and 'training' of the algorithms.

PDF is even further removed. When it comes to extracting semantic
context out of a DTP doc, you can compare it with trying to convince an
OCR scanner to behave. It can be done, but /generally/ on a case-by-case
basis.
Designers don't say "Here is the title, here is the byline", they
position it there, somehow, and leave it to the viewer to deduce from
font size and position what part of the document it is. You can try and
teach a scanning application that font size 18 indicates a title, but
good luck getting it to find the correct context or logical position for
a pullquote.

I can't speak for the Cocoon dev guys, but this worthy field of
endeavour is probably a different ball game from what XSL publishing is
about in the current environment. 
I /am/ predicting the time when round-trip publishing is a reality,
(working on it myself part-time) but as for legacy documents, you're
looking at a tidy subset of artificial recognition to get results.
Or something like HTML Transit. Effective, but not automatic.

<DISCLAIMER>
This is my real-world P.O.V. having been there in many guises.

Try having 28 departments getting faxed (I know, I know) a template
page, then having 127 documents come back containing some sort of Word 6
representation of a 5 x 16 field table full of mixed data.
/Although/ visually and semantically most of them were equivalent,
macroing that into a database was pretty raw.

Just done the same thing again on a smaller scale last week, when
auditing several several hundred IP allocations for the university...
"please correct the user details for your department as shown on this
document and send it back" Uuurg. The things they did to that
plaintext...

There may be some academics out there that can prove this whole thing is
perfectly possible in theory, But as usual, I'm just speaking from the
trenches.
</DISCLAIMER>
  

God lick.

.dan.

:=====================:====================:
: Dan Morrison        : The Web Limited    :
:  http://here.is/dan :  http://web.co.nz  :
:  dman@es.co.nz      :  danm@web.co.nz    :
:  04 384 1472        :  04 495 8250       :
:  025 207 1140       :                    :
:.....................:....................:
: If ignorance is bliss, why aren't more people happy?
:.........................................: