You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@forrest.apache.org by Moshe Yudkowsky <ms...@bl.com> on 2005/10/23 05:14:28 UTC

Conversion of "raw" html stops in mid-file w/o error message

I've got a raw html file that is being auto-converted -- decorated -- 
by forrest.

Although the conversion goes well for the initial sections, at one point 
the conversion stops, and the rest of the file does not appear. There 
are no error messages or warnings.

I have validated the document using the W3C validator, and it passes 
whether I use it as 4.01 loose or XHTML strict. (The meta tags have to 
be modified, depending  on the format, but the rest of the document is 
unchanged.)

Problem 1: no conversion of XHTML strict

If the document is XHTML strict, then forrest does not convert any of 
the body text whatsoever!


Problem 2: partial conversion of HTML 4.01 text.

The initial paragraphs convert with no problem. They look like this:

<h2>WORK EXPERIENCE</h2>
<dl>
  <dt>
   DIALOGIC &amp; INTEL CORPORATION / 1996 - 2002<br/>
   1996 - 2002: Speech Technology<br/>
   Mission: Architect and Advocate for Speech Technologies.<br/>
   <em>(Note: Dialogic was acquired by Intel in 1999.)</em><br/>
</dt>
<dd>
  <ul>
   <li>Guide technical development... </li>
  </ul>
</dd>
</dt>

etc.

The paragraphs which do not convert look like this:

<h2>SKILLS</h2>
	<h4>Speech Recognition &amp; Speech Technology</h4>
	 <ul>
	  <li>Cross-industry knowledge...</li>
	 </ul>

The only line that converts is the <h4> line, SKILLS, and the rest of 
the document is missing. I thought the "&amp;" in the <h4> might be 
throwing the system off, I tried removing it, and that's not the problem.

If anyone has any ideas on how to debug this, please let me know!

Re: better workaround for unadorned HTML?

Posted by David Crossley <cr...@apache.org>.

Moshe Yudkowsky wrote:
> Hmm.
> 
> I notice that in 0.7, all the files in what would be the 
> "raw-content-dir" 0.5.1 directories are actually copied over during 
> processing. For example, take this index.html file, which includes 
> framesets:
> 
> >Copying 
> >/home/moshe/web/disaggregate/site/src/documentation/content/bkauthors/index.html to /home/moshe/web/disaggregate/site/tmp/builds/disaggregate/bkauthors/index.html
> 
> If there's another copy in project.content-dir, e.g., in the correct 
> xdocs subdirectory, then it's adorned and replaces the previous copied 
> file. Unfortunately, framesets aren't processed by forrest, and what I 
> get is a blank file.
> 
> My workaround is to keep this file only in the raw-content-dir 
> directory, and creating an <external> site entry in site.xml:

No need for workarounds, The handling of "raw" files
has changed in 0.7
http://forrest.apache.org/docs/faq.html

You need a special match in your project sitemap.xmap

> <bkauthors-webring href="http://www.disaggregate.com/bkauthors/">
> 	<index href="index.html"/>
> 	<welcome href="webrings/welcome.html"/>
> 	<control href="webrings/control.html"/>
> </bkauthors-webring>
> 
> I then reference that site as an ext: when I want to link to it.
> 
> Two questions arise:
> 
> (1) Is it possible to use a relative pathname in the <external> 
> declaration, above? When I try it, I get an error.

Yes. See the example in our Forrest site docs:
forrest/site-author/content/xdocs/site.xml

> (2) Is there some other workaround? What would work best, I expect, is a 
> method to mark particular files as "do not adorn." I believe that it's 
> possible... it'd be nice to have it as a flag in the file itself, but I 
> suspect that it's possible via a filename and sitemap change, such as 
> "index.html.noadorn" becoming "index.html".

See above.

-David

better workaround for unadorned HTML?

Posted by Moshe Yudkowsky <ms...@bl.com>.

Hmm.

I notice that in 0.7, all the files in what would be the 
"raw-content-dir" 0.5.1 directories are actually copied over during 
processing. For example, take this index.html file, which includes 
framesets:

> Copying /home/moshe/web/disaggregate/site/src/documentation/content/bkauthors/index.html to /home/moshe/web/disaggregate/site/tmp/builds/disaggregate/bkauthors/index.html

If there's another copy in project.content-dir, e.g., in the correct 
xdocs subdirectory, then it's adorned and replaces the previous copied 
file. Unfortunately, framesets aren't processed by forrest, and what I 
get is a blank file.

My workaround is to keep this file only in the raw-content-dir 
directory, and creating an <external> site entry in site.xml:

<bkauthors-webring href="http://www.disaggregate.com/bkauthors/">
	<index href="index.html"/>
	<welcome href="webrings/welcome.html"/>
	<control href="webrings/control.html"/>
</bkauthors-webring>

I then reference that site as an ext: when I want to link to it.

Two questions arise:


(1) Is it possible to use a relative pathname in the <external> 
declaration, above? When I try it, I get an error.

(2) Is there some other workaround? What would work best, I expect, is a 
method to mark particular files as "do not adorn." I believe that it's 
possible... it'd be nice to have it as a flag in the file itself, but I 
suspect that it's possible via a filename and sitemap change, such as 
"index.html.noadorn" becoming "index.html".

Re: Conversion of "raw" html stops in mid-file w/o error message

Posted by Moshe Yudkowsky <ms...@bl.com>.

Ross notes:

  >Lke we said, if it is a problem to you then feel free to provide a patch
  >> to the html2document.xsl stylesheet.

David comments:

  > It is not that we are deliberatley enforcing that.
  > If you can devise a method to handle such tag soup,
  > then we will gladly apply the patch.

Thanks for the information.

I will try to look at this issue, I think.

Actually, a higher-priority item for me would be the silent failure. I
didn't get any error messages; instead, the page was simply not
rendered, and if I hadn't been checking the pages to see what happened
under 0.7 I would never have noticed. I've now checked the rest of the
site and I've found other silent failures!


  >> We don't use html as an internal format because it lacks some needed
  >> structure for other processing. Outr internal structure is much closer
  >> to XHTML2 (in fact we will be moving to a ubset XHTML2 in some future
  >> release).


Thanks, I will be on the lookout.

Re: Conversion of "raw" html stops in mid-file w/o error message

Posted by David Crossley <cr...@apache.org>.

Moshe Yudkowsky wrote:
> 
> Thanks for the information about how to accomplish this conversion.  My 
> (current) problem is solved: <h2> cannot be followed directly by <h4> in 
> forrest.
> 
> I do have a comment:
> 
> * The W3C HTML validator says that <h2> followed by <h4> is valid HTML 
> and valid XHTML. From the W3 spec 
> <http://www.w3.org/TR/REC-html40/struct/global.html#edef-H2> itself, 
> "Some people consider skipping heading levels to be bad practice. They 
> accept H1 H2 H1 while they do not accept H1 H3 H1 since the heading 
> level H2 is skipped." Forrest is going beyond the spec by enforcing this 
> restriction.

It is not that we are deliberatley enforcing that.
If you can devise a method to handle such tag soup,
then we will gladly apply the patch.

-David

Re: Conversion of "raw" html stops in mid-file w/o error message

Posted by "Gav...." <br...@brightontown.com.au>.

----- Original Message ----- 
From: "Ross Gardler" <rg...@apache.org>
To: <us...@forrest.apache.org>
Sent: Sunday, October 23, 2005 8:33 PM
Subject: Re: Conversion of "raw" html stops in mid-file w/o error message

| Moshe Yudkowsky wrote:
| > All,
| >
| > Thanks for the information about how to accomplish this conversion.  My
| > (current) problem is solved: <h2> cannot be followed directly by <h4> in
| > forrest.
| >
| > I do have a comment:
| >
| > * The W3C HTML validator says that <h2> followed by <h4> is valid HTML
| > and valid XHTML. From the W3 spec
| > <http://www.w3.org/TR/REC-html40/struct/global.html#edef-H2> itself,
| > "Some people consider skipping heading levels to be bad practice. They
| > accept H1 H2 H1 while they do not accept H1 H3 H1 since the heading
| > level H2 is skipped." Forrest is going beyond the spec by enforcing this
| > restriction.
|
| Lke we said, if it is a problem to you then feel free to provide a patch
| to the html2document.xsl stylesheet.
|
| We don't use html as an internal format because it lacks some needed
| structure for other processing. Outr internal structure is much closer
| to XHTML2 (in fact we will be moving to a ubset XHTML2 in some future
| release).
|
| In HTML2 headings do not have levels assigned to them, instead you av:
|
| <section>
|   <title>
|     <section>
|       <title>
|
| Ross

And no more <h1>....<h4> , its just <h>

<body>
<h>This is a top level heading</h>
<p>....</p>
<section>
    <p>....</p>
    <h>This is a second-level heading</h>
    <p>....</p>
    <h>This is another second-level heading</h>
    <p>....</p>
</section>

Gav...

-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.361 / Virus Database: 267.12.4/146 - Release Date: 21/10/2005

-- 
This message was scanned for spam and viruses by BitDefender.
For more information please visit http://linux.bitdefender.com/

Re: Conversion of "raw" html stops in mid-file w/o error message

Posted by Ross Gardler <rg...@apache.org>.

Moshe Yudkowsky wrote:
> All,
> 
> Thanks for the information about how to accomplish this conversion.  My 
> (current) problem is solved: <h2> cannot be followed directly by <h4> in 
> forrest.
> 
> I do have a comment:
> 
> * The W3C HTML validator says that <h2> followed by <h4> is valid HTML 
> and valid XHTML. From the W3 spec 
> <http://www.w3.org/TR/REC-html40/struct/global.html#edef-H2> itself, 
> "Some people consider skipping heading levels to be bad practice. They 
> accept H1 H2 H1 while they do not accept H1 H3 H1 since the heading 
> level H2 is skipped." Forrest is going beyond the spec by enforcing this 
> restriction.

Lke we said, if it is a problem to you then feel free to provide a patch 
to the html2document.xsl stylesheet.

We don't use html as an internal format because it lacks some needed 
structure for other processing. Outr internal structure is much closer 
to XHTML2 (in fact we will be moving to a ubset XHTML2 in some future 
release).

In HTML2 headings do not have levels assigned to them, instead you av:

<section>
   <title>
     <section>
       <title>

Ross

Re: Conversion of "raw" html stops in mid-file w/o error message

Posted by Moshe Yudkowsky <ms...@bl.com>.

All,

Thanks for the information about how to accomplish this conversion.  My 
(current) problem is solved: <h2> cannot be followed directly by <h4> in 
forrest.

I do have a comment:

* The W3C HTML validator says that <h2> followed by <h4> is valid HTML 
and valid XHTML. From the W3 spec 
<http://www.w3.org/TR/REC-html40/struct/global.html#edef-H2> itself, 
"Some people consider skipping heading levels to be bad practice. They 
accept H1 H2 H1 while they do not accept H1 H3 H1 since the heading 
level H2 is skipped." Forrest is going beyond the spec by enforcing this 
restriction.

And now, for some other cleanup work (&#151; to &mdash;, for example). 
With any luck I'll find some time to submit doc patches.

-- 
Moshe Yudkowsky * moshe@pobox.com * www.pobox.com/~moshe
"Don't try to outweird me, three-eyes.  I get stranger things than you free
with my breakfast cereal."
- Zaphod Beeblebrox in "Hithiker's Guide to the Galaxy"

Re: Conversion of "raw" html stops in mid-file w/o error message

Posted by Ross Gardler <rg...@apache.org>.

Brian M Dube wrote:
> On 10/22/05, Moshe Yudkowsky <ms...@bl.com> wrote:
> 

...

>>If anyone has any ideas on how to debug this, please let me know!
> 
> 
> What if you try making the <h4> line <h3>? I could be off base (I'm
> still getting to know Forrest), but there could be a problem with
> parsing if you skip levels from <h2> to <h4>.

That is correct. It is curently not possible to work with incorrectly 
ordered headings (i.e. skip from h2 to h4).

If it is critical to you and you want to try and change this the 
relevant file is html2document.xsl (in 0.7, html_to_document.xsl in 
0.8-dev).

Ross

Re: Conversion of "raw" html stops in mid-file w/o error message

Posted by Brian M Dube <br...@gmail.com>.

On 10/22/05, Moshe Yudkowsky <ms...@bl.com> wrote:
> I've got a raw html file that is being auto-converted -- decorated --
> by forrest.
>
> Although the conversion goes well for the initial sections, at one point
> the conversion stops, and the rest of the file does not appear. There
> are no error messages or warnings.
>
> I have validated the document using the W3C validator, and it passes
> whether I use it as 4.01 loose or XHTML strict. (The meta tags have to
> be modified, depending  on the format, but the rest of the document is
> unchanged.)
>
> Problem 1: no conversion of XHTML strict
>
> If the document is XHTML strict, then forrest does not convert any of
> the body text whatsoever!
>
>
> Problem 2: partial conversion of HTML 4.01 text.
>
> The initial paragraphs convert with no problem. They look like this:
>
> <h2>WORK EXPERIENCE</h2>
> <dl>
>   <dt>
>    DIALOGIC &amp; INTEL CORPORATION / 1996 - 2002<br/>
>    1996 - 2002: Speech Technology<br/>
>    Mission: Architect and Advocate for Speech Technologies.<br/>
>    <em>(Note: Dialogic was acquired by Intel in 1999.)</em><br/>
> </dt>
> <dd>
>   <ul>
>    <li>Guide technical development... </li>
>   </ul>
> </dd>
> </dt>
>
> etc.
>
> The paragraphs which do not convert look like this:
>
> <h2>SKILLS</h2>
>         <h4>Speech Recognition &amp; Speech Technology</h4>
>          <ul>
>           <li>Cross-industry knowledge...</li>
>          </ul>
>
> The only line that converts is the <h4> line, SKILLS, and the rest of
> the document is missing. I thought the "&amp;" in the <h4> might be
> throwing the system off, I tried removing it, and that's not the problem.
>
> If anyone has any ideas on how to debug this, please let me know!

What if you try making the <h4> line <h3>? I could be off base (I'm
still getting to know Forrest), but there could be a problem with
parsing if you skip levels from <h2> to <h4>.

Brian

Re: Conversion of "raw" html stops in mid-file w/o error message

Posted by David Crossley <cr...@apache.org>.

Moshe Yudkowsky wrote:
> I've got a raw html file that is being auto-converted -- decorated -- 
> by forrest.

Then it is not "raw". Raw files get no decoration.

> Although the conversion goes well for the initial sections, at one point 
> the conversion stops, and the rest of the file does not appear. There 
> are no error messages or warnings.

This is probably the h2, h3, h4 issue that Brian mentioned.

> I have validated the document using the W3C validator, and it passes 
> whether I use it as 4.01 loose or XHTML strict. (The meta tags have to 
> be modified, depending  on the format, but the rest of the document is 
> unchanged.)
> 
> Problem 1: no conversion of XHTML strict
> 
> If the document is XHTML strict, then forrest does not convert any of 
> the body text whatsoever!

Correct because Forrest is expecting HTML input,
not XHTML.

You need to add a SourceTypeAction for XHTML.
http://forrest.apache.org/docs_0_70/cap.html

-David