You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@forrest.apache.org by Paul Bolger <pb...@gmail.com> on 2005/12/17 12:03:56 UTC

howto-custom-html-source

I've been trying to get this to work, and I'm not sure what's going
wrong. I'll explain what I'd like to be able to do: I'd like to point
at a directory, and it's subdirectories, processing all html files so
that all content outside a #content div is stripped. This How-To is
very detailed and I've learnt a lot from it, but it'd be good to have

a. and example file of sitemap.xmap with the extra element included (I
can't find the place that it's supposed to go...)

and

 b. an example xsl file.

Thanks
Paul Bolger

Re: howto-custom-html-source

Posted by David Crossley <cr...@apache.org>.

Paul Bolger wrote:
> Thanks David. My apologies for the break in transmission: Xmas etc...
> I've had a go - a few goes actually - at getting this to work, and I'm
> still not getting anywhere.
> I've inserted the following into my sitemap.xmap file:
> 
>  <map:match pattern="**dirtyhtml**.xml">
>  <map:generate src="{project:content.xdocs}{1}/dirtyhtml/{2}.html" />
>  <map:transform src="{project:resources.stylesheets}/puck.xsl" />
>  <map:serialize type="xml"/>
> </map:match>

But that is not what we discussed below.

Here is a quick trip down development lane :-)

mkdir /tmp/my-site; cd /tmp/my-site
forrest seed-sample

cp someDirty.html src/documentation/content/xdocs/samples/dirtyhtml/index.html
 (e.g. get the example attachment from FOR-775 [1])

forrest run

browser http://localhost:8888/samples/dirtyhtml/index.html
Forrest will try to render the html, but you want to
extract some special content so make your own sitemap.

Lets build it bit-by-bit to make sure that we have it correct
at each step of the way. Add the following to
src/documentation/sitemap.xmap

<map:match pattern="**/dirtyhtml/**.html">
 <map:generate src="{project:content.xdocs}{1}/dirtyhtml/{2}.html" />
 <map:serialize type="xml"/>
</map:match>

That will read the html and serialise it as xml.

Now add our own transformer ...
<map:match pattern="**/dirtyhtml/**.html">
 <map:generate src="{project:content.xdocs}{1}/dirtyhtml/{2}.html" />
 <map:transform src="{project:resources.stylesheets}/stripContent-to-html.xsl" />
 <map:serialize type="xml"/>
</map:match>

It will get only the "div class=content" and transform that to plain
html. Get the example from the attachment to FOR-775 [1]
to src/documentation/resources/stylesheets/stripContent-to-html.xsl

That should now produce only the html fragment that
you are interested in.

Now add the standard html-to-document transformer.

<map:match pattern="**/dirtyhtml/**.html">
 <map:generate src="{project:content.xdocs}{1}/dirtyhtml/{2}.html" />
 <map:transform src="{project:resources.stylesheets}/stripContent-to-html.xsl" />
 <map:transform src="{forrest:stylesheets}/html-to-document.xsl"/>
 <map:serialize type="xml"/>
</map:match>

The output will now be in the internal xdoc format.

Now stop matching the .html extension and use .xml
and serialise it as the forrest internal format
i.e. adds the proper DOCTYPE so that the forrest
internal machinery will deal with it. So this
is the final match for your sitemap ...

<map:match pattern="**/dirtyhtml/**.xml">
 <map:generate src="{project:content.xdocs}{1}/dirtyhtml/{2}.html" />
 <map:transform src="{project:resources.stylesheets}/stripContent-to-html.xsl" /> 
 <map:transform src="{forrest:stylesheets}/html-to-document.xsl"/>
 <map:serialize type="xml-document"/>
</map:match>

The above stuff will probably need refinement, e.g.
the XSL could be improved and the sitemap could use
the new locationmap.

[1] http://issues.apache.org/jira/browse/FOR-775

> It's the first entry in the <pipelines> section. Bearing in mind what
> you said about the directory separators I tried a few variations on
> the syntax.i

Hmmm, i didn't say anything about directory separators.
Forrest always uses URLs, so even a file:/// local URL
has slashes, not back-slashes.

> I found the result either passed the html page straight
> through, which I assume means that the match isn't being made,

It was probably doing as instructed :-)

> or
> produced the following error:
> 
> test\src\documentation\content\xdocs\dirtyhtml\default.body.html (The
> system cannot find the file specified)
> 
> This happened when I used the code above.
> 
> As a matter of interest, how would one extend the match to include
> files with .htm and .asp extensions?

Have a look at the whiteboard/plugins/org.apache.forrest.plugin.output.php
for an example.

-David

> On 18/12/05, David Crossley wrote:
> > David Crossley wrote:
> > > Paul Bolger wrote:
> > > > I've been trying to get this to work, and I'm not sure what's going
> > > > wrong. I'll explain what I'd like to be able to do: I'd like to point
> > > > at a directory, and it's subdirectories, processing all html files so
> > > > that all content outside a #content div is stripped.
> > >
> > > Ah, that comment indicates a basic misunderstanding
> > > about how Cocoon operates. It doesn't actually process
> > > directories [1]. Rather it handles requests. Depending
> > > on the components of the URL, the sitemap will respond
> > > by matching certain patterns.
> > >
> > > You need a project sitemap (or plugin if it is common
> > > functionality) to intercept the specific matches that
> > > you want to transform. Any matches that remain are handled
> > > by the guts of forrest.
> > >
> > > Some of our documentation explains how to handle specific
> > > matches. As usual our docs need attention. This doc
> > > is close, but you need to wade through the example that
> > > it points to, because only part of that is relevant.
> > > http://forrest.apache.org/docs/project-sitemap.html
> > >
> > > Basically you need a project sitemap.xmap like this
> > > where "this-tree" is the directory tree to which
> > > you want to apply special processing ...
> > >
> > > <map:match pattern="**/this-tree/**.xml">
> > >  <map:generate src="{project:content.xdocs}{1}/this-tree/{2}.html" />
> > >  <map:transform src="{project:resources.stylesheets}/myStripContent-to-document.xsl" />
> > >  <map:serialize type="xml"/>
> > > </map:match>
> >
> > Of course, that should be <map:serialize type="xml-document"/>
> >
> > Also your "myStripContent" transformer could probably
> > just remove the bits that you don't want and then follow
> > it with the forrest html transformer. So ...
> >
> > <map:match pattern="**/this-tree/**.xml">
> >  <map:generate src="{project:content.xdocs}{1}/this-tree/{2}.html" />
> >  <map:transform src="{project:resources.stylesheets}/myStripContent-to-html.xsl" />
> >  <map:transform src="{forrest:stylesheets}/html2document.xsl"/>
> >  <map:serialize type="xml-document"/>
> > </map:match>
> >
> > > (Caveat: Be careful with those directory separators
> > > in the match and generate components: The ** will match
> > > a slash. I just added the above for readability.)
> > >
> > > In other words, presume that the request is
> > > localhost:8888/some-dir/this-tree/foo/bar.html
> > > then your sitemap would fire and it would generate
> > > xml content from xdocs/some-dir/this-tree/foo/bar.html
> > > and apply your transformer to produce the forrest
> > > internal document structure.
> > >
> > >                   --oOo--
> > >
> > > [1] Preparing a directory listing, say for a table
> > > of contents page is another matter. For that you
> > > would use more complex Cocoon sitemap operations.
> > > See DirectoryGenerator which traverses the directory
> > > tree generates an xml fragment. Apply a Transformer
> > > to that to turn it into forrest internal xml format.
> > >
> > > You would need to follow Cocoon sitemap docs. Start at
> > > http://forrest.apache.org/docs/project-sitemap.html
> > > Understand sitemaps and then see:
> > > http://cocoon.apache.org/2.1/userdocs/directory-generator.html
> > >
> > > We need to add an example to our seed-sample site.
> > >
> > > > This How-To is
> > > > very detailed and I've learnt a lot from it, but it'd be good to have
> > > >
> > > > a. and example file of sitemap.xmap with the extra element included (I
> > > > can't find the place that it's supposed to go...)
> > > >
> > > > and
> > > >
> > > >  b. an example xsl file.
> > >
> > > The stylesheet to strip everything except "div class=content"
> > > is a simple XSLT operation. Not apporpriate for this list.
> > > The "XSL FAQ" is a fantanstic resource http://www.dpawson.co.uk/xsl/
> > > and get Micahel Kay's book.
> > >
> > > -David
> >
> 
> 
> --
> Paul Bolger
> 19 Raggatt St
> Alice Springs
> NT 0870
> 08 8953 6780

Re: howto-custom-html-source

Posted by Paul Bolger <pb...@gmail.com>.

Thanks David. My apologies for the break in transmission: Xmas etc...
I've had a go - a few goes actually - at getting this to work, and I'm
still not getting anywhere.
I've inserted the following into my sitemap.xmap file:

 <map:match pattern="**dirtyhtml**.xml">
 <map:generate src="{project:content.xdocs}{1}/dirtyhtml/{2}.html" />
 <map:transform src="{project:resources.stylesheets}/puck.xsl" />
 <map:serialize type="xml"/>
</map:match>

It's the first entry in the <pipelines> section. Bearing in mind what
you said about the directory separators I tried a few variations on
the syntax. I found the result either passed the html page straight
through, which I assume means that the match isn't being made, or
produced the following error:


test\src\documentation\content\xdocs\dirtyhtml\default.body.html (The
system cannot find the file specified)

This happened when I used the code above.

As a matter of interest, how would one extend the match to include
files with .htm and .asp extensions?

paul b




On 18/12/05, David Crossley <cr...@apache.org> wrote:
> David Crossley wrote:
> > Paul Bolger wrote:
> > > I've been trying to get this to work, and I'm not sure what's going
> > > wrong. I'll explain what I'd like to be able to do: I'd like to point
> > > at a directory, and it's subdirectories, processing all html files so
> > > that all content outside a #content div is stripped.
> >
> > Ah, that comment indicates a basic misunderstanding
> > about how Cocoon operates. It doesn't actually process
> > directories [1]. Rather it handles requests. Depending
> > on the components of the URL, the sitemap will respond
> > by matching certain patterns.
> >
> > You need a project sitemap (or plugin if it is common
> > functionality) to intercept the specific matches that
> > you want to transform. Any matches that remain are handled
> > by the guts of forrest.
> >
> > Some of our documentation explains how to handle specific
> > matches. As usual our docs need attention. This doc
> > is close, but you need to wade through the example that
> > it points to, because only part of that is relevant.
> > http://forrest.apache.org/docs/project-sitemap.html
> >
> > Basically you need a project sitemap.xmap like this
> > where "this-tree" is the directory tree to which
> > you want to apply special processing ...
> >
> > <map:match pattern="**/this-tree/**.xml">
> >  <map:generate src="{project:content.xdocs}{1}/this-tree/{2}.html" />
> >  <map:transform src="{project:resources.stylesheets}/myStripContent-to-document.xsl" />
> >  <map:serialize type="xml"/>
> > </map:match>
>
> Of course, that should be <map:serialize type="xml-document"/>
>
> Also your "myStripContent" transformer could probably
> just remove the bits that you don't want and then follow
> it with the forrest html transformer. So ...
>
> <map:match pattern="**/this-tree/**.xml">
>  <map:generate src="{project:content.xdocs}{1}/this-tree/{2}.html" />
>  <map:transform src="{project:resources.stylesheets}/myStripContent-to-html.xsl" />
>  <map:transform src="{forrest:stylesheets}/html2document.xsl"/>
>  <map:serialize type="xml-document"/>
> </map:match>
>
> > (Caveat: Be careful with those directory separators
> > in the match and generate components: The ** will match
> > a slash. I just added the above for readability.)
> >
> > In other words, presume that the request is
> > localhost:8888/some-dir/this-tree/foo/bar.html
> > then your sitemap would fire and it would generate
> > xml content from xdocs/some-dir/this-tree/foo/bar.html
> > and apply your transformer to produce the forrest
> > internal document structure.
> >
> >                   --oOo--
> >
> > [1] Preparing a directory listing, say for a table
> > of contents page is another matter. For that you
> > would use more complex Cocoon sitemap operations.
> > See DirectoryGenerator which traverses the directory
> > tree generates an xml fragment. Apply a Transformer
> > to that to turn it into forrest internal xml format.
> >
> > You would need to follow Cocoon sitemap docs. Start at
> > http://forrest.apache.org/docs/project-sitemap.html
> > Understand sitemaps and then see:
> > http://cocoon.apache.org/2.1/userdocs/directory-generator.html
> >
> > We need to add an example to our seed-sample site.
> >
> > > This How-To is
> > > very detailed and I've learnt a lot from it, but it'd be good to have
> > >
> > > a. and example file of sitemap.xmap with the extra element included (I
> > > can't find the place that it's supposed to go...)
> > >
> > > and
> > >
> > >  b. an example xsl file.
> >
> > The stylesheet to strip everything except "div class=content"
> > is a simple XSLT operation. Not apporpriate for this list.
> > The "XSL FAQ" is a fantanstic resource http://www.dpawson.co.uk/xsl/
> > and get Micahel Kay's book.
> >
> > -David
>


--
Paul Bolger
19 Raggatt St
Alice Springs
NT 0870
08 8953 6780

Re: howto-custom-html-source

Posted by David Crossley <cr...@apache.org>.

David Crossley wrote:
> Paul Bolger wrote:
> > I've been trying to get this to work, and I'm not sure what's going
> > wrong. I'll explain what I'd like to be able to do: I'd like to point
> > at a directory, and it's subdirectories, processing all html files so
> > that all content outside a #content div is stripped.
> 
> Ah, that comment indicates a basic misunderstanding
> about how Cocoon operates. It doesn't actually process
> directories [1]. Rather it handles requests. Depending
> on the components of the URL, the sitemap will respond
> by matching certain patterns.
> 
> You need a project sitemap (or plugin if it is common
> functionality) to intercept the specific matches that
> you want to transform. Any matches that remain are handled
> by the guts of forrest.
> 
> Some of our documentation explains how to handle specific
> matches. As usual our docs need attention. This doc
> is close, but you need to wade through the example that
> it points to, because only part of that is relevant.
> http://forrest.apache.org/docs/project-sitemap.html
> 
> Basically you need a project sitemap.xmap like this
> where "this-tree" is the directory tree to which
> you want to apply special processing ...
> 
> <map:match pattern="**/this-tree/**.xml">
>  <map:generate src="{project:content.xdocs}{1}/this-tree/{2}.html" />
>  <map:transform src="{project:resources.stylesheets}/myStripContent-to-document.xsl" />
>  <map:serialize type="xml"/>
> </map:match>

Of course, that should be <map:serialize type="xml-document"/>

Also your "myStripContent" transformer could probably
just remove the bits that you don't want and then follow
it with the forrest html transformer. So ...

<map:match pattern="**/this-tree/**.xml">
 <map:generate src="{project:content.xdocs}{1}/this-tree/{2}.html" />
 <map:transform src="{project:resources.stylesheets}/myStripContent-to-html.xsl" />
 <map:transform src="{forrest:stylesheets}/html2document.xsl"/>
 <map:serialize type="xml-document"/>
</map:match>

> (Caveat: Be careful with those directory separators
> in the match and generate components: The ** will match
> a slash. I just added the above for readability.)
> 
> In other words, presume that the request is
> localhost:8888/some-dir/this-tree/foo/bar.html
> then your sitemap would fire and it would generate
> xml content from xdocs/some-dir/this-tree/foo/bar.html
> and apply your transformer to produce the forrest
> internal document structure.
> 
>                   --oOo--
> 
> [1] Preparing a directory listing, say for a table
> of contents page is another matter. For that you
> would use more complex Cocoon sitemap operations.
> See DirectoryGenerator which traverses the directory
> tree generates an xml fragment. Apply a Transformer
> to that to turn it into forrest internal xml format.
> 
> You would need to follow Cocoon sitemap docs. Start at
> http://forrest.apache.org/docs/project-sitemap.html
> Understand sitemaps and then see:
> http://cocoon.apache.org/2.1/userdocs/directory-generator.html
> 
> We need to add an example to our seed-sample site.
> 
> > This How-To is
> > very detailed and I've learnt a lot from it, but it'd be good to have
> > 
> > a. and example file of sitemap.xmap with the extra element included (I
> > can't find the place that it's supposed to go...)
> > 
> > and
> > 
> >  b. an example xsl file.
> 
> The stylesheet to strip everything except "div class=content"
> is a simple XSLT operation. Not apporpriate for this list.
> The "XSL FAQ" is a fantanstic resource http://www.dpawson.co.uk/xsl/
> and get Micahel Kay's book.
> 
> -David

Re: howto-custom-html-source

Posted by David Crossley <cr...@apache.org>.

Paul Bolger wrote:
> I've been trying to get this to work, and I'm not sure what's going
> wrong. I'll explain what I'd like to be able to do: I'd like to point
> at a directory, and it's subdirectories, processing all html files so
> that all content outside a #content div is stripped.

Ah, that comment indicates a basic misunderstanding
about how Cocoon operates. It doesn't actually process
directories [1]. Rather it handles requests. Depending
on the components of the URL, the sitemap will respond
by matching certain patterns.

You need a project sitemap (or plugin if it is common
functionality) to intercept the specific matches that
you want to transform. Any matches that remain are handled
by the guts of forrest.

Some of our documentation explains how to handle specific
matches. As usual our docs need attention. This doc
is close, but you need to wade through the example that
it points to, because only part of that is relevant.
http://forrest.apache.org/docs/project-sitemap.html

Basically you need a project sitemap.xmap like this
where "this-tree" is the directory tree to which
you want to apply special processing ...

<map:match pattern="**/this-tree/**.xml">
 <map:generate src="{project:content.xdocs}{1}/this-tree/{2}.html" />
 <map:transform src="{project:resources.stylesheets}/myStripContent-to-document.xsl" />
 <map:serialize type="xml"/>
</map:match>

(Caveat: Be careful with those directory separators
in the match and generate components: The ** will match
a slash. I just added the above for readability.)

In other words, presume that the request is
localhost:8888/some-dir/this-tree/foo/bar.html
then your sitemap would fire and it would generate
xml content from xdocs/some-dir/this-tree/foo/bar.html
and apply your transformer to produce the forrest
internal document structure.

                  --oOo--

[1] Preparing a directory listing, say for a table
of contents page is another matter. For that you
would use more complex Cocoon sitemap operations.
See DirectoryGenerator which traverses the directory
tree generates an xml fragment. Apply a Transformer
to that to turn it into forrest internal xml format.

You would need to follow Cocoon sitemap docs. Start at
http://forrest.apache.org/docs/project-sitemap.html
Understand sitemaps and then see:
http://cocoon.apache.org/2.1/userdocs/directory-generator.html

We need to add an example to our seed-sample site.

> This How-To is
> very detailed and I've learnt a lot from it, but it'd be good to have
> 
> a. and example file of sitemap.xmap with the extra element included (I
> can't find the place that it's supposed to go...)
> 
> and
> 
>  b. an example xsl file.

The stylesheet to strip everything except "div class=content"
is a simple XSLT operation. Not apporpriate for this list.
The "XSL FAQ" is a fantanstic resource http://www.dpawson.co.uk/xsl/
and get Micahel Kay's book.

-David