You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@forrest.apache.org by je...@apache.org on 2003/01/04 17:48:58 UTC

cvs commit: xml-forrest/src/resources/conf sitemap.xmap

jefft       2003/01/04 08:48:58

  Modified:    src/resources/conf sitemap.xmap
  Log:
  Don't let JTidy munge HTML files by default
  
  Revision  Changes    Path
  1.50      +5 -1      xml-forrest/src/resources/conf/sitemap.xmap
  
  Index: sitemap.xmap
  ===================================================================
  RCS file: /home/cvs/xml-forrest/src/resources/conf/sitemap.xmap,v
  retrieving revision 1.49
  retrieving revision 1.50
  diff -u -r1.49 -r1.50
  --- sitemap.xmap	28 Dec 2002 15:23:46 -0000	1.49
  +++ sitemap.xmap	4 Jan 2003 16:48:58 -0000	1.50
  @@ -370,9 +370,13 @@
       <map:act type="resource-exists">
        <map:parameter name="url" value="content/{1}"/>
   
  -      <map:match pattern="**.html">
  +     <map:match pattern="**.html">
  +       <!--
  +        Use these instead if you want JTidy to clean up your HTML
           <map:generate type="html" src="content/{1}.html"/>      
           <map:serialize type="html"/>
  +        -->
  +        <map:read src="content/{0}" mime-type="text/html"/>
         </map:match>
   
         <map:match pattern="**.xml">
  
  
  

Re: Crawling inadequate (Re: cvs commit: xml-forrest/src/resources/conf sitemap.xmap)

Posted by Nicola Ken Barozzi <ni...@apache.org>.
Jeff Turner wrote:
> On Sat, Jan 04, 2003 at 10:21:18PM +0100, Nicola Ken Barozzi wrote:
> 
>>I'd prefer if this is reverted, since it breaks the link crawling in 
>>html files.
> 
> 
> Reverted..
> 
> 
>>A html file can be given in three ways:
>>
>> 1 - crawled and tidied
>> 2 - read as-is (and not crawled)
>> 3 - included in the page framing
>>
>>Actually these mix concerns, which in fact are:
>>
>> 1a - crawl
>> 2a - don't crawl
>>
>> 1b - passthrough as-is
>> 2b - tidy
>> 3b - include in the page framing
>>
>>
>>For the crawling, I actually don't know what is better to do, nor if it 
>>should be a concern at all. As in the web version all links work, so 
>>should be for the CLI version.
> 
> 
> I think we're between a rock and a hard place here.
> 
> Imagine if we have:
> 
> src/documentation/content/a.pdf
> src/documentation/content/b.pdf
> 
> Where a.pdf has an internal link to b.pdf.  That link is traversable for
> webapp users, but won't be copied by the CLI, unless we implement a PDF
> parser.
> 
> We have the same problem with images specified in CSS.
> 
> Conclusion: crawling is inadequate as a means of discovering the full URI
> space.  Users will always be adding new formats for which we don't have a
> parser.  Even known formats like PDF might be password-protected, and
> therefore unparseable.
> 
> Only solution I see to simply copy src/documentation/content/{* - xdocs}
> across.

Since the rule is: "if it's there verbatim, give as-is", and "user and 
destination URI spaces should match as possible", this is doable correct.

> It would immediately solve 3 problems:
> 
>  - Images in CSS would work
>  - HTML wouldn't be munged by JTidy
>  - Javadocs wouldn't be copied one by one

Yes. Crawling is stupid for non-Cocoon generated resources. Having them 
necessarily re-generated by Cocoon just to be crawled is not nice nore 
practical.

> Two ways this could be implemented:
> 
> 1) The Right Way
> 
>  In the CLI, 'invert' the sitemap, discover all non-XML (unparseable)
>  files in src/documentation/content/, and copy them across.
> 
> 2) The Quick Way
> 
>  Since we know that only content/xdocs contains parseable sources, simply
>  copy everything else across with Ant.  Can be implemented with 5 lines
>  in forrest.build.xml
> 
> Any other ways?

go for 2) as a temporary solution.

Note that this has nothing to do with "site:", as I've tried to explain.

>>For the second section, it's IMHO something related to the content of 
>>the file, not the link.
>>That is, if a file should be rendered with 1b, 2b, 3b, can be known from 
>>how the content is created. The hint can be put in the extension:
>>
>> 1b - .html
>> 2b - .xhtml
>> 3b - .ihtml
> 
> 
> Do you mean, use resource-exists to check if each of these types exist,
> and if so, interpret its contents appropriately?
> 
> That still doesn't solve the problem where the user has non-wellformed
> HTML, with a link that we need to traverse, which is an instance of the
> more general problem described above.

Correct, it's another problem. It's about deciding how to show the html, 
if verbatim, included in the page with header and sidebar, or cleaned.

This info is a metadata about the file, as the DTD of xml files is. But 
since we don't have metadata, we can use the extension as a poor-man's 
metadata.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------


Crawling inadequate (Re: cvs commit: xml-forrest/src/resources/conf sitemap.xmap)

Posted by Jeff Turner <je...@apache.org>.
On Sat, Jan 04, 2003 at 10:21:18PM +0100, Nicola Ken Barozzi wrote:
> 
> I'd prefer if this is reverted, since it breaks the link crawling in 
> html files.

Reverted..

> A html file can be given in three ways:
> 
>  1 - crawled and tidied
>  2 - read as-is (and not crawled)
>  3 - included in the page framing
> 
> Actually these mix concerns, which in fact are:
> 
>  1a - crawl
>  2a - don't crawl
> 
>  1b - passthrough as-is
>  2b - tidy
>  3b - include in the page framing
> 
> 
> For the crawling, I actually don't know what is better to do, nor if it 
> should be a concern at all. As in the web version all links work, so 
> should be for the CLI version.

I think we're between a rock and a hard place here.

Imagine if we have:

src/documentation/content/a.pdf
src/documentation/content/b.pdf

Where a.pdf has an internal link to b.pdf.  That link is traversable for
webapp users, but won't be copied by the CLI, unless we implement a PDF
parser.

We have the same problem with images specified in CSS.

Conclusion: crawling is inadequate as a means of discovering the full URI
space.  Users will always be adding new formats for which we don't have a
parser.  Even known formats like PDF might be password-protected, and
therefore unparseable.

Only solution I see to simply copy src/documentation/content/{* - xdocs}
across.

It would immediately solve 3 problems:

 - Images in CSS would work
 - HTML wouldn't be munged by JTidy
 - Javadocs wouldn't be copied one by one


Two ways this could be implemented:

1) The Right Way

 In the CLI, 'invert' the sitemap, discover all non-XML (unparseable)
 files in src/documentation/content/, and copy them across.

2) The Quick Way

 Since we know that only content/xdocs contains parseable sources, simply
 copy everything else across with Ant.  Can be implemented with 5 lines
 in forrest.build.xml

Any other ways?

> For the second section, it's IMHO something related to the content of 
> the file, not the link.
> That is, if a file should be rendered with 1b, 2b, 3b, can be known from 
> how the content is created. The hint can be put in the extension:
> 
>  1b - .html
>  2b - .xhtml
>  3b - .ihtml

Do you mean, use resource-exists to check if each of these types exist,
and if so, interpret its contents appropriately?

That still doesn't solve the problem where the user has non-wellformed
HTML, with a link that we need to traverse, which is an instance of the
more general problem described above.


--Jeff

Re: cvs commit: xml-forrest/src/resources/conf sitemap.xmap

Posted by Nicola Ken Barozzi <ni...@apache.org>.
I'd prefer if this is reverted, since it breaks the link crawling in 
html files.

A html file can be given in three ways:

  1 - crawled and tidied
  2 - read as-is (and not crawled)
  3 - included in the page framing

Actually these mix concerns, which in fact are:

  1a - crawl
  2a - don't crawl

  1b - passthrough as-is
  2b - tidy
  3b - include in the page framing


For the crawling, I actually don't know what is better to do, nor if it 
should be a concern at all. As in the web version all links work, so 
should be for the CLI version.

For the second section, it's IMHO something related to the content of 
the file, not the link.
That is, if a file should be rendered with 1b, 2b, 3b, can be known from 
how the content is created. The hint can be put in the extension:

  1b - .html
  2b - .xhtml
  3b - .ihtml


jefft@apache.org wrote:
> jefft       2003/01/04 08:48:58
> 
>   Modified:    src/resources/conf sitemap.xmap
>   Log:
>   Don't let JTidy munge HTML files by default
>   
>   Revision  Changes    Path
>   1.50      +5 -1      xml-forrest/src/resources/conf/sitemap.xmap
>   
>   Index: sitemap.xmap
>   ===================================================================
>   RCS file: /home/cvs/xml-forrest/src/resources/conf/sitemap.xmap,v
>   retrieving revision 1.49
>   retrieving revision 1.50
>   diff -u -r1.49 -r1.50
>   --- sitemap.xmap	28 Dec 2002 15:23:46 -0000	1.49
>   +++ sitemap.xmap	4 Jan 2003 16:48:58 -0000	1.50
>   @@ -370,9 +370,13 @@
>        <map:act type="resource-exists">
>         <map:parameter name="url" value="content/{1}"/>
>    
>   -      <map:match pattern="**.html">
>   +     <map:match pattern="**.html">
>   +       <!--
>   +        Use these instead if you want JTidy to clean up your HTML
>            <map:generate type="html" src="content/{1}.html"/>      
>            <map:serialize type="html"/>
>   +        -->
>   +        <map:read src="content/{0}" mime-type="text/html"/>
>          </map:match>
>    
>          <map:match pattern="**.xml">
>   


-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------