You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@forrest.apache.org by Nicola Ken Barozzi <ni...@apache.org> on 2002/12/07 19:31:05 UTC

Krysalis skin CSS images are not being crawled

I know, I know, we all know ;-)

It's just to tell you all that to see the real results of the skin 
output you will have to copy the skin images manually.

This problem will eventually go away, this is just a notice.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------

Re: Krysalis skin CSS images are not being crawled

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Marc Portier wrote:
> 
>     A barebones parser, sure.  An entire browser?  To grab CSS elements,
>     what more is needed besides the identification of the following:
> 
>      @import
>      background:
>      background-image:
>      "//" and "/* */"
> 
> 
> I could add one from the 'as-designed' outerthought-site skin:
> <style>
>   li {
>     list-style-image: url(art/bullet_arrow_list.gif)
>   }
> </style>
> 
> 
> 
> the remark on the 'entire browser' pretty much comes from thinking about 
> user defined skins that would exploit image roll-overs in javascript and 
> the like (for which there is rhino, of course)
> 
> so 'entire browser' is more like the 'most-general-and-complete' wording 
> for anything people would like to see happen through the design of their 
> HTML, css, js...
> 
> but I take your argument: it should be narrowed down to only that subset 
> which triggers another HTTP-request from the browser, so we can just add 
> that to the list of links to crawl?
> 
> I guess starting from a 'java browser implementation (without rendering) 
> that allows for some sort of CrawlerListener' would be preferred over 
> assembling that very thing with Sac, rhino,...

The point is that current crawling mechanism looks for urls in defined 
attributes in the SAX stream. Everything that is collected by the 
crawler has to be part of the Sax stream.

Now, imagine that we can plug in rules that are triggered on certain 
conditions, like an element name or an attribute name. For example, is 
an element <style> is encountered, the contents are send to a StyleRule 
that returns all the links in there.

This would take care of all the issues, and be pluggable.
I'm starting to look in the Ant stuff in Cocoon scratchpad, and it looks 
promising, and could replace part of the current crawler.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------

Re: Krysalis skin CSS images are not being crawled

Posted by "J.Pietschmann" <j3...@yahoo.de>.

Steven Noels wrote:
> As far as JS 
> is concerned... oh well.

Oh well. People tend to do unnecessary complex things, like generating
dynamic URLs with session ids stuffed therein. I don't think there is
a way to solve the problem of crawling URLs hidden in arbitrary JS in
general.

What about another file for "additional URLs/directories to crawl"?

J.Pietschmann

Re: Krysalis skin CSS images are not being crawled

Posted by Steven Noels <st...@outerthought.org>.

Marc Portier wrote:

>> A barebones parser, sure.  An entire browser?  To grab CSS elements, 
>> what more is needed besides the identification of the following:
>>
>>  @import
>>  background:
>>  background-image:
>>  "//" and "/* */"
>>
> 
> I could add one from the 'as-designed' outerthought-site skin:
> <style>
>   li {
>     list-style-image: url(art/bullet_arrow_list.gif)
>   }
> </style>

That SAC things looks interesting, and maybe it can be packaged to emit 
SAX events which are then picked up by the LinkSerializer. As far as JS 
is concerned... oh well. I know _we are able_ to do the SAX-packaging of 
SAC, I don't know about the browser thing.

</Steven>
-- 
Steven Noels                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at              http://radio.weblogs.com/0103539/
stevenn at outerthought.org                stevenn at apache.org

Re: Krysalis skin CSS images are not being crawled

Posted by Marc Portier <mp...@outerthought.org>.

> 
> A barebones parser, sure.  An entire browser?  To grab CSS elements, 
> what more is needed besides the identification of the following:
> 
>  @import
>  background:
>  background-image:
>  "//" and "/* */"
> 

I could add one from the 'as-designed' outerthought-site skin:
<style>
   li {
     list-style-image: url(art/bullet_arrow_list.gif)
   }
</style>



the remark on the 'entire browser' pretty much comes from 
thinking about user defined skins that would exploit image 
roll-overs in javascript and the like (for which there is rhino, 
of course)

so 'entire browser' is more like the 'most-general-and-complete' 
wording for anything people would like to see happen through the 
design of their HTML, css, js...

but I take your argument: it should be narrowed down to only that 
subset which triggers another HTTP-request from the browser, so 
we can just add that to the list of links to crawl?

I guess starting from a 'java browser implementation (without 
rendering) that allows for some sort of CrawlerListener' would be 
preferred over assembling that very thing with Sac, rhino,...

regards,
-marc=
-- 
Marc Portier                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
mpo@outerthought.org                              mpo@apache.org

Re: Krysalis skin CSS images are not being crawled

Posted by Miles Elam <mi...@geekspeak.org>.

Steven Noels wrote:

> Hm... I have been discussing this in the office already several times, 
> and most of the time we come up with the conclusion that doing it good 
> would require implementing an entire browser without a rendering 
> subsystem :(

A barebones parser, sure.  An entire browser?  To grab CSS elements, 
what more is needed besides the identification of the following:

  @import
  background:
  background-image:
  "//" and "/* */"

The important information is images and CSS files, no?  Everything else 
is noise to the parser.  Who cares if the parser misread "color: red;"? 
 Comments could be important for skipping items, but even this is 
optional.  What's the worse thing that happens?  If an element is 
commented out, the image it specifies is copied anyway.  Isn't it 
possible to make the assumption that the CSS is correct -- something a 
browser cannot do -- since we make the assumption that someone previewed 
the CSS in a browser before submitting for generation?

Unlike a browser, it need not know what type of resource it is, its MIME 
type, where on the page it goes, how big it is, etc.  Isn't "an entire 
browser without a rendering subsystem" overstating things just a tad?

- Miles

Re: Krysalis skin CSS images are not being crawled

Posted by Nicola Ken Barozzi <ni...@apache.org>.


J.Pietschmann wrote:
> Steven Noels wrote:
> 
>> The problem is that it would require us to embed a CSS parser inside 
>> the Cocoon crawler - 
> 
> 
> There are a few ready to use, see for example
>  http://www.w3.org/Style/CSS/SAC/

Thanks :-)

BTW, the Batik one is listed there too, which makes me think that it's a 
good solution.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------

Re: Krysalis skin CSS images are not being crawled

Posted by "J.Pietschmann" <j3...@yahoo.de>.

Steven Noels wrote:
> The 
> problem is that it would require us to embed a CSS parser inside the 
> Cocoon crawler - 

There are a few ready to use, see for example
  http://www.w3.org/Style/CSS/SAC/

J.Pietschmann

Re: Krysalis skin CSS images are not being crawled

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Steven Noels wrote:
> Nicola Ken Barozzi wrote:
> 
>> This problem will eventually go away, this is just a notice.
> 
> 
> It's one of those things which worries me. I don't think it will ever go 
> away, if we don't start thinking about how we could tackle this. The 
> problem is that it would require us to embed a CSS parser inside the 
> Cocoon crawler - 
> http://xml.apache.org/batik/javadoc/org/apache/batik/css/parser/package-summary.html 

Yes, in the current crawling system, all links must pass on the SAX stream.

So instead of

    <map:match pattern="**.css">
     <map:read src="resources/css/{1}.css" mime-type="text/css"/>
    </map:match>

we should do

    <map:match pattern="**.css">
      <map:generate type="batikcss" src="resources/css/{1}.css">
      <map:serialize type="css"/>
    </map:match>

And the problem should be resolved.

> Hm... I have been discussing this in the office already several times, 
> and most of the time we come up with the conclusion that doing it good 
> would require implementing an entire browser without a rendering 
> subsystem :(
> 
> </Steven>

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------

Re: Krysalis skin CSS images are not being crawled

Posted by Steven Noels <st...@outerthought.org>.

Nicola Ken Barozzi wrote:

> This problem will eventually go away, this is just a notice.

It's one of those things which worries me. I don't think it will ever go 
away, if we don't start thinking about how we could tackle this. The 
problem is that it would require us to embed a CSS parser inside the 
Cocoon crawler - 
http://xml.apache.org/batik/javadoc/org/apache/batik/css/parser/package-summary.html

Hm... I have been discussing this in the office already several times, 
and most of the time we come up with the conclusion that doing it good 
would require implementing an entire browser without a rendering 
subsystem :(

</Steven>
-- 
Steven Noels                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at              http://radio.weblogs.com/0103539/
stevenn at outerthought.org                stevenn at apache.org