You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Iain <ia...@idcl.co.uk> on 2006/08/10 10:39:39 UTC

Crawling flash

I want to include embedded flash in my crawls.

 

Despite (apparently successfully) including the parse-swf plugin, embedded
flash does not seem to be retrieved.  I’m assuming that the object tags are
not being parsed to find the .swf  files.

 

Can anyone comment?

 

Thanks

 

Iain


RE: Crawling flash

Posted by Iain <ia...@idcl.co.uk>.
Thankyou!

Iain
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: 14 August 2006 17:31
To: nutch-user@lucene.apache.org
Cc: iain@idcl.co.uk
Subject: Re: Crawling flash

Iain wrote:
> I don't suppose anyone on the list has ever managed to include a flash
> object in a crawl?
>
> There's a number of sites I need to crawl which use flash for navigation
> (and have HTML content.  Go figure!).
>
>   
>
> I want to include embedded flash in my crawls.
>
> Despite (apparently successfully) including the parse-swf plugin, embedded
> flash does not seem to be retrieved.  I’m assuming that the object tags
are
> not being parsed to find the .swf  files.
>
> Can anyone comment?
>   

I can ;)

You will need to add some code to DOMContentUtils. Currently it skips 
<object> and <embed> tags, so that outlinks leading to the Flash content 
are never collected.

Instead, when the code encounters an <object> tag it should descend into 
<param> children, pick the one with <param name="src" 
value="myFlash.swf">, extract the value and make a new Outlink.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Crawling flash

Posted by Andrzej Bialecki <ab...@getopt.org>.
Iain wrote:
> I don't suppose anyone on the list has ever managed to include a flash
> object in a crawl?
>
> There's a number of sites I need to crawl which use flash for navigation
> (and have HTML content.  Go figure!).
>
>   
>
> I want to include embedded flash in my crawls.
>
> Despite (apparently successfully) including the parse-swf plugin, embedded
> flash does not seem to be retrieved.  I’m assuming that the object tags are
> not being parsed to find the .swf  files.
>
> Can anyone comment?
>   

I can ;)

You will need to add some code to DOMContentUtils. Currently it skips 
<object> and <embed> tags, so that outlinks leading to the Flash content 
are never collected.

Instead, when the code encounters an <object> tag it should descend into 
<param> children, pick the one with <param name="src" 
value="myFlash.swf">, extract the value and make a new Outlink.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: Crawling flash

Posted by Iain <ia...@idcl.co.uk>.
I don't suppose anyone on the list has ever managed to include a flash
object in a crawl?

There's a number of sites I need to crawl which use flash for navigation
(and have HTML content.  Go figure!).

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

I want to include embedded flash in my crawls.

Despite (apparently successfully) including the parse-swf plugin, embedded
flash does not seem to be retrieved.  I’m assuming that the object tags are
not being parsed to find the .swf  files.

Can anyone comment?

Thanks

 

Iain