You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Iain <ia...@idcl.co.uk> on 2006/08/10 10:39:39 UTC
Crawling flash
I want to include embedded flash in my crawls.
Despite (apparently successfully) including the parse-swf plugin, embedded
flash does not seem to be retrieved. Im assuming that the object tags are
not being parsed to find the .swf files.
Can anyone comment?
Thanks
Iain
RE: Crawling flash
Posted by Iain <ia...@idcl.co.uk>.
Thankyou!
Iain
From: Andrzej Bialecki [mailto:ab@getopt.org]
Sent: 14 August 2006 17:31
To: nutch-user@lucene.apache.org
Cc: iain@idcl.co.uk
Subject: Re: Crawling flash
Iain wrote:
> I don't suppose anyone on the list has ever managed to include a flash
> object in a crawl?
>
> There's a number of sites I need to crawl which use flash for navigation
> (and have HTML content. Go figure!).
>
>
>
> I want to include embedded flash in my crawls.
>
> Despite (apparently successfully) including the parse-swf plugin, embedded
> flash does not seem to be retrieved. Im assuming that the object tags
are
> not being parsed to find the .swf files.
>
> Can anyone comment?
>
I can ;)
You will need to add some code to DOMContentUtils. Currently it skips
<object> and <embed> tags, so that outlinks leading to the Flash content
are never collected.
Instead, when the code encounters an <object> tag it should descend into
<param> children, pick the one with <param name="src"
value="myFlash.swf">, extract the value and make a new Outlink.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Crawling flash
Posted by Andrzej Bialecki <ab...@getopt.org>.
Iain wrote:
> I don't suppose anyone on the list has ever managed to include a flash
> object in a crawl?
>
> There's a number of sites I need to crawl which use flash for navigation
> (and have HTML content. Go figure!).
>
>
>
> I want to include embedded flash in my crawls.
>
> Despite (apparently successfully) including the parse-swf plugin, embedded
> flash does not seem to be retrieved. I’m assuming that the object tags are
> not being parsed to find the .swf files.
>
> Can anyone comment?
>
I can ;)
You will need to add some code to DOMContentUtils. Currently it skips
<object> and <embed> tags, so that outlinks leading to the Flash content
are never collected.
Instead, when the code encounters an <object> tag it should descend into
<param> children, pick the one with <param name="src"
value="myFlash.swf">, extract the value and make a new Outlink.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
RE: Crawling flash
Posted by Iain <ia...@idcl.co.uk>.
I don't suppose anyone on the list has ever managed to include a flash
object in a crawl?
There's a number of sites I need to crawl which use flash for navigation
(and have HTML content. Go figure!).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
I want to include embedded flash in my crawls.
Despite (apparently successfully) including the parse-swf plugin, embedded
flash does not seem to be retrieved. Im assuming that the object tags are
not being parsed to find the .swf files.
Can anyone comment?
Thanks
Iain