You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Tizy Ninan <ti...@gmail.com> on 2015/05/08 10:57:07 UTC

Crawl sites containing videos

Hi,

Is it possible to crawl the videos (ex. YouTube videos) embedded in a
website? If so, what changes need to be made to enable video crawling. Will
it be similar to crawling images?

Kindly provide insights on this. Thanks in advance.

Thanks and Regards,
Tizy

Re: Crawl sites containing videos

Posted by Jeff Cocking <je...@gmail.com>.
With Nutch all things are possible......

To crawl/index videos on the general web, I would perform the following
steps.

1. Determine which video hosting services desired to index (i.e. Youtube,
Vimeo, Godtube, etc.)
2. Review each hosting service for standard embedded approaches (i.e.
iframe)
3. Review popular WordPress, VB, phpBB plugins for standard embedded
approaches. (this will provide good general coverage of the general web.)
4. Determine common video embedded models used on the web. (iframe tag,
src=video url)
5. Build video parser/index tool. Tool will need to include iframe tags and
src. html parser and tika parser have a pretty straight forward code to
build a new parser from.
6. Test like crazy with predefined web pages/test pages.
7. Share video parser/index patch on Nutch Jira or github for others to
review/assist/support open source software.

If you are planning on crawling the video hosting services directly, then
you will need to determine html structure of each video hosting service and
build a parser with that in mind.

Hope this helps.....

jeff

On Fri, May 8, 2015 at 3:57 AM, Tizy Ninan <ti...@gmail.com> wrote:

> Hi,
>
> Is it possible to crawl the videos (ex. YouTube videos) embedded in a
> website? If so, what changes need to be made to enable video crawling. Will
> it be similar to crawling images?
>
> Kindly provide insights on this. Thanks in advance.
>
> Thanks and Regards,
> Tizy
>

Re: Crawl sites containing videos

Posted by Tizy Ninan <ti...@gmail.com>.
Hi,

Thank you both for the information. I will try it out with the information
provided.

Thanks and Regards,
Tizy

On Fri, May 8, 2015 at 7:56 PM, Jorge Luis Betancourt González <
jlbetancourt@uci.cu> wrote:

> Hi Tizy,
>
> Actually the modifications should be very similar. First you'll need a way
> of extracting the video URLs from the HTML, if the video es embedded using
> HTML5 it should be very straightforward, otherwise it gets a little more
> difficult, because you'll need to convert the URL that you put in your HTML
> page to an actual video URL that can be downloaded by Nutch. This should be
> developed for any popular video sharing platform that you're targeting.
>
> Depending on how much data you want to extract about the video and how
> much metadata the video have, then you'll need more or less tweaks. The
> overall conclusion is that is possible you'll need to do some work, but its
> definitively posible.
>
> Regards,
>
> ----- Original Message -----
> From: "Tizy Ninan" <ti...@gmail.com>
> To: user@nutch.apache.org, dev@nutch.apache.org
> Sent: Friday, May 8, 2015 4:57:07 AM
> Subject: Crawl sites containing videos
>
> Hi,
>
> Is it possible to crawl the videos (ex. YouTube videos) embedded in a
> website? If so, what changes need to be made to enable video crawling. Will
> it be similar to crawling images?
>
> Kindly provide insights on this. Thanks in advance.
>
> Thanks and Regards,
> Tizy
>



-- 
Thanks and Regards,
Tizy

Re: Crawl sites containing videos

Posted by Jorge Luis Betancourt González <jl...@uci.cu>.
Hi Tizy,

Actually the modifications should be very similar. First you'll need a way of extracting the video URLs from the HTML, if the video es embedded using HTML5 it should be very straightforward, otherwise it gets a little more difficult, because you'll need to convert the URL that you put in your HTML page to an actual video URL that can be downloaded by Nutch. This should be developed for any popular video sharing platform that you're targeting. 

Depending on how much data you want to extract about the video and how much metadata the video have, then you'll need more or less tweaks. The overall conclusion is that is possible you'll need to do some work, but its definitively posible. 

Regards,

----- Original Message -----
From: "Tizy Ninan" <ti...@gmail.com>
To: user@nutch.apache.org, dev@nutch.apache.org
Sent: Friday, May 8, 2015 4:57:07 AM
Subject: Crawl sites containing videos

Hi,

Is it possible to crawl the videos (ex. YouTube videos) embedded in a
website? If so, what changes need to be made to enable video crawling. Will
it be similar to crawling images?

Kindly provide insights on this. Thanks in advance.

Thanks and Regards,
Tizy