You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Charan Shampur <ch...@gmail.com> on 2015/09/23 22:02:44 UTC

Nutch datasets : How to ??

Hello team,

I am new to working with nutch.

I had a task of extracting the different image mime sub types and image
urls by Crawling through a list of urls - my approach for this task is as
below :

a) for Image URLS :

Aftter crawling with nutch, Use nutchpy sequence reader to read from the
segments dataset(/segments/content/data)  and write a python script to
extract out the image urls embedded within various tags

Is my approach correct?? or is there any better way of doing it? I could
not find any other dataset created by nutch which is having this
information.

b) for mime types :

In the same segment dataset search for type and get all subtypes under
image type.

Is this the correct way of doing it?  Is there a better approach than this?

It would be really helpful if i could get some pointers to resources for
solving the above task.


Thanks,

Charan