You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Charan Shampur <ch...@gmail.com> on 2015/09/23 22:02:44 UTC
Nutch datasets : How to ??
Hello team,
I am new to working with nutch.
I had a task of extracting the different image mime sub types and image
urls by Crawling through a list of urls - my approach for this task is as
below :
a) for Image URLS :
Aftter crawling with nutch, Use nutchpy sequence reader to read from the
segments dataset(/segments/content/data) and write a python script to
extract out the image urls embedded within various tags
Is my approach correct?? or is there any better way of doing it? I could
not find any other dataset created by nutch which is having this
information.
b) for mime types :
In the same segment dataset search for type and get all subtypes under
image type.
Is this the correct way of doing it? Is there a better approach than this?
It would be really helpful if i could get some pointers to resources for
solving the above task.
Thanks,
Charan