You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov> on 2015/09/21 04:56:52 UTC

Re: Questions regarding CS-572 assignment 1

Hi Charan,

Thanks for your questions. Please copy your emails to
dev@nutch.apache.org and subscribe there, as you will
find more help I believe.

Here are the answers:

-----Original Message-----

From: Charan Shampur <sh...@usc.edu>
Date: Sunday, September 20, 2015 at 3:55 PM
To: jpluser <ma...@usc.edu>
Subject: Questions regarding CS-572 assignment 1

>Hello professor,
>
>
>Sorry to interrupt you, I have few questions wandering in my mind from
>last 2 days.
>Here are those:
>
>
>1) I was unable to find any guidelines for using nutchpy to extract data
>from the crawldb. Can you provide me with Some pointers to resources that
>will help.
>

The README.md on nutchpy explains how to use it to read Sequence Files:

https://github.com/ContinuumIO/nutchpy/#running


Then, if you look up the Nutch Sequence File format:

http://wiki.apache.org/nutch/NutchFileFormats


You should be good.

>
>2) How do we read or understand the data extracted by nutch?.I was able
>to collect the list of urls that are crawled by running the readdb
>command.
>For others, how do we do it?

You read the data out of the Nutch DB using NutchPy. So, in fact, readDB is
a great tool (there are also tools to read the LinkDB), but you need to
write a program using NutchPy.

>
>
>3) Is there any API or command that  interacts with nutch crawldb to get
>the Statistical data(Mime type, Http response, Un-fetched urls, etc) ?

Yep the data is stored in the Nutch Data file formats specified
and linked above.

>
>
>I have been reading through the nutch/wiki and was unable to figure it
>out.
>
>
>
>professor, Kindly help me in resolving these...
>
>
>Thanks,
>Charan

HTH.

Cheers,
Chris

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Adjunct Associate Professor, Computer Science Department
University of Southern California
Los Angeles, CA 90089 USA
Email: mattmann@usc.edu
WWW: http://sunset.usc.edu/
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++