You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by sanjay singh <cj...@gmail.com> on 2015/10/02 08:22:14 UTC

Apache Nutch Output structure

Hi,
I am trying to crawl certain set of websites using Apache nutch. I
configured nutch with required parameters. After crawling I got various
segments as output which I merged into one segement.
But still I am unable to relate with the file structure that is there in
output and meaning associated with it.
I got in merged segment following directories
content
crawl_fetch
crawl_generate
crawl_parse
parse_data
parse_text

Can someone please explain the significance of these directories or point
me to certain documentation which explains it in detail.


-- 
Regards,
Sanjay Singh, PICT Pune

Re: Apache Nutch Output structure

Posted by sanjay singh <cj...@gmail.com>.

Hi Chris,
Thanks for the quick reply.
I already went through the page but it gives only technical information
about the directories but no information related to relation amongst these
folders and what they really mean in terms of crawled output.
Like for ex: what crawl_parse contains is it like all the crawled data
parsed in terms of html tags or it just contains all the urls extracted
from pages.

On Thu, Oct 1, 2015 at 11:32 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Please see:
>
> http://wiki.apache.org/nutch/NutchFileFormats
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: sanjay singh <cj...@gmail.com>
> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Date: Thursday, October 1, 2015 at 11:22 PM
> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Subject: Apache Nutch Output structure
>
> >Hi,
> >I am trying to crawl certain set of websites using Apache nutch. I
> >configured nutch with required parameters. After crawling I got various
> >segments as output which I merged into one segement.
> >But still I am unable to relate with the file structure that is there in
> >output and meaning associated with it.
> >I got in merged segment following directories
> >content
> >crawl_fetch
> >crawl_generate
> >crawl_parse
> >parse_data
> >parse_text
> >
> >Can someone please explain the significance of these directories or point
> >me to certain documentation which explains it in detail.
> >
> >
> >--
> >Regards,
> >Sanjay Singh, PICT Pune
>
>


-- 
Regards,
Sanjay Singh, PICT Pune

Re: Apache Nutch Output structure

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Please see:

http://wiki.apache.org/nutch/NutchFileFormats

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: sanjay singh <cj...@gmail.com>
Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
Date: Thursday, October 1, 2015 at 11:22 PM
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Subject: Apache Nutch Output structure

>Hi,
>I am trying to crawl certain set of websites using Apache nutch. I
>configured nutch with required parameters. After crawling I got various
>segments as output which I merged into one segement.
>But still I am unable to relate with the file structure that is there in
>output and meaning associated with it.
>I got in merged segment following directories
>content
>crawl_fetch
>crawl_generate
>crawl_parse
>parse_data
>parse_text
>
>Can someone please explain the significance of these directories or point
>me to certain documentation which explains it in detail.
>
>
>-- 
>Regards,
>Sanjay Singh, PICT Pune