You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Vijay Veluchamy <vi...@gmail.com> on 2016/04/05 16:22:38 UTC

How to read segment dump?

Hi Team,

I need to crawl a website using Apache Nutch. Currently, I am using Nutch
1.x.

I have followed the steps provided in the following URL upto 'invertlink'
step.

https://wiki.apache.org/nutch/NutchTutorial

Then, used 'readseg' command to dump the segments. The dump file is created
successfully.

Now, I have the following questions.

1. Is this the right file (segment dump file) to read contents of a
website? If yes, how to read the contents from dump file? I am unable to
read as it looks like encrypted.
2. Otherwise, how can I read the contents of a website?

Thanks,
Vijay

Re: How to read segment dump?

Posted by Furkan KAMACI <fu...@gmail.com>.

Hi,

When you are done with crawling you can try dump command. Its usage is as
follows:

*$ bin/nutch dump [-h] [-mimetype <mimetype>] [-outputDir <outputDir>]*
*   [-segment <segment>]*
* -h,--help                show this help message*
* -mimetype <mimetype>     an optional list of mimetypes to dump, excluding*
*                      all others. Defaults to all.*
* -outputDir <outputDir>   output directory (which will be created) to host*
*                      the raw data*
* -segment <segment>       the segment(s) to use*

So, you can apply that:

*$ bin/nutch dump -segment crawl/segments -outputDir crawl/dump/*

which will create a new directory at -outputDir and dump all the crawled
pages in html format.

On the other hand, this may also be useful for your case:
https://wiki.apache.org/nutch/CommonCrawlDataDumper


Kind Regards,
Furkan KAMACI

On Tue, Apr 5, 2016 at 6:29 PM, Markus Jelsma <ma...@openindex.io>
wrote:

> Hello - you should try the newer dump tool, it dumps HTML files as is to
> some directory.
> Markus
>
>
>
> -----Original message-----
> > From:Vijay Veluchamy <vi...@gmail.com>
> > Sent: Tuesday 5th April 2016 17:24
> > To: user@nutch.apache.org
> > Subject: RE: How to read segment dump?
> >
> > Hi,
> >
> > I am looking for crawling a website as HTML files. After that, I need to
> > parse them and get the elements in it.
> >
> > Thanks,
> > Vijay
> > On Apr 5, 2016 8:37 PM, "Markus Jelsma" <ma...@openindex.io>
> wrote:
> >
> > > Hello, segment dumps are notorious hard to comprehend. What information
> > > are you looking for? What do you mean by reading contents of a website?
> > > Markus
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:Vijay Veluchamy <vi...@gmail.com>
> > > > Sent: Tuesday 5th April 2016 16:22
> > > > To: user@nutch.apache.org
> > > > Subject: How to read segment dump?
> > > >
> > > > Hi Team,
> > > >
> > > > I need to crawl a website using Apache Nutch. Currently, I am using
> Nutch
> > > > 1.x.
> > > >
> > > > I have followed the steps provided in the following URL upto
> 'invertlink'
> > > > step.
> > > >
> > > > https://wiki.apache.org/nutch/NutchTutorial
> > > >
> > > > Then, used 'readseg' command to dump the segments. The dump file is
> > > created
> > > > successfully.
> > > >
> > > > Now, I have the following questions.
> > > >
> > > > 1. Is this the right file (segment dump file) to read contents of a
> > > > website? If yes, how to read the contents from dump file? I am
> unable to
> > > > read as it looks like encrypted.
> > > > 2. Otherwise, how can I read the contents of a website?
> > > >
> > > > Thanks,
> > > > Vijay
> > > >
> > >
> >
>

RE: How to read segment dump?

Posted by Markus Jelsma <ma...@openindex.io>.

Hello - you should try the newer dump tool, it dumps HTML files as is to some directory.
Markus

 
 
-----Original message-----
> From:Vijay Veluchamy <vi...@gmail.com>
> Sent: Tuesday 5th April 2016 17:24
> To: user@nutch.apache.org
> Subject: RE: How to read segment dump?
> 
> Hi,
> 
> I am looking for crawling a website as HTML files. After that, I need to
> parse them and get the elements in it.
> 
> Thanks,
> Vijay
> On Apr 5, 2016 8:37 PM, "Markus Jelsma" <ma...@openindex.io> wrote:
> 
> > Hello, segment dumps are notorious hard to comprehend. What information
> > are you looking for? What do you mean by reading contents of a website?
> > Markus
> >
> >
> >
> > -----Original message-----
> > > From:Vijay Veluchamy <vi...@gmail.com>
> > > Sent: Tuesday 5th April 2016 16:22
> > > To: user@nutch.apache.org
> > > Subject: How to read segment dump?
> > >
> > > Hi Team,
> > >
> > > I need to crawl a website using Apache Nutch. Currently, I am using Nutch
> > > 1.x.
> > >
> > > I have followed the steps provided in the following URL upto 'invertlink'
> > > step.
> > >
> > > https://wiki.apache.org/nutch/NutchTutorial
> > >
> > > Then, used 'readseg' command to dump the segments. The dump file is
> > created
> > > successfully.
> > >
> > > Now, I have the following questions.
> > >
> > > 1. Is this the right file (segment dump file) to read contents of a
> > > website? If yes, how to read the contents from dump file? I am unable to
> > > read as it looks like encrypted.
> > > 2. Otherwise, how can I read the contents of a website?
> > >
> > > Thanks,
> > > Vijay
> > >
> >
>

RE: How to read segment dump?

Posted by Vijay Veluchamy <vi...@gmail.com>.

Hi,

I am looking for crawling a website as HTML files. After that, I need to
parse them and get the elements in it.

Thanks,
Vijay
On Apr 5, 2016 8:37 PM, "Markus Jelsma" <ma...@openindex.io> wrote:

> Hello, segment dumps are notorious hard to comprehend. What information
> are you looking for? What do you mean by reading contents of a website?
> Markus
>
>
>
> -----Original message-----
> > From:Vijay Veluchamy <vi...@gmail.com>
> > Sent: Tuesday 5th April 2016 16:22
> > To: user@nutch.apache.org
> > Subject: How to read segment dump?
> >
> > Hi Team,
> >
> > I need to crawl a website using Apache Nutch. Currently, I am using Nutch
> > 1.x.
> >
> > I have followed the steps provided in the following URL upto 'invertlink'
> > step.
> >
> > https://wiki.apache.org/nutch/NutchTutorial
> >
> > Then, used 'readseg' command to dump the segments. The dump file is
> created
> > successfully.
> >
> > Now, I have the following questions.
> >
> > 1. Is this the right file (segment dump file) to read contents of a
> > website? If yes, how to read the contents from dump file? I am unable to
> > read as it looks like encrypted.
> > 2. Otherwise, how can I read the contents of a website?
> >
> > Thanks,
> > Vijay
> >
>

RE: How to read segment dump?

Posted by Markus Jelsma <ma...@openindex.io>.

Hello, segment dumps are notorious hard to comprehend. What information are you looking for? What do you mean by reading contents of a website? 
Markus

 
 
-----Original message-----
> From:Vijay Veluchamy <vi...@gmail.com>
> Sent: Tuesday 5th April 2016 16:22
> To: user@nutch.apache.org
> Subject: How to read segment dump?
> 
> Hi Team,
> 
> I need to crawl a website using Apache Nutch. Currently, I am using Nutch
> 1.x.
> 
> I have followed the steps provided in the following URL upto 'invertlink'
> step.
> 
> https://wiki.apache.org/nutch/NutchTutorial
> 
> Then, used 'readseg' command to dump the segments. The dump file is created
> successfully.
> 
> Now, I have the following questions.
> 
> 1. Is this the right file (segment dump file) to read contents of a
> website? If yes, how to read the contents from dump file? I am unable to
> read as it looks like encrypted.
> 2. Otherwise, how can I read the contents of a website?
> 
> Thanks,
> Vijay
>