You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Vijay Veluchamy <vi...@gmail.com> on 2016/04/05 16:22:38 UTC
How to read segment dump?
Hi Team,
I need to crawl a website using Apache Nutch. Currently, I am using Nutch
1.x.
I have followed the steps provided in the following URL upto 'invertlink'
step.
https://wiki.apache.org/nutch/NutchTutorial
Then, used 'readseg' command to dump the segments. The dump file is created
successfully.
Now, I have the following questions.
1. Is this the right file (segment dump file) to read contents of a
website? If yes, how to read the contents from dump file? I am unable to
read as it looks like encrypted.
2. Otherwise, how can I read the contents of a website?
Thanks,
Vijay
Re: How to read segment dump?
Posted by Furkan KAMACI <fu...@gmail.com>.
Hi,
When you are done with crawling you can try dump command. Its usage is as
follows:
*$ bin/nutch dump [-h] [-mimetype <mimetype>] [-outputDir <outputDir>]*
* [-segment <segment>]*
* -h,--help show this help message*
* -mimetype <mimetype> an optional list of mimetypes to dump, excluding*
* all others. Defaults to all.*
* -outputDir <outputDir> output directory (which will be created) to host*
* the raw data*
* -segment <segment> the segment(s) to use*
So, you can apply that:
*$ bin/nutch dump -segment crawl/segments -outputDir crawl/dump/*
which will create a new directory at -outputDir and dump all the crawled
pages in html format.
On the other hand, this may also be useful for your case:
https://wiki.apache.org/nutch/CommonCrawlDataDumper
Kind Regards,
Furkan KAMACI
On Tue, Apr 5, 2016 at 6:29 PM, Markus Jelsma <ma...@openindex.io>
wrote:
> Hello - you should try the newer dump tool, it dumps HTML files as is to
> some directory.
> Markus
>
>
>
> -----Original message-----
> > From:Vijay Veluchamy <vi...@gmail.com>
> > Sent: Tuesday 5th April 2016 17:24
> > To: user@nutch.apache.org
> > Subject: RE: How to read segment dump?
> >
> > Hi,
> >
> > I am looking for crawling a website as HTML files. After that, I need to
> > parse them and get the elements in it.
> >
> > Thanks,
> > Vijay
> > On Apr 5, 2016 8:37 PM, "Markus Jelsma" <ma...@openindex.io>
> wrote:
> >
> > > Hello, segment dumps are notorious hard to comprehend. What information
> > > are you looking for? What do you mean by reading contents of a website?
> > > Markus
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:Vijay Veluchamy <vi...@gmail.com>
> > > > Sent: Tuesday 5th April 2016 16:22
> > > > To: user@nutch.apache.org
> > > > Subject: How to read segment dump?
> > > >
> > > > Hi Team,
> > > >
> > > > I need to crawl a website using Apache Nutch. Currently, I am using
> Nutch
> > > > 1.x.
> > > >
> > > > I have followed the steps provided in the following URL upto
> 'invertlink'
> > > > step.
> > > >
> > > > https://wiki.apache.org/nutch/NutchTutorial
> > > >
> > > > Then, used 'readseg' command to dump the segments. The dump file is
> > > created
> > > > successfully.
> > > >
> > > > Now, I have the following questions.
> > > >
> > > > 1. Is this the right file (segment dump file) to read contents of a
> > > > website? If yes, how to read the contents from dump file? I am
> unable to
> > > > read as it looks like encrypted.
> > > > 2. Otherwise, how can I read the contents of a website?
> > > >
> > > > Thanks,
> > > > Vijay
> > > >
> > >
> >
>
RE: How to read segment dump?
Posted by Markus Jelsma <ma...@openindex.io>.
Hello - you should try the newer dump tool, it dumps HTML files as is to some directory.
Markus
-----Original message-----
> From:Vijay Veluchamy <vi...@gmail.com>
> Sent: Tuesday 5th April 2016 17:24
> To: user@nutch.apache.org
> Subject: RE: How to read segment dump?
>
> Hi,
>
> I am looking for crawling a website as HTML files. After that, I need to
> parse them and get the elements in it.
>
> Thanks,
> Vijay
> On Apr 5, 2016 8:37 PM, "Markus Jelsma" <ma...@openindex.io> wrote:
>
> > Hello, segment dumps are notorious hard to comprehend. What information
> > are you looking for? What do you mean by reading contents of a website?
> > Markus
> >
> >
> >
> > -----Original message-----
> > > From:Vijay Veluchamy <vi...@gmail.com>
> > > Sent: Tuesday 5th April 2016 16:22
> > > To: user@nutch.apache.org
> > > Subject: How to read segment dump?
> > >
> > > Hi Team,
> > >
> > > I need to crawl a website using Apache Nutch. Currently, I am using Nutch
> > > 1.x.
> > >
> > > I have followed the steps provided in the following URL upto 'invertlink'
> > > step.
> > >
> > > https://wiki.apache.org/nutch/NutchTutorial
> > >
> > > Then, used 'readseg' command to dump the segments. The dump file is
> > created
> > > successfully.
> > >
> > > Now, I have the following questions.
> > >
> > > 1. Is this the right file (segment dump file) to read contents of a
> > > website? If yes, how to read the contents from dump file? I am unable to
> > > read as it looks like encrypted.
> > > 2. Otherwise, how can I read the contents of a website?
> > >
> > > Thanks,
> > > Vijay
> > >
> >
>
RE: How to read segment dump?
Posted by Vijay Veluchamy <vi...@gmail.com>.
Hi,
I am looking for crawling a website as HTML files. After that, I need to
parse them and get the elements in it.
Thanks,
Vijay
On Apr 5, 2016 8:37 PM, "Markus Jelsma" <ma...@openindex.io> wrote:
> Hello, segment dumps are notorious hard to comprehend. What information
> are you looking for? What do you mean by reading contents of a website?
> Markus
>
>
>
> -----Original message-----
> > From:Vijay Veluchamy <vi...@gmail.com>
> > Sent: Tuesday 5th April 2016 16:22
> > To: user@nutch.apache.org
> > Subject: How to read segment dump?
> >
> > Hi Team,
> >
> > I need to crawl a website using Apache Nutch. Currently, I am using Nutch
> > 1.x.
> >
> > I have followed the steps provided in the following URL upto 'invertlink'
> > step.
> >
> > https://wiki.apache.org/nutch/NutchTutorial
> >
> > Then, used 'readseg' command to dump the segments. The dump file is
> created
> > successfully.
> >
> > Now, I have the following questions.
> >
> > 1. Is this the right file (segment dump file) to read contents of a
> > website? If yes, how to read the contents from dump file? I am unable to
> > read as it looks like encrypted.
> > 2. Otherwise, how can I read the contents of a website?
> >
> > Thanks,
> > Vijay
> >
>
RE: How to read segment dump?
Posted by Markus Jelsma <ma...@openindex.io>.
Hello, segment dumps are notorious hard to comprehend. What information are you looking for? What do you mean by reading contents of a website?
Markus
-----Original message-----
> From:Vijay Veluchamy <vi...@gmail.com>
> Sent: Tuesday 5th April 2016 16:22
> To: user@nutch.apache.org
> Subject: How to read segment dump?
>
> Hi Team,
>
> I need to crawl a website using Apache Nutch. Currently, I am using Nutch
> 1.x.
>
> I have followed the steps provided in the following URL upto 'invertlink'
> step.
>
> https://wiki.apache.org/nutch/NutchTutorial
>
> Then, used 'readseg' command to dump the segments. The dump file is created
> successfully.
>
> Now, I have the following questions.
>
> 1. Is this the right file (segment dump file) to read contents of a
> website? If yes, how to read the contents from dump file? I am unable to
> read as it looks like encrypted.
> 2. Otherwise, how can I read the contents of a website?
>
> Thanks,
> Vijay
>