You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Sameendra Samarawickrama <sm...@googlemail.com> on 2012/01/22 11:51:46 UTC

Getting html pages through a Nutch crawl (for a dataset)

Hi,
I am using Nutch to generate a small dataset of web; dataset on which I am
planning of running a focused crawler later.

I did a test crawl of and I have the 'segments' folder built up. Now I need
to get that exact html pages it fetched out of the seed url/s.

Is it possible to create a dataset this way? If so, how do I get those html
pages?

Thanks a lot!

Re: Getting html pages through a Nutch crawl (for a dataset)

Posted by Markus Jelsma <ma...@openindex.io>.

It is in the big dump file output by the readseg command.

> I need the content. :(
> 
> On Mon, Jan 23, 2012 at 9:47 PM, remi tassing <ta...@gmail.com> wrote:
> > If you need the urls, then yes, you just need to further process that
> > file.
> > 
> > If you need the content of those htlm files, then I'm not.sure how
> > to.do.that
> > 
> > On Monday, January 23, 2012, Sameendra Samarawickrama <
> > 
> > smsamrc@googlemail.com> wrote:
> > > yes it has a dump file which contains 'CrawlDatums'. And I found some
> > 
> > html
> > 
> > > content in it but to get html pages out of it I think you will have to
> > > further process it right? How about my crawl contains several thousand
> > 
> > web
> > 
> > > pages, will that file contain the contents of all the pages? Is this
> > > the way it happens?
> > > 
> > > Thanks,
> > > Sameendra
> > > 
> > > On Mon, Jan 23, 2012 at 8:02 PM, remi tassing <ta...@gmail.com>
> > 
> > wrote:
> > >> Hi,
> > >> 
> > >> in your output directory, you should see two files:
> > >> 1..part-00000.crc
> > >> 2. part-00000
> > >> 
> > >> Open the second one with a text editor and you should be able to see
> > >> the crawled urls. Perharps if there is no html in there, you probably
> > >> didn't crawl any.
> > >> 
> > >> Remi
> > >> 
> > >> On Mon, Jan 23, 2012 at 4:08 PM, Sameendra Samarawickrama <
> > >> 
> > >> smsamrc@googlemail.com> wrote:
> > >> > Hi,
> > >> > I tried the readdb comamnd, but I can't get the html pages with it.
> > >> > Thanks,
> > >> > Sameendra
> > >> > 
> > >> > On Mon, Jan 23, 2012 at 12:14 PM, remi tassing
> > >> > <tassingremi@gmail.com
> > >> > 
> > >> > >wrote:
> > >> > > Hi Sameendra,
> > >> > > 
> > >> > > read this page:  http://wiki.apache.org/nutch/bin/nutch_readdb
> > >> > > 
> > >> > > For instance the following command, will read your database and
> > 
> > output
> > 
> > >> > the
> > >> > 
> > >> > > crawled URLs to the directory output_dir:
> > >> > > 
> > >> > > bin/nutch readdb crawl/crawldb -dump output_dir
> > >> > > 
> > >> > > Remi
> > >> > > 
> > >> > > On Sun, Jan 22, 2012 at 1:02 PM, Lewis John Mcgibbney <
> > >> > > 
> > >> > > lewis.mcgibbney@gmail.com> wrote:
> > >> > > > The best method is to read or dump the contents of your crawldb
> > 
> > and
> > 
> > >> > work
> > >> > 
> > >> > > > based on this.
> > >> > > > 
> > >> > > > Please have a look on the wiki for using the readdb tool.
> > >> > > > 
> > >> > > > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
> > >> > > > 
> > >> > > > smsamrc@googlemail.com> wrote:
> > >> > > > > Hi,
> > >> > > > > I am using Nutch to generate a small dataset of web; dataset
> > >> > > > > on
> > >> > 
> > >> > which I
> > >> > 
> > >> > > > am
> > >> > > > 
> > >> > > > > planning of running a focused crawler later.
> > >> > > > > 
> > >> > > > > I did a test crawl of and I have the 'segments' folder built
> > >> > > > > up.
> > >> 
> > >> Now
> > >> 
> > >> > I
> > >> > 
> > >> > > > need
> > >> > > > 
> > >> > > > > to get that exact html pages it fetched out of the seed url/s.
> > >> > > > > 
> > >> > > > > Is it possible to create a dataset this way? If so, how do I
> > >> > > > > get
> > >> > 
> > >> > those
> > >> > 
> > >> > > > html
> > >> > > > 
> > >> > > > > pages?
> > >> > > > > 
> > >> > > > > Thanks a lot!
> > >> > > > 
> > >> > > > --
> > >> > > > *Lewis*

Re: Getting html pages through a Nutch crawl (for a dataset)

Posted by Sameendra Samarawickrama <sm...@googlemail.com>.

I need the content. :(

On Mon, Jan 23, 2012 at 9:47 PM, remi tassing <ta...@gmail.com> wrote:

> If you need the urls, then yes, you just need to further process that file.
>
> If you need the content of those htlm files, then I'm not.sure how
> to.do.that
>
> On Monday, January 23, 2012, Sameendra Samarawickrama <
> smsamrc@googlemail.com> wrote:
> > yes it has a dump file which contains 'CrawlDatums'. And I found some
> html
> > content in it but to get html pages out of it I think you will have to
> > further process it right? How about my crawl contains several thousand
> web
> > pages, will that file contain the contents of all the pages? Is this the
> > way it happens?
> >
> > Thanks,
> > Sameendra
> >
> > On Mon, Jan 23, 2012 at 8:02 PM, remi tassing <ta...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> in your output directory, you should see two files:
> >> 1..part-00000.crc
> >> 2. part-00000
> >>
> >> Open the second one with a text editor and you should be able to see the
> >> crawled urls. Perharps if there is no html in there, you probably didn't
> >> crawl any.
> >>
> >> Remi
> >>
> >> On Mon, Jan 23, 2012 at 4:08 PM, Sameendra Samarawickrama <
> >> smsamrc@googlemail.com> wrote:
> >>
> >> > Hi,
> >> > I tried the readdb comamnd, but I can't get the html pages with it.
> >> > Thanks,
> >> > Sameendra
> >> >
> >> > On Mon, Jan 23, 2012 at 12:14 PM, remi tassing <tassingremi@gmail.com
> >> > >wrote:
> >> >
> >> > > Hi Sameendra,
> >> > >
> >> > > read this page:  http://wiki.apache.org/nutch/bin/nutch_readdb
> >> > >
> >> > > For instance the following command, will read your database and
> output
> >> > the
> >> > > crawled URLs to the directory output_dir:
> >> > >
> >> > > bin/nutch readdb crawl/crawldb -dump output_dir
> >> > >
> >> > > Remi
> >> > >
> >> > > On Sun, Jan 22, 2012 at 1:02 PM, Lewis John Mcgibbney <
> >> > > lewis.mcgibbney@gmail.com> wrote:
> >> > >
> >> > > > The best method is to read or dump the contents of your crawldb
> and
> >> > work
> >> > > > based on this.
> >> > > >
> >> > > > Please have a look on the wiki for using the readdb tool.
> >> > > >
> >> > > > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
> >> > > > smsamrc@googlemail.com> wrote:
> >> > > >
> >> > > > > Hi,
> >> > > > > I am using Nutch to generate a small dataset of web; dataset on
> >> > which I
> >> > > > am
> >> > > > > planning of running a focused crawler later.
> >> > > > >
> >> > > > > I did a test crawl of and I have the 'segments' folder built up.
> >> Now
> >> > I
> >> > > > need
> >> > > > > to get that exact html pages it fetched out of the seed url/s.
> >> > > > >
> >> > > > > Is it possible to create a dataset this way? If so, how do I get
> >> > those
> >> > > > html
> >> > > > > pages?
> >> > > > >
> >> > > > > Thanks a lot!
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > *Lewis*
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: Getting html pages through a Nutch crawl (for a dataset)

Posted by remi tassing <ta...@gmail.com>.

If you need the urls, then yes, you just need to further process that file.

If you need the content of those htlm files, then I'm not.sure how
to.do.that

On Monday, January 23, 2012, Sameendra Samarawickrama <
smsamrc@googlemail.com> wrote:
> yes it has a dump file which contains 'CrawlDatums'. And I found some html
> content in it but to get html pages out of it I think you will have to
> further process it right? How about my crawl contains several thousand web
> pages, will that file contain the contents of all the pages? Is this the
> way it happens?
>
> Thanks,
> Sameendra
>
> On Mon, Jan 23, 2012 at 8:02 PM, remi tassing <ta...@gmail.com>
wrote:
>
>> Hi,
>>
>> in your output directory, you should see two files:
>> 1..part-00000.crc
>> 2. part-00000
>>
>> Open the second one with a text editor and you should be able to see the
>> crawled urls. Perharps if there is no html in there, you probably didn't
>> crawl any.
>>
>> Remi
>>
>> On Mon, Jan 23, 2012 at 4:08 PM, Sameendra Samarawickrama <
>> smsamrc@googlemail.com> wrote:
>>
>> > Hi,
>> > I tried the readdb comamnd, but I can't get the html pages with it.
>> > Thanks,
>> > Sameendra
>> >
>> > On Mon, Jan 23, 2012 at 12:14 PM, remi tassing <tassingremi@gmail.com
>> > >wrote:
>> >
>> > > Hi Sameendra,
>> > >
>> > > read this page:  http://wiki.apache.org/nutch/bin/nutch_readdb
>> > >
>> > > For instance the following command, will read your database and
output
>> > the
>> > > crawled URLs to the directory output_dir:
>> > >
>> > > bin/nutch readdb crawl/crawldb -dump output_dir
>> > >
>> > > Remi
>> > >
>> > > On Sun, Jan 22, 2012 at 1:02 PM, Lewis John Mcgibbney <
>> > > lewis.mcgibbney@gmail.com> wrote:
>> > >
>> > > > The best method is to read or dump the contents of your crawldb and
>> > work
>> > > > based on this.
>> > > >
>> > > > Please have a look on the wiki for using the readdb tool.
>> > > >
>> > > > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
>> > > > smsamrc@googlemail.com> wrote:
>> > > >
>> > > > > Hi,
>> > > > > I am using Nutch to generate a small dataset of web; dataset on
>> > which I
>> > > > am
>> > > > > planning of running a focused crawler later.
>> > > > >
>> > > > > I did a test crawl of and I have the 'segments' folder built up.
>> Now
>> > I
>> > > > need
>> > > > > to get that exact html pages it fetched out of the seed url/s.
>> > > > >
>> > > > > Is it possible to create a dataset this way? If so, how do I get
>> > those
>> > > > html
>> > > > > pages?
>> > > > >
>> > > > > Thanks a lot!
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > *Lewis*
>> > > >
>> > >
>> >
>>
>

Re: Getting html pages through a Nutch crawl (for a dataset)

Posted by Sameendra Samarawickrama <sm...@googlemail.com>.

yes it has a dump file which contains 'CrawlDatums'. And I found some html
content in it but to get html pages out of it I think you will have to
further process it right? How about my crawl contains several thousand web
pages, will that file contain the contents of all the pages? Is this the
way it happens?

Thanks,
Sameendra

On Mon, Jan 23, 2012 at 8:02 PM, remi tassing <ta...@gmail.com> wrote:

> Hi,
>
> in your output directory, you should see two files:
> 1..part-00000.crc
> 2. part-00000
>
> Open the second one with a text editor and you should be able to see the
> crawled urls. Perharps if there is no html in there, you probably didn't
> crawl any.
>
> Remi
>
> On Mon, Jan 23, 2012 at 4:08 PM, Sameendra Samarawickrama <
> smsamrc@googlemail.com> wrote:
>
> > Hi,
> > I tried the readdb comamnd, but I can't get the html pages with it.
> > Thanks,
> > Sameendra
> >
> > On Mon, Jan 23, 2012 at 12:14 PM, remi tassing <tassingremi@gmail.com
> > >wrote:
> >
> > > Hi Sameendra,
> > >
> > > read this page:  http://wiki.apache.org/nutch/bin/nutch_readdb
> > >
> > > For instance the following command, will read your database and output
> > the
> > > crawled URLs to the directory output_dir:
> > >
> > > bin/nutch readdb crawl/crawldb -dump output_dir
> > >
> > > Remi
> > >
> > > On Sun, Jan 22, 2012 at 1:02 PM, Lewis John Mcgibbney <
> > > lewis.mcgibbney@gmail.com> wrote:
> > >
> > > > The best method is to read or dump the contents of your crawldb and
> > work
> > > > based on this.
> > > >
> > > > Please have a look on the wiki for using the readdb tool.
> > > >
> > > > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
> > > > smsamrc@googlemail.com> wrote:
> > > >
> > > > > Hi,
> > > > > I am using Nutch to generate a small dataset of web; dataset on
> > which I
> > > > am
> > > > > planning of running a focused crawler later.
> > > > >
> > > > > I did a test crawl of and I have the 'segments' folder built up.
> Now
> > I
> > > > need
> > > > > to get that exact html pages it fetched out of the seed url/s.
> > > > >
> > > > > Is it possible to create a dataset this way? If so, how do I get
> > those
> > > > html
> > > > > pages?
> > > > >
> > > > > Thanks a lot!
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *Lewis*
> > > >
> > >
> >
>

Re: Getting html pages through a Nutch crawl (for a dataset)

Posted by remi tassing <ta...@gmail.com>.

Hi,

in your output directory, you should see two files:
1..part-00000.crc
2. part-00000

Open the second one with a text editor and you should be able to see the
crawled urls. Perharps if there is no html in there, you probably didn't
crawl any.

Remi

On Mon, Jan 23, 2012 at 4:08 PM, Sameendra Samarawickrama <
smsamrc@googlemail.com> wrote:

> Hi,
> I tried the readdb comamnd, but I can't get the html pages with it.
> Thanks,
> Sameendra
>
> On Mon, Jan 23, 2012 at 12:14 PM, remi tassing <tassingremi@gmail.com
> >wrote:
>
> > Hi Sameendra,
> >
> > read this page:  http://wiki.apache.org/nutch/bin/nutch_readdb
> >
> > For instance the following command, will read your database and output
> the
> > crawled URLs to the directory output_dir:
> >
> > bin/nutch readdb crawl/crawldb -dump output_dir
> >
> > Remi
> >
> > On Sun, Jan 22, 2012 at 1:02 PM, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > The best method is to read or dump the contents of your crawldb and
> work
> > > based on this.
> > >
> > > Please have a look on the wiki for using the readdb tool.
> > >
> > > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
> > > smsamrc@googlemail.com> wrote:
> > >
> > > > Hi,
> > > > I am using Nutch to generate a small dataset of web; dataset on
> which I
> > > am
> > > > planning of running a focused crawler later.
> > > >
> > > > I did a test crawl of and I have the 'segments' folder built up. Now
> I
> > > need
> > > > to get that exact html pages it fetched out of the seed url/s.
> > > >
> > > > Is it possible to create a dataset this way? If so, how do I get
> those
> > > html
> > > > pages?
> > > >
> > > > Thanks a lot!
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
>

Re: Getting html pages through a Nutch crawl (for a dataset)

Posted by Sameendra Samarawickrama <sm...@googlemail.com>.

Hi,
I tried the readdb comamnd, but I can't get the html pages with it.
Thanks,
Sameendra

On Mon, Jan 23, 2012 at 12:14 PM, remi tassing <ta...@gmail.com>wrote:

> Hi Sameendra,
>
> read this page:  http://wiki.apache.org/nutch/bin/nutch_readdb
>
> For instance the following command, will read your database and output the
> crawled URLs to the directory output_dir:
>
> bin/nutch readdb crawl/crawldb -dump output_dir
>
> Remi
>
> On Sun, Jan 22, 2012 at 1:02 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > The best method is to read or dump the contents of your crawldb and work
> > based on this.
> >
> > Please have a look on the wiki for using the readdb tool.
> >
> > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
> > smsamrc@googlemail.com> wrote:
> >
> > > Hi,
> > > I am using Nutch to generate a small dataset of web; dataset on which I
> > am
> > > planning of running a focused crawler later.
> > >
> > > I did a test crawl of and I have the 'segments' folder built up. Now I
> > need
> > > to get that exact html pages it fetched out of the seed url/s.
> > >
> > > Is it possible to create a dataset this way? If so, how do I get those
> > html
> > > pages?
> > >
> > > Thanks a lot!
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>

Re: Getting html pages through a Nutch crawl (for a dataset)

Posted by remi tassing <ta...@gmail.com>.

Hi Sameendra,

read this page:  http://wiki.apache.org/nutch/bin/nutch_readdb

For instance the following command, will read your database and output the
crawled URLs to the directory output_dir:

bin/nutch readdb crawl/crawldb -dump output_dir

Remi

On Sun, Jan 22, 2012 at 1:02 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> The best method is to read or dump the contents of your crawldb and work
> based on this.
>
> Please have a look on the wiki for using the readdb tool.
>
> On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
> smsamrc@googlemail.com> wrote:
>
> > Hi,
> > I am using Nutch to generate a small dataset of web; dataset on which I
> am
> > planning of running a focused crawler later.
> >
> > I did a test crawl of and I have the 'segments' folder built up. Now I
> need
> > to get that exact html pages it fetched out of the seed url/s.
> >
> > Is it possible to create a dataset this way? If so, how do I get those
> html
> > pages?
> >
> > Thanks a lot!
> >
>
>
>
> --
> *Lewis*
>

Re: Getting html pages through a Nutch crawl (for a dataset)

Posted by Sameendra Samarawickrama <sm...@googlemail.com>.

I tried dumping several folders named with timestamps inside 'segments'
using the readseg tool.

In the dump folder I don't have any html pages but a dump file (and another
crc file). Dump file contains 'CrawlDatums', and there is a one
'CrawlDatum' with 'Content'. Where are the content for the other
'CrawlDatums' (hasn't the crawling been successful?) ? How do I get the
crawled html pages?

On Sun, Jan 22, 2012 at 9:12 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> No, it's the readseg tool you need. It will dump, by default, all contents
> of
> the segment(s).
>
> > As Lewis mentioned, I dumped the crawldb using the readdb tool as below.
> >
> > $ ./bin/nutch readdb crawl-tinysite/crawldb/ -dump outdir   (on Cygwin)
> >
> > But the dump (outdir) contains only two files named '.part-0000.crc' and
> > 'part-0000'.
> > It doesn't have the html pages I wanted. What should I do?
> >
> >
> > On Sun, Jan 22, 2012 at 4:32 PM, Lewis John Mcgibbney <
> >
> > lewis.mcgibbney@gmail.com> wrote:
> > > The best method is to read or dump the contents of your crawldb and
> work
> > > based on this.
> > >
> > > Please have a look on the wiki for using the readdb tool.
> > >
> > > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
> > >
> > > smsamrc@googlemail.com> wrote:
> > > > Hi,
> > > > I am using Nutch to generate a small dataset of web; dataset on
> which I
> > >
> > > am
> > >
> > > > planning of running a focused crawler later.
> > > >
> > > > I did a test crawl of and I have the 'segments' folder built up. Now
> I
> > >
> > > need
> > >
> > > > to get that exact html pages it fetched out of the seed url/s.
> > > >
> > > > Is it possible to create a dataset this way? If so, how do I get
> those
> > >
> > > html
> > >
> > > > pages?
> > > >
> > > > Thanks a lot!
> > >
> > > --
> > > *Lewis*
>



-- 
Regards,
Sameendra

Re: Getting html pages through a Nutch crawl (for a dataset)

Posted by Markus Jelsma <ma...@openindex.io>.

No, it's the readseg tool you need. It will dump, by default, all contents of 
the segment(s). 

> As Lewis mentioned, I dumped the crawldb using the readdb tool as below.
> 
> $ ./bin/nutch readdb crawl-tinysite/crawldb/ -dump outdir   (on Cygwin)
> 
> But the dump (outdir) contains only two files named '.part-0000.crc' and
> 'part-0000'.
> It doesn't have the html pages I wanted. What should I do?
> 
> 
> On Sun, Jan 22, 2012 at 4:32 PM, Lewis John Mcgibbney <
> 
> lewis.mcgibbney@gmail.com> wrote:
> > The best method is to read or dump the contents of your crawldb and work
> > based on this.
> > 
> > Please have a look on the wiki for using the readdb tool.
> > 
> > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
> > 
> > smsamrc@googlemail.com> wrote:
> > > Hi,
> > > I am using Nutch to generate a small dataset of web; dataset on which I
> > 
> > am
> > 
> > > planning of running a focused crawler later.
> > > 
> > > I did a test crawl of and I have the 'segments' folder built up. Now I
> > 
> > need
> > 
> > > to get that exact html pages it fetched out of the seed url/s.
> > > 
> > > Is it possible to create a dataset this way? If so, how do I get those
> > 
> > html
> > 
> > > pages?
> > > 
> > > Thanks a lot!
> > 
> > --
> > *Lewis*

Re: Getting html pages through a Nutch crawl (for a dataset)

Posted by Sameendra Samarawickrama <sm...@googlemail.com>.

As Lewis mentioned, I dumped the crawldb using the readdb tool as below.

$ ./bin/nutch readdb crawl-tinysite/crawldb/ -dump outdir   (on Cygwin)

But the dump (outdir) contains only two files named '.part-0000.crc' and
'part-0000'.
It doesn't have the html pages I wanted. What should I do?


On Sun, Jan 22, 2012 at 4:32 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> The best method is to read or dump the contents of your crawldb and work
> based on this.
>
> Please have a look on the wiki for using the readdb tool.
>
> On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
> smsamrc@googlemail.com> wrote:
>
> > Hi,
> > I am using Nutch to generate a small dataset of web; dataset on which I
> am
> > planning of running a focused crawler later.
> >
> > I did a test crawl of and I have the 'segments' folder built up. Now I
> need
> > to get that exact html pages it fetched out of the seed url/s.
> >
> > Is it possible to create a dataset this way? If so, how do I get those
> html
> > pages?
> >
> > Thanks a lot!
> >
>
>
>
> --
> *Lewis*
>

Re: Getting html pages through a Nutch crawl (for a dataset)

Posted by Lewis John Mcgibbney <le...@gmail.com>.

The best method is to read or dump the contents of your crawldb and work
based on this.

Please have a look on the wiki for using the readdb tool.

On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
smsamrc@googlemail.com> wrote:

> Hi,
> I am using Nutch to generate a small dataset of web; dataset on which I am
> planning of running a focused crawler later.
>
> I did a test crawl of and I have the 'segments' folder built up. Now I need
> to get that exact html pages it fetched out of the seed url/s.
>
> Is it possible to create a dataset this way? If so, how do I get those html
> pages?
>
> Thanks a lot!
>



-- 
*Lewis*