You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "O. Olson" <ol...@yahoo.it> on 2009/09/24 20:54:24 UTC

Using Nutch for only retriving HTML

Hi,
	I am new to Nutch. I would like to completely crawl through an Internal Website and retrieve all the HTML Content. I don’t intend to do further processing using Nutch. 
The Website/Content is rather huge. By crawl, I mean that I would go to a page, download/archive the HTML, get the links from that page, and then download/archive those pages. I would keep doing this till I don’t have any new links.

Is this possible? Is this the right tool for this job, or are there other tools out there that would be more suited for my purpose?

Thanks,
O.O. 



      

RE: R: Using Nutch for only retriving HTML

Posted by BELLINI ADAM <mb...@msn.com>.

thx dude it works fine :)




> Date: Thu, 1 Oct 2009 20:05:09 +0200
> From: ab@getopt.org
> To: nutch-user@lucene.apache.org
> Subject: Re: R: Using Nutch for only retriving HTML
> 
> BELLINI ADAM wrote:
> > hi,
> > but how to dump the content  ? i tried this command :
> > 
> > 
> > 
> > ./bin/nutch readseg -dump crawl/segments/20090903121951/content/  toto
> > 
> > and it said :
> > 
> > Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> > file:/usr/local/nutch-1.0/crawl/segments/20091001120102/content/crawl_generate
> >   
> > 
> > but the crawl_generate is in this path :
> > 
> > /usr/local/nutch-1.0/crawl/segments/20091001120102
> > 
> > and not in this one :
> > 
> > /usr/local/nutch-1.0/crawl/segments/20091001120102/content
> > 
> > can you plz just give me the correct command ?
> 
> This command will dump just the content part:
> 
> ./bin/nutch readseg -dump crawl/segments/20090903121951 toto -nofetch 
> -nogenerate -noparse -noparsedata -noparsetext
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
 		 	   		  
_________________________________________________________________
Windows Live helps you keep up with all your friends, in one place.
http://go.microsoft.com/?linkid=9660826

Re: R: Using Nutch for only retriving HTML

Posted by Andrzej Bialecki <ab...@getopt.org>.
BELLINI ADAM wrote:
> hi,
> but how to dump the content  ? i tried this command :
> 
> 
> 
> ./bin/nutch readseg -dump crawl/segments/20090903121951/content/  toto
> 
> and it said :
> 
> Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> file:/usr/local/nutch-1.0/crawl/segments/20091001120102/content/crawl_generate
>   
> 
> but the crawl_generate is in this path :
> 
> /usr/local/nutch-1.0/crawl/segments/20091001120102
> 
> and not in this one :
> 
> /usr/local/nutch-1.0/crawl/segments/20091001120102/content
> 
> can you plz just give me the correct command ?

This command will dump just the content part:

./bin/nutch readseg -dump crawl/segments/20090903121951 toto -nofetch 
-nogenerate -noparse -noparsedata -noparsetext

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


RE: R: Using Nutch for only retriving HTML

Posted by BELLINI ADAM <mb...@msn.com>.
hi,
but how to dump the content  ? i tried this command :



./bin/nutch readseg -dump crawl/segments/20090903121951/content/  toto

and it said :

Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file:/usr/local/nutch-1.0/crawl/segments/20091001120102/content/crawl_generate
  

but the crawl_generate is in this path :

/usr/local/nutch-1.0/crawl/segments/20091001120102

and not in this one :

/usr/local/nutch-1.0/crawl/segments/20091001120102/content

can you plz just give me the correct command ?

thx



> Date: Thu, 1 Oct 2009 18:16:43 +0200
> From: ab@getopt.org
> To: nutch-user@lucene.apache.org
> Subject: Re: R: Using Nutch for only retriving HTML
> 
> BELLINI ADAM wrote:
> > hi,
> > thx for the advise,
> > but guess when u run the readseg command it will not retun the pages as is (as if browsed ).
> > i tried it and it returns  information about pages :
> > 
> > Recno:: 0
> > URL:: http://blabla.com/blabla.jsp
> > 
> > CrawlDatum::
> > Version: 7
> > Status: 67 (linked)
> > Fetch time: Mon Aug 31 16:11:26 EDT 2009
> > Modified time: Wed Dec 31 19:00:00 EST 1969
> > Retries since fetch: 0
> > Retry interval: 86400 seconds (1 days)
> > Score: 8.849112E-7
> > Signature: null
> > Metadata:
> > 
> > is there another way to get the source of the page as if it will be browsed ? i mean as if we run wget ?
> 
> The above record comes from <segmentDir>/crawl_parse part of segment. If 
> you dump the /content part then you will get the original raw content.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
 		 	   		  
_________________________________________________________________
Internet explorer 8 lets you browse the web faster.
http://go.microsoft.com/?linkid=9655582

Re: R: Using Nutch for only retriving HTML

Posted by Andrzej Bialecki <ab...@getopt.org>.
BELLINI ADAM wrote:
> hi,
> thx for the advise,
> but guess when u run the readseg command it will not retun the pages as is (as if browsed ).
> i tried it and it returns  information about pages :
> 
> Recno:: 0
> URL:: http://blabla.com/blabla.jsp
> 
> CrawlDatum::
> Version: 7
> Status: 67 (linked)
> Fetch time: Mon Aug 31 16:11:26 EDT 2009
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 86400 seconds (1 days)
> Score: 8.849112E-7
> Signature: null
> Metadata:
> 
> is there another way to get the source of the page as if it will be browsed ? i mean as if we run wget ?

The above record comes from <segmentDir>/crawl_parse part of segment. If 
you dump the /content part then you will get the original raw content.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


RE: R: Using Nutch for only retriving HTML

Posted by BELLINI ADAM <mb...@msn.com>.
hi,
thx for the advise,
but guess when u run the readseg command it will not retun the pages as is (as if browsed ).
i tried it and it returns  information about pages :

Recno:: 0
URL:: http://blabla.com/blabla.jsp

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Mon Aug 31 16:11:26 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 86400 seconds (1 days)
Score: 8.849112E-7
Signature: null
Metadata:

is there another way to get the source of the page as if it will be browsed ? i mean as if we run wget ?

thx



> Date: Wed, 30 Sep 2009 23:38:28 +0200
> From: ab@getopt.org
> To: nutch-user@lucene.apache.org
> Subject: Re: R: Using Nutch for only retriving HTML
> 
> BELLINI ADAM wrote:
> > 
> > me again,
> > 
> > i forgot to tell u the easiest way...
> > 
> > once the crawl is finished you can dump the whole db (it contains all the links to your html pages) in a text file..
> > 
> > ./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile
> > 
> > and you can perfor the wget on this db and archive the files
> 
> I'd argue with this advice. The goal here is to obtain the HTML pages. 
> If you have crawled them, then why do it again? You already have their 
> content locally.
> 
> However, page content is NOT stored in crawldb, it's stored in segments. 
> So you need to dump the content from segments, and not the content of 
> crawldb.
> 
> The command 'bin/nutch readseg -dump <segmentName> <output>' should do 
> the trick.
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
 		 	   		  
_________________________________________________________________
We are your photos. Share us now with Windows Live Photos.
http://go.microsoft.com/?linkid=9666047

Re: R: Using Nutch for only retriving HTML

Posted by Andrzej Bialecki <ab...@getopt.org>.
BELLINI ADAM wrote:
> 
> me again,
> 
> i forgot to tell u the easiest way...
> 
> once the crawl is finished you can dump the whole db (it contains all the links to your html pages) in a text file..
> 
> ./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile
> 
> and you can perfor the wget on this db and archive the files

I'd argue with this advice. The goal here is to obtain the HTML pages. 
If you have crawled them, then why do it again? You already have their 
content locally.

However, page content is NOT stored in crawldb, it's stored in segments. 
So you need to dump the content from segments, and not the content of 
crawldb.

The command 'bin/nutch readseg -dump <segmentName> <output>' should do 
the trick.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


RE: R: Using Nutch for only retriving HTML

Posted by BELLINI ADAM <mb...@msn.com>.

me again,

i forgot to tell u the easiest way...

once the crawl is finished you can dump the whole db (it contains all the links to your html pages) in a text file..

./bin/nutch readdb crawl_folder/crawldb/ -dump DBtextFile

and you can perfor the wget on this db and archive the files



> From: mbellil@msn.com
> To: nutch-user@lucene.apache.org
> Subject: RE: R: Using Nutch for only retriving HTML
> Date: Wed, 30 Sep 2009 21:04:03 +0000
> 
> 
> hi 
> mabe you can run a crawl (dont forget to filter the pages just to keep html or htm files (you will do it at conf/crawl-urlfilter.txt) )
> after that you will go to the hadoop.log file and grep the sentence 'fetcher.Fetcher - fetching http' to get all the fetched urls.
> dont forget to sort the file and to make it uniq (command uniq -c) becoz sometimes the crawl try to fecth the poges several times if they  will not answer the first time.
> 
> when you have all your urls you can run wget on your file and archive the dowlowaded pages.
> 
> hope it could help.
> 
> 
> 
> 
> 
> > Date: Wed, 30 Sep 2009 20:46:50 +0000
> > From: olson_ord@yahoo.it
> > Subject: Re: R: Using Nutch for only retriving HTML
> > To: nutch-user@lucene.apache.org
> > 
> > Thanks Magnús and Susam for your responses and pointing me in the right direction. I think I would spend time over the next few weeks trying out Nutch over. I only needed the HTML – I don’t care if it is in the Database or in separate files. 
> > 
> > Thanks guys,
> > O.O. 
> > 
> > 
> > --- Mer 30/9/09, Magnús Skúlason <ma...@gmail.com> ha scritto:
> > 
> > > Da: Magnús Skúlason <ma...@gmail.com>
> > > Oggetto: Re: R: Using Nutch for only retriving HTML
> > > A: nutch-user@lucene.apache.org
> > > Data: Mercoledì 30 settembre 2009, 11:48
> > > Actually its quite easy to modify the
> > > parse-html filter to do this.
> > > 
> > > That is saving the HTML to a file or to some database, you
> > > could then
> > > configure it to skip all unnecessary plugins. I think it
> > > depends a lot on
> > > the other requirements you have whether using nutch for
> > > this task is the
> > > right way to go or not. If you can get by with wget -r then
> > > its probably an
> > > overkill to use nutch.
> > > 
> > > Best regards,
> > > Magnus
> > > 
> > > On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal <su...@gmail.com>
> > > wrote:
> > > 
> > > > On Wed, Sep 30, 2009 at 1:39 AM, O. Olson <ol...@yahoo.it>
> > > wrote:
> > > > > Sorry for pushing this topic, but I would like to
> > > know if Nutch would
> > > > help me get the raw HTML in my situation described
> > > below.
> > > > >
> > > > > I am sure it would be a simple answer to those
> > > who know Nutch. If not
> > > > then I guess Nutch is the wrong tool for the job.
> > > > >
> > > > > Thanks,
> > > > > O. O.
> > > > >
> > > > >
> > > > > --- Gio 24/9/09, O. Olson <ol...@yahoo.it>
> > > ha scritto:
> > > > >
> > > > >> Da: O. Olson <ol...@yahoo.it>
> > > > >> Oggetto: Using Nutch for only retriving HTML
> > > > >> A: nutch-user@lucene.apache.org
> > > > >> Data: Giovedì 24 settembre 2009, 20:54
> > > > >> Hi,
> > > > >>     I am new to Nutch. I
> > > would like to
> > > > >> completely crawl through an Internal Website
> > > and retrieve
> > > > >> all the HTML Content. I don’t intend to do
> > > further
> > > > >> processing using Nutch.
> > > > >> The Website/Content is rather huge. By crawl,
> > > I mean that I
> > > > >> would go to a page, download/archive the
> > > HTML, get the links
> > > > >> from that page, and then download/archive
> > > those pages. I
> > > > >> would keep doing this till I don’t have any
> > > new links.
> > > >
> > > > I don't think it is possible to retrieve pages and
> > > store them as
> > > > separate files, one per page, without modifications in
> > > Nutch. I am not
> > > > sure though. Someone would correct me if I am wrong
> > > here. However, it
> > > > is easy to retrieve the HTML contents from the crawl
> > > DB using the
> > > > Nutch API. But from your post, it seems, you don't
> > > want to do this.
> > > >
> > > > >>
> > > > >> Is this possible? Is this the right tool for
> > > this job, or
> > > > >> are there other tools out there that would be
> > > more suited
> > > > >> for my purpose?
> > > >
> > > > I guess 'wget' is the tool you are looking for. You
> > > can use it with -r
> > > > option to recursively download pages and store them as
> > > separate files
> > > > on the hard disk, which is exactly what you need. You
> > > might want to
> > > > use the -np option too. It is available for Windows as
> > > well as Linux.
> > > >
> > > > Regards,
> > > > Susam Pal
> > > >
> > > 
> > 
> > 
> >       
>  		 	   		  
> _________________________________________________________________
> We are your photos. Share us now with Windows Live Photos.
> http://go.microsoft.com/?linkid=9666047
 		 	   		  
_________________________________________________________________
Attention all humans. We are your photos. Free us.
http://go.microsoft.com/?linkid=9666046

RE: R: Using Nutch for only retriving HTML

Posted by BELLINI ADAM <mb...@msn.com>.
hi 
mabe you can run a crawl (dont forget to filter the pages just to keep html or htm files (you will do it at conf/crawl-urlfilter.txt) )
after that you will go to the hadoop.log file and grep the sentence 'fetcher.Fetcher - fetching http' to get all the fetched urls.
dont forget to sort the file and to make it uniq (command uniq -c) becoz sometimes the crawl try to fecth the poges several times if they  will not answer the first time.

when you have all your urls you can run wget on your file and archive the dowlowaded pages.

hope it could help.





> Date: Wed, 30 Sep 2009 20:46:50 +0000
> From: olson_ord@yahoo.it
> Subject: Re: R: Using Nutch for only retriving HTML
> To: nutch-user@lucene.apache.org
> 
> Thanks Magnús and Susam for your responses and pointing me in the right direction. I think I would spend time over the next few weeks trying out Nutch over. I only needed the HTML – I don’t care if it is in the Database or in separate files. 
> 
> Thanks guys,
> O.O. 
> 
> 
> --- Mer 30/9/09, Magnús Skúlason <ma...@gmail.com> ha scritto:
> 
> > Da: Magnús Skúlason <ma...@gmail.com>
> > Oggetto: Re: R: Using Nutch for only retriving HTML
> > A: nutch-user@lucene.apache.org
> > Data: Mercoledì 30 settembre 2009, 11:48
> > Actually its quite easy to modify the
> > parse-html filter to do this.
> > 
> > That is saving the HTML to a file or to some database, you
> > could then
> > configure it to skip all unnecessary plugins. I think it
> > depends a lot on
> > the other requirements you have whether using nutch for
> > this task is the
> > right way to go or not. If you can get by with wget -r then
> > its probably an
> > overkill to use nutch.
> > 
> > Best regards,
> > Magnus
> > 
> > On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal <su...@gmail.com>
> > wrote:
> > 
> > > On Wed, Sep 30, 2009 at 1:39 AM, O. Olson <ol...@yahoo.it>
> > wrote:
> > > > Sorry for pushing this topic, but I would like to
> > know if Nutch would
> > > help me get the raw HTML in my situation described
> > below.
> > > >
> > > > I am sure it would be a simple answer to those
> > who know Nutch. If not
> > > then I guess Nutch is the wrong tool for the job.
> > > >
> > > > Thanks,
> > > > O. O.
> > > >
> > > >
> > > > --- Gio 24/9/09, O. Olson <ol...@yahoo.it>
> > ha scritto:
> > > >
> > > >> Da: O. Olson <ol...@yahoo.it>
> > > >> Oggetto: Using Nutch for only retriving HTML
> > > >> A: nutch-user@lucene.apache.org
> > > >> Data: Giovedì 24 settembre 2009, 20:54
> > > >> Hi,
> > > >>     I am new to Nutch. I
> > would like to
> > > >> completely crawl through an Internal Website
> > and retrieve
> > > >> all the HTML Content. I don’t intend to do
> > further
> > > >> processing using Nutch.
> > > >> The Website/Content is rather huge. By crawl,
> > I mean that I
> > > >> would go to a page, download/archive the
> > HTML, get the links
> > > >> from that page, and then download/archive
> > those pages. I
> > > >> would keep doing this till I don’t have any
> > new links.
> > >
> > > I don't think it is possible to retrieve pages and
> > store them as
> > > separate files, one per page, without modifications in
> > Nutch. I am not
> > > sure though. Someone would correct me if I am wrong
> > here. However, it
> > > is easy to retrieve the HTML contents from the crawl
> > DB using the
> > > Nutch API. But from your post, it seems, you don't
> > want to do this.
> > >
> > > >>
> > > >> Is this possible? Is this the right tool for
> > this job, or
> > > >> are there other tools out there that would be
> > more suited
> > > >> for my purpose?
> > >
> > > I guess 'wget' is the tool you are looking for. You
> > can use it with -r
> > > option to recursively download pages and store them as
> > separate files
> > > on the hard disk, which is exactly what you need. You
> > might want to
> > > use the -np option too. It is available for Windows as
> > well as Linux.
> > >
> > > Regards,
> > > Susam Pal
> > >
> > 
> 
> 
>       
 		 	   		  
_________________________________________________________________
We are your photos. Share us now with Windows Live Photos.
http://go.microsoft.com/?linkid=9666047

Re: R: Using Nutch for only retriving HTML

Posted by "O. Olson" <ol...@yahoo.it>.
Thanks Magnús and Susam for your responses and pointing me in the right direction. I think I would spend time over the next few weeks trying out Nutch over. I only needed the HTML – I don’t care if it is in the Database or in separate files. 

Thanks guys,
O.O. 


--- Mer 30/9/09, Magnús Skúlason <ma...@gmail.com> ha scritto:

> Da: Magnús Skúlason <ma...@gmail.com>
> Oggetto: Re: R: Using Nutch for only retriving HTML
> A: nutch-user@lucene.apache.org
> Data: Mercoledì 30 settembre 2009, 11:48
> Actually its quite easy to modify the
> parse-html filter to do this.
> 
> That is saving the HTML to a file or to some database, you
> could then
> configure it to skip all unnecessary plugins. I think it
> depends a lot on
> the other requirements you have whether using nutch for
> this task is the
> right way to go or not. If you can get by with wget -r then
> its probably an
> overkill to use nutch.
> 
> Best regards,
> Magnus
> 
> On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal <su...@gmail.com>
> wrote:
> 
> > On Wed, Sep 30, 2009 at 1:39 AM, O. Olson <ol...@yahoo.it>
> wrote:
> > > Sorry for pushing this topic, but I would like to
> know if Nutch would
> > help me get the raw HTML in my situation described
> below.
> > >
> > > I am sure it would be a simple answer to those
> who know Nutch. If not
> > then I guess Nutch is the wrong tool for the job.
> > >
> > > Thanks,
> > > O. O.
> > >
> > >
> > > --- Gio 24/9/09, O. Olson <ol...@yahoo.it>
> ha scritto:
> > >
> > >> Da: O. Olson <ol...@yahoo.it>
> > >> Oggetto: Using Nutch for only retriving HTML
> > >> A: nutch-user@lucene.apache.org
> > >> Data: Giovedì 24 settembre 2009, 20:54
> > >> Hi,
> > >>     I am new to Nutch. I
> would like to
> > >> completely crawl through an Internal Website
> and retrieve
> > >> all the HTML Content. I don’t intend to do
> further
> > >> processing using Nutch.
> > >> The Website/Content is rather huge. By crawl,
> I mean that I
> > >> would go to a page, download/archive the
> HTML, get the links
> > >> from that page, and then download/archive
> those pages. I
> > >> would keep doing this till I don’t have any
> new links.
> >
> > I don't think it is possible to retrieve pages and
> store them as
> > separate files, one per page, without modifications in
> Nutch. I am not
> > sure though. Someone would correct me if I am wrong
> here. However, it
> > is easy to retrieve the HTML contents from the crawl
> DB using the
> > Nutch API. But from your post, it seems, you don't
> want to do this.
> >
> > >>
> > >> Is this possible? Is this the right tool for
> this job, or
> > >> are there other tools out there that would be
> more suited
> > >> for my purpose?
> >
> > I guess 'wget' is the tool you are looking for. You
> can use it with -r
> > option to recursively download pages and store them as
> separate files
> > on the hard disk, which is exactly what you need. You
> might want to
> > use the -np option too. It is available for Windows as
> well as Linux.
> >
> > Regards,
> > Susam Pal
> >
> 


      

Re: R: Using Nutch for only retriving HTML

Posted by Magnús Skúlason <ma...@gmail.com>.
Actually its quite easy to modify the parse-html filter to do this.

That is saving the HTML to a file or to some database, you could then
configure it to skip all unnecessary plugins. I think it depends a lot on
the other requirements you have whether using nutch for this task is the
right way to go or not. If you can get by with wget -r then its probably an
overkill to use nutch.

Best regards,
Magnus

On Tue, Sep 29, 2009 at 10:25 PM, Susam Pal <su...@gmail.com> wrote:

> On Wed, Sep 30, 2009 at 1:39 AM, O. Olson <ol...@yahoo.it> wrote:
> > Sorry for pushing this topic, but I would like to know if Nutch would
> help me get the raw HTML in my situation described below.
> >
> > I am sure it would be a simple answer to those who know Nutch. If not
> then I guess Nutch is the wrong tool for the job.
> >
> > Thanks,
> > O. O.
> >
> >
> > --- Gio 24/9/09, O. Olson <ol...@yahoo.it> ha scritto:
> >
> >> Da: O. Olson <ol...@yahoo.it>
> >> Oggetto: Using Nutch for only retriving HTML
> >> A: nutch-user@lucene.apache.org
> >> Data: Giovedì 24 settembre 2009, 20:54
> >> Hi,
> >>     I am new to Nutch. I would like to
> >> completely crawl through an Internal Website and retrieve
> >> all the HTML Content. I don’t intend to do further
> >> processing using Nutch.
> >> The Website/Content is rather huge. By crawl, I mean that I
> >> would go to a page, download/archive the HTML, get the links
> >> from that page, and then download/archive those pages. I
> >> would keep doing this till I don’t have any new links.
>
> I don't think it is possible to retrieve pages and store them as
> separate files, one per page, without modifications in Nutch. I am not
> sure though. Someone would correct me if I am wrong here. However, it
> is easy to retrieve the HTML contents from the crawl DB using the
> Nutch API. But from your post, it seems, you don't want to do this.
>
> >>
> >> Is this possible? Is this the right tool for this job, or
> >> are there other tools out there that would be more suited
> >> for my purpose?
>
> I guess 'wget' is the tool you are looking for. You can use it with -r
> option to recursively download pages and store them as separate files
> on the hard disk, which is exactly what you need. You might want to
> use the -np option too. It is available for Windows as well as Linux.
>
> Regards,
> Susam Pal
>

Re: R: Using Nutch for only retriving HTML

Posted by Susam Pal <su...@gmail.com>.
On Wed, Sep 30, 2009 at 1:39 AM, O. Olson <ol...@yahoo.it> wrote:
> Sorry for pushing this topic, but I would like to know if Nutch would help me get the raw HTML in my situation described below.
>
> I am sure it would be a simple answer to those who know Nutch. If not then I guess Nutch is the wrong tool for the job.
>
> Thanks,
> O. O.
>
>
> --- Gio 24/9/09, O. Olson <ol...@yahoo.it> ha scritto:
>
>> Da: O. Olson <ol...@yahoo.it>
>> Oggetto: Using Nutch for only retriving HTML
>> A: nutch-user@lucene.apache.org
>> Data: Giovedì 24 settembre 2009, 20:54
>> Hi,
>>     I am new to Nutch. I would like to
>> completely crawl through an Internal Website and retrieve
>> all the HTML Content. I don’t intend to do further
>> processing using Nutch.
>> The Website/Content is rather huge. By crawl, I mean that I
>> would go to a page, download/archive the HTML, get the links
>> from that page, and then download/archive those pages. I
>> would keep doing this till I don’t have any new links.

I don't think it is possible to retrieve pages and store them as
separate files, one per page, without modifications in Nutch. I am not
sure though. Someone would correct me if I am wrong here. However, it
is easy to retrieve the HTML contents from the crawl DB using the
Nutch API. But from your post, it seems, you don't want to do this.

>>
>> Is this possible? Is this the right tool for this job, or
>> are there other tools out there that would be more suited
>> for my purpose?

I guess 'wget' is the tool you are looking for. You can use it with -r
option to recursively download pages and store them as separate files
on the hard disk, which is exactly what you need. You might want to
use the -np option too. It is available for Windows as well as Linux.

Regards,
Susam Pal

R: Using Nutch for only retriving HTML

Posted by "O. Olson" <ol...@yahoo.it>.
Sorry for pushing this topic, but I would like to know if Nutch would help me get the raw HTML in my situation described below. 

I am sure it would be a simple answer to those who know Nutch. If not then I guess Nutch is the wrong tool for the job.

Thanks,
O. O. 


--- Gio 24/9/09, O. Olson <ol...@yahoo.it> ha scritto:

> Da: O. Olson <ol...@yahoo.it>
> Oggetto: Using Nutch for only retriving HTML
> A: nutch-user@lucene.apache.org
> Data: Giovedì 24 settembre 2009, 20:54
> Hi,
>     I am new to Nutch. I would like to
> completely crawl through an Internal Website and retrieve
> all the HTML Content. I don’t intend to do further
> processing using Nutch. 
> The Website/Content is rather huge. By crawl, I mean that I
> would go to a page, download/archive the HTML, get the links
> from that page, and then download/archive those pages. I
> would keep doing this till I don’t have any new links.
> 
> Is this possible? Is this the right tool for this job, or
> are there other tools out there that would be more suited
> for my purpose?
> 
> Thanks,
> O.O. 
> 
> 
> 
> 
>