You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Fuad Efendi <fu...@efendi.ca> on 2005/08/11 17:16:13 UTC

CrawlTool - fetching only first page

I configured classpath including \conf\ and \build\ (which contains
plugins) folders, and run CrawlTool without any errors, but it fetches
only first page and does not fetch lined pages. Windows XP.

What is missed?

RE: CrawlTool - fetching only first page

Posted by Fuad Efendi <fu...@efendi.ca>.

Sorry guys, some magic... 
It works now with standard batch script with depth=5. 

Still trying to configure Eclipse to run CrawlTool directly on Windows
XP.


-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: Thursday, August 11, 2005 12:41 PM
To: nutch-dev@lucene.apache.org
Subject: RE: CrawlTool - fetching only first page


Something wrong with org.apache.nutch.tools.CrawlTool, I noticed some
difference.

With depth=1, in version 0.6, it fetches about 60 URLs directly linked
from single main page, "first level". I disabled all filters, so it
should fetch everything.

I also tried with depth=5, same result.


-----Original Message-----
From: cn@cetic.be [mailto:cn@cetic.be] 
Sent: Thursday, August 11, 2005 12:34 PM
To: nutch-dev@lucene.apache.org
Subject: RE: CrawlTool - fetching only first page


1. With a depth = 1 that means that it will only crawl the urls in
url.txt

2. if some url in url.txt are not fetched, check that your url syntax is
correct

3. Check your regex-urlfilter file and set the right regular expression

Christophe Noel

Quoting Fuad Efendi <fu...@efendi.ca>:

> I loaded latest code, created nutch-0.7-dev, and run command bin/nutch

> crawl url.txt -dir test.crawl -depth 1
> 
> Still does not work. It works in nutch-0.6, with same depth and
> url.txt, it fetches about 30 files.
> 
> 
> 
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: Thursday, August 11, 2005 11:29 AM
> To: nutch-dev@lucene.apache.org
> Subject: RE: CrawlTool - fetching only first page
> 
> 
> Yes, I defined depth 5 (I noticed, it creates 5 segments)
> It fetches only main URLs without linked pages
> 
> 
> -----Original Message-----
> From: nilshoeller@arcor.de [mailto:nilshoeller@arcor.de]
> Sent: Thursday, August 11, 2005 11:20 AM
> To: nutch-dev@lucene.apache.org
> Subject: Aw: CrawlTool - fetching only first page
> 
> 
>  Did you define a depth?
> What is your exact command?
> 
> should be something like
> 
> ./nuch crawl urls -dir crawldir -threads 1 -depth 3
> 
> Nils
> 
> ----- Original Nachricht ----
> Von:     Fuad Efendi <fu...@efendi.ca>
> An:      nutch-dev@lucene.apache.org
> Datum:   11.08.2005 17:16
> Betreff: CrawlTool - fetching only first page
> 
> > I configured classpath including \conf\ and \build\ (which contains
> > plugins) folders, and run CrawlTool without any errors, but it
> > fetches
> 
> > only first page and does not fetch lined pages. Windows XP.
> > 
> > What is missed?
> > 
> > 
> 
> Machen Sie aus 14 Cent spielend bis zu 100 Euro!
> Die neue Gaming-Area von Arcor - über 50 Onlinespiele im Angebot.
> http://www.arcor.de/rd/emf-gaming-1
> 
> 
> 
> 




----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

RE: CrawlTool - fetching only first page

Posted by Fuad Efendi <fu...@efendi.ca>.

Something wrong with org.apache.nutch.tools.CrawlTool, I noticed some
difference.

With depth=1, in version 0.6, it fetches about 60 URLs directly linked
from single main page, "first level". I disabled all filters, so it
should fetch everything.

I also tried with depth=5, same result.


-----Original Message-----
From: cn@cetic.be [mailto:cn@cetic.be] 
Sent: Thursday, August 11, 2005 12:34 PM
To: nutch-dev@lucene.apache.org
Subject: RE: CrawlTool - fetching only first page


1. With a depth = 1 that means that it will only crawl the urls in
url.txt

2. if some url in url.txt are not fetched, check that your url syntax is
correct

3. Check your regex-urlfilter file and set the right regular expression

Christophe Noel

Quoting Fuad Efendi <fu...@efendi.ca>:

> I loaded latest code, created nutch-0.7-dev, and run command bin/nutch

> crawl url.txt -dir test.crawl -depth 1
> 
> Still does not work. It works in nutch-0.6, with same depth and 
> url.txt, it fetches about 30 files.
> 
> 
> 
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: Thursday, August 11, 2005 11:29 AM
> To: nutch-dev@lucene.apache.org
> Subject: RE: CrawlTool - fetching only first page
> 
> 
> Yes, I defined depth 5 (I noticed, it creates 5 segments)
> It fetches only main URLs without linked pages
> 
> 
> -----Original Message-----
> From: nilshoeller@arcor.de [mailto:nilshoeller@arcor.de]
> Sent: Thursday, August 11, 2005 11:20 AM
> To: nutch-dev@lucene.apache.org
> Subject: Aw: CrawlTool - fetching only first page
> 
> 
>  Did you define a depth?
> What is your exact command?
> 
> should be something like
> 
> ./nuch crawl urls -dir crawldir -threads 1 -depth 3
> 
> Nils
> 
> ----- Original Nachricht ----
> Von:     Fuad Efendi <fu...@efendi.ca>
> An:      nutch-dev@lucene.apache.org
> Datum:   11.08.2005 17:16
> Betreff: CrawlTool - fetching only first page
> 
> > I configured classpath including \conf\ and \build\ (which contains
> > plugins) folders, and run CrawlTool without any errors, but it 
> > fetches
> 
> > only first page and does not fetch lined pages. Windows XP.
> > 
> > What is missed?
> > 
> > 
> 
> Machen Sie aus 14 Cent spielend bis zu 100 Euro!
> Die neue Gaming-Area von Arcor - über 50 Onlinespiele im Angebot. 
> http://www.arcor.de/rd/emf-gaming-1
> 
> 
> 
> 




----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

RE: CrawlTool - fetching only first page

Posted by cn...@cetic.be.

1. With a depth = 1 that means that it will only crawl the urls in url.txt

2. if some url in url.txt are not fetched, check that your url syntax is
correct

3. Check your regex-urlfilter file and set the right regular expression

Christophe Noel

Quoting Fuad Efendi <fu...@efendi.ca>:

> I loaded latest code, created nutch-0.7-dev, and run command
> bin/nutch crawl url.txt -dir test.crawl -depth 1
> 
> Still does not work. It works in nutch-0.6, with same depth and url.txt,
> it fetches about 30 files.
> 
> 
> 
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca] 
> Sent: Thursday, August 11, 2005 11:29 AM
> To: nutch-dev@lucene.apache.org
> Subject: RE: CrawlTool - fetching only first page
> 
> 
> Yes, I defined depth 5 (I noticed, it creates 5 segments)
> It fetches only main URLs without linked pages
> 
> 
> -----Original Message-----
> From: nilshoeller@arcor.de [mailto:nilshoeller@arcor.de] 
> Sent: Thursday, August 11, 2005 11:20 AM
> To: nutch-dev@lucene.apache.org
> Subject: Aw: CrawlTool - fetching only first page
> 
> 
>  Did you define a depth?
> What is your exact command? 
> 
> should be something like
> 
> ./nuch crawl urls -dir crawldir -threads 1 -depth 3
> 
> Nils 
> 
> ----- Original Nachricht ----
> Von:     Fuad Efendi <fu...@efendi.ca>
> An:      nutch-dev@lucene.apache.org
> Datum:   11.08.2005 17:16
> Betreff: CrawlTool - fetching only first page
> 
> > I configured classpath including \conf\ and \build\ (which contains
> > plugins) folders, and run CrawlTool without any errors, but it fetches
> 
> > only first page and does not fetch lined pages. Windows XP.
> > 
> > What is missed?
> > 
> > 
> 
> Machen Sie aus 14 Cent spielend bis zu 100 Euro!
> Die neue Gaming-Area von Arcor - über 50 Onlinespiele im Angebot.
> http://www.arcor.de/rd/emf-gaming-1
> 
> 
> 
> 




----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

RE: CrawlTool - fetching only first page

Posted by Fuad Efendi <fu...@efendi.ca>.

I noticed some changes between 0.6 and 0.7, CrawlTool class...
Probably...

Thanks


-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: Thursday, August 11, 2005 12:22 PM
To: nutch-dev@lucene.apache.org
Subject: RE: CrawlTool - fetching only first page


I loaded latest code, created nutch-0.7-dev, and run command bin/nutch
crawl url.txt -dir test.crawl -depth 1

Still does not work. It works in nutch-0.6, with same depth and url.txt,
it fetches about 30 files.



-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: Thursday, August 11, 2005 11:29 AM
To: nutch-dev@lucene.apache.org
Subject: RE: CrawlTool - fetching only first page


Yes, I defined depth 5 (I noticed, it creates 5 segments)
It fetches only main URLs without linked pages


-----Original Message-----
From: nilshoeller@arcor.de [mailto:nilshoeller@arcor.de] 
Sent: Thursday, August 11, 2005 11:20 AM
To: nutch-dev@lucene.apache.org
Subject: Aw: CrawlTool - fetching only first page


 Did you define a depth?
What is your exact command? 

should be something like

./nuch crawl urls -dir crawldir -threads 1 -depth 3

Nils 

----- Original Nachricht ----
Von:     Fuad Efendi <fu...@efendi.ca>
An:      nutch-dev@lucene.apache.org
Datum:   11.08.2005 17:16
Betreff: CrawlTool - fetching only first page

> I configured classpath including \conf\ and \build\ (which contains
> plugins) folders, and run CrawlTool without any errors, but it fetches

> only first page and does not fetch lined pages. Windows XP.
> 
> What is missed?
> 
> 

Machen Sie aus 14 Cent spielend bis zu 100 Euro!
Die neue Gaming-Area von Arcor - über 50 Onlinespiele im Angebot.
http://www.arcor.de/rd/emf-gaming-1

RE: CrawlTool - fetching only first page

Posted by Fuad Efendi <fu...@efendi.ca>.

I loaded latest code, created nutch-0.7-dev, and run command
bin/nutch crawl url.txt -dir test.crawl -depth 1

Still does not work. It works in nutch-0.6, with same depth and url.txt,
it fetches about 30 files.



-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: Thursday, August 11, 2005 11:29 AM
To: nutch-dev@lucene.apache.org
Subject: RE: CrawlTool - fetching only first page


Yes, I defined depth 5 (I noticed, it creates 5 segments)
It fetches only main URLs without linked pages


-----Original Message-----
From: nilshoeller@arcor.de [mailto:nilshoeller@arcor.de] 
Sent: Thursday, August 11, 2005 11:20 AM
To: nutch-dev@lucene.apache.org
Subject: Aw: CrawlTool - fetching only first page


 Did you define a depth?
What is your exact command? 

should be something like

./nuch crawl urls -dir crawldir -threads 1 -depth 3

Nils 

----- Original Nachricht ----
Von:     Fuad Efendi <fu...@efendi.ca>
An:      nutch-dev@lucene.apache.org
Datum:   11.08.2005 17:16
Betreff: CrawlTool - fetching only first page

> I configured classpath including \conf\ and \build\ (which contains
> plugins) folders, and run CrawlTool without any errors, but it fetches

> only first page and does not fetch lined pages. Windows XP.
> 
> What is missed?
> 
> 

Machen Sie aus 14 Cent spielend bis zu 100 Euro!
Die neue Gaming-Area von Arcor - über 50 Onlinespiele im Angebot.
http://www.arcor.de/rd/emf-gaming-1

RE: CrawlTool - fetching only first page

Posted by Fuad Efendi <fu...@efendi.ca>.

Yes, I defined depth 5 (I noticed, it creates 5 segments)
It fetches only main URLs without linked pages


-----Original Message-----
From: nilshoeller@arcor.de [mailto:nilshoeller@arcor.de] 
Sent: Thursday, August 11, 2005 11:20 AM
To: nutch-dev@lucene.apache.org
Subject: Aw: CrawlTool - fetching only first page


 Did you define a depth?
What is your exact command? 

should be something like

./nuch crawl urls -dir crawldir -threads 1 -depth 3

Nils 

----- Original Nachricht ----
Von:     Fuad Efendi <fu...@efendi.ca>
An:      nutch-dev@lucene.apache.org
Datum:   11.08.2005 17:16
Betreff: CrawlTool - fetching only first page

> I configured classpath including \conf\ and \build\ (which contains
> plugins) folders, and run CrawlTool without any errors, but it fetches

> only first page and does not fetch lined pages. Windows XP.
> 
> What is missed?
> 
> 

Machen Sie aus 14 Cent spielend bis zu 100 Euro!
Die neue Gaming-Area von Arcor - über 50 Onlinespiele im Angebot.
http://www.arcor.de/rd/emf-gaming-1

Aw: CrawlTool - fetching only first page

Posted by ni...@arcor.de.

 Did you define a depth?
What is your exact command? 

should be something like

./nuch crawl urls -dir crawldir -threads 1 -depth 3

Nils 

----- Original Nachricht ----
Von:     Fuad Efendi <fu...@efendi.ca>
An:      nutch-dev@lucene.apache.org
Datum:   11.08.2005 17:16
Betreff: CrawlTool - fetching only first page

> I configured classpath including \conf\ and \build\ (which contains
> plugins) folders, and run CrawlTool without any errors, but it fetches
> only first page and does not fetch lined pages. Windows XP.
> 
> What is missed?
> 
> 

Machen Sie aus 14 Cent spielend bis zu 100 Euro!
Die neue Gaming-Area von Arcor - über 50 Onlinespiele im Angebot.
http://www.arcor.de/rd/emf-gaming-1