You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Néstor <ro...@gmail.com> on 2016/10/03 15:49:15 UTC

crawling a subfolder

Hi,

I am using nutch for the first time and when I crawl www.mysite.com it
crawls for a while.
When I try to crawl a subfolder like www.mysite.com/mysubfolder it crawls
for about 1 sec.

my ursl/seed.txt is set
http://www.mysite.com/mysubfolder
my regex-urlfilter.txt use the defautl except for the last 2 lines:

#+.
+^http://www.mysite.org/mysubfolder


Also When try to access the results on http://mysite.com:8080/CLIPS using
solr
I only see 10 records

What could I be missing?
How I get all the records found?
Is there a way to look at the data crawl without sorl?

Thanks,

-- 
Né§t☼r  *Authority gone to one's head is the greatest enemy of Truth*

RE: crawling a subfolder

Posted by Markus Jelsma <ma...@openindex.io>.

I think, as of 1.12, there is a parameter to disable the robots check, i am not sure. Check nutch-default, it might be there.
M.

 
 
-----Original message-----
> From:Nestor <ro...@gmail.com>
> Sent: Wednesday 5th October 2016 0:05
> To: user@nutch.apache.org
> Subject: Re: crawling a subfolder
> 
> OK, Thanks for your help
> I found out that part of my problem was that there was a robots.txt that
> would not allow me to  crawl my site.
> The lessons and gotchas of learning nutch
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/crawling-a-subfolder-tp4299300p4299593.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: crawling a subfolder

Posted by Nestor <ro...@gmail.com>.

OK, Thanks for your help
I found out that part of my problem was that there was a robots.txt that
would not allow me to  crawl my site.
The lessons and gotchas of learning nutch



--
View this message in context: http://lucene.472066.n3.nabble.com/crawling-a-subfolder-tp4299300p4299593.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: crawling a subfolder

Posted by KRIS MUSSHORN <mu...@comcast.net>.


/home/whatever/nutch/bin/nutch mergesegs /home/whatever/nutch/crawl/merged /home/whatever/nutch/crawl/segments/* 




/home/whatever/nutch/bin/nutch readseg -dump /home/whatever/nutch/crawl/merged/* nutchdump 


----- Original Message -----

From: "Nestor" <ro...@gmail.com> 
To: user@nutch.apache.org 
Sent: Monday, October 3, 2016 7:50:31 PM 
Subject: Re: crawling a subfolder 

I am still no able to just crawl a subfolder and all of the folders below. 



-- 
View this message in context: http://lucene.472066.n3.nabble.com/crawling-a-subfolder-tp4299300p4299376.html 
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: crawling a subfolder

Posted by Nestor <ro...@gmail.com>.

I am still no able to just crawl a subfolder and all of the folders below.



--
View this message in context: http://lucene.472066.n3.nabble.com/crawling-a-subfolder-tp4299300p4299376.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: crawling a subfolder

Posted by Nestor <ro...@gmail.com>.

I look at the link you sent and I tried it and it failed.

Thanks,

$ bin/nutch mergesegs crawl/merged crawl/segments/*
Merging 1 segments to crawl/merged/20161003234422
SegmentMerger:   adding crawl/segments/20161003222933
SegmentMerger: using segment data from: content crawl_generate crawl_fetch
crawl_parse parse_data parse_text 
$ bin/nutch readseg -dump crawl/merged/* dumpedContent
SegmentReader: dump segment: crawl/merged/20161003234422
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
file:/home/ubuntu/temtomcat/apache-nutch-1.7/runtime/local/crawl/merged/20161003234422/crawl_parse
Input path does not exist:
file:/home/ubuntu/temtomcat/apache-nutch-1.7/runtime/local/crawl/merged/20161003234422/content
Input path does not exist:
file:/home/ubuntu/temtomcat/apache-nutch-1.7/runtime/local/crawl/merged/20161003234422/parse_data
Input path does not exist:
file:/home/ubuntu/temtomcat/apache-nutch-1.7/runtime/local/crawl/merged/20161003234422/parse_text
        at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)




--
View this message in context: http://lucene.472066.n3.nabble.com/crawling-a-subfolder-tp4299300p4299375.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: crawling a subfolder

Posted by KRIS MUSSHORN <mu...@comcast.net>.

to read what is in nutch try mergesegs then readseg 
http://stackoverflow.com/questions/7968534/dump-all-segments-from-nutch 

----- Original Message -----

From: "Néstor" <ro...@gmail.com> 
To: user@nutch.apache.org 
Sent: Monday, October 3, 2016 11:49:15 AM 
Subject: crawling a subfolder 

Hi, 

I am using nutch for the first time and when I crawl www.mysite.com it 
crawls for a while. 
When I try to crawl a subfolder like www.mysite.com/mysubfolder it crawls 
for about 1 sec. 

my ursl/seed.txt is set 
http://www.mysite.com/mysubfolder 
my regex-urlfilter.txt use the defautl except for the last 2 lines: 

#+. 
+^http://www.mysite.org/mysubfolder 

Also When try to access the results on http://mysite.com:8080/CLIPS using 
solr 
I only see 10 records 

What could I be missing? 
How I get all the records found? 
Is there a way to look at the data crawl without sorl? 

Thanks, 

-- 
Né§t☼r *Authority gone to one's head is the greatest enemy of Truth*