You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Maohua Liu <ca...@gmail.com> on 2013/04/23 16:17:59 UTC

Error when running Nutch, please help

Hi,

These day I follow the Nutch totur: http://wiki.apache.org/nutch/NutchTutorial, but I always get the error message as follows:

MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2
Injector: starting at 2013-04-23 22:00:46
Injector: crawlDb: TestCrawl/crawldb
Injector: urlDir: urls/seed.txt
Injector: Converting injected urls to crawl db entries.
2013-04-23 22:00:46.562 java[1047:1903] Unable to load realm info from SCDynamicStore
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-04-23 22:01:01, elapsed: 00:00:14
2013年 4月23日 星期二 22时01分01秒 CST : Iteration 1 of 2
Generating a new segment
2013-04-23 22:01:01.888 java[1055:1903] Unable to load realm info from SCDynamicStore
Generator: starting at 2013-04-23 22:01:02
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: Partitioning selected urls for politeness.
Generator: segment: TestCrawl/segments/20130423220110
Generator: finished at 2013-04-23 22:01:17, elapsed: 00:00:15
Operating on segment : drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
Fetching : drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2013-04-23 22:01:18
Fetcher: segment: TestCrawl/segments/drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
Fetcher Timelimit set for : 1366736478177
2013-04-23 22:01:18.308 java[1068:1903] Unable to load realm info from SCDynamicStore
Fetcher: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
	at org.apache.hadoop.fs.Path.initialize(Path.java:148)
	at org.apache.hadoop.fs.Path.<init>(Path.java:126)
	at org.apache.hadoop.fs.Path.<init>(Path.java:50)
	at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1084)
	at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
	at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
	at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
	at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
	at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
	at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
	at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
	at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1023)
	at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:987)
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:177)
	at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
	at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:105)
	at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
	at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
	at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
	at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
	at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
	at java.net.URI.checkPath(URI.java:1788)
	at java.net.URI.<init>(URI.java:734)
	at org.apache.hadoop.fs.Path.initialize(Path.java:145)
	... 30 more

All I did was following the totur as follows:
1. download nutch bin from: http://mirror.esocc.com/apache/nutch/1.6/apache-nutch-1.6-bin.zip
2. unzip and step into the dir: apache-nutch-1.6
3. in my home dir i setup JAVA_HOME in .bash_profile like:
JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
export JAVA_HOME
4. change the content in conf/nutch-site.xml to follows:
<configuration>
    <property>
        <name>http.agent.name</name>
        <value>NutchSpider</value>
    </property>
</configuration>

5. under dir: apache-nutch-1.6, excute:
mkdir -p urls
cd urls
touch seed.txt
6. edit seed.txt with content:
http://nutch.apache.org/
7. then edit file conf/regex-urlfilter.txt and replace
# accept anything else
+.
with
+^http://([a-z0-9]*\.)*nutch.apache.org/
8. finally, i run comand under dir :apache-nutch-1.6 as follows:
MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2

9. at the end show the error message as mentioned before.


please help me to solve this problem, thanks very much.

my java version:
MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ java -version
java version "1.6.0_43"
Java(TM) SE Runtime Environment (build 1.6.0_43-b01-447-11M4203)
Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01-447, mixed mode)

Max OS X version 10.7.5



Best Regards.
--------------------------------------
Maohua Liu
Email: carya.liu@gmail.com

Re: Error when running Nutch, please help

Posted by kiran chitturi <ch...@gmail.com>.

Hi,

I came across similar situation when I am using Nutch on my Mac machine in
local mode. I have changed the line '125'  in bin/crawl from

SEGMENT=`ls -l $CRAWL_PATH/segments/ | sed -e "s/ /\\n/g" | egrep 20[0-9]+
| sort -n | tail -n 1`

to

SEGMENT=`ls $CRAWL_PATH/segments/ | sed -e "s/ /\\n/g" | egrep 20[0-9]+ |
sort -n | tail -n 1`

and the script worked properly

As suggested above by the other members, it is caused by the above line in
the crawl script

HTH


On Wed, Apr 24, 2013 at 8:34 AM, Maohua Liu <ca...@gmail.com> wrote:

>
> > Hi,
> >
> > These day I follow the Nutch totur:
> http://wiki.apache.org/nutch/NutchTutorial, but I always get the error
> message as follows:
> >
> > MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt
> TestCrawl http://localhost:8983/solr/ 2
> > Injector: starting at 2013-04-23 22:00:46
> > Injector: crawlDb: TestCrawl/crawldb
> > Injector: urlDir: urls/seed.txt
> > Injector: Converting injected urls to crawl db entries.
> > 2013-04-23 22:00:46.562 java[1047:1903] Unable to load realm info from
> SCDynamicStore
> > Injector: total number of urls rejected by filters: 0
> > Injector: total number of urls injected after normalization and
> filtering: 1
> > Injector: Merging injected urls into crawl db.
> > Injector: finished at 2013-04-23 22:01:01, elapsed: 00:00:14
> > 2013年 4月23日 星期二 22时01分01秒 CST : Iteration 1 of 2
> > Generating a new segment
> > 2013-04-23 22:01:01.888 java[1055:1903] Unable to load realm info from
> SCDynamicStore
> > Generator: starting at 2013-04-23 22:01:02
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: false
> > Generator: normalizing: true
> > Generator: topN: 50000
> > Generator: Partitioning selected urls for politeness.
> > Generator: segment: TestCrawl/segments/20130423220110
> > Generator: finished at 2013-04-23 22:01:17, elapsed: 00:00:15
> > Operating on segment :
> drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> > Fetching : drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> > Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> > Fetcher: starting at 2013-04-23 22:01:18
> > Fetcher: segment:
> TestCrawl/segments/drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> > Fetcher Timelimit set for : 1366736478177
> > 2013-04-23 22:01:18.308 java[1068:1903] Unable to load realm info from
> SCDynamicStore
> > Fetcher: java.lang.IllegalArgumentException:
> java.net.URISyntaxException: Relative path in absolute URI:
> drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> >       at org.apache.hadoop.fs.Path.initialize(Path.java:148)
> >       at org.apache.hadoop.fs.Path.<init>(Path.java:126)
> >       at org.apache.hadoop.fs.Path.<init>(Path.java:50)
> >       at
> org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1084)
> >       at
> org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> >       at
> org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> >       at
> org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> >       at
> org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> >       at
> org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> >       at
> org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> >       at
> org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> >       at
> org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1023)
> >       at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:987)
> >       at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:177)
> >       at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
> >       at
> org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:105)
> >       at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
> >       at
> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
> >       at
> org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
> >       at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
> >       at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
> >       at java.security.AccessController.doPrivileged(Native Method)
> >       at javax.security.auth.Subject.doAs(Subject.java:396)
> >       at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> >       at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
> >       at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
> >       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
> >       at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
> >       at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368)
> >       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >       at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341)
> > Caused by: java.net.URISyntaxException: Relative path in absolute URI:
> drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> >       at java.net.URI.checkPath(URI.java:1788)
> >       at java.net.URI.<init>(URI.java:734)
> >       at org.apache.hadoop.fs.Path.initialize(Path.java:145)
> >       ... 30 more
> >
> > All I did was following the totur as follows:
> > 1. download nutch bin from:
> http://mirror.esocc.com/apache/nutch/1.6/apache-nutch-1.6-bin.zip
> > 2. unzip and step into the dir: apache-nutch-1.6
> > 3. in my home dir i setup JAVA_HOME in .bash_profile like:
> > JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
> > export JAVA_HOME
> > 4. change the content in conf/nutch-site.xml to follows:
> > <configuration>
> >     <property>
> >         <name>http.agent.name</name>
> >         <value>NutchSpider</value>
> >     </property>
> > </configuration>
> >
> > 5. under dir: apache-nutch-1.6, excute:
> > mkdir -p urls
> > cd urls
> > touch seed.txt
> > 6. edit seed.txt with content:
> > http://nutch.apache.org/
> > 7. then edit file conf/regex-urlfilter.txt and replace
> > # accept anything else
> > +.
> > with
> > +^http://([a-z0-9]*\.)*nutch.apache.org/
> > 8. finally, i run comand under dir :apache-nutch-1.6 as follows:
> > MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt
> TestCrawl http://localhost:8983/solr/ 2
> >
> > 9. at the end show the error message as mentioned before.
> >
> >
> > please help me to solve this problem, thanks very much.
> >
> > my java version:
> > MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ java -version
> > java version "1.6.0_43"
> > Java(TM) SE Runtime Environment (build 1.6.0_43-b01-447-11M4203)
> > Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01-447, mixed mode)
> >
> > Max OS X version 10.7.5
> >
> >
> >
> > Best Regards.
> > --------------------------------------
> > Maohua Liu
> > Email: carya.liu@gmail.com
> >
>
>


-- 
Kiran Chitturi

<http://www.linkedin.com/in/kiranchitturi>

Error when running Nutch, please help

Posted by Maohua Liu <ca...@gmail.com>.

> Hi,
> 
> These day I follow the Nutch totur: http://wiki.apache.org/nutch/NutchTutorial, but I always get the error message as follows:
> 
> MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2
> Injector: starting at 2013-04-23 22:00:46
> Injector: crawlDb: TestCrawl/crawldb
> Injector: urlDir: urls/seed.txt
> Injector: Converting injected urls to crawl db entries.
> 2013-04-23 22:00:46.562 java[1047:1903] Unable to load realm info from SCDynamicStore
> Injector: total number of urls rejected by filters: 0
> Injector: total number of urls injected after normalization and filtering: 1
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2013-04-23 22:01:01, elapsed: 00:00:14
> 2013年 4月23日 星期二 22时01分01秒 CST : Iteration 1 of 2
> Generating a new segment
> 2013-04-23 22:01:01.888 java[1055:1903] Unable to load realm info from SCDynamicStore
> Generator: starting at 2013-04-23 22:01:02
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: topN: 50000
> Generator: Partitioning selected urls for politeness.
> Generator: segment: TestCrawl/segments/20130423220110
> Generator: finished at 2013-04-23 22:01:17, elapsed: 00:00:15
> Operating on segment : drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> Fetching : drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
> Fetcher: starting at 2013-04-23 22:01:18
> Fetcher: segment: TestCrawl/segments/drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> Fetcher Timelimit set for : 1366736478177
> 2013-04-23 22:01:18.308 java[1068:1903] Unable to load realm info from SCDynamicStore
> Fetcher: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> 	at org.apache.hadoop.fs.Path.initialize(Path.java:148)
> 	at org.apache.hadoop.fs.Path.<init>(Path.java:126)
> 	at org.apache.hadoop.fs.Path.<init>(Path.java:50)
> 	at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1084)
> 	at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> 	at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> 	at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> 	at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> 	at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> 	at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> 	at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> 	at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1023)
> 	at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:987)
> 	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:177)
> 	at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
> 	at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:105)
> 	at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
> 	at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
> 	at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
> 	at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
> 	at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> 	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
> 	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
> 	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
> 	at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
> 	at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341)
> Caused by: java.net.URISyntaxException: Relative path in absolute URI: drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> 	at java.net.URI.checkPath(URI.java:1788)
> 	at java.net.URI.<init>(URI.java:734)
> 	at org.apache.hadoop.fs.Path.initialize(Path.java:145)
> 	... 30 more
> 
> All I did was following the totur as follows:
> 1. download nutch bin from: http://mirror.esocc.com/apache/nutch/1.6/apache-nutch-1.6-bin.zip
> 2. unzip and step into the dir: apache-nutch-1.6
> 3. in my home dir i setup JAVA_HOME in .bash_profile like:
> JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
> export JAVA_HOME
> 4. change the content in conf/nutch-site.xml to follows:
> <configuration>
>     <property>
>         <name>http.agent.name</name>
>         <value>NutchSpider</value>
>     </property>
> </configuration>
> 
> 5. under dir: apache-nutch-1.6, excute:
> mkdir -p urls
> cd urls
> touch seed.txt
> 6. edit seed.txt with content:
> http://nutch.apache.org/
> 7. then edit file conf/regex-urlfilter.txt and replace
> # accept anything else
> +.
> with
> +^http://([a-z0-9]*\.)*nutch.apache.org/
> 8. finally, i run comand under dir :apache-nutch-1.6 as follows:
> MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2
> 
> 9. at the end show the error message as mentioned before.
> 
> 
> please help me to solve this problem, thanks very much.
> 
> my java version:
> MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ java -version
> java version "1.6.0_43"
> Java(TM) SE Runtime Environment (build 1.6.0_43-b01-447-11M4203)
> Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01-447, mixed mode)
> 
> Max OS X version 10.7.5
> 
> 
> 
> Best Regards.
> --------------------------------------
> Maohua Liu
> Email: carya.liu@gmail.com
>

Re: Error when running Nutch, please help

Posted by feng lu <am...@gmail.com>.

Hi
Maybe this problem is cause by crawl script. the SEGMENT parameter set
command is like this:

 SEGMENT=`ls -l $CRAWL_PATH/segments/ | sed -e "s/ /\\n/g" | egrep 20[0-9]+
| sort -n | tail -n 1`

you can run this command in your terminal:

 ls -l TestCrawl/segments/ | sed -e "s/ /\\n/g" | egrep 20[0-9]+ | sort -n
| tail -n 1

maybe it output like this:

drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110

this command is used to generate the segment name as an input parameter for
fetcher.

but i don't know why you generate this. Maybe the correct SEGMENT parameter
is 20130423220110.




On Tue, Apr 23, 2013 at 10:17 PM, Maohua Liu <ca...@gmail.com> wrote:

> Hi,
>
> These day I follow the Nutch totur:
> http://wiki.apache.org/nutch/NutchTutorial, but I always get the error
> message as follows:
>
> MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt
> TestCrawl http://localhost:8983/solr/ 2
> Injector: starting at 2013-04-23 22:00:46
> Injector: crawlDb: TestCrawl/crawldb
> Injector: urlDir: urls/seed.txt
> Injector: Converting injected urls to crawl db entries.
> 2013-04-23 22:00:46.562 java[1047:1903] Unable to load realm info from
> SCDynamicStore
> Injector: total number of urls rejected by filters: 0
> Injector: total number of urls injected after normalization and filtering:
> 1
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2013-04-23 22:01:01, elapsed: 00:00:14
> 2013年 4月23日 星期二 22时01分01秒 CST : Iteration 1 of 2
> Generating a new segment
> 2013-04-23 22:01:01.888 java[1055:1903] Unable to load realm info from
> SCDynamicStore
> Generator: starting at 2013-04-23 22:01:02
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: topN: 50000
> Generator: Partitioning selected urls for politeness.
> Generator: segment: TestCrawl/segments/20130423220110
> Generator: finished at 2013-04-23 22:01:17, elapsed: 00:00:15
> Operating on segment :
> drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> Fetching : drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2013-04-23 22:01:18
> Fetcher: segment:
> TestCrawl/segments/drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> Fetcher Timelimit set for : 1366736478177
> 2013-04-23 22:01:18.308 java[1068:1903] Unable to load realm info from
> SCDynamicStore
> Fetcher: java.lang.IllegalArgumentException: java.net.URISyntaxException:
> Relative path in absolute URI:
> drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
>  at org.apache.hadoop.fs.Path.initialize(Path.java:148)
> at org.apache.hadoop.fs.Path.<init>(Path.java:126)
>  at org.apache.hadoop.fs.Path.<init>(Path.java:50)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1084)
>  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
>  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
>  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
>  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1023)
>  at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:987)
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:177)
>  at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
> at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:105)
>  at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
> at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
>  at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
>  at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
> at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>  at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
>  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
>  at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>  at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341)
> Caused by: java.net.URISyntaxException: Relative path in absolute URI:
> drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
>  at java.net.URI.checkPath(URI.java:1788)
> at java.net.URI.<init>(URI.java:734)
>  at org.apache.hadoop.fs.Path.initialize(Path.java:145)
> ... 30 more
>
> All I did was following the totur as follows:
> 1. download nutch bin from:
> http://mirror.esocc.com/apache/nutch/1.6/apache-nutch-1.6-bin.zip
> 2. unzip and step into the dir: apache-nutch-1.6
> 3. in my home dir i setup JAVA_HOME in .bash_profile like:
> JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
> export JAVA_HOME
> 4. change the content in conf/nutch-site.xml to follows:
> <configuration>
>     <property>
>         <name>http.agent.name</name>
>         <value>NutchSpider</value>
>     </property>
> </configuration>
>
> 5. under dir: apache-nutch-1.6, excute:
> mkdir -p urls
> cd urls
> touch seed.txt
> 6. edit seed.txt with content:
>
> http://nutch.apache.org/
>
> 7. then edit file conf/regex-urlfilter.txt and replace
>
>
> # accept anything else
> +.
>
> with
>
>
> +^http://([a-z0-9]*\.)*nutch.apache.org/
>
> 8. finally, i run comand under dir :apache-nutch-1.6 as follows:
> MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt
> TestCrawl http://localhost:8983/solr/ 2
>
> 9. at the end show the error message as mentioned before.
>
>
> please help me to solve this problem, thanks very much.
>
> my java version:
> MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ java -version
> java version "1.6.0_43"
> Java(TM) SE Runtime Environment (build 1.6.0_43-b01-447-11M4203)
> Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01-447, mixed mode)
>
> Max OS X version 10.7.5
>
>
>
> Best Regards.
> --------------------------------------
> Maohua Liu
> Email: carya.liu@gmail.com
>
>


-- 
Don't Grow Old, Grow Up... :-)

Re: Error when running Nutch, please help

Posted by Tejas Patil <te...@gmail.com>.

The crawl script picks up the name of the segment created after the
generate phase of nutch using some shell code:

124 if [ $mode = "local" ]; then 125 SEGMENT=`ls -l $CRAWL_PATH/segments/ |
sed -e "s/ /\\n/g" | egrep 20[0-9]+ | sort -n | tail -n 1` 126 else
127SEGMENT=`hadoop fs -ls $CRAWL_PATH/segments/ | grep segments | sed
-e
"s/\//\\n/g" | egrep 20[0-9]+ | sort -n | tail -n 1` 128 fi
129 130 echo "Operating on segment : $SEGMENT"

For some reason on your setup it gives incorrect output:
Operating on segment : drwxr-xr-xnn3ncaryannstaffnn102nn4n23n2
2:01n20130423220110

You are running in local mode so the line 125 above would apply. Try "ls
-l TestCrawl/segments/" over shell. I think that its giving incompatible
output which causes this. Ideal thing to get is something as given in
example at [0].

[0] : http://www.computerhope.com/unix/uls.htm

Thanks,
Tejas







On Tue, Apr 23, 2013 at 7:17 AM, Maohua Liu <ca...@gmail.com> wrote:

> Hi,
>
> These day I follow the Nutch totur:
> http://wiki.apache.org/nutch/NutchTutorial, but I always get the error
> message as follows:
>
> MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt
> TestCrawl http://localhost:8983/solr/ 2
> Injector: starting at 2013-04-23 22:00:46
> Injector: crawlDb: TestCrawl/crawldb
> Injector: urlDir: urls/seed.txt
> Injector: Converting injected urls to crawl db entries.
> 2013-04-23 22:00:46.562 java[1047:1903] Unable to load realm info from
> SCDynamicStore
> Injector: total number of urls rejected by filters: 0
> Injector: total number of urls injected after normalization and filtering:
> 1
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2013-04-23 22:01:01, elapsed: 00:00:14
> 2013年 4月23日 星期二 22时01分01秒 CST : Iteration 1 of 2
> Generating a new segment
> 2013-04-23 22:01:01.888 java[1055:1903] Unable to load realm info from
> SCDynamicStore
> Generator: starting at 2013-04-23 22:01:02
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: topN: 50000
> Generator: Partitioning selected urls for politeness.
> Generator: segment: TestCrawl/segments/20130423220110
> Generator: finished at 2013-04-23 22:01:17, elapsed: 00:00:15
> Operating on segment :
> drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> Fetching : drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2013-04-23 22:01:18
> Fetcher: segment:
> TestCrawl/segments/drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> Fetcher Timelimit set for : 1366736478177
> 2013-04-23 22:01:18.308 java[1068:1903] Unable to load realm info from
> SCDynamicStore
> Fetcher: java.lang.IllegalArgumentException: java.net.URISyntaxException:
> Relative path in absolute URI:
> drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> at org.apache.hadoop.fs.Path.initialize(Path.java:148)
> at org.apache.hadoop.fs.Path.<init>(Path.java:126)
> at org.apache.hadoop.fs.Path.<init>(Path.java:50)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1084)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
> at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1023)
> at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:987)
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:177)
> at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
> at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:105)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
> at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
> at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1368)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1341)
> Caused by: java.net.URISyntaxException: Relative path in absolute URI:
> drwxr-xr-xnn3ncaryannstaffnn102nn4n23n22:01n20130423220110
> at java.net.URI.checkPath(URI.java:1788)
> at java.net.URI.<init>(URI.java:734)
> at org.apache.hadoop.fs.Path.initialize(Path.java:145)
> ... 30 more
>
> All I did was following the totur as follows:
> 1. download nutch bin from:
> http://mirror.esocc.com/apache/nutch/1.6/apache-nutch-1.6-bin.zip
> 2. unzip and step into the dir: apache-nutch-1.6
> 3. in my home dir i setup JAVA_HOME in .bash_profile like:
> JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
> export JAVA_HOME
> 4. change the content in conf/nutch-site.xml to follows:
> <configuration>
>     <property>
>         <name>http.agent.name</name>
>         <value>NutchSpider</value>
>     </property>
> </configuration>
>
> 5. under dir: apache-nutch-1.6, excute:
> mkdir -p urls
> cd urls
> touch seed.txt
> 6. edit seed.txt with content:
>
> http://nutch.apache.org/
>
> 7. then edit file conf/regex-urlfilter.txt and replace
>
> # accept anything else
> +.
>
> with
>
> +^http://([a-z0-9]*\.)*nutch.apache.org/
>
> 8. finally, i run comand under dir :apache-nutch-1.6 as follows:
> MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ bin/crawl urls/seed.txt
> TestCrawl http://localhost:8983/solr/ 2
>
> 9. at the end show the error message as mentioned before.
>
>
> please help me to solve this problem, thanks very much.
>
> my java version:
> MaohuaLiu-MacBook-Pro:apache-nutch-1.6 carya$ java -version
> java version "1.6.0_43"
> Java(TM) SE Runtime Environment (build 1.6.0_43-b01-447-11M4203)
> Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01-447, mixed mode)
>
> Max OS X version 10.7.5
>
>
>
> Best Regards.
> --------------------------------------
> Maohua Liu
> Email: carya.liu@gmail.com
>
>