You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Michael Levy <Lu...@gmail.com> on 2006/04/19 17:01:03 UTC

crawl command params misinterpreted under Solaris?

I hope someone can help me with this problem I'm having with crawling on 
Solaris.  The same
script works fine on Windows using cygwin but I need to run this on Solaris.

This works fine:
  #bin/nutch crawl urls.txt
...creating a directory named something like crawl-20060418105008, as 
expected, and creates a working index.

However if I try to add any parameters beyond the root_url_file parameter 
I get the output below.  I'm really stumped.  The following does not create 
a directory named FOO, but it does create a directory named something like 
crawl-20060418105500.  Apparently it ignores the -dir FOO parameter.

Actually looking at the output it seems as if it is taking "urls.txt -dir FOO"
as the name of the urls file, rather than interpreting the "-dir FOO" at all.
See the line "rootUrlFile = urls.txt -dir FOO"; it should just be 
"rootUrlFile = urls.txt" I think.


## bin/nutch crawl urls.txt -dir FOO
060418 105308 parsing 
file:/export/home/www/virtual/wiki/doc_root/nutch-0.7.2/conf/nutch-default.xml
060418 105308 parsing 
file:/export/home/www/virtual/wiki/doc_root/nutch-0.7.2/conf/crawl-tool.xml
060418 105308 parsing 
file:/export/home/www/virtual/wiki/doc_root/nutch-0.7.2/conf/nutch-site.xml
060418 105308 No FS indicated, using default:local
060418 105308 crawl started in: crawl-20060418105308
060418 105308 rootUrlFile = urls.txt -dir FOO
060418 105308 threads = 10
060418 105308 depth = 5
060418 105310 Created webdb at 
LocalFS,/export/home/www/virtual/wiki/doc_root/nutch-0.7.2/crawl-20060418105308/db
Exception in thread "main" java.io.FileNotFoundException: urls.txt -dir 
FOO (No such file or directory)
       at java.io.FileInputStream.open(Native Method)
       at java.io.FileInputStream.<init>(FileInputStream.java:106)
       at java.io.FileReader.<init>(FileReader.java:55)
       at 
org.apache.nutch.db.WebDBInjector.injectURLFile(WebDBInjector.java:372)
       at org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:535)
       at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:134)

Re: crawl command params misinterpreted under Solaris?

Posted by Michael Levy <Lu...@gmail.com>.

I found a way to work around this problem.  Maybe this will help someone 
else. 

Apparently there is some problem with IFS in the sh shell I'm using.  If 
I commented out these two lines in bin/nutch:
#IFS=
#unset IFS
it works fine.  If I leave "unset IFS" the script won't run and I get 
the error "IFS: cannot unset"  (I noted someone wrote to the list last 
year asking about "IFS: cannot unset")

Once I commented out the lines "IFS="  and "unset IFS" the script ran OK 
and the various params on the bin/nutch command line were interpreted 
properly.  The output line contained only "rootUrlFile = urls.txt" as 
you would expect, and not

Another approach that works for me is to leave the script as is, with 
setting and unsetting IFS but using bash instead of sh.

Michael Levy wrote:
> I hope someone can help me with this problem I'm having with crawling 
> on Solaris.  The same
> script works fine on Windows using cygwin but I need to run this on 
> Solaris.
>
> This works fine:
>  #bin/nutch crawl urls.txt
> ...creating a directory named something like crawl-20060418105008, as 
> expected, and creates a working index.
>
> However if I try to add any parameters beyond the root_url_file 
> parameter I get the output below.  I'm really stumped.  The following 
> does not create a directory named FOO, but it does create a directory 
> named something like crawl-20060418105500.  Apparently it ignores the 
> -dir FOO parameter.
>
> Actually looking at the output it seems as if it is taking "urls.txt 
> -dir FOO"
> as the name of the urls file, rather than interpreting the "-dir FOO" 
> at all.
> See the line "rootUrlFile = urls.txt -dir FOO"; it should just be 
> "rootUrlFile = urls.txt" I think.
>
>
> ## bin/nutch crawl urls.txt -dir FOO
> 060418 105308 parsing 
> file:/export/home/www/virtual/wiki/doc_root/nutch-0.7.2/conf/nutch-default.xml 
>
> 060418 105308 parsing 
> file:/export/home/www/virtual/wiki/doc_root/nutch-0.7.2/conf/crawl-tool.xml 
>
> 060418 105308 parsing 
> file:/export/home/www/virtual/wiki/doc_root/nutch-0.7.2/conf/nutch-site.xml 
>
> 060418 105308 No FS indicated, using default:local
> 060418 105308 crawl started in: crawl-20060418105308
> 060418 105308 rootUrlFile = urls.txt -dir FOO
> 060418 105308 threads = 10
> 060418 105308 depth = 5
> 060418 105310 Created webdb at 
> LocalFS,/export/home/www/virtual/wiki/doc_root/nutch-0.7.2/crawl-20060418105308/db 
>
> Exception in thread "main" java.io.FileNotFoundException: urls.txt 
> -dir FOO (No such file or directory)
>       at java.io.FileInputStream.open(Native Method)
>       at java.io.FileInputStream.<init>(FileInputStream.java:106)
>       at java.io.FileReader.<init>(FileReader.java:55)
>       at 
> org.apache.nutch.db.WebDBInjector.injectURLFile(WebDBInjector.java:372)
>       at org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:535)
>       at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:134)
>
>
>
>