You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by BDalton <bi...@uniform.ca> on 2006/07/18 22:05:50 UTC

0.8 – Will not accept url list file on Windows

I get this error,

bin/nutch crawl url.txt -dir newcrawled -depth 2 >& crawl.log

Exception in thread "main" java.io.IOException: Input directory
d:/nutch3/urls/urls.txt in local is invalid.
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
	at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
	at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

-- 
View this message in context: http://www.nabble.com/0.8--Will-not-accept-url-list-file-on-Windows-tf1962714.html#a5385356
Sent from the Nutch - User forum at Nabble.com.

Re: 0.8 – Will not accept url list file on Windows

Posted by BDalton <bi...@uniform.ca>.

Ah! Thanks all.

On windows, using cygwin the default NUTCH_HOME is at \\cygdrive\

I did have some malformed urls in my test. Fixed and everything is fine, I
just didn’t know the input and output changes to 0.8. I’m new to this list,
so hopefully I’ll now keep up to the changes.

-- 
View this message in context: http://www.nabble.com/0.8--Will-not-accept-url-list-file-on-Windows-tf1962714.html#a5387192
Sent from the Nutch - User forum at Nabble.com.

Re: 0.8 – Will not accept url list file on Windows

Posted by Sami Siren <ss...@gmail.com>.

Logging is also different in 0.8. by default it logs to file 
$NUTCH_HOME/logs/hadoop.log (so you don't need to capture stdout, stderr 
to log file anymore)
--
 Sami Siren

BDalton wrote:

>Thank you, that seemed to fix the problem. Unfortunately, another problem
>followed.
>
>With command: bin/nutch crawl urls1 -dir newcrawled -depth 2 >& crawl.log
>
>I now get a directory called “newcrawled”, however, the crawl.log is created
>empty without any information. Also the index created contains no data. No
>error messages. I’m using nightly July 18, and have no problems with 0.7.2.
>
>
>Sami Siren-2 wrote:
>  
>
>>in 0.8 you submit a _directory_ containing urls.txt not the file itself.
>>
>>so remove /urls.txt part from your commandline and it should go fine.
>>
>>--
>> Sami Siren
>>
>>BDalton wrote:
>>
>>    
>>
>>>I get this error,
>>>
>>>bin/nutch crawl url.txt -dir newcrawled -depth 2 >& crawl.log
>>>
>>>Exception in thread "main" java.io.IOException: Input directory
>>>d:/nutch3/urls/urls.txt in local is invalid.
>>>	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>>>	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>>>	at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
>>>	at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>>>
>>> 
>>>
>>>      
>>>
>>
>>    
>>
>
>  
>

Re: 0.8 - Will not accept url list file on Windows

Posted by Sudhi Seshachala <su...@yahoo.com>.

Please try this command
  bin/nutch crawl search -dir /usr/data/crawl -depth 2 &> crawl.log &

  where search  folder contains the list of files containing URLs.
  The crawler will crawl data into /usr/data/crawl/crawldb folder.
  crawl.log being the log file.

  Hope this helps.
  Thanks
  Sudhi

BDalton <bi...@uniform.ca> wrote:

Thank you, that seemed to fix the problem. Unfortunately, another problem
followed.

With command: bin/nutch crawl urls1 -dir newcrawled -depth 2 >& crawl.log

I now get a directory called ânewcrawledâ, however, the crawl.log is created
empty without any information. Also the index created contains no data. No
error messages. Iâm using nightly July 18, and have no problems with 0.7.2.

Sami Siren-2 wrote:
> 
> in 0.8 you submit a _directory_ containing urls.txt not the file itself.
> 
> so remove /urls.txt part from your commandline and it should go fine.
> 
> --
> Sami Siren
> 
> BDalton wrote:
> 
>>I get this error,
>>
>>bin/nutch crawl url.txt -dir newcrawled -depth 2 >& crawl.log
>>
>>Exception in thread "main" java.io.IOException: Input directory
>>d:/nutch3/urls/urls.txt in local is invalid.
>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>> at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>>
>> 
>>
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/0.8--Will-not-accept-url-list-file-on-Windows-tf1962714.html#a5386778
Sent from the Nutch - User forum at Nabble.com.

 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: 0.8 – Will not accept url list file on Windows

Posted by BDalton <bi...@uniform.ca>.

Thank you, that seemed to fix the problem. Unfortunately, another problem
followed.

With command: bin/nutch crawl urls1 -dir newcrawled -depth 2 >& crawl.log

I now get a directory called “newcrawled”, however, the crawl.log is created
empty without any information. Also the index created contains no data. No
error messages. I’m using nightly July 18, and have no problems with 0.7.2.


Sami Siren-2 wrote:
> 
> in 0.8 you submit a _directory_ containing urls.txt not the file itself.
> 
> so remove /urls.txt part from your commandline and it should go fine.
> 
> --
>  Sami Siren
> 
> BDalton wrote:
> 
>>I get this error,
>>
>>bin/nutch crawl url.txt -dir newcrawled -depth 2 >& crawl.log
>>
>>Exception in thread "main" java.io.IOException: Input directory
>>d:/nutch3/urls/urls.txt in local is invalid.
>>	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>>	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>>	at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
>>	at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>>
>>  
>>
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/0.8--Will-not-accept-url-list-file-on-Windows-tf1962714.html#a5386778
Sent from the Nutch - User forum at Nabble.com.

Re: 0.8 – Will not accept url list file on Windows

Posted by Sami Siren <ss...@gmail.com>.

in 0.8 you submit a _directory_ containing urls.txt not the file itself.

so remove /urls.txt part from your commandline and it should go fine.

--
 Sami Siren

BDalton wrote:

>I get this error,
>
>bin/nutch crawl url.txt -dir newcrawled -depth 2 >& crawl.log
>
>Exception in thread "main" java.io.IOException: Input directory
>d:/nutch3/urls/urls.txt in local is invalid.
>	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>	at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
>	at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>
>  
>