You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tsengtan A Shuy <tt...@sbcglobal.net> on 2007/07/02 23:33:49 UTC

generate command fails in cygwin environment.

I follow the web page
"http://lucene.apache.org/nutch/tutorial8.html#Intranet%3A+Running+the+Crawl
", and execute the "$ bin/nutch generate crawl/crawldb crawl/segments"
command in my cygwin environment.
I got the following error message: 
Generator: starting
Generator: segment: crawl/segments/20070702142541
Generator: Selecting best-scoring urls due for fetch.
Exception in thread "main" java.io.IOException: Input directory
E:/cygwin/home/A
dministrator/nutch-0.8.1/crawl/crawldb/current in local is invalid.
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
        at org.apache.nutch.crawl.Generator.generate(Generator.java:319)
        at org.apache.nutch.crawl.Generator.main(Generator.java:395)

Do you know how to solve the problem?
Your any feedback will be much appreciated.

Adam Shuy, President
ePacific Web Design & Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com

-----Original Message-----
From: Chris Hane [mailto:chrishane@gmail.com] 
Sent: Monday, July 02, 2007 12:45 PM
To: nutch-user@lucene.apache.org
Subject: Re: Adding meta data to searched documents

Enis - thanks for the pointer.

Enis Soztutar wrote:
> You can write index plugins. Please first read the (slighlty outdated) 
> tutorial and then check    http://wiki.apache.org/nutch/PluginCentral. 
> Optionally you may want to write html parse plugins depending on the 
> source of the data.
> 
> Chris Hane wrote:
>> I am looking to use nutch to crawl/index a website.  A lot of the 
>> pages have videos on them.  We have transcripts for the videos that we 
>> would like to be included for indexing; but we do not want to put the 
>> transcripts on the web pages.
>>
>> Is there a way to "add" this information to a given web page for 
>> purposes of indexing as part of the crawl process?  Maybe another 
>> point in the process before the index is generated?  I am hoping there 
>> is a point in the crawl process where I can add augmented content to a 
>> page in the nutch segment (rough thought based on very limited time 
>> spent looking at nutch).
>>
>> We are comfortable using java and can write custom code as needed.  I 
>> would appreciate any pointers on where to look in the nutch code.
>>
>> Thanks in advance,
>> Chris.....
>>
>