You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by brian <br...@gmail.com> on 2009/12/01 09:44:10 UTC

newbie questions

Hi,

I am using nutch - 1.0 under windows xp with cygwin, and ran a test 
crawl.  It apparently worked, as I can see some data in my crawl 
directory with the luke tool (also looking for documentation of this 
tool).  However, when i tried to use  the tomcat interface, it doesn't 
work and i get this error in the log file:

2009-11-30 21:42:31,700 WARN  FileSystem - uri=file:///
javax.security.auth.login.LoginException: Login failed: Cannot run 
program "whoami": CreateProcess error=2, The system cannot find the file 
specified


also, I would like to know how to extract flat text files of the crawl data.

thanks,

Brian



Re: newbie questions

Posted by yangfeng <ye...@gmail.com>.
you should add property  below:

 <property>
      <name>hadoop.job.ugi</name>
      <value>rider,iamsolomon</value>
   </property>

it's ok!

2009/12/1 Mischa Tuffield <mi...@garlik.com>

> Hello Brian,
>
> Getting a response from another newbie here, so I could be wrong (do excuse
> if I am).
>
> If you are attempting to run a search index from the filesystem you need to
> have the following in your nutch-site.xml :
>
>  <property>
>    <name>fs.default.name</name>
>    <value>file:///</value>
>  </property>
>
> The fs.default.name is require by the nutch-site.xml when you build your
> .war file for deployment to tomcat. This should be accompanied by the below
> config, which should point to the direct where your index has been copied
> to, in my case it looks something like below :
>
>  <property>
>   <name>searcher.dir</name>
>   <value>/home/nutch/nutch/service/crawl</value>
>   <description>
>   Path to root of crawl.  This directory is searched (in
>   order) for either the file search-servers.txt, containing a list of
>   distributed search servers, or the directory "index" containing
>   merged indexes, or the directory "segments" containing segment
>   indexes.
>   </description>
>  </property>
>
> Regarding your second question :
>
> bin/nutch readdb yourcrawldir/crawldb -dump -format csv
>
> Gives you a nice flat file serialisation of your crawl database.
>
> I hope this helps,
>
> Mischa
> On 1 Dec 2009, at 08:44, brian wrote:
>
> > also, I would like to know how to extract flat text files of the crawl
> data.
>
> ___________________________________
> Mischa Tuffield
> Email: mischa.tuffield@garlik.com
> Homepage - http://mmt.me.uk/
> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
> +44(0)20 8973 2465  http://www.garlik.com/
> Registered in England and Wales 535 7233 VAT # 849 0517 11
> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>
>

Re: newbie questions

Posted by Mischa Tuffield <mi...@garlik.com>.
Hello Brian, 

Getting a response from another newbie here, so I could be wrong (do excuse if I am).

If you are attempting to run a search index from the filesystem you need to have the following in your nutch-site.xml : 

  <property>
    <name>fs.default.name</name>
    <value>file:///</value>
  </property>

The fs.default.name is require by the nutch-site.xml when you build your .war file for deployment to tomcat. This should be accompanied by the below config, which should point to the direct where your index has been copied to, in my case it looks something like below :

 <property>
   <name>searcher.dir</name>
   <value>/home/nutch/nutch/service/crawl</value>
   <description>
   Path to root of crawl.  This directory is searched (in
   order) for either the file search-servers.txt, containing a list of
   distributed search servers, or the directory "index" containing
   merged indexes, or the directory "segments" containing segment
   indexes.
   </description>
 </property>

Regarding your second question :

bin/nutch readdb yourcrawldir/crawldb -dump -format csv

Gives you a nice flat file serialisation of your crawl database.

I hope this helps, 

Mischa
On 1 Dec 2009, at 08:44, brian wrote:

> also, I would like to know how to extract flat text files of the crawl data.

___________________________________
Mischa Tuffield
Email: mischa.tuffield@garlik.com
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD