You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by brian <br...@gmail.com> on 2009/12/01 09:44:10 UTC
newbie questions
Hi,
I am using nutch - 1.0 under windows xp with cygwin, and ran a test
crawl. It apparently worked, as I can see some data in my crawl
directory with the luke tool (also looking for documentation of this
tool). However, when i tried to use the tomcat interface, it doesn't
work and i get this error in the log file:
2009-11-30 21:42:31,700 WARN FileSystem - uri=file:///
javax.security.auth.login.LoginException: Login failed: Cannot run
program "whoami": CreateProcess error=2, The system cannot find the file
specified
also, I would like to know how to extract flat text files of the crawl data.
thanks,
Brian
Re: newbie questions
Posted by yangfeng <ye...@gmail.com>.
you should add property below:
<property>
<name>hadoop.job.ugi</name>
<value>rider,iamsolomon</value>
</property>
it's ok!
2009/12/1 Mischa Tuffield <mi...@garlik.com>
> Hello Brian,
>
> Getting a response from another newbie here, so I could be wrong (do excuse
> if I am).
>
> If you are attempting to run a search index from the filesystem you need to
> have the following in your nutch-site.xml :
>
> <property>
> <name>fs.default.name</name>
> <value>file:///</value>
> </property>
>
> The fs.default.name is require by the nutch-site.xml when you build your
> .war file for deployment to tomcat. This should be accompanied by the below
> config, which should point to the direct where your index has been copied
> to, in my case it looks something like below :
>
> <property>
> <name>searcher.dir</name>
> <value>/home/nutch/nutch/service/crawl</value>
> <description>
> Path to root of crawl. This directory is searched (in
> order) for either the file search-servers.txt, containing a list of
> distributed search servers, or the directory "index" containing
> merged indexes, or the directory "segments" containing segment
> indexes.
> </description>
> </property>
>
> Regarding your second question :
>
> bin/nutch readdb yourcrawldir/crawldb -dump -format csv
>
> Gives you a nice flat file serialisation of your crawl database.
>
> I hope this helps,
>
> Mischa
> On 1 Dec 2009, at 08:44, brian wrote:
>
> > also, I would like to know how to extract flat text files of the crawl
> data.
>
> ___________________________________
> Mischa Tuffield
> Email: mischa.tuffield@garlik.com
> Homepage - http://mmt.me.uk/
> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
> +44(0)20 8973 2465 http://www.garlik.com/
> Registered in England and Wales 535 7233 VAT # 849 0517 11
> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>
>
Re: newbie questions
Posted by Mischa Tuffield <mi...@garlik.com>.
Hello Brian,
Getting a response from another newbie here, so I could be wrong (do excuse if I am).
If you are attempting to run a search index from the filesystem you need to have the following in your nutch-site.xml :
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>
The fs.default.name is require by the nutch-site.xml when you build your .war file for deployment to tomcat. This should be accompanied by the below config, which should point to the direct where your index has been copied to, in my case it looks something like below :
<property>
<name>searcher.dir</name>
<value>/home/nutch/nutch/service/crawl</value>
<description>
Path to root of crawl. This directory is searched (in
order) for either the file search-servers.txt, containing a list of
distributed search servers, or the directory "index" containing
merged indexes, or the directory "segments" containing segment
indexes.
</description>
</property>
Regarding your second question :
bin/nutch readdb yourcrawldir/crawldb -dump -format csv
Gives you a nice flat file serialisation of your crawl database.
I hope this helps,
Mischa
On 1 Dec 2009, at 08:44, brian wrote:
> also, I would like to know how to extract flat text files of the crawl data.
___________________________________
Mischa Tuffield
Email: mischa.tuffield@garlik.com
Homepage - http://mmt.me.uk/
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465 http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD