You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Feng Ji <fe...@gmail.com> on 2006/09/04 19:18:29 UTC

how to combine two run's result for search

Hi there,

In Nutch 08, I have crawled down from two webDB independently.

For each run, I did invertlinks and index. So each one is searchable.

Now I want to combine them togeter for search. I tried "merge" command to
merge two indexes, but the search for the result index output dir is dull.
Do I need put output dir to the same directory as above two crawl/ ?

I wonder what is proper steps to combine two seperate run into one search
result. Do I need to combine two webdb, merge two segments and do
invertlinks and do index?

thanks your time,

Michael,

Re: selective index searching

Posted by iimchuckles <ii...@gmail.com>.

Hello All,

I've followed the instructions below and for the purpose of searching
multiple indexes at the same time it works great.

I have a crawl directory with multiple indexes beneath it
crawl/category1/section1
crawl/category1/section2
crawl/category2/section1
crawl/category2/section2

How can I search category1/section1 and category2/section1 for term "help
text"
Would using a multisearcher help here or is there a way of specificying the
indexes you want to use?
Possibly create a new field for each item during a crawl/index "cat:1,sec:2"
and then merging all indexes into one and searching on specified fields?
Am I traveling down the right path? or am I lost?

Thanks in advance,
Chad

Zaheed Haque wrote:
> 
> Hi:
> Assuming you have
> 
> index 1 at /data/crawl1
> index 2 at /data/crawl2
> 
> In nutch-site.xml
> searcher.dir = /data
> 
> Under /data you have a text file called search-server.txt (I think do
> check nutch-site search.dir description please)
> 
> In the text file you will have the following
> 
> hostname1 portnumber
> hostname2 portnumber
> 
> example
> localhost 1234
> localhost 5678
> 
> Then you need to start
> bin/nutch server 1234 /data/craw1 &
> and
> bin/nutch server 5678 /data/crawl2 &
> now try
> bin/nutch org.apache.nutch.search.NutchBean www
> you should see results :-)
> Cheers
> 
> 

-- 
View this message in context: http://www.nabble.com/how-to-combine-two-run%27s-result-for-search-tf2216447.html#a6178389
Sent from the Nutch - User forum at Nabble.com.

Re: how to combine two run's result for search

Posted by Tomi NA <he...@gmail.com>.

On 9/18/06, Zaheed Haque <za...@gmail.com> wrote:
> Hi:
>
> I have just checked your flash movie.. quick observation you are
> running tomcat 4.1.31 and there is nothing you are doing that seems
> wrong. Anyway after starting the servers can you search using the
> following command
>
> bin/nutch org.apache.nutch.search.NutchBean bobdocs
>
> what do you get .. and what's in the logfile?
>
> If you get something then probably its tomcat 4.1.31 is  the problem.

tomi@potjeh ~/posao/nutch/novo/nutch-0.8 $ ./bin/nutch
org.apache.nutch.search.NutchBean bobdocs
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/nutch/search/NutchBean
tomi@potjeh ~/posao/nutch/novo/nutch-0.8 $

It doesn't really tell me if tomcat is the problem, does it? I've
added debug statements to the nutch script so I can check if my
CLASSPATH is correct. I have no idea why nutch can't find the
NutchBean class.
I have, however, checked out the nutch 0.8 and hadoop 0.5 sources from
the svn repository, imported them into an eclipse project and used the
DistributedSearch Client and Server "public static void main" methods.
My experiments showed that my problem is not with tomcat or the nutch
web UI, because the DistributedSearch.Client also returned 0 results
regardless of the query or combination of indexes. I've managed to
confirm that the Client sees all the search servers, but it simply
fails to return any results.
I also ran across something in the logs that I didn't see before. The
following is periodically output (regardless of what I'm doing in
eclipse, as long as the Client thread is active):

2006-09-18 13:55:30,352 INFO  searcher.DistributedSearch - STATS: 2
servers, 2 segments.
2006-09-18 13:55:40,539 INFO  searcher.DistributedSearch - Querying
segments from search servers...
2006-09-18 13:55:40,559 INFO  searcher.DistributedSearch - STATS: 2
servers, 2 segments.
2006-09-18 13:55:50,564 INFO  searcher.DistributedSearch - Querying
segments from search servers...

Going back to square one...am I building the crawls correctly?
./bin/nutch crawl urls -threads 15 -topN 10 -depth 3

Is it the fact that I'm doing an intranet crawl every time, instead of
the multi-step whole web crawl? What else, what am I missing?

t.n.a.

Re: how to combine two run's result for search

Posted by Zaheed Haque <za...@gmail.com>.

Hi:

I have just checked your flash movie.. quick observation you are
running tomcat 4.1.31 and there is nothing you are doing that seems
wrong. Anyway after starting the servers can you search using the
following command

bin/nutch org.apache.nutch.search.NutchBean bobdocs

what do you get .. and what's in the logfile?

If you get something then probably its tomcat 4.1.31 is  the problem.

/Zaheed


On 9/18/06, Tomi NA <he...@gmail.com> wrote:
> On 9/16/06, Tomi NA <he...@gmail.com> wrote:
> > On 9/15/06, Tomi NA <he...@gmail.com> wrote:
> > > On 9/14/06, Zaheed Haque <za...@gmail.com> wrote:
> > > > > Thats the way I set it up at first.
> > > > > This time, I started with a blank slate, unpacked nutch and tomcat,
> > > > > unpacked nutch-0.8.war into the webapps/ROOT and left the deployed app
> > > > > untouched.
> > > >
> > > > The above means that you have an empty nutch-site.xml under
> > > > webapps/ROOT and you have a nutch-default.xml with a searcher.dir
> > > > property = crawl. Am I correct? cos you left the deployed web app
> > > > untouched? no?
> > >
> > > You are correct, the searcher.dir property is set to "crawl".
> > >
> > > > > I then pointed the "crawl" symlink in the current dir to point to the
> > > > > "crawls" directory, where my search-servers.txt (with two "localhost
> > > > > port" entries). In the "crawls" dir I also have two nutch-built
> > > > > indexes.
> > > >
> > > > If I remember it correctly I had some trouble with symlink once but I
> > > > don't exactly remember why.. maybe you can try without symlink..
> > >
> > > I tried renaming the directory "crawls" to "crawl", then running the
> > > servers like so:
> > > ./nutch-0.8/bin/nutch server 8192 crawl/crawl1 &
> > > ./nutch-0.8/bin/nutch server 8193 crawl/crawl2 &
> > >
> > > > > Now, I start nutch distributed search servers on each index and start
> > > > > tomcat from the dir containing the "crawl" link. I get no results at
> > > > > all.
> > > > > If I change the link to point to "crawls/crawl1", the search works
> > > >
> > > > I am guessing the above is also a symlink.. hmm.. maybe it has
> > > > something to do with distributed search and symlink.. no?
> > >
> > > It doesn't appear to be the problem. I tried without symlinks without success.
> > >
> > > I'm going to document the problem better today, so maybe that will help.
> > > I'm having trouble believing what I'm trying to achieve is so
> > > problematic...nevertheless, I appreciate your  effort so far.
> >
> > I don't think I can document the problem better than I have here:
> > http://tna.sharanet.org/problem.html
> >
> > It's a 2-minute flash movie showing exactly what I'm doing. I'd very
> > much appreciate anyone taking a look at it, but especially Zaheed.
> > The only thing I forgot to display in the movie is my search-servers.txt:
> > localhost 8192
> > localhost 8193
> >
> > Now, what am I doing wrong?
> >
> > t.n.a.
>
> Anyone? Renaud, Zaheed, Feng?
>
> t.n.a.
>

Re: how to combine two run's result for search

Posted by Tomi NA <he...@gmail.com>.

On 9/16/06, Tomi NA <he...@gmail.com> wrote:
> On 9/15/06, Tomi NA <he...@gmail.com> wrote:
> > On 9/14/06, Zaheed Haque <za...@gmail.com> wrote:
> > > > Thats the way I set it up at first.
> > > > This time, I started with a blank slate, unpacked nutch and tomcat,
> > > > unpacked nutch-0.8.war into the webapps/ROOT and left the deployed app
> > > > untouched.
> > >
> > > The above means that you have an empty nutch-site.xml under
> > > webapps/ROOT and you have a nutch-default.xml with a searcher.dir
> > > property = crawl. Am I correct? cos you left the deployed web app
> > > untouched? no?
> >
> > You are correct, the searcher.dir property is set to "crawl".
> >
> > > > I then pointed the "crawl" symlink in the current dir to point to the
> > > > "crawls" directory, where my search-servers.txt (with two "localhost
> > > > port" entries). In the "crawls" dir I also have two nutch-built
> > > > indexes.
> > >
> > > If I remember it correctly I had some trouble with symlink once but I
> > > don't exactly remember why.. maybe you can try without symlink..
> >
> > I tried renaming the directory "crawls" to "crawl", then running the
> > servers like so:
> > ./nutch-0.8/bin/nutch server 8192 crawl/crawl1 &
> > ./nutch-0.8/bin/nutch server 8193 crawl/crawl2 &
> >
> > > > Now, I start nutch distributed search servers on each index and start
> > > > tomcat from the dir containing the "crawl" link. I get no results at
> > > > all.
> > > > If I change the link to point to "crawls/crawl1", the search works
> > >
> > > I am guessing the above is also a symlink.. hmm.. maybe it has
> > > something to do with distributed search and symlink.. no?
> >
> > It doesn't appear to be the problem. I tried without symlinks without success.
> >
> > I'm going to document the problem better today, so maybe that will help.
> > I'm having trouble believing what I'm trying to achieve is so
> > problematic...nevertheless, I appreciate your  effort so far.
>
> I don't think I can document the problem better than I have here:
> http://tna.sharanet.org/problem.html
>
> It's a 2-minute flash movie showing exactly what I'm doing. I'd very
> much appreciate anyone taking a look at it, but especially Zaheed.
> The only thing I forgot to display in the movie is my search-servers.txt:
> localhost 8192
> localhost 8193
>
> Now, what am I doing wrong?
>
> t.n.a.

Anyone? Renaud, Zaheed, Feng?

t.n.a.

Re: Nutch with Drupal (PHP)

Posted by Robert Douglass <ro...@robshouse.net>.

And thanks Paul, I wasn't aware of the possibility of running Drupal 
within Resin/Quercus. I'll have to give that a whirl!

-Robert

sub paul wrote:
> For those who want to stay all java..
>
> Take a look at Java PHP Implementation that Caucho has done in Resin.
> http://wiki.caucho.com/Quercus:_Drupal
>
> You can run Drupal in Resin and stay all java at the end of the day, 
> if you
> prefer.
>
> Thanks Rob.. will give it a whirl.
>
> Paul
>
>
> On 9/17/06, Robert Douglass <ro...@robshouse.net> wrote:
>>
>> Hi all,
>>
>> I wanted to bring to your attention two Drupal modules which provide a
>> way for you to integrate Nutch with Drupal. If you've never heard of it,
>> Drupal is a pretty decent PHP framework (sometimes called a CMS) which
>> is also very developer friendly.
>>
>> The first module is a client for OpenSearch RSS. This lets you have a
>> search form in your Drupal site that can use Nutch (or any other OpenRSS
>> provider) as the search backend.
>>
>> http://drupal.org/project/opensearchclient
>>
>> The second module is a Nutch integration module that allows you to
>> control the Nutch crawl/index lifecycle from your Drupal admin 
>> interface:
>>
>> http://drupal.org/project/nutch
>>
>> It is mostly just a set of bash scripts which can be triggered from
>> within Drupal. There is a lot of room for further development in this
>> module and I'd be most happy to get more Nutch experts working on it.
>>
>>
>> Anyway, perhaps some of you will find these modules useful.
>>
>> Cheers,
>>
>> Robert Douglass
>>
>>
>> PS. There is also an OpenSearchClient module which produces OpenSearch
>> RSS from Drupal's own search results, opening up the possibility of
>> having one site be the search provider for another.
>>
>> http://drupal.org/project/opensearch
>>
>

Re: Nutch with Drupal (PHP)

Posted by sub paul <su...@gmail.com>.

For those who want to stay all java..

Take a look at Java PHP Implementation that Caucho has done in Resin.
http://wiki.caucho.com/Quercus:_Drupal

You can run Drupal in Resin and stay all java at the end of the day, if you
prefer.

Thanks Rob.. will give it a whirl.

Paul


On 9/17/06, Robert Douglass <ro...@robshouse.net> wrote:
>
> Hi all,
>
> I wanted to bring to your attention two Drupal modules which provide a
> way for you to integrate Nutch with Drupal. If you've never heard of it,
> Drupal is a pretty decent PHP framework (sometimes called a CMS) which
> is also very developer friendly.
>
> The first module is a client for OpenSearch RSS. This lets you have a
> search form in your Drupal site that can use Nutch (or any other OpenRSS
> provider) as the search backend.
>
> http://drupal.org/project/opensearchclient
>
> The second module is a Nutch integration module that allows you to
> control the Nutch crawl/index lifecycle from your Drupal admin interface:
>
> http://drupal.org/project/nutch
>
> It is mostly just a set of bash scripts which can be triggered from
> within Drupal. There is a lot of room for further development in this
> module and I'd be most happy to get more Nutch experts working on it.
>
>
> Anyway, perhaps some of you will find these modules useful.
>
> Cheers,
>
> Robert Douglass
>
>
> PS. There is also an OpenSearchClient module which produces OpenSearch
> RSS from Drupal's own search results, opening up the possibility of
> having one site be the search provider for another.
>
> http://drupal.org/project/opensearch
>

Nutch with Drupal (PHP)

Posted by Robert Douglass <ro...@robshouse.net>.

Hi all,

I wanted to bring to your attention two Drupal modules which provide a 
way for you to integrate Nutch with Drupal. If you've never heard of it, 
Drupal is a pretty decent PHP framework (sometimes called a CMS) which 
is also very developer friendly.

The first module is a client for OpenSearch RSS. This lets you have a 
search form in your Drupal site that can use Nutch (or any other OpenRSS 
provider) as the search backend.

http://drupal.org/project/opensearchclient

The second module is a Nutch integration module that allows you to 
control the Nutch crawl/index lifecycle from your Drupal admin interface:

http://drupal.org/project/nutch

It is mostly just a set of bash scripts which can be triggered from 
within Drupal. There is a lot of room for further development in this 
module and I'd be most happy to get more Nutch experts working on it.


Anyway, perhaps some of you will find these modules useful.

Cheers,

Robert Douglass


PS. There is also an OpenSearchClient module which produces OpenSearch 
RSS from Drupal's own search results, opening up the possibility of 
having one site be the search provider for another.

http://drupal.org/project/opensearch

Re: how to combine two run's result for search

Posted by Tomi NA <he...@gmail.com>.

On 9/15/06, Tomi NA <he...@gmail.com> wrote:
> On 9/14/06, Zaheed Haque <za...@gmail.com> wrote:
> > > Thats the way I set it up at first.
> > > This time, I started with a blank slate, unpacked nutch and tomcat,
> > > unpacked nutch-0.8.war into the webapps/ROOT and left the deployed app
> > > untouched.
> >
> > The above means that you have an empty nutch-site.xml under
> > webapps/ROOT and you have a nutch-default.xml with a searcher.dir
> > property = crawl. Am I correct? cos you left the deployed web app
> > untouched? no?
>
> You are correct, the searcher.dir property is set to "crawl".
>
> > > I then pointed the "crawl" symlink in the current dir to point to the
> > > "crawls" directory, where my search-servers.txt (with two "localhost
> > > port" entries). In the "crawls" dir I also have two nutch-built
> > > indexes.
> >
> > If I remember it correctly I had some trouble with symlink once but I
> > don't exactly remember why.. maybe you can try without symlink..
>
> I tried renaming the directory "crawls" to "crawl", then running the
> servers like so:
> ./nutch-0.8/bin/nutch server 8192 crawl/crawl1 &
> ./nutch-0.8/bin/nutch server 8193 crawl/crawl2 &
>
> > > Now, I start nutch distributed search servers on each index and start
> > > tomcat from the dir containing the "crawl" link. I get no results at
> > > all.
> > > If I change the link to point to "crawls/crawl1", the search works
> >
> > I am guessing the above is also a symlink.. hmm.. maybe it has
> > something to do with distributed search and symlink.. no?
>
> It doesn't appear to be the problem. I tried without symlinks without success.
>
> I'm going to document the problem better today, so maybe that will help.
> I'm having trouble believing what I'm trying to achieve is so
> problematic...nevertheless, I appreciate your  effort so far.

I don't think I can document the problem better than I have here:
http://tna.sharanet.org/problem.html

It's a 2-minute flash movie showing exactly what I'm doing. I'd very
much appreciate anyone taking a look at it, but especially Zaheed.
The only thing I forgot to display in the movie is my search-servers.txt:
localhost 8192
localhost 8193

Now, what am I doing wrong?

t.n.a.

Re: how to combine two run's result for search

Posted by Tomi NA <he...@gmail.com>.

On 9/14/06, Zaheed Haque <za...@gmail.com> wrote:
> > Thats the way I set it up at first.
> > This time, I started with a blank slate, unpacked nutch and tomcat,
> > unpacked nutch-0.8.war into the webapps/ROOT and left the deployed app
> > untouched.
>
> The above means that you have an empty nutch-site.xml under
> webapps/ROOT and you have a nutch-default.xml with a searcher.dir
> property = crawl. Am I correct? cos you left the deployed web app
> untouched? no?

You are correct, the searcher.dir property is set to "crawl".

> > I then pointed the "crawl" symlink in the current dir to point to the
> > "crawls" directory, where my search-servers.txt (with two "localhost
> > port" entries). In the "crawls" dir I also have two nutch-built
> > indexes.
>
> If I remember it correctly I had some trouble with symlink once but I
> don't exactly remember why.. maybe you can try without symlink..

I tried renaming the directory "crawls" to "crawl", then running the
servers like so:
./nutch-0.8/bin/nutch server 8192 crawl/crawl1 &
./nutch-0.8/bin/nutch server 8193 crawl/crawl2 &

> > Now, I start nutch distributed search servers on each index and start
> > tomcat from the dir containing the "crawl" link. I get no results at
> > all.
> > If I change the link to point to "crawls/crawl1", the search works
>
> I am guessing the above is also a symlink.. hmm.. maybe it has
> something to do with distributed search and symlink.. no?

It doesn't appear to be the problem. I tried without symlinks without success.

I'm going to document the problem better today, so maybe that will help.
I'm having trouble believing what I'm trying to achieve is so
problematic...nevertheless, I appreciate your  effort so far.

t.n.a.

Re: how to combine two run's result for search

Posted by Zaheed Haque <za...@gmail.com>.

> Thats the way I set it up at first.
> This time, I started with a blank slate, unpacked nutch and tomcat,
> unpacked nutch-0.8.war into the webapps/ROOT and left the deployed app
> untouched.

The above means that you have an empty nutch-site.xml under
webapps/ROOT and you have a nutch-default.xml with a searcher.dir
property = crawl. Am I correct? cos you left the deployed web app
untouched? no?

> I then pointed the "crawl" symlink in the current dir to point to the
> "crawls" directory, where my search-servers.txt (with two "localhost
> port" entries). In the "crawls" dir I also have two nutch-built
> indexes.

If I remember it correctly I had some trouble with symlink once but I
don't exactly remember why.. maybe you can try without symlink..

> Now, I start nutch distributed search servers on each index and start
> tomcat from the dir containing the "crawl" link. I get no results at
> all.
> If I change the link to point to "crawls/crawl1", the search works

I am guessing the above is also a symlink.. hmm.. maybe it has
something to do with distributed search and symlink.. no?

> i.e. I get a couple of results. What seems to be the problem is
> inserting the distributed search server between the index and tomcat.
> Nothing I do makes the least bit of difference. :\

> t.n.a.
>

Re: how to combine two run's result for search

Posted by Tomi NA <he...@gmail.com>.

On 9/14/06, Zaheed Haque <za...@gmail.com> wrote:
> On 9/14/06, Tomi NA <he...@gmail.com> wrote:
> > On 9/5/06, Zaheed Haque <za...@gmail.com> wrote:
> > > Hi:
> >
> > I have a problem or two with the described procedure...
> >
> > > Assuming you have
> > >
> > > index 1 at /data/crawl1
> > > index 2 at /data/crawl2
> >
> > Used ./bin/nutch crawl urls -dir /home/myhome/crawls/mycrawldir to
> > generate an index: luke says the index is valid and I can query it
> > using luke's interface.
> >
> > Does the "searcher.dir" value in nutch-(default|site).xml have any
> > impact on the way indexes are created?
>
> No it doesn't have any impact on index creation. searcher.dir value is
> for searching only. nutch-site.xml is where you should change..
> example...
>
> <property>
>   <name>searcher.dir</name>
>   <value> /home/myhome/crawls</value>
>   <description>
>   Path to root of index directories.  This directory is searched (in
>   order) for either the file search-servers.txt, containing a list of
>   distributed search servers, or the directory "index" containing
>   merged indexes, or the directory "segments" containing segment
>   indexes.
>   </description>
> </property>
>
> and the text file should be in this case ...
>
>  /home/myhome/crawls/search-servers.txt

Thats the way I set it up at first.
This time, I started with a blank slate, unpacked nutch and tomcat,
unpacked nutch-0.8.war into the webapps/ROOT and left the deployed app
untouched.
I then pointed the "crawl" symlink in the current dir to point to the
"crawls" directory, where my search-servers.txt (with two "localhost
port" entries). In the "crawls" dir I also have two nutch-built
indexes.
Now, I start nutch distributed search servers on each index and start
tomcat from the dir containing the "crawl" link. I get no results at
all.
If I change the link to point to "crawls/crawl1", the search works
i.e. I get a couple of results. What seems to be the problem is
inserting the distributed search server between the index and tomcat.
Nothing I do makes the least bit of difference. :\

t.n.a.

Re: how to combine two run's result for search

Posted by Zaheed Haque <za...@gmail.com>.

On 9/14/06, Tomi NA <he...@gmail.com> wrote:
> On 9/5/06, Zaheed Haque <za...@gmail.com> wrote:
> > Hi:
>
> I have a problem or two with the described procedure...
>
> > Assuming you have
> >
> > index 1 at /data/crawl1
> > index 2 at /data/crawl2
>
> Used ./bin/nutch crawl urls -dir /home/myhome/crawls/mycrawldir to
> generate an index: luke says the index is valid and I can query it
> using luke's interface.
>
> Does the "searcher.dir" value in nutch-(default|site).xml have any
> impact on the way indexes are created?

No it doesn't have any impact on index creation. searcher.dir value is
for searching only. nutch-site.xml is where you should change..
example...

<property>
  <name>searcher.dir</name>
  <value> /home/myhome/crawls</value>
  <description>
  Path to root of index directories.  This directory is searched (in
  order) for either the file search-servers.txt, containing a list of
  distributed search servers, or the directory "index" containing
  merged indexes, or the directory "segments" containing segment
  indexes.
  </description>
</property>

and the text file should be in this case ...

 /home/myhome/crawls/search-servers.txt

> > In nutch-site.xml
> > searcher.dir = /data
>
> This is the nutch-site.xml of the web UI?

Both. I mean tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml as
well as NUTCH HOME/conf/nutch-site.xml.

Web application needs to know where the search-servers.txt file is if
you plan to use tomcat to search.

> > Under /data you have a text file called search-server.txt (I think do
> > check nutch-site search.dir description please)
>
> /home/myhome/crawls/search-servers.txt
>
> > In the text file you will have the following
> >
> > hostname1 portnumber
> > hostname2 portnumber
> >
> > example
> > localhost 1234
> > localhost 5678
>
> I placed
> localhost 12567
> (just one instance, to test)
>
> > Then you need to start
> >
> > bin/nutch server 1234 /data/craw1 &
> >
> > and
> >
> > bin/nutch server 5678 /data/crawl2 &
>
> did that, using port 12567
> ./bin/nutch server 12567 /home/mydir/crawls/mycrawldir &
>
> > bin/nutch org.apache.nutch.search.NutchBean www
> >
> > you should see results :-)
>
> I get:
> ------------
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/nutch/search/NutchBean
> ------------
>
> Whats more, I get no results to any query I care to pass by the Web
> UI, which suggests the UI isn't connected to the underlying
> DistributedSearch server. :\
>
> Any hints, anyone?
>
> TIA,
> t.n.a.
>

Re: how to combine two run's result for search

Posted by Tomi NA <he...@gmail.com>.

On 9/5/06, Zaheed Haque <za...@gmail.com> wrote:
> Hi:

I have a problem or two with the described procedure...

> Assuming you have
>
> index 1 at /data/crawl1
> index 2 at /data/crawl2

Used ./bin/nutch crawl urls -dir /home/myhome/crawls/mycrawldir to
generate an index: luke says the index is valid and I can query it
using luke's interface.

Does the "searcher.dir" value in nutch-(default|site).xml have any
impact on the way indexes are created?

> In nutch-site.xml
> searcher.dir = /data

This is the nutch-site.xml of the web UI?

> Under /data you have a text file called search-server.txt (I think do
> check nutch-site search.dir description please)

/home/myhome/crawls/search-servers.txt

> In the text file you will have the following
>
> hostname1 portnumber
> hostname2 portnumber
>
> example
> localhost 1234
> localhost 5678

I placed
localhost 12567
(just one instance, to test)

> Then you need to start
>
> bin/nutch server 1234 /data/craw1 &
>
> and
>
> bin/nutch server 5678 /data/crawl2 &

did that, using port 12567
./bin/nutch server 12567 /home/mydir/crawls/mycrawldir &

> bin/nutch org.apache.nutch.search.NutchBean www
>
> you should see results :-)

I get:
------------
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/nutch/search/NutchBean
------------

Whats more, I get no results to any query I care to pass by the Web
UI, which suggests the UI isn't connected to the underlying
DistributedSearch server. :\

Any hints, anyone?

TIA,
t.n.a.

Re: how to combine two run's result for search

Posted by Tomi NA <he...@gmail.com>.

On 9/6/06, Zaheed Haque <za...@gmail.com> wrote:
> On 9/6/06, Tomi NA <he...@gmail.com> wrote:
> > On 9/5/06, Zaheed Haque <za...@gmail.com> wrote:
> > > Hi:
> >
> > > In the text file you will have the following
> > >
> > > hostname1 portnumber
> > > hostname2 portnumber
> > >
> > > example
> > > localhost 1234
> > > localhost 5678
> > >
> >
> > Does this work with nutch 0.7.2 or is it specific to the 0.8 release?
>
> I don't really know I have never tried 0.7. From the CVS it seems like it does
>
> http://svn.apache.org/viewvc/lucene/nutch/tags/release-0.7.2/conf/nutch-default.xml?revision=390479&view=markup
>
> but I don't know if the command structures are the same..

Just thought you might know of the top of your head, I'll go try it out.

t.n.a.

Re: how to combine two run's result for search

Posted by Zaheed Haque <za...@gmail.com>.

On 9/6/06, Tomi NA <he...@gmail.com> wrote:
> On 9/5/06, Zaheed Haque <za...@gmail.com> wrote:
> > Hi:
>
> > In the text file you will have the following
> >
> > hostname1 portnumber
> > hostname2 portnumber
> >
> > example
> > localhost 1234
> > localhost 5678
> >
>
> Does this work with nutch 0.7.2 or is it specific to the 0.8 release?

I don't really know I have never tried 0.7. From the CVS it seems like it does

http://svn.apache.org/viewvc/lucene/nutch/tags/release-0.7.2/conf/nutch-default.xml?revision=390479&view=markup

but I don't know if the command structures are the same..

cheers

Re: how to combine two run's result for search

Posted by Tomi NA <he...@gmail.com>.

On 9/5/06, Zaheed Haque <za...@gmail.com> wrote:
> Hi:

> In the text file you will have the following
>
> hostname1 portnumber
> hostname2 portnumber
>
> example
> localhost 1234
> localhost 5678
>

Does this work with nutch 0.7.2 or is it specific to the 0.8 release?

t.n.a.

Re: how to combine two run's result for search

Posted by Zaheed Haque <za...@gmail.com>.

On 9/6/06, Dennis Kubes <nu...@dragonflymc.com> wrote:
> Are those like the shuttle boards?  Smaller 1/4 size boxes?
>
Yes I was actually thinking about the following:

http://www.via.com.tw/en/initiatives/spearhead/clusterserver/

But put 4 boards in 1U like these guys did..

http://linitx.com/product_info.php?cPath=40&products_id=267
http://www2.multithread.co.uk/images/lex_quad_motherboard_1u_case_large.jpg

and have the disks in another cabinet .. I don't think the price for
the hardware will be less but operating cost i.e. hosting/power will
be lot less. I would really like to get this a try :-)

Cheers

Re: how to combine two run's result for search

Posted by Dennis Kubes <nu...@dragonflymc.com>.

Are those like the shuttle boards?  Smaller 1/4 size boxes?

Dennis

Zaheed Haque wrote:
> Renaud:
>
> Yes or No!. I have done some testing as Dennis Kubes suggested and got
> similler results like his test. In short having 4 nutch search servers
> in one box but in 4 different disks with in my case 0.75 mil docs per
> disk. I had about 4 gig memory and 1 AMD 64 processor and it worked
> out rather ok. I need to do more testing to fine tune this cos this
> really brings the issue of cost. I have also thought about doing some
> testing with VIA EPIA boards. Maybe in the future :-)
>
> The problem I encountered is more this
>
> http://issues.apache.org/jira/browse/NUTCH-92
>
> but this will be solved sooner or later just a matter of time.
>
> Cheers
>
>
> On 9/5/06, Renaud Richardet <re...@wyona.com> wrote:
>> Zaheed,
>>
>> Thank you, that works good. Do you know if there is a big performance
>> overhead with starting 2 servers? As an alternative, we could use
>> Lucene's Multisearcher?
>>
>> -- Renaud
>>
>>
>> Zaheed Haque wrote:
>> > Hi:
>> >
>> > Assuming you have
>> >
>> > index 1 at /data/crawl1
>> > index 2 at /data/crawl2
>> >
>> > In nutch-site.xml
>> > searcher.dir = /data
>> >
>> > Under /data you have a text file called search-server.txt (I think do
>> > check nutch-site search.dir description please)
>> >
>> > In the text file you will have the following
>> >
>> > hostname1 portnumber
>> > hostname2 portnumber
>> >
>> > example
>> > localhost 1234
>> > localhost 5678
>> >
>> > Then you need to start
>> >
>> > bin/nutch server 1234 /data/craw1 &
>> >
>> > and
>> >
>> > bin/nutch server 5678 /data/crawl2 &
>> >
>> > now try
>> >
>> > bin/nutch org.apache.nutch.search.NutchBean www
>> >
>> > you should see results :-)
>> >
>> > Cheers
>> >
>> > On 9/5/06, Renaud Richardet <re...@wyona.com> wrote:
>> >> @Dennis,
>> >> Can you explain how to setup distributed search while storing the 2
>> >> indexes on the same local machine (if possible)?
>> >>
>> >> @Feng,
>> >> We created a shell script to merge 2 runs, let us know if that 
>> works for
>> >> you.
>> >> http://wiki.apache.org/nutch/MergeCrawl
>> >>
>> >> Renaud
>> >>
>> >>
>> >> Dennis Kubes wrote:
>> >> > You can keep the indexes separate and use the distributed search
>> >> > server, one per index or you can use the mergedb and mergesegs
>> >> > commands to merge the two runs into a single crawldb and a single
>> >> > segments then re-run the invertlinks and index to create a single
>> >> > index file which can then be searched.
>> >> >
>> >> > Dennis
>> >> >
>> >> > Feng Ji wrote:
>> >> >> Hi there,
>> >> >>
>> >> >> In Nutch 08, I have crawled down from two webDB independently.
>> >> >>
>> >> >> For each run, I did invertlinks and index. So each one is 
>> searchable.
>> >> >>
>> >> >> Now I want to combine them togeter for search. I tried "merge"
>> >> >> command to
>> >> >> merge two indexes, but the search for the result index output 
>> dir is
>> >> >> dull.
>> >> >> Do I need put output dir to the same directory as above two 
>> crawl/ ?
>> >> >>
>> >> >> I wonder what is proper steps to combine two seperate run into one
>> >> >> search
>> >> >> result. Do I need to combine two webdb, merge two segments and do
>> >> >> invertlinks and do index?
>> >> >>
>> >> >> thanks your time,
>> >> >>
>> >> >> Michael,
>> >> >>
>> >> >
>> >>
>> >> --
>> >> Renaud Richardet
>> >> COO America
>> >> Wyona    -   Open Source Content Management   -   Apache Lenya
>> >> office +1 857 776-3195                  mobile +1 617 230 9112
>> >> renaud.richardet <at> wyona.com           http://www.wyona.com
>> >>
>> >>
>> >
>>
>> -- 
>> Renaud Richardet
>> COO America
>> Wyona    -   Open Source Content Management   -   Apache Lenya
>> office +1 857 776-3195                  mobile +1 617 230 9112
>> renaud.richardet <at> wyona.com           http://www.wyona.com
>>
>>

Re: how to combine two run's result for search

Posted by Zaheed Haque <za...@gmail.com>.

Renaud:

Yes or No!. I have done some testing as Dennis Kubes suggested and got
similler results like his test. In short having 4 nutch search servers
 in one box but in 4 different disks with in my case 0.75 mil docs per
disk. I had about 4 gig memory and 1 AMD 64 processor and it worked
out rather ok. I need to do more testing to fine tune this cos this
really brings the issue of cost. I have also thought about doing some
testing with VIA EPIA boards. Maybe in the future :-)

The problem I encountered is more this

http://issues.apache.org/jira/browse/NUTCH-92

but this will be solved sooner or later just a matter of time.

Cheers


On 9/5/06, Renaud Richardet <re...@wyona.com> wrote:
> Zaheed,
>
> Thank you, that works good. Do you know if there is a big performance
> overhead with starting 2 servers? As an alternative, we could use
> Lucene's Multisearcher?
>
> -- Renaud
>
>
> Zaheed Haque wrote:
> > Hi:
> >
> > Assuming you have
> >
> > index 1 at /data/crawl1
> > index 2 at /data/crawl2
> >
> > In nutch-site.xml
> > searcher.dir = /data
> >
> > Under /data you have a text file called search-server.txt (I think do
> > check nutch-site search.dir description please)
> >
> > In the text file you will have the following
> >
> > hostname1 portnumber
> > hostname2 portnumber
> >
> > example
> > localhost 1234
> > localhost 5678
> >
> > Then you need to start
> >
> > bin/nutch server 1234 /data/craw1 &
> >
> > and
> >
> > bin/nutch server 5678 /data/crawl2 &
> >
> > now try
> >
> > bin/nutch org.apache.nutch.search.NutchBean www
> >
> > you should see results :-)
> >
> > Cheers
> >
> > On 9/5/06, Renaud Richardet <re...@wyona.com> wrote:
> >> @Dennis,
> >> Can you explain how to setup distributed search while storing the 2
> >> indexes on the same local machine (if possible)?
> >>
> >> @Feng,
> >> We created a shell script to merge 2 runs, let us know if that works for
> >> you.
> >> http://wiki.apache.org/nutch/MergeCrawl
> >>
> >> Renaud
> >>
> >>
> >> Dennis Kubes wrote:
> >> > You can keep the indexes separate and use the distributed search
> >> > server, one per index or you can use the mergedb and mergesegs
> >> > commands to merge the two runs into a single crawldb and a single
> >> > segments then re-run the invertlinks and index to create a single
> >> > index file which can then be searched.
> >> >
> >> > Dennis
> >> >
> >> > Feng Ji wrote:
> >> >> Hi there,
> >> >>
> >> >> In Nutch 08, I have crawled down from two webDB independently.
> >> >>
> >> >> For each run, I did invertlinks and index. So each one is searchable.
> >> >>
> >> >> Now I want to combine them togeter for search. I tried "merge"
> >> >> command to
> >> >> merge two indexes, but the search for the result index output dir is
> >> >> dull.
> >> >> Do I need put output dir to the same directory as above two crawl/ ?
> >> >>
> >> >> I wonder what is proper steps to combine two seperate run into one
> >> >> search
> >> >> result. Do I need to combine two webdb, merge two segments and do
> >> >> invertlinks and do index?
> >> >>
> >> >> thanks your time,
> >> >>
> >> >> Michael,
> >> >>
> >> >
> >>
> >> --
> >> Renaud Richardet
> >> COO America
> >> Wyona    -   Open Source Content Management   -   Apache Lenya
> >> office +1 857 776-3195                  mobile +1 617 230 9112
> >> renaud.richardet <at> wyona.com           http://www.wyona.com
> >>
> >>
> >
>
> --
> Renaud Richardet
> COO America
> Wyona    -   Open Source Content Management   -   Apache Lenya
> office +1 857 776-3195                  mobile +1 617 230 9112
> renaud.richardet <at> wyona.com           http://www.wyona.com
>
>

Re: how to combine two run's result for search

Posted by Renaud Richardet <re...@wyona.com>.

Zaheed,

Thank you, that works good. Do you know if there is a big performance 
overhead with starting 2 servers? As an alternative, we could use 
Lucene's Multisearcher?

-- Renaud


Zaheed Haque wrote:
> Hi:
>
> Assuming you have
>
> index 1 at /data/crawl1
> index 2 at /data/crawl2
>
> In nutch-site.xml
> searcher.dir = /data
>
> Under /data you have a text file called search-server.txt (I think do
> check nutch-site search.dir description please)
>
> In the text file you will have the following
>
> hostname1 portnumber
> hostname2 portnumber
>
> example
> localhost 1234
> localhost 5678
>
> Then you need to start
>
> bin/nutch server 1234 /data/craw1 &
>
> and
>
> bin/nutch server 5678 /data/crawl2 &
>
> now try
>
> bin/nutch org.apache.nutch.search.NutchBean www
>
> you should see results :-)
>
> Cheers
>
> On 9/5/06, Renaud Richardet <re...@wyona.com> wrote:
>> @Dennis,
>> Can you explain how to setup distributed search while storing the 2
>> indexes on the same local machine (if possible)?
>>
>> @Feng,
>> We created a shell script to merge 2 runs, let us know if that works for
>> you.
>> http://wiki.apache.org/nutch/MergeCrawl
>>
>> Renaud
>>
>>
>> Dennis Kubes wrote:
>> > You can keep the indexes separate and use the distributed search
>> > server, one per index or you can use the mergedb and mergesegs
>> > commands to merge the two runs into a single crawldb and a single
>> > segments then re-run the invertlinks and index to create a single
>> > index file which can then be searched.
>> >
>> > Dennis
>> >
>> > Feng Ji wrote:
>> >> Hi there,
>> >>
>> >> In Nutch 08, I have crawled down from two webDB independently.
>> >>
>> >> For each run, I did invertlinks and index. So each one is searchable.
>> >>
>> >> Now I want to combine them togeter for search. I tried "merge"
>> >> command to
>> >> merge two indexes, but the search for the result index output dir is
>> >> dull.
>> >> Do I need put output dir to the same directory as above two crawl/ ?
>> >>
>> >> I wonder what is proper steps to combine two seperate run into one
>> >> search
>> >> result. Do I need to combine two webdb, merge two segments and do
>> >> invertlinks and do index?
>> >>
>> >> thanks your time,
>> >>
>> >> Michael,
>> >>
>> >
>>
>> -- 
>> Renaud Richardet
>> COO America
>> Wyona    -   Open Source Content Management   -   Apache Lenya
>> office +1 857 776-3195                  mobile +1 617 230 9112
>> renaud.richardet <at> wyona.com           http://www.wyona.com
>>
>>
>

-- 
Renaud Richardet
COO America
Wyona    -   Open Source Content Management   -   Apache Lenya
office +1 857 776-3195                  mobile +1 617 230 9112
renaud.richardet <at> wyona.com           http://www.wyona.com

Re: how to combine two run's result for search

Posted by Zaheed Haque <za...@gmail.com>.

Hi:

Assuming you have

index 1 at /data/crawl1
index 2 at /data/crawl2

In nutch-site.xml
searcher.dir = /data

Under /data you have a text file called search-server.txt (I think do
check nutch-site search.dir description please)

In the text file you will have the following

hostname1 portnumber
hostname2 portnumber

example
localhost 1234
localhost 5678

Then you need to start

bin/nutch server 1234 /data/craw1 &

and

bin/nutch server 5678 /data/crawl2 &

now try

bin/nutch org.apache.nutch.search.NutchBean www

you should see results :-)

Cheers

On 9/5/06, Renaud Richardet <re...@wyona.com> wrote:
> @Dennis,
> Can you explain how to setup distributed search while storing the 2
> indexes on the same local machine (if possible)?
>
> @Feng,
> We created a shell script to merge 2 runs, let us know if that works for
> you.
> http://wiki.apache.org/nutch/MergeCrawl
>
> Renaud
>
>
> Dennis Kubes wrote:
> > You can keep the indexes separate and use the distributed search
> > server, one per index or you can use the mergedb and mergesegs
> > commands to merge the two runs into a single crawldb and a single
> > segments then re-run the invertlinks and index to create a single
> > index file which can then be searched.
> >
> > Dennis
> >
> > Feng Ji wrote:
> >> Hi there,
> >>
> >> In Nutch 08, I have crawled down from two webDB independently.
> >>
> >> For each run, I did invertlinks and index. So each one is searchable.
> >>
> >> Now I want to combine them togeter for search. I tried "merge"
> >> command to
> >> merge two indexes, but the search for the result index output dir is
> >> dull.
> >> Do I need put output dir to the same directory as above two crawl/ ?
> >>
> >> I wonder what is proper steps to combine two seperate run into one
> >> search
> >> result. Do I need to combine two webdb, merge two segments and do
> >> invertlinks and do index?
> >>
> >> thanks your time,
> >>
> >> Michael,
> >>
> >
>
> --
> Renaud Richardet
> COO America
> Wyona    -   Open Source Content Management   -   Apache Lenya
> office +1 857 776-3195                  mobile +1 617 230 9112
> renaud.richardet <at> wyona.com           http://www.wyona.com
>
>

Re: how to combine two run's result for search

Posted by Feng Ji <fe...@gmail.com>.

thanks, Renaud:

I figured out the same senario as your script, it works well.

Michael


On 9/5/06, Renaud Richardet <re...@wyona.com> wrote:
>
> @Dennis,
> Can you explain how to setup distributed search while storing the 2
> indexes on the same local machine (if possible)?
>
> @Feng,
> We created a shell script to merge 2 runs, let us know if that works for
> you.
> http://wiki.apache.org/nutch/MergeCrawl
>
> Renaud
>
>
> Dennis Kubes wrote:
> > You can keep the indexes separate and use the distributed search
> > server, one per index or you can use the mergedb and mergesegs
> > commands to merge the two runs into a single crawldb and a single
> > segments then re-run the invertlinks and index to create a single
> > index file which can then be searched.
> >
> > Dennis
> >
> > Feng Ji wrote:
> >> Hi there,
> >>
> >> In Nutch 08, I have crawled down from two webDB independently.
> >>
> >> For each run, I did invertlinks and index. So each one is searchable.
> >>
> >> Now I want to combine them togeter for search. I tried "merge"
> >> command to
> >> merge two indexes, but the search for the result index output dir is
> >> dull.
> >> Do I need put output dir to the same directory as above two crawl/ ?
> >>
> >> I wonder what is proper steps to combine two seperate run into one
> >> search
> >> result. Do I need to combine two webdb, merge two segments and do
> >> invertlinks and do index?
> >>
> >> thanks your time,
> >>
> >> Michael,
> >>
> >
>
> --
> Renaud Richardet
> COO America
> Wyona    -   Open Source Content Management   -   Apache Lenya
> office +1 857 776-3195                  mobile +1 617 230 9112
> renaud.richardet <at> wyona.com           http://www.wyona.com
>
>

Re: how to combine two run's result for search

Posted by Renaud Richardet <re...@wyona.com>.

@Dennis,
Can you explain how to setup distributed search while storing the 2 
indexes on the same local machine (if possible)?
 
@Feng,
We created a shell script to merge 2 runs, let us know if that works for 
you.
http://wiki.apache.org/nutch/MergeCrawl

Renaud


Dennis Kubes wrote:
> You can keep the indexes separate and use the distributed search 
> server, one per index or you can use the mergedb and mergesegs 
> commands to merge the two runs into a single crawldb and a single 
> segments then re-run the invertlinks and index to create a single 
> index file which can then be searched.
>
> Dennis
>
> Feng Ji wrote:
>> Hi there,
>>
>> In Nutch 08, I have crawled down from two webDB independently.
>>
>> For each run, I did invertlinks and index. So each one is searchable.
>>
>> Now I want to combine them togeter for search. I tried "merge" 
>> command to
>> merge two indexes, but the search for the result index output dir is 
>> dull.
>> Do I need put output dir to the same directory as above two crawl/ ?
>>
>> I wonder what is proper steps to combine two seperate run into one 
>> search
>> result. Do I need to combine two webdb, merge two segments and do
>> invertlinks and do index?
>>
>> thanks your time,
>>
>> Michael,
>>
>

-- 
Renaud Richardet
COO America
Wyona    -   Open Source Content Management   -   Apache Lenya
office +1 857 776-3195                  mobile +1 617 230 9112
renaud.richardet <at> wyona.com           http://www.wyona.com

Re: how to combine two run's result for search

Posted by Dennis Kubes <nu...@dragonflymc.com>.

You can keep the indexes separate and use the distributed search server, 
one per index or you can use the mergedb and mergesegs commands to merge 
the two runs into a single crawldb and a single segments then re-run the 
invertlinks and index to create a single index file which can then be 
searched.

Dennis

Feng Ji wrote:
> Hi there,
>
> In Nutch 08, I have crawled down from two webDB independently.
>
> For each run, I did invertlinks and index. So each one is searchable.
>
> Now I want to combine them togeter for search. I tried "merge" command to
> merge two indexes, but the search for the result index output dir is 
> dull.
> Do I need put output dir to the same directory as above two crawl/ ?
>
> I wonder what is proper steps to combine two seperate run into one search
> result. Do I need to combine two webdb, merge two segments and do
> invertlinks and do index?
>
> thanks your time,
>
> Michael,
>