You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by MagRaj <ma...@yahoo.com> on 2006/03/17 00:28:13 UTC

Searching specific domains

Hi,

I  am newbie in nutch, i would appreciate if anyone could answer to my below
questions:

1. I did the intranet crawling against 5 urls that was defined in the flat
file url. Everything went on fine and i had db,segments and index created.
Is it possible to create a new segment(contains all the pages of that url)
for each url??  So that, i can i do the search against a specific segment.
If so, how do i go about achieving it?

2. I have a requirement in which the search has to be done against a
specific domain/url, dynamically. How to achieve this? The segments and
indexes are created for all the urls.

3. what is the difference between prefix-urlfilter and regex-urlfilter and
its uses??

Thanks.

--
View this message in context: http://www.nabble.com/Searching-specific-domains-t1294857.html#a3447225
Sent from the Nutch - User forum at Nabble.com.


Re: Distributed Search - config issue?

Posted by mo...@richmondinformatics.com.
Many thanks Andrzej,

I have rebuilt, and re-deployed 0.8-dev (-r 374745), and the 
distributed search
is working.

I don't know if it's critical, but "ant test" on my patched 
distribution freezes
at during test-core:

test-core:
   [delete] Deleting directory 
/usr/local/src/nutch-374745/nutch/build/test/data
    [mkdir] Created dir: /usr/local/src/nutch-374745/nutch/build/test/data
    [junit] Running org.apache.nutch.analysis.TestQueryParser
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.617 sec
    [junit] Running org.apache.nutch.fs.TestNutchFileSystem
    ....
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 3.275 sec
    [junit] Running org.apache.nutch.ndfs.TestNDFS


Also, since HADOOP versions of nutch address the problem I had here I guess no
further patches will be released for pre-hadoop versions.  So I guess it would
be helpful to "publish" the fix here?:

Index: src/java/org/apache/nutch/searcher/Query.java
===================================================================
--- src/java/org/apache/nutch/searcher/Query.java       (revision 374745)
+++ src/java/org/apache/nutch/searcher/Query.java       (working copy)
@@ -27,12 +27,13 @@

import org.apache.nutch.util.LogFormatter;
import org.apache.nutch.util.NutchConf;
+import org.apache.nutch.util.NutchConfigurable;
import org.apache.nutch.analysis.NutchAnalysis;

import org.apache.nutch.io.Writable;

/** A Nutch query. */
-public final class Query implements Writable, Cloneable {
+public final class Query implements Writable, Cloneable, NutchConfigurable {
   public static final Logger LOG =
     LogFormatter.getLogger("org.apache.nutch.searcher.Query");

@@ -282,11 +283,22 @@
   private NutchConf nutchConf;

   private static final Clause[] CLAUSES_PROTO = new Clause[0];
+
+  public Query() {
+  }

   public Query(NutchConf nutchConf) {
       this.nutchConf = nutchConf;
   }

+  public void setConf(NutchConf conf) {
+    this.nutchConf = conf;
+  }
+
+  public NutchConf getConf() {
+    return nutchConf;
+  }
+
   /** Return all clauses. */
   public Clause[] getClauses() {
     return (Clause[])clauses.toArray(CLAUSES_PROTO);
Index: src/java/org/apache/nutch/searcher/DistributedSearch.java
===================================================================
--- src/java/org/apache/nutch/searcher/DistributedSearch.java   (revision
374745)
+++ src/java/org/apache/nutch/searcher/DistributedSearch.java   (working copy)
@@ -115,10 +115,10 @@
     /** Construct a client talking to the named servers. */
     public Client(InetSocketAddress[] addresses, NutchConf nutchConf) throws
IOException {
       this.defaultAddresses = addresses;
+      this.nutchConf = nutchConf;
       updateSegments();
       setDaemon(true);
       start();
-      this.nutchConf = nutchConf;
     }

     private static final Method GET_SEGMENTS;
@@ -151,6 +151,8 @@
       int liveServers=0;
       int liveSegments=0;
       Vector liveAddresses=new Vector();
+       System.out.println("defaultAddresses=" + defaultAddresses);
+       System.out.println("defaultAddresses.length=" +
defaultAddresses.length);

       // build segmentToAddress map
       Object[][] params = new Object[defaultAddresses.length][0];
Index: src/java/org/apache/nutch/ipc/Server.java
===================================================================
--- src/java/org/apache/nutch/ipc/Server.java   (revision 374745)
+++ src/java/org/apache/nutch/ipc/Server.java   (working copy)
@@ -34,6 +34,7 @@

import org.apache.nutch.util.LogFormatter;
import org.apache.nutch.util.NutchConf;
+import org.apache.nutch.util.NutchConfigurable;
import org.apache.nutch.io.Writable;
import org.apache.nutch.io.UTF8;

@@ -223,6 +224,8 @@
       LOG.info(getName() + ": exiting");
     }
   }
+
+  private NutchConf conf;

   /** Constructs a server listening on the named port.  Parameters passed must
    * be of the named class.  The <code>handlerCount</handlerCount> determines
@@ -234,6 +237,7 @@
     this.handlerCount = handlerCount;
     this.maxQueuedCalls = handlerCount;
     this.timeout = nutchConf.getInt("ipc.client.timeout",10000);
+    this.conf = nutchConf;
   }

   /** Sets the timeout used for network i/o. */
@@ -280,6 +284,9 @@
     Writable param;                               // construct param
     try {
       param = (Writable)paramClass.newInstance();
+       if (param instanceof NutchConfigurable) {
+               ((NutchConfigurable)param).setConf(conf);
+       }
     } catch (InstantiationException e) {
       throw new RuntimeException(e.toString());
     } catch (IllegalAccessException e) {

Many thanks, again, Adrzej

Monu Ogbe

Quoting Andrzej Bialecki <ab...@getopt.org>:

> Gal Nitzan wrote:
>> Make sure the following exists:
>>
>> 1. make sure your tomcat/webapps/ROOT/WEB-INF/classes/hadoop-site.xml
>> fs.default.name value is local
>> 2. make sure the machine name in your /hosts/search-servers.txt is
>> registered in your /etc/hosts or use the IP.
>> 3. Make sure tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml searcher.dir
>> value is set to /hosts
>> 4. Make sure the Nutch user has access to /hosts On your search 
>> server machine tail the log. the connection from tomcat
>> should appear.
>>
>
> FYI, I helped Monu to solve the issue. It turns out that the 
> particular revision he is using was missing some later fixes, 
> specifically NUTCH-217 and related fixes to Hadoop. After applying 
> these patches it started working. Please be aware that if you use the 
> latest pre-Hadoop version you will face the same issue, i.e. it won't 
> work without patching.
>
....
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>




Re: Distributed Search - config issue?

Posted by Andrzej Bialecki <ab...@getopt.org>.
Gal Nitzan wrote:
> Make sure the following exists:
>
> 1. make sure your tomcat/webapps/ROOT/WEB-INF/classes/hadoop-site.xml
> fs.default.name value is local
> 2. make sure the machine name in your /hosts/search-servers.txt is
> registered in your /etc/hosts or use the IP.
> 3. Make sure tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml searcher.dir
> value is set to /hosts
> 4. Make sure the Nutch user has access to /hosts 
>
> On your search server machine tail the log. the connection from tomcat
> should appear.
>   

FYI, I helped Monu to solve the issue. It turns out that the particular 
revision he is using was missing some later fixes, specifically 
NUTCH-217 and related fixes to Hadoop. After applying these patches it 
started working. Please be aware that if you use the latest pre-Hadoop 
version you will face the same issue, i.e. it won't work without patching.

> One more note... It is not wise to put IP addresses in your emails.
>   

Agreed.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: Distributed Search - config issue?

Posted by Gal Nitzan <gn...@usa.net>.
Make sure the following exists:

1. make sure your tomcat/webapps/ROOT/WEB-INF/classes/hadoop-site.xml
fs.default.name value is local
2. make sure the machine name in your /hosts/search-servers.txt is
registered in your /etc/hosts or use the IP.
3. Make sure tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml searcher.dir
value is set to /hosts
4. Make sure the Nutch user has access to /hosts 

On your search server machine tail the log. the connection from tomcat
should appear.

One more note... It is not wise to put IP addresses in your emails.

Regards,

Gal

-----Original Message-----
From: monu.ogbe@richmondinformatics.com
[mailto:monu.ogbe@richmondinformatics.com] 
Sent: Friday, March 17, 2006 1:35 PM
To: nutch-user@lucene.apache.org
Subject: Re: Distributed Search - config issue?

Hello Team,

Partial false alarm.

I have worked out that I get exactly the same error, if the nutch 
server is NOT
running!  So, perhaps my tomcat search client

-  is not finding the /hosts/search-servers.txt file; or
-  is not interpreting the "address port" line in it

I find that I CAN telnet from the command line to port 8081:

# telnet 193.203.240.118 8081
Trying 193.203.240.118...
Connected to nutch1.houxou.com (193.203.240.118).
Escape character is '^]'.

In this case, I get the following diagnostic output from the "nutch server"
console:

060317 112919 22 Server connection on port 8081 from 193.203.240.118:
starting

However when the tomcat search client tries to search there is NO output
from
the "nutch server" console.

Sounds like I'm getting closer to the problem, but help still gratefully
awaited! :)

Many thanks,

Monu Ogbe



Quoting monu.ogbe@richmondinformatics.com:

> Hi Andrzej,
>
> I am running 0.8-dev revision 374745.
>
> Searching works fine when the tomcat search client's searcher.dir is 
> configured
> to point at the crawl directory as follows.
>
> *** $CATALINA_HOME/webapps/ROOT/WEB-INF/classes/nutch-site.xml contains:
>
> 	<property>
> 	  <name>searcher.dir</name>
> 	   <value>/home/nutch/nutch-0.8-dev-test/crawlA/</value>
> 	  <description>
> 	  Path to root of index directories.
> 	  </description>
> 	</property>
>
> However, I get an error from the tomcat search client when I try to set up
> distributed search using the following config:
>
> 	<property>
> 	  <name>searcher.dir</name>
> 	   <value>/hosts</value>
> 	  <description>
> 	  Path to root of index directories.
> 	  </description>
> 	</property>
>
> *** /hosts/search-servers.txt contains:
>
>
> nutch1.houxou.com 8081
>
>
> *** crawl directory tree looks like this:
>
> crawlA/
> crawlA/linkdb
> crawlA/linkdb/current
> crawlA/linkdb/current/part-00000
> crawlA/linkdb/current/part-00000/index
> crawlA/linkdb/current/part-00000/data
> crawlA/linkdb/current/part-00000/.data.crc
> crawlA/linkdb/current/part-00000/.index.crc
> crawlA/indexes
> crawlA/indexes/part-00000
> crawlA/indexes/part-00000/_2.f2
> crawlA/indexes/part-00000/_2.tis
> crawlA/indexes/part-00000/deletable
> crawlA/indexes/part-00000/_2.f3
> crawlA/indexes/part-00000/_2.frq
> crawlA/indexes/part-00000/_2.f4
> crawlA/indexes/part-00000/_2.tii
> crawlA/indexes/part-00000/_2.fdt
> crawlA/indexes/part-00000/index.done
> crawlA/indexes/part-00000/_2.f1
> crawlA/indexes/part-00000/_2.prx
> crawlA/indexes/part-00000/_2.fnm
> crawlA/indexes/part-00000/_2.f0
> crawlA/indexes/part-00000/segments
> crawlA/indexes/part-00000/_2.fdx
> crawlA/crawldb
> crawlA/crawldb/current
> crawlA/crawldb/current/part-00000
> crawlA/crawldb/current/part-00000/index
> crawlA/crawldb/current/part-00000/data
> crawlA/crawldb/current/part-00000/.data.crc
> crawlA/crawldb/current/part-00000/.index.crc
> crawlA/segments
> crawlA/segments/20060316144827
> crawlA/segments/20060316144827/crawl_generate
> crawlA/segments/20060316144827/crawl_generate/part-00000
> crawlA/segments/20060316144827/crawl_generate/.part-00000.crc
> crawlA/segments/20060316144827/crawl_parse
> crawlA/segments/20060316144827/crawl_parse/part-00000
> crawlA/segments/20060316144827/crawl_parse/.part-00000.crc
> crawlA/segments/20060316144827/parse_text
> crawlA/segments/20060316144827/parse_text/part-00000
> crawlA/segments/20060316144827/parse_text/part-00000/index
> crawlA/segments/20060316144827/parse_text/part-00000/data
> crawlA/segments/20060316144827/parse_text/part-00000/.data.crc
> crawlA/segments/20060316144827/parse_text/part-00000/.index.crc
> crawlA/segments/20060316144827/parse_data
> crawlA/segments/20060316144827/parse_data/part-00000
> crawlA/segments/20060316144827/parse_data/part-00000/index
> crawlA/segments/20060316144827/parse_data/part-00000/data
> crawlA/segments/20060316144827/parse_data/part-00000/.data.crc
> crawlA/segments/20060316144827/parse_data/part-00000/.index.crc
> crawlA/segments/20060316144827/content
> crawlA/segments/20060316144827/content/part-00000
> crawlA/segments/20060316144827/content/part-00000/index
> crawlA/segments/20060316144827/content/part-00000/data
> crawlA/segments/20060316144827/content/part-00000/.data.crc
> crawlA/segments/20060316144827/content/part-00000/.index.crc
> crawlA/segments/20060316144827/crawl_fetch
> crawlA/segments/20060316144827/crawl_fetch/part-00000
> crawlA/segments/20060316144827/crawl_fetch/part-00000/index
> crawlA/segments/20060316144827/crawl_fetch/part-00000/data
> crawlA/segments/20060316144827/crawl_fetch/part-00000/.data.crc
> crawlA/segments/20060316144827/crawl_fetch/part-00000/.index.crc
>
>
> *** Invoking the search server
>
> I have tried invoking the search server pointing at the "crawl" directory,
> crawlA and just for good measure I have also tried pointing at the
"indexes"
> directory within it.
>
> 	# bin/nutch server 8081 crawlA/indexes
> or
> 	# bin/nutch server 8081 crawlA
>
>
> *** The tomcat search client then produces the following output:
>
> HTTP Status 500 -
>
> type Exception report
>
> message
>
> description The server encountered an internal error () that 
> prevented it from
> fulfilling this request.
>
> exception
>
> org.apache.jasper.JasperException
>
org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWra
pper.java:510)
>
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:3
93)
>
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:314)
> 	org.apache.jasper.servlet.JspServlet.service(JspServlet.java:264)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>
> root cause
>
> java.lang.NullPointerException
> 	org.apache.nutch.ipc.RPC.call(RPC.java:162)
>
org.apache.nutch.searcher.DistributedSearch$Client.updateSegments(Distribute
dSearch.java:157)
>
org.apache.nutch.searcher.DistributedSearch$Client.<init>(DistributedSearch.
java:118)
>
org.apache.nutch.searcher.DistributedSearch$Client.<init>(DistributedSearch.
java:92)
> 	org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:98)
> 	org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:80)
> 	org.apache.nutch.searcher.NutchBean.get(NutchBean.java:67)
> 	org.apache.jsp.search_jsp._jspService(search_jsp.java:108)
> 	org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:3
32)
>
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:314)
> 	org.apache.jasper.servlet.JspServlet.service(JspServlet.java:264)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>
> note The full stack trace of the root cause is available in the Apache
> Tomcat/5.5.16 logs.
>
> *** The tomcat logs show
>
> # cat /usr/local/tomcat/logs/localhost.2006-03-16.log
>
> 16-Mar-2006 21:27:00 org.apache.catalina.core.StandardWrapperValve invoke
> SEVERE: Servlet.service() for servlet jsp threw exception
> java.lang.NullPointerException
>        at org.apache.nutch.ipc.RPC.call(RPC.java:162)
>        at
>
org.apache.nutch.searcher.DistributedSearch$Client.updateSegments(Distribute
dSearch.java:157)
>        at
>
org.apache.nutch.searcher.DistributedSearch$Client.<init>(DistributedSearch.
java:118)
>        at
>
org.apache.nutch.searcher.DistributedSearch$Client.<init>(DistributedSearch.
java:92)
>        at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:98)
>        at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:80)
>        at org.apache.nutch.searcher.NutchBean.get(NutchBean.java:67)
>        at org.apache.jsp.search_jsp._jspService(search_jsp.java:108)
>        at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
>        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>        at
>
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:3
32)
>        at
> org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:314)
>        at
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:264)
>        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>        at
>
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
FilterChain.java:252)
>        at
>
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
ain.java:173)
>        at
>
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
va:213)
>        at
>
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
va:178)
>        at
>
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126
)
>        at
>
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105
)
>        at
>
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
:107)
>        at
>
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
>        at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869)
>        at
>
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processC
onnection(Http11BaseProtocol.java:664)
>        at
>
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.jav
a:527)
>        at
>
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWo
rkerThread.java:80)
>        at
>
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.jav
a:684)
>        at java.lang.Thread.run(Thread.java:595)
>
> *** end
>
> Is this a bug for which there is a patch, or are the directories in the
wrong
> places!?
>
> Many thanks,
>
> Monu Ogbe
>
>






Re: Distributed Search - config issue?

Posted by mo...@richmondinformatics.com.
Hello Team,

Partial false alarm.

I have worked out that I get exactly the same error, if the nutch 
server is NOT
running!  So, perhaps my tomcat search client

-  is not finding the /hosts/search-servers.txt file; or
-  is not interpreting the "address port" line in it

I find that I CAN telnet from the command line to port 8081:

# telnet 193.203.240.118 8081
Trying 193.203.240.118...
Connected to nutch1.houxou.com (193.203.240.118).
Escape character is '^]'.

In this case, I get the following diagnostic output from the "nutch server"
console:

060317 112919 22 Server connection on port 8081 from 193.203.240.118: starting

However when the tomcat search client tries to search there is NO output from
the "nutch server" console.

Sounds like I'm getting closer to the problem, but help still gratefully
awaited! :)

Many thanks,

Monu Ogbe



Quoting monu.ogbe@richmondinformatics.com:

> Hi Andrzej,
>
> I am running 0.8-dev revision 374745.
>
> Searching works fine when the tomcat search client's searcher.dir is 
> configured
> to point at the crawl directory as follows.
>
> *** $CATALINA_HOME/webapps/ROOT/WEB-INF/classes/nutch-site.xml contains:
>
> 	<property>
> 	  <name>searcher.dir</name>
> 	   <value>/home/nutch/nutch-0.8-dev-test/crawlA/</value>
> 	  <description>
> 	  Path to root of index directories.
> 	  </description>
> 	</property>
>
> However, I get an error from the tomcat search client when I try to set up
> distributed search using the following config:
>
> 	<property>
> 	  <name>searcher.dir</name>
> 	   <value>/hosts</value>
> 	  <description>
> 	  Path to root of index directories.
> 	  </description>
> 	</property>
>
> *** /hosts/search-servers.txt contains:
>
>
> nutch1.houxou.com 8081
>
>
> *** crawl directory tree looks like this:
>
> crawlA/
> crawlA/linkdb
> crawlA/linkdb/current
> crawlA/linkdb/current/part-00000
> crawlA/linkdb/current/part-00000/index
> crawlA/linkdb/current/part-00000/data
> crawlA/linkdb/current/part-00000/.data.crc
> crawlA/linkdb/current/part-00000/.index.crc
> crawlA/indexes
> crawlA/indexes/part-00000
> crawlA/indexes/part-00000/_2.f2
> crawlA/indexes/part-00000/_2.tis
> crawlA/indexes/part-00000/deletable
> crawlA/indexes/part-00000/_2.f3
> crawlA/indexes/part-00000/_2.frq
> crawlA/indexes/part-00000/_2.f4
> crawlA/indexes/part-00000/_2.tii
> crawlA/indexes/part-00000/_2.fdt
> crawlA/indexes/part-00000/index.done
> crawlA/indexes/part-00000/_2.f1
> crawlA/indexes/part-00000/_2.prx
> crawlA/indexes/part-00000/_2.fnm
> crawlA/indexes/part-00000/_2.f0
> crawlA/indexes/part-00000/segments
> crawlA/indexes/part-00000/_2.fdx
> crawlA/crawldb
> crawlA/crawldb/current
> crawlA/crawldb/current/part-00000
> crawlA/crawldb/current/part-00000/index
> crawlA/crawldb/current/part-00000/data
> crawlA/crawldb/current/part-00000/.data.crc
> crawlA/crawldb/current/part-00000/.index.crc
> crawlA/segments
> crawlA/segments/20060316144827
> crawlA/segments/20060316144827/crawl_generate
> crawlA/segments/20060316144827/crawl_generate/part-00000
> crawlA/segments/20060316144827/crawl_generate/.part-00000.crc
> crawlA/segments/20060316144827/crawl_parse
> crawlA/segments/20060316144827/crawl_parse/part-00000
> crawlA/segments/20060316144827/crawl_parse/.part-00000.crc
> crawlA/segments/20060316144827/parse_text
> crawlA/segments/20060316144827/parse_text/part-00000
> crawlA/segments/20060316144827/parse_text/part-00000/index
> crawlA/segments/20060316144827/parse_text/part-00000/data
> crawlA/segments/20060316144827/parse_text/part-00000/.data.crc
> crawlA/segments/20060316144827/parse_text/part-00000/.index.crc
> crawlA/segments/20060316144827/parse_data
> crawlA/segments/20060316144827/parse_data/part-00000
> crawlA/segments/20060316144827/parse_data/part-00000/index
> crawlA/segments/20060316144827/parse_data/part-00000/data
> crawlA/segments/20060316144827/parse_data/part-00000/.data.crc
> crawlA/segments/20060316144827/parse_data/part-00000/.index.crc
> crawlA/segments/20060316144827/content
> crawlA/segments/20060316144827/content/part-00000
> crawlA/segments/20060316144827/content/part-00000/index
> crawlA/segments/20060316144827/content/part-00000/data
> crawlA/segments/20060316144827/content/part-00000/.data.crc
> crawlA/segments/20060316144827/content/part-00000/.index.crc
> crawlA/segments/20060316144827/crawl_fetch
> crawlA/segments/20060316144827/crawl_fetch/part-00000
> crawlA/segments/20060316144827/crawl_fetch/part-00000/index
> crawlA/segments/20060316144827/crawl_fetch/part-00000/data
> crawlA/segments/20060316144827/crawl_fetch/part-00000/.data.crc
> crawlA/segments/20060316144827/crawl_fetch/part-00000/.index.crc
>
>
> *** Invoking the search server
>
> I have tried invoking the search server pointing at the "crawl" directory,
> crawlA and just for good measure I have also tried pointing at the "indexes"
> directory within it.
>
> 	# bin/nutch server 8081 crawlA/indexes
> or
> 	# bin/nutch server 8081 crawlA
>
>
> *** The tomcat search client then produces the following output:
>
> HTTP Status 500 -
>
> type Exception report
>
> message
>
> description The server encountered an internal error () that 
> prevented it from
> fulfilling this request.
>
> exception
>
> org.apache.jasper.JasperException
> 	org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:510)
> 	org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:393)
> 	org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:314)
> 	org.apache.jasper.servlet.JspServlet.service(JspServlet.java:264)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>
> root cause
>
> java.lang.NullPointerException
> 	org.apache.nutch.ipc.RPC.call(RPC.java:162)
> 	org.apache.nutch.searcher.DistributedSearch$Client.updateSegments(DistributedSearch.java:157)
> 	org.apache.nutch.searcher.DistributedSearch$Client.<init>(DistributedSearch.java:118)
> 	org.apache.nutch.searcher.DistributedSearch$Client.<init>(DistributedSearch.java:92)
> 	org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:98)
> 	org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:80)
> 	org.apache.nutch.searcher.NutchBean.get(NutchBean.java:67)
> 	org.apache.jsp.search_jsp._jspService(search_jsp.java:108)
> 	org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> 	org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:332)
> 	org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:314)
> 	org.apache.jasper.servlet.JspServlet.service(JspServlet.java:264)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>
> note The full stack trace of the root cause is available in the Apache
> Tomcat/5.5.16 logs.
>
> *** The tomcat logs show
>
> # cat /usr/local/tomcat/logs/localhost.2006-03-16.log
>
> 16-Mar-2006 21:27:00 org.apache.catalina.core.StandardWrapperValve invoke
> SEVERE: Servlet.service() for servlet jsp threw exception
> java.lang.NullPointerException
>        at org.apache.nutch.ipc.RPC.call(RPC.java:162)
>        at
> org.apache.nutch.searcher.DistributedSearch$Client.updateSegments(DistributedSearch.java:157)
>        at
> org.apache.nutch.searcher.DistributedSearch$Client.<init>(DistributedSearch.java:118)
>        at
> org.apache.nutch.searcher.DistributedSearch$Client.<init>(DistributedSearch.java:92)
>        at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:98)
>        at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:80)
>        at org.apache.nutch.searcher.NutchBean.get(NutchBean.java:67)
>        at org.apache.jsp.search_jsp._jspService(search_jsp.java:108)
>        at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
>        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>        at
> org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:332)
>        at
> org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:314)
>        at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:264)
>        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>        at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
>        at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
>        at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
>        at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
>        at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
>        at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
>        at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
>        at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
>        at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869)
>        at
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:664)
>        at
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527)
>        at
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
>        at
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
>        at java.lang.Thread.run(Thread.java:595)
>
> *** end
>
> Is this a bug for which there is a patch, or are the directories in the wrong
> places!?
>
> Many thanks,
>
> Monu Ogbe
>
>




Distributed Search - config issue?

Posted by mo...@richmondinformatics.com.
Hi Andrzej,

I am running 0.8-dev revision 374745.

Searching works fine when the tomcat search client's searcher.dir is 
configured
to point at the crawl directory as follows.

*** $CATALINA_HOME/webapps/ROOT/WEB-INF/classes/nutch-site.xml contains:

	<property>
	  <name>searcher.dir</name>
	   <value>/home/nutch/nutch-0.8-dev-test/crawlA/</value>
	  <description>
	  Path to root of index directories.
	  </description>
	</property>

However, I get an error from the tomcat search client when I try to set up
distributed search using the following config:

	<property>
	  <name>searcher.dir</name>
	   <value>/hosts</value>
	  <description>
	  Path to root of index directories.
	  </description>
	</property>

*** /hosts/search-servers.txt contains:


nutch1.houxou.com 8081


*** crawl directory tree looks like this:

crawlA/
crawlA/linkdb
crawlA/linkdb/current
crawlA/linkdb/current/part-00000
crawlA/linkdb/current/part-00000/index
crawlA/linkdb/current/part-00000/data
crawlA/linkdb/current/part-00000/.data.crc
crawlA/linkdb/current/part-00000/.index.crc
crawlA/indexes
crawlA/indexes/part-00000
crawlA/indexes/part-00000/_2.f2
crawlA/indexes/part-00000/_2.tis
crawlA/indexes/part-00000/deletable
crawlA/indexes/part-00000/_2.f3
crawlA/indexes/part-00000/_2.frq
crawlA/indexes/part-00000/_2.f4
crawlA/indexes/part-00000/_2.tii
crawlA/indexes/part-00000/_2.fdt
crawlA/indexes/part-00000/index.done
crawlA/indexes/part-00000/_2.f1
crawlA/indexes/part-00000/_2.prx
crawlA/indexes/part-00000/_2.fnm
crawlA/indexes/part-00000/_2.f0
crawlA/indexes/part-00000/segments
crawlA/indexes/part-00000/_2.fdx
crawlA/crawldb
crawlA/crawldb/current
crawlA/crawldb/current/part-00000
crawlA/crawldb/current/part-00000/index
crawlA/crawldb/current/part-00000/data
crawlA/crawldb/current/part-00000/.data.crc
crawlA/crawldb/current/part-00000/.index.crc
crawlA/segments
crawlA/segments/20060316144827
crawlA/segments/20060316144827/crawl_generate
crawlA/segments/20060316144827/crawl_generate/part-00000
crawlA/segments/20060316144827/crawl_generate/.part-00000.crc
crawlA/segments/20060316144827/crawl_parse
crawlA/segments/20060316144827/crawl_parse/part-00000
crawlA/segments/20060316144827/crawl_parse/.part-00000.crc
crawlA/segments/20060316144827/parse_text
crawlA/segments/20060316144827/parse_text/part-00000
crawlA/segments/20060316144827/parse_text/part-00000/index
crawlA/segments/20060316144827/parse_text/part-00000/data
crawlA/segments/20060316144827/parse_text/part-00000/.data.crc
crawlA/segments/20060316144827/parse_text/part-00000/.index.crc
crawlA/segments/20060316144827/parse_data
crawlA/segments/20060316144827/parse_data/part-00000
crawlA/segments/20060316144827/parse_data/part-00000/index
crawlA/segments/20060316144827/parse_data/part-00000/data
crawlA/segments/20060316144827/parse_data/part-00000/.data.crc
crawlA/segments/20060316144827/parse_data/part-00000/.index.crc
crawlA/segments/20060316144827/content
crawlA/segments/20060316144827/content/part-00000
crawlA/segments/20060316144827/content/part-00000/index
crawlA/segments/20060316144827/content/part-00000/data
crawlA/segments/20060316144827/content/part-00000/.data.crc
crawlA/segments/20060316144827/content/part-00000/.index.crc
crawlA/segments/20060316144827/crawl_fetch
crawlA/segments/20060316144827/crawl_fetch/part-00000
crawlA/segments/20060316144827/crawl_fetch/part-00000/index
crawlA/segments/20060316144827/crawl_fetch/part-00000/data
crawlA/segments/20060316144827/crawl_fetch/part-00000/.data.crc
crawlA/segments/20060316144827/crawl_fetch/part-00000/.index.crc


*** Invoking the search server

I have tried invoking the search server pointing at the "crawl" directory,
crawlA and just for good measure I have also tried pointing at the "indexes"
directory within it.

	# bin/nutch server 8081 crawlA/indexes
or
	# bin/nutch server 8081 crawlA


*** The tomcat search client then produces the following output:

HTTP Status 500 -

type Exception report

message

description The server encountered an internal error () that prevented it from
fulfilling this request.

exception

org.apache.jasper.JasperException
	org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:510)
	org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:393)
	org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:314)
	org.apache.jasper.servlet.JspServlet.service(JspServlet.java:264)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

root cause

java.lang.NullPointerException
	org.apache.nutch.ipc.RPC.call(RPC.java:162)
	org.apache.nutch.searcher.DistributedSearch$Client.updateSegments(DistributedSearch.java:157)
	org.apache.nutch.searcher.DistributedSearch$Client.<init>(DistributedSearch.java:118)
	org.apache.nutch.searcher.DistributedSearch$Client.<init>(DistributedSearch.java:92)
	org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:98)
	org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:80)
	org.apache.nutch.searcher.NutchBean.get(NutchBean.java:67)
	org.apache.jsp.search_jsp._jspService(search_jsp.java:108)
	org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
	org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:332)
	org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:314)
	org.apache.jasper.servlet.JspServlet.service(JspServlet.java:264)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

note The full stack trace of the root cause is available in the Apache
Tomcat/5.5.16 logs.

*** The tomcat logs show

# cat /usr/local/tomcat/logs/localhost.2006-03-16.log

16-Mar-2006 21:27:00 org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet jsp threw exception
java.lang.NullPointerException
        at org.apache.nutch.ipc.RPC.call(RPC.java:162)
        at
org.apache.nutch.searcher.DistributedSearch$Client.updateSegments(DistributedSearch.java:157)
        at
org.apache.nutch.searcher.DistributedSearch$Client.<init>(DistributedSearch.java:118)
        at
org.apache.nutch.searcher.DistributedSearch$Client.<init>(DistributedSearch.java:92)
        at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:98)
        at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:80)
        at org.apache.nutch.searcher.NutchBean.get(NutchBean.java:67)
        at org.apache.jsp.search_jsp._jspService(search_jsp.java:108)
        at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:332)
        at
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:314)
        at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:264)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
        at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
        at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
        at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
        at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
        at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
        at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
        at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
        at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869)
        at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:664)
        at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527)
        at
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
        at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
        at java.lang.Thread.run(Thread.java:595)

*** end

Is this a bug for which there is a patch, or are the directories in the wrong
places!?

Many thanks,

Monu Ogbe


Re: Searching specific domains

Posted by Marko Bauhardt <mb...@media-style.com>.
Am 17.03.2006 um 20:22 schrieb MagRaj:

>
> Thanks Marko for your suggestion.
>
> But, here is my problem: Find below the config files with sample  
> data i
> have:
>
> urls.txt has got 5 urls (just as an example)
> --------------------------------------------------------
> http://foo.com/broker/broker_name_1/
> http://foo.com/broker/broker_name_2/
> http://foo.com/broker/broker_name_3/
> http://foo.com/broker/broker_name_4/
> http://foo.com/broker/broker_name_5/
>


Ah ok i understand.


> I tried as you mentioned, but it didn't work.
> (site:foo.com/broker/broker_name_1 <Search_test>)

This does not work. The site field contains only the host not  
directories.

>
> How can i implement the above requirement??

Hm. You can generate 5 segments and every segment was generated and  
fetched with an other regex-urlfilter.txt
segment1:
+foo.com/broker/broker_name1
-.

segment2:
+foo.com/broker/broker_name2
-.

etc.

After that every segment contains the urls you want. But you can not  
make a search of a specified segment. But you can write a indexing  
plugin that index the segment name. In this case you can filter the  
hits from a specified segment.
But i think all these hints are not really good solutions, because  
this workflow is very intricate.

Marko





Re: Searching specific domains

Posted by MagRaj <ma...@yahoo.com>.
Thanks Marko for your suggestion.

But, here is my problem: Find below the config files with sample data i
have:

urls.txt has got 5 urls (just as an example)
--------------------------------------------------------
http://foo.com/broker/broker_name_1/
http://foo.com/broker/broker_name_2/
http://foo.com/broker/broker_name_3/
http://foo.com/broker/broker_name_4/
http://foo.com/broker/broker_name_5/

crawl-urlfilter.txt contains the following
========================
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*foo.com/

I ran the crawl with the above 5 urls and everything went on fine. Now,when
the search is done within the broker_name_1 intranet homepage, i wanted to
search the search text within all the pages belonging to broker_name_1
homepapge. If the search is done within foo.net, then it should be against
all the broker homepages.


I tried as you mentioned, but it didn't work.
(site:foo.com/broker/broker_name_1 <Search_test>)

How can i implement the above requirement?? Is there anything that i need to
configure.
Any help on this would be appreciated.

Thanks.



--
View this message in context: http://www.nabble.com/Searching-specific-domains-t1294857.html#a3462577
Sent from the Nutch - User forum at Nabble.com.


Re: Searching specific domains

Posted by Marko Bauhardt <mb...@media-style.com>.
Am 17.03.2006 um 00:28 schrieb MagRaj:

> Is it possible to create a new segment(contains all the pages of  
> that url)
> for each url??


You can use the regex-urlfilter.txt to accept only the urls you want.  
But for every new segment you have to change the regex-urlfilter.txt.
A better way is to use the index field "site". You have to generate a  
new segment (for all urls), fetch and index this. In your webapp you  
can limit the results with the "site" field.

e.g.
site:www.foo.com bar

this query search the word bar in the content of the urls with the  
site www.foo.com.

hope this helps
Marko