You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by og...@yahoo.com on 2006/08/09 21:16:28 UTC

Re: [Nutch-general] Single DFS or alternative architectures for performance?

Hi Dennis,

I'd be curious about the outcome of your experiment, so please post the summary, if you remember.

Thanks,
Otis


----- Original Message ----
From: Dennis Kubes <nu...@dragonflymc.com>
To: nutch-user@lucene.apache.org
Sent: Wednesday, August 9, 2006 10:39:38 AM
Subject: Re: [Nutch-general] Single DFS or alternative architectures for performance?

You wouldn't want to use the DFS for searching.  You would want to use 
the DFS/MapReduce for creating the index and slicing it up into certain 
segment sizes of say 1-2 million pages.  Then those individual index 
segments would need to be moved to a local file systems that have search 
servers running each searching that specific part of the index.  You 
would then have the search client (usually a website) sit in front of 
the search servers and use the searchservers.txt file to specify the 
search servers it connects to.  The search client would aggregate the 
results of the multiple index search servers and return the results to 
the client.

We are currently using 1 million pages per index segment although others 
on the list have stated that they have gotten up to 2 million pages 
without problems.  After that the query tends to slow down because of 
the length of time it takes to read individual index segments.  We have 
been running individual servers for each search segments but are  
currently playing around with having a single search server with many 
small disks (say 10 x 20G) with each disk having an index segment.  I  
don't know if that will work though.

Dennis

Murat Ali Bayir wrote:
> Hi everybody,
>
> Does a system with one DFS (crawl, parse, index, and search etc. all 
> on 1 DFS)
> have performance problems at search part? What if 2 DFS were used? One 
> for
> search part (getting summary etc.) and the other one is for the other 
> nutch operations
> (fetch, parse, index etc.). Or is there any alternative architectures 
> for systems performing
> all the nutch functions concurrently on one DFS?

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general




Re: HTMLParseFilter is not called by ParseSegment (nutch parse command)

Posted by Bipin Parmar <bi...@yahoo.com>.
Hi,

Please ignore my earlier question regarding the parse
command / HTMLParseFilter plugin. It was my mistake.
The HTMLParseFilter implementing plugins are called
during parse.

Thank you,

Bipin

--- Bipin Parmar <bi...@yahoo.com> wrote:

> Hi,
> 
> I have written a plugin implementing the
> org.apache.nutch.parse.HtmlParseFilter extension
> point. When I execute "fetch", it gets appropriately
> called. 
> 
> When I execute "fetch -noParsing", it does not get
> called. I think this is how it is supposed to work.
> 
> However when I execute "parse", I thought my
> HtmlParseFilter implementing plugin will be called.
> However it is not. The parse of the segment is
> executed successfully. 
> 
> Shouldn't "parse" call HTMLParseFilter implementing
> plugins?
> 
> I have the same nutch-default.xml for both fetch as
> well as parse commands. I tried changing
> parse-plugins.xml by adding my plugin to "text/html"
> content type but it did not help.
> 
> Please help!
> 
> Thank you,
> 
> Bipin
> I am using nutch-nightly build date 08/07/2006.
> 


HTMLParseFilter is not called by ParseSegment (nutch parse command)

Posted by Bipin Parmar <bi...@yahoo.com>.
Hi,

I have written a plugin implementing the
org.apache.nutch.parse.HtmlParseFilter extension
point. When I execute "fetch", it gets appropriately
called. 

When I execute "fetch -noParsing", it does not get
called. I think this is how it is supposed to work.

However when I execute "parse", I thought my
HtmlParseFilter implementing plugin will be called.
However it is not. The parse of the segment is
executed successfully. 

Shouldn't "parse" call HTMLParseFilter implementing
plugins?

I have the same nutch-default.xml for both fetch as
well as parse commands. I tried changing
parse-plugins.xml by adding my plugin to "text/html"
content type but it did not help.

Please help!

Thank you,

Bipin
I am using nutch-nightly build date 08/07/2006.