You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Chetan Sahasrabudhe <Ch...@KPITCummins.com> on 2005/05/17 10:03:41 UTC

Distributed installation

Hello,
 
    I am planning to put distributed nutch installation.
Can anyone point to document or link to understand the steps

Regards 
Chetan 
_______________________________ 

Tel +91-20-5652 5000 ext 2513

KPITCummins Infosystems Limited 
Hinjwadi
Pune INDIA 
_______________________________ 

 


---------------------------------
This message contains the information that may be privileged and is  the property of the KPIT Cummins Infosystems LTD.It is intended only for the person to whom it is addressed. If you are not intended recipient, you are not authorized to read, print , retain copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message. KPIT Cummins does not accept any liability for virus infected mails.

Re: Distributed installation

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

If I press F5 (refresh) key for 1 minute in my IE, the  situation is the 
sam with previous mail. The backends CPU usage are 95% for long time.

Re: Distributed installation

Posted by Doug Cutting <cu...@nutch.org>.

Piotr Kosiorowski wrote:
> 2)As behavior of the system will change probably old version of search 
> code should also be supported - I have to investigate how to handle both 
> versions of the code at the same time and not to create a mess.

I think this would be a great addition.  Do not worry too much about 
working in old versions of Nutch.  We're still pre-1.0.

Doug

Re: Distributed installation

Posted by Piotr Kosiorowski <pk...@gmail.com>.

Hello,
So it looks like my load-balancing solution is interesting for someone 
except me so I will try to port it in next one, two weeks.

There are two more things left:
1) Currently when one of the servers in the group fails to respond it is 
ignored and search results are returned without data present in segments 
from this server (see org.apache.nutch.ipc.Client.call(Writable[] 
params, InetSocketAddress[] addresses)). This behavior is a problem for
load-balancing solution as second request by the same user can be 
handled by different set of servers we can have duplicated search 
results (when user wants to see next page of results) or missing 
information when user wants to see explanation or cached data.
In my opinion in load -balancing solution such situation should be 
considered an error and other set of the servers should be used for 
handling user query.
To support new and old behavior some changes to RPC and Client classes 
are required - I used to overwrite inherited method in 
DistributedClient.Client but because of recent changes it is not 
possible to do it without affecting RPC and Client in ipc package.

2)As behavior of the system will change probably old version of search 
code should also be supported - I have to investigate how to handle both 
versions of the code at the same time and not to create a mess.

I will post my first results as soon as I make it work for comments.

Regards,
Piotr

Andrzej Bialecki wrote:
> Piotr Kosiorowski wrote:
> 
>> Hello Stefan,
>>
>> I have already written a component that implements this round robin 
>> searching functionality some time ago - but right now it is not working 
> 
> [..]
> 
>>
>> So if there is enough interest for it I can downgrade it to JDK 1.4 (I 
>> am using java.util.concurrent) and send it as a patch.
> 
> 
> That would be very nice! concurrent-1.3.4.jar is already a part of the 
> Nutch distribution - I don't remember how different it is from JDK5, but 
> this should ease the porting effort...
> 
> 
>

Re: Distributed installation

Posted by Andrzej Bialecki <ab...@getopt.org>.

Piotr Kosiorowski wrote:
> Hello Stefan,
> 
> I have already written a component that implements this round robin 
> searching functionality some time ago - but right now it is not working 
[..]

> 
> So if there is enough interest for it I can downgrade it to JDK 1.4 (I 
> am using java.util.concurrent) and send it as a patch.

That would be very nice! concurrent-1.3.4.jar is already a part of the 
Nutch distribution - I don't remember how different it is from JDK5, but 
this should ease the porting effort...



-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: [Nutch-dev] Re: Please help: Tomcat problem, Paginating with optimization (Like google)

Posted by yoursoft <yo...@freemail.hu>.

Thanks again for squid idea, this is solve my problem. I will use squid 
on other web-servers, this is dramaticaly decrease cpu overload.

Byron Miller wrotte:

>I use Resin with Apache 1.3.x as well. I've never had
>any luck running resin/tomcat by themselves.
>
>I have also had great luck using squid as a
>proxy/caching server in front of both. That helped
>boost queries per second nicely keeping much of the
>tcp over head off the jvm/apache.
>
>--- "yoursoft@freemail.hu" <yo...@freemail.hu>
>wrote:
>
>  
>
>>Dear Olaf,
>>
>>Thanks for answer. I found the following:
>>If I use Tomcat or Resin only, the server always
>>broken with large 
>>queries. In the Tomcat manager I found that there
>>are many thread with 
>>'S' status with long time.
>>I analize these threads with 'netstat -anp' command
>>from linux prompt. I 
>>found these connections in 'CLOSE_WAIT' status. This
>>is present that, 
>>the client don't answer the CLOSE status, when the
>>server send the 
>>answer out (for e.g. close the browser before full
>>answer arrive).
>>
>>
>>    
>>
>
>__________________________________________________
>Do You Yahoo!?
>Tired of spam?  Yahoo! Mail has the best spam protection around 
>http://mail.yahoo.com 
>
>
>  
>

Re: [Nutch-dev] Re: Please help: Tomcat problem, Paginating with optimization (Like google)

Posted by Byron Miller <by...@yahoo.com>.

I use Resin with Apache 1.3.x as well. I've never had
any luck running resin/tomcat by themselves.

I have also had great luck using squid as a
proxy/caching server in front of both. That helped
boost queries per second nicely keeping much of the
tcp over head off the jvm/apache.

--- "yoursoft@freemail.hu" <yo...@freemail.hu>
wrote:

> Dear Olaf,
> 
> Thanks for answer. I found the following:
> If I use Tomcat or Resin only, the server always
> broken with large 
> queries. In the Tomcat manager I found that there
> are many thread with 
> 'S' status with long time.
> I analize these threads with 'netstat -anp' command
> from linux prompt. I 
> found these connections in 'CLOSE_WAIT' status. This
> is present that, 
> the client don't answer the CLOSE status, when the
> server send the 
> answer out (for e.g. close the browser before full
> answer arrive).
> 
>

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: [Nutch-dev] Re: Please help: Tomcat problem, Paginating with optimization (Like google)

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

Dear Olaf,

Thanks for answer. I found the following:
If I use Tomcat or Resin only, the server always broken with large 
queries. In the Tomcat manager I found that there are many thread with 
'S' status with long time.
I analize these threads with 'netstat -anp' command from linux prompt. I 
found these connections in 'CLOSE_WAIT' status. This is present that, 
the client don't answer the CLOSE status, when the server send the 
answer out (for e.g. close the browser before full answer arrive).

My solution:
I put front of the Tomcat an Apache2 with jk2 and these problem is 
solved. With Apache2 the answer times is better too than Tomcat alone.
Before it the CPU usage was 50-90%, now only 1-10%. From the Apache I'm  
cache static contents (gif, html, css etc. - I have many items from it), 
and I think the Apache serve the static contents too, better than Resin 
or Tomcat.

Regards,
    Ferenc

Olaf Thiele wrotte:

>Hi Ferenc,
>does the error happen only with your own servlet or
>did you experience it with the standardized interface?
>
>Furthermore, you report on NullPointers. Where exactly
>do they happen. The printstacktrace should tell you that.
>This would be a good starting point.
>
>Hope this helps
>Olaf
>
>
>  
>

Re: Please help: Tomcat problem, Paginating with optimization (Like google)

Posted by Olaf Thiele <ol...@gmail.com>.

Hi Ferenc,
does the error happen only with your own servlet or
did you experience it with the standardized interface?

Furthermore, you report on NullPointers. Where exactly
do they happen. The printstacktrace should tell you that.
This would be a good starting point.

Hope this helps
Olaf


On 5/23/05, yoursoft@freemail.hu <yo...@freemail.hu> wrote:
> Dear Users,
> 
> I have the following problem:
> 
> I use Tomcat 5.5.7 on port 80.
> There is no problem with 1-2 queries / 10 secs.
> 
> When I use it with more users (5-6 queries / sec), the frontend  working
> for e.g. 30 minutes on CPU load with 1-2 %, and after it it go up to 60-90%.
> The queries is not to increasses in this time (e.g. in the last test it
> was 3 queries / 5 sec in the critical point). In this critical time the
> backends CPU usages are max. 10-20%.
> On the frontend in the Tomcat  manager  there are  many threads with
> 'service status' with long time  (e.g.: 586 sec ).
> After 10 minutes in the catalina.out  there are the bean.search time i
> sincreasses over 10-40 sec, and after some minutes there are many
> messages with 'NullPointerException'.
> 
> I rewrite the jsp pages to the servlets.
> There is my source of Search.java, there is a 'google' paginating. The
> source code is optimized (minimal object creating, sb.append, etc.):
> package org.nutch;
> Beginning of java code
> ------------------------------------------------------------------------------
> import javax.servlet.http.HttpServletRequest;
> import javax.servlet.http.HttpServletResponse;
> import javax.servlet.ServletException;
> import org.apache.velocity.exception.*;
> import org.apache.velocity.Template;
> import org.apache.velocity.app.Velocity;
> import org.apache.velocity.context.Context;
> import org.apache.velocity.servlet.VelocityServlet;
> import net.nutch.html.Entities;
> import java.util.*;
> import net.nutch.searcher.*;
> import net.nutch.plugin.*;
> import java.io.*;
> import java.net.*;
> 
> public class Search extends VelocityServlet {
> 
>     private NutchBean bean;
> 
>     public void init() throws ServletException {
>         try {
>             bean = NutchBean.get(this.getServletContext());
>         } catch (IOException e) {
>             throw new ServletException(e);
>         }
>     }
> 
>     private static final String getStringParameter(String name, String
> charset, HttpServletRequest req) {
>         String value = req.getParameter(name);
>         if (value == null) {
>             value = "";
>         }
>         try {
>             value = new String( value.getBytes("8859_1"), charset );
>         } catch (Exception e) {;}
>         return value;
>     }
> 
>     public Template handleRequest( HttpServletRequest request,
>     HttpServletResponse response,
>     Context par_tartalom ) throws java.io.IOException, ServletException {
>         Template loc_template = null;
> 
>         if (bean == null) { return loc_template; }
> 
>         int start = 0;          // first hit to display
>         String startString = request.getParameter("start");
>         if (startString != null) {
>             start = Integer.parseInt(startString);
>             if (start < 0) {
>                 start = 0;
>             }
>         }
>         int hitsPerPage = 10;          // number of hits to display
>         String hitsString = request.getParameter("hitsPerPage");
>         if (hitsString != null) {
>             hitsPerPage = Integer.parseInt(hitsString);
>         }
> 
>         // get the character encoding to use when interpreting request
> values
>         String charset = request.getParameter("charset");
>         if (charset == null) {
>             charset = "UTF8";
>         }
> 
>         // get query from request
>         String queryString = getStringParameter("query", charset, request);
> 
>         try {
>             StringBuffer sb = new StringBuffer();
>             StringBuffer sb2 = new StringBuffer();
>             StringBuffer sb3 = new StringBuffer();
> 
>             // Query string for html
>             String htmlQueryString = Entities.encode(queryString);
>             String htmlQueryStringISO = URLEncoder.encode(queryString,
> "ISO-8859-2");
>             String htmlQueryStringUTF = URLEncoder.encode(queryString,
> "UTF8");
> 
>             // Get more parameters
>             if (hitsPerPage >100) { // No more hitsPerPage than 100
>                 hitsPerPage = 10;
>             }
>             int hitsPerSite = 2;          // max hits per site
>             String hitsPerSiteString = request.getParameter("hitsPerSite");
>             if (hitsPerSiteString != null) {
>                 hitsPerSite = Integer.parseInt(hitsPerSiteString);
>             }
> 
>             Query query = Query.parse(queryString);
> 
>             int hitsLength = 0;
>             int end = 0;
>             int length = 0;
>             log(sb.append("Query:
> ").append(request.getRemoteAddr()).append(" -
> ").append(queryString).toString());
>             long startTime = System.currentTimeMillis();
>             Hits hits = bean.search(query, start + hitsPerPage,
> hitsPerSite);
>             hitsLength = (int)hits.getTotal();
>             int length2 = hits.getLength();
>             par_tartalom.put("TIME",
> String.valueOf((System.currentTimeMillis()-startTime) / 1000.0) );
>             if (length2  <= start) { // If after 'start' there are not hits
>                 start = length2 - (length2 % hitsPerPage);
>                 if (length2 == start) {
>                     start = start - hitsPerPage;
>                     if (start < 0) start = 0;
>                 }
>                 end = length2;
>             } else { // If after 'start' there are some hits
>                 end = length2 < start+hitsPerPage ? length2 :
> start+hitsPerPage;
>             }
>             length = end-start;
>             Hit[] show = hits.getHits(start, end-start);
>             HitDetails[] details = bean.getDetails(show);
>             String[] summaries = bean.getSummary(details, query);
>             sb.setLength(0);
>             log(sb.append("Query:
> ").append(request.getRemoteAddr()).append(" -
> ").append(queryString).append(" - Total hits:
> ").append(hits.getTotal()).append(" - Time:
> ").append(System.currentTimeMillis()-startTime).toString());
> 
>             par_tartalom.put("START", new Long((end==0)?0:(start+1)) );
> // Start of hits
>             par_tartalom.put("END", new Long(end)); // End of hits
>             par_tartalom.put("CNT", new Long(hits.getTotal())); // Count
> of hits
>             par_tartalom.put("QRY", htmlQueryString); // UTF8
>             par_tartalom.put("QRY2", htmlQueryStringISO); // ISO charset
> 
>             // ******************************************************
>             // List Hits
>             // ******************************************************
>             sb.setLength(0);
>             sb2.setLength(0);
>             sb3.setLength(0);
>             sb3.append("?idx=");
>             Hit hit = null;
>             HitDetails detail = null;
>             String title = null;
>             String url = null;
>             String summary = null;
>             for (int i = 0, j; i < length; i++) { // display the hits
>                 hit = show[i];
>                 detail = details[i];
>                 title = detail.getValue("title");
>                 url = detail.getValue("url");
>                 summary = summaries[i];
>                 sb3.setLength(5);
>                 sb3.append( hit.getIndexNo() ).append("&id=").append(
> hit.getIndexDocNo() );
> 
>                 if (title == null || title.equals("")) { // use url for
> docs w/o title
>                     title = url;
>                 }
>                 ... Same with search.jsp ...
>             }
>             if (length2 <= start + hitsPerPage && hitsLength != 0) { //
> If lower tahn hitsPerPage
>                 sb.append("<span style=\"FONT-WEIGHT: bold; color:
> black;\">This is the last page.</span><br><br>");
>                 hitsLength = length2; // paginating length
>             }
>             par_tartalom.put("LIST", sb.toString());
> 
>             // ******************************************************
>             // Paginating
>             // ******************************************************
>             int pageNumber = 0;
>             sb.setLength(0);
>             sb2.setLength(0);
>             sb2.append("&hitsPerPage=");
>             sb2.append(hitsPerPage);
>             sb3.setLength(0);
>             sb3.append("<a
> href=\"Keres?query=").append(htmlQueryStringUTF).append("&start=");
> 
>             // Prev (<<)
>             sb.append("<td width=60 class=pages align=right>");
>             if (start>0) {
>                 long prevStart = start-hitsPerPage;
>                 prevStart = prevStart > 0 ? prevStart : 0;
>                 sb.append(sb3); // query
>                 sb.append(prevStart); // start
>                 sb.append(sb2); // others
>                 sb.append("\" class=pages><font class=arrow><<</font> <<
> </a>");
>             }
>             sb.append("</td><td class=pages width=250 align=center>");
> 
>             if (hitsLength > hitsPerPage ) { // If there are more pages
>                 if (start >= 9 * hitsPerPage) { // from page 10 (from 90)
>                     int startPageNumber = start-(4 * hitsPerPage);
>                     if (startPageNumber < 0) startPageNumber = 0;
>                     pageNumber = (startPageNumber-1) / hitsPerPage+1;
>                     for (int i = startPageNumber; i < hitsLength; i = i
> + hitsPerPage) {
>                         pageNumber++;
>                         if (start == i) {
>                             sb.append("<font class=active_nr><b>");
>                             sb.append(pageNumber);
>                             sb.append("</b>&nbsp;</font>");
>                         } else {
>                             sb.append(sb3);// query
>                             sb.append(i); // start
>                             sb.append(sb2); // others
>                             sb.append("\" class=inactive_nr>");
>                             sb.append(pageNumber);
>                             sb.append("</a>&nbsp;");
>                         }
>                         if (i >= startPageNumber+8*hitsPerPage) {
>                             break;
>                         }
>                     }
>                 } else { // more than 9 pages
>                     for (int i = 0; i < hitsLength; i = i + hitsPerPage) {
>                         pageNumber++;
>                         if (start == i) {
>                             sb.append("<font class=active_nr><b>");
>                             sb.append(pageNumber);
>                             sb.append("</b>&nbsp;</font>");
>                         } else {
>                             sb.append(sb3);// query
>                             sb.append(i);  // start
>                             sb.append(sb2); // others
>                             sb.append("\" class=inactive_nr>");
>                             sb.append(pageNumber);
>                             sb.append("</a>&nbsp;");
>                         }
>                         if (pageNumber >= 10) {
>                             break;
>                         }
>                     }
>                 }
>             }
>             sb.append("</td><td width=100 class=pages>");
> 
>             // next (>>)
>             if (hitsLength > start+hitsPerPage) {
>                 if (end <hitsLength) {
>                     sb.append(sb3); // query
>                     sb.append(end); // start
>                     sb.append(sb2); // others
>                     sb.append("\" class=pages> >> <font
> class=arrow>>></font></a>");
>                 }
>             }
>             sb.append("</td>");
> 
>             par_tartalom.put("PAGES", sb.toString());
> 
>             loc_template = Velocity.getTemplate("search.vm");
> //        } catch (ArrayIndexOutOfBoundsException ae){
> //            log(request.getRemoteAddr()+" - forwarding - "+queryString);
> //        }
>         } catch (Exception e) {
>             log(e.toString()+" - " + request.getRemoteAddr()+" -
> "+queryString);
>             //            throw new ServletException(e);
>         }
>         return loc_template;
>     }
> }
> ------------------------------------------------------------------------------
> End of java code.
> 
> Have you any idea what is the possible problem source?
> 
> Best Regards,
>     Ferenc
> 


-- 

<SimpleHuman gender="male">
   <Physical name="Olaf Thiele" />
   <Virtual adress="http://www.olafthiele.de" />
</SimpleHuman>

Please help: Tomcat problem, Paginating with optimization (Like google)

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

Dear Users,

I have the following problem:

I use Tomcat 5.5.7 on port 80.
There is no problem with 1-2 queries / 10 secs.

When I use it with more users (5-6 queries / sec), the frontend  working 
for e.g. 30 minutes on CPU load with 1-2 %, and after it it go up to 60-90%.
The queries is not to increasses in this time (e.g. in the last test it 
was 3 queries / 5 sec in the critical point). In this critical time the 
backends CPU usages are max. 10-20%.
On the frontend in the Tomcat  manager  there are  many threads with  
'service status' with long time  (e.g.: 586 sec ).
After 10 minutes in the catalina.out  there are the bean.search time i 
sincreasses over 10-40 sec, and after some minutes there are many 
messages with 'NullPointerException'.

I rewrite the jsp pages to the servlets.
There is my source of Search.java, there is a 'google' paginating. The 
source code is optimized (minimal object creating, sb.append, etc.):
package org.nutch;
Beginning of java code
------------------------------------------------------------------------------
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.servlet.ServletException;
import org.apache.velocity.exception.*;
import org.apache.velocity.Template;
import org.apache.velocity.app.Velocity;
import org.apache.velocity.context.Context;
import org.apache.velocity.servlet.VelocityServlet;
import net.nutch.html.Entities;
import java.util.*;
import net.nutch.searcher.*;
import net.nutch.plugin.*;
import java.io.*;
import java.net.*;

public class Search extends VelocityServlet {

    private NutchBean bean;

    public void init() throws ServletException {
        try {
            bean = NutchBean.get(this.getServletContext());
        } catch (IOException e) {
            throw new ServletException(e);
        }
    }

    private static final String getStringParameter(String name, String 
charset, HttpServletRequest req) {
        String value = req.getParameter(name);
        if (value == null) {
            value = "";
        }
        try {
            value = new String( value.getBytes("8859_1"), charset );
        } catch (Exception e) {;}
        return value;
    }

    public Template handleRequest( HttpServletRequest request,
    HttpServletResponse response,
    Context par_tartalom ) throws java.io.IOException, ServletException {
        Template loc_template = null;

        if (bean == null) { return loc_template; }

        int start = 0;          // first hit to display
        String startString = request.getParameter("start");
        if (startString != null) {
            start = Integer.parseInt(startString);
            if (start < 0) {
                start = 0;
            }
        }
        int hitsPerPage = 10;          // number of hits to display
        String hitsString = request.getParameter("hitsPerPage");
        if (hitsString != null) {
            hitsPerPage = Integer.parseInt(hitsString);
        }

        // get the character encoding to use when interpreting request 
values
        String charset = request.getParameter("charset");
        if (charset == null) {
            charset = "UTF8";
        }

        // get query from request
        String queryString = getStringParameter("query", charset, request);

        try {
            StringBuffer sb = new StringBuffer();
            StringBuffer sb2 = new StringBuffer();
            StringBuffer sb3 = new StringBuffer();

            // Query string for html
            String htmlQueryString = Entities.encode(queryString);
            String htmlQueryStringISO = URLEncoder.encode(queryString, 
"ISO-8859-2");
            String htmlQueryStringUTF = URLEncoder.encode(queryString, 
"UTF8");

            // Get more parameters
            if (hitsPerPage >100) { // No more hitsPerPage than 100
                hitsPerPage = 10;
            }
            int hitsPerSite = 2;          // max hits per site
            String hitsPerSiteString = request.getParameter("hitsPerSite");
            if (hitsPerSiteString != null) {
                hitsPerSite = Integer.parseInt(hitsPerSiteString);
            }

            Query query = Query.parse(queryString);

            int hitsLength = 0;
            int end = 0;
            int length = 0;
            log(sb.append("Query: 
").append(request.getRemoteAddr()).append(" - 
").append(queryString).toString());
            long startTime = System.currentTimeMillis();
            Hits hits = bean.search(query, start + hitsPerPage, 
hitsPerSite);
            hitsLength = (int)hits.getTotal();
            int length2 = hits.getLength();
            par_tartalom.put("TIME", 
String.valueOf((System.currentTimeMillis()-startTime) / 1000.0) );
            if (length2  <= start) { // If after 'start' there are not hits
                start = length2 - (length2 % hitsPerPage);
                if (length2 == start) {
                    start = start - hitsPerPage;
                    if (start < 0) start = 0;
                }
                end = length2;
            } else { // If after 'start' there are some hits
                end = length2 < start+hitsPerPage ? length2 : 
start+hitsPerPage;
            }
            length = end-start;
            Hit[] show = hits.getHits(start, end-start);
            HitDetails[] details = bean.getDetails(show);
            String[] summaries = bean.getSummary(details, query);
            sb.setLength(0);
            log(sb.append("Query: 
").append(request.getRemoteAddr()).append(" - 
").append(queryString).append(" - Total hits: 
").append(hits.getTotal()).append(" - Time: 
").append(System.currentTimeMillis()-startTime).toString());

            par_tartalom.put("START", new Long((end==0)?0:(start+1)) ); 
// Start of hits
            par_tartalom.put("END", new Long(end)); // End of hits
            par_tartalom.put("CNT", new Long(hits.getTotal())); // Count 
of hits
            par_tartalom.put("QRY", htmlQueryString); // UTF8
            par_tartalom.put("QRY2", htmlQueryStringISO); // ISO charset

            // ******************************************************
            // List Hits
            // ******************************************************
            sb.setLength(0);
            sb2.setLength(0);
            sb3.setLength(0);
            sb3.append("?idx=");
            Hit hit = null;
            HitDetails detail = null;
            String title = null;
            String url = null;
            String summary = null;
            for (int i = 0, j; i < length; i++) { // display the hits
                hit = show[i];
                detail = details[i];
                title = detail.getValue("title");
                url = detail.getValue("url");
                summary = summaries[i];
                sb3.setLength(5);
                sb3.append( hit.getIndexNo() ).append("&id=").append( 
hit.getIndexDocNo() );

                if (title == null || title.equals("")) { // use url for 
docs w/o title
                    title = url;
                }
                ... Same with search.jsp ...
            }
            if (length2 <= start + hitsPerPage && hitsLength != 0) { // 
If lower tahn hitsPerPage
                sb.append("<span style=\"FONT-WEIGHT: bold; color: 
black;\">This is the last page.</span><br><br>");
                hitsLength = length2; // paginating length
            }
            par_tartalom.put("LIST", sb.toString());

            // ******************************************************
            // Paginating
            // ******************************************************
            int pageNumber = 0;
            sb.setLength(0);
            sb2.setLength(0);
            sb2.append("&hitsPerPage=");
            sb2.append(hitsPerPage);
            sb3.setLength(0);
            sb3.append("<a 
href=\"Keres?query=").append(htmlQueryStringUTF).append("&start=");

            // Prev (<<)
            sb.append("<td width=60 class=pages align=right>");
            if (start>0) {
                long prevStart = start-hitsPerPage;
                prevStart = prevStart > 0 ? prevStart : 0;
                sb.append(sb3); // query
                sb.append(prevStart); // start
                sb.append(sb2); // others
                sb.append("\" class=pages><font class=arrow><<</font> << 
</a>");
            }
            sb.append("</td><td class=pages width=250 align=center>");

            if (hitsLength > hitsPerPage ) { // If there are more pages
                if (start >= 9 * hitsPerPage) { // from page 10 (from 90)
                    int startPageNumber = start-(4 * hitsPerPage);
                    if (startPageNumber < 0) startPageNumber = 0;
                    pageNumber = (startPageNumber-1) / hitsPerPage+1;
                    for (int i = startPageNumber; i < hitsLength; i = i 
+ hitsPerPage) {
                        pageNumber++;
                        if (start == i) {
                            sb.append("<font class=active_nr><b>");
                            sb.append(pageNumber);
                            sb.append("</b>&nbsp;</font>");
                        } else {
                            sb.append(sb3);// query
                            sb.append(i); // start
                            sb.append(sb2); // others
                            sb.append("\" class=inactive_nr>");
                            sb.append(pageNumber);
                            sb.append("</a>&nbsp;");
                        }
                        if (i >= startPageNumber+8*hitsPerPage) {
                            break;
                        }
                    }
                } else { // more than 9 pages
                    for (int i = 0; i < hitsLength; i = i + hitsPerPage) {
                        pageNumber++;
                        if (start == i) {
                            sb.append("<font class=active_nr><b>");
                            sb.append(pageNumber);
                            sb.append("</b>&nbsp;</font>");
                        } else {
                            sb.append(sb3);// query
                            sb.append(i);  // start
                            sb.append(sb2); // others
                            sb.append("\" class=inactive_nr>");
                            sb.append(pageNumber);
                            sb.append("</a>&nbsp;");
                        }
                        if (pageNumber >= 10) {
                            break;
                        }
                    }
                }
            }
            sb.append("</td><td width=100 class=pages>");

            // next (>>)
            if (hitsLength > start+hitsPerPage) {
                if (end <hitsLength) {
                    sb.append(sb3); // query
                    sb.append(end); // start
                    sb.append(sb2); // others
                    sb.append("\" class=pages> >> <font 
class=arrow>>></font></a>");
                }
            }
            sb.append("</td>");

            par_tartalom.put("PAGES", sb.toString());

            loc_template = Velocity.getTemplate("search.vm");
//        } catch (ArrayIndexOutOfBoundsException ae){
//            log(request.getRemoteAddr()+" - forwarding - "+queryString);
//        }
        } catch (Exception e) {
            log(e.toString()+" - " + request.getRemoteAddr()+" - 
"+queryString);
            //            throw new ServletException(e);
        }
        return loc_template;
    }
}
------------------------------------------------------------------------------
End of java code.

Have you any idea what is the possible problem source?

Best Regards,
    Ferenc

Re: [Nutch-dev] Re: Distributed installation

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

I also intersting to your patch.

Thanks for it:
    Ferenc

 >So if there is enough interest for it I can downgrade it to JDK 1.4  
(I am using java.util.concurrent) and send it as a patch.

>
> Sounds very great, at least I have a lot of interest!!!
> :-)
>
> Thanks,
> Stefan

Re: Distributed installation

Posted by Stefan Groschupf <sg...@media-style.com>.

> So if there is enough interest for it I can downgrade it to JDK 1.4  
> (I am using java.util.concurrent) and send it as a patch.

Sounds very great, at least I have a lot of interest!!!
:-)

Thanks,
Stefan

Re: Distributed installation

Posted by Piotr Kosiorowski <pk...@gmail.com>.

Hello Stefan,

I have already written a component that implements this round robin 
searching functionality some time ago - but right now it is not working 
correctly with latest nutch SVN code - anyway I have plans to update it. 
It was done inside modified NutchBean - it was selecting the group of 
servers to be used for particular request in round-robin fashion and in 
case of failure it was moving the current server group to inactive pool 
and retrying using another group.
There was a separate recovery thread that was checking from time to time
if inactive pool contains some groups and was trying to recover it.
So in addition to load-balancing all requests among cluster of search 
server groups it was also providing fault tolerance (automatic detection 
of inactive nodes with recovery).
Our plan was to use two or more tomcat servers with NutchBean each 
configured to use all search server groups. This will remove single 
point of failure during search.

So if there is enough interest for it I can downgrade it to JDK 1.4 (I 
am using java.util.concurrent) and send it as a patch.

Regards,
Piotr

Stefan Groschupf wrote:
> I notice similar behaviors.
> I guess the backend servers does not answering fast enough.
> I was thinking about to have multiple search server groups that have  
> identical content and then query groups in a round robbing style.
> What people  think about this idea?
> 
> It is already easy to setup multiple tomcat that use different search  
> servers and simply split traffic by adding 2 or n ip to your dns for  
> the same domain.
> 
> 
> Stefan
> 
> Am 18.05.2005 um 16:59 schrieb yoursoft@freemail.hu:
> 
>> Dear Users!
>>
>> Firstly sorry my bad English.
>> I  read Stephans great documentation at http://wiki.media-style.com/ 
>> display/nutchDocu/.
>> I maked a frontend (P4 3 GByte RAM, Tomcat 5.5.7 java 1.4.08) with  3 
>> backend with 12 million pages ( 4million / backend AMD64 4 GByte  RAM 
>> 64 bit linux with jdk 1.5_03).
>>
>> When I start using it with 3-5 queries / sec, after 1-2 minute the  
>> frontend does'nt answer to the requests.
>> In the Tomcat manager / status I see there is many thread busy (150  
>> and it increasses, now 241), and these are with Stage 'S' (Service).
>>
>> The backend with usage: top 40-60 % CPU.
>> The frontend with usage: 5% CPU.
>>
>> Have you any idea what is the problem?
>>
>> Best Regards,
>>    Ferenc
>>
>>
>>
>>
> 
> ---------------------------------------------------------------
> company:        http://www.media-style.com
> forum:        http://www.text-mining.org
> blog:            http://www.find23.net
> 
> 
>

Re: [Nutch-dev] Re: Distributed installation

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

Stephan,
Thanks, for suggestions.
I  fixed, the 'paginating' in my source. Not query all the "pages" 
(1-10), only the first. If  user click to the page that not have hits, I 
forward back to the last page.
I use servlets with velocity, not the jsp. I completty rewrite the jsp 
pages, make some optimizations etc.

Stefan Groschupf wrotte:

> Ferenc,
> you can fix this easily. Just hardcode the hitsPerPage in the jsp and  
> count the queries per ip to limit them.
> I notice google does not answer queries if the http header is not  
> correct and the agent identification must be correct.
> Stefan
> Am 19.05.2005 um 08:58 schrieb yoursoft@freemail.hu:
>
>> Dear Stephan,
>>
>> Thanks for you fast answer.
>> I think there are some general 'security' hole in nutch. E.g.: if I  
>> make queries with hitsPerPage=10000, or if the user press F5 key in  
>> IE for long time.
>>
>> In my situation the problem is 'paginating' like google (pages:  
>> 1-10). If the isTotalIsExact() results false -> research with  
>> hitsPerPage * 10.
>> I think I will set maxHitsPerSite value to 0 for a week, and I will  
>> try to reanalize how to reprograming the 'paginating'.
>>
>> Thanks, Ferenc
>>
>> Stefan Groschupf wrotte:
>>
>>
>>> I notice similar behaviors.
>>> I guess the backend servers does not answering fast enough.
>>> I was thinking about to have multiple search server groups that  
>>> have  identical content and then query groups in a round robbing  
>>> style.
>>> What people  think about this idea?
>>>
>>> It is already easy to setup multiple tomcat that use different  
>>> search  servers and simply split traffic by adding 2 or n ip to  
>>> your dns for  the same domain.
>>>
>>>
>>> Stefan
>>>
>>> Am 18.05.2005 um 16:59 schrieb yoursoft@freemail.hu:
>>>
>>>
>>>> Dear Users!
>>>>
>>>> Firstly sorry my bad English.
>>>> I  read Stephans great documentation at http://wiki.media- 
>>>> style.com/ display/nutchDocu/.
>>>> I maked a frontend (P4 3 GByte RAM, Tomcat 5.5.7 java 1.4.08)  
>>>> with  3 backend with 12 million pages ( 4million / backend AMD64  4 
>>>> GByte  RAM 64 bit linux with jdk 1.5_03).
>>>>
>>>> When I start using it with 3-5 queries / sec, after 1-2 minute  
>>>> the  frontend does'nt answer to the requests.
>>>> In the Tomcat manager / status I see there is many thread busy  
>>>> (150  and it increasses, now 241), and these are with Stage  'S' 
>>>> (Service).
>>>>
>>>> The backend with usage: top 40-60 % CPU.
>>>> The frontend with usage: 5% CPU.
>>>>
>>>> Have you any idea what is the problem?
>>>>
>>>> Best Regards,
>>>>    Ferenc
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------
>>> company:        http://www.media-style.com
>>> forum:        http://www.text-mining.org
>>> blog:            http://www.find23.net
>>>
>>>
>>>
>>>
>>
>>
>>
>> -------------------------------------------------------
>> This SF.Net email is sponsored by Oracle Space Sweepstakes
>> Want to be the first software developer in space?
>> Enter now for the Oracle Space Sweepstakes!
>> http://ads.osdn.com/?ad_id=7412&alloc_id=16344&op=click
>> _______________________________________________
>> Nutch-developers mailing list
>> Nutch-developers@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>>
>>
>
>
>

Re: [Nutch-dev] Re: Distributed installation

Posted by Stefan Groschupf <sg...@media-style.com>.

Ferenc,
you can fix this easily. Just hardcode the hitsPerPage in the jsp and  
count the queries per ip to limit them.
I notice google does not answer queries if the http header is not  
correct and the agent identification must be correct.
Stefan
Am 19.05.2005 um 08:58 schrieb yoursoft@freemail.hu:

> Dear Stephan,
>
> Thanks for you fast answer.
> I think there are some general 'security' hole in nutch. E.g.: if I  
> make queries with hitsPerPage=10000, or if the user press F5 key in  
> IE for long time.
>
> In my situation the problem is 'paginating' like google (pages:  
> 1-10). If the isTotalIsExact() results false -> research with  
> hitsPerPage * 10.
> I think I will set maxHitsPerSite value to 0 for a week, and I will  
> try to reanalize how to reprograming the 'paginating'.
>
> Thanks, Ferenc
>
> Stefan Groschupf wrotte:
>
>
>> I notice similar behaviors.
>> I guess the backend servers does not answering fast enough.
>> I was thinking about to have multiple search server groups that  
>> have  identical content and then query groups in a round robbing  
>> style.
>> What people  think about this idea?
>>
>> It is already easy to setup multiple tomcat that use different  
>> search  servers and simply split traffic by adding 2 or n ip to  
>> your dns for  the same domain.
>>
>>
>> Stefan
>>
>> Am 18.05.2005 um 16:59 schrieb yoursoft@freemail.hu:
>>
>>
>>> Dear Users!
>>>
>>> Firstly sorry my bad English.
>>> I  read Stephans great documentation at http://wiki.media- 
>>> style.com/ display/nutchDocu/.
>>> I maked a frontend (P4 3 GByte RAM, Tomcat 5.5.7 java 1.4.08)  
>>> with  3 backend with 12 million pages ( 4million / backend AMD64  
>>> 4 GByte  RAM 64 bit linux with jdk 1.5_03).
>>>
>>> When I start using it with 3-5 queries / sec, after 1-2 minute  
>>> the  frontend does'nt answer to the requests.
>>> In the Tomcat manager / status I see there is many thread busy  
>>> (150  and it increasses, now 241), and these are with Stage  
>>> 'S' (Service).
>>>
>>> The backend with usage: top 40-60 % CPU.
>>> The frontend with usage: 5% CPU.
>>>
>>> Have you any idea what is the problem?
>>>
>>> Best Regards,
>>>    Ferenc
>>>
>>>
>>>
>>>
>>>
>>
>> ---------------------------------------------------------------
>> company:        http://www.media-style.com
>> forum:        http://www.text-mining.org
>> blog:            http://www.find23.net
>>
>>
>>
>>
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by Oracle Space Sweepstakes
> Want to be the first software developer in space?
> Enter now for the Oracle Space Sweepstakes!
> http://ads.osdn.com/?ad_id=7412&alloc_id=16344&op=click
> _______________________________________________
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
>

Re: Distributed installation

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

Dear Stephan,

Thanks for you fast answer.
I think there are some general 'security' hole in nutch. E.g.: if I make 
queries with hitsPerPage=10000, or if the user press F5 key in IE for 
long time.

In my situation the problem is 'paginating' like google (pages: 1-10). 
If the isTotalIsExact() results false -> research with hitsPerPage * 10.
I think I will set maxHitsPerSite value to 0 for a week, and I will try 
to reanalize how to reprograming the 'paginating'.

Thanks, Ferenc

Stefan Groschupf wrotte:

> I notice similar behaviors.
> I guess the backend servers does not answering fast enough.
> I was thinking about to have multiple search server groups that have  
> identical content and then query groups in a round robbing style.
> What people  think about this idea?
>
> It is already easy to setup multiple tomcat that use different search  
> servers and simply split traffic by adding 2 or n ip to your dns for  
> the same domain.
>
>
> Stefan
>
> Am 18.05.2005 um 16:59 schrieb yoursoft@freemail.hu:
>
>> Dear Users!
>>
>> Firstly sorry my bad English.
>> I  read Stephans great documentation at http://wiki.media-style.com/ 
>> display/nutchDocu/.
>> I maked a frontend (P4 3 GByte RAM, Tomcat 5.5.7 java 1.4.08) with  3 
>> backend with 12 million pages ( 4million / backend AMD64 4 GByte  RAM 
>> 64 bit linux with jdk 1.5_03).
>>
>> When I start using it with 3-5 queries / sec, after 1-2 minute the  
>> frontend does'nt answer to the requests.
>> In the Tomcat manager / status I see there is many thread busy (150  
>> and it increasses, now 241), and these are with Stage 'S' (Service).
>>
>> The backend with usage: top 40-60 % CPU.
>> The frontend with usage: 5% CPU.
>>
>> Have you any idea what is the problem?
>>
>> Best Regards,
>>    Ferenc
>>
>>
>>
>>
>
> ---------------------------------------------------------------
> company:        http://www.media-style.com
> forum:        http://www.text-mining.org
> blog:            http://www.find23.net
>
>
>

Re: Distributed installation

Posted by Stefan Groschupf <sg...@media-style.com>.

I notice similar behaviors.
I guess the backend servers does not answering fast enough.
I was thinking about to have multiple search server groups that have  
identical content and then query groups in a round robbing style.
What people  think about this idea?

It is already easy to setup multiple tomcat that use different search  
servers and simply split traffic by adding 2 or n ip to your dns for  
the same domain.


Stefan

Am 18.05.2005 um 16:59 schrieb yoursoft@freemail.hu:

> Dear Users!
>
> Firstly sorry my bad English.
> I  read Stephans great documentation at http://wiki.media-style.com/ 
> display/nutchDocu/.
> I maked a frontend (P4 3 GByte RAM, Tomcat 5.5.7 java 1.4.08) with  
> 3 backend with 12 million pages ( 4million / backend AMD64 4 GByte  
> RAM 64 bit linux with jdk 1.5_03).
>
> When I start using it with 3-5 queries / sec, after 1-2 minute the  
> frontend does'nt answer to the requests.
> In the Tomcat manager / status I see there is many thread busy (150  
> and it increasses, now 241), and these are with Stage 'S' (Service).
>
> The backend with usage: top 40-60 % CPU.
> The frontend with usage: 5% CPU.
>
> Have you any idea what is the problem?
>
> Best Regards,
>    Ferenc
>
>
>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net

Re: Distributed installation

Posted by Stefan Groschupf <sg...@media-style.com>.

I notice similar behaviors.
I guess the backend servers does not answering fast enough.
I was thinking about to have multiple search server groups that have  
identical content and then query groups in a round robbing style.
What people  think about this idea?

It is already easy to setup multiple tomcat that use different search  
servers and simply split traffic by adding 2 or n ip to your dns for  
the same domain.


Stefan

Am 18.05.2005 um 16:59 schrieb yoursoft@freemail.hu:

> Dear Users!
>
> Firstly sorry my bad English.
> I  read Stephans great documentation at http://wiki.media-style.com/ 
> display/nutchDocu/.
> I maked a frontend (P4 3 GByte RAM, Tomcat 5.5.7 java 1.4.08) with  
> 3 backend with 12 million pages ( 4million / backend AMD64 4 GByte  
> RAM 64 bit linux with jdk 1.5_03).
>
> When I start using it with 3-5 queries / sec, after 1-2 minute the  
> frontend does'nt answer to the requests.
> In the Tomcat manager / status I see there is many thread busy (150  
> and it increasses, now 241), and these are with Stage 'S' (Service).
>
> The backend with usage: top 40-60 % CPU.
> The frontend with usage: 5% CPU.
>
> Have you any idea what is the problem?
>
> Best Regards,
>    Ferenc
>
>
>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net

Re: Distributed installation

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

Dear Users!

Firstly sorry my bad English.
I  read Stephans great documentation at 
http://wiki.media-style.com/display/nutchDocu/.
I maked a frontend (P4 3 GByte RAM, Tomcat 5.5.7 java 1.4.08) with 3 
backend with 12 million pages ( 4million / backend AMD64 4 GByte RAM 64 
bit linux with jdk 1.5_03).

When I start using it with 3-5 queries / sec, after 1-2 minute the 
frontend does'nt answer to the requests.
In the Tomcat manager / status I see there is many thread busy (150 and 
it increasses, now 241), and these are with Stage 'S' (Service).

The backend with usage: top 40-60 % CPU.
The frontend with usage: 5% CPU.

Have you any idea what is the problem?

Best Regards,
    Ferenc

Re: Distributed installation

Posted by Giovanni Novelli <gi...@gmail.com>.

There is something wrong with such link
(http://wiki.media-style.com/display/nutchDocu/setup+multiple+search+sever):

Internal Server Error
The server encountered an internal error or misconfiguration and was
unable to complete your request.

Please contact the server administrator, webmaster@media-style.com and
inform them of the time the error occurred, and anything you might
have done that may have caused the error.

More information about this error may be available in the server error log.

Apache/1.3.26 Server at wiki.media-style.com Port 80

On 5/18/05, Stefan Groschupf <sg...@media-style.com> wrote:
> try:
> http://wiki.media-style.com/display/nutchDocu/
> setup+multiple+search+sever
> 
> Am 17.05.2005 um 10:03 schrieb Chetan Sahasrabudhe:
> 
> > Hello,
> >
> >     I am planning to put distributed nutch installation.
> > Can anyone point to document or link to understand the steps
> >
> > Regards
> > Chetan
> > _______________________________
> >
> > Tel +91-20-5652 5000 ext 2513
> >
> > KPITCummins Infosystems Limited
> > Hinjwadi
> > Pune INDIA
> > _______________________________
> >
> >
> >
> >
> > ---------------------------------
> > This message contains the information that may be privileged and is
> > the property of the KPIT Cummins Infosystems LTD.It is intended only
> > for the person to whom it is addressed. If you are not intended
> > recipient, you are not authorized to read, print , retain copy,
> > disseminate, distribute, or use this message or any part thereof. If
> > you receive this message in error, please notify the sender
> > immediately and delete all copies of this message. KPIT Cummins does
> > not accept any liability for virus infected mails.
> >
> -----------information technology-------------------
> company:     http://www.media-style.com
> forum:           http://www.text-mining.org
> blog:                http://www.find23.net
> 
>

Re: Distributed installation

Posted by Stefan Groschupf <sg...@media-style.com>.

try:
http://wiki.media-style.com/display/nutchDocu/ 
setup+multiple+search+sever

Am 17.05.2005 um 10:03 schrieb Chetan Sahasrabudhe:

> Hello,
>
>     I am planning to put distributed nutch installation.
> Can anyone point to document or link to understand the steps
>
> Regards
> Chetan
> _______________________________
>
> Tel +91-20-5652 5000 ext 2513
>
> KPITCummins Infosystems Limited
> Hinjwadi
> Pune INDIA
> _______________________________
>
>
>
>
> ---------------------------------
> This message contains the information that may be privileged and is   
> the property of the KPIT Cummins Infosystems LTD.It is intended only  
> for the person to whom it is addressed. If you are not intended  
> recipient, you are not authorized to read, print , retain copy,  
> disseminate, distribute, or use this message or any part thereof. If  
> you receive this message in error, please notify the sender  
> immediately and delete all copies of this message. KPIT Cummins does  
> not accept any liability for virus infected mails.
>
-----------information technology-------------------
company:     http://www.media-style.com
forum:           http://www.text-mining.org
blog:	             http://www.find23.net