You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ian Reardon <ir...@gmail.com> on 2005/05/20 14:44:45 UTC
Crawler/Fetcher Questions
I've noticed a few things that I'm puzzled about with nutch.
When I just do a "nutch crawl" and give it a directory it creates 3
folders off the root "db", "index" and "segments".
On the other hand if I just create a root directory by hand.
-Make 2 folders inside "segments" and "db"
-Create an empty web db
-Copy my segments from an existing crawl into the new segments folder
-Run updatedb
-Run index on those newly copied segments
(i've been using this method to combine multiple crawls of single
sites into 1 repository)
it seems to work fine but I do not have an "index" folder like it
makes when you just do "nutch crawl". What is the index folder? Is
it ok that I don't have it, everything appears to be working.
2nd question which is not as important.
I've been tracking the size of the folders containing the crawls I'm
doing. It seems like they go up to say 20 megs, then it will go down
to 2 megs and slowly go up again. Where is this drastic reduction
coming from? I just hope I am not losing documents.
Thanks in advance.
Re: Please help: Tomcat problem, Paginating with optimization (Like
goggle)
Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
This is a good answer. Thanks for it,
Ferenc
Piotr Kosiorowski wrotte:
>If I were you I would split 13mln pages in 3 equal or nearly equal
>parts and distribute it over backend servers - without going into how
>many pages are not correctly indexed in this segments. I would assume
>not indexed pages should be distributed equally in all segments. It is
>all very rough estimate but when you would like to go into details you
>would have to take into account avarage number of tokens in a page in
>each segment and probably a distribution of tokens across segments.
>
>So to sum up I would make a rough assumption that all segments have
>the same distribution features search speed depends on and try it out
>by splitting it into equal parts. And only if it would not work as
>expected I would start to think how to optimize it.
>Regards
>Piotr
>
>
>
>On 5/24/05, yoursoft@freemail.hu <yo...@freemail.hu> wrote:
>
>
>>Hi Piotr,
>>
>>Thank for answer, but I don't understand how to calculate how many
>>segments put to one backend?
>>How to calculate the page numbers? In my case, there is 13 million pages
>>in the segments with segread, but only 7,5 million for searcing 'http'.
>>I have 3 backend, and I would like to balance the segments between them.
>>On the server I can't use the lukeall tool, becouse there isn't grafical
>>interface. To copy all segments to local, and view these with lukeall to
>>large work.
>>
>>Regards,
>>Ferenc
>>
>>Piotr Kosiorowski wrotte:
>>
>>
>>
>>>Hi Ferenc,
>>>
>>>'bin/nutch segread -list' reports number of entries in fetcher
>>>output - so if the data is not corrupted - it should report total
>>>number of entries generated during fetchlist generation. luke on the
>>>other hand reports number of documents in lucene index - so it will
>>>include all pages that were correctly processed - so it will not
>>>report all pages that where not fetched because of errors or pages
>>>that were not parsed succesfully etc. And this is the number returned
>>>when you search for "http" because only correctly indexed pages are
>>>searchable.
>>>Regards
>>>Piotr
>>>
>>>On 5/24/05, yoursoft@freemail.hu <yo...@freemail.hu> wrote:
>>>
>>>
>>>
>>>
>>>>Dear Chirag and Byron,
>>>>
>>>>Thanks for suggestion, but I don't have any problem with other
>>>>applications under Tomcat. Problem is occured with only nutch.
>>>>There is free version of Resin, this is truly better than Tomcat?
>>>>
>>>>Dear Chirag, You wrotte that, put 1G memory / 1 million pages to the
>>>>backend.
>>>>How to calculate the pages number in the segments?
>>>>If I use the 'bin/nutch segread -list' tool this is say a segment there
>>>>are 500000 pages in it.
>>>>If I use 'lukeall.jar' tool it is say there are 420105 records in that
>>>>segment.
>>>>If I use 'lukeall.jar' undelete function, there are 438000 records in
>>>>the same segments.
>>>>If I use websearch engine with searching for 'http', this says equal to
>>>>'lukeall.jar'.
>>>>
>>>>What number to use to calculate pages / backend?
>>>>
>>>>I think my solution of the 'paginating' is better than reported others.
>>>>Any comment?
>>>>
>>>>Thanks, Ferenc
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
>
>
Re: Please help: Tomcat problem, Paginating with optimization (Like goggle)
Posted by Piotr Kosiorowski <pk...@gmail.com>.
If I were you I would split 13mln pages in 3 equal or nearly equal
parts and distribute it over backend servers - without going into how
many pages are not correctly indexed in this segments. I would assume
not indexed pages should be distributed equally in all segments. It is
all very rough estimate but when you would like to go into details you
would have to take into account avarage number of tokens in a page in
each segment and probably a distribution of tokens across segments.
So to sum up I would make a rough assumption that all segments have
the same distribution features search speed depends on and try it out
by splitting it into equal parts. And only if it would not work as
expected I would start to think how to optimize it.
Regards
Piotr
On 5/24/05, yoursoft@freemail.hu <yo...@freemail.hu> wrote:
> Hi Piotr,
>
> Thank for answer, but I don't understand how to calculate how many
> segments put to one backend?
> How to calculate the page numbers? In my case, there is 13 million pages
> in the segments with segread, but only 7,5 million for searcing 'http'.
> I have 3 backend, and I would like to balance the segments between them.
> On the server I can't use the lukeall tool, becouse there isn't grafical
> interface. To copy all segments to local, and view these with lukeall to
> large work.
>
> Regards,
> Ferenc
>
> Piotr Kosiorowski wrotte:
>
> >Hi Ferenc,
> >
> >'bin/nutch segread -list' reports number of entries in fetcher
> >output - so if the data is not corrupted - it should report total
> >number of entries generated during fetchlist generation. luke on the
> >other hand reports number of documents in lucene index - so it will
> >include all pages that were correctly processed - so it will not
> >report all pages that where not fetched because of errors or pages
> >that were not parsed succesfully etc. And this is the number returned
> >when you search for "http" because only correctly indexed pages are
> >searchable.
> >Regards
> >Piotr
> >
> >On 5/24/05, yoursoft@freemail.hu <yo...@freemail.hu> wrote:
> >
> >
> >>Dear Chirag and Byron,
> >>
> >>Thanks for suggestion, but I don't have any problem with other
> >>applications under Tomcat. Problem is occured with only nutch.
> >>There is free version of Resin, this is truly better than Tomcat?
> >>
> >>Dear Chirag, You wrotte that, put 1G memory / 1 million pages to the
> >>backend.
> >>How to calculate the pages number in the segments?
> >>If I use the 'bin/nutch segread -list' tool this is say a segment there
> >>are 500000 pages in it.
> >>If I use 'lukeall.jar' tool it is say there are 420105 records in that
> >>segment.
> >>If I use 'lukeall.jar' undelete function, there are 438000 records in
> >>the same segments.
> >>If I use websearch engine with searching for 'http', this says equal to
> >>'lukeall.jar'.
> >>
> >>What number to use to calculate pages / backend?
> >>
> >>I think my solution of the 'paginating' is better than reported others.
> >>Any comment?
> >>
> >>Thanks, Ferenc
> >>
> >>
> >>
> >
> >
> >
> >
>
>
Re: Please help: Tomcat problem, Paginating with optimization (Like
goggle)
Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
Hi Piotr,
Thank for answer, but I don't understand how to calculate how many
segments put to one backend?
How to calculate the page numbers? In my case, there is 13 million pages
in the segments with segread, but only 7,5 million for searcing 'http'.
I have 3 backend, and I would like to balance the segments between them.
On the server I can't use the lukeall tool, becouse there isn't grafical
interface. To copy all segments to local, and view these with lukeall to
large work.
Regards,
Ferenc
Piotr Kosiorowski wrotte:
>Hi Ferenc,
>
>'bin/nutch segread -list' reports number of entries in fetcher
>output - so if the data is not corrupted - it should report total
>number of entries generated during fetchlist generation. luke on the
>other hand reports number of documents in lucene index - so it will
>include all pages that were correctly processed - so it will not
>report all pages that where not fetched because of errors or pages
>that were not parsed succesfully etc. And this is the number returned
>when you search for "http" because only correctly indexed pages are
>searchable.
>Regards
>Piotr
>
>On 5/24/05, yoursoft@freemail.hu <yo...@freemail.hu> wrote:
>
>
>>Dear Chirag and Byron,
>>
>>Thanks for suggestion, but I don't have any problem with other
>>applications under Tomcat. Problem is occured with only nutch.
>>There is free version of Resin, this is truly better than Tomcat?
>>
>>Dear Chirag, You wrotte that, put 1G memory / 1 million pages to the
>>backend.
>>How to calculate the pages number in the segments?
>>If I use the 'bin/nutch segread -list' tool this is say a segment there
>>are 500000 pages in it.
>>If I use 'lukeall.jar' tool it is say there are 420105 records in that
>>segment.
>>If I use 'lukeall.jar' undelete function, there are 438000 records in
>>the same segments.
>>If I use websearch engine with searching for 'http', this says equal to
>>'lukeall.jar'.
>>
>>What number to use to calculate pages / backend?
>>
>>I think my solution of the 'paginating' is better than reported others.
>>Any comment?
>>
>>Thanks, Ferenc
>>
>>
>>
>
>
>
>
Re: Please help: Tomcat problem, Paginating with optimization (Like goggle)
Posted by Piotr Kosiorowski <pk...@gmail.com>.
Hi Ferenc,
'bin/nutch segread -list' reports number of entries in fetcher
output - so if the data is not corrupted - it should report total
number of entries generated during fetchlist generation. luke on the
other hand reports number of documents in lucene index - so it will
include all pages that were correctly processed - so it will not
report all pages that where not fetched because of errors or pages
that were not parsed succesfully etc. And this is the number returned
when you search for "http" because only correctly indexed pages are
searchable.
Regards
Piotr
On 5/24/05, yoursoft@freemail.hu <yo...@freemail.hu> wrote:
> Dear Chirag and Byron,
>
> Thanks for suggestion, but I don't have any problem with other
> applications under Tomcat. Problem is occured with only nutch.
> There is free version of Resin, this is truly better than Tomcat?
>
> Dear Chirag, You wrotte that, put 1G memory / 1 million pages to the
> backend.
> How to calculate the pages number in the segments?
> If I use the 'bin/nutch segread -list' tool this is say a segment there
> are 500000 pages in it.
> If I use 'lukeall.jar' tool it is say there are 420105 records in that
> segment.
> If I use 'lukeall.jar' undelete function, there are 438000 records in
> the same segments.
> If I use websearch engine with searching for 'http', this says equal to
> 'lukeall.jar'.
>
> What number to use to calculate pages / backend?
>
> I think my solution of the 'paginating' is better than reported others.
> Any comment?
>
> Thanks, Ferenc
>
Re: Please help: Tomcat problem, Paginating with optimization (Like
goggle)
Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
Dear Chirag and Byron,
Thanks for suggestion, but I don't have any problem with other
applications under Tomcat. Problem is occured with only nutch.
There is free version of Resin, this is truly better than Tomcat?
Dear Chirag, You wrotte that, put 1G memory / 1 million pages to the
backend.
How to calculate the pages number in the segments?
If I use the 'bin/nutch segread -list' tool this is say a segment there
are 500000 pages in it.
If I use 'lukeall.jar' tool it is say there are 420105 records in that
segment.
If I use 'lukeall.jar' undelete function, there are 438000 records in
the same segments.
If I use websearch engine with searching for 'http', this says equal to
'lukeall.jar'.
What number to use to calculate pages / backend?
I think my solution of the 'paginating' is better than reported others.
Any comment?
Thanks, Ferenc
Re: Please help: Tomcat problem, Paginating with optimization (Like
goggle)
Posted by Byron Miller <By...@compaid.com>.
Not that this fixes your Tomcat issues, but i have nothing but good things
to say about Resin.
It handles the load really well, is easy to manage and is pretty
light-weight for what it does.
I have never had much luck with Tomcat and believe me i've tried many
times to go back.
Just my 2 cents.
-byron
-----Original Message-----
From: "yoursoft@freemail.hu" <yo...@freemail.hu>
To: nutch-user@incubator.apache.org
Date: Mon, 23 May 2005 14:53:25 +0200
Subject: Please help: Tomcat problem, Paginating with optimization (Like
goggle)
> Dear Users,
>
> I have the following problem:
>
> I use Tomcat 5.5.7 on port 80.
> There is no problem with 1-2 queries / 10 secs.
>
Please help: Tomcat problem, Paginating with optimization (Like goggle)
Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
Dear Users,
I have the following problem:
I use Tomcat 5.5.7 on port 80.
There is no problem with 1-2 queries / 10 secs.
When I use it with more users (5-6 queries / sec), the frontend working
for e.g. 30 minutes on CPU load with 1-2 %, and after it it go up to 60-90%.
The queries is not to increasses in this time (e.g. in the last test it
was 3 queries / 5 sec in the critical point). In this critical time the
backends CPU usages are max. 10-20%.
On the frontend in the Tomcat manager there are many threads with
'service status' with long time (e.g.: 586 sec ).
After 10 minutes in the catalina.out there are the bean.search time i
sincreasses over 10-40 sec, and after some minutes there are many
messages with 'NullPointerException'.
I rewrite the jsp pages to the servlets.
There is my source of Search.java, there is a 'google' paginating. The
source code is optimized (minimal object creating, sb.append, etc.):
package org.nutch;
Beginning of java code
------------------------------------------------------------------------------
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.servlet.ServletException;
import org.apache.velocity.exception.*;
import org.apache.velocity.Template;
import org.apache.velocity.app.Velocity;
import org.apache.velocity.context.Context;
import org.apache.velocity.servlet.VelocityServlet;
import net.nutch.html.Entities;
import java.util.*;
import net.nutch.searcher.*;
import net.nutch.plugin.*;
import java.io.*;
import java.net.*;
public class Search extends VelocityServlet {
private NutchBean bean;
public void init() throws ServletException {
try {
bean = NutchBean.get(this.getServletContext());
} catch (IOException e) {
throw new ServletException(e);
}
}
private static final String getStringParameter(String name, String
charset, HttpServletRequest req) {
String value = req.getParameter(name);
if (value == null) {
value = "";
}
try {
value = new String( value.getBytes("8859_1"), charset );
} catch (Exception e) {;}
return value;
}
public Template handleRequest( HttpServletRequest request,
HttpServletResponse response,
Context par_tartalom ) throws java.io.IOException, ServletException {
Template loc_template = null;
if (bean == null) { return loc_template; }
int start = 0; // first hit to display
String startString = request.getParameter("start");
if (startString != null) {
start = Integer.parseInt(startString);
if (start < 0) {
start = 0;
}
}
int hitsPerPage = 10; // number of hits to display
String hitsString = request.getParameter("hitsPerPage");
if (hitsString != null) {
hitsPerPage = Integer.parseInt(hitsString);
}
// get the character encoding to use when interpreting request
values
String charset = request.getParameter("charset");
if (charset == null) {
charset = "UTF8";
}
// get query from request
String queryString = getStringParameter("query", charset, request);
try {
StringBuffer sb = new StringBuffer();
StringBuffer sb2 = new StringBuffer();
StringBuffer sb3 = new StringBuffer();
// Query string for html
String htmlQueryString = Entities.encode(queryString);
String htmlQueryStringISO = URLEncoder.encode(queryString,
"ISO-8859-2");
String htmlQueryStringUTF = URLEncoder.encode(queryString,
"UTF8");
// Get more parameters
if (hitsPerPage >100) { // No more hitsPerPage than 100
hitsPerPage = 10;
}
int hitsPerSite = 2; // max hits per site
String hitsPerSiteString = request.getParameter("hitsPerSite");
if (hitsPerSiteString != null) {
hitsPerSite = Integer.parseInt(hitsPerSiteString);
}
Query query = Query.parse(queryString);
int hitsLength = 0;
int end = 0;
int length = 0;
log(sb.append("Query:
").append(request.getRemoteAddr()).append(" -
").append(queryString).toString());
long startTime = System.currentTimeMillis();
Hits hits = bean.search(query, start + hitsPerPage,
hitsPerSite);
hitsLength = (int)hits.getTotal();
int length2 = hits.getLength();
par_tartalom.put("TIME",
String.valueOf((System.currentTimeMillis()-startTime) / 1000.0) );
if (length2 <= start) { // If after 'start' there are not hits
start = length2 - (length2 % hitsPerPage);
if (length2 == start) {
start = start - hitsPerPage;
if (start < 0) start = 0;
}
end = length2;
} else { // If after 'start' there are some hits
end = length2 < start+hitsPerPage ? length2 :
start+hitsPerPage;
}
length = end-start;
Hit[] show = hits.getHits(start, end-start);
HitDetails[] details = bean.getDetails(show);
String[] summaries = bean.getSummary(details, query);
sb.setLength(0);
log(sb.append("Query:
").append(request.getRemoteAddr()).append(" -
").append(queryString).append(" - Total hits:
").append(hits.getTotal()).append(" - Time:
").append(System.currentTimeMillis()-startTime).toString());
par_tartalom.put("START", new Long((end==0)?0:(start+1)) );
// Start of hits
par_tartalom.put("END", new Long(end)); // End of hits
par_tartalom.put("CNT", new Long(hits.getTotal())); // Count
of hits
par_tartalom.put("QRY", htmlQueryString); // UTF8
par_tartalom.put("QRY2", htmlQueryStringISO); // ISO charset
// ******************************************************
// List Hits
// ******************************************************
sb.setLength(0);
sb2.setLength(0);
sb3.setLength(0);
sb3.append("?idx=");
Hit hit = null;
HitDetails detail = null;
String title = null;
String url = null;
String summary = null;
for (int i = 0, j; i < length; i++) { // display the hits
hit = show[i];
detail = details[i];
title = detail.getValue("title");
url = detail.getValue("url");
summary = summaries[i];
sb3.setLength(5);
sb3.append( hit.getIndexNo() ).append("&id=").append(
hit.getIndexDocNo() );
if (title == null || title.equals("")) { // use url for
docs w/o title
title = url;
}
... Same with search.jsp ...
}
if (length2 <= start + hitsPerPage && hitsLength != 0) { //
If lower tahn hitsPerPage
sb.append("<span style=\"FONT-WEIGHT: bold; color:
black;\">This is the last page.</span><br><br>");
hitsLength = length2; // paginating length
}
par_tartalom.put("LIST", sb.toString());
// ******************************************************
// Paginating
// ******************************************************
int pageNumber = 0;
sb.setLength(0);
sb2.setLength(0);
sb2.append("&hitsPerPage=");
sb2.append(hitsPerPage);
sb3.setLength(0);
sb3.append("<a
href=\"Keres?query=").append(htmlQueryStringUTF).append("&start=");
// Prev (<<)
sb.append("<td width=60 class=pages align=right>");
if (start>0) {
long prevStart = start-hitsPerPage;
prevStart = prevStart > 0 ? prevStart : 0;
sb.append(sb3); // query
sb.append(prevStart); // start
sb.append(sb2); // others
sb.append("\" class=pages><font class=arrow><<</font> <<
</a>");
}
sb.append("</td><td class=pages width=250 align=center>");
if (hitsLength > hitsPerPage ) { // If there are more pages
if (start >= 9 * hitsPerPage) { // from page 10 (from 90)
int startPageNumber = start-(4 * hitsPerPage);
if (startPageNumber < 0) startPageNumber = 0;
pageNumber = (startPageNumber-1) / hitsPerPage+1;
for (int i = startPageNumber; i < hitsLength; i = i
+ hitsPerPage) {
pageNumber++;
if (start == i) {
sb.append("<font class=active_nr><b>");
sb.append(pageNumber);
sb.append("</b> </font>");
} else {
sb.append(sb3);// query
sb.append(i); // start
sb.append(sb2); // others
sb.append("\" class=inactive_nr>");
sb.append(pageNumber);
sb.append("</a> ");
}
if (i >= startPageNumber+8*hitsPerPage) {
break;
}
}
} else { // more than 9 pages
for (int i = 0; i < hitsLength; i = i + hitsPerPage) {
pageNumber++;
if (start == i) {
sb.append("<font class=active_nr><b>");
sb.append(pageNumber);
sb.append("</b> </font>");
} else {
sb.append(sb3);// query
sb.append(i); // start
sb.append(sb2); // others
sb.append("\" class=inactive_nr>");
sb.append(pageNumber);
sb.append("</a> ");
}
if (pageNumber >= 10) {
break;
}
}
}
}
sb.append("</td><td width=100 class=pages>");
// next (>>)
if (hitsLength > start+hitsPerPage) {
if (end <hitsLength) {
sb.append(sb3); // query
sb.append(end); // start
sb.append(sb2); // others
sb.append("\" class=pages> >> <font
class=arrow>>></font></a>");
}
}
sb.append("</td>");
par_tartalom.put("PAGES", sb.toString());
loc_template = Velocity.getTemplate("search.vm");
// } catch (ArrayIndexOutOfBoundsException ae){
// log(request.getRemoteAddr()+" - forwarding - "+queryString);
// }
} catch (Exception e) {
log(e.toString()+" - " + request.getRemoteAddr()+" -
"+queryString);
// throw new ServletException(e);
}
return loc_template;
}
}
------------------------------------------------------------------------------
End of java code.
Have you any idea what is the possible problem source?
Best Regards,
Ferenc
Re: Crawler/Fetcher Questions
Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.
1. If you want to and index folder use 'bin/nutch merge', this will
merge the indexes from 'segments/*/index' to 'index' folder.
2. I have the same situation when I use the default value (1) of
'fetcher.threads.per.host' in nutch-site.xml.
In this case in the fetcher.log you can find some error messages can't
connect to the hosts.
Ian Reardon wrotte:
>I've noticed a few things that I'm puzzled about with nutch.
>
>When I just do a "nutch crawl" and give it a directory it creates 3
>folders off the root "db", "index" and "segments".
>
>On the other hand if I just create a root directory by hand.
>
>-Make 2 folders inside "segments" and "db"
>-Create an empty web db
>-Copy my segments from an existing crawl into the new segments folder
>-Run updatedb
>-Run index on those newly copied segments
>(i've been using this method to combine multiple crawls of single
>sites into 1 repository)
>
>it seems to work fine but I do not have an "index" folder like it
>makes when you just do "nutch crawl". What is the index folder? Is
>it ok that I don't have it, everything appears to be working.
>
>
>2nd question which is not as important.
>
>I've been tracking the size of the folders containing the crawls I'm
>doing. It seems like they go up to say 20 megs, then it will go down
>to 2 megs and slowly go up again. Where is this drastic reduction
>coming from? I just hope I am not losing documents.
>
>Thanks in advance.
>
>
>
>
Re: Crawler/Fetcher Questions
Posted by Byron Miller <By...@compaid.com>.
Index folder is created when your merge indexes - not needed, but usually
enhanced performance. Crawl is probably merging the indexes automagically
while the manual process won't.
During crawl/segment creation and indexing there are tons of files that
get created and the optimize process goes through and cleans this up a bit.
-byron
-----Original Message-----
From: Ian Reardon <ir...@gmail.com>
To: nutch-user@incubator.apache.org
Date: Fri, 20 May 2005 08:44:45 -0400
Subject: Crawler/Fetcher Questions
> I've noticed a few things that I'm puzzled about with nutch.
>
> When I just do a "nutch crawl" and give it a directory it creates 3
> folders off the root "db", "index" and "segments".
>
> On the other hand if I just create a root directory by hand.
>
> -Make 2 folders inside "segments" and "db"
> -Create an empty web db
> -Copy my segments from an existing crawl into the new segments folder
> -Run updatedb
> -Run index on those newly copied segments
> (i've been using this method to combine multiple crawls of single
> sites into 1 repository)
>
> it seems to work fine but I do not have an "index" folder like it
> makes when you just do "nutch crawl". What is the index folder? Is
> it ok that I don't have it, everything appears to be working.
>
>
> 2nd question which is not as important.
>
> I've been tracking the size of the folders containing the crawls I'm
> doing. It seems like they go up to say 20 megs, then it will go down
> to 2 megs and slowly go up again. Where is this drastic reduction
> coming from? I just hope I am not losing documents.
>
> Thanks in advance.
>