You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ian Reardon <ir...@gmail.com> on 2005/05/20 14:44:45 UTC

Crawler/Fetcher Questions

I've noticed a few things that I'm puzzled about with nutch.

When I just do a "nutch crawl" and give it a directory it creates 3
folders off the root "db", "index" and "segments".

On the other hand if I just create a root directory by hand.  

-Make 2 folders inside "segments" and "db" 
-Create an empty web db 
-Copy my segments from an existing crawl into the new segments folder
-Run updatedb
-Run index on those newly copied segments
(i've been using this method to combine multiple crawls of single
sites into 1 repository)

it seems to work fine but I do not have an "index" folder like it
makes when you just do "nutch crawl".  What is the index folder?  Is
it ok that I don't have it, everything appears to be working.


2nd question which is not as important.

I've been tracking the size of the folders containing the crawls I'm
doing.  It seems like they go up to say 20 megs, then it will go down
to 2 megs and slowly go up again.    Where is this drastic reduction
coming from?  I just hope I am not losing documents.

Thanks in advance.

Re: Please help: Tomcat problem, Paginating with optimization (Like goggle)

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

This is a good answer. Thanks for it,
Ferenc

Piotr Kosiorowski wrotte:

>If I were you I would split 13mln pages in 3 equal or nearly equal
>parts and distribute it over backend servers - without going into how
>many pages are not correctly indexed in this segments. I would assume
>not indexed pages should be distributed equally in all segments. It is
>all very rough estimate but when you would like to go into details you
>would have to take into account avarage number of tokens in a page in
>each segment and probably a distribution of tokens across segments.
>
>So to sum up I would make a rough assumption that all segments have
>the same distribution features search speed depends on and try it out
>by splitting it into equal parts. And only if it  would not work as
>expected I would start to think how to optimize it.
>Regards
>Piotr
>
>
>
>On 5/24/05, yoursoft@freemail.hu <yo...@freemail.hu> wrote:
>  
>
>>Hi Piotr,
>>
>>Thank for answer, but I don't understand how to calculate how many
>>segments put to one backend?
>>How to calculate the page numbers? In my case, there is 13 million pages
>>in the segments with segread, but only 7,5 million for searcing 'http'.
>>I have 3 backend, and I would like to balance the segments between them.
>>On the server I can't use the lukeall tool, becouse there isn't grafical
>>interface. To copy all segments to local, and view these with lukeall to
>>large work.
>>
>>Regards,
>>Ferenc
>>
>>Piotr Kosiorowski wrotte:
>>
>>    
>>
>>>Hi Ferenc,
>>>
>>>'bin/nutch segread -list'  reports number of entries  in fetcher
>>>output - so if the data  is not corrupted - it should report total
>>>number of entries generated during fetchlist generation. luke on the
>>>other hand reports number of documents in lucene index - so it will
>>>include all pages that were correctly processed - so it will not
>>>report all pages that where not fetched because of errors or pages
>>>that were not parsed succesfully etc.  And this is the number returned
>>>when you search for "http" because only correctly indexed pages are
>>>searchable.
>>>Regards
>>>Piotr
>>>
>>>On 5/24/05, yoursoft@freemail.hu <yo...@freemail.hu> wrote:
>>>
>>>
>>>      
>>>
>>>>Dear Chirag and Byron,
>>>>
>>>>Thanks for suggestion, but I don't have any problem with other
>>>>applications under Tomcat. Problem is occured with only nutch.
>>>>There is free version of Resin, this is truly better than Tomcat?
>>>>
>>>>Dear Chirag, You wrotte that, put 1G memory / 1 million pages to the
>>>>backend.
>>>>How to calculate the pages number in the segments?
>>>>If I use the 'bin/nutch segread -list' tool this is say a segment there
>>>>are 500000 pages in it.
>>>>If I use 'lukeall.jar' tool it is say there are 420105 records in that
>>>>segment.
>>>>If I use 'lukeall.jar' undelete function, there are 438000 records in
>>>>the same segments.
>>>>If I use websearch engine with searching for 'http', this says equal to
>>>>'lukeall.jar'.
>>>>
>>>>What number to use to calculate pages / backend?
>>>>
>>>>I think my solution of the 'paginating' is better than reported others.
>>>>Any comment?
>>>>
>>>>Thanks, Ferenc
>>>>
>>>>
>>>>
>>>>        
>>>>
>>>
>>>
>>>      
>>>
>>    
>>
>
>
>  
>

Re: Please help: Tomcat problem, Paginating with optimization (Like goggle)

Posted by Piotr Kosiorowski <pk...@gmail.com>.

If I were you I would split 13mln pages in 3 equal or nearly equal
parts and distribute it over backend servers - without going into how
many pages are not correctly indexed in this segments. I would assume
not indexed pages should be distributed equally in all segments. It is
all very rough estimate but when you would like to go into details you
would have to take into account avarage number of tokens in a page in
each segment and probably a distribution of tokens across segments.

So to sum up I would make a rough assumption that all segments have
the same distribution features search speed depends on and try it out
by splitting it into equal parts. And only if it  would not work as
expected I would start to think how to optimize it.
Regards
Piotr



On 5/24/05, yoursoft@freemail.hu <yo...@freemail.hu> wrote:
> Hi Piotr,
> 
> Thank for answer, but I don't understand how to calculate how many
> segments put to one backend?
> How to calculate the page numbers? In my case, there is 13 million pages
> in the segments with segread, but only 7,5 million for searcing 'http'.
> I have 3 backend, and I would like to balance the segments between them.
> On the server I can't use the lukeall tool, becouse there isn't grafical
> interface. To copy all segments to local, and view these with lukeall to
> large work.
> 
> Regards,
> Ferenc
> 
> Piotr Kosiorowski wrotte:
> 
> >Hi Ferenc,
> >
> >'bin/nutch segread -list'  reports number of entries  in fetcher
> >output - so if the data  is not corrupted - it should report total
> >number of entries generated during fetchlist generation. luke on the
> >other hand reports number of documents in lucene index - so it will
> >include all pages that were correctly processed - so it will not
> >report all pages that where not fetched because of errors or pages
> >that were not parsed succesfully etc.  And this is the number returned
> >when you search for "http" because only correctly indexed pages are
> >searchable.
> >Regards
> >Piotr
> >
> >On 5/24/05, yoursoft@freemail.hu <yo...@freemail.hu> wrote:
> >
> >
> >>Dear Chirag and Byron,
> >>
> >>Thanks for suggestion, but I don't have any problem with other
> >>applications under Tomcat. Problem is occured with only nutch.
> >>There is free version of Resin, this is truly better than Tomcat?
> >>
> >>Dear Chirag, You wrotte that, put 1G memory / 1 million pages to the
> >>backend.
> >>How to calculate the pages number in the segments?
> >>If I use the 'bin/nutch segread -list' tool this is say a segment there
> >>are 500000 pages in it.
> >>If I use 'lukeall.jar' tool it is say there are 420105 records in that
> >>segment.
> >>If I use 'lukeall.jar' undelete function, there are 438000 records in
> >>the same segments.
> >>If I use websearch engine with searching for 'http', this says equal to
> >>'lukeall.jar'.
> >>
> >>What number to use to calculate pages / backend?
> >>
> >>I think my solution of the 'paginating' is better than reported others.
> >>Any comment?
> >>
> >>Thanks, Ferenc
> >>
> >>
> >>
> >
> >
> >
> >
> 
>

Re: Please help: Tomcat problem, Paginating with optimization (Like goggle)

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

Hi Piotr,

Thank for answer, but I don't understand how to calculate how many 
segments put to one backend?
How to calculate the page numbers? In my case, there is 13 million pages 
in the segments with segread, but only 7,5 million for searcing 'http'.
I have 3 backend, and I would like to balance the segments between them.
On the server I can't use the lukeall tool, becouse there isn't grafical 
interface. To copy all segments to local, and view these with lukeall to 
large work.

Regards,
Ferenc

Piotr Kosiorowski wrotte:

>Hi Ferenc,
>
>'bin/nutch segread -list'  reports number of entries  in fetcher
>output - so if the data  is not corrupted - it should report total
>number of entries generated during fetchlist generation. luke on the
>other hand reports number of documents in lucene index - so it will
>include all pages that were correctly processed - so it will not
>report all pages that where not fetched because of errors or pages
>that were not parsed succesfully etc.  And this is the number returned
>when you search for "http" because only correctly indexed pages are
>searchable.
>Regards
>Piotr
>
>On 5/24/05, yoursoft@freemail.hu <yo...@freemail.hu> wrote:
>  
>
>>Dear Chirag and Byron,
>>
>>Thanks for suggestion, but I don't have any problem with other
>>applications under Tomcat. Problem is occured with only nutch.
>>There is free version of Resin, this is truly better than Tomcat?
>>
>>Dear Chirag, You wrotte that, put 1G memory / 1 million pages to the
>>backend.
>>How to calculate the pages number in the segments?
>>If I use the 'bin/nutch segread -list' tool this is say a segment there
>>are 500000 pages in it.
>>If I use 'lukeall.jar' tool it is say there are 420105 records in that
>>segment.
>>If I use 'lukeall.jar' undelete function, there are 438000 records in
>>the same segments.
>>If I use websearch engine with searching for 'http', this says equal to
>>'lukeall.jar'.
>>
>>What number to use to calculate pages / backend?
>>
>>I think my solution of the 'paginating' is better than reported others.
>>Any comment?
>>
>>Thanks, Ferenc
>>
>>    
>>
>
>
>  
>

Re: Please help: Tomcat problem, Paginating with optimization (Like goggle)

Posted by Piotr Kosiorowski <pk...@gmail.com>.

Hi Ferenc,

'bin/nutch segread -list'  reports number of entries  in fetcher
output - so if the data  is not corrupted - it should report total
number of entries generated during fetchlist generation. luke on the
other hand reports number of documents in lucene index - so it will
include all pages that were correctly processed - so it will not
report all pages that where not fetched because of errors or pages
that were not parsed succesfully etc.  And this is the number returned
when you search for "http" because only correctly indexed pages are
searchable.
Regards
Piotr

On 5/24/05, yoursoft@freemail.hu <yo...@freemail.hu> wrote:
> Dear Chirag and Byron,
> 
> Thanks for suggestion, but I don't have any problem with other
> applications under Tomcat. Problem is occured with only nutch.
> There is free version of Resin, this is truly better than Tomcat?
> 
> Dear Chirag, You wrotte that, put 1G memory / 1 million pages to the
> backend.
> How to calculate the pages number in the segments?
> If I use the 'bin/nutch segread -list' tool this is say a segment there
> are 500000 pages in it.
> If I use 'lukeall.jar' tool it is say there are 420105 records in that
> segment.
> If I use 'lukeall.jar' undelete function, there are 438000 records in
> the same segments.
> If I use websearch engine with searching for 'http', this says equal to
> 'lukeall.jar'.
> 
> What number to use to calculate pages / backend?
> 
> I think my solution of the 'paginating' is better than reported others.
> Any comment?
> 
> Thanks, Ferenc
>

Re: Please help: Tomcat problem, Paginating with optimization (Like goggle)

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

Dear Chirag and Byron,

Thanks for suggestion, but I don't have any problem with other 
applications under Tomcat. Problem is occured with only nutch.
There is free version of Resin, this is truly better than Tomcat?

Dear Chirag, You wrotte that, put 1G memory / 1 million pages to the 
backend.
How to calculate the pages number in the segments?
If I use the 'bin/nutch segread -list' tool this is say a segment there 
are 500000 pages in it.
If I use 'lukeall.jar' tool it is say there are 420105 records in that 
segment.
If I use 'lukeall.jar' undelete function, there are 438000 records in 
the same segments.
If I use websearch engine with searching for 'http', this says equal to 
'lukeall.jar'.

What number to use to calculate pages / backend?

I think my solution of the 'paginating' is better than reported others. 
Any comment?

Thanks, Ferenc

Re: Please help: Tomcat problem, Paginating with optimization (Like goggle)

Posted by Byron Miller <By...@compaid.com>.

Not that this fixes your Tomcat issues, but i have nothing but good things
to say about Resin.

It handles the load really well, is easy to manage and is pretty
light-weight for what it does.

I have never had much luck with Tomcat and believe me i've tried many
times to go back.

Just my 2 cents.

-byron

-----Original Message-----
From: "yoursoft@freemail.hu" <yo...@freemail.hu>
To: nutch-user@incubator.apache.org
Date: Mon, 23 May 2005 14:53:25 +0200
Subject: Please help: Tomcat problem, Paginating with optimization (Like
goggle)

> Dear Users,
> 
> I have the following problem:
> 
> I use Tomcat 5.5.7 on port 80.
> There is no problem with 1-2 queries / 10 secs.
>

Please help: Tomcat problem, Paginating with optimization (Like goggle)

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

Dear Users,

I have the following problem:

I use Tomcat 5.5.7 on port 80.
There is no problem with 1-2 queries / 10 secs.

When I use it with more users (5-6 queries / sec), the frontend  working 
for e.g. 30 minutes on CPU load with 1-2 %, and after it it go up to 60-90%.
The queries is not to increasses in this time (e.g. in the last test it 
was 3 queries / 5 sec in the critical point). In this critical time the 
backends CPU usages are max. 10-20%.
On the frontend in the Tomcat  manager  there are  many threads with  
'service status' with long time  (e.g.: 586 sec ).
After 10 minutes in the catalina.out  there are the bean.search time i 
sincreasses over 10-40 sec, and after some minutes there are many 
messages with 'NullPointerException'.

I rewrite the jsp pages to the servlets.
There is my source of Search.java, there is a 'google' paginating. The 
source code is optimized (minimal object creating, sb.append, etc.):
package org.nutch;
Beginning of java code
------------------------------------------------------------------------------
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.servlet.ServletException;
import org.apache.velocity.exception.*;
import org.apache.velocity.Template;
import org.apache.velocity.app.Velocity;
import org.apache.velocity.context.Context;
import org.apache.velocity.servlet.VelocityServlet;
import net.nutch.html.Entities;
import java.util.*;
import net.nutch.searcher.*;
import net.nutch.plugin.*;
import java.io.*;
import java.net.*;

public class Search extends VelocityServlet {

    private NutchBean bean;

    public void init() throws ServletException {
        try {
            bean = NutchBean.get(this.getServletContext());
        } catch (IOException e) {
            throw new ServletException(e);
        }
    }

    private static final String getStringParameter(String name, String 
charset, HttpServletRequest req) {
        String value = req.getParameter(name);
        if (value == null) {
            value = "";
        }
        try {
            value = new String( value.getBytes("8859_1"), charset );
        } catch (Exception e) {;}
        return value;
    }

    public Template handleRequest( HttpServletRequest request,
    HttpServletResponse response,
    Context par_tartalom ) throws java.io.IOException, ServletException {
        Template loc_template = null;

        if (bean == null) { return loc_template; }

        int start = 0;          // first hit to display
        String startString = request.getParameter("start");
        if (startString != null) {
            start = Integer.parseInt(startString);
            if (start < 0) {
                start = 0;
            }
        }
        int hitsPerPage = 10;          // number of hits to display
        String hitsString = request.getParameter("hitsPerPage");
        if (hitsString != null) {
            hitsPerPage = Integer.parseInt(hitsString);
        }

        // get the character encoding to use when interpreting request 
values
        String charset = request.getParameter("charset");
        if (charset == null) {
            charset = "UTF8";
        }

        // get query from request
        String queryString = getStringParameter("query", charset, request);

        try {
            StringBuffer sb = new StringBuffer();
            StringBuffer sb2 = new StringBuffer();
            StringBuffer sb3 = new StringBuffer();

            // Query string for html
            String htmlQueryString = Entities.encode(queryString);
            String htmlQueryStringISO = URLEncoder.encode(queryString, 
"ISO-8859-2");
            String htmlQueryStringUTF = URLEncoder.encode(queryString, 
"UTF8");

            // Get more parameters
            if (hitsPerPage >100) { // No more hitsPerPage than 100
                hitsPerPage = 10;
            }
            int hitsPerSite = 2;          // max hits per site
            String hitsPerSiteString = request.getParameter("hitsPerSite");
            if (hitsPerSiteString != null) {
                hitsPerSite = Integer.parseInt(hitsPerSiteString);
            }

            Query query = Query.parse(queryString);

            int hitsLength = 0;
            int end = 0;
            int length = 0;
            log(sb.append("Query: 
").append(request.getRemoteAddr()).append(" - 
").append(queryString).toString());
            long startTime = System.currentTimeMillis();
            Hits hits = bean.search(query, start + hitsPerPage, 
hitsPerSite);
            hitsLength = (int)hits.getTotal();
            int length2 = hits.getLength();
            par_tartalom.put("TIME", 
String.valueOf((System.currentTimeMillis()-startTime) / 1000.0) );
            if (length2  <= start) { // If after 'start' there are not hits
                start = length2 - (length2 % hitsPerPage);
                if (length2 == start) {
                    start = start - hitsPerPage;
                    if (start < 0) start = 0;
                }
                end = length2;
            } else { // If after 'start' there are some hits
                end = length2 < start+hitsPerPage ? length2 : 
start+hitsPerPage;
            }
            length = end-start;
            Hit[] show = hits.getHits(start, end-start);
            HitDetails[] details = bean.getDetails(show);
            String[] summaries = bean.getSummary(details, query);
            sb.setLength(0);
            log(sb.append("Query: 
").append(request.getRemoteAddr()).append(" - 
").append(queryString).append(" - Total hits: 
").append(hits.getTotal()).append(" - Time: 
").append(System.currentTimeMillis()-startTime).toString());

            par_tartalom.put("START", new Long((end==0)?0:(start+1)) ); 
// Start of hits
            par_tartalom.put("END", new Long(end)); // End of hits
            par_tartalom.put("CNT", new Long(hits.getTotal())); // Count 
of hits
            par_tartalom.put("QRY", htmlQueryString); // UTF8
            par_tartalom.put("QRY2", htmlQueryStringISO); // ISO charset

            // ******************************************************
            // List Hits
            // ******************************************************
            sb.setLength(0);
            sb2.setLength(0);
            sb3.setLength(0);
            sb3.append("?idx=");
            Hit hit = null;
            HitDetails detail = null;
            String title = null;
            String url = null;
            String summary = null;
            for (int i = 0, j; i < length; i++) { // display the hits
                hit = show[i];
                detail = details[i];
                title = detail.getValue("title");
                url = detail.getValue("url");
                summary = summaries[i];
                sb3.setLength(5);
                sb3.append( hit.getIndexNo() ).append("&id=").append( 
hit.getIndexDocNo() );

                if (title == null || title.equals("")) { // use url for 
docs w/o title
                    title = url;
                }
                ... Same with search.jsp ...
            }
            if (length2 <= start + hitsPerPage && hitsLength != 0) { // 
If lower tahn hitsPerPage
                sb.append("<span style=\"FONT-WEIGHT: bold; color: 
black;\">This is the last page.</span><br><br>");
                hitsLength = length2; // paginating length
            }
            par_tartalom.put("LIST", sb.toString());

            // ******************************************************
            // Paginating
            // ******************************************************
            int pageNumber = 0;
            sb.setLength(0);
            sb2.setLength(0);
            sb2.append("&hitsPerPage=");
            sb2.append(hitsPerPage);
            sb3.setLength(0);
            sb3.append("<a 
href=\"Keres?query=").append(htmlQueryStringUTF).append("&start=");

            // Prev (<<)
            sb.append("<td width=60 class=pages align=right>");
            if (start>0) {
                long prevStart = start-hitsPerPage;
                prevStart = prevStart > 0 ? prevStart : 0;
                sb.append(sb3); // query
                sb.append(prevStart); // start
                sb.append(sb2); // others
                sb.append("\" class=pages><font class=arrow><<</font> << 
</a>");
            }
            sb.append("</td><td class=pages width=250 align=center>");

            if (hitsLength > hitsPerPage ) { // If there are more pages
                if (start >= 9 * hitsPerPage) { // from page 10 (from 90)
                    int startPageNumber = start-(4 * hitsPerPage);
                    if (startPageNumber < 0) startPageNumber = 0;
                    pageNumber = (startPageNumber-1) / hitsPerPage+1;
                    for (int i = startPageNumber; i < hitsLength; i = i 
+ hitsPerPage) {
                        pageNumber++;
                        if (start == i) {
                            sb.append("<font class=active_nr><b>");
                            sb.append(pageNumber);
                            sb.append("</b>&nbsp;</font>");
                        } else {
                            sb.append(sb3);// query
                            sb.append(i); // start
                            sb.append(sb2); // others
                            sb.append("\" class=inactive_nr>");
                            sb.append(pageNumber);
                            sb.append("</a>&nbsp;");
                        }
                        if (i >= startPageNumber+8*hitsPerPage) {
                            break;
                        }
                    }
                } else { // more than 9 pages
                    for (int i = 0; i < hitsLength; i = i + hitsPerPage) {
                        pageNumber++;
                        if (start == i) {
                            sb.append("<font class=active_nr><b>");
                            sb.append(pageNumber);
                            sb.append("</b>&nbsp;</font>");
                        } else {
                            sb.append(sb3);// query
                            sb.append(i);  // start
                            sb.append(sb2); // others
                            sb.append("\" class=inactive_nr>");
                            sb.append(pageNumber);
                            sb.append("</a>&nbsp;");
                        }
                        if (pageNumber >= 10) {
                            break;
                        }
                    }
                }
            }
            sb.append("</td><td width=100 class=pages>");

            // next (>>)
            if (hitsLength > start+hitsPerPage) {
                if (end <hitsLength) {
                    sb.append(sb3); // query
                    sb.append(end); // start
                    sb.append(sb2); // others
                    sb.append("\" class=pages> >> <font 
class=arrow>>></font></a>");
                }
            }
            sb.append("</td>");

            par_tartalom.put("PAGES", sb.toString());

            loc_template = Velocity.getTemplate("search.vm");
//        } catch (ArrayIndexOutOfBoundsException ae){
//            log(request.getRemoteAddr()+" - forwarding - "+queryString);
//        }
        } catch (Exception e) {
            log(e.toString()+" - " + request.getRemoteAddr()+" - 
"+queryString);
            //            throw new ServletException(e);
        }
        return loc_template;
    }
}
------------------------------------------------------------------------------
End of java code.

Have you any idea what is the possible problem source?

Best Regards,
    Ferenc

Re: Crawler/Fetcher Questions

Posted by "yoursoft@freemail.hu" <yo...@freemail.hu>.

1. If you want to and index folder use 'bin/nutch merge', this will 
merge the indexes from 'segments/*/index' to 'index' folder.
2. I have the same situation when I use the default value (1) of 
'fetcher.threads.per.host' in nutch-site.xml.
In this case in the fetcher.log you can find some error messages can't 
connect to the hosts.

Ian Reardon wrotte:

>I've noticed a few things that I'm puzzled about with nutch.
>
>When I just do a "nutch crawl" and give it a directory it creates 3
>folders off the root "db", "index" and "segments".
>
>On the other hand if I just create a root directory by hand.  
>
>-Make 2 folders inside "segments" and "db" 
>-Create an empty web db 
>-Copy my segments from an existing crawl into the new segments folder
>-Run updatedb
>-Run index on those newly copied segments
>(i've been using this method to combine multiple crawls of single
>sites into 1 repository)
>
>it seems to work fine but I do not have an "index" folder like it
>makes when you just do "nutch crawl".  What is the index folder?  Is
>it ok that I don't have it, everything appears to be working.
>
>
>2nd question which is not as important.
>
>I've been tracking the size of the folders containing the crawls I'm
>doing.  It seems like they go up to say 20 megs, then it will go down
>to 2 megs and slowly go up again.    Where is this drastic reduction
>coming from?  I just hope I am not losing documents.
>
>Thanks in advance.
>
>
>  
>

Re: Crawler/Fetcher Questions

Posted by Byron Miller <By...@compaid.com>.

Index folder is created when your merge indexes - not needed, but usually
enhanced performance.  Crawl is probably merging the indexes automagically
while the manual process won't.

During crawl/segment creation and indexing there are tons of files that
get created and the optimize process goes through and cleans this up a bit.

-byron

-----Original Message-----
From: Ian Reardon <ir...@gmail.com>
To: nutch-user@incubator.apache.org
Date: Fri, 20 May 2005 08:44:45 -0400
Subject: Crawler/Fetcher Questions

> I've noticed a few things that I'm puzzled about with nutch.
> 
> When I just do a "nutch crawl" and give it a directory it creates 3
> folders off the root "db", "index" and "segments".
> 
> On the other hand if I just create a root directory by hand.  
> 
> -Make 2 folders inside "segments" and "db" 
> -Create an empty web db 
> -Copy my segments from an existing crawl into the new segments folder
> -Run updatedb
> -Run index on those newly copied segments
> (i've been using this method to combine multiple crawls of single
> sites into 1 repository)
> 
> it seems to work fine but I do not have an "index" folder like it
> makes when you just do "nutch crawl".  What is the index folder?  Is
> it ok that I don't have it, everything appears to be working.
> 
> 
> 2nd question which is not as important.
> 
> I've been tracking the size of the folders containing the crawls I'm
> doing.  It seems like they go up to say 20 megs, then it will go down
> to 2 megs and slowly go up again.    Where is this drastic reduction
> coming from?  I just hope I am not losing documents.
> 
> Thanks in advance.
>