You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Alexandre <al...@gmail.com> on 2012/09/19 13:14:59 UTC

Recrawling and segment cleanup

Hi,

we currently encounter a little problem with the segment folders created
during crawling.

Our situation is like follows:
We try to set up a Nutch crawler who is crawling / recrwaling on a regular
basis with a fixed depth. How to establish this is already clear for us and
working as intended.
(http://lucene.472066.n3.nabble.com/Absolute-depth-for-recrawling-td4008320.html)

Our general solution looks (from the process point of view) like this:

1. Inject
Loop Recrawl {
Loop (depth) {
2. Generate
3. Fetch
4. Parse
5. UpdateDB
}
6. InvertLinks
7. SOLRIndex
8. SOLRDeup
}

The problem we now got, is that there is a new segment (folder) created for
each crawl / recrawl and each depth loop (which is in fact nothing else then
a normal crawl).

Our main question now is,
1) when can we delete / eventually merge these segment folders and
2) what are they used for in the future.

For now we automatically delete all segement folders after each complete
crawl (after each step 8.SOLRDeup) and it seems to work fine for us. Does
this even make sense?

I think we have to admit that we are not entirely aware of what kind of
information is contained within the crawl DB and the segment folder.

Thanks a lot for your help in advance and kind regards,
Alex

--
View this message in context: http://lucene.472066.n3.nabble.com/Recrawling-and-segment-cleanup-tp4008865.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Recrawling and segment cleanup

Posted by Alexandre <al...@gmail.com>.

Thanks for your explaination. Now it's more clear for me.



--
View this message in context: http://lucene.472066.n3.nabble.com/Recrawling-and-segment-cleanup-tp4008865p4009366.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Recrawling and segment cleanup

Posted by Markus Jelsma <ma...@openindex.io>.


 
 
-----Original message-----
> From:Alexandre <al...@gmail.com>
> Sent: Wed 19-Sep-2012 13:18
> To: user@nutch.apache.org
> Subject: Recrawling and segment cleanup
> 
> Hi,
> 
> we currently encounter a little problem with the segment folders created
> during crawling.
> 
> Our situation is like follows:
> We try to set up a Nutch crawler who is crawling / recrwaling on a regular
> basis with a fixed depth. How to establish this is already clear for us and
> working as intended.
> (http://lucene.472066.n3.nabble.com/Absolute-depth-for-recrawling-td4008320.html)
> 
> Our general solution looks (from the process point of view) like this:
> 
>   1. Inject
>   Loop Recrawl {
>       Loop (depth) {
>         2. Generate
>         3. Fetch
>         4. Parse
>         5. UpdateDB
>       }
>     6. InvertLinks
>     7. SOLRIndex
>     8. SOLRDeup
>   }
> 
> The problem we now got, is that there is a new segment (folder) created for
> each crawl / recrawl and each depth loop (which is in fact nothing else then
> a normal crawl).
> 
> Our main question now is, 
>    1) when can we delete / eventually merge these segment folders and

Wou can merge them whenever you want. We merge all segments daily and monthly because we may have to reindex occasionally.

>    2) what are they used for in the future.

They are only used for reindexing or rebuilding data structures such as the crawldb, webgraph of linkdb.

> 
> For now we automatically delete all segement folders after each complete
> crawl (after each step 8.SOLRDeup) and it seems to work fine for us. Does
> this even make sense?

Sure. If you don't need them.

> 
> I think we have to admit that we are not entirely aware of what kind of
> information is contained within the crawl DB and the segment folder.

The all databases contain a <url, object>  key/value pair. The CrawlDB contains the state of every URL and the segments contain structures such as the generated fetch list, info on the fetched records, parse data (outlinks and such) and parsed text. All this information is key/value based.

> 
> Thanks a lot for your help in advance and kind regards,
> Alex
> 
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Recrawling-and-segment-cleanup-tp4008865.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>