You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Marek Bachmann <m....@uni-kassel.de> on 2011/11/09 16:23:28 UTC

SegmentMerger behavior

Hello all,

when I have segments from two crawls, the first one from initial 
crawling and the second on from recrawl, how will they be merged?

I mean:

*) When site A has changed between the crawl, what content will be in 
the merged segment. The old one or the new one (or both)?

Thanks :)


Re: SegmentMerger behavior

Posted by Marek Bachmann <m....@uni-kassel.de>.
On 09.11.2011 18:03, Andrzej Bialecki wrote:
> On 09/11/2011 16:30, Marek Bachmann wrote:
>> Am 09.11.2011 16:27, schrieb Markus Jelsma:
>>> the most recent item
>>>
>>> On Wednesday 09 November 2011 16:23:28 Marek Bachmann wrote:
>>>> Hello all,
>>>>
>>>> when I have segments from two crawls, the first one from initial
>>>> crawling and the second on from recrawl, how will they be merged?
>>>>
>>>> I mean:
>>>>
>>>> *) When site A has changed between the crawl, what content will be in
>>>> the merged segment. The old one or the new one (or both)?
>>>>
>>>> Thanks :)
>>>
>>
>> Thank you! :)
>>
> 
> Note: please consult the javadocs for SegmentMerger. Timestamps of some
> parts of segments are difficult to determine, so the "latest" means
> "coming from a segment with a name in highest lexicographic order".
> 
> In practice, if your segments are named after a timestamp, all things
> should work ok. However, if you rename the latest segment to e.g.
> 0000-most-recent then results will be not what you expected.
> 

Thank you, Andrzej, for the advice! :) I won't rename them since I need
the timestamp structure for finding the ongoing one in may crawl
scripts. So it should work for me.

Re: SegmentMerger behavior

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 09/11/2011 16:30, Marek Bachmann wrote:
> Am 09.11.2011 16:27, schrieb Markus Jelsma:
>> the most recent item
>>
>> On Wednesday 09 November 2011 16:23:28 Marek Bachmann wrote:
>>> Hello all,
>>>
>>> when I have segments from two crawls, the first one from initial
>>> crawling and the second on from recrawl, how will they be merged?
>>>
>>> I mean:
>>>
>>> *) When site A has changed between the crawl, what content will be in
>>> the merged segment. The old one or the new one (or both)?
>>>
>>> Thanks :)
>>
>
> Thank you! :)
>

Note: please consult the javadocs for SegmentMerger. Timestamps of some 
parts of segments are difficult to determine, so the "latest" means 
"coming from a segment with a name in highest lexicographic order".

In practice, if your segments are named after a timestamp, all things 
should work ok. However, if you rename the latest segment to e.g. 
0000-most-recent then results will be not what you expected.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: SegmentMerger behavior

Posted by Marek Bachmann <m....@uni-kassel.de>.
Am 09.11.2011 16:27, schrieb Markus Jelsma:
> the most recent item
>
> On Wednesday 09 November 2011 16:23:28 Marek Bachmann wrote:
>> Hello all,
>>
>> when I have segments from two crawls, the first one from initial
>> crawling and the second on from recrawl, how will they be merged?
>>
>> I mean:
>>
>> *) When site A has changed between the crawl, what content will be in
>> the merged segment. The old one or the new one (or both)?
>>
>> Thanks :)
>

Thank you! :)

Re: SegmentMerger behavior

Posted by Markus Jelsma <ma...@openindex.io>.
the most recent item

On Wednesday 09 November 2011 16:23:28 Marek Bachmann wrote:
> Hello all,
> 
> when I have segments from two crawls, the first one from initial
> crawling and the second on from recrawl, how will they be merged?
> 
> I mean:
> 
> *) When site A has changed between the crawl, what content will be in
> the merged segment. The old one or the new one (or both)?
> 
> Thanks :)

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350