You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Marek Bachmann <m....@uni-kassel.de> on 2011/11/09 16:23:28 UTC
SegmentMerger behavior
Hello all,
when I have segments from two crawls, the first one from initial
crawling and the second on from recrawl, how will they be merged?
I mean:
*) When site A has changed between the crawl, what content will be in
the merged segment. The old one or the new one (or both)?
Thanks :)
Re: SegmentMerger behavior
Posted by Marek Bachmann <m....@uni-kassel.de>.
On 09.11.2011 18:03, Andrzej Bialecki wrote:
> On 09/11/2011 16:30, Marek Bachmann wrote:
>> Am 09.11.2011 16:27, schrieb Markus Jelsma:
>>> the most recent item
>>>
>>> On Wednesday 09 November 2011 16:23:28 Marek Bachmann wrote:
>>>> Hello all,
>>>>
>>>> when I have segments from two crawls, the first one from initial
>>>> crawling and the second on from recrawl, how will they be merged?
>>>>
>>>> I mean:
>>>>
>>>> *) When site A has changed between the crawl, what content will be in
>>>> the merged segment. The old one or the new one (or both)?
>>>>
>>>> Thanks :)
>>>
>>
>> Thank you! :)
>>
>
> Note: please consult the javadocs for SegmentMerger. Timestamps of some
> parts of segments are difficult to determine, so the "latest" means
> "coming from a segment with a name in highest lexicographic order".
>
> In practice, if your segments are named after a timestamp, all things
> should work ok. However, if you rename the latest segment to e.g.
> 0000-most-recent then results will be not what you expected.
>
Thank you, Andrzej, for the advice! :) I won't rename them since I need
the timestamp structure for finding the ongoing one in may crawl
scripts. So it should work for me.
Re: SegmentMerger behavior
Posted by Andrzej Bialecki <ab...@getopt.org>.
On 09/11/2011 16:30, Marek Bachmann wrote:
> Am 09.11.2011 16:27, schrieb Markus Jelsma:
>> the most recent item
>>
>> On Wednesday 09 November 2011 16:23:28 Marek Bachmann wrote:
>>> Hello all,
>>>
>>> when I have segments from two crawls, the first one from initial
>>> crawling and the second on from recrawl, how will they be merged?
>>>
>>> I mean:
>>>
>>> *) When site A has changed between the crawl, what content will be in
>>> the merged segment. The old one or the new one (or both)?
>>>
>>> Thanks :)
>>
>
> Thank you! :)
>
Note: please consult the javadocs for SegmentMerger. Timestamps of some
parts of segments are difficult to determine, so the "latest" means
"coming from a segment with a name in highest lexicographic order".
In practice, if your segments are named after a timestamp, all things
should work ok. However, if you rename the latest segment to e.g.
0000-most-recent then results will be not what you expected.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: SegmentMerger behavior
Posted by Marek Bachmann <m....@uni-kassel.de>.
Am 09.11.2011 16:27, schrieb Markus Jelsma:
> the most recent item
>
> On Wednesday 09 November 2011 16:23:28 Marek Bachmann wrote:
>> Hello all,
>>
>> when I have segments from two crawls, the first one from initial
>> crawling and the second on from recrawl, how will they be merged?
>>
>> I mean:
>>
>> *) When site A has changed between the crawl, what content will be in
>> the merged segment. The old one or the new one (or both)?
>>
>> Thanks :)
>
Thank you! :)
Re: SegmentMerger behavior
Posted by Markus Jelsma <ma...@openindex.io>.
the most recent item
On Wednesday 09 November 2011 16:23:28 Marek Bachmann wrote:
> Hello all,
>
> when I have segments from two crawls, the first one from initial
> crawling and the second on from recrawl, how will they be merged?
>
> I mean:
>
> *) When site A has changed between the crawl, what content will be in
> the merged segment. The old one or the new one (or both)?
>
> Thanks :)
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350