You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matthew Holt <mh...@redhat.com> on 2006/08/02 23:13:49 UTC

0.8 Recrawl script updated

Just letting everyone know that I updated the recrawl script on the 
Wiki. It now merges the created segments them deletes the old segs to 
prevent a lot of unneeded data remaining/growing on the hard drive.
  Matt

http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03

Re: 0.8 Recrawl script updated

Posted by Matthew Holt <mh...@redhat.com>.
Lukas,
   Thanks for your e-mail. I assumed I could drop the $depth number of 
oldest segments because I first merged them all into one segment (which 
I don't drop). Am I incorrect in my assumption and can this cause 
problems in the future? If so, then I'll go back to the original version 
of my script when I kept all the segments without merging. However, it 
just seemed like if that is the case, it will be a problem after enough 
number of recrawls due to the large amount of segments being kept.

 Thanks,
  Matt

Lukas Vlcek wrote:
> Hi Matthew,
>
> I am surious about one thing. How do you know you can just drop $depth
> number of the most oldest segments in the end? I haven't studied nutch
> code regarding this topic yet but I thought that segment can be
> dropped once you are sure that all its content is already crawled in
> some newer segments (which should be checked somehow via some
> function/script - which hasen't been yet implemented to my knowledge).
>
> Also I don't think this question has been discussed on dev/user lists
> in detail yet so I just wanted to ask you about your opinion. The
> situation could get even more complicated if people add -topN
> parameter into script (which can happen because some might prefer
> crawling in ten smaller bunches over to two huge crawls due to various
> technical reasons).
>
> Anyway, never mind if you don't want to bother about my silly question 
> :-)
>
> Regards,
> Lukas
>
> On 8/4/06, Matthew Holt <mh...@redhat.com> wrote:
>> Last email regarding this script. I found a bug in it that is sporadic
>> (i think it only affected different setups). However, since it would be
>> a problem sometimes, I refactored the script. I'd suggest you redownload
>> the script if you are using it.
>>
>> Matt
>>
>> Matthew Holt wrote:
>> > I'm currently pretty busy at work. If I have I'll do it later.
>> >
>> > The version 0.8 recrawl script has a working version online now. I
>> > temporarily modified it on the website yesterday when I ran into some
>> > problems, but I further tested it and the actual working code is
>> > modified now. So if you got it off the web site any time yesterday, I
>> > would redownload the script.
>> >
>> > Matt
>> >
>> > Lourival Júnior wrote:
>> >> Hi Matthew!
>> >>
>> >> Could you update the script to the version 0.7.2 with the same
>> >> functionalities? I write a scritp that do this, but it don't work 
>> very
>> >> well...
>> >>
>> >> Regards!
>> >>
>> >> On 8/2/06, Matthew Holt <mh...@redhat.com> wrote:
>> >>>
>> >>> Just letting everyone know that I updated the recrawl script on the
>> >>> Wiki. It now merges the created segments them deletes the old 
>> segs to
>> >>> prevent a lot of unneeded data remaining/growing on the hard drive.
>> >>>   Matt
>> >>>
>> >>>
>> >>> 
>> http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03 
>>
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >
>>
>

Re: 0.8 Recrawl script updated

Posted by Lukas Vlcek <lu...@gmail.com>.
Hi Matthew,

I am surious about one thing. How do you know you can just drop $depth
number of the most oldest segments in the end? I haven't studied nutch
code regarding this topic yet but I thought that segment can be
dropped once you are sure that all its content is already crawled in
some newer segments (which should be checked somehow via some
function/script - which hasen't been yet implemented to my knowledge).

Also I don't think this question has been discussed on dev/user lists
in detail yet so I just wanted to ask you about your opinion. The
situation could get even more complicated if people add -topN
parameter into script (which can happen because some might prefer
crawling in ten smaller bunches over to two huge crawls due to various
technical reasons).

Anyway, never mind if you don't want to bother about my silly question :-)

Regards,
Lukas

On 8/4/06, Matthew Holt <mh...@redhat.com> wrote:
> Last email regarding this script. I found a bug in it that is sporadic
> (i think it only affected different setups). However, since it would be
> a problem sometimes, I refactored the script. I'd suggest you redownload
> the script if you are using it.
>
> Matt
>
> Matthew Holt wrote:
> > I'm currently pretty busy at work. If I have I'll do it later.
> >
> > The version 0.8 recrawl script has a working version online now. I
> > temporarily modified it on the website yesterday when I ran into some
> > problems, but I further tested it and the actual working code is
> > modified now. So if you got it off the web site any time yesterday, I
> > would redownload the script.
> >
> > Matt
> >
> > Lourival Júnior wrote:
> >> Hi Matthew!
> >>
> >> Could you update the script to the version 0.7.2 with the same
> >> functionalities? I write a scritp that do this, but it don't work very
> >> well...
> >>
> >> Regards!
> >>
> >> On 8/2/06, Matthew Holt <mh...@redhat.com> wrote:
> >>>
> >>> Just letting everyone know that I updated the recrawl script on the
> >>> Wiki. It now merges the created segments them deletes the old segs to
> >>> prevent a lot of unneeded data remaining/growing on the hard drive.
> >>>   Matt
> >>>
> >>>
> >>> http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03
> >>>
> >>>
> >>
> >>
> >>
> >
>

Re: 0.8 Recrawl script updated

Posted by Matthew Holt <mh...@redhat.com>.
Last email regarding this script. I found a bug in it that is sporadic 
(i think it only affected different setups). However, since it would be 
a problem sometimes, I refactored the script. I'd suggest you redownload 
the script if you are using it.

Matt

Matthew Holt wrote:
> I'm currently pretty busy at work. If I have I'll do it later.
>
> The version 0.8 recrawl script has a working version online now. I 
> temporarily modified it on the website yesterday when I ran into some 
> problems, but I further tested it and the actual working code is 
> modified now. So if you got it off the web site any time yesterday, I 
> would redownload the script.
>
> Matt
>
> Lourival Júnior wrote:
>> Hi Matthew!
>>
>> Could you update the script to the version 0.7.2 with the same
>> functionalities? I write a scritp that do this, but it don't work very
>> well...
>>
>> Regards!
>>
>> On 8/2/06, Matthew Holt <mh...@redhat.com> wrote:
>>>
>>> Just letting everyone know that I updated the recrawl script on the
>>> Wiki. It now merges the created segments them deletes the old segs to
>>> prevent a lot of unneeded data remaining/growing on the hard drive.
>>>   Matt
>>>
>>>
>>> http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03 
>>>
>>>
>>
>>
>>
>

Re: 0.8 Recrawl script updated

Posted by Matthew Holt <mh...@redhat.com>.
I'm currently pretty busy at work. If I have I'll do it later.

The version 0.8 recrawl script has a working version online now. I 
temporarily modified it on the website yesterday when I ran into some 
problems, but I further tested it and the actual working code is 
modified now. So if you got it off the web site any time yesterday, I 
would redownload the script.

Matt

Lourival Júnior wrote:
> Hi Matthew!
>
> Could you update the script to the version 0.7.2 with the same
> functionalities? I write a scritp that do this, but it don't work very
> well...
>
> Regards!
>
> On 8/2/06, Matthew Holt <mh...@redhat.com> wrote:
>>
>> Just letting everyone know that I updated the recrawl script on the
>> Wiki. It now merges the created segments them deletes the old segs to
>> prevent a lot of unneeded data remaining/growing on the hard drive.
>>   Matt
>>
>>
>> http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03 
>>
>>
>
>
>

Re: 0.8 Recrawl script updated

Posted by Lourival Júnior <ju...@gmail.com>.
Hi Matthew!

Could you update the script to the version 0.7.2 with the same
functionalities? I write a scritp that do this, but it don't work very
well...

Regards!

On 8/2/06, Matthew Holt <mh...@redhat.com> wrote:
>
> Just letting everyone know that I updated the recrawl script on the
> Wiki. It now merges the created segments them deletes the old segs to
> prevent a lot of unneeded data remaining/growing on the hard drive.
>   Matt
>
>
> http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03
>



-- 
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: junior_ufpa@hotmail.com