You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/04/28 13:43:54 UTC

Update schema to get solrdedup working again

Hi devs,

The Solr schema must be updated as well to get dedup to work in 1.3. This is 
because in december last year index-basic seems to have been updated to write 
proper formatted dates to Solr but the schema field was still a long.

Somehow Solr accepted (this is a bug) the input but cannot cope with the 
output, nor could Nutch convert the date to the internally used long (which it 
now can). The remaining issue is to update the field to use date instead of 
long. But this will break existing Solr set ups for sure because of field 
incompatibility.

I propose to update the field, regardless of current Solr set ups because of 
the assumption that 1) an index can always be recreated from segments and 2) 
the current indexer assumes the Solr bug remains in 3.1 and higher as well.

I haven't tested it with 3.1 but the bug is in 1.4.1 for sure.

Thoughts?

Cheers,
-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Update schema to get solrdedup working again

Posted by Julien Nioche <li...@gmail.com>.
Resending to dev@nutch - had sent to markus only


>
>> We still need to do
>> something about the moreindexing filter.
>>
>> https://issues.apache.org/jira/browse/NUTCH-985
>>
>
> For now a quick fix for the moreindexingfilter would be OK, but we can
> maybe create a new issue for 1.4 and rely on Date objects everywhere then
> format it properly in the SOLRWriter. We could of course to the latter now,
> but since I have no time to do it in the short time and don't want to twist
> your arm I'll let you decide
>
>
>
>>
>> On Thursday 05 May 2011 15:34:56 Julien Nioche wrote:
>> > Hi Markus,
>> >
>> > Sorry for the late reply. Definitely +1 to change to Date in the schema,
>> it
>> > is the right thing to do and it's also the right time to do it
>> >
>> > Thanks
>> >
>> > Julien
>> >
>> > On 28 April 2011 12:43, Markus Jelsma <ma...@openindex.io>
>> wrote:
>> > > Hi devs,
>> > >
>> > > The Solr schema must be updated as well to get dedup to work in 1.3.
>> This
>> > > is
>> > > because in december last year index-basic seems to have been updated
>> to
>> > > write
>> > > proper formatted dates to Solr but the schema field was still a long.
>> > >
>> > > Somehow Solr accepted (this is a bug) the input but cannot cope with
>> the
>> > > output, nor could Nutch convert the date to the internally used long
>> > > (which it
>> > > now can). The remaining issue is to update the field to use date
>> instead
>> > > of long. But this will break existing Solr set ups for sure because of
>> > > field incompatibility.
>> > >
>> > > I propose to update the field, regardless of current Solr set ups
>> because
>> > > of
>> > > the assumption that 1) an index can always be recreated from segments
>> and
>> > > 2)
>> > > the current indexer assumes the Solr bug remains in 3.1 and higher as
>> > > well.
>> > >
>> > > I haven't tested it with 3.1 but the bug is in 1.4.1 for sure.
>> > >
>> > > Thoughts?
>> > >
>> > > Cheers,
>> > > --
>> > > Markus Jelsma - CTO - Openindex
>> > > http://www.linkedin.com/in/markus17
>> > > 050-8536620 / 06-50258350
>>
>> --
>> Markus Jelsma - CTO - Openindex
>> http://www.linkedin.com/in/markus17
>> 050-8536620 / 06-50258350
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Update schema to get solrdedup working again

Posted by Markus Jelsma <ma...@openindex.io>.
Don't worry, the sun is shining! The change is committed. We still need to do 
something about the moreindexing filter.

https://issues.apache.org/jira/browse/NUTCH-985

On Thursday 05 May 2011 15:34:56 Julien Nioche wrote:
> Hi Markus,
> 
> Sorry for the late reply. Definitely +1 to change to Date in the schema, it
> is the right thing to do and it's also the right time to do it
> 
> Thanks
> 
> Julien
> 
> On 28 April 2011 12:43, Markus Jelsma <ma...@openindex.io> wrote:
> > Hi devs,
> > 
> > The Solr schema must be updated as well to get dedup to work in 1.3. This
> > is
> > because in december last year index-basic seems to have been updated to
> > write
> > proper formatted dates to Solr but the schema field was still a long.
> > 
> > Somehow Solr accepted (this is a bug) the input but cannot cope with the
> > output, nor could Nutch convert the date to the internally used long
> > (which it
> > now can). The remaining issue is to update the field to use date instead
> > of long. But this will break existing Solr set ups for sure because of
> > field incompatibility.
> > 
> > I propose to update the field, regardless of current Solr set ups because
> > of
> > the assumption that 1) an index can always be recreated from segments and
> > 2)
> > the current indexer assumes the Solr bug remains in 3.1 and higher as
> > well.
> > 
> > I haven't tested it with 3.1 but the bug is in 1.4.1 for sure.
> > 
> > Thoughts?
> > 
> > Cheers,
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Update schema to get solrdedup working again

Posted by Julien Nioche <li...@gmail.com>.
Hi Markus,

Sorry for the late reply. Definitely +1 to change to Date in the schema, it
is the right thing to do and it's also the right time to do it

Thanks

Julien


On 28 April 2011 12:43, Markus Jelsma <ma...@openindex.io> wrote:

> Hi devs,
>
> The Solr schema must be updated as well to get dedup to work in 1.3. This
> is
> because in december last year index-basic seems to have been updated to
> write
> proper formatted dates to Solr but the schema field was still a long.
>
> Somehow Solr accepted (this is a bug) the input but cannot cope with the
> output, nor could Nutch convert the date to the internally used long (which
> it
> now can). The remaining issue is to update the field to use date instead of
> long. But this will break existing Solr set ups for sure because of field
> incompatibility.
>
> I propose to update the field, regardless of current Solr set ups because
> of
> the assumption that 1) an index can always be recreated from segments and
> 2)
> the current indexer assumes the Solr bug remains in 3.1 and higher as well.
>
> I haven't tested it with 3.1 but the bug is in 1.4.1 for sure.
>
> Thoughts?
>
> Cheers,
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com