You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/04/28 13:43:54 UTC
Update schema to get solrdedup working again
Hi devs,
The Solr schema must be updated as well to get dedup to work in 1.3. This is
because in december last year index-basic seems to have been updated to write
proper formatted dates to Solr but the schema field was still a long.
Somehow Solr accepted (this is a bug) the input but cannot cope with the
output, nor could Nutch convert the date to the internally used long (which it
now can). The remaining issue is to update the field to use date instead of
long. But this will break existing Solr set ups for sure because of field
incompatibility.
I propose to update the field, regardless of current Solr set ups because of
the assumption that 1) an index can always be recreated from segments and 2)
the current indexer assumes the Solr bug remains in 3.1 and higher as well.
I haven't tested it with 3.1 but the bug is in 1.4.1 for sure.
Thoughts?
Cheers,
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: Update schema to get solrdedup working again
Posted by Julien Nioche <li...@gmail.com>.
Resending to dev@nutch - had sent to markus only
>
>> We still need to do
>> something about the moreindexing filter.
>>
>> https://issues.apache.org/jira/browse/NUTCH-985
>>
>
> For now a quick fix for the moreindexingfilter would be OK, but we can
> maybe create a new issue for 1.4 and rely on Date objects everywhere then
> format it properly in the SOLRWriter. We could of course to the latter now,
> but since I have no time to do it in the short time and don't want to twist
> your arm I'll let you decide
>
>
>
>>
>> On Thursday 05 May 2011 15:34:56 Julien Nioche wrote:
>> > Hi Markus,
>> >
>> > Sorry for the late reply. Definitely +1 to change to Date in the schema,
>> it
>> > is the right thing to do and it's also the right time to do it
>> >
>> > Thanks
>> >
>> > Julien
>> >
>> > On 28 April 2011 12:43, Markus Jelsma <ma...@openindex.io>
>> wrote:
>> > > Hi devs,
>> > >
>> > > The Solr schema must be updated as well to get dedup to work in 1.3.
>> This
>> > > is
>> > > because in december last year index-basic seems to have been updated
>> to
>> > > write
>> > > proper formatted dates to Solr but the schema field was still a long.
>> > >
>> > > Somehow Solr accepted (this is a bug) the input but cannot cope with
>> the
>> > > output, nor could Nutch convert the date to the internally used long
>> > > (which it
>> > > now can). The remaining issue is to update the field to use date
>> instead
>> > > of long. But this will break existing Solr set ups for sure because of
>> > > field incompatibility.
>> > >
>> > > I propose to update the field, regardless of current Solr set ups
>> because
>> > > of
>> > > the assumption that 1) an index can always be recreated from segments
>> and
>> > > 2)
>> > > the current indexer assumes the Solr bug remains in 3.1 and higher as
>> > > well.
>> > >
>> > > I haven't tested it with 3.1 but the bug is in 1.4.1 for sure.
>> > >
>> > > Thoughts?
>> > >
>> > > Cheers,
>> > > --
>> > > Markus Jelsma - CTO - Openindex
>> > > http://www.linkedin.com/in/markus17
>> > > 050-8536620 / 06-50258350
>>
>> --
>> Markus Jelsma - CTO - Openindex
>> http://www.linkedin.com/in/markus17
>> 050-8536620 / 06-50258350
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
Re: Update schema to get solrdedup working again
Posted by Markus Jelsma <ma...@openindex.io>.
Don't worry, the sun is shining! The change is committed. We still need to do
something about the moreindexing filter.
https://issues.apache.org/jira/browse/NUTCH-985
On Thursday 05 May 2011 15:34:56 Julien Nioche wrote:
> Hi Markus,
>
> Sorry for the late reply. Definitely +1 to change to Date in the schema, it
> is the right thing to do and it's also the right time to do it
>
> Thanks
>
> Julien
>
> On 28 April 2011 12:43, Markus Jelsma <ma...@openindex.io> wrote:
> > Hi devs,
> >
> > The Solr schema must be updated as well to get dedup to work in 1.3. This
> > is
> > because in december last year index-basic seems to have been updated to
> > write
> > proper formatted dates to Solr but the schema field was still a long.
> >
> > Somehow Solr accepted (this is a bug) the input but cannot cope with the
> > output, nor could Nutch convert the date to the internally used long
> > (which it
> > now can). The remaining issue is to update the field to use date instead
> > of long. But this will break existing Solr set ups for sure because of
> > field incompatibility.
> >
> > I propose to update the field, regardless of current Solr set ups because
> > of
> > the assumption that 1) an index can always be recreated from segments and
> > 2)
> > the current indexer assumes the Solr bug remains in 3.1 and higher as
> > well.
> >
> > I haven't tested it with 3.1 but the bug is in 1.4.1 for sure.
> >
> > Thoughts?
> >
> > Cheers,
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: Update schema to get solrdedup working again
Posted by Julien Nioche <li...@gmail.com>.
Hi Markus,
Sorry for the late reply. Definitely +1 to change to Date in the schema, it
is the right thing to do and it's also the right time to do it
Thanks
Julien
On 28 April 2011 12:43, Markus Jelsma <ma...@openindex.io> wrote:
> Hi devs,
>
> The Solr schema must be updated as well to get dedup to work in 1.3. This
> is
> because in december last year index-basic seems to have been updated to
> write
> proper formatted dates to Solr but the schema field was still a long.
>
> Somehow Solr accepted (this is a bug) the input but cannot cope with the
> output, nor could Nutch convert the date to the internally used long (which
> it
> now can). The remaining issue is to update the field to use date instead of
> long. But this will break existing Solr set ups for sure because of field
> incompatibility.
>
> I propose to update the field, regardless of current Solr set ups because
> of
> the assumption that 1) an index can always be recreated from segments and
> 2)
> the current indexer assumes the Solr bug remains in 3.1 and higher as well.
>
> I haven't tested it with 3.1 but the bug is in 1.4.1 for sure.
>
> Thoughts?
>
> Cheers,
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com