You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/08/06 23:50:13 UTC

Understanding mapping of field characteristics to index structure

Hi,

Simple question but currently unclear to me...

I know if a field e.g. 'host' is going to be stored and/or indexed as
all I need to do is look this up or define it within my schema,
however what about tokenised? This seems (to me anyway) to be shrouded
in mystery :0|

Any thoughts? Thank you

Best
Lewis

-- 
Lewis

RE: Understanding mapping of field characteristics to index structure

Posted by Markus Jelsma <ma...@openindex.io>.
That stuff isnt being used but is shipped by the default Solr example schema of early 4.0. 4.0's default schema has changed a lot now.

I'd rather ship a small nutch specific config without all the default Solr fieldTypes that aren't being used anyway.
 
 
-----Original message-----
> From:Lewis John Mcgibbney <le...@gmail.com>
> Sent: Tue 07-Aug-2012 13:43
> To: Markus Jelsma <ma...@openindex.io>
> Cc: dev@nutch.apache.org
> Subject: Re: Understanding mapping of field characteristics to index structure
> 
> Hi Markus,
> Thanks for getting back on this one last night. Please see comments inline.
> 
> On Mon, Aug 6, 2012 at 11:12 PM, Markus Jelsma
> <ma...@openindex.io> wrote:
> > Hi,
> > Tokenization depens whether an analyzer used for the field ... should be boosted seperately.
> >
> 
> Thanks for clarifying all is now crystal.
> 
> > About the Solr4 schema, it wasn't introduced as a Solr4 compatible version of the default schema.xml file and i think it should be removed in favour of updating the schema.xml to Solr4.The only change i can think of is adding the version field that is mandatory for SolrCloud. The schema version is 1.5 which the default schema already has.
> >
> 
> OK so what about all of the addition config in the schema-solr4.xml
> file which resides above the actual field definitions? E.g. the
> tokenisation, etc. parts you discuss above
> I also think it is too ambiguous (and slightly pointless) to maintain
> two schema (unless of course someone can provide justification). I
> think (in all distributions moving forward) we should aim to simplify
> this and encapsulate all required field and configuration definitions
> in a single schema.xml...
> 
> Lewis
> 

Re: Understanding mapping of field characteristics to index structure

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Markus,
Thanks for getting back on this one last night. Please see comments inline.

On Mon, Aug 6, 2012 at 11:12 PM, Markus Jelsma
<ma...@openindex.io> wrote:
> Hi,
> Tokenization depens whether an analyzer used for the field ... should be boosted seperately.
>

Thanks for clarifying all is now crystal.

> About the Solr4 schema, it wasn't introduced as a Solr4 compatible version of the default schema.xml file and i think it should be removed in favour of updating the schema.xml to Solr4.The only change i can think of is adding the version field that is mandatory for SolrCloud. The schema version is 1.5 which the default schema already has.
>

OK so what about all of the addition config in the schema-solr4.xml
file which resides above the actual field definitions? E.g. the
tokenisation, etc. parts you discuss above
I also think it is too ambiguous (and slightly pointless) to maintain
two schema (unless of course someone can provide justification). I
think (in all distributions moving forward) we should aim to simplify
this and encapsulate all required field and configuration definitions
in a single schema.xml...

Lewis

RE: Understanding mapping of field characteristics to index structure

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

Tokenization depens whether an analyzer used for the field (non-primitive types) and the tokenization depends on which tokenizer is defined. Tokenizing a hostname doesn't really make sense with the default available tokenizers but you can use a KeywordTokenizer with a WordDelmiterFilter to split it into domains (TLD, SLD, etc). But having a TLD in the same field isn't very useful for boosting and query time analysis of search words - people usually don't search for a tld and if they do it should be boosted seperately.

About the Solr4 schema, it wasn't introduced as a Solr4 compatible version of the default schema.xml file and i think it should be removed in favour of updating the schema.xml to Solr4.The only change i can think of is adding the version field that is mandatory for SolrCloud. The schema version is 1.5 which the default schema already has.

Cheers


 
 
-----Original message-----
> From:Lewis John Mcgibbney <le...@gmail.com>
> Sent: Tue 07-Aug-2012 00:03
> To: dev@nutch.apache.org
> Subject: Re: Understanding mapping of field characteristics to index structure
> 
> Mmmm...
> 
> I think I opened a small can of worms here regarding consistency
> between schema.xml and schema-solr4.xml.
> 
> There are discrepancies between some fields as to their structural
> characteristics. This is something which I think we should make
> consistent between schemas... no?
> 
> An example would be the content field (used in index-basic) which
> appears as stored and indexed in schema-solr4.xml but not stored in
> schema.xml
> 
> Lewis
> 
> On Mon, Aug 6, 2012 at 10:50 PM, Lewis John Mcgibbney
> <le...@gmail.com> wrote:
> > Hi,
> >
> > Simple question but currently unclear to me...
> >
> > I know if a field e.g. 'host' is going to be stored and/or indexed as
> > all I need to do is look this up or define it within my schema,
> > however what about tokenised? This seems (to me anyway) to be shrouded
> > in mystery :0|
> >
> > Any thoughts? Thank you
> >
> > Best
> > Lewis
> >
> > --
> > Lewis
> 
> 
> 
> -- 
> Lewis
> 

Re: Understanding mapping of field characteristics to index structure

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Mmmm...

I think I opened a small can of worms here regarding consistency
between schema.xml and schema-solr4.xml.

There are discrepancies between some fields as to their structural
characteristics. This is something which I think we should make
consistent between schemas... no?

An example would be the content field (used in index-basic) which
appears as stored and indexed in schema-solr4.xml but not stored in
schema.xml

Lewis

On Mon, Aug 6, 2012 at 10:50 PM, Lewis John Mcgibbney
<le...@gmail.com> wrote:
> Hi,
>
> Simple question but currently unclear to me...
>
> I know if a field e.g. 'host' is going to be stored and/or indexed as
> all I need to do is look this up or define it within my schema,
> however what about tokenised? This seems (to me anyway) to be shrouded
> in mystery :0|
>
> Any thoughts? Thank you
>
> Best
> Lewis
>
> --
> Lewis



-- 
Lewis