You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/08/06 23:50:13 UTC
Understanding mapping of field characteristics to index structure
Hi,
Simple question but currently unclear to me...
I know if a field e.g. 'host' is going to be stored and/or indexed as
all I need to do is look this up or define it within my schema,
however what about tokenised? This seems (to me anyway) to be shrouded
in mystery :0|
Any thoughts? Thank you
Best
Lewis
--
Lewis
RE: Understanding mapping of field characteristics to index structure
Posted by Markus Jelsma <ma...@openindex.io>.
That stuff isnt being used but is shipped by the default Solr example schema of early 4.0. 4.0's default schema has changed a lot now.
I'd rather ship a small nutch specific config without all the default Solr fieldTypes that aren't being used anyway.
-----Original message-----
> From:Lewis John Mcgibbney <le...@gmail.com>
> Sent: Tue 07-Aug-2012 13:43
> To: Markus Jelsma <ma...@openindex.io>
> Cc: dev@nutch.apache.org
> Subject: Re: Understanding mapping of field characteristics to index structure
>
> Hi Markus,
> Thanks for getting back on this one last night. Please see comments inline.
>
> On Mon, Aug 6, 2012 at 11:12 PM, Markus Jelsma
> <ma...@openindex.io> wrote:
> > Hi,
> > Tokenization depens whether an analyzer used for the field ... should be boosted seperately.
> >
>
> Thanks for clarifying all is now crystal.
>
> > About the Solr4 schema, it wasn't introduced as a Solr4 compatible version of the default schema.xml file and i think it should be removed in favour of updating the schema.xml to Solr4.The only change i can think of is adding the version field that is mandatory for SolrCloud. The schema version is 1.5 which the default schema already has.
> >
>
> OK so what about all of the addition config in the schema-solr4.xml
> file which resides above the actual field definitions? E.g. the
> tokenisation, etc. parts you discuss above
> I also think it is too ambiguous (and slightly pointless) to maintain
> two schema (unless of course someone can provide justification). I
> think (in all distributions moving forward) we should aim to simplify
> this and encapsulate all required field and configuration definitions
> in a single schema.xml...
>
> Lewis
>
Re: Understanding mapping of field characteristics to index structure
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Markus,
Thanks for getting back on this one last night. Please see comments inline.
On Mon, Aug 6, 2012 at 11:12 PM, Markus Jelsma
<ma...@openindex.io> wrote:
> Hi,
> Tokenization depens whether an analyzer used for the field ... should be boosted seperately.
>
Thanks for clarifying all is now crystal.
> About the Solr4 schema, it wasn't introduced as a Solr4 compatible version of the default schema.xml file and i think it should be removed in favour of updating the schema.xml to Solr4.The only change i can think of is adding the version field that is mandatory for SolrCloud. The schema version is 1.5 which the default schema already has.
>
OK so what about all of the addition config in the schema-solr4.xml
file which resides above the actual field definitions? E.g. the
tokenisation, etc. parts you discuss above
I also think it is too ambiguous (and slightly pointless) to maintain
two schema (unless of course someone can provide justification). I
think (in all distributions moving forward) we should aim to simplify
this and encapsulate all required field and configuration definitions
in a single schema.xml...
Lewis
RE: Understanding mapping of field characteristics to index structure
Posted by Markus Jelsma <ma...@openindex.io>.
Hi,
Tokenization depens whether an analyzer used for the field (non-primitive types) and the tokenization depends on which tokenizer is defined. Tokenizing a hostname doesn't really make sense with the default available tokenizers but you can use a KeywordTokenizer with a WordDelmiterFilter to split it into domains (TLD, SLD, etc). But having a TLD in the same field isn't very useful for boosting and query time analysis of search words - people usually don't search for a tld and if they do it should be boosted seperately.
About the Solr4 schema, it wasn't introduced as a Solr4 compatible version of the default schema.xml file and i think it should be removed in favour of updating the schema.xml to Solr4.The only change i can think of is adding the version field that is mandatory for SolrCloud. The schema version is 1.5 which the default schema already has.
Cheers
-----Original message-----
> From:Lewis John Mcgibbney <le...@gmail.com>
> Sent: Tue 07-Aug-2012 00:03
> To: dev@nutch.apache.org
> Subject: Re: Understanding mapping of field characteristics to index structure
>
> Mmmm...
>
> I think I opened a small can of worms here regarding consistency
> between schema.xml and schema-solr4.xml.
>
> There are discrepancies between some fields as to their structural
> characteristics. This is something which I think we should make
> consistent between schemas... no?
>
> An example would be the content field (used in index-basic) which
> appears as stored and indexed in schema-solr4.xml but not stored in
> schema.xml
>
> Lewis
>
> On Mon, Aug 6, 2012 at 10:50 PM, Lewis John Mcgibbney
> <le...@gmail.com> wrote:
> > Hi,
> >
> > Simple question but currently unclear to me...
> >
> > I know if a field e.g. 'host' is going to be stored and/or indexed as
> > all I need to do is look this up or define it within my schema,
> > however what about tokenised? This seems (to me anyway) to be shrouded
> > in mystery :0|
> >
> > Any thoughts? Thank you
> >
> > Best
> > Lewis
> >
> > --
> > Lewis
>
>
>
> --
> Lewis
>
Re: Understanding mapping of field characteristics to index structure
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Mmmm...
I think I opened a small can of worms here regarding consistency
between schema.xml and schema-solr4.xml.
There are discrepancies between some fields as to their structural
characteristics. This is something which I think we should make
consistent between schemas... no?
An example would be the content field (used in index-basic) which
appears as stored and indexed in schema-solr4.xml but not stored in
schema.xml
Lewis
On Mon, Aug 6, 2012 at 10:50 PM, Lewis John Mcgibbney
<le...@gmail.com> wrote:
> Hi,
>
> Simple question but currently unclear to me...
>
> I know if a field e.g. 'host' is going to be stored and/or indexed as
> all I need to do is look this up or define it within my schema,
> however what about tokenised? This seems (to me anyway) to be shrouded
> in mystery :0|
>
> Any thoughts? Thank you
>
> Best
> Lewis
>
> --
> Lewis
--
Lewis