You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Dennis Gearon <ge...@sbcglobal.net> on 2010/12/21 05:03:36 UTC

Recap on derived objects in Solr Index, 'schema in a can'

Based on more searches and manual consolidation, I've put together some of 
the ideas for this already suggested in a summary below. The last item in the 
summary
seems to be interesting, low technical cost way of doing it.

Basically, it treats the index like a 'BigTable', a la "No SQL".

Erick Erickson pointed out: 
"...but there's absolutely no requirement 
that all documents in SOLR have the same fields..."

I guess I don't have the right understanding of what goes into a Document
in Solr. Is it just a set of fields, each with it's own independent field type
declaration/id, it's name, and it's content?

So even though there's a schema for an index, one could ignore it and
jsut throw any other named fields and types and content at document addition 
time?

So If I wanted to search on a base set, all documents having it, I could then
additionally filter based on the (might be wrong use of this) dynamic fields?






Origninal Thread that I started:
----------------------------------------
http://lucene.472066.n3.nabble.com/A-schema-inside-a-Solr-Schema-Schema-in-a-can-tt2103260.html

-----------------------------------------------------------------------------------------------------

Repeat of the problem, (not actual ratios, numbers, i.e. could be WORSE!):
-----------------------------------------------------------------------------------------------------


1/ Base object of some kind, x number of fields
2/ Derived objects representing Divisiion in company, different customer bases, 
etc.
      each having 2 additional, unique fields.
3/ Assume 1000 such derived object types
4/ A 'flattened' Index would have the x base object fields,
    ****and 2000**** additional fields

 
================================================
Solutions Posited
-----------------------

A/ First thought, muliti-value columns as key pairs.
      1/ Difficult to access individual items of more than one 'word' length 
             for querying in multivalued fields.
      2/ All sorts of statistical stuff probably wouldn't apply?
      3/ (James Dayer said:) There's also one "gotcha" we've experienced when 
searching acrosse
            multi-valued fields:  SOLR will match across field occurences. 
             In the  example below, if you were to search q=contrib_name:(james 
AND smith),
             you will get this record back.  It matches one name from one 
contributor  and 

             another name from a different contributor.  This is not what our  
users want. 


             As a work-around, I am converting these to phrase queries with 
             slop: "james smith"~50 ... Just use a slop # smaller than your  
positionIncrementGap 

             and bigger than the # of terms entered.  This will  prevent the 
cross-field matches 

             yet allow the words to occur in any  order.   

            The problem with this approach is that Lucene doesn't support 
wildcards in phrases
B/ Dynamic fields was suggested, but I am not sure exactly how they
        work, and the person who suggested it was not sure it would work, 
either.
C/ Different field naming conventions were suggested in field types were 
similar.
        I can't predict that.
D/ Found this old thread, and i had other suggestions:
       1/ Use multiple cores, one for each record type/schema, aggregate them in 
during the query.
       2/ Use a fixed number of additional fields X 2. Eatch additional field is 
actually a pair of fields.
           The first of the pair gives the colmn name, the second gives the 
data. 

            a) Although I like this, I wonder how many extra fields to use, 
            b) it was pointed out that relevancy and other statistical criterial 
for queries might suffer.
       3/ Index the different objects exactly as they are, i.e. as Erick 
Erickson said:
           "I'm not entirely sure this is germane, but there's absolutely no 
requirement 

           that all documents in SOLR have the same fields. So it's possible for 
you to 

           index the "wildly different content" in "wildly different fields" 
<G>. Then 

           searching for screen:LCD would be straightforward."...
Dennis Gearon


Signature Warning
----------------
It is always a good idea to learn from your own mistakes. It is usually a better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.


Re: Recap on derived objects in Solr Index, 'schema in a can'

Posted by Dennis Gearon <ge...@sbcglobal.net>.
I think I'm just going to have to have my partner and I play with both cores and 
dynamic fields.

If multiple cores are queried, and the schemas match up in order and postion for 
the base fields, the 'extra' fields in the different cores just show up in the 
result set with their field names? The query against different cores, with 'base 
attributes' and 'extended attributes' has to be tailored for each core, right? 
I.E., not querying for fields that don't exist?

(That could be handled by making the query a server side langauge object with 
inheritance for the extended fields)

 Dennis Gearon


Signature Warning
----------------
It is always a good idea to learn from your own mistakes. It is usually a better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



----- Original Message ----
From: Lance Norskog <go...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Wed, December 22, 2010 1:45:04 PM
Subject: Re: Recap on derived objects in Solr Index, 'schema in a can'

A dynamic field just means that the schema allows any field with a
name matching the wildcard. That's all.

There is no support for referring to all of the existing fields in the
wildcard. That is, there is no support for "*_en:word" as a field
search. Nor is there any kind of grouping for facets. The feature for
addressing a particular field in some of the parameters does not
support wildcards. If you add wildcard fields, you have to remember
what they are.

On Wed, Dec 22, 2010 at 11:04 AM, Dennis Gearon <ge...@sbcglobal.net> wrote:
> I'm open to cores, if it's the faster(indexing/querying/keeping mentally
> straight) way to do things.
>
> But from what you say below, the eventual goal of the site would mean either 
>100
> extra 'generic' fields, or 1,000-100,000's of cores.
> Probably cores is easier to administer for security and does more accurate
> querying?
>
> What is the relationship between dynamic fields and the schema?
>
>  Dennis Gearon
>
>
> Signature Warning
> ----------------
> It is always a good idea to learn from your own mistakes. It is usually a 
>better
> idea to learn from others’ mistakes, so you do not have to make them yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>
>
> ----- Original Message ----
> From: Erick Erickson <er...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Wed, December 22, 2010 10:44:27 AM
> Subject: Re: Recap on derived objects in Solr Index, 'schema in a can'
>
> No, one cannot ignore the schema. If you try to add a field not in the
> schema you get
> an error. One could, however, use any arbitrary subset
> of the fields defined in the schema for any particular #document# in the
> index. Say
> your schema had fields f1, f2, f3...f10. You could have fields f1-f5 in one
> doc, and
> fields f6-f10 in another and f1, f4, f9 in another and.....
>
> The only field(s) that #must# be in a document are the required="true"
> fields.
>
> There's no real penalty for omitting fields from particular documents. This
> allows
> you to store "special" documents that aren't part of normal searches.
>
> You could, for instance, use a document to store meta-information about your
> index that had whatever meaning you wanted in a field(s) that *no* other
> document
> had. Your app could then read that "special" document and make use of that
> info.
> Searches on "normal" documents wouldn't return that doc, etc.
>
> You could effectively have N indexes contained in one index where a document
> in each logical sub-index had fields disjoint from the other logical
> sub-indexes.
> Why you'd do something like that rather than use cores is a very good
> question,
> but you #could# do it that way...
>
> All this is much different from a database where there are penalties for
> defining
> a large number of unused fields.
>
> Whether doing this is wise or not given the particular problem you're trying
> to
> solve is another discussion <G>..
>
> Best
> Erick
>
> On Mon, Dec 20, 2010 at 11:03 PM, Dennis Gearon <ge...@sbcglobal.net>wrote:
>
>> Based on more searches and manual consolidation, I've put together some of
>> the ideas for this already suggested in a summary below. The last item in
>> the
>> summary
>> seems to be interesting, low technical cost way of doing it.
>>
>> Basically, it treats the index like a 'BigTable', a la "No SQL".
>>
>> Erick Erickson pointed out:
>> "...but there's absolutely no requirement
>> that all documents in SOLR have the same fields..."
>>
>> I guess I don't have the right understanding of what goes into a Document
>> in Solr. Is it just a set of fields, each with it's own independent field
>> type
>> declaration/id, it's name, and it's content?
>>
>> So even though there's a schema for an index, one could ignore it and
>> jsut throw any other named fields and types and content at document
>> addition
>> time?
>>
>> So If I wanted to search on a base set, all documents having it, I could
>> then
>> additionally filter based on the (might be wrong use of this) dynamic
>> fields?
>>
>>
>>
>>
>>
>>
>> Origninal Thread that I started:
>> ----------------------------------------
>>
>>http://lucene.472066.n3.nabble.com/A-schema-inside-a-Solr-Schema-Schema-in-a-can-tt2103260.html
>>
>>l
>>
>>
>>-----------------------------------------------------------------------------------------------------
>>
>>-
>>
>> Repeat of the problem, (not actual ratios, numbers, i.e. could be WORSE!):
>>
>>-----------------------------------------------------------------------------------------------------
>>
>>-
>>
>>
>> 1/ Base object of some kind, x number of fields
>> 2/ Derived objects representing Divisiion in company, different customer
>> bases,
>> etc.
>>      each having 2 additional, unique fields.
>> 3/ Assume 1000 such derived object types
>> 4/ A 'flattened' Index would have the x base object fields,
>>    ****and 2000**** additional fields
>>
>>
>> ================================================
>> Solutions Posited
>> -----------------------
>>
>> A/ First thought, muliti-value columns as key pairs.
>>      1/ Difficult to access individual items of more than one 'word' length
>>             for querying in multivalued fields.
>>      2/ All sorts of statistical stuff probably wouldn't apply?
>>      3/ (James Dayer said:) There's also one "gotcha" we've experienced
>> when
>> searching acrosse
>>            multi-valued fields:  SOLR will match across field occurences.
>>             In the  example below, if you were to search
>> q=contrib_name:(james
>> AND smith),
>>             you will get this record back.  It matches one name from one
>> contributor  and
>>
>>             another name from a different contributor.  This is not what
>> our
>> users want.
>>
>>
>>             As a work-around, I am converting these to phrase queries with
>>             slop: "james smith"~50 ... Just use a slop # smaller than your
>> positionIncrementGap
>>
>>             and bigger than the # of terms entered.  This will  prevent the
>> cross-field matches
>>
>>             yet allow the words to occur in any  order.
>>
>>            The problem with this approach is that Lucene doesn't support
>> wildcards in phrases
>> B/ Dynamic fields was suggested, but I am not sure exactly how they
>>        work, and the person who suggested it was not sure it would work,
>> either.
>> C/ Different field naming conventions were suggested in field types were
>> similar.
>>        I can't predict that.
>> D/ Found this old thread, and i had other suggestions:
>>       1/ Use multiple cores, one for each record type/schema, aggregate
>> them in
>> during the query.
>>       2/ Use a fixed number of additional fields X 2. Eatch additional
>> field is
>> actually a pair of fields.
>>           The first of the pair gives the colmn name, the second gives the
>> data.
>>
>>            a) Although I like this, I wonder how many extra fields to use,
>>            b) it was pointed out that relevancy and other statistical
>> criterial
>> for queries might suffer.
>>       3/ Index the different objects exactly as they are, i.e. as Erick
>> Erickson said:
>>           "I'm not entirely sure this is germane, but there's absolutely no
>> requirement
>>
>>           that all documents in SOLR have the same fields. So it's possible
>> for
>> you to
>>
>>           index the "wildly different content" in "wildly different fields"
>> <G>. Then
>>
>>           searching for screen:LCD would be straightforward."...
>> Dennis Gearon
>>
>>
>> Signature Warning
>> ----------------
>> It is always a good idea to learn from your own mistakes. It is usually a
>> better
>> idea to learn from others’ mistakes, so you do not have to make them
>> yourself.
>> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>>
>>
>> EARTH has a Right To Life,
>> otherwise we all die.
>>
>>
>
>



-- 
Lance Norskog
goksron@gmail.com


Re: Recap on derived objects in Solr Index, 'schema in a can'

Posted by Lance Norskog <go...@gmail.com>.
A dynamic field just means that the schema allows any field with a
name matching the wildcard. That's all.

There is no support for referring to all of the existing fields in the
wildcard. That is, there is no support for "*_en:word" as a field
search. Nor is there any kind of grouping for facets. The feature for
addressing a particular field in some of the parameters does not
support wildcards. If you add wildcard fields, you have to remember
what they are.

On Wed, Dec 22, 2010 at 11:04 AM, Dennis Gearon <ge...@sbcglobal.net> wrote:
> I'm open to cores, if it's the faster(indexing/querying/keeping mentally
> straight) way to do things.
>
> But from what you say below, the eventual goal of the site would mean either 100
> extra 'generic' fields, or 1,000-100,000's of cores.
> Probably cores is easier to administer for security and does more accurate
> querying?
>
> What is the relationship between dynamic fields and the schema?
>
>  Dennis Gearon
>
>
> Signature Warning
> ----------------
> It is always a good idea to learn from your own mistakes. It is usually a better
> idea to learn from others’ mistakes, so you do not have to make them yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>
>
> ----- Original Message ----
> From: Erick Erickson <er...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Wed, December 22, 2010 10:44:27 AM
> Subject: Re: Recap on derived objects in Solr Index, 'schema in a can'
>
> No, one cannot ignore the schema. If you try to add a field not in the
> schema you get
> an error. One could, however, use any arbitrary subset
> of the fields defined in the schema for any particular #document# in the
> index. Say
> your schema had fields f1, f2, f3...f10. You could have fields f1-f5 in one
> doc, and
> fields f6-f10 in another and f1, f4, f9 in another and.....
>
> The only field(s) that #must# be in a document are the required="true"
> fields.
>
> There's no real penalty for omitting fields from particular documents. This
> allows
> you to store "special" documents that aren't part of normal searches.
>
> You could, for instance, use a document to store meta-information about your
> index that had whatever meaning you wanted in a field(s) that *no* other
> document
> had. Your app could then read that "special" document and make use of that
> info.
> Searches on "normal" documents wouldn't return that doc, etc.
>
> You could effectively have N indexes contained in one index where a document
> in each logical sub-index had fields disjoint from the other logical
> sub-indexes.
> Why you'd do something like that rather than use cores is a very good
> question,
> but you #could# do it that way...
>
> All this is much different from a database where there are penalties for
> defining
> a large number of unused fields.
>
> Whether doing this is wise or not given the particular problem you're trying
> to
> solve is another discussion <G>..
>
> Best
> Erick
>
> On Mon, Dec 20, 2010 at 11:03 PM, Dennis Gearon <ge...@sbcglobal.net>wrote:
>
>> Based on more searches and manual consolidation, I've put together some of
>> the ideas for this already suggested in a summary below. The last item in
>> the
>> summary
>> seems to be interesting, low technical cost way of doing it.
>>
>> Basically, it treats the index like a 'BigTable', a la "No SQL".
>>
>> Erick Erickson pointed out:
>> "...but there's absolutely no requirement
>> that all documents in SOLR have the same fields..."
>>
>> I guess I don't have the right understanding of what goes into a Document
>> in Solr. Is it just a set of fields, each with it's own independent field
>> type
>> declaration/id, it's name, and it's content?
>>
>> So even though there's a schema for an index, one could ignore it and
>> jsut throw any other named fields and types and content at document
>> addition
>> time?
>>
>> So If I wanted to search on a base set, all documents having it, I could
>> then
>> additionally filter based on the (might be wrong use of this) dynamic
>> fields?
>>
>>
>>
>>
>>
>>
>> Origninal Thread that I started:
>> ----------------------------------------
>>
>>http://lucene.472066.n3.nabble.com/A-schema-inside-a-Solr-Schema-Schema-in-a-can-tt2103260.html
>>l
>>
>>
>>-----------------------------------------------------------------------------------------------------
>>-
>>
>> Repeat of the problem, (not actual ratios, numbers, i.e. could be WORSE!):
>>
>>-----------------------------------------------------------------------------------------------------
>>-
>>
>>
>> 1/ Base object of some kind, x number of fields
>> 2/ Derived objects representing Divisiion in company, different customer
>> bases,
>> etc.
>>      each having 2 additional, unique fields.
>> 3/ Assume 1000 such derived object types
>> 4/ A 'flattened' Index would have the x base object fields,
>>    ****and 2000**** additional fields
>>
>>
>> ================================================
>> Solutions Posited
>> -----------------------
>>
>> A/ First thought, muliti-value columns as key pairs.
>>      1/ Difficult to access individual items of more than one 'word' length
>>             for querying in multivalued fields.
>>      2/ All sorts of statistical stuff probably wouldn't apply?
>>      3/ (James Dayer said:) There's also one "gotcha" we've experienced
>> when
>> searching acrosse
>>            multi-valued fields:  SOLR will match across field occurences.
>>             In the  example below, if you were to search
>> q=contrib_name:(james
>> AND smith),
>>             you will get this record back.  It matches one name from one
>> contributor  and
>>
>>             another name from a different contributor.  This is not what
>> our
>> users want.
>>
>>
>>             As a work-around, I am converting these to phrase queries with
>>             slop: "james smith"~50 ... Just use a slop # smaller than your
>> positionIncrementGap
>>
>>             and bigger than the # of terms entered.  This will  prevent the
>> cross-field matches
>>
>>             yet allow the words to occur in any  order.
>>
>>            The problem with this approach is that Lucene doesn't support
>> wildcards in phrases
>> B/ Dynamic fields was suggested, but I am not sure exactly how they
>>        work, and the person who suggested it was not sure it would work,
>> either.
>> C/ Different field naming conventions were suggested in field types were
>> similar.
>>        I can't predict that.
>> D/ Found this old thread, and i had other suggestions:
>>       1/ Use multiple cores, one for each record type/schema, aggregate
>> them in
>> during the query.
>>       2/ Use a fixed number of additional fields X 2. Eatch additional
>> field is
>> actually a pair of fields.
>>           The first of the pair gives the colmn name, the second gives the
>> data.
>>
>>            a) Although I like this, I wonder how many extra fields to use,
>>            b) it was pointed out that relevancy and other statistical
>> criterial
>> for queries might suffer.
>>       3/ Index the different objects exactly as they are, i.e. as Erick
>> Erickson said:
>>           "I'm not entirely sure this is germane, but there's absolutely no
>> requirement
>>
>>           that all documents in SOLR have the same fields. So it's possible
>> for
>> you to
>>
>>           index the "wildly different content" in "wildly different fields"
>> <G>. Then
>>
>>           searching for screen:LCD would be straightforward."...
>> Dennis Gearon
>>
>>
>> Signature Warning
>> ----------------
>> It is always a good idea to learn from your own mistakes. It is usually a
>> better
>> idea to learn from others’ mistakes, so you do not have to make them
>> yourself.
>> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>>
>>
>> EARTH has a Right To Life,
>> otherwise we all die.
>>
>>
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Recap on derived objects in Solr Index, 'schema in a can'

Posted by Dennis Gearon <ge...@sbcglobal.net>.
I'm open to cores, if it's the faster(indexing/querying/keeping mentally 
straight) way to do things.

But from what you say below, the eventual goal of the site would mean either 100 
extra 'generic' fields, or 1,000-100,000's of cores.
Probably cores is easier to administer for security and does more accurate 
querying?

What is the relationship between dynamic fields and the schema?

 Dennis Gearon


Signature Warning
----------------
It is always a good idea to learn from your own mistakes. It is usually a better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



----- Original Message ----
From: Erick Erickson <er...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Wed, December 22, 2010 10:44:27 AM
Subject: Re: Recap on derived objects in Solr Index, 'schema in a can'

No, one cannot ignore the schema. If you try to add a field not in the
schema you get
an error. One could, however, use any arbitrary subset
of the fields defined in the schema for any particular #document# in the
index. Say
your schema had fields f1, f2, f3...f10. You could have fields f1-f5 in one
doc, and
fields f6-f10 in another and f1, f4, f9 in another and.....

The only field(s) that #must# be in a document are the required="true"
fields.

There's no real penalty for omitting fields from particular documents. This
allows
you to store "special" documents that aren't part of normal searches.

You could, for instance, use a document to store meta-information about your
index that had whatever meaning you wanted in a field(s) that *no* other
document
had. Your app could then read that "special" document and make use of that
info.
Searches on "normal" documents wouldn't return that doc, etc.

You could effectively have N indexes contained in one index where a document
in each logical sub-index had fields disjoint from the other logical
sub-indexes.
Why you'd do something like that rather than use cores is a very good
question,
but you #could# do it that way...

All this is much different from a database where there are penalties for
defining
a large number of unused fields.

Whether doing this is wise or not given the particular problem you're trying
to
solve is another discussion <G>..

Best
Erick

On Mon, Dec 20, 2010 at 11:03 PM, Dennis Gearon <ge...@sbcglobal.net>wrote:

> Based on more searches and manual consolidation, I've put together some of
> the ideas for this already suggested in a summary below. The last item in
> the
> summary
> seems to be interesting, low technical cost way of doing it.
>
> Basically, it treats the index like a 'BigTable', a la "No SQL".
>
> Erick Erickson pointed out:
> "...but there's absolutely no requirement
> that all documents in SOLR have the same fields..."
>
> I guess I don't have the right understanding of what goes into a Document
> in Solr. Is it just a set of fields, each with it's own independent field
> type
> declaration/id, it's name, and it's content?
>
> So even though there's a schema for an index, one could ignore it and
> jsut throw any other named fields and types and content at document
> addition
> time?
>
> So If I wanted to search on a base set, all documents having it, I could
> then
> additionally filter based on the (might be wrong use of this) dynamic
> fields?
>
>
>
>
>
>
> Origninal Thread that I started:
> ----------------------------------------
>
>http://lucene.472066.n3.nabble.com/A-schema-inside-a-Solr-Schema-Schema-in-a-can-tt2103260.html
>l
>
>
>-----------------------------------------------------------------------------------------------------
>-
>
> Repeat of the problem, (not actual ratios, numbers, i.e. could be WORSE!):
>
>-----------------------------------------------------------------------------------------------------
>-
>
>
> 1/ Base object of some kind, x number of fields
> 2/ Derived objects representing Divisiion in company, different customer
> bases,
> etc.
>      each having 2 additional, unique fields.
> 3/ Assume 1000 such derived object types
> 4/ A 'flattened' Index would have the x base object fields,
>    ****and 2000**** additional fields
>
>
> ================================================
> Solutions Posited
> -----------------------
>
> A/ First thought, muliti-value columns as key pairs.
>      1/ Difficult to access individual items of more than one 'word' length
>             for querying in multivalued fields.
>      2/ All sorts of statistical stuff probably wouldn't apply?
>      3/ (James Dayer said:) There's also one "gotcha" we've experienced
> when
> searching acrosse
>            multi-valued fields:  SOLR will match across field occurences.
>             In the  example below, if you were to search
> q=contrib_name:(james
> AND smith),
>             you will get this record back.  It matches one name from one
> contributor  and
>
>             another name from a different contributor.  This is not what
> our
> users want.
>
>
>             As a work-around, I am converting these to phrase queries with
>             slop: "james smith"~50 ... Just use a slop # smaller than your
> positionIncrementGap
>
>             and bigger than the # of terms entered.  This will  prevent the
> cross-field matches
>
>             yet allow the words to occur in any  order.
>
>            The problem with this approach is that Lucene doesn't support
> wildcards in phrases
> B/ Dynamic fields was suggested, but I am not sure exactly how they
>        work, and the person who suggested it was not sure it would work,
> either.
> C/ Different field naming conventions were suggested in field types were
> similar.
>        I can't predict that.
> D/ Found this old thread, and i had other suggestions:
>       1/ Use multiple cores, one for each record type/schema, aggregate
> them in
> during the query.
>       2/ Use a fixed number of additional fields X 2. Eatch additional
> field is
> actually a pair of fields.
>           The first of the pair gives the colmn name, the second gives the
> data.
>
>            a) Although I like this, I wonder how many extra fields to use,
>            b) it was pointed out that relevancy and other statistical
> criterial
> for queries might suffer.
>       3/ Index the different objects exactly as they are, i.e. as Erick
> Erickson said:
>           "I'm not entirely sure this is germane, but there's absolutely no
> requirement
>
>           that all documents in SOLR have the same fields. So it's possible
> for
> you to
>
>           index the "wildly different content" in "wildly different fields"
> <G>. Then
>
>           searching for screen:LCD would be straightforward."...
> Dennis Gearon
>
>
> Signature Warning
> ----------------
> It is always a good idea to learn from your own mistakes. It is usually a
> better
> idea to learn from others’ mistakes, so you do not have to make them
> yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>


Re: Recap on derived objects in Solr Index, 'schema in a can'

Posted by Erick Erickson <er...@gmail.com>.
No, one cannot ignore the schema. If you try to add a field not in the
schema you get
an error. One could, however, use any arbitrary subset
of the fields defined in the schema for any particular #document# in the
index. Say
your schema had fields f1, f2, f3...f10. You could have fields f1-f5 in one
doc, and
fields f6-f10 in another and f1, f4, f9 in another and.....

The only field(s) that #must# be in a document are the required="true"
fields.

There's no real penalty for omitting fields from particular documents. This
allows
you to store "special" documents that aren't part of normal searches.

You could, for instance, use a document to store meta-information about your
index that had whatever meaning you wanted in a field(s) that *no* other
document
had. Your app could then read that "special" document and make use of that
info.
Searches on "normal" documents wouldn't return that doc, etc.

You could effectively have N indexes contained in one index where a document
in each logical sub-index had fields disjoint from the other logical
sub-indexes.
Why you'd do something like that rather than use cores is a very good
question,
but you #could# do it that way...

All this is much different from a database where there are penalties for
defining
a large number of unused fields.

Whether doing this is wise or not given the particular problem you're trying
to
solve is another discussion <G>..

Best
Erick

On Mon, Dec 20, 2010 at 11:03 PM, Dennis Gearon <ge...@sbcglobal.net>wrote:

> Based on more searches and manual consolidation, I've put together some of
> the ideas for this already suggested in a summary below. The last item in
> the
> summary
> seems to be interesting, low technical cost way of doing it.
>
> Basically, it treats the index like a 'BigTable', a la "No SQL".
>
> Erick Erickson pointed out:
> "...but there's absolutely no requirement
> that all documents in SOLR have the same fields..."
>
> I guess I don't have the right understanding of what goes into a Document
> in Solr. Is it just a set of fields, each with it's own independent field
> type
> declaration/id, it's name, and it's content?
>
> So even though there's a schema for an index, one could ignore it and
> jsut throw any other named fields and types and content at document
> addition
> time?
>
> So If I wanted to search on a base set, all documents having it, I could
> then
> additionally filter based on the (might be wrong use of this) dynamic
> fields?
>
>
>
>
>
>
> Origninal Thread that I started:
> ----------------------------------------
>
> http://lucene.472066.n3.nabble.com/A-schema-inside-a-Solr-Schema-Schema-in-a-can-tt2103260.html
>
>
> -----------------------------------------------------------------------------------------------------
>
> Repeat of the problem, (not actual ratios, numbers, i.e. could be WORSE!):
>
> -----------------------------------------------------------------------------------------------------
>
>
> 1/ Base object of some kind, x number of fields
> 2/ Derived objects representing Divisiion in company, different customer
> bases,
> etc.
>      each having 2 additional, unique fields.
> 3/ Assume 1000 such derived object types
> 4/ A 'flattened' Index would have the x base object fields,
>    ****and 2000**** additional fields
>
>
> ================================================
> Solutions Posited
> -----------------------
>
> A/ First thought, muliti-value columns as key pairs.
>      1/ Difficult to access individual items of more than one 'word' length
>             for querying in multivalued fields.
>      2/ All sorts of statistical stuff probably wouldn't apply?
>      3/ (James Dayer said:) There's also one "gotcha" we've experienced
> when
> searching acrosse
>            multi-valued fields:  SOLR will match across field occurences.
>             In the  example below, if you were to search
> q=contrib_name:(james
> AND smith),
>             you will get this record back.  It matches one name from one
> contributor  and
>
>             another name from a different contributor.  This is not what
> our
> users want.
>
>
>             As a work-around, I am converting these to phrase queries with
>             slop: "james smith"~50 ... Just use a slop # smaller than your
> positionIncrementGap
>
>             and bigger than the # of terms entered.  This will  prevent the
> cross-field matches
>
>             yet allow the words to occur in any  order.
>
>            The problem with this approach is that Lucene doesn't support
> wildcards in phrases
> B/ Dynamic fields was suggested, but I am not sure exactly how they
>        work, and the person who suggested it was not sure it would work,
> either.
> C/ Different field naming conventions were suggested in field types were
> similar.
>        I can't predict that.
> D/ Found this old thread, and i had other suggestions:
>       1/ Use multiple cores, one for each record type/schema, aggregate
> them in
> during the query.
>       2/ Use a fixed number of additional fields X 2. Eatch additional
> field is
> actually a pair of fields.
>           The first of the pair gives the colmn name, the second gives the
> data.
>
>            a) Although I like this, I wonder how many extra fields to use,
>            b) it was pointed out that relevancy and other statistical
> criterial
> for queries might suffer.
>       3/ Index the different objects exactly as they are, i.e. as Erick
> Erickson said:
>           "I'm not entirely sure this is germane, but there's absolutely no
> requirement
>
>           that all documents in SOLR have the same fields. So it's possible
> for
> you to
>
>           index the "wildly different content" in "wildly different fields"
> <G>. Then
>
>           searching for screen:LCD would be straightforward."...
> Dennis Gearon
>
>
> Signature Warning
> ----------------
> It is always a good idea to learn from your own mistakes. It is usually a
> better
> idea to learn from others’ mistakes, so you do not have to make them
> yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>