You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Tim Terlegård <ti...@gmail.com> on 2010/12/21 14:13:17 UTC

Consequences for using multivalued on all fields

In our application we use dynamic fields and there can be about 50 of
them and there can be up to 100 million documents.

Are there any disadvantages having multivalued=true on all fields in
the schema? An admin of the application can specify dynamic fields and
if they should be indexed or stored. Question is if we gain anything
by letting them to choose multivalued as well or if it just adds
complexity to the user interface?

Thanks,
Tim

Michigan Information Retrieval Enthusiasts Group Quarterly Meetup - May 19th 2011

Posted by "Provalov, Ivan" <Iv...@cengage.com>.

Our next IR Meetup is at Cengage Learning on May 19, 2011. Please RSVP here:
http://www.meetup.com/Michigan-Information-Retrieval-Enthusiasts-Group/events/17567795/

Presentations:
1. Bayesian Language Model
This talk presents a Bayesian language model, originally described by (Teh 2006), which uses a hierarchical Pitman-Yor process to describe the distribution of n-grams in an n-gram language model and which allows for a Bayesian back-off and smoothing strategy. The language model, which assumes a power-law prior over the n-gram space, compares favorably with language models based upon state of the art empirical n-gram smoothing techniques. In addition to the language model, and primarily because the background information required to understand it is somewhat difficult, that material, most of which does not appear in (Teh 2006), is also presented in some detail. In particular, background information related to the Dirichlet distribution and the Dirichlet process is given. The Dirichlet process is then related to the Pitman-Yor process, and the hierarchical Pitman-Yor process is also presented.

2. Using GATE for Word Polarity in Context Classification
GATE (General Architecture for Text Engineering) is an open source software for creating text processing workflows. Core GATE includes the tools for solving many text engineering issues: modeling and persistence of specialized data structures; measurement, evaluation, benchmarking; visualization and editing of annotations, ontologies, parse trees, etc.; extraction of training instances for machine learning; pluggable machine learning implementations. This tutorial will show how to use GATE for advanced machine learning applications. Detecting word polarity in context will be used as an example to show some of the GATE features. The tutorial project is based on the latest sentiment analysis research, specifically the work by Theresa Wilson, Janyce Wiebe, Paul Hoffmann "Recognizing Contextual Polarity: An Exploration of Features for Phrase-Level Sentiment Analysis", 2009. Using different features (words, part of speech, negations, etc...) SVM classifier is trained and evaluated.

Thank you,

Ivan Provalov

Re: Consequences for using multivalued on all fields

Posted by Geert-Jan Brits <gb...@gmail.com>.

You should be aware that the behavior of sorting on a multi-valued field is
undefined. After all, which of the multiple values should be used for
sorting?
So if you need sorting on the field, you shouldn't make it multi-valued.

Geert-Jan

2010/12/21 J.J. Larrea <jj...@panix.com>

> Someone please correct me if I am wrong, but as far as I am aware index
> format is identical in either case.
>
> One benefit of allowing one to specify a field as single-valued is similar
> to specifying that a field is required: Providing a safeguard that index
> data conforms to requirements.  So making all fields multivalued forgoes
> that integrity check for fields which by definition should be singular.
>
> Also depending on the response writer and for the XMLResponseWriter the
> requested response version (see
> http://wiki.apache.org/solr/XMLResponseFormat) the multi-valued setting
> can determine whether the document values returned from a query will be
> scalars (eg. <str name="year">2010</str>) or arrays of scalars (<arr
> name="year"><str>2010</str></arr>), regardless of how many values are
> actually stored.
>
> But the most significant gotcha of not specifying the actual arity (1 or N)
> arises if any of those fields is used for field-faceting: By default the
> field-faceting logic chooses a different algorithm depending on whether the
> field is multi-valued, and the default choice for multi-valued is only
> appropriate for a small set of enumerated values since it creates a filter
> query for each value in the set. And this can have a profound effect on Solr
> memory utilization. So if you are not relying on the field arity setting to
> select the algorithm, you or your users might need to specify it explicitly
> with the f.<field>.facet.method argument; see
> http://wiki.apache.org/solr/SolrFacetingOverview for more info.
>
> So while all-multivalued isn't a showstopper, if it were up to me I'd want
> to give users the option to specify arity and whether the field is required.
>
> - J.J.
>
> At 2:13 PM +0100 12/21/10, Tim Terlegård wrote:
> >In our application we use dynamic fields and there can be about 50 of
> >them and there can be up to 100 million documents.
> >
> >Are there any disadvantages having multivalued=true on all fields in
> >the schema? An admin of the application can specify dynamic fields and
> >if they should be indexed or stored. Question is if we gain anything
> >by letting them to choose multivalued as well or if it just adds
> >complexity to the user interface?
> >
> >Thanks,
> >Tim
>
>

Re: Consequences for using multivalued on all fields

Posted by "J.J. Larrea" <jj...@panix.com>.

Someone please correct me if I am wrong, but as far as I am aware index format is identical in either case.

One benefit of allowing one to specify a field as single-valued is similar to specifying that a field is required: Providing a safeguard that index data conforms to requirements.  So making all fields multivalued forgoes that integrity check for fields which by definition should be singular.

Also depending on the response writer and for the XMLResponseWriter the requested response version (see http://wiki.apache.org/solr/XMLResponseFormat) the multi-valued setting can determine whether the document values returned from a query will be scalars (eg. <str name="year">2010</str>) or arrays of scalars (<arr name="year"><str>2010</str></arr>), regardless of how many values are actually stored.

But the most significant gotcha of not specifying the actual arity (1 or N) arises if any of those fields is used for field-faceting: By default the field-faceting logic chooses a different algorithm depending on whether the field is multi-valued, and the default choice for multi-valued is only appropriate for a small set of enumerated values since it creates a filter query for each value in the set. And this can have a profound effect on Solr memory utilization. So if you are not relying on the field arity setting to select the algorithm, you or your users might need to specify it explicitly with the f.<field>.facet.method argument; see http://wiki.apache.org/solr/SolrFacetingOverview for more info.

So while all-multivalued isn't a showstopper, if it were up to me I'd want to give users the option to specify arity and whether the field is required.

- J.J.

At 2:13 PM +0100 12/21/10, Tim Terlegård wrote:
>In our application we use dynamic fields and there can be about 50 of
>them and there can be up to 100 million documents.
>
>Are there any disadvantages having multivalued=true on all fields in
>the schema? An admin of the application can specify dynamic fields and
>if they should be indexed or stored. Question is if we gain anything
>by letting them to choose multivalued as well or if it just adds
>complexity to the user interface?
>
>Thanks,
>Tim

Re: Consequences for using multivalued on all fields

Posted by Erick Erickson <er...@gmail.com>.

PositionIncrementGap for multiValued fields is, perhaps, the most
interesting
difference. One of the drivers here is, say, indexing across some boundary
that you don't want phrases or near clauses to match. For instance, say you
have text with
sentences, and your requirement is that phrases don't match across sentence
boundaries. One way to handle that is to add successive sentences to a
multivalued
field and define that field with a large increment gap.

But otherwise, as far as I know, there's no difference worth mentioning
between
indexing a bunch of stuff as one long string or breaking it up into multiple
segments in a multivalued field with the increment gap set to 1, except for
edge cases like the sorting thing Geert-Jan mentions....

Best
Erick

On Tue, Dec 21, 2010 at 12:49 PM, Dennis Gearon <ge...@sbcglobal.net>wrote:

> Thanks you for the input. You might have seen my posts about doing a
> flexible
> schema for derived objects. Sounds like dynamic fields might be the ticket.
>
> We'll be ready to test the idea in about a month, mabye 3 weeks. I'll post
> a
> comment about it whn it gets there.
>
> I don't know if I would gain anything, but I think that ALL boolean that
> were
> NOT in the base object but wehre in the derived objects could be put into
> one
> field and textually positioned key:pairs, at least for searh purposes.
>
>
> Since the derived object would have it's own, additional methods, one of
> those
> methods could be to 'unserialize' the 'boolean column'. In fact, that could
> be a
> base object function - Empty boolean column values just end up not
> populating
> any extra base object attiributes.
>
>  Dennis Gearon
>
>
> Signature Warning
> ----------------
> It is always a good idea to learn from your own mistakes. It is usually a
> better
> idea to learn from others’ mistakes, so you do not have to make them
> yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>
>
> ----- Original Message ----
> From: kenf_nc <ke...@realestate.com>
> To: solr-user@lucene.apache.org
> Sent: Tue, December 21, 2010 6:07:51 AM
> Subject: Re: Consequences for using multivalued on all fields
>
>
> I have about 30 million documents and with the exception of the Unique ID,
> Type and a couple of date fields, every document is made of dynamic fields.
> Now, I only have maybe 1 in 5 being multi-value, but search and facet
> performance doesn't look appreciably different from a fixed schema
> solution.
> I don't do some of the fancier things, highlighting, spell check, etc. And
> I
> use a lot more string or lowercase field types than I do Text (so not as
> many fully tokenized fields), that probably helps with performance.
>
> The only disadvantage I know of is dealing with field names at runtime.
> Depending on your architecture, you don't really know what your document
> looks like until you have it in a result set. For what I'm doing, that
> isn't
> a problem.
> --
> View this message in context:
>
> http://lucene.472066.n3.nabble.com/Consequences-for-using-multivalued-on-all-fields-tp2125867p2126120.html
>
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Consequences for using multivalued on all fields

Posted by Dennis Gearon <ge...@sbcglobal.net>.

Thanks you for the input. You might have seen my posts about doing a flexible 
schema for derived objects. Sounds like dynamic fields might be the ticket.

We'll be ready to test the idea in about a month, mabye 3 weeks. I'll post a 
comment about it whn it gets there.

I don't know if I would gain anything, but I think that ALL boolean that were 
NOT in the base object but wehre in the derived objects could be put into one 
field and textually positioned key:pairs, at least for searh purposes. 


Since the derived object would have it's own, additional methods, one of those 
methods could be to 'unserialize' the 'boolean column'. In fact, that could be a 
base object function - Empty boolean column values just end up not populating 
any extra base object attiributes.

 Dennis Gearon


Signature Warning
----------------
It is always a good idea to learn from your own mistakes. It is usually a better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



----- Original Message ----
From: kenf_nc <ke...@realestate.com>
To: solr-user@lucene.apache.org
Sent: Tue, December 21, 2010 6:07:51 AM
Subject: Re: Consequences for using multivalued on all fields


I have about 30 million documents and with the exception of the Unique ID,
Type and a couple of date fields, every document is made of dynamic fields.
Now, I only have maybe 1 in 5 being multi-value, but search and facet
performance doesn't look appreciably different from a fixed schema solution.
I don't do some of the fancier things, highlighting, spell check, etc. And I
use a lot more string or lowercase field types than I do Text (so not as
many fully tokenized fields), that probably helps with performance.

The only disadvantage I know of is dealing with field names at runtime.
Depending on your architecture, you don't really know what your document
looks like until you have it in a result set. For what I'm doing, that isn't
a problem.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Consequences-for-using-multivalued-on-all-fields-tp2125867p2126120.html

Sent from the Solr - User mailing list archive at Nabble.com.

Re: Consequences for using multivalued on all fields

Posted by kenf_nc <ke...@realestate.com>.

I have about 30 million documents and with the exception of the Unique ID,
Type and a couple of date fields, every document is made of dynamic fields.
Now, I only have maybe 1 in 5 being multi-value, but search and facet
performance doesn't look appreciably different from a fixed schema solution.
I don't do some of the fancier things, highlighting, spell check, etc. And I
use a lot more string or lowercase field types than I do Text (so not as
many fully tokenized fields), that probably helps with performance.

The only disadvantage I know of is dealing with field names at runtime.
Depending on your architecture, you don't really know what your document
looks like until you have it in a result set. For what I'm doing, that isn't
a problem.
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Consequences-for-using-multivalued-on-all-fields-tp2125867p2126120.html
Sent from the Solr - User mailing list archive at Nabble.com.