You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2013/02/12 20:58:39 UTC

New Lucene features and Solr indexes

I'm interested in knowing which of the extremely new Lucene features I 
can use in Solr and what criteria I should use when deciding where to 
use them.

Some of these, like compressed stored fields and compressed termvectors, 
are being turned on by default, which is awesome.  I'm already running a 
4.2 snapshot, so I've got those in place.

One thing that I know I would like to do is use the new BloomFilter for 
a couple of my fields that contain only unique values.  Last time I 
checked (which was before the 4.1 release), if you added the 
lucene-codecs jar, Solr had a BloomFilter postings format, but didn't 
have any way to specify the underlying format.  See SOLR-3950 and 
LUCENE-4394.

Another new feature that is coming soon to Solr is DocValues - 
SOLR-3855.  Looking at the issue, I was not able to tell what situations 
would be appropriate for using the feature.  The patch includes notes in 
the example schema about using it on the popularity and manu_exact 
fields, but nothing about why those fields are good choices.  If you use 
docvalues, do you still have to store the field if you want it in 
results?  I think I remember reading something about it being able to 
replace stored fields.

These are the features I can think of at the moment.  There may be 
others, so feel free to fill in the blanks.

Thanks,
Shawn

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: New Lucene features and Solr indexes

Posted by Jack Krupansky <ja...@basetechnology.com>.

It seems as if you are using the text field analyzer to "clean up" or 
"normalize" the values for that field, but generally an analyzer is mapping 
from source terms to index terms, with the expectation that the index 
term(s) may be radically different from the source terms, and generally, 
tokenizing the input stream as well.

Maybe this is simply a question of best practices for using analyzers for 
"analysis" as opposed to the cleanup/normalization that an update processor 
would normally do. In other words, situations where the analyzer is used as 
a poor man's update processor for what otherwise would/should be simple 
string fields.

-- Jack Krupansky

-----Original Message----- 
From: Shawn Heisey
Sent: Saturday, February 16, 2013 11:43 AM
To: dev@lucene.apache.org
Subject: Re: New Lucene features and Solr indexes

2/14/2013 8:26 AM, Adrien Grand wrote:
>> This suggests that adding docvalues to the uniqueKey field would be a 
>> good
>> idea for distributed searching in general, since the first phase of a
>> distributed search only retrieves that field and score.  That assumes of
>> course that the docvalues are fully utilized for retrieving fields during
>> that initial phase.
>
> Right, this would likely improve performance given than doc values
> (even if disk-based) are more likely to be in memory than stored
> fields. Another (better?) approach would be to use the internal Lucene
> doc IDs for distributed search (I assumed there was an open JIRA issue
> to do that but I can't find it).

Related to this ... I have been watching SOLR-3855.  I notice that
TextField is not listed on the supported types.  Is that likely to
change in the future, or is there a fundamental issue there?

My uniqueKey field uses the following fieldType definition:

     <!-- lowercases the entire field value -->
     <fieldType name="lowercase" class="solr.TextField"
sortMissingLast="true" positionIncrementGap="0" omitNorms="true">
       <analyzer>
         <tokenizer class="solr.KeywordTokenizerFactory"/>
         <filter class="solr.ICUFoldingFilterFactory"/>
         <filter class="solr.TrimFilterFactory"/>
       </analyzer>
     </fieldType>

I'm about 95% sure that the source value from MySQL will never contain
lowercase characters and probably does not actually need to be trimmed,
but we want to be able to search when an uppercase value is entered.
Would I have to give up that capability to get docvalues on this field?
  Does the current SOLR-3855 patch take advantage of docvalues for the
first phase of a distributed search when they are present, as we
discussed earlier?

Thanks,
Shawn

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: New Lucene features and Solr indexes

Posted by Shawn Heisey <so...@elyograg.org>.

2/14/2013 8:26 AM, Adrien Grand wrote:
>> This suggests that adding docvalues to the uniqueKey field would be a good
>> idea for distributed searching in general, since the first phase of a
>> distributed search only retrieves that field and score.  That assumes of
>> course that the docvalues are fully utilized for retrieving fields during
>> that initial phase.
>
> Right, this would likely improve performance given than doc values
> (even if disk-based) are more likely to be in memory than stored
> fields. Another (better?) approach would be to use the internal Lucene
> doc IDs for distributed search (I assumed there was an open JIRA issue
> to do that but I can't find it).

Related to this ... I have been watching SOLR-3855.  I notice that 
TextField is not listed on the supported types.  Is that likely to 
change in the future, or is there a fundamental issue there?

My uniqueKey field uses the following fieldType definition:

     <!-- lowercases the entire field value -->
     <fieldType name="lowercase" class="solr.TextField" 
sortMissingLast="true" positionIncrementGap="0" omitNorms="true">
       <analyzer>
         <tokenizer class="solr.KeywordTokenizerFactory"/>
         <filter class="solr.ICUFoldingFilterFactory"/>
         <filter class="solr.TrimFilterFactory"/>
       </analyzer>
     </fieldType>

I'm about 95% sure that the source value from MySQL will never contain 
lowercase characters and probably does not actually need to be trimmed, 
but we want to be able to search when an uppercase value is entered. 
Would I have to give up that capability to get docvalues on this field? 
  Does the current SOLR-3855 patch take advantage of docvalues for the 
first phase of a distributed search when they are present, as we 
discussed earlier?

Thanks,
Shawn

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: New Lucene features and Solr indexes

Posted by Adrien Grand <jp...@gmail.com>.

On Wed, Feb 13, 2013 at 4:18 PM, Shawn Heisey <so...@elyograg.org> wrote:
> On 2/13/2013 2:42 AM, Adrien Grand wrote:
>> Doc values are like FieldCache except that you don't need to uninvert
>> values from the inverted index whenever you open a new Reader. I think
>> there are two reasons why you would like to turn doc values on:
>
>
> Confession -- that's almost gibberish to me!  At my current level of
> understanding, the pieces make some semblance of sense, but the whole thing
> falls apart before my head grasps it.  My fault, not yours. :)

What it means is that doc values achieve the same goal as the field
cache (the ability to quickly access the value of a given field for
any document) except that the hard work is done at indexing time
rather than whenever a new IndexReader is open.  This is generally a
better trade-off and I think the field cache is eventually going to be
deprecated or even removed (5.0 maybe?).

> This suggests that adding docvalues to the uniqueKey field would be a good
> idea for distributed searching in general, since the first phase of a
> distributed search only retrieves that field and score.  That assumes of
> course that the docvalues are fully utilized for retrieving fields during
> that initial phase.

Right, this would likely improve performance given than doc values
(even if disk-based) are more likely to be in memory than stored
fields. Another (better?) approach would be to use the internal Lucene
doc IDs for distributed search (I assumed there was an open JIRA issue
to do that but I can't find it).

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: New Lucene features and Solr indexes

Posted by Shawn Heisey <so...@elyograg.org>.

On 2/13/2013 2:42 AM, Adrien Grand wrote:
> Doc values are like FieldCache except that you don't need to uninvert
> values from the inverted index whenever you open a new Reader. I think
> there are two reasons why you would like to turn doc values on:

Confession -- that's almost gibberish to me!  At my current level of 
understanding, the pieces make some semblance of sense, but the whole 
thing falls apart before my head grasps it.  My fault, not yours. :)

>   - if you are indexing a field only for faceting, sorting or grouping
> (not searching), setting indexed=false and docValues=true will provide
> the same functionnality and be lighter, both at indexing time (no need
> to invert the field) and when opening a new IndexReader (no need to
> uninvert the field),

I have some fields that mostly get used for sorting.  The most common 
field used for sorting is a seconds-since-epoch timestamp simply stored 
as a long (source is MySQL bigint).  We have another copy of it in tdate 
format that we use for date range searches.  I'll need to ask whether 
they are using it for searching or filtering before I make the long 
version indexed=false.

>   - if the field is also used for searching, turning doc values on will
> give your Lucene index a little more work at indexing time (not a big
> deal in my opinion) but it will be faster to open (especially
> interesting if you're doing near-realtime search) and likely more
> memory-efficient.

I have a lot more index headroom thanks to stored/termvector 
compression.  My indexes fit entirely in available RAM now!  Even before 
the upgrade, not all of the index data was being cached, so I still had 
free RAM, so I have plenty of room for index growth.  I just have to 
convince them to start using the upgraded index copy so I can upgrade 
the other one.

> However doc values are useless for searching, so there is no need to
> turn them on on a field which is used solely for searching.
>
> Similarly to stored fields, doc values could help you retrieve the
> value of a field, but the trade-off is very different: stored fields
> are better at retrieving many fields of a single document efficiently
> while doc values are good at retrieving one field for a lot of
> documents efficiently. So if you want to get a field's value in the
> response, you should keep setting stored=true. There might be
> optimizations in the future for example if you're only asking for a
> single field which has doc values, but this will be transparent to
> you.

This suggests that adding docvalues to the uniqueKey field would be a 
good idea for distributed searching in general, since the first phase of 
a distributed search only retrieves that field and score.  That assumes 
of course that the docvalues are fully utilized for retrieving fields 
during that initial phase.

Generally when we search, we retrieve all stored fields, so I will keep 
those around.  We already don't store every field, and advances we've 
made on the client side will probably allow me to stop storing more of 
them, further reducing our index size.

Thanks,
Shawn

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: New Lucene features and Solr indexes

Posted by mark harwood <ma...@yahoo.co.uk>.

>>Instead of making other APIs to accomodate BloomFilter's current
>>brokenness: remove its custom per-field logic so it works with
>>PerFieldPostingsFormat, like every other PF.

Not looked at it in a while but I'm pretty certain, like every other PF, you can go ahead and use PerFieldPF with Bloom filter just fine.

What was broken was (is?) that in this configuration PFPF isn't smart enough to avoid creating twice as many files as is required - see Lucene 4093.
Until that is resolved (and I have noted my pessimism about that being fixed easily) BloomPF contains an optimisation for those that want to avoid this inefficiency.
The use of that optimisation is entirely optional for users.
Internally to BloomPF, the implementation of that optimisation is trivial  - if a null bloom set is returned for a given field it ignores the usual bloom filtering logic and delegates directly to the wrapped codec. 
You can choose to implement a BloomFilterFactory that adds this field-choice optimisation or, more simply run the default PerFieldPF-managed configuration and live with the increased numbers of files.

Arguably, the inefficiencies of the PerFieldPF framework are the real issue to be addressed here.

>>I brought this up before it was committed, and i was ignored

You stopped engaging in the debate when I outlined the 3 proposed options for moving BloomPF forward :  http://goo.gl/mxtP9
Those options were:
1) ignore the inefficiencies in PFPF
2) sort out the issues in PFPF (4093 but probably a more complex solution)
3) work around existing PFPF issues with a simple but entirely optional optimisation to BloomPF

I opted for 3) and gave notice that I 'd take it out if anyone objected. 
I don't think there's been any movement on 2) so I guess you're still happy with option 1)? I recall you didn't think the business of extra files was that much of a concern: http://goo.gl/eJWo3

(Incidentally, probably best following up on the relevant Jiras rather than here)

Cheers
Mark

________________________________
 From: Robert Muir <rc...@gmail.com>
To: dev@lucene.apache.org 
Sent: Wednesday, 13 February 2013, 13:01
Subject: Re: New Lucene features and Solr indexes

On Wed, Feb 13, 2013 at 4:42 AM, Adrien Grand <jp...@gmail.com> wrote:
> Hi Shawn,
>
> On Tue, Feb 12, 2013 at 8:58 PM, Shawn Heisey <so...@elyograg.org> wrote:
>> Some of these, like compressed stored fields and compressed termvectors, are
>> being turned on by default, which is awesome.  I'm already running a 4.2
>> snapshot, so I've got those in place.
>
> Excellent!
>
>> One thing that I know I would like to do is use the new BloomFilter for a
>> couple of my fields that contain only unique values.  Last time I checked
>> (which was before the 4.1 release), if you added the lucene-codecs jar, Solr
>> had a BloomFilter postings format, but didn't have any way to specify the
>> underlying format.  See SOLR-3950 and LUCENE-4394.
>
> BloomFilterPostingsFormat is a little special compared to other
> postings formats because it can wrap any postings format. So maybe it
> should require special support, like an additional attribute in the
> field type definition?

-1

Instead of making other APIs to accomodate BloomFilter's current
brokenness: remove its custom per-field logic so it works with
PerFieldPostingsFormat, like every other PF.

In other words, it should work just like pulsing.

I brought this up before it was committed, and i was ignored. Thats
fine, but I'll be damned if i let its incorrect design complicate
other parts of the codebase too. I'd rather it continue to stay
difficult to integrate and continue walking its current path to an
open source death instead.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: New Lucene features and Solr indexes

Posted by mark harwood <ma...@yahoo.co.uk>.

>>should be a stupid simple postings format like any other postings format with a default configuration

It does have a default config. It just needs a PF delegate in the constructor just like Pulsing....
Like Rob said:
>>In other words, it should work just like pulsing.


So far so good.

Now where people are getting upset (for no particularly good reason in my view) around per-field stuff:  if you really, really want to you can supply a subclass of BloomFilterFactory to your BloomPF constructor which allows customised control over choice of hashing algo, bitset sizing and saturation policies if the DefaultBloomFilterFactory fails to make the right choices.  99.99999% of people will not do this. The reason it is a factory object and not some dumb settings is that it is called on a per-segment basis with state info that is useful context in making sizing choices.  Now, (horror of horrors), the factory's API is passed a FieldInfo object in the method designed to produce a bitset. It is conceivable that some rogue agents could choose to implement some per-field decisions here if the same BloomPF instance was registered to handle >1 field. In addition, BloomPF has some common-sense defensive coding that checks if the factory returns null
 for the bitset - in which case it delegates all calls un-bloomed directly to the delegate codec. 

None of this prevents the use of BloomPF with the prescribed PerFieldPF manner for handling field-specific choices.

I happen to use a custom BloomFilterFactory to implement a more efficient indexing pipeline than the prescribed PerFieldPF route of implementing all per-field policies "up high" in the stack -  but none of that is at the cost of a clean BloomPF API or with any unnecessary duplication of PerFieldPF logic. 

If anything needs changing here there may be a case for providing a convenience class that weds BloomPF and a default choice of Lucene40 codec so it can help with whatever Solr and other config-driven engines may need ie  zero arg constructors if that's how their registry of codecs works.

Cheers
Mark












________________________________
 From: Uwe Schindler <uw...@thetaphi.de>
To: dev@lucene.apache.org 
Sent: Wednesday, 13 February 2013, 16:47
Subject: RE: New Lucene features and Solr indexes
 
Hi Shawn,

I was arguing also at the time when this was committed. I fully agree with Robert, the current API is not in a good shape!
I have the same feeling: Bloom Postings should be a stupid simple postings format like any other postings format with a default configuration. If you really want to change its configuration, you can subclass it as a separate postings format.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Shawn Heisey [mailto:solr@elyograg.org]
> Sent: Wednesday, February 13, 2013 3:59 PM
> To: dev@lucene.apache.org
> Subject: Re: New Lucene features and Solr indexes
> 
> >> BloomFilterPostingsFormat is a little special compared to other
> >> postings formats because it can wrap any postings format. So maybe it
> >> should require special support, like an additional attribute in the
> >> field type definition?
> >
> > -1
> >
> > Instead of making other APIs to accomodate BloomFilter's current
> > brokenness: remove its custom per-field logic so it works with
> > PerFieldPostingsFormat, like every other PF.
> >
> > In other words, it should work just like pulsing.
> >
> > I brought this up before it was committed, and i was ignored. Thats
> > fine, but I'll be damned if i let its incorrect design complicate
> > other parts of the codebase too. I'd rather it continue to stay
> > difficult to integrate and continue walking its current path to an
> > open source death instead.
> 
> Robert,
> 
> I have to send you a general thank you for your dedication to the quality of
> this project, and for your amazing ability to seemingly keep the entire design
> for Lucene in your head at all times.
> 
> I'm not sure what exactly you want to die here, or what you think would be
> the best option for me, the Solr end-user.  Is BloomFilter something that's
> not worth pursuing, or would you just like it to be integrated in a different
> way?
> 
> Thanks,
> Shawn
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

RE: New Lucene features and Solr indexes

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi Shawn,

I was arguing also at the time when this was committed. I fully agree with Robert, the current API is not in a good shape!
I have the same feeling: Bloom Postings should be a stupid simple postings format like any other postings format with a default configuration. If you really want to change its configuration, you can subclass it as a separate postings format.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Shawn Heisey [mailto:solr@elyograg.org]
> Sent: Wednesday, February 13, 2013 3:59 PM
> To: dev@lucene.apache.org
> Subject: Re: New Lucene features and Solr indexes
> 
> >> BloomFilterPostingsFormat is a little special compared to other
> >> postings formats because it can wrap any postings format. So maybe it
> >> should require special support, like an additional attribute in the
> >> field type definition?
> >
> > -1
> >
> > Instead of making other APIs to accomodate BloomFilter's current
> > brokenness: remove its custom per-field logic so it works with
> > PerFieldPostingsFormat, like every other PF.
> >
> > In other words, it should work just like pulsing.
> >
> > I brought this up before it was committed, and i was ignored. Thats
> > fine, but I'll be damned if i let its incorrect design complicate
> > other parts of the codebase too. I'd rather it continue to stay
> > difficult to integrate and continue walking its current path to an
> > open source death instead.
> 
> Robert,
> 
> I have to send you a general thank you for your dedication to the quality of
> this project, and for your amazing ability to seemingly keep the entire design
> for Lucene in your head at all times.
> 
> I'm not sure what exactly you want to die here, or what you think would be
> the best option for me, the Solr end-user.  Is BloomFilter something that's
> not worth pursuing, or would you just like it to be integrated in a different
> way?
> 
> Thanks,
> Shawn
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: New Lucene features and Solr indexes

Posted by Shawn Heisey <so...@elyograg.org>.

>> BloomFilterPostingsFormat is a little special compared to other
>> postings formats because it can wrap any postings format. So maybe it
>> should require special support, like an additional attribute in the
>> field type definition?
>
> -1
>
> Instead of making other APIs to accomodate BloomFilter's current
> brokenness: remove its custom per-field logic so it works with
> PerFieldPostingsFormat, like every other PF.
>
> In other words, it should work just like pulsing.
>
> I brought this up before it was committed, and i was ignored. Thats
> fine, but I'll be damned if i let its incorrect design complicate
> other parts of the codebase too. I'd rather it continue to stay
> difficult to integrate and continue walking its current path to an
> open source death instead.

Robert,

I have to send you a general thank you for your dedication to the 
quality of this project, and for your amazing ability to seemingly keep 
the entire design for Lucene in your head at all times.

I'm not sure what exactly you want to die here, or what you think would 
be the best option for me, the Solr end-user.  Is BloomFilter something 
that's not worth pursuing, or would you just like it to be integrated in 
a different way?

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: New Lucene features and Solr indexes

Posted by Yonik Seeley <yo...@lucidworks.com>.

On Wed, Feb 13, 2013 at 8:01 AM, Robert Muir <rc...@gmail.com> wrote:
> On Wed, Feb 13, 2013 at 4:42 AM, Adrien Grand <jp...@gmail.com> wrote:
>> Hi Shawn,
>>
>> On Tue, Feb 12, 2013 at 8:58 PM, Shawn Heisey <so...@elyograg.org> wrote:
>>> Some of these, like compressed stored fields and compressed termvectors, are
>>> being turned on by default, which is awesome.  I'm already running a 4.2
>>> snapshot, so I've got those in place.
>>
>> Excellent!
>>
>>> One thing that I know I would like to do is use the new BloomFilter for a
>>> couple of my fields that contain only unique values.  Last time I checked
>>> (which was before the 4.1 release), if you added the lucene-codecs jar, Solr
>>> had a BloomFilter postings format, but didn't have any way to specify the
>>> underlying format.  See SOLR-3950 and LUCENE-4394.
>>
>> BloomFilterPostingsFormat is a little special compared to other
>> postings formats because it can wrap any postings format. So maybe it
>> should require special support, like an additional attribute in the
>> field type definition?
>
> -1
>
> Instead of making other APIs to accomodate BloomFilter's current
> brokenness: remove its custom per-field logic so it works with
> PerFieldPostingsFormat, like every other PF.
>
> In other words, it should work just like pulsing.
>
> I brought this up before it was committed, and i was ignored. Thats
> fine, but I'll be damned if i let its incorrect design complicate
> other parts of the codebase too. I'd rather it continue to stay
> difficult to integrate and continue walking its current path to an
> open source death instead.

Would your desired changes in bloom postings format change the high
level interface in Solr (i.e. specifying bloom on a field or fieldType
in the schema)?
If not, any currently needed work-around seems more like an
implementation detail.

-Yonik
http://lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: New Lucene features and Solr indexes

Posted by Robert Muir <rc...@gmail.com>.

On Wed, Feb 13, 2013 at 4:42 AM, Adrien Grand <jp...@gmail.com> wrote:
> Hi Shawn,
>
> On Tue, Feb 12, 2013 at 8:58 PM, Shawn Heisey <so...@elyograg.org> wrote:
>> Some of these, like compressed stored fields and compressed termvectors, are
>> being turned on by default, which is awesome.  I'm already running a 4.2
>> snapshot, so I've got those in place.
>
> Excellent!
>
>> One thing that I know I would like to do is use the new BloomFilter for a
>> couple of my fields that contain only unique values.  Last time I checked
>> (which was before the 4.1 release), if you added the lucene-codecs jar, Solr
>> had a BloomFilter postings format, but didn't have any way to specify the
>> underlying format.  See SOLR-3950 and LUCENE-4394.
>
> BloomFilterPostingsFormat is a little special compared to other
> postings formats because it can wrap any postings format. So maybe it
> should require special support, like an additional attribute in the
> field type definition?

-1

Instead of making other APIs to accomodate BloomFilter's current
brokenness: remove its custom per-field logic so it works with
PerFieldPostingsFormat, like every other PF.

In other words, it should work just like pulsing.

I brought this up before it was committed, and i was ignored. Thats
fine, but I'll be damned if i let its incorrect design complicate
other parts of the codebase too. I'd rather it continue to stay
difficult to integrate and continue walking its current path to an
open source death instead.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: New Lucene features and Solr indexes

Posted by Adrien Grand <jp...@gmail.com>.

Hi Shawn,

On Tue, Feb 12, 2013 at 8:58 PM, Shawn Heisey <so...@elyograg.org> wrote:
> Some of these, like compressed stored fields and compressed termvectors, are
> being turned on by default, which is awesome.  I'm already running a 4.2
> snapshot, so I've got those in place.

Excellent!

> One thing that I know I would like to do is use the new BloomFilter for a
> couple of my fields that contain only unique values.  Last time I checked
> (which was before the 4.1 release), if you added the lucene-codecs jar, Solr
> had a BloomFilter postings format, but didn't have any way to specify the
> underlying format.  See SOLR-3950 and LUCENE-4394.

BloomFilterPostingsFormat is a little special compared to other
postings formats because it can wrap any postings format. So maybe it
should require special support, like an additional attribute in the
field type definition?

> Another new feature that is coming soon to Solr is DocValues - SOLR-3855.
> Looking at the issue, I was not able to tell what situations would be
> appropriate for using the feature.

Doc values are like FieldCache except that you don't need to uninvert
values from the inverted index whenever you open a new Reader. I think
there are two reasons why you would like to turn doc values on:
 - if you are indexing a field only for faceting, sorting or grouping
(not searching), setting indexed=false and docValues=true will provide
the same functionnality and be lighter, both at indexing time (no need
to invert the field) and when opening a new IndexReader (no need to
uninvert the field),
 - if the field is also used for searching, turning doc values on will
give your Lucene index a little more work at indexing time (not a big
deal in my opinion) but it will be faster to open (especially
interesting if you're doing near-realtime search) and likely more
memory-efficient.

However doc values are useless for searching, so there is no need to
turn them on on a field which is used solely for searching.

Similarly to stored fields, doc values could help you retrieve the
value of a field, but the trade-off is very different: stored fields
are better at retrieving many fields of a single document efficiently
while doc values are good at retrieving one field for a lot of
documents efficiently. So if you want to get a field's value in the
response, you should keep setting stored=true. There might be
optimizations in the future for example if you're only asking for a
single field which has doc values, but this will be transparent to
you.

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org