You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Eric Pugh <ep...@opensourceconnections.com> on 2010/08/24 18:03:17 UTC

Should analysis.jsp honor maxFieldLength

Hi all,

I have maxFieldLength set to 10000 in solrconfig.xml, but was playing around with really large document (The King James Bible) in analysis.jsp.   I hacked analysis.jsp to show me the number of terms at each filter, and the headers, but without turning everything on by checkboxing verbose.  

My results shown at this screenshot: http://img.skitch.com/20100824-t36rq45i2wfimwyd53gwiqebdy.png seem to confirm that maxFieldLength is NOT honored by the analysis.jsp.   

But it seems to me that folks using analysis.jsp would expect the process to be exactly like what happens during a document being indexed??   In my specific case, it took me a while to realize that the reason my indexing results differed from analysis.jsp results was because indexing only looked at the first 10000 tokens, but analysis looked at all 101561. A horizontal table of 10,000 cells kind of looks like a horizontal field of 101,561 cells!

Would it make sense to parse the text through the DocInverterPerField in analysis.jsp?  Or to maybe just modify the getTokens method in analysis.jsp to only parse maxFieldLength tokens?  I think I can do it via looking up the SolrCore, and doing core.getSolrConfig().mainIndexConfig.maxFieldLength


Eric





-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
Co-Author: Solr 1.4 Enterprise Search Server available from http://www.packtpub.com/solr-1-4-enterprise-search-server
Free/Busy: http://tinyurl.com/eric-cal









---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Should analysis.jsp honor maxFieldLength

Posted by Robert Muir <rc...@gmail.com>.
On Thu, Aug 26, 2010 at 6:06 PM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> we could concievable support having LimitTokenCountFilter added implicitly
> even if that option isn't used, via some syntax like you suggest -- but
> honestly i think it's better to just let the user declare it like any
> other filter -- that way it doesn't have any special semantics, and they
> can put it at the end (limit tokens) or at the begining (limit tokenz from
> the tokenizer, but allow more tokens to be added by synonyms, WDF, etc...)
>
>
Honestly I think limiting the number of tokens like this makes way more
sense being just an analysis thing, rather than in IndexWriter.

I think we should consider say, deprecating the IndexWriter option in 3.x
and removing in 4.0.

I see Mike asked about this on
https://issues.apache.org/jira/browse/LUCENE-2295 but I think his question
got "lost"

-- 
Robert Muir
rcmuir@gmail.com

Re: Should analysis.jsp honor maxFieldLength

Posted by Chris Hostetter <ho...@fucit.org>.
This can be dealt with in a lot of differnet ways in Solr -- even if 
Lucene removes all suppport for the IndexWriter.maxFieldLength, Solr can 
still support it by wrapping every analyzer with a LimitTokenCountFilter 
if that config option is used.  

we could concievable support having LimitTokenCountFilter added implicitly 
even if that option isn't used, via some syntax like you suggest -- but 
honestly i think it's better to just let the user declare it like any 
other filter -- that way it doesn't have any special semantics, and they 
can put it at the end (limit tokens) or at the begining (limit tokenz from 
the tokenizer, but allow more tokens to be added by synonyms, WDF, etc...)

: What about an option to override this on a per field-type and/or per 
: field basis. Then the global setting could still be default:


-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss      ...  Stump The Chump!

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Should analysis.jsp honor maxFieldLength

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.
What about an option to override this on a per field-type and/or per field basis. Then the global setting could still be default:

   <fieldType name="text" class="solr.TextField" positionIncrementGap="100" maxLength="100000">
OR
   <field name="teaser" type="text" indexed="true" stored="true" maxLength="100000"/>

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 24. aug. 2010, at 20.56, Eric Pugh wrote:

> I did always think that the global maxFieldLength was odd.  In one project I have, 10,000 is fine except for 1 field that I would like to bump up to 100,000, and there isn't (as far as I know) a way to do that.  Is there any real negative effect to swapping to maxFieldLength of 100,000 (with the caveat that the auto truncation won't be working!)?   
> 
> The filter approach that you pointed out does make sense, the only worry I have is that it might make building analyzers more complex.  One of the things I treasure about Solr is how many decisions it makes for you out of the box that are right so very often, and therefore how simple it is.  If every user needs to think about maxFieldLength from day one, then that might make life more complex.
> 
> Eric
> 
> 
> 
> 
> 
> On Aug 24, 2010, at 2:44 PM, Robert Muir wrote:
> 
>> 
>> 
>> On Tue, Aug 24, 2010 at 2:29 PM, Eric Pugh <ep...@opensourceconnections.com> wrote:
>> I created a patch file at https://issues.apache.org/jira/browse/SOLR-2086.  I went with the simplest approach since I didn't want to confuse things by having extra filters being added to what the user created.  However, either approach would work!
>> 
>> 
>> 
>> One idea here was that this maxFieldLength might be going away: see https://issues.apache.org/jira/browse/LUCENE-2295 for more information (though i notice its still not listed as deprecated?).
>> 
>> But for now its worth mentioning: The filter is more flexible, for example it supports per-field configuration (and of course if you use the filter instead, which you can do now, it will automatically work in analysis.jsp). 
>> 
>>  
>> -- 
>> Robert Muir
>> rcmuir@gmail.com
> 
> -----------------------------------------------------
> Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
> Co-Author: Solr 1.4 Enterprise Search Server available from http://www.packtpub.com/solr-1-4-enterprise-search-server
> Free/Busy: http://tinyurl.com/eric-cal
> 
> 
> 
> 
> 
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Should analysis.jsp honor maxFieldLength

Posted by Eric Pugh <ep...@opensourceconnections.com>.
I did always think that the global maxFieldLength was odd.  In one project I have, 10,000 is fine except for 1 field that I would like to bump up to 100,000, and there isn't (as far as I know) a way to do that.  Is there any real negative effect to swapping to maxFieldLength of 100,000 (with the caveat that the auto truncation won't be working!)?   

The filter approach that you pointed out does make sense, the only worry I have is that it might make building analyzers more complex.  One of the things I treasure about Solr is how many decisions it makes for you out of the box that are right so very often, and therefore how simple it is.  If every user needs to think about maxFieldLength from day one, then that might make life more complex.

Eric





On Aug 24, 2010, at 2:44 PM, Robert Muir wrote:

> 
> 
> On Tue, Aug 24, 2010 at 2:29 PM, Eric Pugh <ep...@opensourceconnections.com> wrote:
> I created a patch file at https://issues.apache.org/jira/browse/SOLR-2086.  I went with the simplest approach since I didn't want to confuse things by having extra filters being added to what the user created.  However, either approach would work!
> 
> 
> 
> One idea here was that this maxFieldLength might be going away: see https://issues.apache.org/jira/browse/LUCENE-2295 for more information (though i notice its still not listed as deprecated?).
> 
> But for now its worth mentioning: The filter is more flexible, for example it supports per-field configuration (and of course if you use the filter instead, which you can do now, it will automatically work in analysis.jsp). 
> 
>  
> -- 
> Robert Muir
> rcmuir@gmail.com

-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
Co-Author: Solr 1.4 Enterprise Search Server available from http://www.packtpub.com/solr-1-4-enterprise-search-server
Free/Busy: http://tinyurl.com/eric-cal









Re: Should analysis.jsp honor maxFieldLength

Posted by Robert Muir <rc...@gmail.com>.
On Tue, Aug 24, 2010 at 2:29 PM, Eric Pugh
<ep...@opensourceconnections.com>wrote:

> I created a patch file at https://issues.apache.org/jira/browse/SOLR-2086.
>  I went with the simplest approach since I didn't want to confuse things by
> having extra filters being added to what the user created.  However, either
> approach would work!
>
>
>
One idea here was that this maxFieldLength might be going away: see
https://issues.apache.org/jira/browse/LUCENE-2295 for more information
(though i notice its still not listed as deprecated?).

But for now its worth mentioning: The filter is more flexible, for example
it supports per-field configuration (and of course if you use the filter
instead, which you can do now, it will automatically work in analysis.jsp).


-- 
Robert Muir
rcmuir@gmail.com

Re: Should analysis.jsp honor maxFieldLength

Posted by Eric Pugh <ep...@opensourceconnections.com>.
I created a patch file at https://issues.apache.org/jira/browse/SOLR-2086.  I went with the simplest approach since I didn't want to confuse things by having extra filters being added to what the user created.  However, either approach would work!

On Aug 24, 2010, at 12:18 PM, Robert Muir wrote:

> 
> On Tue, Aug 24, 2010 at 12:03 PM, Eric Pugh <ep...@opensourceconnections.com> wrote:
> Hi all,
> 
> I have maxFieldLength set to 10000 in solrconfig.xml, but was playing around with really large document (The King James Bible) in analysis.jsp.   I hacked analysis.jsp to show me the number of terms at each filter, and the headers, but without turning everything on by checkboxing verbose.
> 
> My results shown at this screenshot: http://img.skitch.com/20100824-t36rq45i2wfimwyd53gwiqebdy.png seem to confirm that maxFieldLength is NOT honored by the analysis.jsp.
> 
> 
> Separate from whether or not analysis.jsp should do this (I happen to think the closer to "reality" it is, the better), I think the easiest implementation would be to wrap the entire stream with LimitTokenCountFilter:
> 
> /**
>  * This TokenFilter limits the number of tokens while indexing. It is
>  * a replacement for the maximum field length setting inside {@link org.apache.lucene.index.IndexWriter}.
>  */
>  
> If i remember, its not exactly the same as the maxFieldLength, but its pretty close.
> 
> -- 
> Robert Muir
> rcmuir@gmail.com

-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
Co-Author: Solr 1.4 Enterprise Search Server available from http://www.packtpub.com/solr-1-4-enterprise-search-server
Free/Busy: http://tinyurl.com/eric-cal









Re: Should analysis.jsp honor maxFieldLength

Posted by Robert Muir <rc...@gmail.com>.
On Tue, Aug 24, 2010 at 12:03 PM, Eric Pugh <epugh@opensourceconnections.com
> wrote:

> Hi all,
>
> I have maxFieldLength set to 10000 in solrconfig.xml, but was playing
> around with really large document (The King James Bible) in analysis.jsp.
> I hacked analysis.jsp to show me the number of terms at each filter, and the
> headers, but without turning everything on by checkboxing verbose.
>
> My results shown at this screenshot:
> http://img.skitch.com/20100824-t36rq45i2wfimwyd53gwiqebdy.png seem to
> confirm that maxFieldLength is NOT honored by the analysis.jsp.
>
>
Separate from whether or not analysis.jsp should do this (I happen to think
the closer to "reality" it is, the better), I think the easiest
implementation would be to wrap the entire stream with
LimitTokenCountFilter:

/**
 * This TokenFilter limits the number of tokens while indexing. It is
 * a replacement for the maximum field length setting inside {@link
org.apache.lucene.index.IndexWriter}.
 */

If i remember, its not exactly the same as the maxFieldLength, but its
pretty close.

-- 
Robert Muir
rcmuir@gmail.com