You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Pierre JdlF <pi...@gmail.com> on 2012/01/31 16:00:12 UTC

ShingleFilterFactory not indexing the whole doc, where is the limit ?

I'm trying to index word-ngrams using the solr.ShingleFilterFactory,
(storing their positions + offset)
...
    <fieldType name="edge_ngram" class="solr.TextField"
positionIncrementGap="1">
      <analyzer type="index">
      	<charFilter class="solr.HTMLStripCharFilterFactory"/>
	<tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.ShingleFilterFactory" minShingleSize="3"
maxShingleSize="5" outputUnigrams="false" tokenSeparator="_"/>
      </analyzer>
...
<field name="textengram" type="edge_ngram" indexed="true"
stored="true" multiValued="false" termVectors="true"
termPositions="true" termOffsets="true"/>
...
i'm testing it with a (big?) html document, [1.300.000 chars], with lots of tags
Looking at the index (using Schema browser web interface), i can see
some ngrams were indexed (8939)
but it appears that they were found only in the beginning of the
document (first 1/8 of the document)

other fields are indexing the whole doc without problem
so i was wondering if solr.ShingleFilterFactory had a limit ?
- in the sense of maximum blob of text it can manage ?
- in the sense of maximum number of ngrams produced ?

note that if i try with lower values like: minShingleSize="2" maxShingleSize="3"
i obtain 6465 ngrams (corresponding to the first 1/5 of the doc)

i though the sky was the limit !
any idea ?

-- 
+ Pierre

Re: ShingleFilterFactory not indexing the whole doc, where is the limit ?

Posted by Pierre JdlF <pi...@gmail.com>.
Works now ! thanks a lot
... i guess until a document with more than 2.147.483.647 chars
'happy night
+ Pierre

On Tue, Jan 31, 2012 at 5:23 PM, Ahmet Arslan <io...@yahoo.com> wrote:
>> I'm trying to index word-ngrams using
>> the solr.ShingleFilterFactory,
>> (storing their positions + offset)
>> ...
>>     <fieldType name="edge_ngram"
>> class="solr.TextField"
>> positionIncrementGap="1">
>>       <analyzer type="index">
>>           <charFilter
>> class="solr.HTMLStripCharFilterFactory"/>
>>     <tokenizer
>> class="solr.WhitespaceTokenizerFactory" />
>>         <filter
>> class="solr.LowerCaseFilterFactory" />
>>         <filter
>> class="solr.ShingleFilterFactory" minShingleSize="3"
>> maxShingleSize="5" outputUnigrams="false"
>> tokenSeparator="_"/>
>>       </analyzer>
>> ...
>> <field name="textengram" type="edge_ngram"
>> indexed="true"
>> stored="true" multiValued="false" termVectors="true"
>> termPositions="true" termOffsets="true"/>
>> ...
>> i'm testing it with a (big?) html document, [1.300.000
>> chars], with lots of tags
>> Looking at the index (using Schema browser web interface), i
>> can see
>> some ngrams were indexed (8939)
>> but it appears that they were found only in the beginning of
>> the
>> document (first 1/8 of the document)
>
> It could be the maxFieldLength setting in solrconfig.xml . Set it to <maxFieldLength>2147483647</maxFieldLength>

Re: ShingleFilterFactory not indexing the whole doc, where is the limit ?

Posted by Ahmet Arslan <io...@yahoo.com>.
> I'm trying to index word-ngrams using
> the solr.ShingleFilterFactory,
> (storing their positions + offset)
> ...
>     <fieldType name="edge_ngram"
> class="solr.TextField"
> positionIncrementGap="1">
>       <analyzer type="index">
>           <charFilter
> class="solr.HTMLStripCharFilterFactory"/>
>     <tokenizer
> class="solr.WhitespaceTokenizerFactory" />
>         <filter
> class="solr.LowerCaseFilterFactory" />
>         <filter
> class="solr.ShingleFilterFactory" minShingleSize="3"
> maxShingleSize="5" outputUnigrams="false"
> tokenSeparator="_"/>
>       </analyzer>
> ...
> <field name="textengram" type="edge_ngram"
> indexed="true"
> stored="true" multiValued="false" termVectors="true"
> termPositions="true" termOffsets="true"/>
> ...
> i'm testing it with a (big?) html document, [1.300.000
> chars], with lots of tags
> Looking at the index (using Schema browser web interface), i
> can see
> some ngrams were indexed (8939)
> but it appears that they were found only in the beginning of
> the
> document (first 1/8 of the document)

It could be the maxFieldLength setting in solrconfig.xml . Set it to <maxFieldLength>2147483647</maxFieldLength>