You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Peter Spam <ps...@mac.com> on 2011/11/04 08:12:49 UTC

Proper analyzer / tokenizer for syslog data?

Example data:
01/23/2011 05:12:34 [Test] a=1; hello_there=50; data=[1,5,30%];

I would love to be able to just "grep" the data - ie. if I search for "ello", it finds and returns "ello", and if I search for "hello_there=5", it would match too.

Here's what I'm using now:

   <fieldType name="text_sy" class="solr.TextField">
     <analyzer>
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
     </analyzer>
   </fieldType>

The problem with this is that if I search for a substring, I don't get anything back.  For example, searching for "ello" or "*ello*" doesn't return.  Any ideas?

http://localhost:8983/solr/select?q=*ello*&start=0&rows=50&hl.maxAnalyzedChars=2147483647&hl.useFastVectorHighlighter=true&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400


Thanks!
Pete

Re: Proper analyzer / tokenizer for syslog data?

Posted by Peter Spam <ps...@mac.com>.
Wow, I tried with minGramSize=1 and maxgramSize=1000 (I want someone to be able to search on any substring, just like "grep"), and the index is multiple orders of magnitude larger than my data!

There's got to be a better way to support full grep-like searching?


Thanks!
Pete

On Nov 4, 2011, at 1:20 AM, Ahmet Arslan wrote:

>> Example data:
>> 01/23/2011 05:12:34 [Test] a=1; hello_there=50;
>> data=[1,5,30%];
>> 
>> I would love to be able to just "grep" the data - ie. if I
>> search for "ello", it finds and returns "ello", and if I
>> search for "hello_there=5", it would match too.
>> 
>> Here's what I'm using now:
>> 
>>    <fieldType name="text_sy"
>> class="solr.TextField">
>>      <analyzer>
>>        <tokenizer
>> class="solr.StandardTokenizerFactory"/>
>>        <filter
>> class="solr.LowerCaseFilterFactory"/>
>>        <filter
>> class="solr.WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0"
>> catenateWords="0" catenateNumbers="0" catenateAll="0"
>> splitOnCaseChange="0"/>
>>      </analyzer>
>>    </fieldType>
>> 
>> The problem with this is that if I search for a substring,
>> I don't get anything back.  For example, searching for
>> "ello" or "*ello*" doesn't return.  Any ideas?
>> 
>> http://localhost:8983/solr/select?q=*ello*&start=0&rows=50&hl.maxAnalyzedChars=2147483647&hl.useFastVectorHighlighter=true&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400
> 
> For sub-string match NGramFilterFactory is required at index time.
> 
> <filter class="solr.NGramFilterFactory" minGramSize="1"
> maxGramSize="15"/> 
> 
> Plus you may want to use WhiteSpaceTokenizer instead of StandardTokenizerFactory. Analysis admin page displays behavior of each tokenizer.


Re: Proper analyzer / tokenizer for syslog data?

Posted by Ahmet Arslan <io...@yahoo.com>.
> Example data:
> 01/23/2011 05:12:34 [Test] a=1; hello_there=50;
> data=[1,5,30%];
> 
> I would love to be able to just "grep" the data - ie. if I
> search for "ello", it finds and returns "ello", and if I
> search for "hello_there=5", it would match too.
> 
> Here's what I'm using now:
> 
>    <fieldType name="text_sy"
> class="solr.TextField">
>      <analyzer>
>        <tokenizer
> class="solr.StandardTokenizerFactory"/>
>        <filter
> class="solr.LowerCaseFilterFactory"/>
>        <filter
> class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0"
> catenateWords="0" catenateNumbers="0" catenateAll="0"
> splitOnCaseChange="0"/>
>      </analyzer>
>    </fieldType>
> 
> The problem with this is that if I search for a substring,
> I don't get anything back.  For example, searching for
> "ello" or "*ello*" doesn't return.  Any ideas?
> 
> http://localhost:8983/solr/select?q=*ello*&start=0&rows=50&hl.maxAnalyzedChars=2147483647&hl.useFastVectorHighlighter=true&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400

For sub-string match NGramFilterFactory is required at index time.

<filter class="solr.NGramFilterFactory" minGramSize="1"
maxGramSize="15"/> 

Plus you may want to use WhiteSpaceTokenizer instead of StandardTokenizerFactory. Analysis admin page displays behavior of each tokenizer.