You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Mathieu Lecarme <ma...@garambrogne.net> on 2008/04/06 19:23:58 UTC

shingles and punctuations

The newly ShingleFilter is very helpful to fetch group of words, but  
it doesn't handle ponctuation or any separation.
If you feed it with multiple sentences, you will get shingle that  
start in one sentences and end in the next.
In order to avoid that, you can handle token positions, if there is  
more than one char with the previous token, it should be punctation  
(or typo).
Any suggestions to handle only shingle in the same sentence?

M.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: shingles and punctuations

Posted by Mathieu Lecarme <ma...@garambrogne.net>.

setting a flag in a filter is easy :

8<-------------------

package org.apache.lucene.analysis.shingle;

import java.io.IOException;

import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;

/**
  * @author Mathieu Lecarme
  *
  */
public class SentenceCutterFilter extends TokenFilter{
   public static final int FLAG = 42;
   public Token previous = null;

   protected SentenceCutterFilter(TokenStream input) {
     super(input);
   }

   public Token next() throws IOException {
     Token current = input.next();
     if(current == null)
       return null;
     if(previous == null || (current.startOffset() -  
previous.endOffset()) > 1)
       current.setFlags(FLAG);
     previous = current;
     return current;
   }
}

8<-------------------
and using it at the right place is tricky :
8<-------------------

     String test = "This is a test, a big test";
     TokenStream stream =
       new StopFilter(
         new ShingleFilter(
           new SentenceCutterFilter(
             new LowerCaseFilter(
               new ISOLatin1AccentFilter(
                   new StandardTokenizer(new StringReader(test))))), 3),
       new String[]{"is", "a"});

8<-------------------

But I must be to tired, but I can't patch the ShingleFilter to handle  
the flag.
I guess flag should be a bit, tested with a mask.

M.



Le 6 avr. 08 à 22:53, Grant Ingersoll a écrit :
> For now, it's up to your app to know, unfortunately :-(  I think the  
> WikipediaTokenizer is the only one using flags currently in the  
> Lucene.
>
>
> On Apr 6, 2008, at 10:43 PM, Mathieu Lecarme wrote:
>
>> I'll use Token flags to specifiy first token in a sentence, but how  
>> it's works? how flag collision is avoided? to keep it simple, i'll  
>> take 1 as flag, but what happens if an other filter use the same  
>> flags?
>>
>> M.
>>
>> Le 6 avr. 08 à 20:13, Grant Ingersoll a écrit :
>>> I think you need sentence detection to take place further  
>>> upstream.  Then you could use the Token type or Token flags to  
>>> indicate punctuation, sentences, whatever and we could patch the  
>>> shingle filter to ignore these things, or break and move onto the  
>>> next one.
>>>
>>> -Grant
>>>
>>> On Apr 6, 2008, at 7:23 PM, Mathieu Lecarme wrote:
>>>
>>>> The newly ShingleFilter is very helpful to fetch group of words,  
>>>> but it doesn't handle ponctuation or any separation.
>>>> If you feed it with multiple sentences, you will get shingle that  
>>>> start in one sentences and end in the next.
>>>> In order to avoid that, you can handle token positions, if there  
>>>> is more than one char with the previous token, it should be  
>>>> punctation (or typo).
>>>> Any suggestions to handle only shingle in the same sentence?
>>>>
>>>> M.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: shingles and punctuations

Posted by Grant Ingersoll <gs...@apache.org>.

For now, it's up to your app to know, unfortunately :-(  I think the  
WikipediaTokenizer is the only one using flags currently in the Lucene.


On Apr 6, 2008, at 10:43 PM, Mathieu Lecarme wrote:

> I'll use Token flags to specifiy first token in a sentence, but how  
> it's works? how flag collision is avoided? to keep it simple, i'll  
> take 1 as flag, but what happens if an other filter use the same  
> flags?
>
> M.
>
> Le 6 avr. 08 à 20:13, Grant Ingersoll a écrit :
>> I think you need sentence detection to take place further  
>> upstream.  Then you could use the Token type or Token flags to  
>> indicate punctuation, sentences, whatever and we could patch the  
>> shingle filter to ignore these things, or break and move onto the  
>> next one.
>>
>> -Grant
>>
>> On Apr 6, 2008, at 7:23 PM, Mathieu Lecarme wrote:
>>
>>> The newly ShingleFilter is very helpful to fetch group of words,  
>>> but it doesn't handle ponctuation or any separation.
>>> If you feed it with multiple sentences, you will get shingle that  
>>> start in one sentences and end in the next.
>>> In order to avoid that, you can handle token positions, if there  
>>> is more than one char with the previous token, it should be  
>>> punctation (or typo).
>>> Any suggestions to handle only shingle in the same sentence?
>>>
>>> M.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: shingles and punctuations

Posted by Mathieu Lecarme <ma...@garambrogne.net>.

I'll use Token flags to specifiy first token in a sentence, but how  
it's works? how flag collision is avoided? to keep it simple, i'll  
take 1 as flag, but what happens if an other filter use the same flags?

M.

Le 6 avr. 08 à 20:13, Grant Ingersoll a écrit :
> I think you need sentence detection to take place further upstream.   
> Then you could use the Token type or Token flags to indicate  
> punctuation, sentences, whatever and we could patch the shingle  
> filter to ignore these things, or break and move onto the next one.
>
> -Grant
>
> On Apr 6, 2008, at 7:23 PM, Mathieu Lecarme wrote:
>
>> The newly ShingleFilter is very helpful to fetch group of words,  
>> but it doesn't handle ponctuation or any separation.
>> If you feed it with multiple sentences, you will get shingle that  
>> start in one sentences and end in the next.
>> In order to avoid that, you can handle token positions, if there is  
>> more than one char with the previous token, it should be punctation  
>> (or typo).
>> Any suggestions to handle only shingle in the same sentence?
>>
>> M.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: shingles and punctuations

Posted by Grant Ingersoll <gs...@apache.org>.

I think you need sentence detection to take place further upstream.   
Then you could use the Token type or Token flags to indicate  
punctuation, sentences, whatever and we could patch the shingle filter  
to ignore these things, or break and move onto the next one.

-Grant

On Apr 6, 2008, at 7:23 PM, Mathieu Lecarme wrote:

> The newly ShingleFilter is very helpful to fetch group of words, but  
> it doesn't handle ponctuation or any separation.
> If you feed it with multiple sentences, you will get shingle that  
> start in one sentences and end in the next.
> In order to avoid that, you can handle token positions, if there is  
> more than one char with the previous token, it should be punctation  
> (or typo).
> Any suggestions to handle only shingle in the same sentence?
>
> M.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: shingles and punctuations

Posted by Steven A Rowe <sa...@syr.edu>.

Hi Mathieu,

>From the class comment for ShingleFilter:

  This filter handles position increments > 1 by inserting
  filler tokens (tokens with termtext "_"). It does not
  handle a position increment of 0.

You could use feature this by setting (in an upstream filter) the positionIncrement of each sentence-starting word be at least as large as the maximum shingle size.  This would result in sentence-ending shingles like ". _" and sentence-beginning shingles like "_ Word".

Steve

On 04/06/2008 at 1:23 PM, Mathieu Lecarme wrote:
> The newly ShingleFilter is very helpful to fetch group of words, but
> it doesn't handle ponctuation or any separation.
> If you feed it with multiple sentences, you will get shingle that
> start in one sentences and end in the next.
> In order to avoid that, you can handle token positions, if there is
> more than one char with the previous token, it should be punctation
> (or typo).
> Any suggestions to handle only shingle in the same sentence?
> 
> M.
> 
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For
> additional commands, e-mail: java-dev-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org