You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by luther blisset <sa...@gmail.com> on 2009/07/24 11:41:45 UTC

Removing diacritics with ISOLatin1AccentFilter

Hi folks,
I just upgrading Hibernate Search library of my app and so I had to upgrade
Lucene too and pass from 2.2 to 2.4 version.
In Lucene 2.4 the ISOLatin1AccentFilter class has changed and I can't figure
how it works.
I use a TwoWayFieldBridge to index the data and this is my set method:

public void set(String s, Object o, Document document, Field.Store store,
Field.Index index, Float aFloat){

	//MyObject has a field name
        MyObject objectToIndex;

	//casting from Object to MyObject
	try{
            objectToIndex = MyObject.class.cast(o);
        }catch(ClassCastException cEx ){}



        if (objectToIndex.getName() != null) {
            
            ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(new
StandardTokenizer(new StringReader(objectToIndex.getName())));
            filter.removeAccents(objectToIndex.getName().toCharArray(),
objectToIndex.getName().length());
            Field name = new Field( "name",
String.valueOf(objectToIndex.getName()).toLowerCase() , Field.Store.YES,
Field.Index.UN_TOKENIZED );

            document.add(name);
        }
}


but it doesn't work. And if pass an accented word for the property
objectToIndex.getName(), it remains with accent :(
I think there is something wrong in my code when I create the new instance
of ISOLatin1AccentFilter  but I can' t get it works properly.
Could someone help me?
thanks a lot
-- 
View this message in context: http://www.nabble.com/Removing-diacritics-with-ISOLatin1AccentFilter-tp24641618p24641618.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Removing diacritics with ISOLatin1AccentFilter

Posted by luther blisset <sa...@gmail.com>.

I'm trying to index all the words without accent.
I do the same when I'm querying, I remove the accent and lower case the
search term.
Why should I pass the string through the analyzer?
or what is wrong if don't pass it through the analyzer?
and what are the benefits?
I'm just a newbie with Lucene..
Thanks a lot for your reply :]




Simon Willnauer wrote:
> 
> On Fri, Jul 24, 2009 at 11:41 AM, luther blisset<sa...@gmail.com>
> wrote:
>>
>> Hi folks,
>> I just upgrading Hibernate Search library of my app and so I had to
>> upgrade
>> Lucene too and pass from 2.2 to 2.4 version.
>> In Lucene 2.4 the ISOLatin1AccentFilter class has changed and I can't
>> figure
>> how it works.
>> I use a TwoWayFieldBridge to index the data and this is my set method:
>>
>> public void set(String s, Object o, Document document, Field.Store store,
>> Field.Index index, Float aFloat){
>>
>>        //MyObject has a field name
>>        MyObject objectToIndex;
>>
>>        //casting from Object to MyObject
>>        try{
>>            objectToIndex = MyObject.class.cast(o);
>>        }catch(ClassCastException cEx ){}
>>
>>
>>
>>        if (objectToIndex.getName() != null) {
>>
>>            ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(new
>> StandardTokenizer(new StringReader(objectToIndex.getName())));
>>            filter.removeAccents(objectToIndex.getName().toCharArray(),
>> objectToIndex.getName().length());
>>            Field name = new Field( "name",
>> String.valueOf(objectToIndex.getName()).toLowerCase() , Field.Store.YES,
>> Field.Index.UN_TOKENIZED );
>>
>>            document.add(name);
>>        }
>> }
>>
> I do not really understand what you are trying to do. do you just
> wanna remove the accents from the string and index it without passing
> it through an analyzer?! (Field.Index.UN_TOKENIZED will not pass the
> field value to an analyzer).
> do you wanna index this without an analyzer?!
> 
> If you pass an array to ISOLantin1AccentFilter#removeAccents() the
> processed chars will be written to an private internal char array
> inside the ISOLantin1AccentFilter. You can not use the removeAccents
> method just removing the accents. what you could do as a dirty
> workaround is the following:
> String foo = "HÄllo HÄllo HÄllo HÄllo HÄllo";
>   ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(
>       new Tokenizer(new StringReader(foo)){
>         private boolean isRead = false;
>         public Token next(final Token reusableToken) throws IOException {
>           if(isRead){
>             return null;
>           }
>           BufferedReader reader = new BufferedReader(this.input);
>           StringBuilder builder = new StringBuilder();
> 
>          char[] buffer = new char[1024];
>          int read = -1;
>          while((read = reader.read(buffer)) > 0){
>            builder.append(buffer, 0, read);
>          }
>          reusableToken.setTermText(builder.toString());
>          isRead = true;
>          return reusableToken;
>         }
>   });
>     Token t = filter.next();
>     String foo_without_accents = t.term();
>     System.out.println(foo_without_accents);
> yields: HAllo HAllo HAllo HAllo HAllo
> 
> 
> simon
>>
>> but it doesn't work. And if pass an accented word for the property
>> objectToIndex.getName(), it remains with accent :(
>> I think there is something wrong in my code when I create the new
>> instance
>> of ISOLatin1AccentFilter  but I can' t get it works properly.
>> Could someone help me?
>> thanks a lot
>> --
>> View this message in context:
>> http://www.nabble.com/Removing-diacritics-with-ISOLatin1AccentFilter-tp24641618p24641618.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Removing-diacritics-with-ISOLatin1AccentFilter-tp24641618p24643036.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Removing diacritics with ISOLatin1AccentFilter

Posted by Simon Willnauer <si...@googlemail.com>.

On Fri, Jul 24, 2009 at 11:41 AM, luther blisset<sa...@gmail.com> wrote:
>
> Hi folks,
> I just upgrading Hibernate Search library of my app and so I had to upgrade
> Lucene too and pass from 2.2 to 2.4 version.
> In Lucene 2.4 the ISOLatin1AccentFilter class has changed and I can't figure
> how it works.
> I use a TwoWayFieldBridge to index the data and this is my set method:
>
> public void set(String s, Object o, Document document, Field.Store store,
> Field.Index index, Float aFloat){
>
>        //MyObject has a field name
>        MyObject objectToIndex;
>
>        //casting from Object to MyObject
>        try{
>            objectToIndex = MyObject.class.cast(o);
>        }catch(ClassCastException cEx ){}
>
>
>
>        if (objectToIndex.getName() != null) {
>
>            ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(new
> StandardTokenizer(new StringReader(objectToIndex.getName())));
>            filter.removeAccents(objectToIndex.getName().toCharArray(),
> objectToIndex.getName().length());
>            Field name = new Field( "name",
> String.valueOf(objectToIndex.getName()).toLowerCase() , Field.Store.YES,
> Field.Index.UN_TOKENIZED );
>
>            document.add(name);
>        }
> }
>
I do not really understand what you are trying to do. do you just
wanna remove the accents from the string and index it without passing
it through an analyzer?! (Field.Index.UN_TOKENIZED will not pass the
field value to an analyzer).
do you wanna index this without an analyzer?!

If you pass an array to ISOLantin1AccentFilter#removeAccents() the
processed chars will be written to an private internal char array
inside the ISOLantin1AccentFilter. You can not use the removeAccents
method just removing the accents. what you could do as a dirty
workaround is the following:
String foo = "HÄllo HÄllo HÄllo HÄllo HÄllo";
  ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(
      new Tokenizer(new StringReader(foo)){
        private boolean isRead = false;
        public Token next(final Token reusableToken) throws IOException {
          if(isRead){
            return null;
          }
          BufferedReader reader = new BufferedReader(this.input);
          StringBuilder builder = new StringBuilder();

         char[] buffer = new char[1024];
         int read = -1;
         while((read = reader.read(buffer)) > 0){
           builder.append(buffer, 0, read);
         }
         reusableToken.setTermText(builder.toString());
         isRead = true;
         return reusableToken;
        }
  });
    Token t = filter.next();
    String foo_without_accents = t.term();
    System.out.println(foo_without_accents);
yields: HAllo HAllo HAllo HAllo HAllo


simon
>
> but it doesn't work. And if pass an accented word for the property
> objectToIndex.getName(), it remains with accent :(
> I think there is something wrong in my code when I create the new instance
> of ISOLatin1AccentFilter  but I can' t get it works properly.
> Could someone help me?
> thanks a lot
> --
> View this message in context: http://www.nabble.com/Removing-diacritics-with-ISOLatin1AccentFilter-tp24641618p24641618.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Removing diacritics with ISOLatin1AccentFilter

Posted by luther blisset <sa...@gmail.com>.

yes Ahmet Arslan ...this works!!
I've just tested it and works nicely...
really thanks..




Ahmet Arslan wrote:
> 
> 
> Or alternatively:
> 
> String test = "HÄllo HÄllo HÄllo HÄllo HÄllo";
> 
> ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(new
>                 KeywordTokenizer(new StringReader(test)));
> 
>     final Token reusableToken = new Token();
>     Token nextToken;
>         
>     if ((nextToken = filter.next(reusableToken)) != null)
>          System.out.print(nextToken.term());
>         
>     filter.close();
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Removing-diacritics-with-ISOLatin1AccentFilter-tp24641618p24643074.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org