You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Jaime <j....@estructure.es> on 2016/06/23 15:47:25 UTC

Preprocess input text before tokenizing

Hello,

I want to change the input text before tokenizing. I think I just need 
to use some characters as word separators, and maybe remove some others 
completely.

I was planning to use MappingCharFilterFactory to replace some chars 
with " " and others with "", but I feel like I'm not in the right track.

First, I've implemented a custom analyzer to use my custom tokenizer. My 
idea was to inherit from StandardTokenizer and, in setReader, calling 
MappingCharFilterFactory.create(reader) from within.

However, setReader is final, so I can't override it.

Is there a better way to do this?
In any case, how should I use MappingCharFilter in case I really needed it?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Preprocess input text before tokenizing

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi Jaime,

Please see o.a.l.analysis.custom.CustomAnalyzer.builder() to create custom analyzers using a builder-style API.

Ahmet

On Friday, June 24, 2016 10:54 AM, Jaime <j....@estructure.es> wrote:
Thank you very much, that seems to solve my issue.

However, I find this a little cumbersome. I need to filter the text 
before any tokenizing takes place, so I have to implement a filtered 
version of every analyzer I'm using (StandardAnalyzer and 
SpanishAnalyzer and a custom analyzer right now).

If I need to support another analyzer in the future (a very plausible 
possibility) I will need to create another version of that analyzer. 
Whenever any of those analyzer is changed, I will need to manually apply 
the changes.

Isn't there a better way to do this?

El 23/06/2016 a las 20:28, Ahmet Arslan escribió:
> Hi,
>
> Zero or more CharFilter(s) is the way to manipulate text before the tokenizer.
> I think init reader is the method you want to plug char filters.
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/morfologik/src/java/org/apache/lucene/analysis/uk/UkrainianMorfologikAnalyzer.java
>
> Ahmet
>
> On Thursday, June 23, 2016 6:47 PM, Jaime <j....@estructure.es> wrote:
> Hello,
>
> I want to change the input text before tokenizing. I think I just need
> to use some characters as word separators, and maybe remove some others
> completely.
>
> I was planning to use MappingCharFilterFactory to replace some chars
> with " " and others with "", but I feel like I'm not in the right track.
>
> First, I've implemented a custom analyzer to use my custom tokenizer. My
> idea was to inherit from StandardTokenizer and, in setReader, calling
> MappingCharFilterFactory.create(reader) from within.
>
> However, setReader is final, so I can't override it.
>
> Is there a better way to do this?
> In any case, how should I use MappingCharFilter in case I really needed it?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

-- 
Jaime Pardos
ESTRUCTURE MEDIA SYSTEMS, S.L.
Avda. de Madrid nº 120 nave 10, 28500, Arganda del Rey, MADRID,
j.pardos@estructure.es
910088429

AVISO LEGAL: Este mensaje y sus archivos adjuntos van dirigidos exclusivamente a su destinatario, pudiendo contener información confidencial sometida a secreto confidencial. No está permitida su reproducción o distribución sin la autorización expresa de ESTRUCTURE MEDIA SYSTEMS, S.L.. Si usted no es el destinatario final por favor elimínelo e infórmenos por esta vía. De acuerdo con lo establecido en la Ley Orgánica 15/1999, de 13 de diciembre, de Protección de Datos de Carácter Personal (LOPD), le informamos que sus datos están incorporados en un fichero del que es titular ESTRUCTURE MEDIA SYSTEMS, S.L. con la finalidad de realizar la gestión administrativa, contable, y fiscal, así como enviarle comunicaciones comerciales sobre nuestros productos y/o servicios. Asimismo, le informamos de la posibilidad de ejercer los derechos de acceso, rectificación, cancelación y oposición de sus datos en el domicilio de ESTRUCTURE MEDIA SYSTEMS, S.L., sito en Avda. de Madrid nº 120 nave 10, 28500, Arganda del Rey, MADRID, o a la dirección de correo electrónico info@estructure.es.

This message and its attachments are intended solely for the addressee and may contain confidential information submitted to confidential secret. It is not allowed its reproduction or distribution without the express permission of ESTRUCTURE MEDIA SYSTEMS, S.L. .. If you are not the intended recipient please delete it and inform us in this way. According to the provisions of Law 15/1999, of December 13, Protection of Personal Data (LOPD), we inform you that your data is incorporated into a file which is owned by ESTRUCTURE MEDIA SYSTEMS, S.L. in order to perform administrative, accounting and fiscal management, as well as send you communications about our products and / or services. Also we advised of the possibility of exercising rights of access, rectification, cancellation and opposition of their data at the home of ESTRUCTURE MEDIA SYSTEMS, SL, located in Avda. De Madrid # 120 ship 10 28500, Arganda del Rey, Madrid , or email address info@estructure.es.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Preprocess input text before tokenizing

Posted by Jaime <j....@estructure.es>.

Thank you very much, that seems to solve my issue.

However, I find this a little cumbersome. I need to filter the text 
before any tokenizing takes place, so I have to implement a filtered 
version of every analyzer I'm using (StandardAnalyzer and 
SpanishAnalyzer and a custom analyzer right now).

If I need to support another analyzer in the future (a very plausible 
possibility) I will need to create another version of that analyzer. 
Whenever any of those analyzer is changed, I will need to manually apply 
the changes.

Isn't there a better way to do this?

El 23/06/2016 a las 20:28, Ahmet Arslan escribió:
> Hi,
>
> Zero or more CharFilter(s) is the way to manipulate text before the tokenizer.
> I think init reader is the method you want to plug char filters.
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/morfologik/src/java/org/apache/lucene/analysis/uk/UkrainianMorfologikAnalyzer.java
>
> Ahmet
>
> On Thursday, June 23, 2016 6:47 PM, Jaime <j....@estructure.es> wrote:
> Hello,
>
> I want to change the input text before tokenizing. I think I just need
> to use some characters as word separators, and maybe remove some others
> completely.
>
> I was planning to use MappingCharFilterFactory to replace some chars
> with " " and others with "", but I feel like I'm not in the right track.
>
> First, I've implemented a custom analyzer to use my custom tokenizer. My
> idea was to inherit from StandardTokenizer and, in setReader, calling
> MappingCharFilterFactory.create(reader) from within.
>
> However, setReader is final, so I can't override it.
>
> Is there a better way to do this?
> In any case, how should I use MappingCharFilter in case I really needed it?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

-- 
Jaime Pardos
ESTRUCTURE MEDIA SYSTEMS, S.L.
Avda. de Madrid nº 120 nave 10, 28500, Arganda del Rey, MADRID,
j.pardos@estructure.es
910088429

AVISO LEGAL: Este mensaje y sus archivos adjuntos van dirigidos exclusivamente a su destinatario, pudiendo contener información confidencial sometida a secreto confidencial. No está permitida su reproducción o distribución sin la autorización expresa de ESTRUCTURE MEDIA SYSTEMS, S.L.. Si usted no es el destinatario final por favor elimínelo e infórmenos por esta vía. De acuerdo con lo establecido en la Ley Orgánica 15/1999, de 13 de diciembre, de Protección de Datos de Carácter Personal (LOPD), le informamos que sus datos están incorporados en un fichero del que es titular ESTRUCTURE MEDIA SYSTEMS, S.L. con la finalidad de realizar la gestión administrativa, contable, y fiscal, así como enviarle comunicaciones comerciales sobre nuestros productos y/o servicios. Asimismo, le informamos de la posibilidad de ejercer los derechos de acceso, rectificación, cancelación y oposición de sus datos en el domicilio de ESTRUCTURE MEDIA SYSTEMS, S.L., sito en Avda. de Madrid nº 120 nave 10, 28500, Arganda del Rey, MADRID, o a la dirección de correo electrónico info@estructure.es.

This message and its attachments are intended solely for the addressee and may contain confidential information submitted to confidential secret. It is not allowed its reproduction or distribution without the express permission of ESTRUCTURE MEDIA SYSTEMS, S.L. .. If you are not the intended recipient please delete it and inform us in this way. According to the provisions of Law 15/1999, of December 13, Protection of Personal Data (LOPD), we inform you that your data is incorporated into a file which is owned by ESTRUCTURE MEDIA SYSTEMS, S.L. in order to perform administrative, accounting and fiscal management, as well as send you communications about our products and / or services. Also we advised of the possibility of exercising rights of access, rectification, cancellation and opposition of their data at the home of ESTRUCTURE MEDIA SYSTEMS, SL, located in Avda. De Madrid # 120 ship 10 28500, Arganda del Rey, Madrid , or email address info@estructure.es.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Preprocess input text before tokenizing

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi,

Zero or more CharFilter(s) is the way to manipulate text before the tokenizer.
I think init reader is the method you want to plug char filters.
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/morfologik/src/java/org/apache/lucene/analysis/uk/UkrainianMorfologikAnalyzer.java

Ahmet

On Thursday, June 23, 2016 6:47 PM, Jaime <j....@estructure.es> wrote:
Hello,

I want to change the input text before tokenizing. I think I just need 
to use some characters as word separators, and maybe remove some others 
completely.

I was planning to use MappingCharFilterFactory to replace some chars 
with " " and others with "", but I feel like I'm not in the right track.

First, I've implemented a custom analyzer to use my custom tokenizer. My 
idea was to inherit from StandardTokenizer and, in setReader, calling 
MappingCharFilterFactory.create(reader) from within.

However, setReader is final, so I can't override it.

Is there a better way to do this?
In any case, how should I use MappingCharFilter in case I really needed it?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org