You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Gordin, Ira" <ir...@sap.com> on 2018/07/31 14:07:33 UTC

Search in lines, so need to index lines?

Hi all,

I understand Lucene knows to find query matches in tokens. For example if I use WhiteSpaceTokenizer and I am searching with /.*nice day.*/ regular expression, I'll always find nothing. Am I correct?
In my project I need to find matches inside lines and not inside words, so I am considering to tokenize lines. How I should to implement this idea?
I'll really appriciate you have more ideas/implementations.

Thanks in advance,
Ira

Re: Search in lines, so need to index lines?

Posted by Robert Muir <rc...@gmail.com>.

http://man7.org/linux/man-pages/man1/grep.1.html

On Wed, Aug 1, 2018 at 7:01 AM, Gordin, Ira <ir...@sap.com> wrote:
> Hi Tomoko,
>
> I need to search in many files and we use Lucene for this purpose.
>
> Thanks,
> Ira
>
> -----Original Message-----
> From: Tomoko Uchida <to...@gmail.com>
> Sent: Wednesday, August 1, 2018 1:49 PM
> To: java-user@lucene.apache.org
> Subject: Re: Search in lines, so need to index lines?
>
> Hi Ira,
>
>> I am trying to implement regex search in file
>
> Why are you using Lucene for regular expression search?
> You can implement this by simply using java.util.regex package?
>
> Regards,
> Tomoko
>
> 2018年8月1日(水) 0:18 Gordin, Ira <ir...@sap.com>:
>
>> Hi Uwe,
>>
>> I am trying to implement regex search in file the same as in editors, in
>> Notepad++ for example.
>>
>> Thanks,
>> Ira
>>
>> -----Original Message-----
>> From: Uwe Schindler <uw...@thetaphi.de>
>> Sent: Tuesday, July 31, 2018 6:12 PM
>> To: java-user@lucene.apache.org
>> Subject: RE: Search in lines, so need to index lines?
>>
>> Hi,
>>
>> you need to create your own tokenizer that splits tokens on \n or \r.
>> Instead of using WhitespaceTokenizer, you can use:
>>
>> Tokenizer tok = CharTokenizer. fromSeparatorCharPredicate(ch -> ch=='\r'
>> || ch=='\n');
>>
>> But I would first think of how to implement the whole thing correctly.
>> Using a regular expression as "default" query is slow and does not look
>> correct. What are you trying to do?
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> Achterdiek 19, D-28357 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>> > -----Original Message-----
>> > From: Gordin, Ira <ir...@sap.com>
>> > Sent: Tuesday, July 31, 2018 4:08 PM
>> > To: java-user@lucene.apache.org
>> > Subject: Search in lines, so need to index lines?
>> >
>> > Hi all,
>> >
>> > I understand Lucene knows to find query matches in tokens. For example
>> if I
>> > use WhiteSpaceTokenizer and I am searching with /.*nice day.*/ regular
>> > expression, I'll always find nothing. Am I correct?
>> > In my project I need to find matches inside lines and not inside words,
>> so I
>> > am considering to tokenize lines. How I should to implement this idea?
>> > I'll really appriciate you have more ideas/implementations.
>> >
>> > Thanks in advance,
>> > Ira
>> >
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> --
> Tomoko Uchida

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search in lines, so need to index lines?

Posted by Michael Sokolov <ms...@gmail.com>.

It sounds as if you must integrate a regex search into an existing
framework. You have good ammunition here for explaining why this may not be
a good idea since the performance will not be good. However if you must do
it, you may want to consider whether you can augment your queries with
additional constraints to make the problem tractable. For example you could
index and then search letter n grams that are implied by the user's regex.

On Wed, Aug 1, 2018, 4:35 AM Tomoko Uchida <to...@gmail.com>
wrote:

> Ira,
>
> I do not understand your requirements, but essentially lucene is not for
> regex searching.
> There are tools for fast regular expression search, if you do not satisfy
> with java standard library, for example:
> https://github.com/google/re2j
>
> And yes, grep command would be the best tool for you.
>
> Tomoko
>
> 2018年8月1日(水) 20:01 Gordin, Ira <ir...@sap.com>:
>
> > Hi Tomoko,
> >
> > I need to search in many files and we use Lucene for this purpose.
> >
> > Thanks,
> > Ira
> >
> > -----Original Message-----
> > From: Tomoko Uchida <to...@gmail.com>
> > Sent: Wednesday, August 1, 2018 1:49 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Search in lines, so need to index lines?
> >
> > Hi Ira,
> >
> > > I am trying to implement regex search in file
> >
> > Why are you using Lucene for regular expression search?
> > You can implement this by simply using java.util.regex package?
> >
> > Regards,
> > Tomoko
> >
> > 2018年8月1日(水) 0:18 Gordin, Ira <ir...@sap.com>:
> >
> > > Hi Uwe,
> > >
> > > I am trying to implement regex search in file the same as in editors,
> in
> > > Notepad++ for example.
> > >
> > > Thanks,
> > > Ira
> > >
> > > -----Original Message-----
> > > From: Uwe Schindler <uw...@thetaphi.de>
> > > Sent: Tuesday, July 31, 2018 6:12 PM
> > > To: java-user@lucene.apache.org
> > > Subject: RE: Search in lines, so need to index lines?
> > >
> > > Hi,
> > >
> > > you need to create your own tokenizer that splits tokens on \n or \r.
> > > Instead of using WhitespaceTokenizer, you can use:
> > >
> > > Tokenizer tok = CharTokenizer. fromSeparatorCharPredicate(ch ->
> ch=='\r'
> > > || ch=='\n');
> > >
> > > But I would first think of how to implement the whole thing correctly.
> > > Using a regular expression as "default" query is slow and does not look
> > > correct. What are you trying to do?
> > >
> > > Uwe
> > >
> > > -----
> > > Uwe Schindler
> > > Achterdiek 19, D-28357 Bremen
> > > http://www.thetaphi.de
> > > eMail: uwe@thetaphi.de
> > >
> > > > -----Original Message-----
> > > > From: Gordin, Ira <ir...@sap.com>
> > > > Sent: Tuesday, July 31, 2018 4:08 PM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Search in lines, so need to index lines?
> > > >
> > > > Hi all,
> > > >
> > > > I understand Lucene knows to find query matches in tokens. For
> example
> > > if I
> > > > use WhiteSpaceTokenizer and I am searching with /.*nice day.*/
> regular
> > > > expression, I'll always find nothing. Am I correct?
> > > > In my project I need to find matches inside lines and not inside
> words,
> > > so I
> > > > am considering to tokenize lines. How I should to implement this
> idea?
> > > > I'll really appriciate you have more ideas/implementations.
> > > >
> > > > Thanks in advance,
> > > > Ira
> > > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > --
> > Tomoko Uchida
> >
>
>
> --
> Tomoko Uchida
>

Re: Search in lines, so need to index lines?

Posted by Tomoko Uchida <to...@gmail.com>.

> what Lucene is good and supposed to be used and for what it is not good
and not supposed to be used?

I guess it is a too broad question to answer.

If you need comprehensive information, see the official site and
documentation:
http://lucene.apache.org/core/
http://lucene.apache.org/core/7_4_0/index.html

As far as questions in your previous posts, Lucene is (originally) a
full-text search engine, not a regex engine. If you need regex engine, you
should seek right tools for the task.

Regards,
Tomoko



2018年8月2日(木) 15:52 Gordin, Ira <ir...@sap.com>:

> Hi all,
>
> Would you mind to explain me or/and send some links on explanations for
> what Lucene is good and supposed to be used and for what it is not good and
> not supposed to be used?
>
> Thanks a lot in advance,
> Ira
>
> -----Original Message-----
> From: Tomoko Uchida <to...@gmail.com>
> Sent: Wednesday, August 1, 2018 2:35 PM
> To: java-user@lucene.apache.org
> Subject: Re: Search in lines, so need to index lines?
>
> Ira,
>
> I do not understand your requirements, but essentially lucene is not for
> regex searching.
> There are tools for fast regular expression search, if you do not satisfy
> with java standard library, for example:
> https://github.com/google/re2j
>
> And yes, grep command would be the best tool for you.
>
> Tomoko
>
> 2018年8月1日(水) 20:01 Gordin, Ira <ir...@sap.com>:
>
> > Hi Tomoko,
> >
> > I need to search in many files and we use Lucene for this purpose.
> >
> > Thanks,
> > Ira
> >
> > -----Original Message-----
> > From: Tomoko Uchida <to...@gmail.com>
> > Sent: Wednesday, August 1, 2018 1:49 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Search in lines, so need to index lines?
> >
> > Hi Ira,
> >
> > > I am trying to implement regex search in file
> >
> > Why are you using Lucene for regular expression search?
> > You can implement this by simply using java.util.regex package?
> >
> > Regards,
> > Tomoko
> >
> > 2018年8月1日(水) 0:18 Gordin, Ira <ir...@sap.com>:
> >
> > > Hi Uwe,
> > >
> > > I am trying to implement regex search in file the same as in editors,
> in
> > > Notepad++ for example.
> > >
> > > Thanks,
> > > Ira
> > >
> > > -----Original Message-----
> > > From: Uwe Schindler <uw...@thetaphi.de>
> > > Sent: Tuesday, July 31, 2018 6:12 PM
> > > To: java-user@lucene.apache.org
> > > Subject: RE: Search in lines, so need to index lines?
> > >
> > > Hi,
> > >
> > > you need to create your own tokenizer that splits tokens on \n or \r.
> > > Instead of using WhitespaceTokenizer, you can use:
> > >
> > > Tokenizer tok = CharTokenizer. fromSeparatorCharPredicate(ch ->
> ch=='\r'
> > > || ch=='\n');
> > >
> > > But I would first think of how to implement the whole thing correctly.
> > > Using a regular expression as "default" query is slow and does not look
> > > correct. What are you trying to do?
> > >
> > > Uwe
> > >
> > > -----
> > > Uwe Schindler
> > > Achterdiek 19, D-28357 Bremen
> > > http://www.thetaphi.de
> > > eMail: uwe@thetaphi.de
> > >
> > > > -----Original Message-----
> > > > From: Gordin, Ira <ir...@sap.com>
> > > > Sent: Tuesday, July 31, 2018 4:08 PM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Search in lines, so need to index lines?
> > > >
> > > > Hi all,
> > > >
> > > > I understand Lucene knows to find query matches in tokens. For
> example
> > > if I
> > > > use WhiteSpaceTokenizer and I am searching with /.*nice day.*/
> regular
> > > > expression, I'll always find nothing. Am I correct?
> > > > In my project I need to find matches inside lines and not inside
> words,
> > > so I
> > > > am considering to tokenize lines. How I should to implement this
> idea?
> > > > I'll really appriciate you have more ideas/implementations.
> > > >
> > > > Thanks in advance,
> > > > Ira
> > > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > --
> > Tomoko Uchida
> >
>
>
> --
> Tomoko Uchida
>


-- 
Tomoko Uchida

RE: Search in lines, so need to index lines?

Posted by "Gordin, Ira" <ir...@sap.com>.

Hi all,

Would you mind to explain me or/and send some links on explanations for what Lucene is good and supposed to be used and for what it is not good and not supposed to be used?

Thanks a lot in advance,
Ira

-----Original Message-----
From: Tomoko Uchida <to...@gmail.com> 
Sent: Wednesday, August 1, 2018 2:35 PM
To: java-user@lucene.apache.org
Subject: Re: Search in lines, so need to index lines?

Ira,

I do not understand your requirements, but essentially lucene is not for
regex searching.
There are tools for fast regular expression search, if you do not satisfy
with java standard library, for example:
https://github.com/google/re2j

And yes, grep command would be the best tool for you.

Tomoko

2018年8月1日(水) 20:01 Gordin, Ira <ir...@sap.com>:

> Hi Tomoko,
>
> I need to search in many files and we use Lucene for this purpose.
>
> Thanks,
> Ira
>
> -----Original Message-----
> From: Tomoko Uchida <to...@gmail.com>
> Sent: Wednesday, August 1, 2018 1:49 PM
> To: java-user@lucene.apache.org
> Subject: Re: Search in lines, so need to index lines?
>
> Hi Ira,
>
> > I am trying to implement regex search in file
>
> Why are you using Lucene for regular expression search?
> You can implement this by simply using java.util.regex package?
>
> Regards,
> Tomoko
>
> 2018年8月1日(水) 0:18 Gordin, Ira <ir...@sap.com>:
>
> > Hi Uwe,
> >
> > I am trying to implement regex search in file the same as in editors, in
> > Notepad++ for example.
> >
> > Thanks,
> > Ira
> >
> > -----Original Message-----
> > From: Uwe Schindler <uw...@thetaphi.de>
> > Sent: Tuesday, July 31, 2018 6:12 PM
> > To: java-user@lucene.apache.org
> > Subject: RE: Search in lines, so need to index lines?
> >
> > Hi,
> >
> > you need to create your own tokenizer that splits tokens on \n or \r.
> > Instead of using WhitespaceTokenizer, you can use:
> >
> > Tokenizer tok = CharTokenizer. fromSeparatorCharPredicate(ch -> ch=='\r'
> > || ch=='\n');
> >
> > But I would first think of how to implement the whole thing correctly.
> > Using a regular expression as "default" query is slow and does not look
> > correct. What are you trying to do?
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> > > -----Original Message-----
> > > From: Gordin, Ira <ir...@sap.com>
> > > Sent: Tuesday, July 31, 2018 4:08 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Search in lines, so need to index lines?
> > >
> > > Hi all,
> > >
> > > I understand Lucene knows to find query matches in tokens. For example
> > if I
> > > use WhiteSpaceTokenizer and I am searching with /.*nice day.*/ regular
> > > expression, I'll always find nothing. Am I correct?
> > > In my project I need to find matches inside lines and not inside words,
> > so I
> > > am considering to tokenize lines. How I should to implement this idea?
> > > I'll really appriciate you have more ideas/implementations.
> > >
> > > Thanks in advance,
> > > Ira
> > >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> --
> Tomoko Uchida
>


-- 
Tomoko Uchida

Re: Search in lines, so need to index lines?

Posted by Tomoko Uchida <to...@gmail.com>.

Ira,

I do not understand your requirements, but essentially lucene is not for
regex searching.
There are tools for fast regular expression search, if you do not satisfy
with java standard library, for example:
https://github.com/google/re2j

And yes, grep command would be the best tool for you.

Tomoko

2018年8月1日(水) 20:01 Gordin, Ira <ir...@sap.com>:

> Hi Tomoko,
>
> I need to search in many files and we use Lucene for this purpose.
>
> Thanks,
> Ira
>
> -----Original Message-----
> From: Tomoko Uchida <to...@gmail.com>
> Sent: Wednesday, August 1, 2018 1:49 PM
> To: java-user@lucene.apache.org
> Subject: Re: Search in lines, so need to index lines?
>
> Hi Ira,
>
> > I am trying to implement regex search in file
>
> Why are you using Lucene for regular expression search?
> You can implement this by simply using java.util.regex package?
>
> Regards,
> Tomoko
>
> 2018年8月1日(水) 0:18 Gordin, Ira <ir...@sap.com>:
>
> > Hi Uwe,
> >
> > I am trying to implement regex search in file the same as in editors, in
> > Notepad++ for example.
> >
> > Thanks,
> > Ira
> >
> > -----Original Message-----
> > From: Uwe Schindler <uw...@thetaphi.de>
> > Sent: Tuesday, July 31, 2018 6:12 PM
> > To: java-user@lucene.apache.org
> > Subject: RE: Search in lines, so need to index lines?
> >
> > Hi,
> >
> > you need to create your own tokenizer that splits tokens on \n or \r.
> > Instead of using WhitespaceTokenizer, you can use:
> >
> > Tokenizer tok = CharTokenizer. fromSeparatorCharPredicate(ch -> ch=='\r'
> > || ch=='\n');
> >
> > But I would first think of how to implement the whole thing correctly.
> > Using a regular expression as "default" query is slow and does not look
> > correct. What are you trying to do?
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> > > -----Original Message-----
> > > From: Gordin, Ira <ir...@sap.com>
> > > Sent: Tuesday, July 31, 2018 4:08 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Search in lines, so need to index lines?
> > >
> > > Hi all,
> > >
> > > I understand Lucene knows to find query matches in tokens. For example
> > if I
> > > use WhiteSpaceTokenizer and I am searching with /.*nice day.*/ regular
> > > expression, I'll always find nothing. Am I correct?
> > > In my project I need to find matches inside lines and not inside words,
> > so I
> > > am considering to tokenize lines. How I should to implement this idea?
> > > I'll really appriciate you have more ideas/implementations.
> > >
> > > Thanks in advance,
> > > Ira
> > >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> --
> Tomoko Uchida
>


-- 
Tomoko Uchida

RE: Search in lines, so need to index lines?

Posted by "Gordin, Ira" <ir...@sap.com>.

Hi Tomoko,

I need to search in many files and we use Lucene for this purpose.

Thanks,
Ira

-----Original Message-----
From: Tomoko Uchida <to...@gmail.com> 
Sent: Wednesday, August 1, 2018 1:49 PM
To: java-user@lucene.apache.org
Subject: Re: Search in lines, so need to index lines?

Hi Ira,

> I am trying to implement regex search in file

Why are you using Lucene for regular expression search?
You can implement this by simply using java.util.regex package?

Regards,
Tomoko

2018年8月1日(水) 0:18 Gordin, Ira <ir...@sap.com>:

> Hi Uwe,
>
> I am trying to implement regex search in file the same as in editors, in
> Notepad++ for example.
>
> Thanks,
> Ira
>
> -----Original Message-----
> From: Uwe Schindler <uw...@thetaphi.de>
> Sent: Tuesday, July 31, 2018 6:12 PM
> To: java-user@lucene.apache.org
> Subject: RE: Search in lines, so need to index lines?
>
> Hi,
>
> you need to create your own tokenizer that splits tokens on \n or \r.
> Instead of using WhitespaceTokenizer, you can use:
>
> Tokenizer tok = CharTokenizer. fromSeparatorCharPredicate(ch -> ch=='\r'
> || ch=='\n');
>
> But I would first think of how to implement the whole thing correctly.
> Using a regular expression as "default" query is slow and does not look
> correct. What are you trying to do?
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Gordin, Ira <ir...@sap.com>
> > Sent: Tuesday, July 31, 2018 4:08 PM
> > To: java-user@lucene.apache.org
> > Subject: Search in lines, so need to index lines?
> >
> > Hi all,
> >
> > I understand Lucene knows to find query matches in tokens. For example
> if I
> > use WhiteSpaceTokenizer and I am searching with /.*nice day.*/ regular
> > expression, I'll always find nothing. Am I correct?
> > In my project I need to find matches inside lines and not inside words,
> so I
> > am considering to tokenize lines. How I should to implement this idea?
> > I'll really appriciate you have more ideas/implementations.
> >
> > Thanks in advance,
> > Ira
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-- 
Tomoko Uchida

Re: Search in lines, so need to index lines?

Posted by Tomoko Uchida <to...@gmail.com>.

Hi Ira,

> I am trying to implement regex search in file

Why are you using Lucene for regular expression search?
You can implement this by simply using java.util.regex package?

Regards,
Tomoko

2018年8月1日(水) 0:18 Gordin, Ira <ir...@sap.com>:

> Hi Uwe,
>
> I am trying to implement regex search in file the same as in editors, in
> Notepad++ for example.
>
> Thanks,
> Ira
>
> -----Original Message-----
> From: Uwe Schindler <uw...@thetaphi.de>
> Sent: Tuesday, July 31, 2018 6:12 PM
> To: java-user@lucene.apache.org
> Subject: RE: Search in lines, so need to index lines?
>
> Hi,
>
> you need to create your own tokenizer that splits tokens on \n or \r.
> Instead of using WhitespaceTokenizer, you can use:
>
> Tokenizer tok = CharTokenizer. fromSeparatorCharPredicate(ch -> ch=='\r'
> || ch=='\n');
>
> But I would first think of how to implement the whole thing correctly.
> Using a regular expression as "default" query is slow and does not look
> correct. What are you trying to do?
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Gordin, Ira <ir...@sap.com>
> > Sent: Tuesday, July 31, 2018 4:08 PM
> > To: java-user@lucene.apache.org
> > Subject: Search in lines, so need to index lines?
> >
> > Hi all,
> >
> > I understand Lucene knows to find query matches in tokens. For example
> if I
> > use WhiteSpaceTokenizer and I am searching with /.*nice day.*/ regular
> > expression, I'll always find nothing. Am I correct?
> > In my project I need to find matches inside lines and not inside words,
> so I
> > am considering to tokenize lines. How I should to implement this idea?
> > I'll really appriciate you have more ideas/implementations.
> >
> > Thanks in advance,
> > Ira
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-- 
Tomoko Uchida

RE: Search in lines, so need to index lines?

Posted by "Gordin, Ira" <ir...@sap.com>.

Hi Uwe,

I am trying to implement regex search in file the same as in editors, in Notepad++ for example.

Thanks,
Ira

-----Original Message-----
From: Uwe Schindler <uw...@thetaphi.de> 
Sent: Tuesday, July 31, 2018 6:12 PM
To: java-user@lucene.apache.org
Subject: RE: Search in lines, so need to index lines?

Hi,

you need to create your own tokenizer that splits tokens on \n or \r. Instead of using WhitespaceTokenizer, you can use:

Tokenizer tok = CharTokenizer. fromSeparatorCharPredicate(ch -> ch=='\r' || ch=='\n');

But I would first think of how to implement the whole thing correctly. Using a regular expression as "default" query is slow and does not look correct. What are you trying to do?

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Gordin, Ira <ir...@sap.com>
> Sent: Tuesday, July 31, 2018 4:08 PM
> To: java-user@lucene.apache.org
> Subject: Search in lines, so need to index lines?
> 
> Hi all,
> 
> I understand Lucene knows to find query matches in tokens. For example if I
> use WhiteSpaceTokenizer and I am searching with /.*nice day.*/ regular
> expression, I'll always find nothing. Am I correct?
> In my project I need to find matches inside lines and not inside words, so I
> am considering to tokenize lines. How I should to implement this idea?
> I'll really appriciate you have more ideas/implementations.
> 
> Thanks in advance,
> Ira
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Search in lines, so need to index lines?

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

you need to create your own tokenizer that splits tokens on \n or \r. Instead of using WhitespaceTokenizer, you can use:

Tokenizer tok = CharTokenizer. fromSeparatorCharPredicate(ch -> ch=='\r' || ch=='\n');

But I would first think of how to implement the whole thing correctly. Using a regular expression as "default" query is slow and does not look correct. What are you trying to do?

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Gordin, Ira <ir...@sap.com>
> Sent: Tuesday, July 31, 2018 4:08 PM
> To: java-user@lucene.apache.org
> Subject: Search in lines, so need to index lines?
> 
> Hi all,
> 
> I understand Lucene knows to find query matches in tokens. For example if I
> use WhiteSpaceTokenizer and I am searching with /.*nice day.*/ regular
> expression, I'll always find nothing. Am I correct?
> In my project I need to find matches inside lines and not inside words, so I
> am considering to tokenize lines. How I should to implement this idea?
> I'll really appriciate you have more ideas/implementations.
> 
> Thanks in advance,
> Ira
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org