You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Gordin, Ira" <ir...@sap.com> on 2018/06/26 07:57:42 UTC

How search code files for words which contains a given substrings?

Hi all,
I started to work on project which currently search code files for words which contains a given substrings.
Currently it uses WhitespaceTokenizerand use regex query which wraps the searched substring with '.*'.
For example, if one search for 'a', the query will be '/.*a.*/'. In this way in the 'Mama loves banana' text, it will find tokens 'Mama' and 'banana'.
Currently I need to get the start and end positions of matched tokens in the line and the line number.
With TokenStream I can get start and end positions of  'Mama' and 'banana' in the full text. But I need the positions of 'a'.
I see 2 options.
Option 1: to perform additional search in returned token.
Option 2: to use NGramTokenizer or NGramTokenFilter (not sure which of them) and in this way I hope I will get the 'a' positions in TokenStream.
Additional question how I can get the line numbers and the positions inside the line.
Many thanks in advance for your help,
Ira

Re: How search code files for words which contains a given substrings?

Posted by Michael Sokolov <ms...@gmail.com>.

If you need to get back line numbers and the regex does not span lines you
could consider indexing each line as a separate document.

On Tue, Jun 26, 2018, 9:04 AM Mikhail Khludnev <mk...@apache.org> wrote:

> I mean, you'd rather need offsets not positions, but I don't have something
> definite to suggest.
>
> On Tue, Jun 26, 2018 at 1:29 PM Gordin, Ira <ir...@sap.com> wrote:
>
> > Hello Mikhail,
> >
> > I see in the link you sent that PositionIncrementAttribute determines the
> > position of this token relative to the previous Token in a TokenStream,
> > used in phrase searching.
> > I am not in phrase searching.
> > Would you mind to explain how it can help me?
> >
> > Thanks,
> > Ira
> >
> > -----Original Message-----
> > From: Mikhail Khludnev [mailto:mkhl@apache.org]
> > Sent: Tuesday, June 26, 2018 12:33 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: How search code files for words which contains a given
> > substrings?
> >
> > Hello, Ira.
> > Note the difference between offset
> >
> >
> https://lucene.apache.org/core/7_3_0/core/org/apache/lucene/analysis/tokenattributes/OffsetAttribute.html
> > and
> > position
> >
> >
> https://lucene.apache.org/core/7_3_0/core/org/apache/lucene/analysis/tokenattributes/PositionIncrementAttribute.html
> > in Lucene terminology.
> > Please make sure you don't rebuild existing functionality
> >
> >
> https://lucene.apache.org/core/7_3_1/highlighter/org/apache/lucene/search/highlight/package-summary.html#package.description
> >
> >
> > On Tue, Jun 26, 2018 at 10:57 AM Gordin, Ira <ir...@sap.com> wrote:
> >
> > > Hi all,
> > > I started to work on project which currently search code files for
> words
> > > which contains a given substrings.
> > > Currently it uses WhitespaceTokenizerand use regex query which wraps
> the
> > > searched substring with '.*'.
> > > For example, if one search for 'a', the query will be '/.*a.*/'. In
> this
> > > way in the 'Mama loves banana' text, it will find tokens 'Mama' and
> > > 'banana'.
> > > Currently I need to get the start and end positions of matched tokens
> in
> > > the line and the line number.
> > > With TokenStream I can get start and end positions of  'Mama' and
> > 'banana'
> > > in the full text. But I need the positions of 'a'.
> > > I see 2 options.
> > > Option 1: to perform additional search in returned token.
> > > Option 2: to use NGramTokenizer or NGramTokenFilter (not sure which of
> > > them) and in this way I hope I will get the 'a' positions in
> TokenStream.
> > > Additional question how I can get the line numbers and the positions
> > > inside the line.
> > > Many thanks in advance for your help,
> > > Ira
> > >
> > >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: How search code files for words which contains a given substrings?

Posted by Mikhail Khludnev <mk...@apache.org>.

I mean, you'd rather need offsets not positions, but I don't have something
definite to suggest.

On Tue, Jun 26, 2018 at 1:29 PM Gordin, Ira <ir...@sap.com> wrote:

> Hello Mikhail,
>
> I see in the link you sent that PositionIncrementAttribute determines the
> position of this token relative to the previous Token in a TokenStream,
> used in phrase searching.
> I am not in phrase searching.
> Would you mind to explain how it can help me?
>
> Thanks,
> Ira
>
> -----Original Message-----
> From: Mikhail Khludnev [mailto:mkhl@apache.org]
> Sent: Tuesday, June 26, 2018 12:33 PM
> To: java-user@lucene.apache.org
> Subject: Re: How search code files for words which contains a given
> substrings?
>
> Hello, Ira.
> Note the difference between offset
>
> https://lucene.apache.org/core/7_3_0/core/org/apache/lucene/analysis/tokenattributes/OffsetAttribute.html
> and
> position
>
> https://lucene.apache.org/core/7_3_0/core/org/apache/lucene/analysis/tokenattributes/PositionIncrementAttribute.html
> in Lucene terminology.
> Please make sure you don't rebuild existing functionality
>
> https://lucene.apache.org/core/7_3_1/highlighter/org/apache/lucene/search/highlight/package-summary.html#package.description
>
>
> On Tue, Jun 26, 2018 at 10:57 AM Gordin, Ira <ir...@sap.com> wrote:
>
> > Hi all,
> > I started to work on project which currently search code files for words
> > which contains a given substrings.
> > Currently it uses WhitespaceTokenizerand use regex query which wraps the
> > searched substring with '.*'.
> > For example, if one search for 'a', the query will be '/.*a.*/'. In this
> > way in the 'Mama loves banana' text, it will find tokens 'Mama' and
> > 'banana'.
> > Currently I need to get the start and end positions of matched tokens in
> > the line and the line number.
> > With TokenStream I can get start and end positions of  'Mama' and
> 'banana'
> > in the full text. But I need the positions of 'a'.
> > I see 2 options.
> > Option 1: to perform additional search in returned token.
> > Option 2: to use NGramTokenizer or NGramTokenFilter (not sure which of
> > them) and in this way I hope I will get the 'a' positions in TokenStream.
> > Additional question how I can get the line numbers and the positions
> > inside the line.
> > Many thanks in advance for your help,
> > Ira
> >
> >
>
> --
> Sincerely yours
> Mikhail Khludnev
>


-- 
Sincerely yours
Mikhail Khludnev

RE: How search code files for words which contains a given substrings?

Posted by "Gordin, Ira" <ir...@sap.com>.

Hello Mikhail,

I see in the link you sent that PositionIncrementAttribute determines the position of this token relative to the previous Token in a TokenStream, used in phrase searching.
I am not in phrase searching.
Would you mind to explain how it can help me?

Thanks,
Ira

-----Original Message-----
From: Mikhail Khludnev [mailto:mkhl@apache.org] 
Sent: Tuesday, June 26, 2018 12:33 PM
To: java-user@lucene.apache.org
Subject: Re: How search code files for words which contains a given substrings?

Hello, Ira.
Note the difference between offset
https://lucene.apache.org/core/7_3_0/core/org/apache/lucene/analysis/tokenattributes/OffsetAttribute.html
and
position
https://lucene.apache.org/core/7_3_0/core/org/apache/lucene/analysis/tokenattributes/PositionIncrementAttribute.html
in Lucene terminology.
Please make sure you don't rebuild existing functionality
https://lucene.apache.org/core/7_3_1/highlighter/org/apache/lucene/search/highlight/package-summary.html#package.description

On Tue, Jun 26, 2018 at 10:57 AM Gordin, Ira <ir...@sap.com> wrote:

> Hi all,
> I started to work on project which currently search code files for words
> which contains a given substrings.
> Currently it uses WhitespaceTokenizerand use regex query which wraps the
> searched substring with '.*'.
> For example, if one search for 'a', the query will be '/.*a.*/'. In this
> way in the 'Mama loves banana' text, it will find tokens 'Mama' and
> 'banana'.
> Currently I need to get the start and end positions of matched tokens in
> the line and the line number.
> With TokenStream I can get start and end positions of  'Mama' and 'banana'
> in the full text. But I need the positions of 'a'.
> I see 2 options.
> Option 1: to perform additional search in returned token.
> Option 2: to use NGramTokenizer or NGramTokenFilter (not sure which of
> them) and in this way I hope I will get the 'a' positions in TokenStream.
> Additional question how I can get the line numbers and the positions
> inside the line.
> Many thanks in advance for your help,
> Ira
>
>

-- 
Sincerely yours
Mikhail Khludnev

Re: How search code files for words which contains a given substrings?

Posted by Mikhail Khludnev <mk...@apache.org>.

Hello, Ira.
Note the difference between offset
https://lucene.apache.org/core/7_3_0/core/org/apache/lucene/analysis/tokenattributes/OffsetAttribute.html
and
position
https://lucene.apache.org/core/7_3_0/core/org/apache/lucene/analysis/tokenattributes/PositionIncrementAttribute.html
in Lucene terminology.
Please make sure you don't rebuild existing functionality
https://lucene.apache.org/core/7_3_1/highlighter/org/apache/lucene/search/highlight/package-summary.html#package.description

On Tue, Jun 26, 2018 at 10:57 AM Gordin, Ira <ir...@sap.com> wrote:

> Hi all,
> I started to work on project which currently search code files for words
> which contains a given substrings.
> Currently it uses WhitespaceTokenizerand use regex query which wraps the
> searched substring with '.*'.
> For example, if one search for 'a', the query will be '/.*a.*/'. In this
> way in the 'Mama loves banana' text, it will find tokens 'Mama' and
> 'banana'.
> Currently I need to get the start and end positions of matched tokens in
> the line and the line number.
> With TokenStream I can get start and end positions of  'Mama' and 'banana'
> in the full text. But I need the positions of 'a'.
> I see 2 options.
> Option 1: to perform additional search in returned token.
> Option 2: to use NGramTokenizer or NGramTokenFilter (not sure which of
> them) and in this way I hope I will get the 'a' positions in TokenStream.
> Additional question how I can get the line numbers and the positions
> inside the line.
> Many thanks in advance for your help,
> Ira
>
>

-- 
Sincerely yours
Mikhail Khludnev