You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Diego Fernandez <di...@redhat.com> on 2014/02/14 19:42:32 UTC

Extending StandardTokenizer Jflex to not split on '/'

Hi guys, this is my first time posting on the Lucene list, so hello everyone.

I really like the way that the StandardTokenizer works, however I'd like for it to not split tokens on / (forward slash).  I've been looking at http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to understand the rules, but I'm either misunderstanding or missing something.  If I understand correctly, the symbols in MidLetter keep it from splitting a token as long as there's alpha chars on either side.  I tried adding the forward slash to the MidLetter and MidLetterSupp rules (tried different combinations), but it still seems like it's splitting on it.

Does anyone have any tips or ideas?

Thanks

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Extending StandardTokenizer Jflex to not split on '/'

Posted by Diego Fernandez <di...@redhat.com>.

Thanks again for the help.  Upon further investigation I found out we weren't using our custom version of the analyzer, which explains why it wasn't doing what I thought it should.  When I have time to get back to it I'll reconfigure it to use our tokenizer.

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics


----- Original Message -----
> Sorry, Diego, the generated scanner diff doesn't tell me anything.
> 
> Since I was able to successfully make changes to the open source and get
> the desired behavior, I'm guessing you're: a) not using the same (versions
> of) tools as me; b) not using the same (version of the) source as me; or c)
> not testing what you think you're testing.
> 
> So:
> 
> What version of Lucene?  What version of JFlex?  Are you using the Lucene
> build system, or some other mechanism to generate the scanner?  (If so,
> what is it?)
> 
> What other changes have you made?  (If you send me your grammar, I'll test
> it locally.)
> 
> Can you give an example of an input that should be split but isn't?
> 
> Are you sure you're testing the scanner generated from the modified grammar?
> 
> 
> 
> On Mon, Feb 17, 2014 at 5:04 PM, Diego Fernandez <di...@redhat.com>wrote:
> 
> > Hey Steve, thanks for the quick reply.  I didn't have a chance to test
> > again until today.  In our Lucene build, we had already made some
> > customization to the JFlex file and it re-generates the java file whenever
> > we build our project.  Unfortunately, it is still not working for me.  I
> > diffed the generated java file before and after the JFlex change and here's
> > the result:
> >
> >
> > *** 71,77 ****
> >     private static final String ZZ_CMAP_PACKED =
> >       "\11\0\1\156\1\161\1\0\1\151\1\161\22\0\1\156\2\0\1\150"+
> > !     "\3\0\1\140\1\162\1\163\2\0\1\137\1\136\1\140\1\136\1\152"+
> >       "\11\155\1\136\1\137\5\0\6\154\21\132\1\153\2\132\4\0\1\141"+
> >       "\1\0\1\164\4\154\1\166\2\132\1\157\3\132\1\171\1\160\1\170"+
> >       "\2\132\1\167\1\132\1\165\3\132\1\153\2\132\57\0\1\132\2\0"+
> > --- 71,77 ----
> >      */
> >     private static final String ZZ_CMAP_PACKED =
> >       "\11\0\1\156\1\161\1\0\1\151\1\161\22\0\1\156\2\0\1\150"+
> > !     "\3\0\1\140\1\162\1\163\2\0\1\137\1\136\1\140\1\0\1\152"+
> >       "\11\155\1\136\1\137\5\0\6\154\21\132\1\153\2\132\4\0\1\141"+
> >       "\1\0\1\164\4\154\1\166\2\132\1\157\3\132\1\171\1\160\1\170"+
> >       "\2\132\1\167\1\132\1\165\3\132\1\153\2\132\57\0\1\132\2\0"+
> >
> >
> > Diego Fernandez - 爱国
> > Software Engineer
> > US GSS Supportability - Diagnostics
> >
> >
> > ----- Original Message -----
> > > Welcome Diego,
> > >
> > > I think you’re right about MidLetter - adding a char to it should disable
> > > splitting on that char, as long as there is a letter on one side or the
> > > other.  (If you’d like that behavior to be extended to numeric digits,
> > you
> > > should use MidNumLet instead.)
> > >
> > > I tested this by adding “/“ to MidLetter in StandardTokenizerImpl.jflex
> > > (compressed whitespace diff below):
> > >
> > >     -MidLetter = (\p{WB:MidLetter}    | {MidLetterSupp})
> > >     +MidLetter = ([/\p{WB:MidLetter}] | {MidLetterSupp})
> > >
> > > then running ‘ant jflex’ under lucene/analysis/common/, and the following
> > > text was split as indicated (I tested by adding the method below to
> > > TestStandardAnalyzer.java):
> > >
> > >   public void testMidLetterSlash() throws Exception {
> > >     BaseTokenStreamTestCase.assertAnalyzesTo(a, "/one/two/three/ four”,
> > >                                   new String[]{ "one/two/three", "four"
> > });
> > >     BaseTokenStreamTestCase.assertAnalyzesTo(a, "1/two/3”,
> > >                                  new String[] { "1", "two", "3" });
> > >   }
> > >
> > > So it works for me - are you regenerating the scanner (‘ant jflex’)?
> > >
> > > FYI, I found a bug when I was testing the above: “http://example.com”
> > is left
> > > intact when “/“ is added to MidLetter, but it shouldn’t be; although ‘:’
> > and
> > > ‘/‘ are in [/\p{WB:MidLetter}], the letter-on-both-sides requirement
> > should
> > > instead result in “http://example.com” being split into “http” and
> > > “example.com”.  Further testing indicates that this is a problem for
> > > MidLetter, MidNumLet and MidNum.  I’ve filed an issue:
> > > <https://issues.apache.org/jira/browse/LUCENE-5447>.
> > >
> > > Steve
> > >
> > > On Feb 14, 2014, at 1:42 PM, Diego Fernandez <di...@redhat.com>
> > wrote:
> > >
> > > > Hi guys, this is my first time posting on the Lucene list, so hello
> > > > everyone.
> > > >
> > > > I really like the way that the StandardTokenizer works, however I'd
> > like
> > > > for it to not split tokens on / (forward slash).  I've been looking at
> > > > http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to
> > > > understand the rules, but I'm either misunderstanding or missing
> > > > something.  If I understand correctly, the symbols in MidLetter keep it
> > > > from splitting a token as long as there's alpha chars on either side.
> >  I
> > > > tried adding the forward slash to the MidLetter and MidLetterSupp rules
> > > > (tried different combinations), but it still seems like it's splitting
> > on
> > > > it.
> > > >
> > > > Does anyone have any tips or ideas?
> > > >
> > > > Thanks
> > > >
> > > > Diego Fernandez - 爱国
> > > > Software Engineer
> > > > US GSS Supportability - Diagnostics
> > > >
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Extending StandardTokenizer Jflex to not split on '/'

Posted by Steve Rowe <sa...@gmail.com>.

Sorry, Diego, the generated scanner diff doesn't tell me anything.

Since I was able to successfully make changes to the open source and get
the desired behavior, I'm guessing you're: a) not using the same (versions
of) tools as me; b) not using the same (version of the) source as me; or c)
not testing what you think you're testing.

So:

What version of Lucene?  What version of JFlex?  Are you using the Lucene
build system, or some other mechanism to generate the scanner?  (If so,
what is it?)

What other changes have you made?  (If you send me your grammar, I'll test
it locally.)

Can you give an example of an input that should be split but isn't?

Are you sure you're testing the scanner generated from the modified grammar?



On Mon, Feb 17, 2014 at 5:04 PM, Diego Fernandez <di...@redhat.com>wrote:

> Hey Steve, thanks for the quick reply.  I didn't have a chance to test
> again until today.  In our Lucene build, we had already made some
> customization to the JFlex file and it re-generates the java file whenever
> we build our project.  Unfortunately, it is still not working for me.  I
> diffed the generated java file before and after the JFlex change and here's
> the result:
>
>
> *** 71,77 ****
>     private static final String ZZ_CMAP_PACKED =
>       "\11\0\1\156\1\161\1\0\1\151\1\161\22\0\1\156\2\0\1\150"+
> !     "\3\0\1\140\1\162\1\163\2\0\1\137\1\136\1\140\1\136\1\152"+
>       "\11\155\1\136\1\137\5\0\6\154\21\132\1\153\2\132\4\0\1\141"+
>       "\1\0\1\164\4\154\1\166\2\132\1\157\3\132\1\171\1\160\1\170"+
>       "\2\132\1\167\1\132\1\165\3\132\1\153\2\132\57\0\1\132\2\0"+
> --- 71,77 ----
>      */
>     private static final String ZZ_CMAP_PACKED =
>       "\11\0\1\156\1\161\1\0\1\151\1\161\22\0\1\156\2\0\1\150"+
> !     "\3\0\1\140\1\162\1\163\2\0\1\137\1\136\1\140\1\0\1\152"+
>       "\11\155\1\136\1\137\5\0\6\154\21\132\1\153\2\132\4\0\1\141"+
>       "\1\0\1\164\4\154\1\166\2\132\1\157\3\132\1\171\1\160\1\170"+
>       "\2\132\1\167\1\132\1\165\3\132\1\153\2\132\57\0\1\132\2\0"+
>
>
> Diego Fernandez - 爱国
> Software Engineer
> US GSS Supportability - Diagnostics
>
>
> ----- Original Message -----
> > Welcome Diego,
> >
> > I think you’re right about MidLetter - adding a char to it should disable
> > splitting on that char, as long as there is a letter on one side or the
> > other.  (If you’d like that behavior to be extended to numeric digits,
> you
> > should use MidNumLet instead.)
> >
> > I tested this by adding “/“ to MidLetter in StandardTokenizerImpl.jflex
> > (compressed whitespace diff below):
> >
> >     -MidLetter = (\p{WB:MidLetter}    | {MidLetterSupp})
> >     +MidLetter = ([/\p{WB:MidLetter}] | {MidLetterSupp})
> >
> > then running ‘ant jflex’ under lucene/analysis/common/, and the following
> > text was split as indicated (I tested by adding the method below to
> > TestStandardAnalyzer.java):
> >
> >   public void testMidLetterSlash() throws Exception {
> >     BaseTokenStreamTestCase.assertAnalyzesTo(a, "/one/two/three/ four”,
> >                                   new String[]{ "one/two/three", "four"
> });
> >     BaseTokenStreamTestCase.assertAnalyzesTo(a, "1/two/3”,
> >                                  new String[] { "1", "two", "3" });
> >   }
> >
> > So it works for me - are you regenerating the scanner (‘ant jflex’)?
> >
> > FYI, I found a bug when I was testing the above: “http://example.com”
> is left
> > intact when “/“ is added to MidLetter, but it shouldn’t be; although ‘:’
> and
> > ‘/‘ are in [/\p{WB:MidLetter}], the letter-on-both-sides requirement
> should
> > instead result in “http://example.com” being split into “http” and
> > “example.com”.  Further testing indicates that this is a problem for
> > MidLetter, MidNumLet and MidNum.  I’ve filed an issue:
> > <https://issues.apache.org/jira/browse/LUCENE-5447>.
> >
> > Steve
> >
> > On Feb 14, 2014, at 1:42 PM, Diego Fernandez <di...@redhat.com>
> wrote:
> >
> > > Hi guys, this is my first time posting on the Lucene list, so hello
> > > everyone.
> > >
> > > I really like the way that the StandardTokenizer works, however I'd
> like
> > > for it to not split tokens on / (forward slash).  I've been looking at
> > > http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to
> > > understand the rules, but I'm either misunderstanding or missing
> > > something.  If I understand correctly, the symbols in MidLetter keep it
> > > from splitting a token as long as there's alpha chars on either side.
>  I
> > > tried adding the forward slash to the MidLetter and MidLetterSupp rules
> > > (tried different combinations), but it still seems like it's splitting
> on
> > > it.
> > >
> > > Does anyone have any tips or ideas?
> > >
> > > Thanks
> > >
> > > Diego Fernandez - 爱国
> > > Software Engineer
> > > US GSS Supportability - Diagnostics
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Extending StandardTokenizer Jflex to not split on '/'

Posted by Diego Fernandez <di...@redhat.com>.

Hey Steve, thanks for the quick reply.  I didn't have a chance to test again until today.  In our Lucene build, we had already made some customization to the JFlex file and it re-generates the java file whenever we build our project.  Unfortunately, it is still not working for me.  I diffed the generated java file before and after the JFlex change and here's the result:


*** 71,77 ****
    private static final String ZZ_CMAP_PACKED = 
      "\11\0\1\156\1\161\1\0\1\151\1\161\22\0\1\156\2\0\1\150"+
!     "\3\0\1\140\1\162\1\163\2\0\1\137\1\136\1\140\1\136\1\152"+
      "\11\155\1\136\1\137\5\0\6\154\21\132\1\153\2\132\4\0\1\141"+
      "\1\0\1\164\4\154\1\166\2\132\1\157\3\132\1\171\1\160\1\170"+
      "\2\132\1\167\1\132\1\165\3\132\1\153\2\132\57\0\1\132\2\0"+
--- 71,77 ----
     */
    private static final String ZZ_CMAP_PACKED = 
      "\11\0\1\156\1\161\1\0\1\151\1\161\22\0\1\156\2\0\1\150"+
!     "\3\0\1\140\1\162\1\163\2\0\1\137\1\136\1\140\1\0\1\152"+
      "\11\155\1\136\1\137\5\0\6\154\21\132\1\153\2\132\4\0\1\141"+
      "\1\0\1\164\4\154\1\166\2\132\1\157\3\132\1\171\1\160\1\170"+
      "\2\132\1\167\1\132\1\165\3\132\1\153\2\132\57\0\1\132\2\0"+


Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics


----- Original Message -----
> Welcome Diego,
> 
> I think you’re right about MidLetter - adding a char to it should disable
> splitting on that char, as long as there is a letter on one side or the
> other.  (If you’d like that behavior to be extended to numeric digits, you
> should use MidNumLet instead.)
> 
> I tested this by adding “/“ to MidLetter in StandardTokenizerImpl.jflex
> (compressed whitespace diff below):
> 
>     -MidLetter = (\p{WB:MidLetter}    | {MidLetterSupp})
>     +MidLetter = ([/\p{WB:MidLetter}] | {MidLetterSupp})
> 
> then running ‘ant jflex’ under lucene/analysis/common/, and the following
> text was split as indicated (I tested by adding the method below to
> TestStandardAnalyzer.java):
> 
>   public void testMidLetterSlash() throws Exception {
>     BaseTokenStreamTestCase.assertAnalyzesTo(a, "/one/two/three/ four”,
>                                   new String[]{ "one/two/three", "four" });
>     BaseTokenStreamTestCase.assertAnalyzesTo(a, "1/two/3”,
>                                  new String[] { "1", "two", "3" });
>   }
> 
> So it works for me - are you regenerating the scanner (‘ant jflex’)?
> 
> FYI, I found a bug when I was testing the above: “http://example.com” is left
> intact when “/“ is added to MidLetter, but it shouldn’t be; although ‘:’ and
> ‘/‘ are in [/\p{WB:MidLetter}], the letter-on-both-sides requirement should
> instead result in “http://example.com” being split into “http” and
> “example.com”.  Further testing indicates that this is a problem for
> MidLetter, MidNumLet and MidNum.  I’ve filed an issue:
> <https://issues.apache.org/jira/browse/LUCENE-5447>.
> 
> Steve
> 
> On Feb 14, 2014, at 1:42 PM, Diego Fernandez <di...@redhat.com> wrote:
> 
> > Hi guys, this is my first time posting on the Lucene list, so hello
> > everyone.
> > 
> > I really like the way that the StandardTokenizer works, however I'd like
> > for it to not split tokens on / (forward slash).  I've been looking at
> > http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to
> > understand the rules, but I'm either misunderstanding or missing
> > something.  If I understand correctly, the symbols in MidLetter keep it
> > from splitting a token as long as there's alpha chars on either side.  I
> > tried adding the forward slash to the MidLetter and MidLetterSupp rules
> > (tried different combinations), but it still seems like it's splitting on
> > it.
> > 
> > Does anyone have any tips or ideas?
> > 
> > Thanks
> > 
> > Diego Fernandez - 爱国
> > Software Engineer
> > US GSS Supportability - Diagnostics
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Extending StandardTokenizer Jflex to not split on '/'

Posted by Steve Rowe <sa...@gmail.com>.

Welcome Diego,

I think you’re right about MidLetter - adding a char to it should disable splitting on that char, as long as there is a letter on one side or the other.  (If you’d like that behavior to be extended to numeric digits, you should use MidNumLet instead.)

I tested this by adding “/“ to MidLetter in StandardTokenizerImpl.jflex (compressed whitespace diff below):

    -MidLetter = (\p{WB:MidLetter}    | {MidLetterSupp})
    +MidLetter = ([/\p{WB:MidLetter}] | {MidLetterSupp})

then running ‘ant jflex’ under lucene/analysis/common/, and the following text was split as indicated (I tested by adding the method below to TestStandardAnalyzer.java):

  public void testMidLetterSlash() throws Exception {
    BaseTokenStreamTestCase.assertAnalyzesTo(a, "/one/two/three/ four”, 
                                  new String[]{ "one/two/three", "four" });
    BaseTokenStreamTestCase.assertAnalyzesTo(a, "1/two/3”, 
                                 new String[] { "1", "two", "3" });
  }

So it works for me - are you regenerating the scanner (‘ant jflex’)?

FYI, I found a bug when I was testing the above: “http://example.com” is left intact when “/“ is added to MidLetter, but it shouldn’t be; although ‘:’ and ‘/‘ are in [/\p{WB:MidLetter}], the letter-on-both-sides requirement should instead result in “http://example.com” being split into “http” and “example.com”.  Further testing indicates that this is a problem for MidLetter, MidNumLet and MidNum.  I’ve filed an issue: <https://issues.apache.org/jira/browse/LUCENE-5447>.

Steve

On Feb 14, 2014, at 1:42 PM, Diego Fernandez <di...@redhat.com> wrote:

> Hi guys, this is my first time posting on the Lucene list, so hello everyone.
> 
> I really like the way that the StandardTokenizer works, however I'd like for it to not split tokens on / (forward slash).  I've been looking at http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to understand the rules, but I'm either misunderstanding or missing something.  If I understand correctly, the symbols in MidLetter keep it from splitting a token as long as there's alpha chars on either side.  I tried adding the forward slash to the MidLetter and MidLetterSupp rules (tried different combinations), but it still seems like it's splitting on it.
> 
> Does anyone have any tips or ideas?
> 
> Thanks
> 
> Diego Fernandez - 爱国
> Software Engineer
> US GSS Supportability - Diagnostics
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org