You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Weiwei Wang <ww...@gmail.com> on 2009/12/11 06:30:43 UTC

Recover special terms from StandardTokenizer

Hi, all,
     I designed a ftp search engine based on Lucene. I did a few
modifications to the StandardTokenizer.
My problem is:
  C++ is tokenized as c from StandardTokenizer and I want to recover it from
the TokenStream from StandardTokenizer

What should I do?

-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Re: Recover special terms from StandardTokenizer

Posted by Weiwei Wang <ww...@gmail.com>.

Another problem

how to deal with c++c++ or c++abc using MappingCharFilter

i use a NormalizeMap("c++","cplusplus"), the analyzed result will be
cpluspluscplusplus or cplusplusabc wich is not what i want

if i use a NormalizeMap("c++","cplusplus$"), the offset will not be correct.



On Sun, Dec 13, 2009 at 9:38 PM, Weiwei Wang <ww...@gmail.com> wrote:

> Thank you very much, Uwe, I found the problem.
>
>
> 2009/12/13 Uwe Schindler <uw...@thetaphi.de>
>
>> MappingCharFilter definitely preserves the offsets from the original
>> reader.
>> Yo can verify that for your case with Lucene’s testcase
>> TestMappingCharFilter in the source distribution @
>> /src/test/org/apache/lucene/analysis/TestMappingCharFilter.java:
>>
>> public void test2to4() throws Exception {
>>  CharStream cs = new MappingCharFilter( normMap, new StringReader( "ll" )
>> );
>>  TokenStream ts = new WhitespaceTokenizer( cs );
>>  assertTokenStreamContents(ts, new String[]{"llll"}, new int[]{0}, new
>> int[]
>> {2});
>> }
>>
>> So there is everything correct. I tried this test also with
>> StandrdTokenizer
>> instead of WhiteSpaceTokenizer - it works and asserts the correct offsets.
>> You should debug through the incrementToken()/CharFilter calls and verify
>> where your offsets change.
>>
>> I cannot help more.
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>> > -----Original Message-----
>> > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
>> > Sent: Sunday, December 13, 2009 12:51 PM
>> > To: java-user@lucene.apache.org
>> > Subject: Re: Recover special terms from StandardTokenizer
>> >
>> > LowercaseCharFilter is necessary, as in the MappingCharFilter we need to
>> > provide a NormalizeCharMap. We lowercase the stream so as we only
>> provide
>> > lowercase maps in the NormalizeCharMap, e.g. we provide map
>> > (c++-->cplusplus) instead of (c++-->cplusplus) and (C++-->cplusplus).
>> >
>> > C++ is only an example we want to fix, in the future we may add more
>> such
>> > special terms
>> >
>> > the code for LowercaseCharFilter is as follows:
>> > package analysis;
>> >
>> > import java.io.IOException;
>> > import java.io.Reader;
>> >
>> > import org.apache.lucene.analysis.BaseCharFilter;
>> > import org.apache.lucene.analysis.CharReader;
>> > import org.apache.lucene.analysis.CharStream;
>> >
>> >
>> > public class LowercaseCharFilter extends BaseCharFilter
>> > {
>> >
>> >     public LowercaseCharFilter(CharStream in)
>> >     {
>> >     super(in);
>> >     }
>> >
>> >     public LowercaseCharFilter(Reader in)
>> >     {
>> >     super(CharReader.get(in));
>> >     }
>> >     @Override
>> >     public int read() throws IOException
>> >     {
>> >     return Character.toLowerCase(input.read());
>> >     }
>> >     @Override
>> >     public int read(char[] cbuf, int off, int len) throws IOException {
>> >     int ret = input.read(cbuf, off, len);
>> >     if(ret!=-1)
>> >     {
>> >         for(int i=off; i<off+ret; i++)
>> >         cbuf[i] = Character.toLowerCase(cbuf[i]);
>> >     }
>> >     return ret;
>> >     }
>> > }
>> >
>> >
>> > Currently RosaMappingCharFilter is inherited from MappingCharFilter and
>> > nothing is changed(i was planning to override addOffCorrectMap to fix my
>> > problem, but it didn't work)
>> >
>> >
>> > 2009/12/13 Uwe Schindler <uw...@thetaphi.de>
>> >
>> > > I think your problem is theLowercaseCharFilter that does not pass
>> > > correctOffset() to the underying CharFilter. Does it work better
>> without
>> > > your LowerCaseCharFilter (which is duplicate because there is already
>> a
>> > > LowerCaseFilter in the Tokenizer chain).
>> > >
>> > > As you are only looking for "c++", just also add a mapping for "C++"
>> and
>> > > you
>> > > are done, why lowercasing all because of one char?
>> > >
>> > > And what's RosaMappingCharFilter? A pink one? *g*
>> > >
>> > > -----
>> > > Uwe Schindler
>> > > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > > http://www.thetaphi.de
>> > > eMail: uwe@thetaphi.de
>> > >
>> > > > -----Original Message-----
>> > > > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
>> > > > Sent: Sunday, December 13, 2009 12:23 PM
>> > > > To: java-user@lucene.apache.org
>> > > > Subject: Re: Recover special terms from StandardTokenizer
>> > > >
>> > > > thanks, Uwe.
>> > > > Maybe i was not very clear. My situation is like this:
>> > > > Analyzer:
>> > > >    NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
>> > > >    RECOVERY_MAP.add("c++","cplusplus$");
>> > > >     CharFilter filter = new LowercaseCharFilter(reader);
>> > > >     filter = new RosaMappingCharFilter(RECOVERY_MAP,filter);
>> > > >     StandardTokenizer tokenStream = new
>> > > > StandardTokenizer(Version.LUCENE_30,
>> > > > filter);
>> > > >     tokenStream.setMaxTokenLength(maxTokenLength);
>> > > >     TokenStream result = new StandardFilter(tokenStream);
>> > > >     result = getStopFilter(result);
>> > > >     result = new SnowballFilter(result, STEMMER);
>> > > > Analyze c++c++, return
>> > > > (0,9)  [cplusplus]
>> > > > (10,19)  [cplusplus]
>> > > > the two numbers in th**e brackets are offsets.
>> > > >
>> > > > So in the searching process when i want to hight the search keyword
>> > c++
>> > > > with
>> > > > the same analyzer, exception will be thrown because the string i
>> > stored
>> > > > are
>> > > > c++c++ not cpluspluscplusplus(actually, i should not change the
>> > original
>> > > > string when storing them, otherwise it will confuse the users).
>> > > >
>> > > > I hope the analyzer can give result like this
>> > > > (0,3) [cplusplus]
>> > > > (3,6) [cplusplus]
>> > > > then the Hilighter will works fine.
>> > > >
>> > > > So how can I achieve this result?
>> > > >
>> > > > 2009/12/13 Uwe Schindler <uw...@thetaphi.de>
>> > > >
>> > > > > MappingCharFilter preserves the offsets in the stream *before*
>> > > > filtering.
>> > > > > So
>> > > > > if you store the original string (without c++ replaced) in a
>> stored
>> > > > field
>> > > > > you can highlight using the given offstes. The highlighter must
>> use
>> > > > again
>> > > > > the same analyzer or use FastVectorHighlighter.
>> > > > >
>> > > > > -----
>> > > > > Uwe Schindler
>> > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > > > > http://www.thetaphi.de
>> > > > > eMail: uwe@thetaphi.de
>> > > > >
>> > > > > > -----Original Message-----
>> > > > > > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
>> > > > > > Sent: Sunday, December 13, 2009 11:43 AM
>> > > > > > To: java-user@lucene.apache.org
>> > > > > > Subject: Re: Recover special terms from StandardTokenizer
>> > > > > >
>> > > > > > Problem solved. Now another problem comes.
>> > > > > >
>> > > > > >
>> > > > > > As I want to use Highlighter in my system, the token offset is
>> > > > incorrect
>> > > > > > after the MappingCharFilter is used.
>> > > > > >
>> > > > > > Koji, do you known how to fix the offset problem?
>> > > > > >
>> > > > > > On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang
>> > <ww...@gmail.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > I use Luke to check the result and find only c exists as a
>> term,
>> > no
>> > > > > > > cplusplus found in the index
>> > > > > > >
>> > > > > > >
>> > > > > > > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang
>> > > > > > <ww...@gmail.com>wrote:
>> > > > > > >
>> > > > > > >> Thanks, Koji, I followed your advice and change my analyzer
>> as
>> > > > shown
>> > > > > > >> below:
>> > > > > > >> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
>> > > > > > >> RECOVERY_MAP.add("c++","cplusplus$");
>> > > > > > >> CharFilter filter = new LowercaseCharFilter(reader);
>> > > > > > >> filter = new MappingCharFilter(RECOVERY_MAP,filter);
>> > > > > > >> StandardTokenizer tokenStream = new
>> > > > > > StandardTokenizer(Version.LUCENE_30,
>> > > > > > >> filter);
>> > > > > > >> tokenStream.setMaxTokenLength(maxTokenLength);
>> > > > > > >> TokenStream result = new StandardFilter(tokenStream);
>> > > > > > >> result = new LowerCaseFilter(result);
>> > > > > > >> result = new StopFilter(enableStopPositionIncrements, result,
>> > > > > stopSet);
>> > > > > > >> result = new SnowballFilter(result, STEMMER);
>> > > > > > >>
>> > > > > > >> I use the same analyzer in the search side. As you know, this
>> > > > analyzer
>> > > > > > can
>> > > > > > >> token c++ as cplusplus, for this reason, it seems I can
>> search
>> > c++
>> > > > > with
>> > > > > > >> the same analyzer because it is also tokenized as cplusplus.
>> > > > > > >>
>> > > > > > >> I tested it on as string c++c++, however, when i search c++
>> on
>> > the
>> > > > > > built
>> > > > > > >> index, nothing is returned.
>> > > > > > >>
>> > > > > > >>  I do not know what's wrong with my code. Waiting for your
>> > replay
>> > > > > > >>
>> > > > > > >>
>> > > > > > >>
>> > > > > > >>
>> > > > > > >>
>> > > > > > >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang
>> > > > > > <ww...@gmail.com>wrote:
>> > > > > > >>
>> > > > > > >>> Thanks, Koji
>> > > > > > >>>
>> > > > > > >>>
>> > > > > > >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi
>> > > > > > <ko...@r.email.ne.jp>wrote:
>> > > > > > >>>
>> > > > > > >>>> MappingCharFilter can be used to convert c++ to cplusplus.
>> > > > > > >>>>
>> > > > > > >>>> Koji
>> > > > > > >>>>
>> > > > > > >>>> --
>> > > > > > >>>> http://www.rondhuit.com/en/
>> > > > > > >>>>
>> > > > > > >>>>
>> > > > > > >>>>
>> > > > > > >>>> Anshum wrote:
>> > > > > > >>>>
>> > > > > > >>>>> How about getting the original token stream and then
>> > converting
>> > > > c++
>> > > > > > to
>> > > > > > >>>>> cplusplus or anyother such transform. Or perhaps you might
>> > look
>> > > > at
>> > > > > > >>>>> using/extending(in the non java sense) some other
>> tokenized!
>> > > > > > >>>>>
>> > > > > > >>>>> --
>> > > > > > >>>>> Anshum Gupta
>> > > > > > >>>>> Naukri Labs!
>> > > > > > >>>>> http://ai-cafe.blogspot.com
>> > > > > > >>>>>
>> > > > > > >>>>> The facts expressed here belong to everybody, the opinions
>> > to
>> > > > me.
>> > > > > > The
>> > > > > > >>>>> distinction is yours to draw............
>> > > > > > >>>>>
>> > > > > > >>>>>
>> > > > > > >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <
>> > > > > ww.wang.cs@gmail.com>
>> > > > > > >>>>> wrote:
>> > > > > > >>>>>
>> > > > > > >>>>>
>> > > > > > >>>>>
>> > > > > > >>>>>> Hi, all,
>> > > > > > >>>>>>    I designed a ftp search engine based on Lucene. I did
>> a
>> > few
>> > > > > > >>>>>> modifications to the StandardTokenizer.
>> > > > > > >>>>>> My problem is:
>> > > > > > >>>>>>  C++ is tokenized as c from StandardTokenizer and I want
>> to
>> > > > > recover
>> > > > > > it
>> > > > > > >>>>>> from
>> > > > > > >>>>>> the TokenStream from StandardTokenizer
>> > > > > > >>>>>>
>> > > > > > >>>>>> What should I do?
>> > > > > > >>>>>>
>> > > > > > >>>>>> --
>> > > > > > >>>>>> Weiwei Wang
>> > > > > > >>>>>> Alex Wang
>> > > > > > >>>>>> 王巍巍
>> > > > > > >>>>>> Room 403, Mengmin Wei Building
>> > > > > > >>>>>> Computer Science Department
>> > > > > > >>>>>> Gulou Campus of Nanjing University
>> > > > > > >>>>>> Nanjing, P.R.China, 210093
>> > > > > > >>>>>>
>> > > > > > >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > > > > >>>>>>
>> > > > > > >>>>>>
>> > > > > > >>>>>>
>> > > > > > >>>>>
>> > > > > > >>>>>
>> > > > > > >>>>>
>> > > > > > >>>>
>> > > > > > >>>>
>> > > > > > >>>>
>> > > > > > >>>>
>> > > > >
>> --------------------------------------------------------------------
>> > -
>> > > > > > >>>> To unsubscribe, e-mail: java-user-
>> > unsubscribe@lucene.apache.org
>> > > > > > >>>> For additional commands, e-mail:
>> > > java-user-help@lucene.apache.org
>> > > > > > >>>>
>> > > > > > >>>>
>> > > > > > >>>
>> > > > > > >>>
>> > > > > > >>> --
>> > > > > > >>> Weiwei Wang
>> > > > > > >>> Alex Wang
>> > > > > > >>> 王巍巍
>> > > > > > >>> Room 403, Mengmin Wei Building
>> > > > > > >>> Computer Science Department
>> > > > > > >>> Gulou Campus of Nanjing University
>> > > > > > >>> Nanjing, P.R.China, 210093
>> > > > > > >>>
>> > > > > > >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > > > > >>>
>> > > > > > >>
>> > > > > > >>
>> > > > > > >>
>> > > > > > >> --
>> > > > > > >> Weiwei Wang
>> > > > > > >> Alex Wang
>> > > > > > >> 王巍巍
>> > > > > > >> Room 403, Mengmin Wei Building
>> > > > > > >> Computer Science Department
>> > > > > > >> Gulou Campus of Nanjing University
>> > > > > > >> Nanjing, P.R.China, 210093
>> > > > > > >>
>> > > > > > >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > > > > >>
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > --
>> > > > > > > Weiwei Wang
>> > > > > > > Alex Wang
>> > > > > > > 王巍巍
>> > > > > > > Room 403, Mengmin Wei Building
>> > > > > > > Computer Science Department
>> > > > > > > Gulou Campus of Nanjing University
>> > > > > > > Nanjing, P.R.China, 210093
>> > > > > > >
>> > > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Weiwei Wang
>> > > > > > Alex Wang
>> > > > > > 王巍巍
>> > > > > > Room 403, Mengmin Wei Building
>> > > > > > Computer Science Department
>> > > > > > Gulou Campus of Nanjing University
>> > > > > > Nanjing, P.R.China, 210093
>> > > > > >
>> > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > > >
>> > > > >
>> > > > >
>> --------------------------------------------------------------------
>> > -
>> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
>> > > > >
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > > Weiwei Wang
>> > > > Alex Wang
>> > > > 王巍巍
>> > > > Room 403, Mengmin Wei Building
>> > > > Computer Science Department
>> > > > Gulou Campus of Nanjing University
>> > > > Nanjing, P.R.China, 210093
>> > > >
>> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > >
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > > For additional commands, e-mail: java-user-help@lucene.apache.org
>> > >
>> > >
>> >
>> >
>> > --
>> > Weiwei Wang
>> > Alex Wang
>> > 王巍巍
>> > Room 403, Mengmin Wei Building
>> > Computer Science Department
>> > Gulou Campus of Nanjing University
>> > Nanjing, P.R.China, 210093
>> >
>> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>



-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Re: Recover special terms from StandardTokenizer

Posted by Weiwei Wang <ww...@gmail.com>.

Thank you very much, Uwe, I found the problem.

2009/12/13 Uwe Schindler <uw...@thetaphi.de>

> MappingCharFilter definitely preserves the offsets from the original
> reader.
> Yo can verify that for your case with Lucene’s testcase
> TestMappingCharFilter in the source distribution @
> /src/test/org/apache/lucene/analysis/TestMappingCharFilter.java:
>
> public void test2to4() throws Exception {
>  CharStream cs = new MappingCharFilter( normMap, new StringReader( "ll" )
> );
>  TokenStream ts = new WhitespaceTokenizer( cs );
>  assertTokenStreamContents(ts, new String[]{"llll"}, new int[]{0}, new
> int[]
> {2});
> }
>
> So there is everything correct. I tried this test also with
> StandrdTokenizer
> instead of WhiteSpaceTokenizer - it works and asserts the correct offsets.
> You should debug through the incrementToken()/CharFilter calls and verify
> where your offsets change.
>
> I cannot help more.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
> > Sent: Sunday, December 13, 2009 12:51 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Recover special terms from StandardTokenizer
> >
> > LowercaseCharFilter is necessary, as in the MappingCharFilter we need to
> > provide a NormalizeCharMap. We lowercase the stream so as we only provide
> > lowercase maps in the NormalizeCharMap, e.g. we provide map
> > (c++-->cplusplus) instead of (c++-->cplusplus) and (C++-->cplusplus).
> >
> > C++ is only an example we want to fix, in the future we may add more such
> > special terms
> >
> > the code for LowercaseCharFilter is as follows:
> > package analysis;
> >
> > import java.io.IOException;
> > import java.io.Reader;
> >
> > import org.apache.lucene.analysis.BaseCharFilter;
> > import org.apache.lucene.analysis.CharReader;
> > import org.apache.lucene.analysis.CharStream;
> >
> >
> > public class LowercaseCharFilter extends BaseCharFilter
> > {
> >
> >     public LowercaseCharFilter(CharStream in)
> >     {
> >     super(in);
> >     }
> >
> >     public LowercaseCharFilter(Reader in)
> >     {
> >     super(CharReader.get(in));
> >     }
> >     @Override
> >     public int read() throws IOException
> >     {
> >     return Character.toLowerCase(input.read());
> >     }
> >     @Override
> >     public int read(char[] cbuf, int off, int len) throws IOException {
> >     int ret = input.read(cbuf, off, len);
> >     if(ret!=-1)
> >     {
> >         for(int i=off; i<off+ret; i++)
> >         cbuf[i] = Character.toLowerCase(cbuf[i]);
> >     }
> >     return ret;
> >     }
> > }
> >
> >
> > Currently RosaMappingCharFilter is inherited from MappingCharFilter and
> > nothing is changed(i was planning to override addOffCorrectMap to fix my
> > problem, but it didn't work)
> >
> >
> > 2009/12/13 Uwe Schindler <uw...@thetaphi.de>
> >
> > > I think your problem is theLowercaseCharFilter that does not pass
> > > correctOffset() to the underying CharFilter. Does it work better
> without
> > > your LowerCaseCharFilter (which is duplicate because there is already a
> > > LowerCaseFilter in the Tokenizer chain).
> > >
> > > As you are only looking for "c++", just also add a mapping for "C++"
> and
> > > you
> > > are done, why lowercasing all because of one char?
> > >
> > > And what's RosaMappingCharFilter? A pink one? *g*
> > >
> > > -----
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > http://www.thetaphi.de
> > > eMail: uwe@thetaphi.de
> > >
> > > > -----Original Message-----
> > > > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
> > > > Sent: Sunday, December 13, 2009 12:23 PM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Re: Recover special terms from StandardTokenizer
> > > >
> > > > thanks, Uwe.
> > > > Maybe i was not very clear. My situation is like this:
> > > > Analyzer:
> > > >    NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
> > > >    RECOVERY_MAP.add("c++","cplusplus$");
> > > >     CharFilter filter = new LowercaseCharFilter(reader);
> > > >     filter = new RosaMappingCharFilter(RECOVERY_MAP,filter);
> > > >     StandardTokenizer tokenStream = new
> > > > StandardTokenizer(Version.LUCENE_30,
> > > > filter);
> > > >     tokenStream.setMaxTokenLength(maxTokenLength);
> > > >     TokenStream result = new StandardFilter(tokenStream);
> > > >     result = getStopFilter(result);
> > > >     result = new SnowballFilter(result, STEMMER);
> > > > Analyze c++c++, return
> > > > (0,9)  [cplusplus]
> > > > (10,19)  [cplusplus]
> > > > the two numbers in th**e brackets are offsets.
> > > >
> > > > So in the searching process when i want to hight the search keyword
> > c++
> > > > with
> > > > the same analyzer, exception will be thrown because the string i
> > stored
> > > > are
> > > > c++c++ not cpluspluscplusplus(actually, i should not change the
> > original
> > > > string when storing them, otherwise it will confuse the users).
> > > >
> > > > I hope the analyzer can give result like this
> > > > (0,3) [cplusplus]
> > > > (3,6) [cplusplus]
> > > > then the Hilighter will works fine.
> > > >
> > > > So how can I achieve this result?
> > > >
> > > > 2009/12/13 Uwe Schindler <uw...@thetaphi.de>
> > > >
> > > > > MappingCharFilter preserves the offsets in the stream *before*
> > > > filtering.
> > > > > So
> > > > > if you store the original string (without c++ replaced) in a stored
> > > > field
> > > > > you can highlight using the given offstes. The highlighter must use
> > > > again
> > > > > the same analyzer or use FastVectorHighlighter.
> > > > >
> > > > > -----
> > > > > Uwe Schindler
> > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > http://www.thetaphi.de
> > > > > eMail: uwe@thetaphi.de
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
> > > > > > Sent: Sunday, December 13, 2009 11:43 AM
> > > > > > To: java-user@lucene.apache.org
> > > > > > Subject: Re: Recover special terms from StandardTokenizer
> > > > > >
> > > > > > Problem solved. Now another problem comes.
> > > > > >
> > > > > >
> > > > > > As I want to use Highlighter in my system, the token offset is
> > > > incorrect
> > > > > > after the MappingCharFilter is used.
> > > > > >
> > > > > > Koji, do you known how to fix the offset problem?
> > > > > >
> > > > > > On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang
> > <ww...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > I use Luke to check the result and find only c exists as a
> term,
> > no
> > > > > > > cplusplus found in the index
> > > > > > >
> > > > > > >
> > > > > > > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang
> > > > > > <ww...@gmail.com>wrote:
> > > > > > >
> > > > > > >> Thanks, Koji, I followed your advice and change my analyzer as
> > > > shown
> > > > > > >> below:
> > > > > > >> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
> > > > > > >> RECOVERY_MAP.add("c++","cplusplus$");
> > > > > > >> CharFilter filter = new LowercaseCharFilter(reader);
> > > > > > >> filter = new MappingCharFilter(RECOVERY_MAP,filter);
> > > > > > >> StandardTokenizer tokenStream = new
> > > > > > StandardTokenizer(Version.LUCENE_30,
> > > > > > >> filter);
> > > > > > >> tokenStream.setMaxTokenLength(maxTokenLength);
> > > > > > >> TokenStream result = new StandardFilter(tokenStream);
> > > > > > >> result = new LowerCaseFilter(result);
> > > > > > >> result = new StopFilter(enableStopPositionIncrements, result,
> > > > > stopSet);
> > > > > > >> result = new SnowballFilter(result, STEMMER);
> > > > > > >>
> > > > > > >> I use the same analyzer in the search side. As you know, this
> > > > analyzer
> > > > > > can
> > > > > > >> token c++ as cplusplus, for this reason, it seems I can search
> > c++
> > > > > with
> > > > > > >> the same analyzer because it is also tokenized as cplusplus.
> > > > > > >>
> > > > > > >> I tested it on as string c++c++, however, when i search c++ on
> > the
> > > > > > built
> > > > > > >> index, nothing is returned.
> > > > > > >>
> > > > > > >>  I do not know what's wrong with my code. Waiting for your
> > replay
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang
> > > > > > <ww...@gmail.com>wrote:
> > > > > > >>
> > > > > > >>> Thanks, Koji
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi
> > > > > > <ko...@r.email.ne.jp>wrote:
> > > > > > >>>
> > > > > > >>>> MappingCharFilter can be used to convert c++ to cplusplus.
> > > > > > >>>>
> > > > > > >>>> Koji
> > > > > > >>>>
> > > > > > >>>> --
> > > > > > >>>> http://www.rondhuit.com/en/
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> Anshum wrote:
> > > > > > >>>>
> > > > > > >>>>> How about getting the original token stream and then
> > converting
> > > > c++
> > > > > > to
> > > > > > >>>>> cplusplus or anyother such transform. Or perhaps you might
> > look
> > > > at
> > > > > > >>>>> using/extending(in the non java sense) some other
> tokenized!
> > > > > > >>>>>
> > > > > > >>>>> --
> > > > > > >>>>> Anshum Gupta
> > > > > > >>>>> Naukri Labs!
> > > > > > >>>>> http://ai-cafe.blogspot.com
> > > > > > >>>>>
> > > > > > >>>>> The facts expressed here belong to everybody, the opinions
> > to
> > > > me.
> > > > > > The
> > > > > > >>>>> distinction is yours to draw............
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <
> > > > > ww.wang.cs@gmail.com>
> > > > > > >>>>> wrote:
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>>> Hi, all,
> > > > > > >>>>>>    I designed a ftp search engine based on Lucene. I did a
> > few
> > > > > > >>>>>> modifications to the StandardTokenizer.
> > > > > > >>>>>> My problem is:
> > > > > > >>>>>>  C++ is tokenized as c from StandardTokenizer and I want
> to
> > > > > recover
> > > > > > it
> > > > > > >>>>>> from
> > > > > > >>>>>> the TokenStream from StandardTokenizer
> > > > > > >>>>>>
> > > > > > >>>>>> What should I do?
> > > > > > >>>>>>
> > > > > > >>>>>> --
> > > > > > >>>>>> Weiwei Wang
> > > > > > >>>>>> Alex Wang
> > > > > > >>>>>> 王巍巍
> > > > > > >>>>>> Room 403, Mengmin Wei Building
> > > > > > >>>>>> Computer Science Department
> > > > > > >>>>>> Gulou Campus of Nanjing University
> > > > > > >>>>>> Nanjing, P.R.China, 210093
> > > > > > >>>>>>
> > > > > > >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > > > >>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > >
> --------------------------------------------------------------------
> > -
> > > > > > >>>> To unsubscribe, e-mail: java-user-
> > unsubscribe@lucene.apache.org
> > > > > > >>>> For additional commands, e-mail:
> > > java-user-help@lucene.apache.org
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> --
> > > > > > >>> Weiwei Wang
> > > > > > >>> Alex Wang
> > > > > > >>> 王巍巍
> > > > > > >>> Room 403, Mengmin Wei Building
> > > > > > >>> Computer Science Department
> > > > > > >>> Gulou Campus of Nanjing University
> > > > > > >>> Nanjing, P.R.China, 210093
> > > > > > >>>
> > > > > > >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Weiwei Wang
> > > > > > >> Alex Wang
> > > > > > >> 王巍巍
> > > > > > >> Room 403, Mengmin Wei Building
> > > > > > >> Computer Science Department
> > > > > > >> Gulou Campus of Nanjing University
> > > > > > >> Nanjing, P.R.China, 210093
> > > > > > >>
> > > > > > >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Weiwei Wang
> > > > > > > Alex Wang
> > > > > > > 王巍巍
> > > > > > > Room 403, Mengmin Wei Building
> > > > > > > Computer Science Department
> > > > > > > Gulou Campus of Nanjing University
> > > > > > > Nanjing, P.R.China, 210093
> > > > > > >
> > > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Weiwei Wang
> > > > > > Alex Wang
> > > > > > 王巍巍
> > > > > > Room 403, Mengmin Wei Building
> > > > > > Computer Science Department
> > > > > > Gulou Campus of Nanjing University
> > > > > > Nanjing, P.R.China, 210093
> > > > > >
> > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > >
> > > > >
> > > > >
> --------------------------------------------------------------------
> > -
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Weiwei Wang
> > > > Alex Wang
> > > > 王巍巍
> > > > Room 403, Mengmin Wei Building
> > > > Computer Science Department
> > > > Gulou Campus of Nanjing University
> > > > Nanjing, P.R.China, 210093
> > > >
> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Weiwei Wang
> > Alex Wang
> > 王巍巍
> > Room 403, Mengmin Wei Building
> > Computer Science Department
> > Gulou Campus of Nanjing University
> > Nanjing, P.R.China, 210093
> >
> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

RE: Recover special terms from StandardTokenizer

Posted by Uwe Schindler <uw...@thetaphi.de>.

MappingCharFilter definitely preserves the offsets from the original reader.
Yo can verify that for your case with Lucene’s testcase
TestMappingCharFilter in the source distribution @
/src/test/org/apache/lucene/analysis/TestMappingCharFilter.java:
 
public void test2to4() throws Exception {
 CharStream cs = new MappingCharFilter( normMap, new StringReader( "ll" ) );
 TokenStream ts = new WhitespaceTokenizer( cs );
 assertTokenStreamContents(ts, new String[]{"llll"}, new int[]{0}, new int[]
{2});
}

So there is everything correct. I tried this test also with StandrdTokenizer
instead of WhiteSpaceTokenizer - it works and asserts the correct offsets.
You should debug through the incrementToken()/CharFilter calls and verify
where your offsets change.

I cannot help more.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
> Sent: Sunday, December 13, 2009 12:51 PM
> To: java-user@lucene.apache.org
> Subject: Re: Recover special terms from StandardTokenizer
> 
> LowercaseCharFilter is necessary, as in the MappingCharFilter we need to
> provide a NormalizeCharMap. We lowercase the stream so as we only provide
> lowercase maps in the NormalizeCharMap, e.g. we provide map
> (c++-->cplusplus) instead of (c++-->cplusplus) and (C++-->cplusplus).
> 
> C++ is only an example we want to fix, in the future we may add more such
> special terms
> 
> the code for LowercaseCharFilter is as follows:
> package analysis;
> 
> import java.io.IOException;
> import java.io.Reader;
> 
> import org.apache.lucene.analysis.BaseCharFilter;
> import org.apache.lucene.analysis.CharReader;
> import org.apache.lucene.analysis.CharStream;
> 
> 
> public class LowercaseCharFilter extends BaseCharFilter
> {
> 
>     public LowercaseCharFilter(CharStream in)
>     {
>     super(in);
>     }
> 
>     public LowercaseCharFilter(Reader in)
>     {
>     super(CharReader.get(in));
>     }
>     @Override
>     public int read() throws IOException
>     {
>     return Character.toLowerCase(input.read());
>     }
>     @Override
>     public int read(char[] cbuf, int off, int len) throws IOException {
>     int ret = input.read(cbuf, off, len);
>     if(ret!=-1)
>     {
>         for(int i=off; i<off+ret; i++)
>         cbuf[i] = Character.toLowerCase(cbuf[i]);
>     }
>     return ret;
>     }
> }
> 
> 
> Currently RosaMappingCharFilter is inherited from MappingCharFilter and
> nothing is changed(i was planning to override addOffCorrectMap to fix my
> problem, but it didn't work)
> 
> 
> 2009/12/13 Uwe Schindler <uw...@thetaphi.de>
> 
> > I think your problem is theLowercaseCharFilter that does not pass
> > correctOffset() to the underying CharFilter. Does it work better without
> > your LowerCaseCharFilter (which is duplicate because there is already a
> > LowerCaseFilter in the Tokenizer chain).
> >
> > As you are only looking for "c++", just also add a mapping for "C++" and
> > you
> > are done, why lowercasing all because of one char?
> >
> > And what's RosaMappingCharFilter? A pink one? *g*
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> > > -----Original Message-----
> > > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
> > > Sent: Sunday, December 13, 2009 12:23 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: Recover special terms from StandardTokenizer
> > >
> > > thanks, Uwe.
> > > Maybe i was not very clear. My situation is like this:
> > > Analyzer:
> > >    NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
> > >    RECOVERY_MAP.add("c++","cplusplus$");
> > >     CharFilter filter = new LowercaseCharFilter(reader);
> > >     filter = new RosaMappingCharFilter(RECOVERY_MAP,filter);
> > >     StandardTokenizer tokenStream = new
> > > StandardTokenizer(Version.LUCENE_30,
> > > filter);
> > >     tokenStream.setMaxTokenLength(maxTokenLength);
> > >     TokenStream result = new StandardFilter(tokenStream);
> > >     result = getStopFilter(result);
> > >     result = new SnowballFilter(result, STEMMER);
> > > Analyze c++c++, return
> > > (0,9)  [cplusplus]
> > > (10,19)  [cplusplus]
> > > the two numbers in th**e brackets are offsets.
> > >
> > > So in the searching process when i want to hight the search keyword
> c++
> > > with
> > > the same analyzer, exception will be thrown because the string i
> stored
> > > are
> > > c++c++ not cpluspluscplusplus(actually, i should not change the
> original
> > > string when storing them, otherwise it will confuse the users).
> > >
> > > I hope the analyzer can give result like this
> > > (0,3) [cplusplus]
> > > (3,6) [cplusplus]
> > > then the Hilighter will works fine.
> > >
> > > So how can I achieve this result?
> > >
> > > 2009/12/13 Uwe Schindler <uw...@thetaphi.de>
> > >
> > > > MappingCharFilter preserves the offsets in the stream *before*
> > > filtering.
> > > > So
> > > > if you store the original string (without c++ replaced) in a stored
> > > field
> > > > you can highlight using the given offstes. The highlighter must use
> > > again
> > > > the same analyzer or use FastVectorHighlighter.
> > > >
> > > > -----
> > > > Uwe Schindler
> > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > http://www.thetaphi.de
> > > > eMail: uwe@thetaphi.de
> > > >
> > > > > -----Original Message-----
> > > > > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
> > > > > Sent: Sunday, December 13, 2009 11:43 AM
> > > > > To: java-user@lucene.apache.org
> > > > > Subject: Re: Recover special terms from StandardTokenizer
> > > > >
> > > > > Problem solved. Now another problem comes.
> > > > >
> > > > >
> > > > > As I want to use Highlighter in my system, the token offset is
> > > incorrect
> > > > > after the MappingCharFilter is used.
> > > > >
> > > > > Koji, do you known how to fix the offset problem?
> > > > >
> > > > > On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang
> <ww...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > I use Luke to check the result and find only c exists as a term,
> no
> > > > > > cplusplus found in the index
> > > > > >
> > > > > >
> > > > > > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang
> > > > > <ww...@gmail.com>wrote:
> > > > > >
> > > > > >> Thanks, Koji, I followed your advice and change my analyzer as
> > > shown
> > > > > >> below:
> > > > > >> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
> > > > > >> RECOVERY_MAP.add("c++","cplusplus$");
> > > > > >> CharFilter filter = new LowercaseCharFilter(reader);
> > > > > >> filter = new MappingCharFilter(RECOVERY_MAP,filter);
> > > > > >> StandardTokenizer tokenStream = new
> > > > > StandardTokenizer(Version.LUCENE_30,
> > > > > >> filter);
> > > > > >> tokenStream.setMaxTokenLength(maxTokenLength);
> > > > > >> TokenStream result = new StandardFilter(tokenStream);
> > > > > >> result = new LowerCaseFilter(result);
> > > > > >> result = new StopFilter(enableStopPositionIncrements, result,
> > > > stopSet);
> > > > > >> result = new SnowballFilter(result, STEMMER);
> > > > > >>
> > > > > >> I use the same analyzer in the search side. As you know, this
> > > analyzer
> > > > > can
> > > > > >> token c++ as cplusplus, for this reason, it seems I can search
> c++
> > > > with
> > > > > >> the same analyzer because it is also tokenized as cplusplus.
> > > > > >>
> > > > > >> I tested it on as string c++c++, however, when i search c++ on
> the
> > > > > built
> > > > > >> index, nothing is returned.
> > > > > >>
> > > > > >>  I do not know what's wrong with my code. Waiting for your
> replay
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang
> > > > > <ww...@gmail.com>wrote:
> > > > > >>
> > > > > >>> Thanks, Koji
> > > > > >>>
> > > > > >>>
> > > > > >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi
> > > > > <ko...@r.email.ne.jp>wrote:
> > > > > >>>
> > > > > >>>> MappingCharFilter can be used to convert c++ to cplusplus.
> > > > > >>>>
> > > > > >>>> Koji
> > > > > >>>>
> > > > > >>>> --
> > > > > >>>> http://www.rondhuit.com/en/
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> Anshum wrote:
> > > > > >>>>
> > > > > >>>>> How about getting the original token stream and then
> converting
> > > c++
> > > > > to
> > > > > >>>>> cplusplus or anyother such transform. Or perhaps you might
> look
> > > at
> > > > > >>>>> using/extending(in the non java sense) some other tokenized!
> > > > > >>>>>
> > > > > >>>>> --
> > > > > >>>>> Anshum Gupta
> > > > > >>>>> Naukri Labs!
> > > > > >>>>> http://ai-cafe.blogspot.com
> > > > > >>>>>
> > > > > >>>>> The facts expressed here belong to everybody, the opinions
> to
> > > me.
> > > > > The
> > > > > >>>>> distinction is yours to draw............
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <
> > > > ww.wang.cs@gmail.com>
> > > > > >>>>> wrote:
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>> Hi, all,
> > > > > >>>>>>    I designed a ftp search engine based on Lucene. I did a
> few
> > > > > >>>>>> modifications to the StandardTokenizer.
> > > > > >>>>>> My problem is:
> > > > > >>>>>>  C++ is tokenized as c from StandardTokenizer and I want to
> > > > recover
> > > > > it
> > > > > >>>>>> from
> > > > > >>>>>> the TokenStream from StandardTokenizer
> > > > > >>>>>>
> > > > > >>>>>> What should I do?
> > > > > >>>>>>
> > > > > >>>>>> --
> > > > > >>>>>> Weiwei Wang
> > > > > >>>>>> Alex Wang
> > > > > >>>>>> 王巍巍
> > > > > >>>>>> Room 403, Mengmin Wei Building
> > > > > >>>>>> Computer Science Department
> > > > > >>>>>> Gulou Campus of Nanjing University
> > > > > >>>>>> Nanjing, P.R.China, 210093
> > > > > >>>>>>
> > > > > >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > --------------------------------------------------------------------
> -
> > > > > >>>> To unsubscribe, e-mail: java-user-
> unsubscribe@lucene.apache.org
> > > > > >>>> For additional commands, e-mail:
> > java-user-help@lucene.apache.org
> > > > > >>>>
> > > > > >>>>
> > > > > >>>
> > > > > >>>
> > > > > >>> --
> > > > > >>> Weiwei Wang
> > > > > >>> Alex Wang
> > > > > >>> 王巍巍
> > > > > >>> Room 403, Mengmin Wei Building
> > > > > >>> Computer Science Department
> > > > > >>> Gulou Campus of Nanjing University
> > > > > >>> Nanjing, P.R.China, 210093
> > > > > >>>
> > > > > >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Weiwei Wang
> > > > > >> Alex Wang
> > > > > >> 王巍巍
> > > > > >> Room 403, Mengmin Wei Building
> > > > > >> Computer Science Department
> > > > > >> Gulou Campus of Nanjing University
> > > > > >> Nanjing, P.R.China, 210093
> > > > > >>
> > > > > >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Weiwei Wang
> > > > > > Alex Wang
> > > > > > 王巍巍
> > > > > > Room 403, Mengmin Wei Building
> > > > > > Computer Science Department
> > > > > > Gulou Campus of Nanjing University
> > > > > > Nanjing, P.R.China, 210093
> > > > > >
> > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Weiwei Wang
> > > > > Alex Wang
> > > > > 王巍巍
> > > > > Room 403, Mengmin Wei Building
> > > > > Computer Science Department
> > > > > Gulou Campus of Nanjing University
> > > > > Nanjing, P.R.China, 210093
> > > > >
> > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > >
> > > >
> > > > --------------------------------------------------------------------
> -
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > >
> > > --
> > > Weiwei Wang
> > > Alex Wang
> > > 王巍巍
> > > Room 403, Mengmin Wei Building
> > > Computer Science Department
> > > Gulou Campus of Nanjing University
> > > Nanjing, P.R.China, 210093
> > >
> > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> 
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
> 
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Recover special terms from StandardTokenizer

Posted by Weiwei Wang <ww...@gmail.com>.

LowercaseCharFilter is necessary, as in the MappingCharFilter we need to
provide a NormalizeCharMap. We lowercase the stream so as we only provide
lowercase maps in the NormalizeCharMap, e.g. we provide map
(c++-->cplusplus) instead of (c++-->cplusplus) and (C++-->cplusplus).

C++ is only an example we want to fix, in the future we may add more such
special terms

the code for LowercaseCharFilter is as follows:
package analysis;

import java.io.IOException;
import java.io.Reader;

import org.apache.lucene.analysis.BaseCharFilter;
import org.apache.lucene.analysis.CharReader;
import org.apache.lucene.analysis.CharStream;


public class LowercaseCharFilter extends BaseCharFilter
{

    public LowercaseCharFilter(CharStream in)
    {
    super(in);
    }

    public LowercaseCharFilter(Reader in)
    {
    super(CharReader.get(in));
    }
    @Override
    public int read() throws IOException
    {
    return Character.toLowerCase(input.read());
    }
    @Override
    public int read(char[] cbuf, int off, int len) throws IOException {
    int ret = input.read(cbuf, off, len);
    if(ret!=-1)
    {
        for(int i=off; i<off+ret; i++)
        cbuf[i] = Character.toLowerCase(cbuf[i]);
    }
    return ret;
    }
}


Currently RosaMappingCharFilter is inherited from MappingCharFilter and
nothing is changed(i was planning to override addOffCorrectMap to fix my
problem, but it didn't work)


2009/12/13 Uwe Schindler <uw...@thetaphi.de>

> I think your problem is theLowercaseCharFilter that does not pass
> correctOffset() to the underying CharFilter. Does it work better without
> your LowerCaseCharFilter (which is duplicate because there is already a
> LowerCaseFilter in the Tokenizer chain).
>
> As you are only looking for "c++", just also add a mapping for "C++" and
> you
> are done, why lowercasing all because of one char?
>
> And what's RosaMappingCharFilter? A pink one? *g*
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
> > Sent: Sunday, December 13, 2009 12:23 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Recover special terms from StandardTokenizer
> >
> > thanks, Uwe.
> > Maybe i was not very clear. My situation is like this:
> > Analyzer:
> >    NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
> >    RECOVERY_MAP.add("c++","cplusplus$");
> >     CharFilter filter = new LowercaseCharFilter(reader);
> >     filter = new RosaMappingCharFilter(RECOVERY_MAP,filter);
> >     StandardTokenizer tokenStream = new
> > StandardTokenizer(Version.LUCENE_30,
> > filter);
> >     tokenStream.setMaxTokenLength(maxTokenLength);
> >     TokenStream result = new StandardFilter(tokenStream);
> >     result = getStopFilter(result);
> >     result = new SnowballFilter(result, STEMMER);
> > Analyze c++c++, return
> > (0,9)  [cplusplus]
> > (10,19)  [cplusplus]
> > the two numbers in th**e brackets are offsets.
> >
> > So in the searching process when i want to hight the search keyword c++
> > with
> > the same analyzer, exception will be thrown because the string i stored
> > are
> > c++c++ not cpluspluscplusplus(actually, i should not change the original
> > string when storing them, otherwise it will confuse the users).
> >
> > I hope the analyzer can give result like this
> > (0,3) [cplusplus]
> > (3,6) [cplusplus]
> > then the Hilighter will works fine.
> >
> > So how can I achieve this result?
> >
> > 2009/12/13 Uwe Schindler <uw...@thetaphi.de>
> >
> > > MappingCharFilter preserves the offsets in the stream *before*
> > filtering.
> > > So
> > > if you store the original string (without c++ replaced) in a stored
> > field
> > > you can highlight using the given offstes. The highlighter must use
> > again
> > > the same analyzer or use FastVectorHighlighter.
> > >
> > > -----
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > http://www.thetaphi.de
> > > eMail: uwe@thetaphi.de
> > >
> > > > -----Original Message-----
> > > > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
> > > > Sent: Sunday, December 13, 2009 11:43 AM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Re: Recover special terms from StandardTokenizer
> > > >
> > > > Problem solved. Now another problem comes.
> > > >
> > > >
> > > > As I want to use Highlighter in my system, the token offset is
> > incorrect
> > > > after the MappingCharFilter is used.
> > > >
> > > > Koji, do you known how to fix the offset problem?
> > > >
> > > > On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang <ww...@gmail.com>
> > > > wrote:
> > > >
> > > > > I use Luke to check the result and find only c exists as a term, no
> > > > > cplusplus found in the index
> > > > >
> > > > >
> > > > > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang
> > > > <ww...@gmail.com>wrote:
> > > > >
> > > > >> Thanks, Koji, I followed your advice and change my analyzer as
> > shown
> > > > >> below:
> > > > >> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
> > > > >> RECOVERY_MAP.add("c++","cplusplus$");
> > > > >> CharFilter filter = new LowercaseCharFilter(reader);
> > > > >> filter = new MappingCharFilter(RECOVERY_MAP,filter);
> > > > >> StandardTokenizer tokenStream = new
> > > > StandardTokenizer(Version.LUCENE_30,
> > > > >> filter);
> > > > >> tokenStream.setMaxTokenLength(maxTokenLength);
> > > > >> TokenStream result = new StandardFilter(tokenStream);
> > > > >> result = new LowerCaseFilter(result);
> > > > >> result = new StopFilter(enableStopPositionIncrements, result,
> > > stopSet);
> > > > >> result = new SnowballFilter(result, STEMMER);
> > > > >>
> > > > >> I use the same analyzer in the search side. As you know, this
> > analyzer
> > > > can
> > > > >> token c++ as cplusplus, for this reason, it seems I can search c++
> > > with
> > > > >> the same analyzer because it is also tokenized as cplusplus.
> > > > >>
> > > > >> I tested it on as string c++c++, however, when i search c++ on the
> > > > built
> > > > >> index, nothing is returned.
> > > > >>
> > > > >>  I do not know what's wrong with my code. Waiting for your replay
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang
> > > > <ww...@gmail.com>wrote:
> > > > >>
> > > > >>> Thanks, Koji
> > > > >>>
> > > > >>>
> > > > >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi
> > > > <ko...@r.email.ne.jp>wrote:
> > > > >>>
> > > > >>>> MappingCharFilter can be used to convert c++ to cplusplus.
> > > > >>>>
> > > > >>>> Koji
> > > > >>>>
> > > > >>>> --
> > > > >>>> http://www.rondhuit.com/en/
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> Anshum wrote:
> > > > >>>>
> > > > >>>>> How about getting the original token stream and then converting
> > c++
> > > > to
> > > > >>>>> cplusplus or anyother such transform. Or perhaps you might look
> > at
> > > > >>>>> using/extending(in the non java sense) some other tokenized!
> > > > >>>>>
> > > > >>>>> --
> > > > >>>>> Anshum Gupta
> > > > >>>>> Naukri Labs!
> > > > >>>>> http://ai-cafe.blogspot.com
> > > > >>>>>
> > > > >>>>> The facts expressed here belong to everybody, the opinions to
> > me.
> > > > The
> > > > >>>>> distinction is yours to draw............
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <
> > > ww.wang.cs@gmail.com>
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>> Hi, all,
> > > > >>>>>>    I designed a ftp search engine based on Lucene. I did a few
> > > > >>>>>> modifications to the StandardTokenizer.
> > > > >>>>>> My problem is:
> > > > >>>>>>  C++ is tokenized as c from StandardTokenizer and I want to
> > > recover
> > > > it
> > > > >>>>>> from
> > > > >>>>>> the TokenStream from StandardTokenizer
> > > > >>>>>>
> > > > >>>>>> What should I do?
> > > > >>>>>>
> > > > >>>>>> --
> > > > >>>>>> Weiwei Wang
> > > > >>>>>> Alex Wang
> > > > >>>>>> 王巍巍
> > > > >>>>>> Room 403, Mengmin Wei Building
> > > > >>>>>> Computer Science Department
> > > > >>>>>> Gulou Campus of Nanjing University
> > > > >>>>>> Nanjing, P.R.China, 210093
> > > > >>>>>>
> > > > >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > ---------------------------------------------------------------------
> > > > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > >>>> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > >>>>
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >>> --
> > > > >>> Weiwei Wang
> > > > >>> Alex Wang
> > > > >>> 王巍巍
> > > > >>> Room 403, Mengmin Wei Building
> > > > >>> Computer Science Department
> > > > >>> Gulou Campus of Nanjing University
> > > > >>> Nanjing, P.R.China, 210093
> > > > >>>
> > > > >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > >>>
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Weiwei Wang
> > > > >> Alex Wang
> > > > >> 王巍巍
> > > > >> Room 403, Mengmin Wei Building
> > > > >> Computer Science Department
> > > > >> Gulou Campus of Nanjing University
> > > > >> Nanjing, P.R.China, 210093
> > > > >>
> > > > >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Weiwei Wang
> > > > > Alex Wang
> > > > > 王巍巍
> > > > > Room 403, Mengmin Wei Building
> > > > > Computer Science Department
> > > > > Gulou Campus of Nanjing University
> > > > > Nanjing, P.R.China, 210093
> > > > >
> > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Weiwei Wang
> > > > Alex Wang
> > > > 王巍巍
> > > > Room 403, Mengmin Wei Building
> > > > Computer Science Department
> > > > Gulou Campus of Nanjing University
> > > > Nanjing, P.R.China, 210093
> > > >
> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Weiwei Wang
> > Alex Wang
> > 王巍巍
> > Room 403, Mengmin Wei Building
> > Computer Science Department
> > Gulou Campus of Nanjing University
> > Nanjing, P.R.China, 210093
> >
> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

RE: Recover special terms from StandardTokenizer

Posted by Uwe Schindler <uw...@thetaphi.de>.

I think your problem is theLowercaseCharFilter that does not pass
correctOffset() to the underying CharFilter. Does it work better without
your LowerCaseCharFilter (which is duplicate because there is already a
LowerCaseFilter in the Tokenizer chain).

As you are only looking for "c++", just also add a mapping for "C++" and you
are done, why lowercasing all because of one char?

And what's RosaMappingCharFilter? A pink one? *g*

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
> Sent: Sunday, December 13, 2009 12:23 PM
> To: java-user@lucene.apache.org
> Subject: Re: Recover special terms from StandardTokenizer
> 
> thanks, Uwe.
> Maybe i was not very clear. My situation is like this:
> Analyzer:
>    NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
>    RECOVERY_MAP.add("c++","cplusplus$");
>     CharFilter filter = new LowercaseCharFilter(reader);
>     filter = new RosaMappingCharFilter(RECOVERY_MAP,filter);
>     StandardTokenizer tokenStream = new
> StandardTokenizer(Version.LUCENE_30,
> filter);
>     tokenStream.setMaxTokenLength(maxTokenLength);
>     TokenStream result = new StandardFilter(tokenStream);
>     result = getStopFilter(result);
>     result = new SnowballFilter(result, STEMMER);
> Analyze c++c++, return
> (0,9)  [cplusplus]
> (10,19)  [cplusplus]
> the two numbers in th**e brackets are offsets.
> 
> So in the searching process when i want to hight the search keyword c++
> with
> the same analyzer, exception will be thrown because the string i stored
> are
> c++c++ not cpluspluscplusplus(actually, i should not change the original
> string when storing them, otherwise it will confuse the users).
> 
> I hope the analyzer can give result like this
> (0,3) [cplusplus]
> (3,6) [cplusplus]
> then the Hilighter will works fine.
> 
> So how can I achieve this result?
> 
> 2009/12/13 Uwe Schindler <uw...@thetaphi.de>
> 
> > MappingCharFilter preserves the offsets in the stream *before*
> filtering.
> > So
> > if you store the original string (without c++ replaced) in a stored
> field
> > you can highlight using the given offstes. The highlighter must use
> again
> > the same analyzer or use FastVectorHighlighter.
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> > > -----Original Message-----
> > > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
> > > Sent: Sunday, December 13, 2009 11:43 AM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: Recover special terms from StandardTokenizer
> > >
> > > Problem solved. Now another problem comes.
> > >
> > >
> > > As I want to use Highlighter in my system, the token offset is
> incorrect
> > > after the MappingCharFilter is used.
> > >
> > > Koji, do you known how to fix the offset problem?
> > >
> > > On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang <ww...@gmail.com>
> > > wrote:
> > >
> > > > I use Luke to check the result and find only c exists as a term, no
> > > > cplusplus found in the index
> > > >
> > > >
> > > > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang
> > > <ww...@gmail.com>wrote:
> > > >
> > > >> Thanks, Koji, I followed your advice and change my analyzer as
> shown
> > > >> below:
> > > >> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
> > > >> RECOVERY_MAP.add("c++","cplusplus$");
> > > >> CharFilter filter = new LowercaseCharFilter(reader);
> > > >> filter = new MappingCharFilter(RECOVERY_MAP,filter);
> > > >> StandardTokenizer tokenStream = new
> > > StandardTokenizer(Version.LUCENE_30,
> > > >> filter);
> > > >> tokenStream.setMaxTokenLength(maxTokenLength);
> > > >> TokenStream result = new StandardFilter(tokenStream);
> > > >> result = new LowerCaseFilter(result);
> > > >> result = new StopFilter(enableStopPositionIncrements, result,
> > stopSet);
> > > >> result = new SnowballFilter(result, STEMMER);
> > > >>
> > > >> I use the same analyzer in the search side. As you know, this
> analyzer
> > > can
> > > >> token c++ as cplusplus, for this reason, it seems I can search c++
> > with
> > > >> the same analyzer because it is also tokenized as cplusplus.
> > > >>
> > > >> I tested it on as string c++c++, however, when i search c++ on the
> > > built
> > > >> index, nothing is returned.
> > > >>
> > > >>  I do not know what's wrong with my code. Waiting for your replay
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang
> > > <ww...@gmail.com>wrote:
> > > >>
> > > >>> Thanks, Koji
> > > >>>
> > > >>>
> > > >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi
> > > <ko...@r.email.ne.jp>wrote:
> > > >>>
> > > >>>> MappingCharFilter can be used to convert c++ to cplusplus.
> > > >>>>
> > > >>>> Koji
> > > >>>>
> > > >>>> --
> > > >>>> http://www.rondhuit.com/en/
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> Anshum wrote:
> > > >>>>
> > > >>>>> How about getting the original token stream and then converting
> c++
> > > to
> > > >>>>> cplusplus or anyother such transform. Or perhaps you might look
> at
> > > >>>>> using/extending(in the non java sense) some other tokenized!
> > > >>>>>
> > > >>>>> --
> > > >>>>> Anshum Gupta
> > > >>>>> Naukri Labs!
> > > >>>>> http://ai-cafe.blogspot.com
> > > >>>>>
> > > >>>>> The facts expressed here belong to everybody, the opinions to
> me.
> > > The
> > > >>>>> distinction is yours to draw............
> > > >>>>>
> > > >>>>>
> > > >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <
> > ww.wang.cs@gmail.com>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>> Hi, all,
> > > >>>>>>    I designed a ftp search engine based on Lucene. I did a few
> > > >>>>>> modifications to the StandardTokenizer.
> > > >>>>>> My problem is:
> > > >>>>>>  C++ is tokenized as c from StandardTokenizer and I want to
> > recover
> > > it
> > > >>>>>> from
> > > >>>>>> the TokenStream from StandardTokenizer
> > > >>>>>>
> > > >>>>>> What should I do?
> > > >>>>>>
> > > >>>>>> --
> > > >>>>>> Weiwei Wang
> > > >>>>>> Alex Wang
> > > >>>>>> 王巍巍
> > > >>>>>> Room 403, Mengmin Wei Building
> > > >>>>>> Computer Science Department
> > > >>>>>> Gulou Campus of Nanjing University
> > > >>>>>> Nanjing, P.R.China, 210093
> > > >>>>>>
> > > >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > ---------------------------------------------------------------------
> > > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>>>
> > > >>>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Weiwei Wang
> > > >>> Alex Wang
> > > >>> 王巍巍
> > > >>> Room 403, Mengmin Wei Building
> > > >>> Computer Science Department
> > > >>> Gulou Campus of Nanjing University
> > > >>> Nanjing, P.R.China, 210093
> > > >>>
> > > >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Weiwei Wang
> > > >> Alex Wang
> > > >> 王巍巍
> > > >> Room 403, Mengmin Wei Building
> > > >> Computer Science Department
> > > >> Gulou Campus of Nanjing University
> > > >> Nanjing, P.R.China, 210093
> > > >>
> > > >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Weiwei Wang
> > > > Alex Wang
> > > > 王巍巍
> > > > Room 403, Mengmin Wei Building
> > > > Computer Science Department
> > > > Gulou Campus of Nanjing University
> > > > Nanjing, P.R.China, 210093
> > > >
> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > >
> > >
> > >
> > >
> > > --
> > > Weiwei Wang
> > > Alex Wang
> > > 王巍巍
> > > Room 403, Mengmin Wei Building
> > > Computer Science Department
> > > Gulou Campus of Nanjing University
> > > Nanjing, P.R.China, 210093
> > >
> > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> 
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
> 
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Recover special terms from StandardTokenizer

Posted by Weiwei Wang <ww...@gmail.com>.

thanks, Uwe.
Maybe i was not very clear. My situation is like this:
Analyzer:
   NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
   RECOVERY_MAP.add("c++","cplusplus$");
    CharFilter filter = new LowercaseCharFilter(reader);
    filter = new RosaMappingCharFilter(RECOVERY_MAP,filter);
    StandardTokenizer tokenStream = new StandardTokenizer(Version.LUCENE_30,
filter);
    tokenStream.setMaxTokenLength(maxTokenLength);
    TokenStream result = new StandardFilter(tokenStream);
    result = getStopFilter(result);
    result = new SnowballFilter(result, STEMMER);
Analyze c++c++, return
(0,9)  [cplusplus]
(10,19)  [cplusplus]
the two numbers in th**e brackets are offsets.

So in the searching process when i want to hight the search keyword c++ with
the same analyzer, exception will be thrown because the string i stored are
c++c++ not cpluspluscplusplus(actually, i should not change the original
string when storing them, otherwise it will confuse the users).

I hope the analyzer can give result like this
(0,3) [cplusplus]
(3,6) [cplusplus]
then the Hilighter will works fine.

So how can I achieve this result?

2009/12/13 Uwe Schindler <uw...@thetaphi.de>

> MappingCharFilter preserves the offsets in the stream *before* filtering.
> So
> if you store the original string (without c++ replaced) in a stored field
> you can highlight using the given offstes. The highlighter must use again
> the same analyzer or use FastVectorHighlighter.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
> > Sent: Sunday, December 13, 2009 11:43 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: Recover special terms from StandardTokenizer
> >
> > Problem solved. Now another problem comes.
> >
> >
> > As I want to use Highlighter in my system, the token offset is incorrect
> > after the MappingCharFilter is used.
> >
> > Koji, do you known how to fix the offset problem?
> >
> > On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang <ww...@gmail.com>
> > wrote:
> >
> > > I use Luke to check the result and find only c exists as a term, no
> > > cplusplus found in the index
> > >
> > >
> > > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang
> > <ww...@gmail.com>wrote:
> > >
> > >> Thanks, Koji, I followed your advice and change my analyzer as shown
> > >> below:
> > >> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
> > >> RECOVERY_MAP.add("c++","cplusplus$");
> > >> CharFilter filter = new LowercaseCharFilter(reader);
> > >> filter = new MappingCharFilter(RECOVERY_MAP,filter);
> > >> StandardTokenizer tokenStream = new
> > StandardTokenizer(Version.LUCENE_30,
> > >> filter);
> > >> tokenStream.setMaxTokenLength(maxTokenLength);
> > >> TokenStream result = new StandardFilter(tokenStream);
> > >> result = new LowerCaseFilter(result);
> > >> result = new StopFilter(enableStopPositionIncrements, result,
> stopSet);
> > >> result = new SnowballFilter(result, STEMMER);
> > >>
> > >> I use the same analyzer in the search side. As you know, this analyzer
> > can
> > >> token c++ as cplusplus, for this reason, it seems I can search c++
> with
> > >> the same analyzer because it is also tokenized as cplusplus.
> > >>
> > >> I tested it on as string c++c++, however, when i search c++ on the
> > built
> > >> index, nothing is returned.
> > >>
> > >>  I do not know what's wrong with my code. Waiting for your replay
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang
> > <ww...@gmail.com>wrote:
> > >>
> > >>> Thanks, Koji
> > >>>
> > >>>
> > >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi
> > <ko...@r.email.ne.jp>wrote:
> > >>>
> > >>>> MappingCharFilter can be used to convert c++ to cplusplus.
> > >>>>
> > >>>> Koji
> > >>>>
> > >>>> --
> > >>>> http://www.rondhuit.com/en/
> > >>>>
> > >>>>
> > >>>>
> > >>>> Anshum wrote:
> > >>>>
> > >>>>> How about getting the original token stream and then converting c++
> > to
> > >>>>> cplusplus or anyother such transform. Or perhaps you might look at
> > >>>>> using/extending(in the non java sense) some other tokenized!
> > >>>>>
> > >>>>> --
> > >>>>> Anshum Gupta
> > >>>>> Naukri Labs!
> > >>>>> http://ai-cafe.blogspot.com
> > >>>>>
> > >>>>> The facts expressed here belong to everybody, the opinions to me.
> > The
> > >>>>> distinction is yours to draw............
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <
> ww.wang.cs@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>> Hi, all,
> > >>>>>>    I designed a ftp search engine based on Lucene. I did a few
> > >>>>>> modifications to the StandardTokenizer.
> > >>>>>> My problem is:
> > >>>>>>  C++ is tokenized as c from StandardTokenizer and I want to
> recover
> > it
> > >>>>>> from
> > >>>>>> the TokenStream from StandardTokenizer
> > >>>>>>
> > >>>>>> What should I do?
> > >>>>>>
> > >>>>>> --
> > >>>>>> Weiwei Wang
> > >>>>>> Alex Wang
> > >>>>>> 王巍巍
> > >>>>>> Room 403, Mengmin Wei Building
> > >>>>>> Computer Science Department
> > >>>>>> Gulou Campus of Nanjing University
> > >>>>>> Nanjing, P.R.China, 210093
> > >>>>>>
> > >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> ---------------------------------------------------------------------
> > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Weiwei Wang
> > >>> Alex Wang
> > >>> 王巍巍
> > >>> Room 403, Mengmin Wei Building
> > >>> Computer Science Department
> > >>> Gulou Campus of Nanjing University
> > >>> Nanjing, P.R.China, 210093
> > >>>
> > >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Weiwei Wang
> > >> Alex Wang
> > >> 王巍巍
> > >> Room 403, Mengmin Wei Building
> > >> Computer Science Department
> > >> Gulou Campus of Nanjing University
> > >> Nanjing, P.R.China, 210093
> > >>
> > >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > >>
> > >
> > >
> > >
> > > --
> > > Weiwei Wang
> > > Alex Wang
> > > 王巍巍
> > > Room 403, Mengmin Wei Building
> > > Computer Science Department
> > > Gulou Campus of Nanjing University
> > > Nanjing, P.R.China, 210093
> > >
> > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > >
> >
> >
> >
> > --
> > Weiwei Wang
> > Alex Wang
> > 王巍巍
> > Room 403, Mengmin Wei Building
> > Computer Science Department
> > Gulou Campus of Nanjing University
> > Nanjing, P.R.China, 210093
> >
> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

RE: Recover special terms from StandardTokenizer

Posted by Uwe Schindler <uw...@thetaphi.de>.

MappingCharFilter preserves the offsets in the stream *before* filtering. So
if you store the original string (without c++ replaced) in a stored field
you can highlight using the given offstes. The highlighter must use again
the same analyzer or use FastVectorHighlighter.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
> Sent: Sunday, December 13, 2009 11:43 AM
> To: java-user@lucene.apache.org
> Subject: Re: Recover special terms from StandardTokenizer
> 
> Problem solved. Now another problem comes.
> 
> 
> As I want to use Highlighter in my system, the token offset is incorrect
> after the MappingCharFilter is used.
> 
> Koji, do you known how to fix the offset problem?
> 
> On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang <ww...@gmail.com>
> wrote:
> 
> > I use Luke to check the result and find only c exists as a term, no
> > cplusplus found in the index
> >
> >
> > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang
> <ww...@gmail.com>wrote:
> >
> >> Thanks, Koji, I followed your advice and change my analyzer as shown
> >> below:
> >> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
> >> RECOVERY_MAP.add("c++","cplusplus$");
> >> CharFilter filter = new LowercaseCharFilter(reader);
> >> filter = new MappingCharFilter(RECOVERY_MAP,filter);
> >> StandardTokenizer tokenStream = new
> StandardTokenizer(Version.LUCENE_30,
> >> filter);
> >> tokenStream.setMaxTokenLength(maxTokenLength);
> >> TokenStream result = new StandardFilter(tokenStream);
> >> result = new LowerCaseFilter(result);
> >> result = new StopFilter(enableStopPositionIncrements, result, stopSet);
> >> result = new SnowballFilter(result, STEMMER);
> >>
> >> I use the same analyzer in the search side. As you know, this analyzer
> can
> >> token c++ as cplusplus, for this reason, it seems I can search c++ with
> >> the same analyzer because it is also tokenized as cplusplus.
> >>
> >> I tested it on as string c++c++, however, when i search c++ on the
> built
> >> index, nothing is returned.
> >>
> >>  I do not know what's wrong with my code. Waiting for your replay
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang
> <ww...@gmail.com>wrote:
> >>
> >>> Thanks, Koji
> >>>
> >>>
> >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi
> <ko...@r.email.ne.jp>wrote:
> >>>
> >>>> MappingCharFilter can be used to convert c++ to cplusplus.
> >>>>
> >>>> Koji
> >>>>
> >>>> --
> >>>> http://www.rondhuit.com/en/
> >>>>
> >>>>
> >>>>
> >>>> Anshum wrote:
> >>>>
> >>>>> How about getting the original token stream and then converting c++
> to
> >>>>> cplusplus or anyother such transform. Or perhaps you might look at
> >>>>> using/extending(in the non java sense) some other tokenized!
> >>>>>
> >>>>> --
> >>>>> Anshum Gupta
> >>>>> Naukri Labs!
> >>>>> http://ai-cafe.blogspot.com
> >>>>>
> >>>>> The facts expressed here belong to everybody, the opinions to me.
> The
> >>>>> distinction is yours to draw............
> >>>>>
> >>>>>
> >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <ww...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Hi, all,
> >>>>>>    I designed a ftp search engine based on Lucene. I did a few
> >>>>>> modifications to the StandardTokenizer.
> >>>>>> My problem is:
> >>>>>>  C++ is tokenized as c from StandardTokenizer and I want to recover
> it
> >>>>>> from
> >>>>>> the TokenStream from StandardTokenizer
> >>>>>>
> >>>>>> What should I do?
> >>>>>>
> >>>>>> --
> >>>>>> Weiwei Wang
> >>>>>> Alex Wang
> >>>>>> 王巍巍
> >>>>>> Room 403, Mengmin Wei Building
> >>>>>> Computer Science Department
> >>>>>> Gulou Campus of Nanjing University
> >>>>>> Nanjing, P.R.China, 210093
> >>>>>>
> >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Weiwei Wang
> >>> Alex Wang
> >>> 王巍巍
> >>> Room 403, Mengmin Wei Building
> >>> Computer Science Department
> >>> Gulou Campus of Nanjing University
> >>> Nanjing, P.R.China, 210093
> >>>
> >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >>>
> >>
> >>
> >>
> >> --
> >> Weiwei Wang
> >> Alex Wang
> >> 王巍巍
> >> Room 403, Mengmin Wei Building
> >> Computer Science Department
> >> Gulou Campus of Nanjing University
> >> Nanjing, P.R.China, 210093
> >>
> >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >>
> >
> >
> >
> > --
> > Weiwei Wang
> > Alex Wang
> > 王巍巍
> > Room 403, Mengmin Wei Building
> > Computer Science Department
> > Gulou Campus of Nanjing University
> > Nanjing, P.R.China, 210093
> >
> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >
> 
> 
> 
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
> 
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Recover special terms from StandardTokenizer

Posted by Weiwei Wang <ww...@gmail.com>.

Problem solved. Now another problem comes.


As I want to use Highlighter in my system, the token offset is incorrect
after the MappingCharFilter is used.

Koji, do you known how to fix the offset problem?

On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang <ww...@gmail.com> wrote:

> I use Luke to check the result and find only c exists as a term, no
> cplusplus found in the index
>
>
> On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang <ww...@gmail.com>wrote:
>
>> Thanks, Koji, I followed your advice and change my analyzer as shown
>> below:
>> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
>> RECOVERY_MAP.add("c++","cplusplus$");
>> CharFilter filter = new LowercaseCharFilter(reader);
>> filter = new MappingCharFilter(RECOVERY_MAP,filter);
>> StandardTokenizer tokenStream = new StandardTokenizer(Version.LUCENE_30,
>> filter);
>> tokenStream.setMaxTokenLength(maxTokenLength);
>> TokenStream result = new StandardFilter(tokenStream);
>> result = new LowerCaseFilter(result);
>> result = new StopFilter(enableStopPositionIncrements, result, stopSet);
>> result = new SnowballFilter(result, STEMMER);
>>
>> I use the same analyzer in the search side. As you know, this analyzer can
>> token c++ as cplusplus, for this reason, it seems I can search c++ with
>> the same analyzer because it is also tokenized as cplusplus.
>>
>> I tested it on as string c++c++, however, when i search c++ on the built
>> index, nothing is returned.
>>
>>  I do not know what's wrong with my code. Waiting for your replay
>>
>>
>>
>>
>>
>> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang <ww...@gmail.com>wrote:
>>
>>> Thanks, Koji
>>>
>>>
>>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi <ko...@r.email.ne.jp>wrote:
>>>
>>>> MappingCharFilter can be used to convert c++ to cplusplus.
>>>>
>>>> Koji
>>>>
>>>> --
>>>> http://www.rondhuit.com/en/
>>>>
>>>>
>>>>
>>>> Anshum wrote:
>>>>
>>>>> How about getting the original token stream and then converting c++ to
>>>>> cplusplus or anyother such transform. Or perhaps you might look at
>>>>> using/extending(in the non java sense) some other tokenized!
>>>>>
>>>>> --
>>>>> Anshum Gupta
>>>>> Naukri Labs!
>>>>> http://ai-cafe.blogspot.com
>>>>>
>>>>> The facts expressed here belong to everybody, the opinions to me. The
>>>>> distinction is yours to draw............
>>>>>
>>>>>
>>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <ww...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi, all,
>>>>>>    I designed a ftp search engine based on Lucene. I did a few
>>>>>> modifications to the StandardTokenizer.
>>>>>> My problem is:
>>>>>>  C++ is tokenized as c from StandardTokenizer and I want to recover it
>>>>>> from
>>>>>> the TokenStream from StandardTokenizer
>>>>>>
>>>>>> What should I do?
>>>>>>
>>>>>> --
>>>>>> Weiwei Wang
>>>>>> Alex Wang
>>>>>> 王巍巍
>>>>>> Room 403, Mengmin Wei Building
>>>>>> Computer Science Department
>>>>>> Gulou Campus of Nanjing University
>>>>>> Nanjing, P.R.China, 210093
>>>>>>
>>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>> --
>>> Weiwei Wang
>>> Alex Wang
>>> 王巍巍
>>> Room 403, Mengmin Wei Building
>>> Computer Science Department
>>> Gulou Campus of Nanjing University
>>> Nanjing, P.R.China, 210093
>>>
>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>>>
>>
>>
>>
>> --
>> Weiwei Wang
>> Alex Wang
>> 王巍巍
>> Room 403, Mengmin Wei Building
>> Computer Science Department
>> Gulou Campus of Nanjing University
>> Nanjing, P.R.China, 210093
>>
>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>>
>
>
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>



-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Re: Recover special terms from StandardTokenizer

Posted by Weiwei Wang <ww...@gmail.com>.

I use Luke to check the result and find only c exists as a term, no
cplusplus found in the index

On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang <ww...@gmail.com> wrote:

> Thanks, Koji, I followed your advice and change my analyzer as shown below:
> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
> RECOVERY_MAP.add("c++","cplusplus$");
> CharFilter filter = new LowercaseCharFilter(reader);
> filter = new MappingCharFilter(RECOVERY_MAP,filter);
> StandardTokenizer tokenStream = new StandardTokenizer(Version.LUCENE_30,
> filter);
> tokenStream.setMaxTokenLength(maxTokenLength);
> TokenStream result = new StandardFilter(tokenStream);
> result = new LowerCaseFilter(result);
> result = new StopFilter(enableStopPositionIncrements, result, stopSet);
> result = new SnowballFilter(result, STEMMER);
>
> I use the same analyzer in the search side. As you know, this analyzer can
> token c++ as cplusplus, for this reason, it seems I can search c++ with
> the same analyzer because it is also tokenized as cplusplus.
>
> I tested it on as string c++c++, however, when i search c++ on the built
> index, nothing is returned.
>
>  I do not know what's wrong with my code. Waiting for your replay
>
>
>
>
>
> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang <ww...@gmail.com> wrote:
>
>> Thanks, Koji
>>
>>
>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi <ko...@r.email.ne.jp>wrote:
>>
>>> MappingCharFilter can be used to convert c++ to cplusplus.
>>>
>>> Koji
>>>
>>> --
>>> http://www.rondhuit.com/en/
>>>
>>>
>>>
>>> Anshum wrote:
>>>
>>>> How about getting the original token stream and then converting c++ to
>>>> cplusplus or anyother such transform. Or perhaps you might look at
>>>> using/extending(in the non java sense) some other tokenized!
>>>>
>>>> --
>>>> Anshum Gupta
>>>> Naukri Labs!
>>>> http://ai-cafe.blogspot.com
>>>>
>>>> The facts expressed here belong to everybody, the opinions to me. The
>>>> distinction is yours to draw............
>>>>
>>>>
>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <ww...@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>>> Hi, all,
>>>>>    I designed a ftp search engine based on Lucene. I did a few
>>>>> modifications to the StandardTokenizer.
>>>>> My problem is:
>>>>>  C++ is tokenized as c from StandardTokenizer and I want to recover it
>>>>> from
>>>>> the TokenStream from StandardTokenizer
>>>>>
>>>>> What should I do?
>>>>>
>>>>> --
>>>>> Weiwei Wang
>>>>> Alex Wang
>>>>> 王巍巍
>>>>> Room 403, Mengmin Wei Building
>>>>> Computer Science Department
>>>>> Gulou Campus of Nanjing University
>>>>> Nanjing, P.R.China, 210093
>>>>>
>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>> --
>> Weiwei Wang
>> Alex Wang
>> 王巍巍
>> Room 403, Mengmin Wei Building
>> Computer Science Department
>> Gulou Campus of Nanjing University
>> Nanjing, P.R.China, 210093
>>
>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>>
>
>
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>



-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Re: Recover special terms from StandardTokenizer

Posted by Weiwei Wang <ww...@gmail.com>.

Thanks, Koji, I followed your advice and change my analyzer as shown below:
NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
RECOVERY_MAP.add("c++","cplusplus$");
CharFilter filter = new LowercaseCharFilter(reader);
filter = new MappingCharFilter(RECOVERY_MAP,filter);
StandardTokenizer tokenStream = new StandardTokenizer(Version.LUCENE_30,
filter);
tokenStream.setMaxTokenLength(maxTokenLength);
TokenStream result = new StandardFilter(tokenStream);
result = new LowerCaseFilter(result);
result = new StopFilter(enableStopPositionIncrements, result, stopSet);
result = new SnowballFilter(result, STEMMER);

I use the same analyzer in the search side. As you know, this analyzer can
token c++ as cplusplus, for this reason, it seems I can search c++ with the
same analyzer because it is also tokenized as cplusplus.

I tested it on as string c++c++, however, when i search c++ on the built
index, nothing is returned.

 I do not know what's wrong with my code. Waiting for your replay




On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang <ww...@gmail.com> wrote:

> Thanks, Koji
>
>
> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi <ko...@r.email.ne.jp>wrote:
>
>> MappingCharFilter can be used to convert c++ to cplusplus.
>>
>> Koji
>>
>> --
>> http://www.rondhuit.com/en/
>>
>>
>>
>> Anshum wrote:
>>
>>> How about getting the original token stream and then converting c++ to
>>> cplusplus or anyother such transform. Or perhaps you might look at
>>> using/extending(in the non java sense) some other tokenized!
>>>
>>> --
>>> Anshum Gupta
>>> Naukri Labs!
>>> http://ai-cafe.blogspot.com
>>>
>>> The facts expressed here belong to everybody, the opinions to me. The
>>> distinction is yours to draw............
>>>
>>>
>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <ww...@gmail.com>
>>> wrote:
>>>
>>>
>>>
>>>> Hi, all,
>>>>    I designed a ftp search engine based on Lucene. I did a few
>>>> modifications to the StandardTokenizer.
>>>> My problem is:
>>>>  C++ is tokenized as c from StandardTokenizer and I want to recover it
>>>> from
>>>> the TokenStream from StandardTokenizer
>>>>
>>>> What should I do?
>>>>
>>>> --
>>>> Weiwei Wang
>>>> Alex Wang
>>>> 王巍巍
>>>> Room 403, Mengmin Wei Building
>>>> Computer Science Department
>>>> Gulou Campus of Nanjing University
>>>> Nanjing, P.R.China, 210093
>>>>
>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>



-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Re: Recover special terms from StandardTokenizer

Posted by Weiwei Wang <ww...@gmail.com>.

Thanks, Koji

On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi <ko...@r.email.ne.jp> wrote:

> MappingCharFilter can be used to convert c++ to cplusplus.
>
> Koji
>
> --
> http://www.rondhuit.com/en/
>
>
>
> Anshum wrote:
>
>> How about getting the original token stream and then converting c++ to
>> cplusplus or anyother such transform. Or perhaps you might look at
>> using/extending(in the non java sense) some other tokenized!
>>
>> --
>> Anshum Gupta
>> Naukri Labs!
>> http://ai-cafe.blogspot.com
>>
>> The facts expressed here belong to everybody, the opinions to me. The
>> distinction is yours to draw............
>>
>>
>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <ww...@gmail.com>
>> wrote:
>>
>>
>>
>>> Hi, all,
>>>    I designed a ftp search engine based on Lucene. I did a few
>>> modifications to the StandardTokenizer.
>>> My problem is:
>>>  C++ is tokenized as c from StandardTokenizer and I want to recover it
>>> from
>>> the TokenStream from StandardTokenizer
>>>
>>> What should I do?
>>>
>>> --
>>> Weiwei Wang
>>> Alex Wang
>>> 王巍巍
>>> Room 403, Mengmin Wei Building
>>> Computer Science Department
>>> Gulou Campus of Nanjing University
>>> Nanjing, P.R.China, 210093
>>>
>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>>>
>>>
>>>
>>
>>
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Re: Recover special terms from StandardTokenizer

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

MappingCharFilter can be used to convert c++ to cplusplus.

Koji

-- 
http://www.rondhuit.com/en/


Anshum wrote:
> How about getting the original token stream and then converting c++ to
> cplusplus or anyother such transform. Or perhaps you might look at
> using/extending(in the non java sense) some other tokenized!
>
> --
> Anshum Gupta
> Naukri Labs!
> http://ai-cafe.blogspot.com
>
> The facts expressed here belong to everybody, the opinions to me. The
> distinction is yours to draw............
>
>
> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <ww...@gmail.com> wrote:
>
>   
>> Hi, all,
>>     I designed a ftp search engine based on Lucene. I did a few
>> modifications to the StandardTokenizer.
>> My problem is:
>>  C++ is tokenized as c from StandardTokenizer and I want to recover it from
>> the TokenStream from StandardTokenizer
>>
>> What should I do?
>>
>> --
>> Weiwei Wang
>> Alex Wang
>> 王巍巍
>> Room 403, Mengmin Wei Building
>> Computer Science Department
>> Gulou Campus of Nanjing University
>> Nanjing, P.R.China, 210093
>>
>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>>
>>     
>
>   



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Recover special terms from StandardTokenizer

Posted by Anshum <an...@gmail.com>.

How about getting the original token stream and then converting c++ to
cplusplus or anyother such transform. Or perhaps you might look at
using/extending(in the non java sense) some other tokenized!

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............

On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <ww...@gmail.com> wrote:

> Hi, all,
>     I designed a ftp search engine based on Lucene. I did a few
> modifications to the StandardTokenizer.
> My problem is:
>  C++ is tokenized as c from StandardTokenizer and I want to recover it from
> the TokenStream from StandardTokenizer
>
> What should I do?
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>