You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Uwe Schindler <uw...@thetaphi.de> on 2011/10/01 00:46:13 UTC

RE: Writing a TokenConcatenateFilter - junk characters appearing on output.

Hi,

The junk is appended here: buffer.append(termAtt.buffer());

I assume you are on Lucene 3.1+, so use buffer.append(termAtt); termAtt
implements CharSequence, so it can be appended to any StringBuilder.
The code you are using appends the whole char array, which may contain
characters after termAtt.length().

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Jithin [mailto:jithin1987@gmail.com]
> Sent: Friday, September 30, 2011 11:12 PM
> To: java-user@lucene.apache.org
> Subject: Writing a TokenConcatenateFilter - junk characters appearing on
> output.
> 
> Hi,
> I am trying to write a TokenFilter which just concatenates all the the
token in
> the input TokenStream.
> Issue I am facing is that my filter is outputting certain junk characters
in
> addition to the concatenated string. I believe this is caused by
StringBuilder.
> 
> This is my incrementToken() function
> 
> public boolean incrementToken() throws IOException {
>         //if (!input.incrementToken()) {
>             //return false;
>         //}
>         if (finished) {
>             logger.error("Finished");
>             return false;
>         }
>         logger.error("Starting");
>         StringBuilder buffer = new StringBuilder();
>         int length = 0;
>         while (input.incrementToken()) {
>             logger.error(Integer.toString(buffer.length()));
>             logger.error(buffer.toString());
>             if (0 == length) {
>                 buffer.append(termAtt.buffer());
>                length += termAtt.length();
>             } else {
>                 buffer.append(" ").append(termAtt.buffer());
>                length += termAtt.length() + 1;
>             }
> 
>         }
> 
>         logger.error("####### Final");
>         logger.error(Integer.toString(buffer.length()));
>         logger.error(Integer.toString(length));
>         logger.error(buffer.toString());
> 
>         termAtt.setEmpty().append(buffer);
>         offsetAtt.setOffset(0, length);
>         finished = true;
>         return true;
>     }
> 
> 
> *Output for input tokens booh and good is *
> 
> SEVERE: Starting
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: 0
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE:
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: 14
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: booh
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: ####### Final
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: 29
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: 9
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: booh good
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: Finished
> 
> 
> And this is it is appearing on solr analysis
> page.(http://localhost:8983/solr/admin/analysis.jsp)
> org.ctown.solr.analysis.CTConcatFilterFactory
> {luceneMatchVersion=LUCENE_34}
> position 	1
> *term text 	booh#0;#0;#0;#0;#0;#0;#0;#0;#0;#0;
> good#0;#0;#0;#0;#0;#0;#0;#0;#0;#0;*
> startOffset 	0
> endOffset 	9
> 
> Kindlt help me in understanding what I am doing wrong and how to fix this.
> 
> 
> 
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/Writing-a-
> TokenConcatenateFilter-junk-characters-appearing-on-output-
> tp3383684p3383684.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Writing a TokenConcatenateFilter - junk characters appearing on output.

Posted by Jithin <ji...@gmail.com>.

Figured out the issue. finished variable needs to be reinitialized to false
once current stream is over.

    if (finished) {
        logger.debug("Finished");
        finished = false;
        return false;
    }

Looks like the same class is being reused. Makes sense.


On Sat, Oct 1, 2011 at 10:57 AM, Jithin [via Lucene] <
ml-node+s472066n3384419h7@n3.nabble.com> wrote:

> I meant to say. Now my analser chain looks like this.
>
>             <analyzer type="index">
>
>
>                 <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="[-_]" replacement=" " />
>
>                 <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="[^\p{L}\p{Nd}\p{Mn}\p{Mc}\s+]" replacement="" />
>
>                 <tokenizer class="solr.WhitespaceTokenizerFactory" />
>
>
>                 <filter class="solr.LowerCaseFilterFactory" />
>
>
>                 <filter class="solr.StopWordFilterFactory"
> ignoreCase="true"
>
>                     words="words.txt" />
>
>                 <filter
> class="org.ctown.solr.analysis.CTConcatFilterFactory" />
>
>
>             </analyzer>
>             <analyzer type="query">
>
>
>                 <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="[-_]" replacement=" " />
>
>                 <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="[^\p{L}\p{Nd}\p{Mn}\p{Mc}\s+]" replacement="" />
>
>                 <tokenizer class="solr.KeywordTokenizerFactory" />
>
>
>
>
>             </analyzer>
>
> But only my first document is getting indexed. Is there any logging I can
> enable to see what is going wrong?
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Writing-a-TokenConcatenateFilter-junk-characters-appearing-on-output-tp3383684p3384419.html
>  To unsubscribe from Writing a TokenConcatenateFilter - junk characters
> appearing on output., click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3383684&code=aml0aGluMTk4N0BnbWFpbC5jb218MzM4MzY4NHwtMTEwMTgwMTA3Ng==>.
>
>



-- 
Thanks
Jithin Emmanuel


--
View this message in context: http://lucene.472066.n3.nabble.com/Writing-a-TokenConcatenateFilter-junk-characters-appearing-on-output-tp3383684p3384528.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

Re: Writing a TokenConcatenateFilter - junk characters appearing on output.

Posted by Jithin <ji...@gmail.com>.

I meant to say. Now my analser chain looks like this. 

            <analyzer type="index">                                                                                                                                                                       
                <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[-_]" replacement=" " />                                                                                                
                <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[^\p{L}\p{Nd}\p{Mn}\p{Mc}\s+]" replacement="" />                                                                        
                <tokenizer class="solr.WhitespaceTokenizerFactory" />                                                                                                                                     
                <filter class="solr.LowerCaseFilterFactory" />                                                                                                                                            
                <filter class="solr.StopWordFilterFactory" ignoreCase="true"                                                                                                         
                    words="words.txt" />                                                                                                                
                <filter
class="org.ctown.solr.analysis.CTConcatFilterFactory" />                                                                                                                          
            </analyzer>    
            <analyzer type="query">                                                                                                                                                                       
                <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[-_]" replacement=" " />                                                                                                
                <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[^\p{L}\p{Nd}\p{Mn}\p{Mc}\s+]" replacement="" />                                                                        
                <tokenizer class="solr.KeywordTokenizerFactory" />                                                                                                                                        
                                                                                                                             
            </analyzer>  

But only my first document is getting indexed. Is there any logging I can
enable to see what is going wrong?

--
View this message in context: http://lucene.472066.n3.nabble.com/Writing-a-TokenConcatenateFilter-junk-characters-appearing-on-output-tp3383684p3384419.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Writing a TokenConcatenateFilter - junk characters appearing on output.

Posted by Jithin <ji...@gmail.com>.

I have added this custom filter at the end of my query. Now only my first
document is getting indexed.

--
View this message in context: http://lucene.472066.n3.nabble.com/Writing-a-TokenConcatenateFilter-junk-characters-appearing-on-output-tp3383684p3384379.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Writing a TokenConcatenateFilter - junk characters appearing on output.

Posted by Jithin <ji...@gmail.com>.

Thanks a million Uwe. That fixes it.

On Sat, Oct 1, 2011 at 4:16 AM, Uwe Schindler [via Lucene] <
ml-node+s472066n3383905h73@n3.nabble.com> wrote:

> Hi,
>
> The junk is appended here: buffer.append(termAtt.buffer());
>
> I assume you are on Lucene 3.1+, so use buffer.append(termAtt); termAtt
> implements CharSequence, so it can be appended to any StringBuilder.
> The code you are using appends the whole char array, which may contain
> characters after termAtt.length().
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [hidden email]<http://user/SendEmail.jtp?type=node&node=3383905&i=0>
>
> > -----Original Message-----
> > From: Jithin [mailto:[hidden email]<http://user/SendEmail.jtp?type=node&node=3383905&i=1>]
>
> > Sent: Friday, September 30, 2011 11:12 PM
> > To: [hidden email]<http://user/SendEmail.jtp?type=node&node=3383905&i=2>
> > Subject: Writing a TokenConcatenateFilter - junk characters appearing on
> > output.
> >
> > Hi,
> > I am trying to write a TokenFilter which just concatenates all the the
> token in
> > the input TokenStream.
> > Issue I am facing is that my filter is outputting certain junk characters
>
> in
> > addition to the concatenated string. I believe this is caused by
> StringBuilder.
>
> >
> > This is my incrementToken() function
> >
> > public boolean incrementToken() throws IOException {
> >         //if (!input.incrementToken()) {
> >             //return false;
> >         //}
> >         if (finished) {
> >             logger.error("Finished");
> >             return false;
> >         }
> >         logger.error("Starting");
> >         StringBuilder buffer = new StringBuilder();
> >         int length = 0;
> >         while (input.incrementToken()) {
> >             logger.error(Integer.toString(buffer.length()));
> >             logger.error(buffer.toString());
> >             if (0 == length) {
> >                 buffer.append(termAtt.buffer());
> >                length += termAtt.length();
> >             } else {
> >                 buffer.append(" ").append(termAtt.buffer());
> >                length += termAtt.length() + 1;
> >             }
> >
> >         }
> >
> >         logger.error("####### Final");
> >         logger.error(Integer.toString(buffer.length()));
> >         logger.error(Integer.toString(length));
> >         logger.error(buffer.toString());
> >
> >         termAtt.setEmpty().append(buffer);
> >         offsetAtt.setOffset(0, length);
> >         finished = true;
> >         return true;
> >     }
> >
> >
> > *Output for input tokens booh and good is *
> >
> > SEVERE: Starting
> > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> > incrementToken
> > SEVERE: 0
> > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> > incrementToken
> > SEVERE:
> > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> > incrementToken
> > SEVERE: 14
> > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> > incrementToken
> > SEVERE: booh
> > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> > incrementToken
> > SEVERE: ####### Final
> > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> > incrementToken
> > SEVERE: 29
> > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> > incrementToken
> > SEVERE: 9
> > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> > incrementToken
> > SEVERE: booh good
> > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> > incrementToken
> > SEVERE: Finished
> >
> >
> > And this is it is appearing on solr analysis
> > page.(http://localhost:8983/solr/admin/analysis.jsp)
> > org.ctown.solr.analysis.CTConcatFilterFactory
> > {luceneMatchVersion=LUCENE_34}
> > position 1
> > *term text booh#0;#0;#0;#0;#0;#0;#0;#0;#0;#0;
> > good#0;#0;#0;#0;#0;#0;#0;#0;#0;#0;*
> > startOffset 0
> > endOffset 9
> >
> > Kindlt help me in understanding what I am doing wrong and how to fix
> this.
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Writing-a-
> > TokenConcatenateFilter-junk-characters-appearing-on-output-
> > tp3383684p3383684.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]<http://user/SendEmail.jtp?type=node&node=3383905&i=3>
> > For additional commands, e-mail: [hidden email]<http://user/SendEmail.jtp?type=node&node=3383905&i=4>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]<http://user/SendEmail.jtp?type=node&node=3383905&i=5>
> For additional commands, e-mail: [hidden email]<http://user/SendEmail.jtp?type=node&node=3383905&i=6>
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Writing-a-TokenConcatenateFilter-junk-characters-appearing-on-output-tp3383684p3383905.html
>  To unsubscribe from Writing a TokenConcatenateFilter - junk characters
> appearing on output., click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3383684&code=aml0aGluMTk4N0BnbWFpbC5jb218MzM4MzY4NHwtMTEwMTgwMTA3Ng==>.
>
>



-- 
Thanks
Jithin Emmanuel


--
View this message in context: http://lucene.472066.n3.nabble.com/Writing-a-TokenConcatenateFilter-junk-characters-appearing-on-output-tp3383684p3384323.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.