You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Uwe Schindler <uw...@thetaphi.de> on 2011/10/01 00:46:13 UTC
RE: Writing a TokenConcatenateFilter - junk characters appearing on output.
Hi,
The junk is appended here: buffer.append(termAtt.buffer());
I assume you are on Lucene 3.1+, so use buffer.append(termAtt); termAtt
implements CharSequence, so it can be appended to any StringBuilder.
The code you are using appends the whole char array, which may contain
characters after termAtt.length().
Uwe
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de
> -----Original Message-----
> From: Jithin [mailto:jithin1987@gmail.com]
> Sent: Friday, September 30, 2011 11:12 PM
> To: java-user@lucene.apache.org
> Subject: Writing a TokenConcatenateFilter - junk characters appearing on
> output.
>
> Hi,
> I am trying to write a TokenFilter which just concatenates all the the
token in
> the input TokenStream.
> Issue I am facing is that my filter is outputting certain junk characters
in
> addition to the concatenated string. I believe this is caused by
StringBuilder.
>
> This is my incrementToken() function
>
> public boolean incrementToken() throws IOException {
> //if (!input.incrementToken()) {
> //return false;
> //}
> if (finished) {
> logger.error("Finished");
> return false;
> }
> logger.error("Starting");
> StringBuilder buffer = new StringBuilder();
> int length = 0;
> while (input.incrementToken()) {
> logger.error(Integer.toString(buffer.length()));
> logger.error(buffer.toString());
> if (0 == length) {
> buffer.append(termAtt.buffer());
> length += termAtt.length();
> } else {
> buffer.append(" ").append(termAtt.buffer());
> length += termAtt.length() + 1;
> }
>
> }
>
> logger.error("####### Final");
> logger.error(Integer.toString(buffer.length()));
> logger.error(Integer.toString(length));
> logger.error(buffer.toString());
>
> termAtt.setEmpty().append(buffer);
> offsetAtt.setOffset(0, length);
> finished = true;
> return true;
> }
>
>
> *Output for input tokens booh and good is *
>
> SEVERE: Starting
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: 0
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE:
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: 14
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: booh
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: ####### Final
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: 29
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: 9
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: booh good
> Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> incrementToken
> SEVERE: Finished
>
>
> And this is it is appearing on solr analysis
> page.(http://localhost:8983/solr/admin/analysis.jsp)
> org.ctown.solr.analysis.CTConcatFilterFactory
> {luceneMatchVersion=LUCENE_34}
> position 1
> *term text booh#0;#0;#0;#0;#0;#0;#0;#0;#0;#0;
> good#0;#0;#0;#0;#0;#0;#0;#0;#0;#0;*
> startOffset 0
> endOffset 9
>
> Kindlt help me in understanding what I am doing wrong and how to fix this.
>
>
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/Writing-a-
> TokenConcatenateFilter-junk-characters-appearing-on-output-
> tp3383684p3383684.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Writing a TokenConcatenateFilter - junk characters appearing on
output.
Posted by Jithin <ji...@gmail.com>.
Figured out the issue. finished variable needs to be reinitialized to false
once current stream is over.
if (finished) {
logger.debug("Finished");
finished = false;
return false;
}
Looks like the same class is being reused. Makes sense.
On Sat, Oct 1, 2011 at 10:57 AM, Jithin [via Lucene] <
ml-node+s472066n3384419h7@n3.nabble.com> wrote:
> I meant to say. Now my analser chain looks like this.
>
> <analyzer type="index">
>
>
> <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="[-_]" replacement=" " />
>
> <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="[^\p{L}\p{Nd}\p{Mn}\p{Mc}\s+]" replacement="" />
>
> <tokenizer class="solr.WhitespaceTokenizerFactory" />
>
>
> <filter class="solr.LowerCaseFilterFactory" />
>
>
> <filter class="solr.StopWordFilterFactory"
> ignoreCase="true"
>
> words="words.txt" />
>
> <filter
> class="org.ctown.solr.analysis.CTConcatFilterFactory" />
>
>
> </analyzer>
> <analyzer type="query">
>
>
> <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="[-_]" replacement=" " />
>
> <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="[^\p{L}\p{Nd}\p{Mn}\p{Mc}\s+]" replacement="" />
>
> <tokenizer class="solr.KeywordTokenizerFactory" />
>
>
>
>
> </analyzer>
>
> But only my first document is getting indexed. Is there any logging I can
> enable to see what is going wrong?
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Writing-a-TokenConcatenateFilter-junk-characters-appearing-on-output-tp3383684p3384419.html
> To unsubscribe from Writing a TokenConcatenateFilter - junk characters
> appearing on output., click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3383684&code=aml0aGluMTk4N0BnbWFpbC5jb218MzM4MzY4NHwtMTEwMTgwMTA3Ng==>.
>
>
--
Thanks
Jithin Emmanuel
--
View this message in context: http://lucene.472066.n3.nabble.com/Writing-a-TokenConcatenateFilter-junk-characters-appearing-on-output-tp3383684p3384528.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
Re: Writing a TokenConcatenateFilter - junk characters appearing on
output.
Posted by Jithin <ji...@gmail.com>.
I meant to say. Now my analser chain looks like this.
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[-_]" replacement=" " />
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[^\p{L}\p{Nd}\p{Mn}\p{Mc}\s+]" replacement="" />
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopWordFilterFactory" ignoreCase="true"
words="words.txt" />
<filter
class="org.ctown.solr.analysis.CTConcatFilterFactory" />
</analyzer>
<analyzer type="query">
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[-_]" replacement=" " />
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[^\p{L}\p{Nd}\p{Mn}\p{Mc}\s+]" replacement="" />
<tokenizer class="solr.KeywordTokenizerFactory" />
</analyzer>
But only my first document is getting indexed. Is there any logging I can
enable to see what is going wrong?
--
View this message in context: http://lucene.472066.n3.nabble.com/Writing-a-TokenConcatenateFilter-junk-characters-appearing-on-output-tp3383684p3384419.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Writing a TokenConcatenateFilter - junk characters appearing on
output.
Posted by Jithin <ji...@gmail.com>.
I have added this custom filter at the end of my query. Now only my first
document is getting indexed.
--
View this message in context: http://lucene.472066.n3.nabble.com/Writing-a-TokenConcatenateFilter-junk-characters-appearing-on-output-tp3383684p3384379.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Writing a TokenConcatenateFilter - junk characters appearing on
output.
Posted by Jithin <ji...@gmail.com>.
Thanks a million Uwe. That fixes it.
On Sat, Oct 1, 2011 at 4:16 AM, Uwe Schindler [via Lucene] <
ml-node+s472066n3383905h73@n3.nabble.com> wrote:
> Hi,
>
> The junk is appended here: buffer.append(termAtt.buffer());
>
> I assume you are on Lucene 3.1+, so use buffer.append(termAtt); termAtt
> implements CharSequence, so it can be appended to any StringBuilder.
> The code you are using appends the whole char array, which may contain
> characters after termAtt.length().
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [hidden email]<http://user/SendEmail.jtp?type=node&node=3383905&i=0>
>
> > -----Original Message-----
> > From: Jithin [mailto:[hidden email]<http://user/SendEmail.jtp?type=node&node=3383905&i=1>]
>
> > Sent: Friday, September 30, 2011 11:12 PM
> > To: [hidden email]<http://user/SendEmail.jtp?type=node&node=3383905&i=2>
> > Subject: Writing a TokenConcatenateFilter - junk characters appearing on
> > output.
> >
> > Hi,
> > I am trying to write a TokenFilter which just concatenates all the the
> token in
> > the input TokenStream.
> > Issue I am facing is that my filter is outputting certain junk characters
>
> in
> > addition to the concatenated string. I believe this is caused by
> StringBuilder.
>
> >
> > This is my incrementToken() function
> >
> > public boolean incrementToken() throws IOException {
> > //if (!input.incrementToken()) {
> > //return false;
> > //}
> > if (finished) {
> > logger.error("Finished");
> > return false;
> > }
> > logger.error("Starting");
> > StringBuilder buffer = new StringBuilder();
> > int length = 0;
> > while (input.incrementToken()) {
> > logger.error(Integer.toString(buffer.length()));
> > logger.error(buffer.toString());
> > if (0 == length) {
> > buffer.append(termAtt.buffer());
> > length += termAtt.length();
> > } else {
> > buffer.append(" ").append(termAtt.buffer());
> > length += termAtt.length() + 1;
> > }
> >
> > }
> >
> > logger.error("####### Final");
> > logger.error(Integer.toString(buffer.length()));
> > logger.error(Integer.toString(length));
> > logger.error(buffer.toString());
> >
> > termAtt.setEmpty().append(buffer);
> > offsetAtt.setOffset(0, length);
> > finished = true;
> > return true;
> > }
> >
> >
> > *Output for input tokens booh and good is *
> >
> > SEVERE: Starting
> > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> > incrementToken
> > SEVERE: 0
> > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> > incrementToken
> > SEVERE:
> > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> > incrementToken
> > SEVERE: 14
> > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> > incrementToken
> > SEVERE: booh
> > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> > incrementToken
> > SEVERE: ####### Final
> > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> > incrementToken
> > SEVERE: 29
> > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> > incrementToken
> > SEVERE: 9
> > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> > incrementToken
> > SEVERE: booh good
> > Sep 30, 2011 9:02:13 PM org.ctown.solr.analysis.CTConcatFilter
> > incrementToken
> > SEVERE: Finished
> >
> >
> > And this is it is appearing on solr analysis
> > page.(http://localhost:8983/solr/admin/analysis.jsp)
> > org.ctown.solr.analysis.CTConcatFilterFactory
> > {luceneMatchVersion=LUCENE_34}
> > position 1
> > *term text booh#0;#0;#0;#0;#0;#0;#0;#0;#0;#0;
> > good#0;#0;#0;#0;#0;#0;#0;#0;#0;#0;*
> > startOffset 0
> > endOffset 9
> >
> > Kindlt help me in understanding what I am doing wrong and how to fix
> this.
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Writing-a-
> > TokenConcatenateFilter-junk-characters-appearing-on-output-
> > tp3383684p3383684.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]<http://user/SendEmail.jtp?type=node&node=3383905&i=3>
> > For additional commands, e-mail: [hidden email]<http://user/SendEmail.jtp?type=node&node=3383905&i=4>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]<http://user/SendEmail.jtp?type=node&node=3383905&i=5>
> For additional commands, e-mail: [hidden email]<http://user/SendEmail.jtp?type=node&node=3383905&i=6>
>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Writing-a-TokenConcatenateFilter-junk-characters-appearing-on-output-tp3383684p3383905.html
> To unsubscribe from Writing a TokenConcatenateFilter - junk characters
> appearing on output., click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3383684&code=aml0aGluMTk4N0BnbWFpbC5jb218MzM4MzY4NHwtMTEwMTgwMTA3Ng==>.
>
>
--
Thanks
Jithin Emmanuel
--
View this message in context: http://lucene.472066.n3.nabble.com/Writing-a-TokenConcatenateFilter-junk-characters-appearing-on-output-tp3383684p3384323.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.