You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Bupo Jung <bu...@gmail.com> on 2011/04/13 13:13:00 UTC

chinese token overlap bug in org.apache.nutch.summary.basic.BasicSummarizer.getSummary

I use Nutch for Chinese search. I input a query string like
"�ɰ���СŮ��"(a lovely little girl),the chinese analyzer turn it to three query
token����
�ɰ���СŮ��Ů��. When using the tokens to get the summary of the result page, a
StringIndexOutOfBoundsException throw out. Here is the error log:

2010-12-15 12:18:43,505 ERROR searcher.NutchBean �C Exception occured while
executing search: java.lang.RuntimeException:
java.util.concurrent.ExecutionException:
java.lang.StringIndexOutOfBoundsException: String index out of range: -1

java.lang.RuntimeException: java.util.concurrent.ExecutionException:
java.lang.StringIndexOutOfBoundsException: String index out of range: -1

at
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:297)

at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:350)

at org.apache.nutch.searcher.NutchBean.main(NutchBean.java:410)

Caused by: java.util.concurrent.ExecutionException:
java.lang.StringIndexOutOfBoundsException: String index out of range: -1

at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)

at java.util.concurrent.FutureTask.get(FutureTask.java:83)

at
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:292)

�� 2 more

Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
range: -1

at java.lang.String.substring(String.java:1937)

at
org.apache.nutch.summary.basic.BasicSummarizer.getSummary(BasicSummarizer.java:188)

at
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:263)

at
org.apache.nutch.searcher.FetchedSegments$SummaryTask.call(FetchedSegments.java:63)

at
org.apache.nutch.searcher.FetchedSegments$SummaryTask.call(FetchedSegments.java:53)

at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

at java.util.concurrent.FutureTask.run(FutureTask.java:138)

at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

This is because there is overlap between the two query tokens ��СŮ�� and ��Ů������

nutch/src/plugin/summary-basic/src/java/org/apache/nutch/summary/basic/BasicSummarizer.java

line 188��

*if* (highlight.contains(t.term())) {
excerpt.addToken(t.term());
//when two tokens overlap��offset>t.startOffset()
//
excerpt.add(*new*Fragment(text.substring(offset, t.startOffset())));//this
is where the exception accur
excerpt.add(*new*Highlight(text.substring(t.startOffset(),t.endOffset())));
offset = t.endOffset();
endToken = Math.*min*(j +sumContext, tokens.length);
}


//Change code to fix the error��
*if* (highlight.contains(t.term())) {
excerpt.addToken(t.term());
//bupo changed the code to fix the chinese token overlap error 2010.12.15
*if*(offset < t.startOffset()){
excerpt.add(*new*Fragment(text.substring(offset, t.startOffset())));
excerpt.add(*new*Highlight(text.substring(t.startOffset(),t.endOffset())));
}*else*{
excerpt.add(*new*Highlight(text.substring(offset,t.endOffset())));
}//bupo
}

--

Yizhong Zhuang
Beijing University of Posts and Telecommunications
Email:bupo.jung@gmail.com

Re: chinese token overlap bug in org.apache.nutch.summary.basic.BasicSummarizer.getSummary

Posted by Bupo Jung <bu...@gmail.com>.

Thank you for your response.
I have to update my code [?]^_^


�� 2011��4��13�� ����7:19��Julien Nioche <li...@gmail.com>д����

> Hi,
>
> Nutch has moved away from handling the indexing and search itself and now
> delegates that to SOLR as of versions 1.3 and 2.0 (both forthcoming). The
> issue you described won't be fixed as this part of the code has been
> removed. Users are encouraged to start using 1.3 and use SOLR for the
> indexing and search.
>
> Your comments should be useful to anyone having the same issue with Nutch
> <= 1.2, so thanks for sharing this.
>
> Julien
>
>
> 2011/4/13 Bupo Jung <bu...@gmail.com>
>
>> I use Nutch for Chinese search. I input a query string like
>> "�ɰ���СŮ��"(a lovely little girl),the chinese analyzer turn it to three query
>> token����
>> �ɰ���СŮ��Ů��. When using the tokens to get the summary of the result page, a
>> StringIndexOutOfBoundsException throw out. Here is the error log:
>>
>> 2010-12-15 12:18:43,505 ERROR searcher.NutchBean �C Exception occured while
>> executing search: java.lang.RuntimeException:
>> java.util.concurrent.ExecutionException:
>> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>>
>> java.lang.RuntimeException: java.util.concurrent.ExecutionException:
>> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>>
>> at
>> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:297)
>>
>> at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:350)
>>
>> at org.apache.nutch.searcher.NutchBean.main(NutchBean.java:410)
>>
>> Caused by: java.util.concurrent.ExecutionException:
>> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>>
>> at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>>
>> at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>>
>> at
>> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:292)
>>
>> �� 2 more
>>
>> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
>> range: -1
>>
>> at java.lang.String.substring(String.java:1937)
>>
>> at
>> org.apache.nutch.summary.basic.BasicSummarizer.getSummary(BasicSummarizer.java:188)
>>
>> at
>> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:263)
>>
>> at
>> org.apache.nutch.searcher.FetchedSegments$SummaryTask.call(FetchedSegments.java:63)
>>
>> at
>> org.apache.nutch.searcher.FetchedSegments$SummaryTask.call(FetchedSegments.java:53)
>>
>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>
>> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>
>> at java.lang.Thread.run(Thread.java:662)
>>
>> This is because there is overlap between the two query tokens ��СŮ�� and
>> ��Ů������
>>
>>
>> nutch/src/plugin/summary-basic/src/java/org/apache/nutch/summary/basic/BasicSummarizer.java
>>
>> line 188��
>>
>> *if* (highlight.contains(t.term())) {
>> excerpt.addToken(t.term());
>> //when two tokens overlap��offset>t.startOffset()
>> //
>> excerpt.add(*new*Fragment(text.substring(offset, t.startOffset())));//this
>> is where the exception accur
>> excerpt.add(*new*
>> Highlight(text.substring(t.startOffset(),t.endOffset())));
>> offset = t.endOffset();
>> endToken = Math.*min*(j +sumContext, tokens.length);
>> }
>>
>>
>> //Change code to fix the error��
>> *if* (highlight.contains(t.term())) {
>> excerpt.addToken(t.term());
>> //bupo changed the code to fix the chinese token overlap error 2010.12.15
>> *if*(offset < t.startOffset()){
>> excerpt.add(*new*Fragment(text.substring(offset, t.startOffset())));
>> excerpt.add(*new*
>> Highlight(text.substring(t.startOffset(),t.endOffset())));
>> }*else*{
>> excerpt.add(*new*Highlight(text.substring(offset,t.endOffset())));
>> }//bupo
>> }
>>
>> --
>>
>> Yizhong Zhuang
>> Beijing University of Posts and Telecommunications
>> Email:bupo.jung@gmail.com
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
Yizhong Zhuang
Beijing University of Posts and Telecommunications
Email:bupo.jung@gmail.com

Re: chinese token overlap bug in org.apache.nutch.summary.basic.BasicSummarizer.getSummary

Posted by Julien Nioche <li...@gmail.com>.

Hi,

Nutch has moved away from handling the indexing and search itself and now
delegates that to SOLR as of versions 1.3 and 2.0 (both forthcoming). The
issue you described won't be fixed as this part of the code has been
removed. Users are encouraged to start using 1.3 and use SOLR for the
indexing and search.

Your comments should be useful to anyone having the same issue with Nutch <=
1.2, so thanks for sharing this.

Julien


2011/4/13 Bupo Jung <bu...@gmail.com>

> I use Nutch for Chinese search. I input a query string like
> "�ɰ���СŮ��"(a lovely little girl),the chinese analyzer turn it to three query
> token����
> �ɰ���СŮ��Ů��. When using the tokens to get the summary of the result page, a
> StringIndexOutOfBoundsException throw out. Here is the error log:
>
> 2010-12-15 12:18:43,505 ERROR searcher.NutchBean �C Exception occured while
> executing search: java.lang.RuntimeException:
> java.util.concurrent.ExecutionException:
> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>
> java.lang.RuntimeException: java.util.concurrent.ExecutionException:
> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>
> at
> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:297)
>
> at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:350)
>
> at org.apache.nutch.searcher.NutchBean.main(NutchBean.java:410)
>
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>
> at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>
> at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>
> at
> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:292)
>
> �� 2 more
>
> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
> range: -1
>
> at java.lang.String.substring(String.java:1937)
>
> at
> org.apache.nutch.summary.basic.BasicSummarizer.getSummary(BasicSummarizer.java:188)
>
> at
> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:263)
>
> at
> org.apache.nutch.searcher.FetchedSegments$SummaryTask.call(FetchedSegments.java:63)
>
> at
> org.apache.nutch.searcher.FetchedSegments$SummaryTask.call(FetchedSegments.java:53)
>
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>
> at java.lang.Thread.run(Thread.java:662)
>
> This is because there is overlap between the two query tokens ��СŮ�� and
> ��Ů������
>
>
> nutch/src/plugin/summary-basic/src/java/org/apache/nutch/summary/basic/BasicSummarizer.java
>
> line 188��
>
> *if* (highlight.contains(t.term())) {
> excerpt.addToken(t.term());
> //when two tokens overlap��offset>t.startOffset()
> //
> excerpt.add(*new*Fragment(text.substring(offset, t.startOffset())));//this
> is where the exception accur
> excerpt.add(*new*
> Highlight(text.substring(t.startOffset(),t.endOffset())));
> offset = t.endOffset();
> endToken = Math.*min*(j +sumContext, tokens.length);
> }
>
>
> //Change code to fix the error��
> *if* (highlight.contains(t.term())) {
> excerpt.addToken(t.term());
> //bupo changed the code to fix the chinese token overlap error 2010.12.15
> *if*(offset < t.startOffset()){
> excerpt.add(*new*Fragment(text.substring(offset, t.startOffset())));
> excerpt.add(*new*
> Highlight(text.substring(t.startOffset(),t.endOffset())));
> }*else*{
> excerpt.add(*new*Highlight(text.substring(offset,t.endOffset())));
> }//bupo
> }
>
> --
>
> Yizhong Zhuang
> Beijing University of Posts and Telecommunications
> Email:bupo.jung@gmail.com
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com