You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by ivrokv <iv...@gmail.com> on 2008/05/15 21:51:09 UTC
Bug in NutchAnalysis.java
Hello,
I am not sure if this is more relevant for the Nutch-User list. I felt that
the nutch developers should be aware of this issue.
A patch was submitted in jira -
http://issues.apache.org/jira/browse/NUTCH-479.
I used this patch to fix NutchAnalyze.jj and OR query works in most cases.
However, I have noticed a small bug with the code.
The code breaks when an OR is followed by:
1) a whitespace and no other search terms after that - eg. patent OR
http://mysite.com/search.jsp?lang=en&query=patent+OR+
2) there is nothing else after the OR operator:
http://mysite.com/search.jsp?lang=en&query=patent+OR
3) If OR is the only search term
http://mysite.com/search.jsp?lang=en&query=OR
4) OR+, OR_ OR- , basically OR with any trailing characters.
http://mysite.com/search.jsp?lang=en&query=OR-
http://mysite.com/search.jsp?lang=en&query=OR+
I get this error message from tomcat:
java.io.IOException: Parse exception:
org.apache.nutch.analysis.ParseException: Encountered "<EOF>" at line 1,
column 12.
Was expecting one of:
<WORD> ...
<ACRONYM> ...
<SIGRAM> ...
"\"" ...
<WHITE> ...
":" ...
"/" ...
"." ...
"@" ...
"\'" ...
"+" ...
"-" ...
org.apache.nutch.analysis.NutchAnalysis.parseQuery(NutchAnalysis.java:62)
org.apache.nutch.searcher.Query.parse(Query.java:468)
org.apache.jsp.search_jsp._jspService(search_jsp.java:172)
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:393)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266)
javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
If there are any suggestions or pointers on how to fix this, that will be
great.
Thanks.
--
View this message in context: http://www.nabble.com/Bug-in-NutchAnalysis.java-tp17261004p17261004.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: Bug in NutchAnalysis.java
Posted by ivrokv <iv...@gmail.com>.
Hi Otis,
I guess I was a little hasty in reporting this as a bug. I was hoping that
were would be some built in tolerance for such query's.
However, I do notice another issue with OR and perhaps you can confirm this
.
Consider this query: Term1 OR Term2
Assume that the index does not contain Term1 but contains Term2.
One would expect that there will be search hits which contain Term2. I have
noticed that there are no results for this query.
However, if run the query Term2 OR Term1, search hits containing Term2 are
returned.
Basically what I am saying is that OR is not commutative in the patch.
Has anyone experienced this issue? I personally hope that this is specific
only to my case and OR's are functioning correctly!
ogjunk-nutch wrote:
>
> Hi,
>
> But shouldn't this be the expected behaviour? In each of the examples the
> query really is bad/invalid, uses incorrect syntax.
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> ----- Original Message ----
>> From: ivrokv <iv...@gmail.com>
>> To: nutch-dev@lucene.apache.org
>> Sent: Thursday, May 15, 2008 3:51:09 PM
>> Subject: Bug in NutchAnalysis.java
>>
>>
>> Hello,
>>
>> I am not sure if this is more relevant for the Nutch-User list. I felt
>> that
>> the nutch developers should be aware of this issue.
>>
>> A patch was submitted in jira -
>> http://issues.apache.org/jira/browse/NUTCH-479.
>> I used this patch to fix NutchAnalyze.jj and OR query works in most
>> cases.
>>
>> However, I have noticed a small bug with the code.
>> The code breaks when an OR is followed by:
>>
>> 1) a whitespace and no other search terms after that - eg. patent OR
>> http://mysite.com/search.jsp?lang=en&query=patent+OR+
>>
>> 2) there is nothing else after the OR operator:
>> http://mysite.com/search.jsp?lang=en&query=patent+OR
>>
>> 3) If OR is the only search term
>> http://mysite.com/search.jsp?lang=en&query=OR
>>
>> 4) OR+, OR_ OR- , basically OR with any trailing characters.
>> http://mysite.com/search.jsp?lang=en&query=OR-
>> http://mysite.com/search.jsp?lang=en&query=OR+
>>
>> I get this error message from tomcat:
>>
>> java.io.IOException: Parse exception:
>> org.apache.nutch.analysis.ParseException: Encountered "" at line 1,
>> column 12.
>> Was expecting one of:
>> ...
>> ...
>> ...
>> "\"" ...
>> ...
>> ":" ...
>> "/" ...
>> "." ...
>> "@" ...
>> "\'" ...
>> "+" ...
>> "-" ...
>>
>>
>> org.apache.nutch.analysis.NutchAnalysis.parseQuery(NutchAnalysis.java:62)
>> org.apache.nutch.searcher.Query.parse(Query.java:468)
>> org.apache.jsp.search_jsp._jspService(search_jsp.java:172)
>> org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
>> javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>>
>> org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:393)
>>
>> org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320)
>> org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266)
>> javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>>
>> If there are any suggestions or pointers on how to fix this, that will be
>> great.
>>
>>
>> Thanks.
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Bug-in-NutchAnalysis.java-tp17261004p17261004.html
>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>
>
--
View this message in context: http://www.nabble.com/Bug-in-NutchAnalysis.java-tp17261004p17266549.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.