You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by ivrokv <iv...@gmail.com> on 2008/05/15 21:51:09 UTC

Bug in NutchAnalysis.java

Hello,

I am not sure if this is more relevant for the Nutch-User list. I felt that
the nutch developers should be aware of this issue.

A patch was submitted in jira -
http://issues.apache.org/jira/browse/NUTCH-479.
I used this patch to fix NutchAnalyze.jj and OR query works in most cases.

However, I have noticed a small bug with the code.
The code breaks when an OR is followed by:

1) a whitespace and no other search terms after that  - eg. patent OR 
http://mysite.com/search.jsp?lang=en&query=patent+OR+

2) there is nothing else after the OR operator:
http://mysite.com/search.jsp?lang=en&query=patent+OR

3) If OR is the only search term
http://mysite.com/search.jsp?lang=en&query=OR

4) OR+, OR_ OR- , basically OR with any trailing characters.
http://mysite.com/search.jsp?lang=en&query=OR-
http://mysite.com/search.jsp?lang=en&query=OR+

I get this error message from tomcat:

java.io.IOException: Parse exception:
org.apache.nutch.analysis.ParseException: Encountered "<EOF>" at line 1,
column 12.
Was expecting one of:
    <WORD> ...
    <ACRONYM> ...
    <SIGRAM> ...
    "\"" ...
    <WHITE> ...
    ":" ...
    "/" ...
    "." ...
    "@" ...
    "\'" ...
    "+" ...
    "-" ...
    
	org.apache.nutch.analysis.NutchAnalysis.parseQuery(NutchAnalysis.java:62)
	org.apache.nutch.searcher.Query.parse(Query.java:468)
	org.apache.jsp.search_jsp._jspService(search_jsp.java:172)
	org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:803)

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:393)
	org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320)
	org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:803)

If there are any suggestions or pointers on how to fix this, that will be
great.


Thanks.

-- 
View this message in context: http://www.nabble.com/Bug-in-NutchAnalysis.java-tp17261004p17261004.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: Bug in NutchAnalysis.java

Posted by ivrokv <iv...@gmail.com>.
Hi Otis,

I guess I was a little hasty in reporting this as a bug. I was hoping that
were would be some built in tolerance for such query's.
 
However, I do notice another issue with OR  and perhaps you can confirm this
.
Consider this query:   Term1  OR Term2
Assume that the  index does not contain Term1 but contains Term2.
One would expect that there will be  search hits which contain Term2. I have
noticed that there are no results for this query.

However, if run the query Term2 OR Term1, search hits containing Term2 are
returned. 

Basically what I am saying is that OR is not commutative in the patch.

Has anyone experienced this issue? I personally hope that  this is specific
only to my case and OR's  are functioning  correctly!




ogjunk-nutch wrote:
> 
> Hi,
> 
> But shouldn't this be the expected behaviour?  In each of the examples the
> query really is bad/invalid, uses incorrect syntax.
> 
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> ----- Original Message ----
>> From: ivrokv <iv...@gmail.com>
>> To: nutch-dev@lucene.apache.org
>> Sent: Thursday, May 15, 2008 3:51:09 PM
>> Subject: Bug in NutchAnalysis.java
>> 
>> 
>> Hello,
>> 
>> I am not sure if this is more relevant for the Nutch-User list. I felt
>> that
>> the nutch developers should be aware of this issue.
>> 
>> A patch was submitted in jira -
>> http://issues.apache.org/jira/browse/NUTCH-479.
>> I used this patch to fix NutchAnalyze.jj and OR query works in most
>> cases.
>> 
>> However, I have noticed a small bug with the code.
>> The code breaks when an OR is followed by:
>> 
>> 1) a whitespace and no other search terms after that  - eg. patent OR 
>> http://mysite.com/search.jsp?lang=en&query=patent+OR+
>> 
>> 2) there is nothing else after the OR operator:
>> http://mysite.com/search.jsp?lang=en&query=patent+OR
>> 
>> 3) If OR is the only search term
>> http://mysite.com/search.jsp?lang=en&query=OR
>> 
>> 4) OR+, OR_ OR- , basically OR with any trailing characters.
>> http://mysite.com/search.jsp?lang=en&query=OR-
>> http://mysite.com/search.jsp?lang=en&query=OR+
>> 
>> I get this error message from tomcat:
>> 
>> java.io.IOException: Parse exception:
>> org.apache.nutch.analysis.ParseException: Encountered "" at line 1,
>> column 12.
>> Was expecting one of:
>>     ...
>>     ...
>>     ...
>>     "\"" ...
>>     ...
>>     ":" ...
>>     "/" ...
>>     "." ...
>>     "@" ...
>>     "\'" ...
>>     "+" ...
>>     "-" ...
>>     
>>    
>> org.apache.nutch.analysis.NutchAnalysis.parseQuery(NutchAnalysis.java:62)
>>     org.apache.nutch.searcher.Query.parse(Query.java:468)
>>     org.apache.jsp.search_jsp._jspService(search_jsp.java:172)
>>     org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
>>     javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>> 
>> org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:393)
>>    
>> org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320)
>>     org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266)
>>     javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>> 
>> If there are any suggestions or pointers on how to fix this, that will be
>> great.
>> 
>> 
>> Thanks.
>> 
>> -- 
>> View this message in context: 
>> http://www.nabble.com/Bug-in-NutchAnalysis.java-tp17261004p17261004.html
>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Bug-in-NutchAnalysis.java-tp17261004p17266549.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.