You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Edward Quick <ed...@hotmail.com> on 2005/09/16 10:53:14 UTC

Should type: and date: queries work with search.jsp?

Hi,

Should type: and date: queries work with the search.jsp program?
I'm using Nutch 0.7, and crawled the intranet at work. String searches work 
fine, but I want to test out the new features added by John Xing in the 
changelog 
(http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/CHANGES.txt?rev=1.48) for 
0.7.

When I search on something like:

news type:pdf

or

news type:application/pdf

I don't get any results, where I would expect to because all our news docs 
are in pdf format.

Thanks for any help.

Ed.



Re: Should type: and date: queries work with search.jsp?

Posted by Edward Quick <ed...@hotmail.com>.
I had the following set in nutch-site.xml during the crawl:

<property>
  <name>plugin.includes</name>
  
<value>protocol-(httpclient|http|file|ftp|file)|urlfilter-regex|parse-(text|html|js|msword|pdf|rss|ext)|index-(basic|more)|query-(basi
c|site|url|more)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.  By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

Is there anywhere else I need to enable index and query-more? Also, (sorry 
if this is a dumb question) how do I reindex the segments?

Thanks,

Ed.

>Edward Quick wrote:
>>Hi,
>>
>>Should type: and date: queries work with the search.jsp program?
>>I'm using Nutch 0.7, and crawled the intranet at work. String searches 
>>work fine, but I want to test out the new features added by John Xing in 
>>the changelog 
>>(http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/CHANGES.txt?rev=1.48) 
>>for 0.7.
>>
>>When I search on something like:
>>
>>news type:pdf
>>
>>or
>>
>>news type:application/pdf
>>
>>I don't get any results, where I would expect to because all our news docs 
>>are in pdf format.
>
>You probably forgot to enable index-more and query-more plugins. After you 
>do this, you need to re-index your segments.
>
>
>--
>Best regards,
>Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
>[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>___|||__||  \|  ||  |  Embedded Unix, System Integration
>http://www.sigram.com  Contact: info at sigram dot com
>



Re: Should type: and date: queries work with search.jsp?

Posted by Edward Quick <ed...@hotmail.com>.
>>Edward Quick wrote:
>>>Hi,
>>>
>>>Should type: and date: queries work with the search.jsp program?
>>>I'm using Nutch 0.7, and crawled the intranet at work. String searches 
>>>work fine, but I want to test out the new features added by John Xing in 
>>>the changelog 
>>>(http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/CHANGES.txt?rev=1.48) 
>>>for 0.7.
>>>
>>>When I search on something like:
>>>
>>>news type:pdf
>>>
>>>or
>>>
>>>news type:application/pdf
>>>
>>>I don't get any results, where I would expect to because all our news 
>>>docs are in pdf format.
>>
>>You probably forgot to enable index-more and query-more plugins. After you 
>>do this, you need to re-index your segments.
>>
>>
>>--
>>Best regards,
>>Andrzej Bialecki     <><
>
>Thanks for your answer.
>I definitely did enable index-more and query-more because the search 
>results show file type, size, date:
>
>Swap.PDF
>[pdf] (3920 bytes) 2005.9.26 - View as Plain Text
>
>If I just search on 'pdf' alone I get 866 hits, but get no hits with 
>type:pdf.
>
>Perhaps I've just caught the wrong end of the stick here, or should the 
>nutch search.jsp be able to perform lucene type searches as well, for 
>example,
>
>wildcard searches such as te*t
>single character searches sych as te?t
>fuzzy searches such as roam~
>title searches such as title:Do it right
>
>and so on....
>
>Appreciate any help.
>
>Thanks,
>
>Ed.
>
>

I found another article about this problem:
http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00826.html

and if I do

url:http type:msword
url:http type:pdf
url:http date:19000101-20050501

that works fine.

I would still like to do the other types of lucene searches above though 
i.e. te*t or te?t. Does anyone have any code to do this?

Thanks again.

Ed.



Re: Should type: and date: queries work with search.jsp?

Posted by Edward Quick <ed...@hotmail.com>.
>Edward Quick wrote:
>>Hi,
>>
>>Should type: and date: queries work with the search.jsp program?
>>I'm using Nutch 0.7, and crawled the intranet at work. String searches 
>>work fine, but I want to test out the new features added by John Xing in 
>>the changelog 
>>(http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/CHANGES.txt?rev=1.48) 
>>for 0.7.
>>
>>When I search on something like:
>>
>>news type:pdf
>>
>>or
>>
>>news type:application/pdf
>>
>>I don't get any results, where I would expect to because all our news docs 
>>are in pdf format.
>
>You probably forgot to enable index-more and query-more plugins. After you 
>do this, you need to re-index your segments.
>
>
>--
>Best regards,
>Andrzej Bialecki     <><

Thanks for your answer.
I definitely did enable index-more and query-more because the search results 
show file type, size, date:

Swap.PDF
[pdf] (3920 bytes) 2005.9.26 - View as Plain Text

If I just search on 'pdf' alone I get 866 hits, but get no hits with 
type:pdf.

Perhaps I've just caught the wrong end of the stick here, or should the 
nutch search.jsp be able to perform lucene type searches as well, for 
example,

wildcard searches such as te*t
single character searches sych as te?t
fuzzy searches such as roam~
title searches such as title:Do it right

and so on....

Appreciate any help.

Thanks,

Ed.



Re: Should type: and date: queries work with search.jsp?

Posted by Andrzej Bialecki <ab...@getopt.org>.
Edward Quick wrote:
> Hi,
> 
> Should type: and date: queries work with the search.jsp program?
> I'm using Nutch 0.7, and crawled the intranet at work. String searches 
> work fine, but I want to test out the new features added by John Xing in 
> the changelog 
> (http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/CHANGES.txt?rev=1.48) 
> for 0.7.
> 
> When I search on something like:
> 
> news type:pdf
> 
> or
> 
> news type:application/pdf
> 
> I don't get any results, where I would expect to because all our news 
> docs are in pdf format.

You probably forgot to enable index-more and query-more plugins. After 
you do this, you need to re-index your segments.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com