You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tomi NA <he...@gmail.com> on 2006/07/19 19:00:34 UTC

missing, but declared functionality

These kinds of queries return no results:

date:19980101-20061231
type:pdf
type:application/pdf

>From the release changes documents (0.7-0.7.2), I assumed these would work.
Upon index inspection (using the luke tool), I see there are no fields
marked "date" or "type" (althought I gather this is interpreted as
url:*.pdf). The fields I have are:
anchor
boost
content
digest
docNo
host
segment
site
title
url

I ran the index process with very little special configuration....some
filetype filtering and the like.
Am I missing something?
The files are served over a samba share: I plan to serve them through
a web server because of security implications of using the file://
protocol. Can the creation and last modification date be retrieved
over http:// at all?

TIA,
t.n.a.

Re: missing, but declared functionality

Posted by Matthew Holt <mh...@redhat.com>.
I'm having similar problems, luke says a search of date:20051125 works, 
and gives me results within luke. However, when I try the same search in 
nutch, nothing comes back. Does nutch handle query searches differently? 
Or to better rephrase my question, how should I be searching based on 
dates in nutch (the proper query plugins are enabled)?

Thanks,
Matt

Tomi NA wrote:
> Sorry for the long silence and thanks for the help.
> I've found the plugins you mentioned and set up nutch to use them. The
> result is somewhat confusing, though. For one thing, my date: and
> type: queries still returned no results. Weirder still, using luke to
> inspect the index contents, I saw the new fields, luke would display
> the top ranking terms by both "date" and "type" fields, a search like
> "date:20051030" would yield dozens of results, but the "string value"
> of the "date" and "type" fields was not available....even thought I
> found the documents in question using that exact field as a key.
>
> I'll see what I come up with using 0.8 as I need the .xls and .zip
> support, anyway.
>
> t.n.a.
>
> On 7/20/06, Teruhiko Kurosaka <Ku...@basistech.com> wrote:
>> You'd have to enable index-more and query-more plugins, I believe.
>>
>> > -----Original Message-----
>> > From: Tomi NA [mailto:hefest@gmail.com]
>> > Sent: 2006-7-19 10:01
>> > To: nutch-user@lucene.apache.org
>> > Subject: missing, but declared functionality
>> >
>> > These kinds of queries return no results:
>> >
>> > date:19980101-20061231
>> > type:pdf
>> > type:application/pdf
>> >
>> > From the release changes documents (0.7-0.7.2), I assumed
>> > these would work.
>> > Upon index inspection (using the luke tool), I see there are no fields
>> > marked "date" or "type" (althought I gather this is interpreted as
>> > url:*.pdf). The fields I have are:
>> > anchor
>> > boost
>> > content
>> > digest
>> > docNo
>> > host
>> > segment
>> > site
>> > title
>> > url
>> >
>> > I ran the index process with very little special configuration....some
>> > filetype filtering and the like.
>> > Am I missing something?
>> > The files are served over a samba share: I plan to serve them through
>> > a web server because of security implications of using the file://
>> > protocol. Can the creation and last modification date be retrieved
>> > over http:// at all?
>> >
>> > TIA,
>> > t.n.a.
>> >
>>
>

Re: missing, but declared functionality

Posted by Tomi NA <he...@gmail.com>.
Sorry for the long silence and thanks for the help.
I've found the plugins you mentioned and set up nutch to use them. The
result is somewhat confusing, though. For one thing, my date: and
type: queries still returned no results. Weirder still, using luke to
inspect the index contents, I saw the new fields, luke would display
the top ranking terms by both "date" and "type" fields, a search like
"date:20051030" would yield dozens of results, but the "string value"
of the "date" and "type" fields was not available....even thought I
found the documents in question using that exact field as a key.

I'll see what I come up with using 0.8 as I need the .xls and .zip
support, anyway.

t.n.a.

On 7/20/06, Teruhiko Kurosaka <Ku...@basistech.com> wrote:
> You'd have to enable index-more and query-more plugins, I believe.
>
> > -----Original Message-----
> > From: Tomi NA [mailto:hefest@gmail.com]
> > Sent: 2006-7-19 10:01
> > To: nutch-user@lucene.apache.org
> > Subject: missing, but declared functionality
> >
> > These kinds of queries return no results:
> >
> > date:19980101-20061231
> > type:pdf
> > type:application/pdf
> >
> > From the release changes documents (0.7-0.7.2), I assumed
> > these would work.
> > Upon index inspection (using the luke tool), I see there are no fields
> > marked "date" or "type" (althought I gather this is interpreted as
> > url:*.pdf). The fields I have are:
> > anchor
> > boost
> > content
> > digest
> > docNo
> > host
> > segment
> > site
> > title
> > url
> >
> > I ran the index process with very little special configuration....some
> > filetype filtering and the like.
> > Am I missing something?
> > The files are served over a samba share: I plan to serve them through
> > a web server because of security implications of using the file://
> > protocol. Can the creation and last modification date be retrieved
> > over http:// at all?
> >
> > TIA,
> > t.n.a.
> >
>

RE: missing, but declared functionality

Posted by Teruhiko Kurosaka <Ku...@basistech.com>.
You'd have to enable index-more and query-more plugins, I believe. 

> -----Original Message-----
> From: Tomi NA [mailto:hefest@gmail.com] 
> Sent: 2006-7-19 10:01
> To: nutch-user@lucene.apache.org
> Subject: missing, but declared functionality
> 
> These kinds of queries return no results:
> 
> date:19980101-20061231
> type:pdf
> type:application/pdf
> 
> From the release changes documents (0.7-0.7.2), I assumed 
> these would work.
> Upon index inspection (using the luke tool), I see there are no fields
> marked "date" or "type" (althought I gather this is interpreted as
> url:*.pdf). The fields I have are:
> anchor
> boost
> content
> digest
> docNo
> host
> segment
> site
> title
> url
> 
> I ran the index process with very little special configuration....some
> filetype filtering and the like.
> Am I missing something?
> The files are served over a samba share: I plan to serve them through
> a web server because of security implications of using the file://
> protocol. Can the creation and last modification date be retrieved
> over http:// at all?
> 
> TIA,
> t.n.a.
>