You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Matthew Holt <mh...@redhat.com> on 2006/08/02 22:58:05 UTC

Querying Fields

I am unable to query fields in my index in the method that has been 
suggested. I used Luke to examine my index and the following field types 
exist:
anchor, boost, content, contentLength, date, digest, host, lastModified, 
primaryType, segment, site, subType, title, type, url

However, when I do a search using one of the fields, followed by a 
colon, an incorrect result is returned. I used Luke to find the top term 
in the date field which is '20060801'. I then searched using the 
following query:
date: 20060801

Unfortunately, nothing was returned. The correct plugins are enabled, 
here is an excerpt from my nutch-site.xml:

<property>
  <name>plugin.includes</name>
 <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>


Any ideas? I'm not the only one having the same problem, I saw an 
earlier mailing list post but couldn't find any resolve... Thanks,

   Matt

Re: Querying Fields

Posted by Lourival Júnior <ju...@gmail.com>.

OK Lukas, I know what you mean. The community is very important to the
success of the project, specially the open source ones. I'm not sure I can
contribute to nutch at now, because I'm a newbie in this area. I will
contribute soon. At now, I answer the questions that I have a knowledge. I
really appreciate when you answer our questions because we feel motivated,
and we'll say to other people that Nutch is very useful when you want to
make a web search engine, not only useful, but the best way.

Regards!

On 8/14/06, Lukas Vlcek <lu...@gmail.com> wrote:
>
> Lourival,
>
> Definitely you are not alone with this feeling. Nutch is quite active
> open source project so some sort of documentation lack is a natural
> especially when Nutch hasen't reached its 1.0 release. Believe me, I
> have the same problem all the time.
>
> The best way how to change this situation is to contribute! Wiki is
> opend to anybody, source code can be downloaded and if you are freak
> then you can suggest changes and if you are a real hacker (meaning you
> are not ashmed to use vi for anything - including writing source code)
> then you can even become a commiter. Once you become a commiter then
> you will be overloaded with work to the point that you won't be able
> to answer STFW questions in mail-lists... etc. :-)
>
> Regards,
> Lukas
>
> On 8/11/06, Lourival Júnior <ju...@gmail.com> wrote:
> > Yes yes, I tested the index-more and query-more plugin. They allows to
> > search these fields easily. However if I could find a documentation
> about
> > they I would not spend time thinking in a solution.
> >
> > Thanks a lot!
> >
> > On 8/11/06, Lukas Vlcek <lu...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > You need to look into source to find out what exactly it does. As far
> > > as I know it does not add any new filed into index (it should be done
> > > via index-more plugin) but it allows you to query using type: date:
> > > and site: I think.
> > >
> > > Lukas
> > >
> > > On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
> > > > What does exactilly the query-more plugin? I tested it a few minutes
> ago
> > > and
> > > > it dont add any field to the result index. It's used in the webapp?
> > > Could
> > > > you give me a clarification about it?
> > > >
> > > > Thanks!
> > > >
> > > > On 8/9/06, Lukas Vlcek <lu...@gmail.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > If my memory serves me correctly then query-more should work fine
> with
> > > > > 0.7.2 nutch too.
> > > > > And you are right Matthew, you need to use both [type:] or [date:]
> > > > > filters in combination to [url:] as you can experience empty
> result
> > > > > set if used in solo mode. I do queries like this: [url:http
> type:pdf]
> > > > > and it gives me the result I need.
> > > > >
> > > > > Lukas
> > > > >
> > > > > On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
> > > > > > All right! I've done this already. I thing you dont understand
> my
> > > > > question.
> > > > > > What I want to do is to query my indexes using something like
> > > > > > "filetype:pdf". The version 0.8 already have this feature. But
> I'm
> > > using
> > > > > the
> > > > > > version 0.7.2 and I want to add this feature mannually. But I
> dont
> > > know
> > > > > > where I have to edit. Do you know?
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Lourival Junior
> > > > > >
> > > > > > On 8/9/06, Lukas Vlcek <lu...@gmail.com> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > To allow more formats to be indexed you need to modify
> > > nutch-site.xml
> > > > > > > and update/add plugin.includes property (see nutch-default.xmlfor
> > > > > > > default settings). The following is what I have in
> nutch-site.xml:
> > > > > > >
> > > > > > > <property>
> > > > > > >   <name>plugin.includes</name>
> > > > > > >
> > > > > > >
> > > > >
> > >
> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic</value>
> > > > > > > </property>
> > > > > > >
> > > > > > > [parse-*] is used to parse various formats, [query-more]
> allows
> > > you to
> > > > > > > use [type:] filter in nutch queries.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Lukas
> > > > > > >
> > > > > > > On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
> > > > > > > > Hi Lukas and everybody!
> > > > > > > >
> > > > > > > > Do you know which file in nutch 0.7.2 should I edit to add
> some
> > > > > field in
> > > > > > > my
> > > > > > > > index (i.e. file type - PDF, Word or html)?'
> > > > > > > >
> > > > > > > > On 8/8/06, Lukas Vlcek <lu...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I am not sure if I can give you any useful hint but the
> > > follwoing
> > > > > is
> > > > > > > > > what once worked for me.
> > > > > > > > > Example of query: url:http date:20060801
> > > > > > > > >
> > > > > > > > > date: and type: options can be used in combination with
> url:
> > > > > > > > > Filer url:http should select all documents (unless you
> allowed
> > > > > file,
> > > > > > > > > ftp protocols). Plain date ot type filter select onthing
> if
> > > they
> > > > > are
> > > > > > > > > used alone.
> > > > > > > > >
> > > > > > > > > And be sure you don't introduce any space between filter
> name
> > > and
> > > > > its
> > > > > > > > > value ([date: 20060801] is not the same as
> [date:20060801])
> > > > > > > > >
> > > > > > > > > Lukas
> > > > > > > > >
> > > > > > > > > On 8/8/06, Matthew Holt <mh...@redhat.com> wrote:
> > > > > > > > > > Howie,
> > > > > > > > > >    I inspected my index using Luke and 20060801 shows up
> > > several
> > > > > > > times
> > > > > > > > > > in the index. I'm unable to query pretty much any field.
> > > Several
> > > > > > > people
> > > > > > > > > > seem to be having the same problem. Does anyone know
> whats
> > > going
> > > > > on?
> > > > > > > > > >
> > > > > > > > > > This is one of the last things I have to resolve to have
> > > Nutch
> > > > > > > deployed
> > > > > > > > > > successfully at my organization. Unfortunately, Friday
> is my
> > > > > last
> > > > > > > day.
> > > > > > > > > > Can anyone offer any assistance??
> > > > > > > > > > Thanks,
> > > > > > > > > >   Matt
> > > > > > > > > >
> > > > > > > > > > Howie Wang wrote:
> > > > > > > > > > > I think that I have problems querying for numbers and
> > > > > > > > > > > words with digits in them. Now that I think of it, is
> it
> > > > > > > > > > > possible it has something to do with the stemming in
> > > > > > > > > > > either the query filter or indexing? In either case, I
> > > would
> > > > > > > > > > > print out the text that is being indexed and the
> phrases
> > > > > > > > > > > added to the query. You could also using luke to
> inspect
> > > > > > > > > > > your index and see whether 20060801 shows up anywhere.
> > > > > > > > > > >
> > > > > > > > > > > Howie
> > > > > > > > > > >
> > > > > > > > > > >> I tried looked for a page that had the date 20060801
> and
> > > the
> > > > > text
> > > > > > > > > > >> "test" in the page. I tried the following:
> > > > > > > > > > >>
> > > > > > > > > > >> date: 20060801 test
> > > > > > > > > > >>
> > > > > > > > > > >> and
> > > > > > > > > > >>
> > > > > > > > > > >> date 20060721-20060803 test
> > > > > > > > > > >>
> > > > > > > > > > >> Neither worked, any ideas??
> > > > > > > > > > >>
> > > > > > > > > > >> Matt
> > > > > > > > > > >>
> > > > > > > > > > >> Matthew Holt wrote:
> > > > > > > > > > >>> Thanks Jake,
> > > > > > > > > > >>>   However, it seems to me that it makes most sense
> that
> > > a
> > > > > query
> > > > > > > > > > >>> should return all pages that match the query,
> instead of
> > > > > acting
> > > > > > > as a
> > > > > > > > > > >>> content filter. However, I know its something easy
> to
> > > > > suggest
> > > > > > > when
> > > > > > > > > > >>> you're not having to implement it, so just a
> suggestion.
> > > > > > > > > > >>>
> > > > > > > > > > >>> Matt
> > > > > > > > > > >>>
> > > > > > > > > > >>> Vanderdray, Jacob wrote:
> > > > > > > > > > >>>> Try querying with both the date and something you'd
> > > expect
> > > > > to
> > > > > > > find
> > > > > > > > > > >>>> in the content.  The field query filter is just a
> > > > > filter.  It
> > > > > > > only
> > > > > > > > > > >>>> restricts your results to things that match the
> basic
> > > query
> > > > > and
> > > > > > > has
> > > > > > > > > > >>>> the contents you require in the field.  So if you
> query
> > > for
> > > > > > > > > > >>>> "date:2006080 text" you'll be searching for
> documents
> > > that
> > > > > > > contain
> > > > > > > > > > >>>> "text" in one of the default query fields and has
> the
> > > value
> > > > > > > 2006080
> > > > > > > > > > >>>> in the date field.  Leaving out text in that
> example
> > > would
> > > > > > > > > > >>>> essentially be asking for nothing in the default
> fields
> > > and
> > > > > > > 2006080
> > > > > > > > > > >>>> in the date field which is why it doesn't return
> any
> > > > > results.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Hope that helps,
> > > > > > > > > > >>>> Jake.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> -----Original Message-----
> > > > > > > > > > >>>> From: Matthew Holt [mailto:mholt@redhat.com]
> > > > > > > > > > >>>> Sent: Wed 8/2/2006 4:58 PM
> > > > > > > > > > >>>> To: nutch-user@lucene.apache.org
> > > > > > > > > > >>>> Subject: Querying Fields
> > > > > > > > > > >>>>  I am unable to query fields in my index in the
> method
> > > that
> > > > > has
> > > > > > > > > > >>>> been suggested. I used Luke to examine my index and
> the
> > > > > > > following
> > > > > > > > > > >>>> field types exist:
> > > > > > > > > > >>>> anchor, boost, content, contentLength, date,
> digest,
> > > host,
> > > > > > > > > > >>>> lastModified, primaryType, segment, site, subType,
> > > title,
> > > > > type,
> > > > > > > url
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> However, when I do a search using one of the
> fields,
> > > > > followed
> > > > > > > by a
> > > > > > > > > > >>>> colon, an incorrect result is returned. I used Luke
> to
> > > find
> > > > > the
> > > > > > > top
> > > > > > > > > > >>>> term in the date field which is '20060801'. I then
> > > searched
> > > > > > > using
> > > > > > > > > > >>>> the following query:
> > > > > > > > > > >>>> date: 20060801
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Unfortunately, nothing was returned. The correct
> > > plugins
> > > > > are
> > > > > > > > > > >>>> enabled, here is an excerpt from my nutch-site.xml:
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> <property>
> > > > > > > > > > >>>>   <name>plugin.includes</name>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>   <description>Regular expression naming plugin
> > > directory
> > > > > names
> > > > > > > to
> > > > > > > > > > >>>>   include.  Any plugin not matching this expression
> is
> > > > > > > excluded.
> > > > > > > > > > >>>>   In any case you need at least include the
> > > > > > > nutch-extensionpoints
> > > > > > > > > > >>>> plugin. By
> > > > > > > > > > >>>>   default Nutch includes crawling just HTML and
> plain
> > > text
> > > > > via
> > > > > > > > > HTTP,
> > > > > > > > > > >>>>   and basic indexing and search plugins.
> > > > > > > > > > >>>>   </description>
> > > > > > > > > > >>>> </property>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Any ideas? I'm not the only one having the same
> > > problem, I
> > > > > saw
> > > > > > > an
> > > > > > > > > > >>>> earlier mailing list post but couldn't find any
> > > resolve...
> > > > > > > Thanks,
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>    Matt
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Lourival Junior
> > > > > > > > Universidade Federal do Pará
> > > > > > > > Curso de Bacharelado em Sistemas de Informação
> > > > > > > > http://www.ufpa.br/cbsi
> > > > > > > > Msn: junior_ufpa@hotmail.com
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Lourival Junior
> > > > > > Universidade Federal do Pará
> > > > > > Curso de Bacharelado em Sistemas de Informação
> > > > > > http://www.ufpa.br/cbsi
> > > > > > Msn: junior_ufpa@hotmail.com
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Lourival Junior
> > > > Universidade Federal do Pará
> > > > Curso de Bacharelado em Sistemas de Informação
> > > > http://www.ufpa.br/cbsi
> > > > Msn: junior_ufpa@hotmail.com
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Lourival Junior
> > Universidade Federal do Pará
> > Curso de Bacharelado em Sistemas de Informação
> > http://www.ufpa.br/cbsi
> > Msn: junior_ufpa@hotmail.com
> >
> >
>



-- 
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: junior_ufpa@hotmail.com

Re: Querying Fields

Posted by Lukas Vlcek <lu...@gmail.com>.

Lourival,

Definitely you are not alone with this feeling. Nutch is quite active
open source project so some sort of documentation lack is a natural
especially when Nutch hasen't reached its 1.0 release. Believe me, I
have the same problem all the time.

The best way how to change this situation is to contribute! Wiki is
opend to anybody, source code can be downloaded and if you are freak
then you can suggest changes and if you are a real hacker (meaning you
are not ashmed to use vi for anything - including writing source code)
then you can even become a commiter. Once you become a commiter then
you will be overloaded with work to the point that you won't be able
to answer STFW questions in mail-lists... etc. :-)

Regards,
Lukas

On 8/11/06, Lourival Júnior <ju...@gmail.com> wrote:
> Yes yes, I tested the index-more and query-more plugin. They allows to
> search these fields easily. However if I could find a documentation about
> they I would not spend time thinking in a solution.
>
> Thanks a lot!
>
> On 8/11/06, Lukas Vlcek <lu...@gmail.com> wrote:
> >
> > Hi,
> >
> > You need to look into source to find out what exactly it does. As far
> > as I know it does not add any new filed into index (it should be done
> > via index-more plugin) but it allows you to query using type: date:
> > and site: I think.
> >
> > Lukas
> >
> > On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
> > > What does exactilly the query-more plugin? I tested it a few minutes ago
> > and
> > > it dont add any field to the result index. It's used in the webapp?
> > Could
> > > you give me a clarification about it?
> > >
> > > Thanks!
> > >
> > > On 8/9/06, Lukas Vlcek <lu...@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > If my memory serves me correctly then query-more should work fine with
> > > > 0.7.2 nutch too.
> > > > And you are right Matthew, you need to use both [type:] or [date:]
> > > > filters in combination to [url:] as you can experience empty result
> > > > set if used in solo mode. I do queries like this: [url:http type:pdf]
> > > > and it gives me the result I need.
> > > >
> > > > Lukas
> > > >
> > > > On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
> > > > > All right! I've done this already. I thing you dont understand my
> > > > question.
> > > > > What I want to do is to query my indexes using something like
> > > > > "filetype:pdf". The version 0.8 already have this feature. But I'm
> > using
> > > > the
> > > > > version 0.7.2 and I want to add this feature mannually. But I dont
> > know
> > > > > where I have to edit. Do you know?
> > > > >
> > > > > Regards,
> > > > >
> > > > > Lourival Junior
> > > > >
> > > > > On 8/9/06, Lukas Vlcek <lu...@gmail.com> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > To allow more formats to be indexed you need to modify
> > nutch-site.xml
> > > > > > and update/add plugin.includes property (see nutch-default.xml for
> > > > > > default settings). The following is what I have in nutch-site.xml:
> > > > > >
> > > > > > <property>
> > > > > >   <name>plugin.includes</name>
> > > > > >
> > > > > >
> > > >
> > <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic</value>
> > > > > > </property>
> > > > > >
> > > > > > [parse-*] is used to parse various formats, [query-more] allows
> > you to
> > > > > > use [type:] filter in nutch queries.
> > > > > >
> > > > > > Regards,
> > > > > > Lukas
> > > > > >
> > > > > > On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
> > > > > > > Hi Lukas and everybody!
> > > > > > >
> > > > > > > Do you know which file in nutch 0.7.2 should I edit to add some
> > > > field in
> > > > > > my
> > > > > > > index (i.e. file type - PDF, Word or html)?'
> > > > > > >
> > > > > > > On 8/8/06, Lukas Vlcek <lu...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I am not sure if I can give you any useful hint but the
> > follwoing
> > > > is
> > > > > > > > what once worked for me.
> > > > > > > > Example of query: url:http date:20060801
> > > > > > > >
> > > > > > > > date: and type: options can be used in combination with url:
> > > > > > > > Filer url:http should select all documents (unless you allowed
> > > > file,
> > > > > > > > ftp protocols). Plain date ot type filter select onthing if
> > they
> > > > are
> > > > > > > > used alone.
> > > > > > > >
> > > > > > > > And be sure you don't introduce any space between filter name
> > and
> > > > its
> > > > > > > > value ([date: 20060801] is not the same as [date:20060801])
> > > > > > > >
> > > > > > > > Lukas
> > > > > > > >
> > > > > > > > On 8/8/06, Matthew Holt <mh...@redhat.com> wrote:
> > > > > > > > > Howie,
> > > > > > > > >    I inspected my index using Luke and 20060801 shows up
> > several
> > > > > > times
> > > > > > > > > in the index. I'm unable to query pretty much any field.
> > Several
> > > > > > people
> > > > > > > > > seem to be having the same problem. Does anyone know whats
> > going
> > > > on?
> > > > > > > > >
> > > > > > > > > This is one of the last things I have to resolve to have
> > Nutch
> > > > > > deployed
> > > > > > > > > successfully at my organization. Unfortunately, Friday is my
> > > > last
> > > > > > day.
> > > > > > > > > Can anyone offer any assistance??
> > > > > > > > > Thanks,
> > > > > > > > >   Matt
> > > > > > > > >
> > > > > > > > > Howie Wang wrote:
> > > > > > > > > > I think that I have problems querying for numbers and
> > > > > > > > > > words with digits in them. Now that I think of it, is it
> > > > > > > > > > possible it has something to do with the stemming in
> > > > > > > > > > either the query filter or indexing? In either case, I
> > would
> > > > > > > > > > print out the text that is being indexed and the phrases
> > > > > > > > > > added to the query. You could also using luke to inspect
> > > > > > > > > > your index and see whether 20060801 shows up anywhere.
> > > > > > > > > >
> > > > > > > > > > Howie
> > > > > > > > > >
> > > > > > > > > >> I tried looked for a page that had the date 20060801 and
> > the
> > > > text
> > > > > > > > > >> "test" in the page. I tried the following:
> > > > > > > > > >>
> > > > > > > > > >> date: 20060801 test
> > > > > > > > > >>
> > > > > > > > > >> and
> > > > > > > > > >>
> > > > > > > > > >> date 20060721-20060803 test
> > > > > > > > > >>
> > > > > > > > > >> Neither worked, any ideas??
> > > > > > > > > >>
> > > > > > > > > >> Matt
> > > > > > > > > >>
> > > > > > > > > >> Matthew Holt wrote:
> > > > > > > > > >>> Thanks Jake,
> > > > > > > > > >>>   However, it seems to me that it makes most sense that
> > a
> > > > query
> > > > > > > > > >>> should return all pages that match the query, instead of
> > > > acting
> > > > > > as a
> > > > > > > > > >>> content filter. However, I know its something easy to
> > > > suggest
> > > > > > when
> > > > > > > > > >>> you're not having to implement it, so just a suggestion.
> > > > > > > > > >>>
> > > > > > > > > >>> Matt
> > > > > > > > > >>>
> > > > > > > > > >>> Vanderdray, Jacob wrote:
> > > > > > > > > >>>> Try querying with both the date and something you'd
> > expect
> > > > to
> > > > > > find
> > > > > > > > > >>>> in the content.  The field query filter is just a
> > > > filter.  It
> > > > > > only
> > > > > > > > > >>>> restricts your results to things that match the basic
> > query
> > > > and
> > > > > > has
> > > > > > > > > >>>> the contents you require in the field.  So if you query
> > for
> > > > > > > > > >>>> "date:2006080 text" you'll be searching for documents
> > that
> > > > > > contain
> > > > > > > > > >>>> "text" in one of the default query fields and has the
> > value
> > > > > > 2006080
> > > > > > > > > >>>> in the date field.  Leaving out text in that example
> > would
> > > > > > > > > >>>> essentially be asking for nothing in the default fields
> > and
> > > > > > 2006080
> > > > > > > > > >>>> in the date field which is why it doesn't return any
> > > > results.
> > > > > > > > > >>>>
> > > > > > > > > >>>> Hope that helps,
> > > > > > > > > >>>> Jake.
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>> -----Original Message-----
> > > > > > > > > >>>> From: Matthew Holt [mailto:mholt@redhat.com]
> > > > > > > > > >>>> Sent: Wed 8/2/2006 4:58 PM
> > > > > > > > > >>>> To: nutch-user@lucene.apache.org
> > > > > > > > > >>>> Subject: Querying Fields
> > > > > > > > > >>>>  I am unable to query fields in my index in the method
> > that
> > > > has
> > > > > > > > > >>>> been suggested. I used Luke to examine my index and the
> > > > > > following
> > > > > > > > > >>>> field types exist:
> > > > > > > > > >>>> anchor, boost, content, contentLength, date, digest,
> > host,
> > > > > > > > > >>>> lastModified, primaryType, segment, site, subType,
> > title,
> > > > type,
> > > > > > url
> > > > > > > > > >>>>
> > > > > > > > > >>>> However, when I do a search using one of the fields,
> > > > followed
> > > > > > by a
> > > > > > > > > >>>> colon, an incorrect result is returned. I used Luke to
> > find
> > > > the
> > > > > > top
> > > > > > > > > >>>> term in the date field which is '20060801'. I then
> > searched
> > > > > > using
> > > > > > > > > >>>> the following query:
> > > > > > > > > >>>> date: 20060801
> > > > > > > > > >>>>
> > > > > > > > > >>>> Unfortunately, nothing was returned. The correct
> > plugins
> > > > are
> > > > > > > > > >>>> enabled, here is an excerpt from my nutch-site.xml:
> > > > > > > > > >>>>
> > > > > > > > > >>>> <property>
> > > > > > > > > >>>>   <name>plugin.includes</name>
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > >
> > > > > >
> > > >
> > <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>>   <description>Regular expression naming plugin
> > directory
> > > > names
> > > > > > to
> > > > > > > > > >>>>   include.  Any plugin not matching this expression is
> > > > > > excluded.
> > > > > > > > > >>>>   In any case you need at least include the
> > > > > > nutch-extensionpoints
> > > > > > > > > >>>> plugin. By
> > > > > > > > > >>>>   default Nutch includes crawling just HTML and plain
> > text
> > > > via
> > > > > > > > HTTP,
> > > > > > > > > >>>>   and basic indexing and search plugins.
> > > > > > > > > >>>>   </description>
> > > > > > > > > >>>> </property>
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>> Any ideas? I'm not the only one having the same
> > problem, I
> > > > saw
> > > > > > an
> > > > > > > > > >>>> earlier mailing list post but couldn't find any
> > resolve...
> > > > > > Thanks,
> > > > > > > > > >>>>
> > > > > > > > > >>>>    Matt
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Lourival Junior
> > > > > > > Universidade Federal do Pará
> > > > > > > Curso de Bacharelado em Sistemas de Informação
> > > > > > > http://www.ufpa.br/cbsi
> > > > > > > Msn: junior_ufpa@hotmail.com
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Lourival Junior
> > > > > Universidade Federal do Pará
> > > > > Curso de Bacharelado em Sistemas de Informação
> > > > > http://www.ufpa.br/cbsi
> > > > > Msn: junior_ufpa@hotmail.com
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Lourival Junior
> > > Universidade Federal do Pará
> > > Curso de Bacharelado em Sistemas de Informação
> > > http://www.ufpa.br/cbsi
> > > Msn: junior_ufpa@hotmail.com
> > >
> > >
> >
>
>
>
> --
> Lourival Junior
> Universidade Federal do Pará
> Curso de Bacharelado em Sistemas de Informação
> http://www.ufpa.br/cbsi
> Msn: junior_ufpa@hotmail.com
>
>

Re: Querying Fields

Posted by Lourival Júnior <ju...@gmail.com>.

Yes yes, I tested the index-more and query-more plugin. They allows to
search these fields easily. However if I could find a documentation about
they I would not spend time thinking in a solution.

Thanks a lot!

On 8/11/06, Lukas Vlcek <lu...@gmail.com> wrote:
>
> Hi,
>
> You need to look into source to find out what exactly it does. As far
> as I know it does not add any new filed into index (it should be done
> via index-more plugin) but it allows you to query using type: date:
> and site: I think.
>
> Lukas
>
> On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
> > What does exactilly the query-more plugin? I tested it a few minutes ago
> and
> > it dont add any field to the result index. It's used in the webapp?
> Could
> > you give me a clarification about it?
> >
> > Thanks!
> >
> > On 8/9/06, Lukas Vlcek <lu...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > If my memory serves me correctly then query-more should work fine with
> > > 0.7.2 nutch too.
> > > And you are right Matthew, you need to use both [type:] or [date:]
> > > filters in combination to [url:] as you can experience empty result
> > > set if used in solo mode. I do queries like this: [url:http type:pdf]
> > > and it gives me the result I need.
> > >
> > > Lukas
> > >
> > > On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
> > > > All right! I've done this already. I thing you dont understand my
> > > question.
> > > > What I want to do is to query my indexes using something like
> > > > "filetype:pdf". The version 0.8 already have this feature. But I'm
> using
> > > the
> > > > version 0.7.2 and I want to add this feature mannually. But I dont
> know
> > > > where I have to edit. Do you know?
> > > >
> > > > Regards,
> > > >
> > > > Lourival Junior
> > > >
> > > > On 8/9/06, Lukas Vlcek <lu...@gmail.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > To allow more formats to be indexed you need to modify
> nutch-site.xml
> > > > > and update/add plugin.includes property (see nutch-default.xml for
> > > > > default settings). The following is what I have in nutch-site.xml:
> > > > >
> > > > > <property>
> > > > >   <name>plugin.includes</name>
> > > > >
> > > > >
> > >
> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic</value>
> > > > > </property>
> > > > >
> > > > > [parse-*] is used to parse various formats, [query-more] allows
> you to
> > > > > use [type:] filter in nutch queries.
> > > > >
> > > > > Regards,
> > > > > Lukas
> > > > >
> > > > > On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
> > > > > > Hi Lukas and everybody!
> > > > > >
> > > > > > Do you know which file in nutch 0.7.2 should I edit to add some
> > > field in
> > > > > my
> > > > > > index (i.e. file type - PDF, Word or html)?'
> > > > > >
> > > > > > On 8/8/06, Lukas Vlcek <lu...@gmail.com> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I am not sure if I can give you any useful hint but the
> follwoing
> > > is
> > > > > > > what once worked for me.
> > > > > > > Example of query: url:http date:20060801
> > > > > > >
> > > > > > > date: and type: options can be used in combination with url:
> > > > > > > Filer url:http should select all documents (unless you allowed
> > > file,
> > > > > > > ftp protocols). Plain date ot type filter select onthing if
> they
> > > are
> > > > > > > used alone.
> > > > > > >
> > > > > > > And be sure you don't introduce any space between filter name
> and
> > > its
> > > > > > > value ([date: 20060801] is not the same as [date:20060801])
> > > > > > >
> > > > > > > Lukas
> > > > > > >
> > > > > > > On 8/8/06, Matthew Holt <mh...@redhat.com> wrote:
> > > > > > > > Howie,
> > > > > > > >    I inspected my index using Luke and 20060801 shows up
> several
> > > > > times
> > > > > > > > in the index. I'm unable to query pretty much any field.
> Several
> > > > > people
> > > > > > > > seem to be having the same problem. Does anyone know whats
> going
> > > on?
> > > > > > > >
> > > > > > > > This is one of the last things I have to resolve to have
> Nutch
> > > > > deployed
> > > > > > > > successfully at my organization. Unfortunately, Friday is my
> > > last
> > > > > day.
> > > > > > > > Can anyone offer any assistance??
> > > > > > > > Thanks,
> > > > > > > >   Matt
> > > > > > > >
> > > > > > > > Howie Wang wrote:
> > > > > > > > > I think that I have problems querying for numbers and
> > > > > > > > > words with digits in them. Now that I think of it, is it
> > > > > > > > > possible it has something to do with the stemming in
> > > > > > > > > either the query filter or indexing? In either case, I
> would
> > > > > > > > > print out the text that is being indexed and the phrases
> > > > > > > > > added to the query. You could also using luke to inspect
> > > > > > > > > your index and see whether 20060801 shows up anywhere.
> > > > > > > > >
> > > > > > > > > Howie
> > > > > > > > >
> > > > > > > > >> I tried looked for a page that had the date 20060801 and
> the
> > > text
> > > > > > > > >> "test" in the page. I tried the following:
> > > > > > > > >>
> > > > > > > > >> date: 20060801 test
> > > > > > > > >>
> > > > > > > > >> and
> > > > > > > > >>
> > > > > > > > >> date 20060721-20060803 test
> > > > > > > > >>
> > > > > > > > >> Neither worked, any ideas??
> > > > > > > > >>
> > > > > > > > >> Matt
> > > > > > > > >>
> > > > > > > > >> Matthew Holt wrote:
> > > > > > > > >>> Thanks Jake,
> > > > > > > > >>>   However, it seems to me that it makes most sense that
> a
> > > query
> > > > > > > > >>> should return all pages that match the query, instead of
> > > acting
> > > > > as a
> > > > > > > > >>> content filter. However, I know its something easy to
> > > suggest
> > > > > when
> > > > > > > > >>> you're not having to implement it, so just a suggestion.
> > > > > > > > >>>
> > > > > > > > >>> Matt
> > > > > > > > >>>
> > > > > > > > >>> Vanderdray, Jacob wrote:
> > > > > > > > >>>> Try querying with both the date and something you'd
> expect
> > > to
> > > > > find
> > > > > > > > >>>> in the content.  The field query filter is just a
> > > filter.  It
> > > > > only
> > > > > > > > >>>> restricts your results to things that match the basic
> query
> > > and
> > > > > has
> > > > > > > > >>>> the contents you require in the field.  So if you query
> for
> > > > > > > > >>>> "date:2006080 text" you'll be searching for documents
> that
> > > > > contain
> > > > > > > > >>>> "text" in one of the default query fields and has the
> value
> > > > > 2006080
> > > > > > > > >>>> in the date field.  Leaving out text in that example
> would
> > > > > > > > >>>> essentially be asking for nothing in the default fields
> and
> > > > > 2006080
> > > > > > > > >>>> in the date field which is why it doesn't return any
> > > results.
> > > > > > > > >>>>
> > > > > > > > >>>> Hope that helps,
> > > > > > > > >>>> Jake.
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>> -----Original Message-----
> > > > > > > > >>>> From: Matthew Holt [mailto:mholt@redhat.com]
> > > > > > > > >>>> Sent: Wed 8/2/2006 4:58 PM
> > > > > > > > >>>> To: nutch-user@lucene.apache.org
> > > > > > > > >>>> Subject: Querying Fields
> > > > > > > > >>>>  I am unable to query fields in my index in the method
> that
> > > has
> > > > > > > > >>>> been suggested. I used Luke to examine my index and the
> > > > > following
> > > > > > > > >>>> field types exist:
> > > > > > > > >>>> anchor, boost, content, contentLength, date, digest,
> host,
> > > > > > > > >>>> lastModified, primaryType, segment, site, subType,
> title,
> > > type,
> > > > > url
> > > > > > > > >>>>
> > > > > > > > >>>> However, when I do a search using one of the fields,
> > > followed
> > > > > by a
> > > > > > > > >>>> colon, an incorrect result is returned. I used Luke to
> find
> > > the
> > > > > top
> > > > > > > > >>>> term in the date field which is '20060801'. I then
> searched
> > > > > using
> > > > > > > > >>>> the following query:
> > > > > > > > >>>> date: 20060801
> > > > > > > > >>>>
> > > > > > > > >>>> Unfortunately, nothing was returned. The correct
> plugins
> > > are
> > > > > > > > >>>> enabled, here is an excerpt from my nutch-site.xml:
> > > > > > > > >>>>
> > > > > > > > >>>> <property>
> > > > > > > > >>>>   <name>plugin.includes</name>
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > >
> > > > >
> > >
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>>   <description>Regular expression naming plugin
> directory
> > > names
> > > > > to
> > > > > > > > >>>>   include.  Any plugin not matching this expression is
> > > > > excluded.
> > > > > > > > >>>>   In any case you need at least include the
> > > > > nutch-extensionpoints
> > > > > > > > >>>> plugin. By
> > > > > > > > >>>>   default Nutch includes crawling just HTML and plain
> text
> > > via
> > > > > > > HTTP,
> > > > > > > > >>>>   and basic indexing and search plugins.
> > > > > > > > >>>>   </description>
> > > > > > > > >>>> </property>
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>> Any ideas? I'm not the only one having the same
> problem, I
> > > saw
> > > > > an
> > > > > > > > >>>> earlier mailing list post but couldn't find any
> resolve...
> > > > > Thanks,
> > > > > > > > >>>>
> > > > > > > > >>>>    Matt
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Lourival Junior
> > > > > > Universidade Federal do Pará
> > > > > > Curso de Bacharelado em Sistemas de Informação
> > > > > > http://www.ufpa.br/cbsi
> > > > > > Msn: junior_ufpa@hotmail.com
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Lourival Junior
> > > > Universidade Federal do Pará
> > > > Curso de Bacharelado em Sistemas de Informação
> > > > http://www.ufpa.br/cbsi
> > > > Msn: junior_ufpa@hotmail.com
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Lourival Junior
> > Universidade Federal do Pará
> > Curso de Bacharelado em Sistemas de Informação
> > http://www.ufpa.br/cbsi
> > Msn: junior_ufpa@hotmail.com
> >
> >
>



-- 
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: junior_ufpa@hotmail.com

Re: Querying Fields

Posted by Lukas Vlcek <lu...@gmail.com>.

Hi,

You need to look into source to find out what exactly it does. As far
as I know it does not add any new filed into index (it should be done
via index-more plugin) but it allows you to query using type: date:
and site: I think.

Lukas

On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
> What does exactilly the query-more plugin? I tested it a few minutes ago and
> it dont add any field to the result index. It's used in the webapp? Could
> you give me a clarification about it?
>
> Thanks!
>
> On 8/9/06, Lukas Vlcek <lu...@gmail.com> wrote:
> >
> > Hi,
> >
> > If my memory serves me correctly then query-more should work fine with
> > 0.7.2 nutch too.
> > And you are right Matthew, you need to use both [type:] or [date:]
> > filters in combination to [url:] as you can experience empty result
> > set if used in solo mode. I do queries like this: [url:http type:pdf]
> > and it gives me the result I need.
> >
> > Lukas
> >
> > On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
> > > All right! I've done this already. I thing you dont understand my
> > question.
> > > What I want to do is to query my indexes using something like
> > > "filetype:pdf". The version 0.8 already have this feature. But I'm using
> > the
> > > version 0.7.2 and I want to add this feature mannually. But I dont know
> > > where I have to edit. Do you know?
> > >
> > > Regards,
> > >
> > > Lourival Junior
> > >
> > > On 8/9/06, Lukas Vlcek <lu...@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > To allow more formats to be indexed you need to modify nutch-site.xml
> > > > and update/add plugin.includes property (see nutch-default.xml for
> > > > default settings). The following is what I have in nutch-site.xml:
> > > >
> > > > <property>
> > > >   <name>plugin.includes</name>
> > > >
> > > >
> > <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic</value>
> > > > </property>
> > > >
> > > > [parse-*] is used to parse various formats, [query-more] allows you to
> > > > use [type:] filter in nutch queries.
> > > >
> > > > Regards,
> > > > Lukas
> > > >
> > > > On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
> > > > > Hi Lukas and everybody!
> > > > >
> > > > > Do you know which file in nutch 0.7.2 should I edit to add some
> > field in
> > > > my
> > > > > index (i.e. file type - PDF, Word or html)?'
> > > > >
> > > > > On 8/8/06, Lukas Vlcek <lu...@gmail.com> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am not sure if I can give you any useful hint but the follwoing
> > is
> > > > > > what once worked for me.
> > > > > > Example of query: url:http date:20060801
> > > > > >
> > > > > > date: and type: options can be used in combination with url:
> > > > > > Filer url:http should select all documents (unless you allowed
> > file,
> > > > > > ftp protocols). Plain date ot type filter select onthing if they
> > are
> > > > > > used alone.
> > > > > >
> > > > > > And be sure you don't introduce any space between filter name and
> > its
> > > > > > value ([date: 20060801] is not the same as [date:20060801])
> > > > > >
> > > > > > Lukas
> > > > > >
> > > > > > On 8/8/06, Matthew Holt <mh...@redhat.com> wrote:
> > > > > > > Howie,
> > > > > > >    I inspected my index using Luke and 20060801 shows up several
> > > > times
> > > > > > > in the index. I'm unable to query pretty much any field. Several
> > > > people
> > > > > > > seem to be having the same problem. Does anyone know whats going
> > on?
> > > > > > >
> > > > > > > This is one of the last things I have to resolve to have Nutch
> > > > deployed
> > > > > > > successfully at my organization. Unfortunately, Friday is my
> > last
> > > > day.
> > > > > > > Can anyone offer any assistance??
> > > > > > > Thanks,
> > > > > > >   Matt
> > > > > > >
> > > > > > > Howie Wang wrote:
> > > > > > > > I think that I have problems querying for numbers and
> > > > > > > > words with digits in them. Now that I think of it, is it
> > > > > > > > possible it has something to do with the stemming in
> > > > > > > > either the query filter or indexing? In either case, I would
> > > > > > > > print out the text that is being indexed and the phrases
> > > > > > > > added to the query. You could also using luke to inspect
> > > > > > > > your index and see whether 20060801 shows up anywhere.
> > > > > > > >
> > > > > > > > Howie
> > > > > > > >
> > > > > > > >> I tried looked for a page that had the date 20060801 and the
> > text
> > > > > > > >> "test" in the page. I tried the following:
> > > > > > > >>
> > > > > > > >> date: 20060801 test
> > > > > > > >>
> > > > > > > >> and
> > > > > > > >>
> > > > > > > >> date 20060721-20060803 test
> > > > > > > >>
> > > > > > > >> Neither worked, any ideas??
> > > > > > > >>
> > > > > > > >> Matt
> > > > > > > >>
> > > > > > > >> Matthew Holt wrote:
> > > > > > > >>> Thanks Jake,
> > > > > > > >>>   However, it seems to me that it makes most sense that a
> > query
> > > > > > > >>> should return all pages that match the query, instead of
> > acting
> > > > as a
> > > > > > > >>> content filter. However, I know its something easy to
> > suggest
> > > > when
> > > > > > > >>> you're not having to implement it, so just a suggestion.
> > > > > > > >>>
> > > > > > > >>> Matt
> > > > > > > >>>
> > > > > > > >>> Vanderdray, Jacob wrote:
> > > > > > > >>>> Try querying with both the date and something you'd expect
> > to
> > > > find
> > > > > > > >>>> in the content.  The field query filter is just a
> > filter.  It
> > > > only
> > > > > > > >>>> restricts your results to things that match the basic query
> > and
> > > > has
> > > > > > > >>>> the contents you require in the field.  So if you query for
> > > > > > > >>>> "date:2006080 text" you'll be searching for documents that
> > > > contain
> > > > > > > >>>> "text" in one of the default query fields and has the value
> > > > 2006080
> > > > > > > >>>> in the date field.  Leaving out text in that example would
> > > > > > > >>>> essentially be asking for nothing in the default fields and
> > > > 2006080
> > > > > > > >>>> in the date field which is why it doesn't return any
> > results.
> > > > > > > >>>>
> > > > > > > >>>> Hope that helps,
> > > > > > > >>>> Jake.
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>> -----Original Message-----
> > > > > > > >>>> From: Matthew Holt [mailto:mholt@redhat.com]
> > > > > > > >>>> Sent: Wed 8/2/2006 4:58 PM
> > > > > > > >>>> To: nutch-user@lucene.apache.org
> > > > > > > >>>> Subject: Querying Fields
> > > > > > > >>>>  I am unable to query fields in my index in the method that
> > has
> > > > > > > >>>> been suggested. I used Luke to examine my index and the
> > > > following
> > > > > > > >>>> field types exist:
> > > > > > > >>>> anchor, boost, content, contentLength, date, digest, host,
> > > > > > > >>>> lastModified, primaryType, segment, site, subType, title,
> > type,
> > > > url
> > > > > > > >>>>
> > > > > > > >>>> However, when I do a search using one of the fields,
> > followed
> > > > by a
> > > > > > > >>>> colon, an incorrect result is returned. I used Luke to find
> > the
> > > > top
> > > > > > > >>>> term in the date field which is '20060801'. I then searched
> > > > using
> > > > > > > >>>> the following query:
> > > > > > > >>>> date: 20060801
> > > > > > > >>>>
> > > > > > > >>>> Unfortunately, nothing was returned. The correct plugins
> > are
> > > > > > > >>>> enabled, here is an excerpt from my nutch-site.xml:
> > > > > > > >>>>
> > > > > > > >>>> <property>
> > > > > > > >>>>   <name>plugin.includes</name>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > >
> > > >
> > <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>   <description>Regular expression naming plugin directory
> > names
> > > > to
> > > > > > > >>>>   include.  Any plugin not matching this expression is
> > > > excluded.
> > > > > > > >>>>   In any case you need at least include the
> > > > nutch-extensionpoints
> > > > > > > >>>> plugin. By
> > > > > > > >>>>   default Nutch includes crawling just HTML and plain text
> > via
> > > > > > HTTP,
> > > > > > > >>>>   and basic indexing and search plugins.
> > > > > > > >>>>   </description>
> > > > > > > >>>> </property>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>> Any ideas? I'm not the only one having the same problem, I
> > saw
> > > > an
> > > > > > > >>>> earlier mailing list post but couldn't find any resolve...
> > > > Thanks,
> > > > > > > >>>>
> > > > > > > >>>>    Matt
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Lourival Junior
> > > > > Universidade Federal do Pará
> > > > > Curso de Bacharelado em Sistemas de Informação
> > > > > http://www.ufpa.br/cbsi
> > > > > Msn: junior_ufpa@hotmail.com
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Lourival Junior
> > > Universidade Federal do Pará
> > > Curso de Bacharelado em Sistemas de Informação
> > > http://www.ufpa.br/cbsi
> > > Msn: junior_ufpa@hotmail.com
> > >
> > >
> >
>
>
>
> --
> Lourival Junior
> Universidade Federal do Pará
> Curso de Bacharelado em Sistemas de Informação
> http://www.ufpa.br/cbsi
> Msn: junior_ufpa@hotmail.com
>
>

Re: Querying Fields

Posted by Lourival Júnior <ju...@gmail.com>.

What does exactilly the query-more plugin? I tested it a few minutes ago and
it dont add any field to the result index. It's used in the webapp? Could
you give me a clarification about it?

Thanks!

On 8/9/06, Lukas Vlcek <lu...@gmail.com> wrote:
>
> Hi,
>
> If my memory serves me correctly then query-more should work fine with
> 0.7.2 nutch too.
> And you are right Matthew, you need to use both [type:] or [date:]
> filters in combination to [url:] as you can experience empty result
> set if used in solo mode. I do queries like this: [url:http type:pdf]
> and it gives me the result I need.
>
> Lukas
>
> On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
> > All right! I've done this already. I thing you dont understand my
> question.
> > What I want to do is to query my indexes using something like
> > "filetype:pdf". The version 0.8 already have this feature. But I'm using
> the
> > version 0.7.2 and I want to add this feature mannually. But I dont know
> > where I have to edit. Do you know?
> >
> > Regards,
> >
> > Lourival Junior
> >
> > On 8/9/06, Lukas Vlcek <lu...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > To allow more formats to be indexed you need to modify nutch-site.xml
> > > and update/add plugin.includes property (see nutch-default.xml for
> > > default settings). The following is what I have in nutch-site.xml:
> > >
> > > <property>
> > >   <name>plugin.includes</name>
> > >
> > >
> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic</value>
> > > </property>
> > >
> > > [parse-*] is used to parse various formats, [query-more] allows you to
> > > use [type:] filter in nutch queries.
> > >
> > > Regards,
> > > Lukas
> > >
> > > On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
> > > > Hi Lukas and everybody!
> > > >
> > > > Do you know which file in nutch 0.7.2 should I edit to add some
> field in
> > > my
> > > > index (i.e. file type - PDF, Word or html)?'
> > > >
> > > > On 8/8/06, Lukas Vlcek <lu...@gmail.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > I am not sure if I can give you any useful hint but the follwoing
> is
> > > > > what once worked for me.
> > > > > Example of query: url:http date:20060801
> > > > >
> > > > > date: and type: options can be used in combination with url:
> > > > > Filer url:http should select all documents (unless you allowed
> file,
> > > > > ftp protocols). Plain date ot type filter select onthing if they
> are
> > > > > used alone.
> > > > >
> > > > > And be sure you don't introduce any space between filter name and
> its
> > > > > value ([date: 20060801] is not the same as [date:20060801])
> > > > >
> > > > > Lukas
> > > > >
> > > > > On 8/8/06, Matthew Holt <mh...@redhat.com> wrote:
> > > > > > Howie,
> > > > > >    I inspected my index using Luke and 20060801 shows up several
> > > times
> > > > > > in the index. I'm unable to query pretty much any field. Several
> > > people
> > > > > > seem to be having the same problem. Does anyone know whats going
> on?
> > > > > >
> > > > > > This is one of the last things I have to resolve to have Nutch
> > > deployed
> > > > > > successfully at my organization. Unfortunately, Friday is my
> last
> > > day.
> > > > > > Can anyone offer any assistance??
> > > > > > Thanks,
> > > > > >   Matt
> > > > > >
> > > > > > Howie Wang wrote:
> > > > > > > I think that I have problems querying for numbers and
> > > > > > > words with digits in them. Now that I think of it, is it
> > > > > > > possible it has something to do with the stemming in
> > > > > > > either the query filter or indexing? In either case, I would
> > > > > > > print out the text that is being indexed and the phrases
> > > > > > > added to the query. You could also using luke to inspect
> > > > > > > your index and see whether 20060801 shows up anywhere.
> > > > > > >
> > > > > > > Howie
> > > > > > >
> > > > > > >> I tried looked for a page that had the date 20060801 and the
> text
> > > > > > >> "test" in the page. I tried the following:
> > > > > > >>
> > > > > > >> date: 20060801 test
> > > > > > >>
> > > > > > >> and
> > > > > > >>
> > > > > > >> date 20060721-20060803 test
> > > > > > >>
> > > > > > >> Neither worked, any ideas??
> > > > > > >>
> > > > > > >> Matt
> > > > > > >>
> > > > > > >> Matthew Holt wrote:
> > > > > > >>> Thanks Jake,
> > > > > > >>>   However, it seems to me that it makes most sense that a
> query
> > > > > > >>> should return all pages that match the query, instead of
> acting
> > > as a
> > > > > > >>> content filter. However, I know its something easy to
> suggest
> > > when
> > > > > > >>> you're not having to implement it, so just a suggestion.
> > > > > > >>>
> > > > > > >>> Matt
> > > > > > >>>
> > > > > > >>> Vanderdray, Jacob wrote:
> > > > > > >>>> Try querying with both the date and something you'd expect
> to
> > > find
> > > > > > >>>> in the content.  The field query filter is just a
> filter.  It
> > > only
> > > > > > >>>> restricts your results to things that match the basic query
> and
> > > has
> > > > > > >>>> the contents you require in the field.  So if you query for
> > > > > > >>>> "date:2006080 text" you'll be searching for documents that
> > > contain
> > > > > > >>>> "text" in one of the default query fields and has the value
> > > 2006080
> > > > > > >>>> in the date field.  Leaving out text in that example would
> > > > > > >>>> essentially be asking for nothing in the default fields and
> > > 2006080
> > > > > > >>>> in the date field which is why it doesn't return any
> results.
> > > > > > >>>>
> > > > > > >>>> Hope that helps,
> > > > > > >>>> Jake.
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> -----Original Message-----
> > > > > > >>>> From: Matthew Holt [mailto:mholt@redhat.com]
> > > > > > >>>> Sent: Wed 8/2/2006 4:58 PM
> > > > > > >>>> To: nutch-user@lucene.apache.org
> > > > > > >>>> Subject: Querying Fields
> > > > > > >>>>  I am unable to query fields in my index in the method that
> has
> > > > > > >>>> been suggested. I used Luke to examine my index and the
> > > following
> > > > > > >>>> field types exist:
> > > > > > >>>> anchor, boost, content, contentLength, date, digest, host,
> > > > > > >>>> lastModified, primaryType, segment, site, subType, title,
> type,
> > > url
> > > > > > >>>>
> > > > > > >>>> However, when I do a search using one of the fields,
> followed
> > > by a
> > > > > > >>>> colon, an incorrect result is returned. I used Luke to find
> the
> > > top
> > > > > > >>>> term in the date field which is '20060801'. I then searched
> > > using
> > > > > > >>>> the following query:
> > > > > > >>>> date: 20060801
> > > > > > >>>>
> > > > > > >>>> Unfortunately, nothing was returned. The correct plugins
> are
> > > > > > >>>> enabled, here is an excerpt from my nutch-site.xml:
> > > > > > >>>>
> > > > > > >>>> <property>
> > > > > > >>>>   <name>plugin.includes</name>
> > > > > > >>>>
> > > > > > >>>>
> > > > >
> > >
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>   <description>Regular expression naming plugin directory
> names
> > > to
> > > > > > >>>>   include.  Any plugin not matching this expression is
> > > excluded.
> > > > > > >>>>   In any case you need at least include the
> > > nutch-extensionpoints
> > > > > > >>>> plugin. By
> > > > > > >>>>   default Nutch includes crawling just HTML and plain text
> via
> > > > > HTTP,
> > > > > > >>>>   and basic indexing and search plugins.
> > > > > > >>>>   </description>
> > > > > > >>>> </property>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> Any ideas? I'm not the only one having the same problem, I
> saw
> > > an
> > > > > > >>>> earlier mailing list post but couldn't find any resolve...
> > > Thanks,
> > > > > > >>>>
> > > > > > >>>>    Matt
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Lourival Junior
> > > > Universidade Federal do Pará
> > > > Curso de Bacharelado em Sistemas de Informação
> > > > http://www.ufpa.br/cbsi
> > > > Msn: junior_ufpa@hotmail.com
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Lourival Junior
> > Universidade Federal do Pará
> > Curso de Bacharelado em Sistemas de Informação
> > http://www.ufpa.br/cbsi
> > Msn: junior_ufpa@hotmail.com
> >
> >
>



-- 
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: junior_ufpa@hotmail.com

Re: Querying Fields

Posted by Lukas Vlcek <lu...@gmail.com>.

Hi,

If my memory serves me correctly then query-more should work fine with
0.7.2 nutch too.
And you are right Matthew, you need to use both [type:] or [date:]
filters in combination to [url:] as you can experience empty result
set if used in solo mode. I do queries like this: [url:http type:pdf]
and it gives me the result I need.

Lukas

On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
> All right! I've done this already. I thing you dont understand my question.
> What I want to do is to query my indexes using something like
> "filetype:pdf". The version 0.8 already have this feature. But I'm using the
> version 0.7.2 and I want to add this feature mannually. But I dont know
> where I have to edit. Do you know?
>
> Regards,
>
> Lourival Junior
>
> On 8/9/06, Lukas Vlcek <lu...@gmail.com> wrote:
> >
> > Hi,
> >
> > To allow more formats to be indexed you need to modify nutch-site.xml
> > and update/add plugin.includes property (see nutch-default.xml for
> > default settings). The following is what I have in nutch-site.xml:
> >
> > <property>
> >   <name>plugin.includes</name>
> >
> > <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic</value>
> > </property>
> >
> > [parse-*] is used to parse various formats, [query-more] allows you to
> > use [type:] filter in nutch queries.
> >
> > Regards,
> > Lukas
> >
> > On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
> > > Hi Lukas and everybody!
> > >
> > > Do you know which file in nutch 0.7.2 should I edit to add some field in
> > my
> > > index (i.e. file type - PDF, Word or html)?'
> > >
> > > On 8/8/06, Lukas Vlcek <lu...@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I am not sure if I can give you any useful hint but the follwoing is
> > > > what once worked for me.
> > > > Example of query: url:http date:20060801
> > > >
> > > > date: and type: options can be used in combination with url:
> > > > Filer url:http should select all documents (unless you allowed file,
> > > > ftp protocols). Plain date ot type filter select onthing if they are
> > > > used alone.
> > > >
> > > > And be sure you don't introduce any space between filter name and its
> > > > value ([date: 20060801] is not the same as [date:20060801])
> > > >
> > > > Lukas
> > > >
> > > > On 8/8/06, Matthew Holt <mh...@redhat.com> wrote:
> > > > > Howie,
> > > > >    I inspected my index using Luke and 20060801 shows up several
> > times
> > > > > in the index. I'm unable to query pretty much any field. Several
> > people
> > > > > seem to be having the same problem. Does anyone know whats going on?
> > > > >
> > > > > This is one of the last things I have to resolve to have Nutch
> > deployed
> > > > > successfully at my organization. Unfortunately, Friday is my last
> > day.
> > > > > Can anyone offer any assistance??
> > > > > Thanks,
> > > > >   Matt
> > > > >
> > > > > Howie Wang wrote:
> > > > > > I think that I have problems querying for numbers and
> > > > > > words with digits in them. Now that I think of it, is it
> > > > > > possible it has something to do with the stemming in
> > > > > > either the query filter or indexing? In either case, I would
> > > > > > print out the text that is being indexed and the phrases
> > > > > > added to the query. You could also using luke to inspect
> > > > > > your index and see whether 20060801 shows up anywhere.
> > > > > >
> > > > > > Howie
> > > > > >
> > > > > >> I tried looked for a page that had the date 20060801 and the text
> > > > > >> "test" in the page. I tried the following:
> > > > > >>
> > > > > >> date: 20060801 test
> > > > > >>
> > > > > >> and
> > > > > >>
> > > > > >> date 20060721-20060803 test
> > > > > >>
> > > > > >> Neither worked, any ideas??
> > > > > >>
> > > > > >> Matt
> > > > > >>
> > > > > >> Matthew Holt wrote:
> > > > > >>> Thanks Jake,
> > > > > >>>   However, it seems to me that it makes most sense that a query
> > > > > >>> should return all pages that match the query, instead of acting
> > as a
> > > > > >>> content filter. However, I know its something easy to suggest
> > when
> > > > > >>> you're not having to implement it, so just a suggestion.
> > > > > >>>
> > > > > >>> Matt
> > > > > >>>
> > > > > >>> Vanderdray, Jacob wrote:
> > > > > >>>> Try querying with both the date and something you'd expect to
> > find
> > > > > >>>> in the content.  The field query filter is just a filter.  It
> > only
> > > > > >>>> restricts your results to things that match the basic query and
> > has
> > > > > >>>> the contents you require in the field.  So if you query for
> > > > > >>>> "date:2006080 text" you'll be searching for documents that
> > contain
> > > > > >>>> "text" in one of the default query fields and has the value
> > 2006080
> > > > > >>>> in the date field.  Leaving out text in that example would
> > > > > >>>> essentially be asking for nothing in the default fields and
> > 2006080
> > > > > >>>> in the date field which is why it doesn't return any results.
> > > > > >>>>
> > > > > >>>> Hope that helps,
> > > > > >>>> Jake.
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> -----Original Message-----
> > > > > >>>> From: Matthew Holt [mailto:mholt@redhat.com]
> > > > > >>>> Sent: Wed 8/2/2006 4:58 PM
> > > > > >>>> To: nutch-user@lucene.apache.org
> > > > > >>>> Subject: Querying Fields
> > > > > >>>>  I am unable to query fields in my index in the method that has
> > > > > >>>> been suggested. I used Luke to examine my index and the
> > following
> > > > > >>>> field types exist:
> > > > > >>>> anchor, boost, content, contentLength, date, digest, host,
> > > > > >>>> lastModified, primaryType, segment, site, subType, title, type,
> > url
> > > > > >>>>
> > > > > >>>> However, when I do a search using one of the fields, followed
> > by a
> > > > > >>>> colon, an incorrect result is returned. I used Luke to find the
> > top
> > > > > >>>> term in the date field which is '20060801'. I then searched
> > using
> > > > > >>>> the following query:
> > > > > >>>> date: 20060801
> > > > > >>>>
> > > > > >>>> Unfortunately, nothing was returned. The correct plugins are
> > > > > >>>> enabled, here is an excerpt from my nutch-site.xml:
> > > > > >>>>
> > > > > >>>> <property>
> > > > > >>>>   <name>plugin.includes</name>
> > > > > >>>>
> > > > > >>>>
> > > >
> > <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>   <description>Regular expression naming plugin directory names
> > to
> > > > > >>>>   include.  Any plugin not matching this expression is
> > excluded.
> > > > > >>>>   In any case you need at least include the
> > nutch-extensionpoints
> > > > > >>>> plugin. By
> > > > > >>>>   default Nutch includes crawling just HTML and plain text via
> > > > HTTP,
> > > > > >>>>   and basic indexing and search plugins.
> > > > > >>>>   </description>
> > > > > >>>> </property>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> Any ideas? I'm not the only one having the same problem, I saw
> > an
> > > > > >>>> earlier mailing list post but couldn't find any resolve...
> > Thanks,
> > > > > >>>>
> > > > > >>>>    Matt
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Lourival Junior
> > > Universidade Federal do Pará
> > > Curso de Bacharelado em Sistemas de Informação
> > > http://www.ufpa.br/cbsi
> > > Msn: junior_ufpa@hotmail.com
> > >
> > >
> >
>
>
>
> --
> Lourival Junior
> Universidade Federal do Pará
> Curso de Bacharelado em Sistemas de Informação
> http://www.ufpa.br/cbsi
> Msn: junior_ufpa@hotmail.com
>
>

Re: Querying Fields

Posted by Lourival Júnior <ju...@gmail.com>.

All right! I've done this already. I thing you dont understand my question.
What I want to do is to query my indexes using something like
"filetype:pdf". The version 0.8 already have this feature. But I'm using the
version 0.7.2 and I want to add this feature mannually. But I dont know
where I have to edit. Do you know?

Regards,

Lourival Junior

On 8/9/06, Lukas Vlcek <lu...@gmail.com> wrote:
>
> Hi,
>
> To allow more formats to be indexed you need to modify nutch-site.xml
> and update/add plugin.includes property (see nutch-default.xml for
> default settings). The following is what I have in nutch-site.xml:
>
> <property>
>   <name>plugin.includes</name>
>
> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic</value>
> </property>
>
> [parse-*] is used to parse various formats, [query-more] allows you to
> use [type:] filter in nutch queries.
>
> Regards,
> Lukas
>
> On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
> > Hi Lukas and everybody!
> >
> > Do you know which file in nutch 0.7.2 should I edit to add some field in
> my
> > index (i.e. file type - PDF, Word or html)?'
> >
> > On 8/8/06, Lukas Vlcek <lu...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I am not sure if I can give you any useful hint but the follwoing is
> > > what once worked for me.
> > > Example of query: url:http date:20060801
> > >
> > > date: and type: options can be used in combination with url:
> > > Filer url:http should select all documents (unless you allowed file,
> > > ftp protocols). Plain date ot type filter select onthing if they are
> > > used alone.
> > >
> > > And be sure you don't introduce any space between filter name and its
> > > value ([date: 20060801] is not the same as [date:20060801])
> > >
> > > Lukas
> > >
> > > On 8/8/06, Matthew Holt <mh...@redhat.com> wrote:
> > > > Howie,
> > > >    I inspected my index using Luke and 20060801 shows up several
> times
> > > > in the index. I'm unable to query pretty much any field. Several
> people
> > > > seem to be having the same problem. Does anyone know whats going on?
> > > >
> > > > This is one of the last things I have to resolve to have Nutch
> deployed
> > > > successfully at my organization. Unfortunately, Friday is my last
> day.
> > > > Can anyone offer any assistance??
> > > > Thanks,
> > > >   Matt
> > > >
> > > > Howie Wang wrote:
> > > > > I think that I have problems querying for numbers and
> > > > > words with digits in them. Now that I think of it, is it
> > > > > possible it has something to do with the stemming in
> > > > > either the query filter or indexing? In either case, I would
> > > > > print out the text that is being indexed and the phrases
> > > > > added to the query. You could also using luke to inspect
> > > > > your index and see whether 20060801 shows up anywhere.
> > > > >
> > > > > Howie
> > > > >
> > > > >> I tried looked for a page that had the date 20060801 and the text
> > > > >> "test" in the page. I tried the following:
> > > > >>
> > > > >> date: 20060801 test
> > > > >>
> > > > >> and
> > > > >>
> > > > >> date 20060721-20060803 test
> > > > >>
> > > > >> Neither worked, any ideas??
> > > > >>
> > > > >> Matt
> > > > >>
> > > > >> Matthew Holt wrote:
> > > > >>> Thanks Jake,
> > > > >>>   However, it seems to me that it makes most sense that a query
> > > > >>> should return all pages that match the query, instead of acting
> as a
> > > > >>> content filter. However, I know its something easy to suggest
> when
> > > > >>> you're not having to implement it, so just a suggestion.
> > > > >>>
> > > > >>> Matt
> > > > >>>
> > > > >>> Vanderdray, Jacob wrote:
> > > > >>>> Try querying with both the date and something you'd expect to
> find
> > > > >>>> in the content.  The field query filter is just a filter.  It
> only
> > > > >>>> restricts your results to things that match the basic query and
> has
> > > > >>>> the contents you require in the field.  So if you query for
> > > > >>>> "date:2006080 text" you'll be searching for documents that
> contain
> > > > >>>> "text" in one of the default query fields and has the value
> 2006080
> > > > >>>> in the date field.  Leaving out text in that example would
> > > > >>>> essentially be asking for nothing in the default fields and
> 2006080
> > > > >>>> in the date field which is why it doesn't return any results.
> > > > >>>>
> > > > >>>> Hope that helps,
> > > > >>>> Jake.
> > > > >>>>
> > > > >>>>
> > > > >>>> -----Original Message-----
> > > > >>>> From: Matthew Holt [mailto:mholt@redhat.com]
> > > > >>>> Sent: Wed 8/2/2006 4:58 PM
> > > > >>>> To: nutch-user@lucene.apache.org
> > > > >>>> Subject: Querying Fields
> > > > >>>>  I am unable to query fields in my index in the method that has
> > > > >>>> been suggested. I used Luke to examine my index and the
> following
> > > > >>>> field types exist:
> > > > >>>> anchor, boost, content, contentLength, date, digest, host,
> > > > >>>> lastModified, primaryType, segment, site, subType, title, type,
> url
> > > > >>>>
> > > > >>>> However, when I do a search using one of the fields, followed
> by a
> > > > >>>> colon, an incorrect result is returned. I used Luke to find the
> top
> > > > >>>> term in the date field which is '20060801'. I then searched
> using
> > > > >>>> the following query:
> > > > >>>> date: 20060801
> > > > >>>>
> > > > >>>> Unfortunately, nothing was returned. The correct plugins are
> > > > >>>> enabled, here is an excerpt from my nutch-site.xml:
> > > > >>>>
> > > > >>>> <property>
> > > > >>>>   <name>plugin.includes</name>
> > > > >>>>
> > > > >>>>
> > >
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
> > > > >>>>
> > > > >>>>
> > > > >>>>   <description>Regular expression naming plugin directory names
> to
> > > > >>>>   include.  Any plugin not matching this expression is
> excluded.
> > > > >>>>   In any case you need at least include the
> nutch-extensionpoints
> > > > >>>> plugin. By
> > > > >>>>   default Nutch includes crawling just HTML and plain text via
> > > HTTP,
> > > > >>>>   and basic indexing and search plugins.
> > > > >>>>   </description>
> > > > >>>> </property>
> > > > >>>>
> > > > >>>>
> > > > >>>> Any ideas? I'm not the only one having the same problem, I saw
> an
> > > > >>>> earlier mailing list post but couldn't find any resolve...
> Thanks,
> > > > >>>>
> > > > >>>>    Matt
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Lourival Junior
> > Universidade Federal do Pará
> > Curso de Bacharelado em Sistemas de Informação
> > http://www.ufpa.br/cbsi
> > Msn: junior_ufpa@hotmail.com
> >
> >
>



-- 
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: junior_ufpa@hotmail.com

Re: Querying Fields

Posted by Matthew Holt <mh...@redhat.com>.

Ignore the last email. I ended up doing the same as Benjamin Higgins. 
Works great, use his email for reference if you are trying to accomplish 
the same thing.
Matt

Matthew Holt wrote:
> Thanks for the reply. I've added the plugins you suggested. However, 
> some of the plugins need to be modified to search for fields such as 
> date (see previous email from Benjamin Higgins). I am currently 
> modifying the query-basic DateQueryFilter.java so one is allowed to 
> add query.date.boost to the nutch-site.xml to enable the date field 
> search.
>
> I'll try and post my results, or commit them.
> Matt
>
> Lukas Vlcek wrote:
>> Hi,
>>
>> To allow more formats to be indexed you need to modify nutch-site.xml
>> and update/add plugin.includes property (see nutch-default.xml for
>> default settings). The following is what I have in nutch-site.xml:
>>
>> <property>
>>  <name>plugin.includes</name>
>> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic</value> 
>>
>> </property>
>>
>> [parse-*] is used to parse various formats, [query-more] allows you to
>> use [type:] filter in nutch queries.
>>
>> Regards,
>> Lukas
>>
>> On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
>>> Hi Lukas and everybody!
>>>
>>> Do you know which file in nutch 0.7.2 should I edit to add some 
>>> field in my
>>> index (i.e. file type - PDF, Word or html)?'
>>>
>>> On 8/8/06, Lukas Vlcek <lu...@gmail.com> wrote:
>>> >
>>> > Hi,
>>> >
>>> > I am not sure if I can give you any useful hint but the follwoing is
>>> > what once worked for me.
>>> > Example of query: url:http date:20060801
>>> >
>>> > date: and type: options can be used in combination with url:
>>> > Filer url:http should select all documents (unless you allowed file,
>>> > ftp protocols). Plain date ot type filter select onthing if they are
>>> > used alone.
>>> >
>>> > And be sure you don't introduce any space between filter name and its
>>> > value ([date: 20060801] is not the same as [date:20060801])
>>> >
>>> > Lukas
>>> >
>>> > On 8/8/06, Matthew Holt <mh...@redhat.com> wrote:
>>> > > Howie,
>>> > >    I inspected my index using Luke and 20060801 shows up several 
>>> times
>>> > > in the index. I'm unable to query pretty much any field. Several 
>>> people
>>> > > seem to be having the same problem. Does anyone know whats going 
>>> on?
>>> > >
>>> > > This is one of the last things I have to resolve to have Nutch 
>>> deployed
>>> > > successfully at my organization. Unfortunately, Friday is my 
>>> last day.
>>> > > Can anyone offer any assistance??
>>> > > Thanks,
>>> > >   Matt
>>> > >
>>> > > Howie Wang wrote:
>>> > > > I think that I have problems querying for numbers and
>>> > > > words with digits in them. Now that I think of it, is it
>>> > > > possible it has something to do with the stemming in
>>> > > > either the query filter or indexing? In either case, I would
>>> > > > print out the text that is being indexed and the phrases
>>> > > > added to the query. You could also using luke to inspect
>>> > > > your index and see whether 20060801 shows up anywhere.
>>> > > >
>>> > > > Howie
>>> > > >
>>> > > >> I tried looked for a page that had the date 20060801 and the 
>>> text
>>> > > >> "test" in the page. I tried the following:
>>> > > >>
>>> > > >> date: 20060801 test
>>> > > >>
>>> > > >> and
>>> > > >>
>>> > > >> date 20060721-20060803 test
>>> > > >>
>>> > > >> Neither worked, any ideas??
>>> > > >>
>>> > > >> Matt
>>> > > >>
>>> > > >> Matthew Holt wrote:
>>> > > >>> Thanks Jake,
>>> > > >>>   However, it seems to me that it makes most sense that a query
>>> > > >>> should return all pages that match the query, instead of 
>>> acting as a
>>> > > >>> content filter. However, I know its something easy to 
>>> suggest when
>>> > > >>> you're not having to implement it, so just a suggestion.
>>> > > >>>
>>> > > >>> Matt
>>> > > >>>
>>> > > >>> Vanderdray, Jacob wrote:
>>> > > >>>> Try querying with both the date and something you'd expect 
>>> to find
>>> > > >>>> in the content.  The field query filter is just a filter.  
>>> It only
>>> > > >>>> restricts your results to things that match the basic query 
>>> and has
>>> > > >>>> the contents you require in the field.  So if you query for
>>> > > >>>> "date:2006080 text" you'll be searching for documents that 
>>> contain
>>> > > >>>> "text" in one of the default query fields and has the value 
>>> 2006080
>>> > > >>>> in the date field.  Leaving out text in that example would
>>> > > >>>> essentially be asking for nothing in the default fields and 
>>> 2006080
>>> > > >>>> in the date field which is why it doesn't return any results.
>>> > > >>>>
>>> > > >>>> Hope that helps,
>>> > > >>>> Jake.
>>> > > >>>>
>>> > > >>>>
>>> > > >>>> -----Original Message-----
>>> > > >>>> From: Matthew Holt [mailto:mholt@redhat.com]
>>> > > >>>> Sent: Wed 8/2/2006 4:58 PM
>>> > > >>>> To: nutch-user@lucene.apache.org
>>> > > >>>> Subject: Querying Fields
>>> > > >>>>  I am unable to query fields in my index in the method that 
>>> has
>>> > > >>>> been suggested. I used Luke to examine my index and the 
>>> following
>>> > > >>>> field types exist:
>>> > > >>>> anchor, boost, content, contentLength, date, digest, host,
>>> > > >>>> lastModified, primaryType, segment, site, subType, title, 
>>> type, url
>>> > > >>>>
>>> > > >>>> However, when I do a search using one of the fields, 
>>> followed by a
>>> > > >>>> colon, an incorrect result is returned. I used Luke to find 
>>> the top
>>> > > >>>> term in the date field which is '20060801'. I then searched 
>>> using
>>> > > >>>> the following query:
>>> > > >>>> date: 20060801
>>> > > >>>>
>>> > > >>>> Unfortunately, nothing was returned. The correct plugins are
>>> > > >>>> enabled, here is an excerpt from my nutch-site.xml:
>>> > > >>>>
>>> > > >>>> <property>
>>> > > >>>>   <name>plugin.includes</name>
>>> > > >>>>
>>> > > >>>>
>>> > 
>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value> 
>>>
>>> > > >>>>
>>> > > >>>>
>>> > > >>>>   <description>Regular expression naming plugin directory 
>>> names to
>>> > > >>>>   include.  Any plugin not matching this expression is 
>>> excluded.
>>> > > >>>>   In any case you need at least include the 
>>> nutch-extensionpoints
>>> > > >>>> plugin. By
>>> > > >>>>   default Nutch includes crawling just HTML and plain text via
>>> > HTTP,
>>> > > >>>>   and basic indexing and search plugins.
>>> > > >>>>   </description>
>>> > > >>>> </property>
>>> > > >>>>
>>> > > >>>>
>>> > > >>>> Any ideas? I'm not the only one having the same problem, I 
>>> saw an
>>> > > >>>> earlier mailing list post but couldn't find any resolve... 
>>> Thanks,
>>> > > >>>>
>>> > > >>>>    Matt
>>> > > >>>>
>>> > > >>>>
>>> > > >>>>
>>> > > >>>>
>>> > > >>>
>>> > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>>
>>>
>>>
>>> -- 
>>> Lourival Junior
>>> Universidade Federal do Pará
>>> Curso de Bacharelado em Sistemas de Informação
>>> http://www.ufpa.br/cbsi
>>> Msn: junior_ufpa@hotmail.com
>>>
>>>
>>
>

Re: Querying Fields

Posted by Matthew Holt <mh...@redhat.com>.

Thanks for the reply. I've added the plugins you suggested. However, 
some of the plugins need to be modified to search for fields such as 
date (see previous email from Benjamin Higgins). I am currently 
modifying the query-basic DateQueryFilter.java so one is allowed to add 
query.date.boost to the nutch-site.xml to enable the date field search.

I'll try and post my results, or commit them.
Matt

Lukas Vlcek wrote:
> Hi,
>
> To allow more formats to be indexed you need to modify nutch-site.xml
> and update/add plugin.includes property (see nutch-default.xml for
> default settings). The following is what I have in nutch-site.xml:
>
> <property>
>  <name>plugin.includes</name>
> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic</value> 
>
> </property>
>
> [parse-*] is used to parse various formats, [query-more] allows you to
> use [type:] filter in nutch queries.
>
> Regards,
> Lukas
>
> On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
>> Hi Lukas and everybody!
>>
>> Do you know which file in nutch 0.7.2 should I edit to add some field 
>> in my
>> index (i.e. file type - PDF, Word or html)?'
>>
>> On 8/8/06, Lukas Vlcek <lu...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I am not sure if I can give you any useful hint but the follwoing is
>> > what once worked for me.
>> > Example of query: url:http date:20060801
>> >
>> > date: and type: options can be used in combination with url:
>> > Filer url:http should select all documents (unless you allowed file,
>> > ftp protocols). Plain date ot type filter select onthing if they are
>> > used alone.
>> >
>> > And be sure you don't introduce any space between filter name and its
>> > value ([date: 20060801] is not the same as [date:20060801])
>> >
>> > Lukas
>> >
>> > On 8/8/06, Matthew Holt <mh...@redhat.com> wrote:
>> > > Howie,
>> > >    I inspected my index using Luke and 20060801 shows up several 
>> times
>> > > in the index. I'm unable to query pretty much any field. Several 
>> people
>> > > seem to be having the same problem. Does anyone know whats going on?
>> > >
>> > > This is one of the last things I have to resolve to have Nutch 
>> deployed
>> > > successfully at my organization. Unfortunately, Friday is my last 
>> day.
>> > > Can anyone offer any assistance??
>> > > Thanks,
>> > >   Matt
>> > >
>> > > Howie Wang wrote:
>> > > > I think that I have problems querying for numbers and
>> > > > words with digits in them. Now that I think of it, is it
>> > > > possible it has something to do with the stemming in
>> > > > either the query filter or indexing? In either case, I would
>> > > > print out the text that is being indexed and the phrases
>> > > > added to the query. You could also using luke to inspect
>> > > > your index and see whether 20060801 shows up anywhere.
>> > > >
>> > > > Howie
>> > > >
>> > > >> I tried looked for a page that had the date 20060801 and the text
>> > > >> "test" in the page. I tried the following:
>> > > >>
>> > > >> date: 20060801 test
>> > > >>
>> > > >> and
>> > > >>
>> > > >> date 20060721-20060803 test
>> > > >>
>> > > >> Neither worked, any ideas??
>> > > >>
>> > > >> Matt
>> > > >>
>> > > >> Matthew Holt wrote:
>> > > >>> Thanks Jake,
>> > > >>>   However, it seems to me that it makes most sense that a query
>> > > >>> should return all pages that match the query, instead of 
>> acting as a
>> > > >>> content filter. However, I know its something easy to suggest 
>> when
>> > > >>> you're not having to implement it, so just a suggestion.
>> > > >>>
>> > > >>> Matt
>> > > >>>
>> > > >>> Vanderdray, Jacob wrote:
>> > > >>>> Try querying with both the date and something you'd expect 
>> to find
>> > > >>>> in the content.  The field query filter is just a filter.  
>> It only
>> > > >>>> restricts your results to things that match the basic query 
>> and has
>> > > >>>> the contents you require in the field.  So if you query for
>> > > >>>> "date:2006080 text" you'll be searching for documents that 
>> contain
>> > > >>>> "text" in one of the default query fields and has the value 
>> 2006080
>> > > >>>> in the date field.  Leaving out text in that example would
>> > > >>>> essentially be asking for nothing in the default fields and 
>> 2006080
>> > > >>>> in the date field which is why it doesn't return any results.
>> > > >>>>
>> > > >>>> Hope that helps,
>> > > >>>> Jake.
>> > > >>>>
>> > > >>>>
>> > > >>>> -----Original Message-----
>> > > >>>> From: Matthew Holt [mailto:mholt@redhat.com]
>> > > >>>> Sent: Wed 8/2/2006 4:58 PM
>> > > >>>> To: nutch-user@lucene.apache.org
>> > > >>>> Subject: Querying Fields
>> > > >>>>  I am unable to query fields in my index in the method that has
>> > > >>>> been suggested. I used Luke to examine my index and the 
>> following
>> > > >>>> field types exist:
>> > > >>>> anchor, boost, content, contentLength, date, digest, host,
>> > > >>>> lastModified, primaryType, segment, site, subType, title, 
>> type, url
>> > > >>>>
>> > > >>>> However, when I do a search using one of the fields, 
>> followed by a
>> > > >>>> colon, an incorrect result is returned. I used Luke to find 
>> the top
>> > > >>>> term in the date field which is '20060801'. I then searched 
>> using
>> > > >>>> the following query:
>> > > >>>> date: 20060801
>> > > >>>>
>> > > >>>> Unfortunately, nothing was returned. The correct plugins are
>> > > >>>> enabled, here is an excerpt from my nutch-site.xml:
>> > > >>>>
>> > > >>>> <property>
>> > > >>>>   <name>plugin.includes</name>
>> > > >>>>
>> > > >>>>
>> > 
>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value> 
>>
>> > > >>>>
>> > > >>>>
>> > > >>>>   <description>Regular expression naming plugin directory 
>> names to
>> > > >>>>   include.  Any plugin not matching this expression is 
>> excluded.
>> > > >>>>   In any case you need at least include the 
>> nutch-extensionpoints
>> > > >>>> plugin. By
>> > > >>>>   default Nutch includes crawling just HTML and plain text via
>> > HTTP,
>> > > >>>>   and basic indexing and search plugins.
>> > > >>>>   </description>
>> > > >>>> </property>
>> > > >>>>
>> > > >>>>
>> > > >>>> Any ideas? I'm not the only one having the same problem, I 
>> saw an
>> > > >>>> earlier mailing list post but couldn't find any resolve... 
>> Thanks,
>> > > >>>>
>> > > >>>>    Matt
>> > > >>>>
>> > > >>>>
>> > > >>>>
>> > > >>>>
>> > > >>>
>> > > >
>> > > >
>> > > >
>> > >
>> >
>>
>>
>>
>> -- 
>> Lourival Junior
>> Universidade Federal do Pará
>> Curso de Bacharelado em Sistemas de Informação
>> http://www.ufpa.br/cbsi
>> Msn: junior_ufpa@hotmail.com
>>
>>
>

Re: Querying Fields

Posted by Lukas Vlcek <lu...@gmail.com>.

Hi,

To allow more formats to be indexed you need to modify nutch-site.xml
and update/add plugin.includes property (see nutch-default.xml for
default settings). The following is what I have in nutch-site.xml:

<property>
  <name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic</value>
</property>

[parse-*] is used to parse various formats, [query-more] allows you to
use [type:] filter in nutch queries.

Regards,
Lukas

On 8/9/06, Lourival Júnior <ju...@gmail.com> wrote:
> Hi Lukas and everybody!
>
> Do you know which file in nutch 0.7.2 should I edit to add some field in my
> index (i.e. file type - PDF, Word or html)?'
>
> On 8/8/06, Lukas Vlcek <lu...@gmail.com> wrote:
> >
> > Hi,
> >
> > I am not sure if I can give you any useful hint but the follwoing is
> > what once worked for me.
> > Example of query: url:http date:20060801
> >
> > date: and type: options can be used in combination with url:
> > Filer url:http should select all documents (unless you allowed file,
> > ftp protocols). Plain date ot type filter select onthing if they are
> > used alone.
> >
> > And be sure you don't introduce any space between filter name and its
> > value ([date: 20060801] is not the same as [date:20060801])
> >
> > Lukas
> >
> > On 8/8/06, Matthew Holt <mh...@redhat.com> wrote:
> > > Howie,
> > >    I inspected my index using Luke and 20060801 shows up several times
> > > in the index. I'm unable to query pretty much any field. Several people
> > > seem to be having the same problem. Does anyone know whats going on?
> > >
> > > This is one of the last things I have to resolve to have Nutch deployed
> > > successfully at my organization. Unfortunately, Friday is my last day.
> > > Can anyone offer any assistance??
> > > Thanks,
> > >   Matt
> > >
> > > Howie Wang wrote:
> > > > I think that I have problems querying for numbers and
> > > > words with digits in them. Now that I think of it, is it
> > > > possible it has something to do with the stemming in
> > > > either the query filter or indexing? In either case, I would
> > > > print out the text that is being indexed and the phrases
> > > > added to the query. You could also using luke to inspect
> > > > your index and see whether 20060801 shows up anywhere.
> > > >
> > > > Howie
> > > >
> > > >> I tried looked for a page that had the date 20060801 and the text
> > > >> "test" in the page. I tried the following:
> > > >>
> > > >> date: 20060801 test
> > > >>
> > > >> and
> > > >>
> > > >> date 20060721-20060803 test
> > > >>
> > > >> Neither worked, any ideas??
> > > >>
> > > >> Matt
> > > >>
> > > >> Matthew Holt wrote:
> > > >>> Thanks Jake,
> > > >>>   However, it seems to me that it makes most sense that a query
> > > >>> should return all pages that match the query, instead of acting as a
> > > >>> content filter. However, I know its something easy to suggest when
> > > >>> you're not having to implement it, so just a suggestion.
> > > >>>
> > > >>> Matt
> > > >>>
> > > >>> Vanderdray, Jacob wrote:
> > > >>>> Try querying with both the date and something you'd expect to find
> > > >>>> in the content.  The field query filter is just a filter.  It only
> > > >>>> restricts your results to things that match the basic query and has
> > > >>>> the contents you require in the field.  So if you query for
> > > >>>> "date:2006080 text" you'll be searching for documents that contain
> > > >>>> "text" in one of the default query fields and has the value 2006080
> > > >>>> in the date field.  Leaving out text in that example would
> > > >>>> essentially be asking for nothing in the default fields and 2006080
> > > >>>> in the date field which is why it doesn't return any results.
> > > >>>>
> > > >>>> Hope that helps,
> > > >>>> Jake.
> > > >>>>
> > > >>>>
> > > >>>> -----Original Message-----
> > > >>>> From: Matthew Holt [mailto:mholt@redhat.com]
> > > >>>> Sent: Wed 8/2/2006 4:58 PM
> > > >>>> To: nutch-user@lucene.apache.org
> > > >>>> Subject: Querying Fields
> > > >>>>  I am unable to query fields in my index in the method that has
> > > >>>> been suggested. I used Luke to examine my index and the following
> > > >>>> field types exist:
> > > >>>> anchor, boost, content, contentLength, date, digest, host,
> > > >>>> lastModified, primaryType, segment, site, subType, title, type, url
> > > >>>>
> > > >>>> However, when I do a search using one of the fields, followed by a
> > > >>>> colon, an incorrect result is returned. I used Luke to find the top
> > > >>>> term in the date field which is '20060801'. I then searched using
> > > >>>> the following query:
> > > >>>> date: 20060801
> > > >>>>
> > > >>>> Unfortunately, nothing was returned. The correct plugins are
> > > >>>> enabled, here is an excerpt from my nutch-site.xml:
> > > >>>>
> > > >>>> <property>
> > > >>>>   <name>plugin.includes</name>
> > > >>>>
> > > >>>>
> > <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
> > > >>>>
> > > >>>>
> > > >>>>   <description>Regular expression naming plugin directory names to
> > > >>>>   include.  Any plugin not matching this expression is excluded.
> > > >>>>   In any case you need at least include the nutch-extensionpoints
> > > >>>> plugin. By
> > > >>>>   default Nutch includes crawling just HTML and plain text via
> > HTTP,
> > > >>>>   and basic indexing and search plugins.
> > > >>>>   </description>
> > > >>>> </property>
> > > >>>>
> > > >>>>
> > > >>>> Any ideas? I'm not the only one having the same problem, I saw an
> > > >>>> earlier mailing list post but couldn't find any resolve... Thanks,
> > > >>>>
> > > >>>>    Matt
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >
> > > >
> > > >
> > >
> >
>
>
>
> --
> Lourival Junior
> Universidade Federal do Pará
> Curso de Bacharelado em Sistemas de Informação
> http://www.ufpa.br/cbsi
> Msn: junior_ufpa@hotmail.com
>
>

Re: Querying Fields

Posted by Lourival Júnior <ju...@gmail.com>.

Hi Lukas and everybody!

Do you know which file in nutch 0.7.2 should I edit to add some field in my
index (i.e. file type - PDF, Word or html)?'

On 8/8/06, Lukas Vlcek <lu...@gmail.com> wrote:
>
> Hi,
>
> I am not sure if I can give you any useful hint but the follwoing is
> what once worked for me.
> Example of query: url:http date:20060801
>
> date: and type: options can be used in combination with url:
> Filer url:http should select all documents (unless you allowed file,
> ftp protocols). Plain date ot type filter select onthing if they are
> used alone.
>
> And be sure you don't introduce any space between filter name and its
> value ([date: 20060801] is not the same as [date:20060801])
>
> Lukas
>
> On 8/8/06, Matthew Holt <mh...@redhat.com> wrote:
> > Howie,
> >    I inspected my index using Luke and 20060801 shows up several times
> > in the index. I'm unable to query pretty much any field. Several people
> > seem to be having the same problem. Does anyone know whats going on?
> >
> > This is one of the last things I have to resolve to have Nutch deployed
> > successfully at my organization. Unfortunately, Friday is my last day.
> > Can anyone offer any assistance??
> > Thanks,
> >   Matt
> >
> > Howie Wang wrote:
> > > I think that I have problems querying for numbers and
> > > words with digits in them. Now that I think of it, is it
> > > possible it has something to do with the stemming in
> > > either the query filter or indexing? In either case, I would
> > > print out the text that is being indexed and the phrases
> > > added to the query. You could also using luke to inspect
> > > your index and see whether 20060801 shows up anywhere.
> > >
> > > Howie
> > >
> > >> I tried looked for a page that had the date 20060801 and the text
> > >> "test" in the page. I tried the following:
> > >>
> > >> date: 20060801 test
> > >>
> > >> and
> > >>
> > >> date 20060721-20060803 test
> > >>
> > >> Neither worked, any ideas??
> > >>
> > >> Matt
> > >>
> > >> Matthew Holt wrote:
> > >>> Thanks Jake,
> > >>>   However, it seems to me that it makes most sense that a query
> > >>> should return all pages that match the query, instead of acting as a
> > >>> content filter. However, I know its something easy to suggest when
> > >>> you're not having to implement it, so just a suggestion.
> > >>>
> > >>> Matt
> > >>>
> > >>> Vanderdray, Jacob wrote:
> > >>>> Try querying with both the date and something you'd expect to find
> > >>>> in the content.  The field query filter is just a filter.  It only
> > >>>> restricts your results to things that match the basic query and has
> > >>>> the contents you require in the field.  So if you query for
> > >>>> "date:2006080 text" you'll be searching for documents that contain
> > >>>> "text" in one of the default query fields and has the value 2006080
> > >>>> in the date field.  Leaving out text in that example would
> > >>>> essentially be asking for nothing in the default fields and 2006080
> > >>>> in the date field which is why it doesn't return any results.
> > >>>>
> > >>>> Hope that helps,
> > >>>> Jake.
> > >>>>
> > >>>>
> > >>>> -----Original Message-----
> > >>>> From: Matthew Holt [mailto:mholt@redhat.com]
> > >>>> Sent: Wed 8/2/2006 4:58 PM
> > >>>> To: nutch-user@lucene.apache.org
> > >>>> Subject: Querying Fields
> > >>>>  I am unable to query fields in my index in the method that has
> > >>>> been suggested. I used Luke to examine my index and the following
> > >>>> field types exist:
> > >>>> anchor, boost, content, contentLength, date, digest, host,
> > >>>> lastModified, primaryType, segment, site, subType, title, type, url
> > >>>>
> > >>>> However, when I do a search using one of the fields, followed by a
> > >>>> colon, an incorrect result is returned. I used Luke to find the top
> > >>>> term in the date field which is '20060801'. I then searched using
> > >>>> the following query:
> > >>>> date: 20060801
> > >>>>
> > >>>> Unfortunately, nothing was returned. The correct plugins are
> > >>>> enabled, here is an excerpt from my nutch-site.xml:
> > >>>>
> > >>>> <property>
> > >>>>   <name>plugin.includes</name>
> > >>>>
> > >>>>
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
> > >>>>
> > >>>>
> > >>>>   <description>Regular expression naming plugin directory names to
> > >>>>   include.  Any plugin not matching this expression is excluded.
> > >>>>   In any case you need at least include the nutch-extensionpoints
> > >>>> plugin. By
> > >>>>   default Nutch includes crawling just HTML and plain text via
> HTTP,
> > >>>>   and basic indexing and search plugins.
> > >>>>   </description>
> > >>>> </property>
> > >>>>
> > >>>>
> > >>>> Any ideas? I'm not the only one having the same problem, I saw an
> > >>>> earlier mailing list post but couldn't find any resolve... Thanks,
> > >>>>
> > >>>>    Matt
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>
> > >
> > >
> > >
> >
>



-- 
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: junior_ufpa@hotmail.com

Re: Querying Fields

Posted by Lukas Vlcek <lu...@gmail.com>.

Hi,

I am not sure if I can give you any useful hint but the follwoing is
what once worked for me.
Example of query: url:http date:20060801

date: and type: options can be used in combination with url:
Filer url:http should select all documents (unless you allowed file,
ftp protocols). Plain date ot type filter select onthing if they are
used alone.

And be sure you don't introduce any space between filter name and its
value ([date: 20060801] is not the same as [date:20060801])

Lukas

On 8/8/06, Matthew Holt <mh...@redhat.com> wrote:
> Howie,
>    I inspected my index using Luke and 20060801 shows up several times
> in the index. I'm unable to query pretty much any field. Several people
> seem to be having the same problem. Does anyone know whats going on?
>
> This is one of the last things I have to resolve to have Nutch deployed
> successfully at my organization. Unfortunately, Friday is my last day.
> Can anyone offer any assistance??
> Thanks,
>   Matt
>
> Howie Wang wrote:
> > I think that I have problems querying for numbers and
> > words with digits in them. Now that I think of it, is it
> > possible it has something to do with the stemming in
> > either the query filter or indexing? In either case, I would
> > print out the text that is being indexed and the phrases
> > added to the query. You could also using luke to inspect
> > your index and see whether 20060801 shows up anywhere.
> >
> > Howie
> >
> >> I tried looked for a page that had the date 20060801 and the text
> >> "test" in the page. I tried the following:
> >>
> >> date: 20060801 test
> >>
> >> and
> >>
> >> date 20060721-20060803 test
> >>
> >> Neither worked, any ideas??
> >>
> >> Matt
> >>
> >> Matthew Holt wrote:
> >>> Thanks Jake,
> >>>   However, it seems to me that it makes most sense that a query
> >>> should return all pages that match the query, instead of acting as a
> >>> content filter. However, I know its something easy to suggest when
> >>> you're not having to implement it, so just a suggestion.
> >>>
> >>> Matt
> >>>
> >>> Vanderdray, Jacob wrote:
> >>>> Try querying with both the date and something you'd expect to find
> >>>> in the content.  The field query filter is just a filter.  It only
> >>>> restricts your results to things that match the basic query and has
> >>>> the contents you require in the field.  So if you query for
> >>>> "date:2006080 text" you'll be searching for documents that contain
> >>>> "text" in one of the default query fields and has the value 2006080
> >>>> in the date field.  Leaving out text in that example would
> >>>> essentially be asking for nothing in the default fields and 2006080
> >>>> in the date field which is why it doesn't return any results.
> >>>>
> >>>> Hope that helps,
> >>>> Jake.
> >>>>
> >>>>
> >>>> -----Original Message-----
> >>>> From: Matthew Holt [mailto:mholt@redhat.com]
> >>>> Sent: Wed 8/2/2006 4:58 PM
> >>>> To: nutch-user@lucene.apache.org
> >>>> Subject: Querying Fields
> >>>>  I am unable to query fields in my index in the method that has
> >>>> been suggested. I used Luke to examine my index and the following
> >>>> field types exist:
> >>>> anchor, boost, content, contentLength, date, digest, host,
> >>>> lastModified, primaryType, segment, site, subType, title, type, url
> >>>>
> >>>> However, when I do a search using one of the fields, followed by a
> >>>> colon, an incorrect result is returned. I used Luke to find the top
> >>>> term in the date field which is '20060801'. I then searched using
> >>>> the following query:
> >>>> date: 20060801
> >>>>
> >>>> Unfortunately, nothing was returned. The correct plugins are
> >>>> enabled, here is an excerpt from my nutch-site.xml:
> >>>>
> >>>> <property>
> >>>>   <name>plugin.includes</name>
> >>>>
> >>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
> >>>>
> >>>>
> >>>>   <description>Regular expression naming plugin directory names to
> >>>>   include.  Any plugin not matching this expression is excluded.
> >>>>   In any case you need at least include the nutch-extensionpoints
> >>>> plugin. By
> >>>>   default Nutch includes crawling just HTML and plain text via HTTP,
> >>>>   and basic indexing and search plugins.
> >>>>   </description>
> >>>> </property>
> >>>>
> >>>>
> >>>> Any ideas? I'm not the only one having the same problem, I saw an
> >>>> earlier mailing list post but couldn't find any resolve... Thanks,
> >>>>
> >>>>    Matt
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >
> >
> >
>

Re: Querying Fields

Posted by Howie Wang <ho...@hotmail.com>.

If it's showing up using Luke, the indexing filter is probably fine.
You can try putting print statements into the query-filter. Print
out both the input query and the output query, and see if
the numbers are being filtered out somewhere.

You might want to see what's happening in Query.java in the
parse method also.

Howie

>Howie,
>   I inspected my index using Luke and 20060801 shows up several times in 
>the index. I'm unable to query pretty much any field. Several people seem 
>to be having the same problem. Does anyone know whats going on?
>
>This is one of the last things I have to resolve to have Nutch deployed 
>successfully at my organization. Unfortunately, Friday is my last day. Can 
>anyone offer any assistance??
>Thanks,
>  Matt
>
>Howie Wang wrote:
>>I think that I have problems querying for numbers and
>>words with digits in them. Now that I think of it, is it
>>possible it has something to do with the stemming in
>>either the query filter or indexing? In either case, I would
>>print out the text that is being indexed and the phrases
>>added to the query. You could also using luke to inspect
>>your index and see whether 20060801 shows up anywhere.
>>
>>Howie
>>
>>>I tried looked for a page that had the date 20060801 and the text "test" 
>>>in the page. I tried the following:
>>>
>>>date: 20060801 test
>>>
>>>and
>>>
>>>date 20060721-20060803 test
>>>
>>>Neither worked, any ideas??
>>>
>>>Matt
>>>
>>>Matthew Holt wrote:
>>>>Thanks Jake,
>>>>   However, it seems to me that it makes most sense that a query should 
>>>>return all pages that match the query, instead of acting as a content 
>>>>filter. However, I know its something easy to suggest when you're not 
>>>>having to implement it, so just a suggestion.
>>>>
>>>>Matt
>>>>
>>>>Vanderdray, Jacob wrote:
>>>>>Try querying with both the date and something you'd expect to find in 
>>>>>the content.  The field query filter is just a filter.  It only 
>>>>>restricts your results to things that match the basic query and has the 
>>>>>contents you require in the field.  So if you query for "date:2006080 
>>>>>text" you'll be searching for documents that contain "text" in one of 
>>>>>the default query fields and has the value 2006080 in the date field.  
>>>>>Leaving out text in that example would essentially be asking for 
>>>>>nothing in the default fields and 2006080 in the date field which is 
>>>>>why it doesn't return any results.
>>>>>
>>>>>Hope that helps,
>>>>>Jake.
>>>>>
>>>>>
>>>>>-----Original Message-----
>>>>>From: Matthew Holt [mailto:mholt@redhat.com]
>>>>>Sent: Wed 8/2/2006 4:58 PM
>>>>>To: nutch-user@lucene.apache.org
>>>>>Subject: Querying Fields
>>>>>  I am unable to query fields in my index in the method that has been 
>>>>>suggested. I used Luke to examine my index and the following field 
>>>>>types exist:
>>>>>anchor, boost, content, contentLength, date, digest, host, 
>>>>>lastModified, primaryType, segment, site, subType, title, type, url
>>>>>
>>>>>However, when I do a search using one of the fields, followed by a 
>>>>>colon, an incorrect result is returned. I used Luke to find the top 
>>>>>term in the date field which is '20060801'. I then searched using the 
>>>>>following query:
>>>>>date: 20060801
>>>>>
>>>>>Unfortunately, nothing was returned. The correct plugins are enabled, 
>>>>>here is an excerpt from my nutch-site.xml:
>>>>>
>>>>><property>
>>>>>   <name>plugin.includes</name>
>>>>>  
>>>>><value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
>>>>>
>>>>>
>>>>>   <description>Regular expression naming plugin directory names to
>>>>>   include.  Any plugin not matching this expression is excluded.
>>>>>   In any case you need at least include the nutch-extensionpoints 
>>>>>plugin. By
>>>>>   default Nutch includes crawling just HTML and plain text via HTTP,
>>>>>   and basic indexing and search plugins.
>>>>>   </description>
>>>>></property>
>>>>>
>>>>>
>>>>>Any ideas? I'm not the only one having the same problem, I saw an 
>>>>>earlier mailing list post but couldn't find any resolve... Thanks,
>>>>>
>>>>>    Matt
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>>

Re: Querying Fields

Posted by Matthew Holt <mh...@redhat.com>.

Howie,
   I inspected my index using Luke and 20060801 shows up several times 
in the index. I'm unable to query pretty much any field. Several people 
seem to be having the same problem. Does anyone know whats going on?

This is one of the last things I have to resolve to have Nutch deployed 
successfully at my organization. Unfortunately, Friday is my last day. 
Can anyone offer any assistance??
Thanks,
  Matt

Howie Wang wrote:
> I think that I have problems querying for numbers and
> words with digits in them. Now that I think of it, is it
> possible it has something to do with the stemming in
> either the query filter or indexing? In either case, I would
> print out the text that is being indexed and the phrases
> added to the query. You could also using luke to inspect
> your index and see whether 20060801 shows up anywhere.
>
> Howie
>
>> I tried looked for a page that had the date 20060801 and the text 
>> "test" in the page. I tried the following:
>>
>> date: 20060801 test
>>
>> and
>>
>> date 20060721-20060803 test
>>
>> Neither worked, any ideas??
>>
>> Matt
>>
>> Matthew Holt wrote:
>>> Thanks Jake,
>>>   However, it seems to me that it makes most sense that a query 
>>> should return all pages that match the query, instead of acting as a 
>>> content filter. However, I know its something easy to suggest when 
>>> you're not having to implement it, so just a suggestion.
>>>
>>> Matt
>>>
>>> Vanderdray, Jacob wrote:
>>>> Try querying with both the date and something you'd expect to find 
>>>> in the content.  The field query filter is just a filter.  It only 
>>>> restricts your results to things that match the basic query and has 
>>>> the contents you require in the field.  So if you query for 
>>>> "date:2006080 text" you'll be searching for documents that contain 
>>>> "text" in one of the default query fields and has the value 2006080 
>>>> in the date field.  Leaving out text in that example would 
>>>> essentially be asking for nothing in the default fields and 2006080 
>>>> in the date field which is why it doesn't return any results.
>>>>
>>>> Hope that helps,
>>>> Jake.
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Matthew Holt [mailto:mholt@redhat.com]
>>>> Sent: Wed 8/2/2006 4:58 PM
>>>> To: nutch-user@lucene.apache.org
>>>> Subject: Querying Fields
>>>>  I am unable to query fields in my index in the method that has 
>>>> been suggested. I used Luke to examine my index and the following 
>>>> field types exist:
>>>> anchor, boost, content, contentLength, date, digest, host, 
>>>> lastModified, primaryType, segment, site, subType, title, type, url
>>>>
>>>> However, when I do a search using one of the fields, followed by a 
>>>> colon, an incorrect result is returned. I used Luke to find the top 
>>>> term in the date field which is '20060801'. I then searched using 
>>>> the following query:
>>>> date: 20060801
>>>>
>>>> Unfortunately, nothing was returned. The correct plugins are 
>>>> enabled, here is an excerpt from my nutch-site.xml:
>>>>
>>>> <property>
>>>>   <name>plugin.includes</name>
>>>>  
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value> 
>>>>
>>>>
>>>>   <description>Regular expression naming plugin directory names to
>>>>   include.  Any plugin not matching this expression is excluded.
>>>>   In any case you need at least include the nutch-extensionpoints 
>>>> plugin. By
>>>>   default Nutch includes crawling just HTML and plain text via HTTP,
>>>>   and basic indexing and search plugins.
>>>>   </description>
>>>> </property>
>>>>
>>>>
>>>> Any ideas? I'm not the only one having the same problem, I saw an 
>>>> earlier mailing list post but couldn't find any resolve... Thanks,
>>>>
>>>>    Matt
>>>>
>>>>
>>>>
>>>>
>>>
>
>
>

Re: Querying Fields

Posted by Howie Wang <ho...@hotmail.com>.

I think that I have problems querying for numbers and
words with digits in them. Now that I think of it, is it
possible it has something to do with the stemming in
either the query filter or indexing? In either case, I would
print out the text that is being indexed and the phrases
added to the query. You could also using luke to inspect
your index and see whether 20060801 shows up anywhere.

Howie

>I tried looked for a page that had the date 20060801 and the text "test" in 
>the page. I tried the following:
>
>date: 20060801 test
>
>and
>
>date 20060721-20060803 test
>
>Neither worked, any ideas??
>
>Matt
>
>Matthew Holt wrote:
>>Thanks Jake,
>>   However, it seems to me that it makes most sense that a query should 
>>return all pages that match the query, instead of acting as a content 
>>filter. However, I know its something easy to suggest when you're not 
>>having to implement it, so just a suggestion.
>>
>>Matt
>>
>>Vanderdray, Jacob wrote:
>>>Try querying with both the date and something you'd expect to find in the 
>>>content.  The field query filter is just a filter.  It only restricts 
>>>your results to things that match the basic query and has the contents 
>>>you require in the field.  So if you query for "date:2006080 text" you'll 
>>>be searching for documents that contain "text" in one of the default 
>>>query fields and has the value 2006080 in the date field.  Leaving out 
>>>text in that example would essentially be asking for nothing in the 
>>>default fields and 2006080 in the date field which is why it doesn't 
>>>return any results.
>>>
>>>Hope that helps,
>>>Jake.
>>>
>>>
>>>-----Original Message-----
>>>From: Matthew Holt [mailto:mholt@redhat.com]
>>>Sent: Wed 8/2/2006 4:58 PM
>>>To: nutch-user@lucene.apache.org
>>>Subject: Querying Fields
>>>  I am unable to query fields in my index in the method that has been 
>>>suggested. I used Luke to examine my index and the following field types 
>>>exist:
>>>anchor, boost, content, contentLength, date, digest, host, lastModified, 
>>>primaryType, segment, site, subType, title, type, url
>>>
>>>However, when I do a search using one of the fields, followed by a colon, 
>>>an incorrect result is returned. I used Luke to find the top term in the 
>>>date field which is '20060801'. I then searched using the following 
>>>query:
>>>date: 20060801
>>>
>>>Unfortunately, nothing was returned. The correct plugins are enabled, 
>>>here is an excerpt from my nutch-site.xml:
>>>
>>><property>
>>>   <name>plugin.includes</name>
>>>  
>>><value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
>>>
>>>   <description>Regular expression naming plugin directory names to
>>>   include.  Any plugin not matching this expression is excluded.
>>>   In any case you need at least include the nutch-extensionpoints 
>>>plugin. By
>>>   default Nutch includes crawling just HTML and plain text via HTTP,
>>>   and basic indexing and search plugins.
>>>   </description>
>>></property>
>>>
>>>
>>>Any ideas? I'm not the only one having the same problem, I saw an earlier 
>>>mailing list post but couldn't find any resolve... Thanks,
>>>
>>>    Matt
>>>
>>>
>>>
>>>
>>

Re: Querying Fields

Posted by Matthew Holt <mh...@redhat.com>.

I tried looked for a page that had the date 20060801 and the text "test" 
in the page. I tried the following:

date: 20060801 test

and

date 20060721-20060803 test

Neither worked, any ideas??

Matt

Matthew Holt wrote:
> Thanks Jake,
>   However, it seems to me that it makes most sense that a query should 
> return all pages that match the query, instead of acting as a content 
> filter. However, I know its something easy to suggest when you're not 
> having to implement it, so just a suggestion.
>
> Matt
>
> Vanderdray, Jacob wrote:
>> Try querying with both the date and something you'd expect to find in 
>> the content.  The field query filter is just a filter.  It only 
>> restricts your results to things that match the basic query and has 
>> the contents you require in the field.  So if you query for 
>> "date:2006080 text" you'll be searching for documents that contain 
>> "text" in one of the default query fields and has the value 2006080 
>> in the date field.  Leaving out text in that example would 
>> essentially be asking for nothing in the default fields and 2006080 
>> in the date field which is why it doesn't return any results.
>>
>> Hope that helps,
>> Jake.
>>
>>
>> -----Original Message-----
>> From: Matthew Holt [mailto:mholt@redhat.com]
>> Sent: Wed 8/2/2006 4:58 PM
>> To: nutch-user@lucene.apache.org
>> Subject: Querying Fields
>>  
>> I am unable to query fields in my index in the method that has been 
>> suggested. I used Luke to examine my index and the following field 
>> types exist:
>> anchor, boost, content, contentLength, date, digest, host, 
>> lastModified, primaryType, segment, site, subType, title, type, url
>>
>> However, when I do a search using one of the fields, followed by a 
>> colon, an incorrect result is returned. I used Luke to find the top 
>> term in the date field which is '20060801'. I then searched using the 
>> following query:
>> date: 20060801
>>
>> Unfortunately, nothing was returned. The correct plugins are enabled, 
>> here is an excerpt from my nutch-site.xml:
>>
>> <property>
>>   <name>plugin.includes</name>
>>  <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value> 
>>
>>   <description>Regular expression naming plugin directory names to
>>   include.  Any plugin not matching this expression is excluded.
>>   In any case you need at least include the nutch-extensionpoints 
>> plugin. By
>>   default Nutch includes crawling just HTML and plain text via HTTP,
>>   and basic indexing and search plugins.
>>   </description>
>> </property>
>>
>>
>> Any ideas? I'm not the only one having the same problem, I saw an 
>> earlier mailing list post but couldn't find any resolve... Thanks,
>>
>>    Matt
>>
>>
>>
>>   
>

Re: Querying Fields

Posted by Matthew Holt <mh...@redhat.com>.

Thanks Jake,
   However, it seems to me that it makes most sense that a query should 
return all pages that match the query, instead of acting as a content 
filter. However, I know its something easy to suggest when you're not 
having to implement it, so just a suggestion.

Matt

Vanderdray, Jacob wrote:
> Try querying with both the date and something you'd expect to find in the content.  The field query filter is just a filter.  It only restricts your results to things that match the basic query and has the contents you require in the field.  So if you query for "date:2006080 text" you'll be searching for documents that contain "text" in one of the default query fields and has the value 2006080 in the date field.  Leaving out text in that example would essentially be asking for nothing in the default fields and 2006080 in the date field which is why it doesn't return any results.
>
> Hope that helps,
> Jake.
>
>
> -----Original Message-----
> From: Matthew Holt [mailto:mholt@redhat.com]
> Sent: Wed 8/2/2006 4:58 PM
> To: nutch-user@lucene.apache.org
> Subject: Querying Fields
>  
> I am unable to query fields in my index in the method that has been 
> suggested. I used Luke to examine my index and the following field types 
> exist:
> anchor, boost, content, contentLength, date, digest, host, lastModified, 
> primaryType, segment, site, subType, title, type, url
>
> However, when I do a search using one of the fields, followed by a 
> colon, an incorrect result is returned. I used Luke to find the top term 
> in the date field which is '20060801'. I then searched using the 
> following query:
> date: 20060801
>
> Unfortunately, nothing was returned. The correct plugins are enabled, 
> here is an excerpt from my nutch-site.xml:
>
> <property>
>   <name>plugin.includes</name>
>  <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin. By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins.
>   </description>
> </property>
>
>
> Any ideas? I'm not the only one having the same problem, I saw an 
> earlier mailing list post but couldn't find any resolve... Thanks,
>
>    Matt
>
>
>
>

RE: Querying Fields

Posted by "Vanderdray, Jacob" <JV...@aarp.org>.

Try querying with both the date and something you'd expect to find in the content.  The field query filter is just a filter.  It only restricts your results to things that match the basic query and has the contents you require in the field.  So if you query for "date:2006080 text" you'll be searching for documents that contain "text" in one of the default query fields and has the value 2006080 in the date field.  Leaving out text in that example would essentially be asking for nothing in the default fields and 2006080 in the date field which is why it doesn't return any results.

Hope that helps,
Jake.

-----Original Message-----
From: Matthew Holt [mailto:mholt@redhat.com]
Sent: Wed 8/2/2006 4:58 PM
To: nutch-user@lucene.apache.org
Subject: Querying Fields

I am unable to query fields in my index in the method that has been 
suggested. I used Luke to examine my index and the following field types 
exist:
anchor, boost, content, contentLength, date, digest, host, lastModified, 
primaryType, segment, site, subType, title, type, url

However, when I do a search using one of the fields, followed by a 
colon, an incorrect result is returned. I used Luke to find the top term 
in the date field which is '20060801'. I then searched using the 
following query:
date: 20060801

Unfortunately, nothing was returned. The correct plugins are enabled, 
here is an excerpt from my nutch-site.xml:

<property>
  <name>plugin.includes</name>
 <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

Any ideas? I'm not the only one having the same problem, I saw an 
earlier mailing list post but couldn't find any resolve... Thanks,

   Matt

Re: Querying Fields

Posted by Benjamin Higgins <bh...@gmail.com>.

I had to edit DateQueryFilter.java (in
src/plugin/query-more/src/java/org/apache/nutch/searcher/more/DateQueryFilter.java)
in order to have queries that just had date by itself.

The relevant line is:

rangeQuery.setBoost(0.0f);                  // trigger filterization

I changed 0.0f to 1.0f

More generally, I learned that it doesn't matter if a query works in Lucene,
there has to be support for it somewhere in Nutch query code.

I made the same change to TypeQueryFilter.java.

I also added a TitleQueryFilter since I found that there wasn't even any
code for it.  All I did was take URLQueryFilter.java and replace
super("url"); with super("title");

HTH.

Ben

On 8/2/06, Matthew Holt <mh...@redhat.com> wrote:
>
> I am unable to query fields in my index in the method that has been
> suggested. I used Luke to examine my index and the following field types
> exist:
> anchor, boost, content, contentLength, date, digest, host, lastModified,
> primaryType, segment, site, subType, title, type, url
>
> However, when I do a search using one of the fields, followed by a
> colon, an incorrect result is returned. I used Luke to find the top term
> in the date field which is '20060801'. I then searched using the
> following query:
> date: 20060801
>
> Unfortunately, nothing was returned. The correct plugins are enabled,
> here is an excerpt from my nutch-site.xml:
>
> <property>
>   <name>plugin.includes</name>
>
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin.
> By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins.
>   </description>
> </property>
>
>
> Any ideas? I'm not the only one having the same problem, I saw an
> earlier mailing list post but couldn't find any resolve... Thanks,
>
>    Matt
>
>
>