You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Dalton, Jeffery" <jd...@globalspec.com> on 2005/10/13 15:01:54 UTC
NutchAnalysis -- Distinguishing between quoted clauses (phrases) and unquoted clauses (individual terms) after parsing
Looking back on this post, perhaps it might be better suited to the
developer list...
Thoughts anyone?
- Jeff
________________________________
From: Dalton, Jeffery
Sent: Wednesday, October 12, 2005 4:01 PM
To: nutch-user@lucene.apache.org
Subject: NutchAnalysis -- Distinguishing between quoted clauses
(phrases) and unquoted clauses (individual terms) after parsing
Greetings,
I am using Nutch 0.7.1 and am using the NutchAnalysis parser -->
org.apache.nutch.analysis.NutchAnalysis. I am trying to understand the
parser in NutchAnalysis and I am slightly confused.I will illustrate the
problem with an example:
I want to parse the query string --> "motor" motor
What I expect is that the org.apache.nutch.searcher.Query returned from
calling NutchAnalysis.parseQuery(queryString) on the above string should
contain two Clauses -- on Phrase and one Term. I expect the first one
to be a phrase because it is quoted and the second one to be a term
because it is not quoted. However, this does not happen. Both clauses
are Phrases.
Digging deeper I see that every clause regardless of whether is
processed by "phrase(..)" or "compound(..)" is added as
query.addProhibitedPhrase(array,field) or
query.addRequiredPhrase(array,field). The unquoted motor term is
processed as an "implicit phrase" and added as a Phrase.
I am trying to do some post-parsing query work -- query expansion --
where I need to know whether a Clause is an explicit phrase (quoted) or
an implicit phrase (not). In my case, I don't want to expand terms
present in explicit phrases, but I do want to expand terms that are not
quoted. To do this I think I need to know whether the clause was an
explicit or an implicit phrase. How can I determine this after
parsing? Right now this is impossible because isPhrase() -- which
examines the termOrPhrase membe r type predictably always returns
true.
Am I understanding the distinction between implicit and explicit phrases
correctly? Why does the parser not use addRequiredTerm(array[0],field)
for the non-quoted motor term? This seems like it would fix the
problem. Why are unquoted terms considered implicit phrases even though
they may be the only term in the query and not a phrase at all? Is
what I am doing reasonable? Is there a better way?
Any guidance would be greatly appreciated.
Thanks,
- Jeff
Re: NutchAnalysis -- Distinguishing between quoted clauses (phrases)
and unquoted clauses (individual terms) after parsing
Posted by Doug Cutting <cu...@nutch.org>.
Dalton, Jeffery wrote:
> I am trying to do some post-parsing query work -- query expansion --
> where I need to know whether a Clause is an explicit phrase (quoted) or
> an implicit phrase (not). In my case, I don't want to expand terms
> present in explicit phrases, but I do want to expand terms that are not
> quoted. To do this I think I need to know whether the clause was an
> explicit or an implicit phrase. How can I determine this after
> parsing?
Perhaps we should add a Query.Phrase.isImplicit() method that
NutchAnalysis.jj sets when the compound() rule matches? If that
suffices, please submit a patch.
Doug