You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Dalton, Jeffery" <jd...@globalspec.com> on 2005/10/13 15:01:54 UTC

NutchAnalysis -- Distinguishing between quoted clauses (phrases) and unquoted clauses (individual terms) after parsing

Looking back on this post, perhaps it might be better suited to the
developer list...
 
Thoughts anyone?
 
- Jeff

________________________________

From: Dalton, Jeffery 
Sent: Wednesday, October 12, 2005 4:01 PM
To: nutch-user@lucene.apache.org
Subject: NutchAnalysis -- Distinguishing between quoted clauses
(phrases) and unquoted clauses (individual terms) after parsing


Greetings,
 
I am using Nutch 0.7.1 and am using the NutchAnalysis parser -->
org.apache.nutch.analysis.NutchAnalysis.  I am trying to understand the
parser in NutchAnalysis and I am slightly confused.I will illustrate the
problem with an example:
 
I want to parse the query string --> "motor" motor 
 
What I expect is that the org.apache.nutch.searcher.Query returned from
calling NutchAnalysis.parseQuery(queryString) on the above string should
contain two Clauses -- on Phrase and one Term.  I expect the first one
to be a phrase because it is quoted and the second one to be a term
because it is not quoted. However, this does not happen.   Both clauses
are Phrases.      
 
Digging deeper I see that every clause regardless of whether is
processed by "phrase(..)" or "compound(..)" is added as
query.addProhibitedPhrase(array,field) or
query.addRequiredPhrase(array,field).  The unquoted motor term is
processed as an "implicit phrase" and added as a Phrase.
 
I am trying to do some post-parsing query work -- query expansion --
where I need to know whether a Clause is an explicit phrase (quoted) or
an implicit phrase (not).  In my case, I don't want to expand terms
present in explicit phrases, but I do want to expand terms that are not
quoted.   To do this I think I need to know whether the clause was an
explicit or an implicit phrase.  How can I determine  this  after
parsing?  Right now this is impossible  because isPhrase() -- which
examines  the  termOrPhrase membe r type predictably always returns
true.   
 
Am I understanding the distinction between implicit and explicit phrases
correctly?  Why does the parser not use addRequiredTerm(array[0],field)
for the non-quoted motor term?  This seems like it would fix the
problem.  Why are unquoted terms considered implicit phrases even though
they may be the only term in the query and not a phrase at all?      Is
what I am doing reasonable?  Is there a better way?
 
Any guidance would be greatly appreciated.
 
Thanks,
 
- Jeff
 
 

Re: NutchAnalysis -- Distinguishing between quoted clauses (phrases) and unquoted clauses (individual terms) after parsing

Posted by Doug Cutting <cu...@nutch.org>.
Dalton, Jeffery wrote:
> I am trying to do some post-parsing query work -- query expansion --
> where I need to know whether a Clause is an explicit phrase (quoted) or
> an implicit phrase (not).  In my case, I don't want to expand terms
> present in explicit phrases, but I do want to expand terms that are not
> quoted.   To do this I think I need to know whether the clause was an
> explicit or an implicit phrase.  How can I determine  this  after
> parsing?

Perhaps we should add a Query.Phrase.isImplicit() method that 
NutchAnalysis.jj sets when the compound() rule matches?  If that 
suffices, please submit a patch.

Doug