You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Dalton, Jeffery" <jd...@globalspec.com> on 2005/10/12 22:01:10 UTC

NutchAnalysis -- Distinguishing between quoted clauses (phrases) and unquoted clauses (individual terms) after parsing

Greetings,
 
I am using Nutch 0.7.1 and am using the NutchAnalysis parser -->
org.apache.nutch.analysis.NutchAnalysis.  I am trying to understand the
parser in NutchAnalysis and I am slightly confused.I will illustrate the
problem with an example:
 
I want to parse the query string --> "motor" motor 
 
What I expect is that the org.apache.nutch.searcher.Query returned from
calling NutchAnalysis.parseQuery(queryString) on the above string should
contain two Clauses -- on Phrase and one Term.  I expect the first one
to be a phrase because it is quoted and the second one to be a term
because it is not quoted. However, this does not happen.   Both clauses
are Phrases.      
 
Digging deeper I see that every clause regardless of whether is
processed by "phrase(..)" or "compound(..)" is added as
query.addProhibitedPhrase(array,field) or
query.addRequiredPhrase(array,field).  The unquoted motor term is
processed as an "implicit phrase" and added as a Phrase.
 
I am trying to do some post-parsing query work -- query expansion --
where I need to know whether a Clause is an explicit phrase (quoted) or
an implicit phrase (not).  In my case, I don't want to expand terms
present in explicit phrases, but I do want to expand terms that are not
quoted.  Do do this I think I need to know whether the clause was an
explicit or an implicit phrase.  How can I determine after parsing? I
have had some difficulty because isPhrase() -- which examines
termOrPhrase member type predictably always returns true.   
 
Am I understanding the distinction between implicit and explicit phrases
correctly?  Why does the parser not use addRequiredTerm(array[0],field)
for the non-quoted motor term?  This seems like it would fix the
problem.  Why are unquoted terms considered implicit phrases even though
they may be the only term in the query and not a phrase at all?      Is
what I am doing reasonable?  Is there a better way?
 
Any guidance would be greatly appreciated.
 
Thanks,
 
- Jeff