You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Chris Mattmann <ch...@jpl.nasa.gov> on 2005/10/16 19:53:03 UTC

developing a parse-/index-/query- plugin set

Hi Folks,

 

  I was wondering if anybody could give me some advice on what I'm doing
wrong in the following situation. 

 

I am trying to fetch and search some bioinformatics data with specific data
elements that I want to index, parse out, and search on. For instance, for
each page of data I fetch, I would like to store things like PROTOCOL_ID,
and CONTACT_EMAIL. Okay, so to go about this, I went and wrote a
parse-specimen plugin to suck out the specific metadata elements I wanted to
index. I have tested and verified that this part of the process is working.
For instance, after the page content is fetched, I've instrumented the code
with LOG.log commands to verify that the metadata is being added to the
Properties object that is sent back with the ParseImpl. Okay, so then I
wrote an index-specimen plugin, that basically takes the reconstructed parse
data (as all indexing plugins do), gets out the specific properties that I
captured during the parse, and then adds them to the Lucene document and
returns the document. I have also verified that this portion of the process
is working as well, for instance, I have instrumented the code with LOG.log
commands again, and verified that the fields are getting added to the
Document object, which is then returned. Okay, so then I just thought I
could deploy and start up the nutch web app at that point, and I would be
able to do queries like, "PROTOCOL_ID:36.0", and
"CONTACT_EMAIL:chris.mattmann@jpl.nasa.gov", for instance, and since the
metadata was stored in the index, that the hits would come back. However, of
course, I found out that this wasn't the case. After some snooping around, I
saw that it seems that in order for the query to work right, a user needs to
then write a query-xxx plugin that declares its support for the specific
fields that were indexed, and that you want to search on. Well I've been
trying to do this for the last day and a half, and for the life of me, I
can't get the thing working. Could someone give me some help or suggestions
on how to do this? To write my query-specimen plugin that I have now, that
doesn't work; I used the model of the query-more plugin. I've written two
classes which extend the RawFieldQueryFilter, to test out if I could at
least get the PROTOCOL_ID and CONTACT_EMAIL queries working. So I wrote a
ProtocolIDQueryFilter class and a ContactEmailQueryFilter class, which just
extended the RawFieldQueryFilter class, and passed in "PROTOCOL_ID" and
"CONTACT_EMAIL" to the constructor of it, again, this is what I saw in the
query-more plugin that I used as an example. Additionally, in my plugin.xml
file for the query-specimen plugin, I've declared that my plugin supports
those 2 raw fields, in the following fashion:

 

 

   <extension id="gov.nasa.jpl.edrn.nutch.searcher.specimen"

              name="Specimen Query Filter"

              point="org.apache.nutch.searcher.QueryFilter">

      <implementation id="ProtocolIDQueryFilter"

 
class="gov.nasa.jpl.edrn.nutch.searcher.specimen.ProtocolIDQueryFilter"

                      raw-fields="PROTOCOL_ID"/>

   </extension>

   

   <extension id="gov.nasa.jpl.edrn.nutch.searcher.specimen"

              name="Specimen Query Filter"

              point="org.apache.nutch.searcher.QueryFilter">

      <implementation id="ContactEmailQueryFilter"

 
class="gov.nasa.jpl.edrn.nutch.searcher.specimen.ContactEmailQueryFilter"

                      raw-fields="CONTACT_EMAIL "/>

   </extension>

   

 

However, after rebuilding the Nutch webapp with the query-specimen plugin
enabled (which I have verified via the LOG files that it is actually
enabled), and then trying the queries such as "PROTOCOL_ID:36.0", the
queries still don't work. I've verified that the fields were indexed
correct, and that 36.0 is actually a valid value for the PROTOCOL_ID,
because for instance, when I just do a regular query that I know returns
hits (I've only indexed 3 documents so far), and then I click on the
"explain" link, it shows that I have indexed all the fields which I wanted
to query on (such as PROTOCOL_ID, and CONTACT_EMAIL), and it shows me the
values for each field, such as PROTOCOL_ID = 36.0. So, now I'm stuck. I
can't get the queries to work and if anyone can help me with this, I would
be really appreciative. Oh yeah, one more thing, it turns out that a lot of
my fields are numeric-like values, such as 36.0, 2.0, etc. However, when I
indexed them I indexed them as Field.Text() values in the Lucene document.
I've never done this before, so if that was the wrong thing to do, then that
might be the problem? Here is the snippet of code in my index-specimen
plugin where I index the fields:

 

public Document filter(Document doc, Parse parse, FetcherOutput fo)

            throws IndexingException {

        

        //get the parse metadata

        Properties metadata = parse.getData().getMetadata();

        

 

        for(int i = 0; i < edrnCDES.length; i++){

            String key = edrnCDES[i];

            String val = (String)metadata.get(key);

            if(val != null){

                LOG.log(Level.INFO,"SpecimenIndexer:adding
["+key+"=>"+val+"]");

                doc.add(Field.Text(key,val));                

            }

            

 

        }

        

        return doc;

        

    }

 

 

"edrnCDES" is an array of the field names I want to index, such as
"PROTOCOL_ID" and "CONTACT_EMAIL". So, does the fact that some of these
fields are numerical values make a difference, even though I'm trying to
index them as text? I mean, one thing that I know is that even the
non-numerical values, e.g., CONTACT_EMAIL, isn't working, so I suspect that
the numerical value issue isn't the thing that's causing my problem.

 

If anyone can provide any help on this, again, I would appreciate it. Thanks
a lot!

 

Cheers,

  Chris

 

 

 

 

______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov 
Staff Member
Modeling and Data Management Systems Section (387)

Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
Phone:  818-354-8810
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.