You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Pooja Verlani <po...@gmail.com> on 2009/12/02 08:20:45 UTC

Hierarchical xml

Hi,
I want to index an xml like following:

<officer>
<name>John</name>
<dob>1979-29-17T28:14:48Z</dob>
<collegeGroup>
        <college>
               <name>ABC College</name>
               <year>1998</year>
         </college>
         <college>
               <name>PQRS College</name>
               <year>2001</year>
         </college>
          <college>
               <name>XYZ College</name>
               <year>2003</year>
         </college>
</collegeGroup>
</officer>

I am not able to judge how should be the schema like?
Also, if I flatten such an xml and make collegename & year as multivalued
like this:
<college_name>ABC College, PQRS College, XYZ College</college_name>
<college_year>1998,2001,2003</year>

In such a scenario I can't make a coorespondence between ABC college & year
1998.

In case someone has an efficient way out, do share.
Thanks in anticipation.

Regards,
Pooja

Re: Hierarchical xml

Posted by Age Jan Kuperus <Ag...@wur.nl>.
Pooja Verlani wrote:
> Hi,
> I want to index an xml like following:
> 
> <officer>
> <name>John</name>
> <dob>1979-29-17T28:14:48Z</dob>
> <collegeGroup>
>         <college>
>                <name>ABC College</name>
>                <year>1998</year>
>          </college>
>          <college>
>                <name>PQRS College</name>
>                <year>2001</year>
>          </college>
>           <college>
>                <name>XYZ College</name>
>                <year>2003</year>
>          </college>
> </collegeGroup>
> </officer>
> 

At Wageningen UR Library our data is completely xml based. We are in the process of replacing 
Oracla Text by SOLR for our background engine.

This is how we would do it at Wageningen UR Library (actually I ran your example through a 
minimally modified version of the xslt we use for the transformation)

_id_ is the unique id derived from the outer element (we actually use it combined with an 
attribute here)
_data_ is a stored only field that reproduces the complete record (in escaped (or CDATA, which 
is identical at the solr level) form, because solr doesn't accept xml as data
all other fields names not ending in _s are text fields, representing all full and partial paths 
  to the data
the _s fields are string fields, copying the same data for faceting, sorting and (facet) filtering.


<?xml version="1.0" encoding="utf-8"?>
<add>
   <doc>
     <field name="_id_">officer/</field>
     <field name="_data_">&lt;officer&gt;
&lt;name&gt;John&lt;/name&gt;
&lt;dob&gt;1979-29-17T28:14:48Z&lt;/dob&gt;
&lt;collegeGroup&gt;
         &lt;college&gt;
                &lt;name&gt;ABC College&lt;/name&gt;
                &lt;year&gt;1998&lt;/year&gt;
          &lt;/college&gt;
          &lt;college&gt;
                &lt;name&gt;PQRS College&lt;/name&gt;
                &lt;year&gt;2001&lt;/year&gt;
          &lt;/college&gt;
           &lt;college&gt;
                &lt;name&gt;XYZ College&lt;/name&gt;
                &lt;year&gt;2003&lt;/year&gt;
          &lt;/college&gt;
&lt;/collegeGroup&gt;
&lt;/officer&gt;</field>
     <field name="officer">John 1979-29-17T28:14:48Z ABC College 1998 PQRS College 2001 XYZ 
College 2003</field>
     <field name="name">John</field> 

     <field name="name_s">John</field> 

     <field name="officer/name">John</field> 

     <field name="officer/name_s">John</field> 

     <field name="dob">1979-29-17T28:14:48Z</field> 

     <field name="dob_s">1979-29-17T28:14:48Z</field> 

     <field name="officer/dob">1979-29-17T28:14:48Z</field> 

     <field name="officer/dob_s">1979-29-17T28:14:48Z</field> 

     <field name="collegeGroup">ABC College 1998 PQRS College 2001 XYZ College 2003</field> 

     <field name="collegeGroup_s">ABC College 1998 PQRS College 2001 XYZ College 2003</field> 

     <field name="officer/collegeGroup">ABC College 1998 PQRS College 2001 XYZ College 
2003</field>
     <field name="officer/collegeGroup_s">ABC College 1998 PQRS College 2001 XYZ College 
2003</field>
     <field name="college">ABC College 1998</field> 

     <field name="college_s">ABC College 1998</field> 

     <field name="collegeGroup/college">ABC College 1998</field> 

     <field name="collegeGroup/college_s">ABC College 1998</field> 

     <field name="officer/collegeGroup/college">ABC College 1998</field> 

     <field name="officer/collegeGroup/college_s">ABC College 1998</field> 

     <field name="name">ABC College</field> 

     <field name="name_s">ABC College</field> 

     <field name="college/name">ABC College</field> 

     <field name="college/name_s">ABC College</field> 

     <field name="collegeGroup/college/name">ABC College</field> 

     <field name="collegeGroup/college/name_s">ABC College</field> 

     <field name="officer/collegeGroup/college/name">ABC College</field> 

     <field name="officer/collegeGroup/college/name_s">ABC College</field> 

     <field name="year">1998</field> 

     <field name="year_s">1998</field> 

     <field name="college/year">1998</field> 

     <field name="college/year_s">1998</field> 

     <field name="collegeGroup/college/year">1998</field> 

     <field name="collegeGroup/college/year_s">1998</field> 

     <field name="officer/collegeGroup/college/year">1998</field> 

     <field name="officer/collegeGroup/college/year_s">1998</field> 

     <field name="college">PQRS College 2001</field> 

     <field name="college_s">PQRS College 2001</field>
     <field name="collegeGroup/college">PQRS College 2001</field>
     <field name="collegeGroup/college_s">PQRS College 2001</field>
     <field name="officer/collegeGroup/college">PQRS College 2001</field>
     <field name="officer/collegeGroup/college_s">PQRS College 2001</field>
     <field name="name">PQRS College</field>
     <field name="name_s">PQRS College</field>
     <field name="college/name">PQRS College</field>
     <field name="college/name_s">PQRS College</field>
     <field name="collegeGroup/college/name">PQRS College</field>
     <field name="collegeGroup/college/name_s">PQRS College</field>
     <field name="officer/collegeGroup/college/name">PQRS College</field>
     <field name="officer/collegeGroup/college/name_s">PQRS College</field>
     <field name="year">2001</field>
     <field name="year_s">2001</field>
     <field name="college/year">2001</field>
     <field name="college/year_s">2001</field>
     <field name="collegeGroup/college/year">2001</field>
     <field name="collegeGroup/college/year_s">2001</field>
     <field name="officer/collegeGroup/college/year">2001</field>
     <field name="officer/collegeGroup/college/year_s">2001</field>
     <field name="college">XYZ College 2003</field>
     <field name="college_s">XYZ College 2003</field>
     <field name="collegeGroup/college">XYZ College 2003</field>
     <field name="collegeGroup/college_s">XYZ College 2003</field>
     <field name="officer/collegeGroup/college">XYZ College 2003</field>
     <field name="officer/collegeGroup/college_s">XYZ College 2003</field>
     <field name="name">XYZ College</field>
     <field name="name_s">XYZ College</field>
     <field name="college/name">XYZ College</field>
     <field name="college/name_s">XYZ College</field>
     <field name="collegeGroup/college/name">XYZ College</field>
     <field name="collegeGroup/college/name_s">XYZ College</field>
     <field name="officer/collegeGroup/college/name">XYZ College</field>
     <field name="officer/collegeGroup/college/name_s">XYZ College</field>
     <field name="year">2003</field>
     <field name="year_s">2003</field>
     <field name="college/year">2003</field>
     <field name="college/year_s">2003</field>
     <field name="collegeGroup/college/year">2003</field>
     <field name="collegeGroup/college/year_s">2003</field>
     <field name="officer/collegeGroup/college/year">2003</field>
     <field name="officer/collegeGroup/college/year_s">2003</field>
   </doc>
</add>


Age Jan Kuperus


Re: Hierarchical xml

Posted by Sascha Szott <sz...@zib.de>.
Pooja,

have a look at Solr's DataImportHandler. XPathEntityProcessor [1] should 
suit your needs.

Best,
Sascha

[1] http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor

Pooja Verlani schrieb:
> Hi,
> I want to index an xml like following:
> 
> <officer>
> <name>John</name>
> <dob>1979-29-17T28:14:48Z</dob>
> <collegeGroup>
>         <college>
>                <name>ABC College</name>
>                <year>1998</year>
>          </college>
>          <college>
>                <name>PQRS College</name>
>                <year>2001</year>
>          </college>
>           <college>
>                <name>XYZ College</name>
>                <year>2003</year>
>          </college>
> </collegeGroup>
> </officer>
> 
> I am not able to judge how should be the schema like?
> Also, if I flatten such an xml and make collegename & year as multivalued
> like this:
> <college_name>ABC College, PQRS College, XYZ College</college_name>
> <college_year>1998,2001,2003</year>
> 
> In such a scenario I can't make a coorespondence between ABC college & year
> 1998.
> 
> In case someone has an efficient way out, do share.
> Thanks in anticipation.
> 
> Regards,
> Pooja
>