You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Fergus McMenemie <fe...@twig.me.uk> on 2009/01/19 11:41:57 UTC

Re: getting DIH to read my XML files: solved

Shalin, thanks for the pointer.

The following data-config.xml worked. The trick was realising
that EVERY entity tag needs to have its own datasource, I guess
I had been assuming that it was implicit for certain processors.

The whole thing is confusing in that there is both the dataSource
element(s), which is to all intents and purposes required, and an
optional dataSource attribute of the entity element. If the entity
dataSource attribute is missing it defaults to one of the defined
ones??? Unless you are using FileListEntityProcessor where you have
to explicitly state you are not using a dataSource.

As a newbie I think my lesson learnt, is to name every dataSource
element I define and to reference named dataSources from every
entity element I add, except for FileListEntityProcessor where
is has to be set to null.

   <dataConfig>
    <document>
      <dataSource name="myfilereader" type="FileDataSource"/>    
      <entity name="jcurrent"
	       processor="FileListEntityProcessor"
	       dataSource="null"
	       fileName=".*xml"
	       newerThan="'NOW-1000DAYS'"
	       recursive="true"
	       rootEntity="false"
	       baseDir="/Volumes/spare/ts/j/groups">
       <entity name="x"
	       processor="XPathEntityProcessor"
	       dataSource="myfilereader"
	       url="${jcurrent.fileAbsolutePath}"
	       stream="false"
	       forEach="/record"
	       transformer="DateFormatTransformer">0
       <field column="title"     xpath="/record/title"/>
       <field column="subject"   xpath="/record/metadata/subject[@qualifier='fullTitle']"/>
       <field column="text"      xpath="/para" />
       <field column="pubname"   xpath="/record/metadata/subject[@qualifier='publication']"/>
       <field column="pubabrev"  xpath="/record/metadata/subject[@qualifier='pubAbbrev']"/>
       <field column="pubdate"   xpath="/record/metadata/date[@qualifier='pubDate']"/>
       </entity>
       </entity>
       </document>
    </dataConfig>


Regards Fergus.
-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: getting DIH to read my XML files: solved

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
Hi Fergus,

The idea here is that if you do not give a name to your data source, then it
is the 'default' data source and gets used automatically by all entities. If
you decide to give a name to the data source, then it should be specified
for each entity.

Even when you have multiple data sources, you can decide to give a name to
only one of them. The un-named one will be used for all entities which do
not have a dataSource attribute.

Hope that helps.

On Mon, Jan 19, 2009 at 4:11 PM, Fergus McMenemie <fe...@twig.me.uk> wrote:

> Shalin, thanks for the pointer.
>
> The following data-config.xml worked. The trick was realising
> that EVERY entity tag needs to have its own datasource, I guess
> I had been assuming that it was implicit for certain processors.
>
> The whole thing is confusing in that there is both the dataSource
> element(s), which is to all intents and purposes required, and an
> optional dataSource attribute of the entity element. If the entity
> dataSource attribute is missing it defaults to one of the defined
> ones??? Unless you are using FileListEntityProcessor where you have
> to explicitly state you are not using a dataSource.
>
> As a newbie I think my lesson learnt, is to name every dataSource
> element I define and to reference named dataSources from every
> entity element I add, except for FileListEntityProcessor where
> is has to be set to null.
>
>   <dataConfig>
>    <document>
>      <dataSource name="myfilereader" type="FileDataSource"/>
>      <entity name="jcurrent"
>               processor="FileListEntityProcessor"
>               dataSource="null"
>               fileName=".*xml"
>               newerThan="'NOW-1000DAYS'"
>               recursive="true"
>               rootEntity="false"
>               baseDir="/Volumes/spare/ts/j/groups">
>       <entity name="x"
>               processor="XPathEntityProcessor"
>               dataSource="myfilereader"
>               url="${jcurrent.fileAbsolutePath}"
>               stream="false"
>               forEach="/record"
>               transformer="DateFormatTransformer">0
>       <field column="title"     xpath="/record/title"/>
>       <field column="subject"
> xpath="/record/metadata/subject[@qualifier='fullTitle']"/>
>       <field column="text"      xpath="/para" />
>       <field column="pubname"
> xpath="/record/metadata/subject[@qualifier='publication']"/>
>       <field column="pubabrev"
>  xpath="/record/metadata/subject[@qualifier='pubAbbrev']"/>
>       <field column="pubdate"
> xpath="/record/metadata/date[@qualifier='pubDate']"/>
>       </entity>
>       </entity>
>       </document>
>    </dataConfig>
>
>
> Regards Fergus.
> --
>
> ===============================================================
> Fergus McMenemie               Email:fergus@twig.me.uk<Em...@twig.me.uk>
> Techmore Ltd                   Phone:(UK) 07721 376021
>
> Unix/Mac/Intranets             Analyst Programmer
> ===============================================================
>



-- 
Regards,
Shalin Shekhar Mangar.