You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Fergus McMenemie <fe...@twig.me.uk> on 2009/02/16 10:52:35 UTC

DIH transformers

Hello.

I have been beating my head around the data-config.xml listed
at the end of this message. It breaks in a few different ways.

  1) I have bodged TemplateTransformer to allow it to return 
     when one of the variables is undefined. This ensures my
     uniqueKey is always defined. But thinking more on
     Nobel's comments there is use in having it work both ways.
     ie leaving the column undefined or replacing the variable
     with "". I still like my idea about using the default
     value of a solr field from schema.xml, but I cant figure
     out how/where to best implement it. 

  2) Having used TemplateTransformer to assign a value to an 
     entity column that column cannot be used in other 
     TemplateTransformer operations. In my project I am 
     attempting to reuse "x.fileWebPath". To fix this, the 
     last line of transformRow() in TemplateTransformer.java
     needs replaced with the following which as well as 
     'putting' the templated-ed string in 'row' also saves it
     into the 'resolver'.

     **originally**
      row.put(column, resolver.replaceTokens(expr));
      }

     **new**
      String columnName = map.get(DataImporter.COLUMN);
      expr=resolver.replaceTokens(expr);
      row.put(columnName, expr);
      resolverMapCopy.put(columnName, expr);
      }

     As an aside I think I ran into the issues covered by 
     SOLR-993. It took a while to figure out I could not a
     a single columnname/value to the resolver. I had instead
     to add to the map that was already stored within the
     resolver.

  3) No entity column names can be used within RegexTransformer.
     I guess all the stuff that was added to TemplateTransformer
     to allow column names to be used in templates needs re-added
     into RegexTransformer. I am doing that now... but am confused
     by the fragment of code which copies from resolverMap into
     resolverMapCopy. As best I can see resolverMap is always 
     empty; but I am barely able to follow the code! Can somebody
     explain when/why resolverMap would be populated.

     Also, I begin to understand comments made by Noble in
     SOL-1001 about resolving "entity attributes in 
     ContextImpl.getEntityAttribute" and I guess Shalin was
     right as well. However it also seems wrong that at the
     top of every transformer we are going to repeat the
     same code to load the resolver with information about the 
     entity.

  4) In that I am reusing template output within other templates
     the order of execution becomes important. Can I assume that
     the explicitly listed columns in an entity are processed by
     the various transformers in the order they appear within
     data-config.xml. I *think* that the list of columns within
     an entity as returned by getAllEntityFields() is actually
     an ArrayList which I think or order dependent. IS this
     correct?

  5) Should I raise this as a single JIRA issue?

  6) Having played with this stuff, I was going to add a bit
     more to the wiki highlighting some of the possibilities
     and issues with transformers. But want to check with the 
     list first!


   <dataConfig>
   <dataSource name="myfilereader" type="FileDataSource"/>    
    <document>
    <entity name="jc"
	       processor="FileListEntityProcessor"
	       fileName="^.*\.xml$"
	       newerThan="'NOW-1000DAYS'"
	       recursive="true"
	       rootEntity="false"
	       dataSource="null"
	       baseDir="/Volumes/spare/ts/solr/content"
	       >
    <entity name="x"
	          dataSource="myfilereader"
		  processor="XPathEntityProcessor"
		  url="${jc.fileAbsolutePath}"
		  rootEntity="true"
		  stream="false"
		  forEach="/record | /record/mediaBlock"
		  transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer">

<field column="fileAbsolutePath"       template="${jc.fileAbsolutePath}" />
<field column="fileWebPath"            regex="${x.test}(.*)" replaceWith="/ford$1" sourceColName="fileAbsolutePath"/>
<field column="title"                  xpath="/record/title" />
<field column="para1" name="para"      xpath="/record/sect1/para" />
<field column="para2" name="para"      xpath="/record/list/listitem/para" />
<field column="pubdate"                xpath="/record/metadata/date[@qualifier='pubDate']" dateTimeFormat="yyyyMMdd"   />

<field column="vurl"                   xpath="/record/mediaBlock/mediaObject/@vurl" />
<field column="imgSrcArticle"          template="${dataimporter.request.fordinstalldir}" />
<field column="imgCpation"             xpath="/record/mediaBlock/caption"  />

<field column="test"                   template="${dataimporter.request.contentinstalldir}" />
<!-- **problem is that vurl is just a fragment of the info needed to access the picture. -->
<field column="imgWebPathICON"         regex="(.*)/.*" replaceWith="$1/imagery/${x.vurl}s.jpg" sourceColName="fileWebPath"/>
<field column="imgWebPathFULL"         regex="(.*)/.*" replaceWith="$1/imagery/${x.vurl}.jpg"  sourceColName="fileWebPath"/>
<field column="vdkvgwkey"              template="${jc.fileAbsolutePath}#${x.vurl}" />
       </entity>
       </entity>
       </document>
    </dataConfig>

Regards Fergus.

-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: DIH transformers - sect 2

Posted by Fergus McMenemie <fe...@twig.me.uk>.
>On Mon, Feb 16, 2009 at 3:22 PM, Fergus McMenemie <fe...@twig.me.uk> wrote:
>>
>>  2) Having used TemplateTransformer to assign a value to an
>>     entity column that column cannot be used in other
>>     TemplateTransformer operations. In my project I am
>>     attempting to reuse "x.fileWebPath". To fix this, the
>>     last line of transformRow() in TemplateTransformer.java
>>     needs replaced with the following which as well as
>>     'putting' the templated-ed string in 'row' also saves it
>>     into the 'resolver'.
>>
>>     **originally**
>>      row.put(column, resolver.replaceTokens(expr));
>>      }
>>
>>     **new**
>>      String columnName = map.get(DataImporter.COLUMN);
>>      expr=resolver.replaceTokens(expr);
>>      row.put(columnName, expr);
>>      resolverMapCopy.put(columnName, expr);
>>      }
>
>isn't it better to write a custom transformer to achieve this. I did
>not want a standard component to change the state of the
>VariableResolver .
>
>I am not sure what is the best way.
>

Noble, (Good to have email working :-)

Hmm not sure why this requires a custom transformer. Why is this not 
more in the nature of a bug fix? Also the current behavior temporarily
adds all the column names into the resolver for the duration of the 
TemplateTransformer's operation, removing them again at the end. I
do not think there is any permanent change to the state of the 
VariableResolver.

Surely if we have defined a value for a column, that value should be
temporarily available in subsequent template or regexp operations?

Fergus.

>>
>>
>>   <dataConfig>
>>   <dataSource name="myfilereader" type="FileDataSource"/>
>>    <document>
>>    <entity name="jc"
>>               processor="FileListEntityProcessor"
>>               fileName="^.*\.xml$"
>>               newerThan="'NOW-1000DAYS'"
>>               recursive="true"
>>               rootEntity="false"
>>               dataSource="null"
>>               baseDir="/Volumes/spare/ts/solr/content"
>>               >
>>    <entity name="x"
>>                  dataSource="myfilereader"
>>                  processor="XPathEntityProcessor"
>>                  url="${jc.fileAbsolutePath}"
>>                  rootEntity="true"
>>                  stream="false"
>>                  forEach="/record | /record/mediaBlock"
>>                  transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer">
>>
>> <field column="fileAbsolutePath"       template="${jc.fileAbsolutePath}" />
>> <field column="fileWebPath"            regex="${x.test}(.*)" replaceWith="/ford$1" sourceColName="fileAbsolutePath"/>
>> <field column="title"                  xpath="/record/title" />
>> <field column="para1" name="para"      xpath="/record/sect1/para" />
>> <field column="para2" name="para"      xpath="/record/list/listitem/para" />
>> <field column="pubdate"                xpath="/record/metadata/date[@qualifier='pubDate']" dateTimeFormat="yyyyMMdd"   />
>>
>> <field column="vurl"                   xpath="/record/mediaBlock/mediaObject/@vurl" />
>> <field column="imgSrcArticle"          template="${dataimporter.request.fordinstalldir}" />
>> <field column="imgCpation"             xpath="/record/mediaBlock/caption"  />
>>
>> <field column="test"                   template="${dataimporter.request.contentinstalldir}" />
>> <!-- **problem is that vurl is just a fragment of the info needed to access the picture. -->
>> <field column="imgWebPathICON"         regex="(.*)/.*" replaceWith="$1/imagery/${x.vurl}s.jpg" sourceColName="fileWebPath"/>
>> <field column="imgWebPathFULL"         regex="(.*)/.*" replaceWith="$1/imagery/${x.vurl}.jpg"  sourceColName="fileWebPath"/>
>> <field column="vdkvgwkey"              template="${jc.fileAbsolutePath}#${x.vurl}" />
>>       </entity>
>>       </entity>
>>       </document>
>>    </dataConfig>

-- 

===============================================================
Fergus McMenemie               Email:fergus@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Re: DIH transformers

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
On Mon, Feb 16, 2009 at 3:22 PM, Fergus McMenemie <fe...@twig.me.uk> wrote:
> Hello.
>
> I have been beating my head around the data-config.xml listed
> at the end of this message. It breaks in a few different ways.
>
>  1) I have bodged TemplateTransformer to allow it to return
>     when one of the variables is undefined. This ensures my
>     uniqueKey is always defined. But thinking more on
>     Nobel's comments there is use in having it work both ways.
>     ie leaving the column undefined or replacing the variable
>     with "". I still like my idea about using the default
>     value of a solr field from schema.xml, but I cant figure
>     out how/where to best implement it.
When a value is missing from the templatewe may end up giving
constructing a partial string which may not be desired. If we leave it
out as empty, then Solr would automatically put in the default value
and it should be solved. Just in case you wish to know the
defaultvalue in the schema.xml you can get it from the api.
fields = context.getAllEntityFields();
String defval = fields.get(0).get("defaultvalue");
>
>  2) Having used TemplateTransformer to assign a value to an
>     entity column that column cannot be used in other
>     TemplateTransformer operations. In my project I am
>     attempting to reuse "x.fileWebPath". To fix this, the
>     last line of transformRow() in TemplateTransformer.java
>     needs replaced with the following which as well as
>     'putting' the templated-ed string in 'row' also saves it
>     into the 'resolver'.
>
>     **originally**
>      row.put(column, resolver.replaceTokens(expr));
>      }
>
>     **new**
>      String columnName = map.get(DataImporter.COLUMN);
>      expr=resolver.replaceTokens(expr);
>      row.put(columnName, expr);
>      resolverMapCopy.put(columnName, expr);
>      }

isn't it better to write a custom transformer to achieve this. I did
not want a standard component to change the state of the
VariableResolver .

I am not sure what is the best way.

>
>     As an aside I think I ran into the issues covered by
>     SOLR-993. It took a while to figure out I could not a
>     a single columnname/value to the resolver. I had instead
>     to add to the map that was already stored within the
>     resolver.
>
>  3) No entity column names can be used within RegexTransformer.
>     I guess all the stuff that was added to TemplateTransformer
>     to allow column names to be used in templates needs re-added
>     into RegexTransformer. I am doing that now... but am confused
>     by the fragment of code which copies from resolverMap into
>     resolverMapCopy. As best I can see resolverMap is always
>     empty; but I am barely able to follow the code! Can somebody
>     explain when/why resolverMap would be populated.

The behavior is like this, the expression ${currentEntity.colName}
does not work automatically. Because the row is not added to
VariableResolver .TemplateTransformer has hacked the stuff to make it
work.

We can think of modifying this behavior
>
>     Also, I begin to understand comments made by Noble in
>     SOL-1001 about resolving "entity attributes in
>     ContextImpl.getEntityAttribute" and I guess Shalin was
>     right as well. However it also seems wrong that at the
>     top of every transformer we are going to repeat the
>     same code to load the resolver with information about the
>     entity.
>
>  4) In that I am reusing template output within other templates
>     the order of execution becomes important. Can I assume that
>     the explicitly listed columns in an entity are processed by
>     the various transformers in the order they appear within
>     data-config.xml. I *think* that the list of columns within
>     an entity as returned by getAllEntityFields() is actually
>     an ArrayList which I think or order dependent. IS this
>     correct?

IT IS CORRECT
>
>  5) Should I raise this as a single JIRA issue?
Do not add ONE issue forall. If they are logically connected  put all
of them into one.If not, split them into as many issues as possible.
>
>  6) Having played with this stuff, I was going to add a bit
>     more to the wiki highlighting some of the possibilities
>     and issues with transformers. But want to check with the
>     list first!
>
>
>   <dataConfig>
>   <dataSource name="myfilereader" type="FileDataSource"/>
>    <document>
>    <entity name="jc"
>               processor="FileListEntityProcessor"
>               fileName="^.*\.xml$"
>               newerThan="'NOW-1000DAYS'"
>               recursive="true"
>               rootEntity="false"
>               dataSource="null"
>               baseDir="/Volumes/spare/ts/solr/content"
>               >
>    <entity name="x"
>                  dataSource="myfilereader"
>                  processor="XPathEntityProcessor"
>                  url="${jc.fileAbsolutePath}"
>                  rootEntity="true"
>                  stream="false"
>                  forEach="/record | /record/mediaBlock"
>                  transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer">
>
> <field column="fileAbsolutePath"       template="${jc.fileAbsolutePath}" />
> <field column="fileWebPath"            regex="${x.test}(.*)" replaceWith="/ford$1" sourceColName="fileAbsolutePath"/>
> <field column="title"                  xpath="/record/title" />
> <field column="para1" name="para"      xpath="/record/sect1/para" />
> <field column="para2" name="para"      xpath="/record/list/listitem/para" />
> <field column="pubdate"                xpath="/record/metadata/date[@qualifier='pubDate']" dateTimeFormat="yyyyMMdd"   />
>
> <field column="vurl"                   xpath="/record/mediaBlock/mediaObject/@vurl" />
> <field column="imgSrcArticle"          template="${dataimporter.request.fordinstalldir}" />
> <field column="imgCpation"             xpath="/record/mediaBlock/caption"  />
>
> <field column="test"                   template="${dataimporter.request.contentinstalldir}" />
> <!-- **problem is that vurl is just a fragment of the info needed to access the picture. -->
> <field column="imgWebPathICON"         regex="(.*)/.*" replaceWith="$1/imagery/${x.vurl}s.jpg" sourceColName="fileWebPath"/>
> <field column="imgWebPathFULL"         regex="(.*)/.*" replaceWith="$1/imagery/${x.vurl}.jpg"  sourceColName="fileWebPath"/>
> <field column="vdkvgwkey"              template="${jc.fileAbsolutePath}#${x.vurl}" />
>       </entity>
>       </entity>
>       </document>
>    </dataConfig>
>
> Regards Fergus.
>
> --
>
> ===============================================================
> Fergus McMenemie               Email:fergus@twig.me.uk
> Techmore Ltd                   Phone:(UK) 07721 376021
>
> Unix/Mac/Intranets             Analyst Programmer
> ===============================================================
>



-- 
--Noble Paul