You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Fergus McMenemie <fe...@twig.me.uk> on 2009/02/16 10:52:35 UTC
DIH transformers
Hello.
I have been beating my head around the data-config.xml listed
at the end of this message. It breaks in a few different ways.
1) I have bodged TemplateTransformer to allow it to return
when one of the variables is undefined. This ensures my
uniqueKey is always defined. But thinking more on
Nobel's comments there is use in having it work both ways.
ie leaving the column undefined or replacing the variable
with "". I still like my idea about using the default
value of a solr field from schema.xml, but I cant figure
out how/where to best implement it.
2) Having used TemplateTransformer to assign a value to an
entity column that column cannot be used in other
TemplateTransformer operations. In my project I am
attempting to reuse "x.fileWebPath". To fix this, the
last line of transformRow() in TemplateTransformer.java
needs replaced with the following which as well as
'putting' the templated-ed string in 'row' also saves it
into the 'resolver'.
**originally**
row.put(column, resolver.replaceTokens(expr));
}
**new**
String columnName = map.get(DataImporter.COLUMN);
expr=resolver.replaceTokens(expr);
row.put(columnName, expr);
resolverMapCopy.put(columnName, expr);
}
As an aside I think I ran into the issues covered by
SOLR-993. It took a while to figure out I could not a
a single columnname/value to the resolver. I had instead
to add to the map that was already stored within the
resolver.
3) No entity column names can be used within RegexTransformer.
I guess all the stuff that was added to TemplateTransformer
to allow column names to be used in templates needs re-added
into RegexTransformer. I am doing that now... but am confused
by the fragment of code which copies from resolverMap into
resolverMapCopy. As best I can see resolverMap is always
empty; but I am barely able to follow the code! Can somebody
explain when/why resolverMap would be populated.
Also, I begin to understand comments made by Noble in
SOL-1001 about resolving "entity attributes in
ContextImpl.getEntityAttribute" and I guess Shalin was
right as well. However it also seems wrong that at the
top of every transformer we are going to repeat the
same code to load the resolver with information about the
entity.
4) In that I am reusing template output within other templates
the order of execution becomes important. Can I assume that
the explicitly listed columns in an entity are processed by
the various transformers in the order they appear within
data-config.xml. I *think* that the list of columns within
an entity as returned by getAllEntityFields() is actually
an ArrayList which I think or order dependent. IS this
correct?
5) Should I raise this as a single JIRA issue?
6) Having played with this stuff, I was going to add a bit
more to the wiki highlighting some of the possibilities
and issues with transformers. But want to check with the
list first!
<dataConfig>
<dataSource name="myfilereader" type="FileDataSource"/>
<document>
<entity name="jc"
processor="FileListEntityProcessor"
fileName="^.*\.xml$"
newerThan="'NOW-1000DAYS'"
recursive="true"
rootEntity="false"
dataSource="null"
baseDir="/Volumes/spare/ts/solr/content"
>
<entity name="x"
dataSource="myfilereader"
processor="XPathEntityProcessor"
url="${jc.fileAbsolutePath}"
rootEntity="true"
stream="false"
forEach="/record | /record/mediaBlock"
transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer">
<field column="fileAbsolutePath" template="${jc.fileAbsolutePath}" />
<field column="fileWebPath" regex="${x.test}(.*)" replaceWith="/ford$1" sourceColName="fileAbsolutePath"/>
<field column="title" xpath="/record/title" />
<field column="para1" name="para" xpath="/record/sect1/para" />
<field column="para2" name="para" xpath="/record/list/listitem/para" />
<field column="pubdate" xpath="/record/metadata/date[@qualifier='pubDate']" dateTimeFormat="yyyyMMdd" />
<field column="vurl" xpath="/record/mediaBlock/mediaObject/@vurl" />
<field column="imgSrcArticle" template="${dataimporter.request.fordinstalldir}" />
<field column="imgCpation" xpath="/record/mediaBlock/caption" />
<field column="test" template="${dataimporter.request.contentinstalldir}" />
<!-- **problem is that vurl is just a fragment of the info needed to access the picture. -->
<field column="imgWebPathICON" regex="(.*)/.*" replaceWith="$1/imagery/${x.vurl}s.jpg" sourceColName="fileWebPath"/>
<field column="imgWebPathFULL" regex="(.*)/.*" replaceWith="$1/imagery/${x.vurl}.jpg" sourceColName="fileWebPath"/>
<field column="vdkvgwkey" template="${jc.fileAbsolutePath}#${x.vurl}" />
</entity>
</entity>
</document>
</dataConfig>
Regards Fergus.
--
===============================================================
Fergus McMenemie Email:fergus@twig.me.uk
Techmore Ltd Phone:(UK) 07721 376021
Unix/Mac/Intranets Analyst Programmer
===============================================================
Re: DIH transformers - sect 2
Posted by Fergus McMenemie <fe...@twig.me.uk>.
>On Mon, Feb 16, 2009 at 3:22 PM, Fergus McMenemie <fe...@twig.me.uk> wrote:
>>
>> 2) Having used TemplateTransformer to assign a value to an
>> entity column that column cannot be used in other
>> TemplateTransformer operations. In my project I am
>> attempting to reuse "x.fileWebPath". To fix this, the
>> last line of transformRow() in TemplateTransformer.java
>> needs replaced with the following which as well as
>> 'putting' the templated-ed string in 'row' also saves it
>> into the 'resolver'.
>>
>> **originally**
>> row.put(column, resolver.replaceTokens(expr));
>> }
>>
>> **new**
>> String columnName = map.get(DataImporter.COLUMN);
>> expr=resolver.replaceTokens(expr);
>> row.put(columnName, expr);
>> resolverMapCopy.put(columnName, expr);
>> }
>
>isn't it better to write a custom transformer to achieve this. I did
>not want a standard component to change the state of the
>VariableResolver .
>
>I am not sure what is the best way.
>
Noble, (Good to have email working :-)
Hmm not sure why this requires a custom transformer. Why is this not
more in the nature of a bug fix? Also the current behavior temporarily
adds all the column names into the resolver for the duration of the
TemplateTransformer's operation, removing them again at the end. I
do not think there is any permanent change to the state of the
VariableResolver.
Surely if we have defined a value for a column, that value should be
temporarily available in subsequent template or regexp operations?
Fergus.
>>
>>
>> <dataConfig>
>> <dataSource name="myfilereader" type="FileDataSource"/>
>> <document>
>> <entity name="jc"
>> processor="FileListEntityProcessor"
>> fileName="^.*\.xml$"
>> newerThan="'NOW-1000DAYS'"
>> recursive="true"
>> rootEntity="false"
>> dataSource="null"
>> baseDir="/Volumes/spare/ts/solr/content"
>> >
>> <entity name="x"
>> dataSource="myfilereader"
>> processor="XPathEntityProcessor"
>> url="${jc.fileAbsolutePath}"
>> rootEntity="true"
>> stream="false"
>> forEach="/record | /record/mediaBlock"
>> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer">
>>
>> <field column="fileAbsolutePath" template="${jc.fileAbsolutePath}" />
>> <field column="fileWebPath" regex="${x.test}(.*)" replaceWith="/ford$1" sourceColName="fileAbsolutePath"/>
>> <field column="title" xpath="/record/title" />
>> <field column="para1" name="para" xpath="/record/sect1/para" />
>> <field column="para2" name="para" xpath="/record/list/listitem/para" />
>> <field column="pubdate" xpath="/record/metadata/date[@qualifier='pubDate']" dateTimeFormat="yyyyMMdd" />
>>
>> <field column="vurl" xpath="/record/mediaBlock/mediaObject/@vurl" />
>> <field column="imgSrcArticle" template="${dataimporter.request.fordinstalldir}" />
>> <field column="imgCpation" xpath="/record/mediaBlock/caption" />
>>
>> <field column="test" template="${dataimporter.request.contentinstalldir}" />
>> <!-- **problem is that vurl is just a fragment of the info needed to access the picture. -->
>> <field column="imgWebPathICON" regex="(.*)/.*" replaceWith="$1/imagery/${x.vurl}s.jpg" sourceColName="fileWebPath"/>
>> <field column="imgWebPathFULL" regex="(.*)/.*" replaceWith="$1/imagery/${x.vurl}.jpg" sourceColName="fileWebPath"/>
>> <field column="vdkvgwkey" template="${jc.fileAbsolutePath}#${x.vurl}" />
>> </entity>
>> </entity>
>> </document>
>> </dataConfig>
--
===============================================================
Fergus McMenemie Email:fergus@twig.me.uk
Techmore Ltd Phone:(UK) 07721 376021
Unix/Mac/Intranets Analyst Programmer
===============================================================
Re: DIH transformers
Posted by Noble Paul നോബിള് नोब्ळ् <no...@gmail.com>.
On Mon, Feb 16, 2009 at 3:22 PM, Fergus McMenemie <fe...@twig.me.uk> wrote:
> Hello.
>
> I have been beating my head around the data-config.xml listed
> at the end of this message. It breaks in a few different ways.
>
> 1) I have bodged TemplateTransformer to allow it to return
> when one of the variables is undefined. This ensures my
> uniqueKey is always defined. But thinking more on
> Nobel's comments there is use in having it work both ways.
> ie leaving the column undefined or replacing the variable
> with "". I still like my idea about using the default
> value of a solr field from schema.xml, but I cant figure
> out how/where to best implement it.
When a value is missing from the templatewe may end up giving
constructing a partial string which may not be desired. If we leave it
out as empty, then Solr would automatically put in the default value
and it should be solved. Just in case you wish to know the
defaultvalue in the schema.xml you can get it from the api.
fields = context.getAllEntityFields();
String defval = fields.get(0).get("defaultvalue");
>
> 2) Having used TemplateTransformer to assign a value to an
> entity column that column cannot be used in other
> TemplateTransformer operations. In my project I am
> attempting to reuse "x.fileWebPath". To fix this, the
> last line of transformRow() in TemplateTransformer.java
> needs replaced with the following which as well as
> 'putting' the templated-ed string in 'row' also saves it
> into the 'resolver'.
>
> **originally**
> row.put(column, resolver.replaceTokens(expr));
> }
>
> **new**
> String columnName = map.get(DataImporter.COLUMN);
> expr=resolver.replaceTokens(expr);
> row.put(columnName, expr);
> resolverMapCopy.put(columnName, expr);
> }
isn't it better to write a custom transformer to achieve this. I did
not want a standard component to change the state of the
VariableResolver .
I am not sure what is the best way.
>
> As an aside I think I ran into the issues covered by
> SOLR-993. It took a while to figure out I could not a
> a single columnname/value to the resolver. I had instead
> to add to the map that was already stored within the
> resolver.
>
> 3) No entity column names can be used within RegexTransformer.
> I guess all the stuff that was added to TemplateTransformer
> to allow column names to be used in templates needs re-added
> into RegexTransformer. I am doing that now... but am confused
> by the fragment of code which copies from resolverMap into
> resolverMapCopy. As best I can see resolverMap is always
> empty; but I am barely able to follow the code! Can somebody
> explain when/why resolverMap would be populated.
The behavior is like this, the expression ${currentEntity.colName}
does not work automatically. Because the row is not added to
VariableResolver .TemplateTransformer has hacked the stuff to make it
work.
We can think of modifying this behavior
>
> Also, I begin to understand comments made by Noble in
> SOL-1001 about resolving "entity attributes in
> ContextImpl.getEntityAttribute" and I guess Shalin was
> right as well. However it also seems wrong that at the
> top of every transformer we are going to repeat the
> same code to load the resolver with information about the
> entity.
>
> 4) In that I am reusing template output within other templates
> the order of execution becomes important. Can I assume that
> the explicitly listed columns in an entity are processed by
> the various transformers in the order they appear within
> data-config.xml. I *think* that the list of columns within
> an entity as returned by getAllEntityFields() is actually
> an ArrayList which I think or order dependent. IS this
> correct?
IT IS CORRECT
>
> 5) Should I raise this as a single JIRA issue?
Do not add ONE issue forall. If they are logically connected put all
of them into one.If not, split them into as many issues as possible.
>
> 6) Having played with this stuff, I was going to add a bit
> more to the wiki highlighting some of the possibilities
> and issues with transformers. But want to check with the
> list first!
>
>
> <dataConfig>
> <dataSource name="myfilereader" type="FileDataSource"/>
> <document>
> <entity name="jc"
> processor="FileListEntityProcessor"
> fileName="^.*\.xml$"
> newerThan="'NOW-1000DAYS'"
> recursive="true"
> rootEntity="false"
> dataSource="null"
> baseDir="/Volumes/spare/ts/solr/content"
> >
> <entity name="x"
> dataSource="myfilereader"
> processor="XPathEntityProcessor"
> url="${jc.fileAbsolutePath}"
> rootEntity="true"
> stream="false"
> forEach="/record | /record/mediaBlock"
> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer">
>
> <field column="fileAbsolutePath" template="${jc.fileAbsolutePath}" />
> <field column="fileWebPath" regex="${x.test}(.*)" replaceWith="/ford$1" sourceColName="fileAbsolutePath"/>
> <field column="title" xpath="/record/title" />
> <field column="para1" name="para" xpath="/record/sect1/para" />
> <field column="para2" name="para" xpath="/record/list/listitem/para" />
> <field column="pubdate" xpath="/record/metadata/date[@qualifier='pubDate']" dateTimeFormat="yyyyMMdd" />
>
> <field column="vurl" xpath="/record/mediaBlock/mediaObject/@vurl" />
> <field column="imgSrcArticle" template="${dataimporter.request.fordinstalldir}" />
> <field column="imgCpation" xpath="/record/mediaBlock/caption" />
>
> <field column="test" template="${dataimporter.request.contentinstalldir}" />
> <!-- **problem is that vurl is just a fragment of the info needed to access the picture. -->
> <field column="imgWebPathICON" regex="(.*)/.*" replaceWith="$1/imagery/${x.vurl}s.jpg" sourceColName="fileWebPath"/>
> <field column="imgWebPathFULL" regex="(.*)/.*" replaceWith="$1/imagery/${x.vurl}.jpg" sourceColName="fileWebPath"/>
> <field column="vdkvgwkey" template="${jc.fileAbsolutePath}#${x.vurl}" />
> </entity>
> </entity>
> </document>
> </dataConfig>
>
> Regards Fergus.
>
> --
>
> ===============================================================
> Fergus McMenemie Email:fergus@twig.me.uk
> Techmore Ltd Phone:(UK) 07721 376021
>
> Unix/Mac/Intranets Analyst Programmer
> ===============================================================
>
--
--Noble Paul