You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tricia Williams <wi...@gmail.com> on 2010/03/12 22:53:51 UTC
DIH: using variables in nested entities
Hi All,
The DataImportHandler is the most fantastic thing that has recently
come to Solr. Thank you.
I'm noticing that when I use variables in nested entities that
square brackets are wrapped around the variable value when they are
used. For example ${x.url} used in the "tika" entity below resolves as
[http://publicdomain.ca/content/Sample.pdf] (note the square brackets)
so I get the error in my log:
> SEVERE: Exception thrown while getting data
> java.net.MalformedURLException: no protocol:
> [http://publicdomain.ca/content/Sample.pdf]
> at java.net.URL.<init>(URL.java:567)
> at java.net.URL.<init>(URL.java:464)
> at java.net.URL.<init>(URL.java:413)
> at
> org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDat
> aSource.java:78)
> at
> org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDat
> aSource.java:38)
> at
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEn
> tityProcessor.java:98)
I encountered this previously when I tried to concatenate fields
from different entities into one field. I worked around this by
gathering fields with an xsl. Not being able to resolve the url for
Tika is a little more problematic.
*Is this a bug? If not, how do I remove the brackets so that I can
use my variable as it was meant?*
<dataConfig>
<dataSource type="BinURLDataSource" name="bin"/>
<dataSource type="FileDataSource" name="fileReader"/>
<document>
<entity name="f" processor="FileListEntityProcessor" baseDir="/home/pgwillia/content" dataSource="null" fileName=".*xml" rootEntity="false">
<entity name="x" processor="XPathEntityProcessor" dataSource="fileReader" transformer="TemplateTransformer,RegexTransformer" forEach="/RDF/Description" url="${f.fileAbsolutePath}">
...
<field column="url" xpath="/RDF/Description/identifier" regex="http://privatedomain:8080/content/" replaceWith="http://publicdomain.ca/content/"/>
<entity name="tika" processor="TikaEntityProcessor" url="${x.url}" dataSource="bin" format="text">
<field column="fulltext" name="text"/>
</entity>
</entity>
</entity>
</document>
</dataConfig>
Many thanks,
Tricia
Re: DIH: using variables in nested entities
Posted by Tricia Williams <wi...@gmail.com>.
For anyone interested, my issue (I think) was because I had specified
the url field as a multivalued field. I wasn't able to create a test
case that emulated my problem. This guess is based on gradual fiddling
with my configs.
My concern is no longer pressing but I do have a couple questions for
the devs to think about:
1. How should a multivalued field be treated in a child entity? The
use case would be the one I presented where I intend url to be
multivalued. I'm thinking a for-each type construct should apply.
2. How should a multivalued field be formatted or custom formatted if
you intend to use the content of a field in another field,
possibly nested?
Tricia Williams wrote:
> Hi All,
>
> The DataImportHandler is the most fantastic thing that has recently
> come to Solr. Thank you.
>
> I'm noticing that when I use variables in nested entities that
> square brackets are wrapped around the variable value when they are
> used. For example ${x.url} used in the "tika" entity below resolves
> as [http://publicdomain.ca/content/Sample.pdf] (note the square
> brackets) so I get the error in my log:
>
>> SEVERE: Exception thrown while getting data
>> java.net.MalformedURLException: no protocol:
>> [http://publicdomain.ca/content/Sample.pdf]
>> at java.net.URL.<init>(URL.java:567)
>> at java.net.URL.<init>(URL.java:464)
>> at java.net.URL.<init>(URL.java:413)
>> at
>> org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDat
>> aSource.java:78)
>> at
>> org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDat
>> aSource.java:38)
>> at
>> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEn
>> tityProcessor.java:98)
>
> I encountered this previously when I tried to concatenate fields
> from different entities into one field. I worked around this by
> gathering fields with an xsl. Not being able to resolve the url for
> Tika is a little more problematic.
>
> *Is this a bug? If not, how do I remove the brackets so that I can
> use my variable as it was meant?*
>
> <dataConfig>
>
> <dataSource type="BinURLDataSource" name="bin"/>
>
> <dataSource type="FileDataSource" name="fileReader"/>
>
> <document>
>
> <entity name="f" processor="FileListEntityProcessor"
> baseDir="/home/pgwillia/content" dataSource="null" fileName=".*xml"
> rootEntity="false">
>
> <entity name="x" processor="XPathEntityProcessor"
> dataSource="fileReader"
> transformer="TemplateTransformer,RegexTransformer"
> forEach="/RDF/Description" url="${f.fileAbsolutePath}">
>
> ...
>
> <field column="url" xpath="/RDF/Description/identifier"
> regex="http://privatedomain:8080/content/"
> replaceWith="http://publicdomain.ca/content/"/>
>
> <entity name="tika" processor="TikaEntityProcessor"
> url="${x.url}" dataSource="bin" format="text">
>
> <field column="fulltext" name="text"/>
>
> </entity>
>
> </entity>
>
> </entity>
>
> </document>
>
> </dataConfig>
>
>
> Many thanks,
> Tricia
>