You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tricia Williams <wi...@gmail.com> on 2010/03/12 22:53:51 UTC

DIH: using variables in nested entities

Hi All,

    The DataImportHandler is the most fantastic thing that has recently 
come to Solr.  Thank you.

    I'm noticing that when I use variables in nested entities that 
square brackets are wrapped around the variable value when they are 
used.  For example ${x.url} used in the "tika" entity below resolves as 
[http://publicdomain.ca/content/Sample.pdf] (note the square brackets) 
so I get the error in my log:

> SEVERE: Exception thrown while getting data
> java.net.MalformedURLException: no protocol: 
> [http://publicdomain.ca/content/Sample.pdf]
>         at java.net.URL.<init>(URL.java:567)
>         at java.net.URL.<init>(URL.java:464)
>         at java.net.URL.<init>(URL.java:413)
>         at 
> org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDat
> aSource.java:78)
>         at 
> org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDat
> aSource.java:38)
>         at 
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEn
> tityProcessor.java:98)

    I encountered this previously when I tried to concatenate fields 
from different entities into one field.  I worked around this by 
gathering fields with an xsl.  Not being able to resolve the url for 
Tika is a little more problematic.

    *Is this a bug?  If not, how do I remove the brackets so that I can 
use my variable as it was meant?*

<dataConfig>

    <dataSource type="BinURLDataSource" name="bin"/>

    <dataSource type="FileDataSource" name="fileReader"/>

    <document>

        <entity name="f" processor="FileListEntityProcessor" baseDir="/home/pgwillia/content" dataSource="null" fileName=".*xml" rootEntity="false">

            <entity name="x" processor="XPathEntityProcessor" dataSource="fileReader" transformer="TemplateTransformer,RegexTransformer" forEach="/RDF/Description" url="${f.fileAbsolutePath}">

         	...

                <field column="url" xpath="/RDF/Description/identifier" regex="http://privatedomain:8080/content/" replaceWith="http://publicdomain.ca/content/"/>

                <entity name="tika" processor="TikaEntityProcessor" url="${x.url}" dataSource="bin" format="text">

                        <field column="fulltext" name="text"/>

           	</entity>

            </entity>

        </entity>

    </document>

</dataConfig>


Many thanks,
Tricia

Re: DIH: using variables in nested entities

Posted by Tricia Williams <wi...@gmail.com>.
For anyone interested, my issue (I think) was because I had specified 
the url field as a multivalued field.   I wasn't able to create a test 
case that emulated my problem.  This guess is based on gradual fiddling 
with my configs.

My concern is no longer pressing but I do have a couple questions for 
the devs to think about:

   1. How should a multivalued field be treated in a child entity?  The
      use case would be the one I presented where I intend url to be
      multivalued.  I'm thinking a for-each type construct should apply.
   2. How should a multivalued field be formatted or custom formatted if
      you intend to use the content of a field in another field,
      possibly nested?



Tricia Williams wrote:
> Hi All,
>
>    The DataImportHandler is the most fantastic thing that has recently 
> come to Solr.  Thank you.
>
>    I'm noticing that when I use variables in nested entities that 
> square brackets are wrapped around the variable value when they are 
> used.  For example ${x.url} used in the "tika" entity below resolves 
> as [http://publicdomain.ca/content/Sample.pdf] (note the square 
> brackets) so I get the error in my log:
>
>> SEVERE: Exception thrown while getting data
>> java.net.MalformedURLException: no protocol: 
>> [http://publicdomain.ca/content/Sample.pdf]
>>         at java.net.URL.<init>(URL.java:567)
>>         at java.net.URL.<init>(URL.java:464)
>>         at java.net.URL.<init>(URL.java:413)
>>         at 
>> org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDat
>> aSource.java:78)
>>         at 
>> org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDat
>> aSource.java:38)
>>         at 
>> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEn
>> tityProcessor.java:98)
>
>    I encountered this previously when I tried to concatenate fields 
> from different entities into one field.  I worked around this by 
> gathering fields with an xsl.  Not being able to resolve the url for 
> Tika is a little more problematic.
>
>    *Is this a bug?  If not, how do I remove the brackets so that I can 
> use my variable as it was meant?*
>
> <dataConfig>
>
>    <dataSource type="BinURLDataSource" name="bin"/>
>
>    <dataSource type="FileDataSource" name="fileReader"/>
>
>    <document>
>
>        <entity name="f" processor="FileListEntityProcessor" 
> baseDir="/home/pgwillia/content" dataSource="null" fileName=".*xml" 
> rootEntity="false">
>
>            <entity name="x" processor="XPathEntityProcessor" 
> dataSource="fileReader" 
> transformer="TemplateTransformer,RegexTransformer" 
> forEach="/RDF/Description" url="${f.fileAbsolutePath}">
>
>             ...
>
>                <field column="url" xpath="/RDF/Description/identifier" 
> regex="http://privatedomain:8080/content/" 
> replaceWith="http://publicdomain.ca/content/"/>
>
>                <entity name="tika" processor="TikaEntityProcessor" 
> url="${x.url}" dataSource="bin" format="text">
>
>                        <field column="fulltext" name="text"/>
>
>               </entity>
>
>            </entity>
>
>        </entity>
>
>    </document>
>
> </dataConfig>
>
>
> Many thanks,
> Tricia
>