You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2008/12/14 02:48:42 UTC

DIH - duplicate xpaths on HttpDataSource

I'm trying to index a blog with DIH, and have this:

             <field column="id" xpath="/rss/channel/item/link"/>
             <field column="url" xpath="/rss/channel/item/link"/>

If I comment out the url <field> line it all works fine, but if I put  
it in, no documents get indexed.  Is there an issue with using the  
same xpath twice?   Or something else I'm missing?

This is using Solr trunk.

Thanks,
	Erik


Re: DIH - duplicate xpaths on HttpDataSource

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Thanks again, Noble.  All is working fine for me now.

	Erik


On Dec 14, 2008, at 10:33 AM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:
>> Also, I want to add in date transformation like in the example above
>> commented out.  How would I use both the TemplateTransformer and the
>> DateFormatTransformer,
> Transformers are chained.  transformer="a,b,c"  (comma separated)
> first a is applied then b and then c.
>
>> since I'm specifying the transformer on the entity
>> rather than the field it is used on?  I don't yet understand why the
>> transformer is entity-based rather than per-field.
>
> A row is emitted by an entity and a field is one of the values in that
> row.  A field declaration is optional because it can be implicit if it
> is present in the schema. (But , for XPathEntityProcessor it is not )
> ..Moreover there is no component in DIH which  corresponds to a Field
> where this can be attached
> Hence , the transformers are declared on an entity
>
>>
>> Thanks for your help, Noble.
>>
>>       Erik
>>

Re: DIH - duplicate xpaths on HttpDataSource

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
On Sun, Dec 14, 2008 at 8:43 PM, Erik Hatcher
<er...@ehatchersolutions.com> wrote:
> On Dec 14, 2008, at 6:15 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
>>
>> one xpath can be mapped to only one name.
>
> That's a bizarre restriction.  A shame.

hmmm....
Understandale . XPathRecordReader is a custom minimal implementation
for streaming XPath. My aim was to implement something which handles
XPath fast and that does the job. XPathEntityprocessor  relies on
that.

XPathRecordReader is far from feature complete.
>
>> If you want to copy one to another use something like a
>> TemplateTransformer
>
> Thanks!, I got it to work like this:
>
>        <entity name="blog" pk="id" url="http://server/rss/"
> processor="XPathEntityProcessor" forEach="/rss/channel/item"
> transformer="TemplateTransformer">
>            <field column="id" xpath="/rss/channel/item/link"/>
>            <field column="url" template="${blog.id}"/>
>            <field column="title" xpath="/rss/channel/item/title"/>
>            <field column="description"
> xpath="/rss/channel/item/description"/>
>            <!-- No pubDate either <field column="date"
> xpath="/rss/channel/item/pubDate" dateTimeFormat="EEE, d MMM yyyy HH:mm:ss
> Z"/> -->
>        </entity>
>
> A couple of questions... how would I add a static valued field in?
>  Something like:
>
>   <field column="source" value="blog"/>
use TemplateTransformer again
<field column="source" template="blog"/>
>
> where all documents from this entity come in with a source="blog" literally.
>
> Also, I want to add in date transformation like in the example above
> commented out.  How would I use both the TemplateTransformer and the
> DateFormatTransformer,
Transformers are chained.  transformer="a,b,c"  (comma separated)
first a is applied then b and then c.

>since I'm specifying the transformer on the entity
> rather than the field it is used on?  I don't yet understand why the
> transformer is entity-based rather than per-field.

A row is emitted by an entity and a field is one of the values in that
row.  A field declaration is optional because it can be implicit if it
is present in the schema. (But , for XPathEntityProcessor it is not )
..Moreover there is no component in DIH which  corresponds to a Field
where this can be attached
Hence , the transformers are declared on an entity

>
> Thanks for your help, Noble.
>
>        Erik
>
>
>>
>>
>>
>>
>>
>>
>> On Sun, Dec 14, 2008 at 7:18 AM, Erik Hatcher
>> <er...@ehatchersolutions.com> wrote:
>>>
>>> I'm trying to index a blog with DIH, and have this:
>>>
>>>          <field column="id" xpath="/rss/channel/item/link"/>
>>>          <field column="url" xpath="/rss/channel/item/link"/>
>>>
>>> If I comment out the url <field> line it all works fine, but if I put it
>>> in,
>>> no documents get indexed.  Is there an issue with using the same xpath
>>> twice?   Or something else I'm missing?
>>>
>>> This is using Solr trunk.
>>>
>>> Thanks,
>>>      Erik
>>>
>>>
>>
>>
>>
>> --
>> --Noble Paul
>
>



-- 
--Noble Paul

Re: DIH - duplicate xpaths on HttpDataSource

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Dec 14, 2008, at 6:15 AM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:
> one xpath can be mapped to only one name.

That's a bizarre restriction.  A shame.

> If you want to copy one to another use something like a  
> TemplateTransformer

Thanks!, I got it to work like this:

         <entity name="blog" pk="id" url="http://server/rss/"  
processor="XPathEntityProcessor" forEach="/rss/channel/item"  
transformer="TemplateTransformer">
             <field column="id" xpath="/rss/channel/item/link"/>
             <field column="url" template="${blog.id}"/>
             <field column="title" xpath="/rss/channel/item/title"/>
             <field column="description" xpath="/rss/channel/item/ 
description"/>
             <!-- No pubDate either <field column="date" xpath="/rss/ 
channel/item/pubDate" dateTimeFormat="EEE, d MMM yyyy HH:mm:ss Z"/> -->
         </entity>

A couple of questions... how would I add a static valued field in?   
Something like:

    <field column="source" value="blog"/>

where all documents from this entity come in with a source="blog"  
literally.

Also, I want to add in date transformation like in the example above  
commented out.  How would I use both the TemplateTransformer and the  
DateFormatTransformer, since I'm specifying the transformer on the  
entity rather than the field it is used on?  I don't yet understand  
why the transformer is entity-based rather than per-field.

Thanks for your help, Noble.

	Erik


>
>
>
>
>
>
> On Sun, Dec 14, 2008 at 7:18 AM, Erik Hatcher
> <er...@ehatchersolutions.com> wrote:
>> I'm trying to index a blog with DIH, and have this:
>>
>>           <field column="id" xpath="/rss/channel/item/link"/>
>>           <field column="url" xpath="/rss/channel/item/link"/>
>>
>> If I comment out the url <field> line it all works fine, but if I  
>> put it in,
>> no documents get indexed.  Is there an issue with using the same  
>> xpath
>> twice?   Or something else I'm missing?
>>
>> This is using Solr trunk.
>>
>> Thanks,
>>       Erik
>>
>>
>
>
>
> -- 
> --Noble Paul


Re: DIH - duplicate xpaths on HttpDataSource

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
one xpath can be mapped to only one name.
If you want to copy one to another use something like a TemplateTransformer





On Sun, Dec 14, 2008 at 7:18 AM, Erik Hatcher
<er...@ehatchersolutions.com> wrote:
> I'm trying to index a blog with DIH, and have this:
>
>            <field column="id" xpath="/rss/channel/item/link"/>
>            <field column="url" xpath="/rss/channel/item/link"/>
>
> If I comment out the url <field> line it all works fine, but if I put it in,
> no documents get indexed.  Is there an issue with using the same xpath
> twice?   Or something else I'm missing?
>
> This is using Solr trunk.
>
> Thanks,
>        Erik
>
>



-- 
--Noble Paul