You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2008/12/14 02:48:42 UTC
DIH - duplicate xpaths on HttpDataSource
I'm trying to index a blog with DIH, and have this:
<field column="id" xpath="/rss/channel/item/link"/>
<field column="url" xpath="/rss/channel/item/link"/>
If I comment out the url <field> line it all works fine, but if I put
it in, no documents get indexed. Is there an issue with using the
same xpath twice? Or something else I'm missing?
This is using Solr trunk.
Thanks,
Erik
Re: DIH - duplicate xpaths on HttpDataSource
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Thanks again, Noble. All is working fine for me now.
Erik
On Dec 14, 2008, at 10:33 AM, Noble Paul നോബിള്
नोब्ळ् wrote:
>> Also, I want to add in date transformation like in the example above
>> commented out. How would I use both the TemplateTransformer and the
>> DateFormatTransformer,
> Transformers are chained. transformer="a,b,c" (comma separated)
> first a is applied then b and then c.
>
>> since I'm specifying the transformer on the entity
>> rather than the field it is used on? I don't yet understand why the
>> transformer is entity-based rather than per-field.
>
> A row is emitted by an entity and a field is one of the values in that
> row. A field declaration is optional because it can be implicit if it
> is present in the schema. (But , for XPathEntityProcessor it is not )
> ..Moreover there is no component in DIH which corresponds to a Field
> where this can be attached
> Hence , the transformers are declared on an entity
>
>>
>> Thanks for your help, Noble.
>>
>> Erik
>>
Re: DIH - duplicate xpaths on HttpDataSource
Posted by Noble Paul നോബിള് नोब्ळ् <no...@gmail.com>.
On Sun, Dec 14, 2008 at 8:43 PM, Erik Hatcher
<er...@ehatchersolutions.com> wrote:
> On Dec 14, 2008, at 6:15 AM, Noble Paul നോബിള് नोब्ळ् wrote:
>>
>> one xpath can be mapped to only one name.
>
> That's a bizarre restriction. A shame.
hmmm....
Understandale . XPathRecordReader is a custom minimal implementation
for streaming XPath. My aim was to implement something which handles
XPath fast and that does the job. XPathEntityprocessor relies on
that.
XPathRecordReader is far from feature complete.
>
>> If you want to copy one to another use something like a
>> TemplateTransformer
>
> Thanks!, I got it to work like this:
>
> <entity name="blog" pk="id" url="http://server/rss/"
> processor="XPathEntityProcessor" forEach="/rss/channel/item"
> transformer="TemplateTransformer">
> <field column="id" xpath="/rss/channel/item/link"/>
> <field column="url" template="${blog.id}"/>
> <field column="title" xpath="/rss/channel/item/title"/>
> <field column="description"
> xpath="/rss/channel/item/description"/>
> <!-- No pubDate either <field column="date"
> xpath="/rss/channel/item/pubDate" dateTimeFormat="EEE, d MMM yyyy HH:mm:ss
> Z"/> -->
> </entity>
>
> A couple of questions... how would I add a static valued field in?
> Something like:
>
> <field column="source" value="blog"/>
use TemplateTransformer again
<field column="source" template="blog"/>
>
> where all documents from this entity come in with a source="blog" literally.
>
> Also, I want to add in date transformation like in the example above
> commented out. How would I use both the TemplateTransformer and the
> DateFormatTransformer,
Transformers are chained. transformer="a,b,c" (comma separated)
first a is applied then b and then c.
>since I'm specifying the transformer on the entity
> rather than the field it is used on? I don't yet understand why the
> transformer is entity-based rather than per-field.
A row is emitted by an entity and a field is one of the values in that
row. A field declaration is optional because it can be implicit if it
is present in the schema. (But , for XPathEntityProcessor it is not )
..Moreover there is no component in DIH which corresponds to a Field
where this can be attached
Hence , the transformers are declared on an entity
>
> Thanks for your help, Noble.
>
> Erik
>
>
>>
>>
>>
>>
>>
>>
>> On Sun, Dec 14, 2008 at 7:18 AM, Erik Hatcher
>> <er...@ehatchersolutions.com> wrote:
>>>
>>> I'm trying to index a blog with DIH, and have this:
>>>
>>> <field column="id" xpath="/rss/channel/item/link"/>
>>> <field column="url" xpath="/rss/channel/item/link"/>
>>>
>>> If I comment out the url <field> line it all works fine, but if I put it
>>> in,
>>> no documents get indexed. Is there an issue with using the same xpath
>>> twice? Or something else I'm missing?
>>>
>>> This is using Solr trunk.
>>>
>>> Thanks,
>>> Erik
>>>
>>>
>>
>>
>>
>> --
>> --Noble Paul
>
>
--
--Noble Paul
Re: DIH - duplicate xpaths on HttpDataSource
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Dec 14, 2008, at 6:15 AM, Noble Paul നോബിള്
नोब्ळ् wrote:
> one xpath can be mapped to only one name.
That's a bizarre restriction. A shame.
> If you want to copy one to another use something like a
> TemplateTransformer
Thanks!, I got it to work like this:
<entity name="blog" pk="id" url="http://server/rss/"
processor="XPathEntityProcessor" forEach="/rss/channel/item"
transformer="TemplateTransformer">
<field column="id" xpath="/rss/channel/item/link"/>
<field column="url" template="${blog.id}"/>
<field column="title" xpath="/rss/channel/item/title"/>
<field column="description" xpath="/rss/channel/item/
description"/>
<!-- No pubDate either <field column="date" xpath="/rss/
channel/item/pubDate" dateTimeFormat="EEE, d MMM yyyy HH:mm:ss Z"/> -->
</entity>
A couple of questions... how would I add a static valued field in?
Something like:
<field column="source" value="blog"/>
where all documents from this entity come in with a source="blog"
literally.
Also, I want to add in date transformation like in the example above
commented out. How would I use both the TemplateTransformer and the
DateFormatTransformer, since I'm specifying the transformer on the
entity rather than the field it is used on? I don't yet understand
why the transformer is entity-based rather than per-field.
Thanks for your help, Noble.
Erik
>
>
>
>
>
>
> On Sun, Dec 14, 2008 at 7:18 AM, Erik Hatcher
> <er...@ehatchersolutions.com> wrote:
>> I'm trying to index a blog with DIH, and have this:
>>
>> <field column="id" xpath="/rss/channel/item/link"/>
>> <field column="url" xpath="/rss/channel/item/link"/>
>>
>> If I comment out the url <field> line it all works fine, but if I
>> put it in,
>> no documents get indexed. Is there an issue with using the same
>> xpath
>> twice? Or something else I'm missing?
>>
>> This is using Solr trunk.
>>
>> Thanks,
>> Erik
>>
>>
>
>
>
> --
> --Noble Paul
Re: DIH - duplicate xpaths on HttpDataSource
Posted by Noble Paul നോബിള് नोब्ळ् <no...@gmail.com>.
one xpath can be mapped to only one name.
If you want to copy one to another use something like a TemplateTransformer
On Sun, Dec 14, 2008 at 7:18 AM, Erik Hatcher
<er...@ehatchersolutions.com> wrote:
> I'm trying to index a blog with DIH, and have this:
>
> <field column="id" xpath="/rss/channel/item/link"/>
> <field column="url" xpath="/rss/channel/item/link"/>
>
> If I comment out the url <field> line it all works fine, but if I put it in,
> no documents get indexed. Is there an issue with using the same xpath
> twice? Or something else I'm missing?
>
> This is using Solr trunk.
>
> Thanks,
> Erik
>
>
--
--Noble Paul