You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2009/09/17 05:27:37 UTC
[DIH] URLDataSource and fetching a link
Many RSS feeds contain a <link> to some full article. How can I have
the DIH get the RSS feed and then have it go and fetch the content at
the link?
Thanks,
Grant
Re: [DIH] URLDataSource and fetching a link
Posted by Noble Paul നോബിള് नोब्ळ् <no...@corp.aol.com>.
<entity name="nytSportsFeed"
pk="link"
url="http://feeds1.nytimes.com/nyt/rss/Sports
"
processor="XPathEntityProcessor"
forEach="/rss/channel | /rss/channel/item"
dataSource="rss"
transformer="RegexTransformer,DateFormatTransformer">
<field column="source" xpath="/rss/channel/title"
commonField="true" />
<field column="source-link" xpath="/rss/channel/link"
commonField="true" />
<field column="title" xpath="/rss/channel/item/title"
/>
<field column="id" xpath="/rss/channel/item/guid" />
<field column="link" xpath="/rss/channel/item/link"
/>
<!-- Use the RegexTransformer to strip out ads -->
<field column="description"
xpath="/rss/channel/item/description" regex="<a.*?</a>"
replaceWith=""/>
<field column="category"
xpath="/rss/channel/item/category" />
<!-- 'Sun, 18 May 2008 11:23:11 +0000' -->
<field column="pubDate" xpath="/rss/channel/item/pubDate"
dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" />
<entity name="x" url="${nytSportsFeed.link}"
processor="PlainTextEntityProcessor"
dataSource="rss"
transformer="HTMLStripTransformer">
<field column="plainText" name="body"
stripHTML="true/>
</entity>
</entity>
On Tue, Oct 20, 2009 at 6:13 PM, Grant Ingersoll <gs...@apache.org>wrote:
> Finally getting back to this...
>
> On Sep 17, 2009, at 12:28 AM, Noble Paul നോബിള് नोब्ळ् wrote:
>
> 2009/9/17 Noble Paul നോബിള് नोब्ळ् <no...@corp.aol.com>:
>>
>>> it is possible to have a sub entity which has XPathEntityProcessor
>>> which can use the link ar the url
>>>
>>
>> This may not be a good solution.
>>
>> But you can use the $hasMore and $nextUrl options of
>> XPathEntityProcessor to recursively loop if there are more links
>>
>
> Is there an example of this somewhere? The DIH Wiki refers to it, but I
> don't see an example of it.
>
> I have:
> <entity name="nytSportsFeed"
> pk="link"
> url="
> http://feeds1.nytimes.com/nyt/rss/Sports"
> processor="XPathEntityProcessor"
> forEach="/rss/channel | /rss/channel/item"
> dataSource="rss"
> transformer="RegexTransformer,DateFormatTransformer">
> <field column="source" xpath="/rss/channel/title"
> commonField="true" />
> <field column="source-link"
> xpath="/rss/channel/link" commonField="true" />
> <field column="title"
> xpath="/rss/channel/item/title" />
> <field column="id" xpath="/rss/channel/item/guid" />
> <field column="link" xpath="/rss/channel/item/link"
> />
> <!-- Use the RegexTransformer to strip out ads -->
> <field column="description"
> xpath="/rss/channel/item/description" regex="<a.*?</a>"
> replaceWith=""/>
> <field column="category"
> xpath="/rss/channel/item/category" />
> <!-- 'Sun, 18 May 2008 11:23:11 +0000' -->
> <field column="pubDate" xpath="/rss/channel/item/pubDate"
> dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" />
> </entity>
>
> And I want to take the value from the link column and go get the contents
> of that link and index them into a "body" field.
>
> I'm not sure how to link in the sub-entity.
>
> Thanks,
> Grant
>
>
>
>
>>> On Thu, Sep 17, 2009 at 8:57 AM, Grant Ingersoll <gs...@apache.org>
>>> wrote:
>>>
>>>> Many RSS feeds contain a <link> to some full article. How can I have
>>>> the
>>>> DIH get the RSS feed and then have it go and fetch the content at the
>>>> link?
>>>>
>>>> Thanks,
>>>> Grant
>>>>
>>>>
>
>
--
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com
Re: [DIH] URLDataSource and fetching a link
Posted by Grant Ingersoll <gs...@apache.org>.
Finally getting back to this...
On Sep 17, 2009, at 12:28 AM, Noble Paul നോബിള്
नोब्ळ् wrote:
> 2009/9/17 Noble Paul നോബിള് नोब्ळ्
> <no...@corp.aol.com>:
>> it is possible to have a sub entity which has XPathEntityProcessor
>> which can use the link ar the url
>
> This may not be a good solution.
>
> But you can use the $hasMore and $nextUrl options of
> XPathEntityProcessor to recursively loop if there are more links
Is there an example of this somewhere? The DIH Wiki refers to it, but
I don't see an example of it.
I have:
<entity name="nytSportsFeed"
pk="link"
url="http://feeds1.nytimes.com/nyt/rss/Sports
"
processor="XPathEntityProcessor"
forEach="/rss/channel | /rss/channel/
item"
dataSource="rss"
transformer="RegexTransformer,DateFormatTransformer">
<field column="source" xpath="/rss/channel/
title" commonField="true" />
<field column="source-link" xpath="/rss/
channel/link" commonField="true" />
<field column="title" xpath="/rss/channel/
item/title" />
<field column="id" xpath="/rss/channel/item/
guid" />
<field column="link" xpath="/rss/channel/item/
link" />
<!-- Use the RegexTransformer to strip out ads -->
<field column="description" xpath="/rss/
channel/item/description" regex="<a.*?</a>" replaceWith=""/>
<field column="category" xpath="/rss/channel/
item/category" />
<!-- 'Sun, 18 May 2008 11:23:11 +0000' -->
<field column="pubDate" xpath="/rss/channel/item/pubDate"
dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" />
</entity>
And I want to take the value from the link column and go get the
contents of that link and index them into a "body" field.
I'm not sure how to link in the sub-entity.
Thanks,
Grant
>>
>> On Thu, Sep 17, 2009 at 8:57 AM, Grant Ingersoll
>> <gs...@apache.org> wrote:
>>> Many RSS feeds contain a <link> to some full article. How can I
>>> have the
>>> DIH get the RSS feed and then have it go and fetch the content at
>>> the link?
>>>
>>> Thanks,
>>> Grant
>>>
Re: [DIH] URLDataSource and fetching a link
Posted by Noble Paul നോബിള് नोब्ळ् <no...@corp.aol.com>.
2009/9/17 Noble Paul നോബിള് नोब्ळ् <no...@corp.aol.com>:
> it is possible to have a sub entity which has XPathEntityProcessor
> which can use the link ar the url
This may not be a good solution.
But you can use the $hasMore and $nextUrl options of
XPathEntityProcessor to recursively loop if there are more links
>
> On Thu, Sep 17, 2009 at 8:57 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> Many RSS feeds contain a <link> to some full article. How can I have the
>> DIH get the RSS feed and then have it go and fetch the content at the link?
>>
>> Thanks,
>> Grant
>>
>
>
>
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
>
--
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com
Re: [DIH] URLDataSource and fetching a link
Posted by Noble Paul നോബിള് नोब्ळ् <no...@corp.aol.com>.
it is possible to have a sub entity which has XPathEntityProcessor
which can use the link ar the url
On Thu, Sep 17, 2009 at 8:57 AM, Grant Ingersoll <gs...@apache.org> wrote:
> Many RSS feeds contain a <link> to some full article. How can I have the
> DIH get the RSS feed and then have it go and fetch the content at the link?
>
> Thanks,
> Grant
>
--
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com
Re: [DIH] URLDataSource and fetching a link
Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 16, 2009, at 9:13 PM, Walter Underwood wrote:
> I would use the RSS feed (hopefully in Atom format) as a source of
> links, then use a regular web spider to fetch the content.
>
> I seriously doubt that DIH is up to the task of general fetching
> from the Wild Wild Web. That is a dirty and difficult job and DIH is
> designed for cooperating data stores.
>
This is just for a quick demo thing, not production.
Re: [DIH] URLDataSource and fetching a link
Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 16, 2009, at 9:13 PM, Walter Underwood wrote:
> I would use the RSS feed (hopefully in Atom format) as a source of
> links, then use a regular web spider to fetch the content.
>
> I seriously doubt that DIH is up to the task of general fetching
> from the Wild Wild Web. That is a dirty and difficult job and DIH is
> designed for cooperating data stores.
>
This is just for a quick demo thing, not production.
Re: [DIH] URLDataSource and fetching a link
Posted by Walter Underwood <wu...@wunderwood.org>.
I would use the RSS feed (hopefully in Atom format) as a source of
links, then use a regular web spider to fetch the content.
I seriously doubt that DIH is up to the task of general fetching from
the Wild Wild Web. That is a dirty and difficult job and DIH is
designed for cooperating data stores.
wunder
On Sep 16, 2009, at 8:27 PM, Grant Ingersoll wrote:
> Many RSS feeds contain a <link> to some full article. How can I
> have the DIH get the RSS feed and then have it go and fetch the
> content at the link?
>
> Thanks,
> Grant