You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2009/09/17 05:27:37 UTC

[DIH] URLDataSource and fetching a link

Many RSS feeds contain a <link> to some full article.  How can I have  
the DIH get the RSS feed and then have it go and fetch the content at  
the link?

Thanks,
Grant

Re: [DIH] URLDataSource and fetching a link

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.

 <entity name="nytSportsFeed"
                               pk="link"
                               url="http://feeds1.nytimes.com/nyt/rss/Sports
"
                               processor="XPathEntityProcessor"
                               forEach="/rss/channel | /rss/channel/item"
                               dataSource="rss"

transformer="RegexTransformer,DateFormatTransformer">
                       <field column="source" xpath="/rss/channel/title"
commonField="true" />
                       <field column="source-link" xpath="/rss/channel/link"
commonField="true" />
                       <field column="title" xpath="/rss/channel/item/title"
/>
                       <field column="id" xpath="/rss/channel/item/guid" />
                       <field column="link" xpath="/rss/channel/item/link"
/>
     <!-- Use the RegexTransformer to strip out ads -->
                       <field column="description"
xpath="/rss/channel/item/description" regex="&lt;a.*?&lt;/a&gt;"
replaceWith=""/>
                       <field column="category"
xpath="/rss/channel/item/category" />
     <!-- 'Sun, 18 May 2008 11:23:11 +0000' -->
     <field column="pubDate" xpath="/rss/channel/item/pubDate"
dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" />
     <entity name="x"   url="${nytSportsFeed.link}"
                        processor="PlainTextEntityProcessor"

                        dataSource="rss"
                        transformer="HTMLStripTransformer">
                        <field column="plainText" name="body"
stripHTML="true/>

     </entity>


   </entity>



On Tue, Oct 20, 2009 at 6:13 PM, Grant Ingersoll <gs...@apache.org>wrote:

> Finally getting back to this...
>
> On Sep 17, 2009, at 12:28 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
>
>  2009/9/17 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
>>
>>> it is possible to have a sub entity which has XPathEntityProcessor
>>> which can use the link ar the url
>>>
>>
>> This may not be a good solution.
>>
>> But you can use the $hasMore and $nextUrl options of
>> XPathEntityProcessor to recursively loop if there are more links
>>
>
> Is there an example of this somewhere?  The DIH Wiki refers to it, but I
> don't see an example of it.
>
> I have:
>  <entity name="nytSportsFeed"
>                                pk="link"
>                                url="
> http://feeds1.nytimes.com/nyt/rss/Sports"
>                                processor="XPathEntityProcessor"
>                                forEach="/rss/channel | /rss/channel/item"
>            dataSource="rss"
>        transformer="RegexTransformer,DateFormatTransformer">
>                        <field column="source" xpath="/rss/channel/title"
> commonField="true" />
>                        <field column="source-link"
> xpath="/rss/channel/link" commonField="true" />
>                        <field column="title"
> xpath="/rss/channel/item/title" />
>                        <field column="id" xpath="/rss/channel/item/guid" />
>                        <field column="link" xpath="/rss/channel/item/link"
> />
>      <!-- Use the RegexTransformer to strip out ads -->
>                        <field column="description"
> xpath="/rss/channel/item/description" regex="&lt;a.*?&lt;/a&gt;"
> replaceWith=""/>
>                        <field column="category"
> xpath="/rss/channel/item/category" />
>      <!-- 'Sun, 18 May 2008 11:23:11 +0000' -->
>      <field column="pubDate" xpath="/rss/channel/item/pubDate"
> dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" />
>    </entity>
>
> And I want to take the value from the link column and go get the contents
> of that link and index them into a "body" field.
>
> I'm not sure how to link in the sub-entity.
>
> Thanks,
> Grant
>
>
>
>
>>> On Thu, Sep 17, 2009 at 8:57 AM, Grant Ingersoll <gs...@apache.org>
>>> wrote:
>>>
>>>> Many RSS feeds contain a <link> to some full article.  How can I have
>>>> the
>>>> DIH get the RSS feed and then have it go and fetch the content at the
>>>> link?
>>>>
>>>> Thanks,
>>>> Grant
>>>>
>>>>
>
>


-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: [DIH] URLDataSource and fetching a link

Posted by Grant Ingersoll <gs...@apache.org>.

Finally getting back to this...

On Sep 17, 2009, at 12:28 AM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:

> 2009/9/17 Noble Paul നോബിള്‍  नोब्ळ्  
> <no...@corp.aol.com>:
>> it is possible to have a sub entity which has XPathEntityProcessor
>> which can use the link ar the url
>
> This may not be a good solution.
>
> But you can use the $hasMore and $nextUrl options of
> XPathEntityProcessor to recursively loop if there are more links

Is there an example of this somewhere?  The DIH Wiki refers to it, but  
I don't see an example of it.

I have:
  <entity name="nytSportsFeed"
                                 pk="link"
                                 url="http://feeds1.nytimes.com/nyt/rss/Sports 
"
                                 processor="XPathEntityProcessor"
                                 forEach="/rss/channel | /rss/channel/ 
item"
             dataSource="rss"
         transformer="RegexTransformer,DateFormatTransformer">
                         <field column="source" xpath="/rss/channel/ 
title" commonField="true" />
                         <field column="source-link" xpath="/rss/ 
channel/link" commonField="true" />
                         <field column="title" xpath="/rss/channel/ 
item/title" />
                         <field column="id" xpath="/rss/channel/item/ 
guid" />
                         <field column="link" xpath="/rss/channel/item/ 
link" />
       <!-- Use the RegexTransformer to strip out ads -->
                         <field column="description" xpath="/rss/ 
channel/item/description" regex="&lt;a.*?&lt;/a&gt;" replaceWith=""/>
                         <field column="category" xpath="/rss/channel/ 
item/category" />
       <!-- 'Sun, 18 May 2008 11:23:11 +0000' -->
       <field column="pubDate" xpath="/rss/channel/item/pubDate"  
dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" />
     </entity>

And I want to take the value from the link column and go get the  
contents of that link and index them into a "body" field.

I'm not sure how to link in the sub-entity.

Thanks,
Grant


>>
>> On Thu, Sep 17, 2009 at 8:57 AM, Grant Ingersoll  
>> <gs...@apache.org> wrote:
>>> Many RSS feeds contain a <link> to some full article.  How can I  
>>> have the
>>> DIH get the RSS feed and then have it go and fetch the content at  
>>> the link?
>>>
>>> Thanks,
>>> Grant
>>>

Re: [DIH] URLDataSource and fetching a link

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.

2009/9/17 Noble Paul നോബിള്‍  नोब्ळ् <no...@corp.aol.com>:
> it is possible to have a sub entity which has XPathEntityProcessor
> which can use the link ar the url

This may not be a good solution.

But you can use the $hasMore and $nextUrl options of
XPathEntityProcessor to recursively loop if there are more links
>
> On Thu, Sep 17, 2009 at 8:57 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> Many RSS feeds contain a <link> to some full article.  How can I have the
>> DIH get the RSS feed and then have it go and fetch the content at the link?
>>
>> Thanks,
>> Grant
>>
>
>
>
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: [DIH] URLDataSource and fetching a link

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.

it is possible to have a sub entity which has XPathEntityProcessor
which can use the link ar the url

On Thu, Sep 17, 2009 at 8:57 AM, Grant Ingersoll <gs...@apache.org> wrote:
> Many RSS feeds contain a <link> to some full article.  How can I have the
> DIH get the RSS feed and then have it go and fetch the content at the link?
>
> Thanks,
> Grant
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: [DIH] URLDataSource and fetching a link

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 16, 2009, at 9:13 PM, Walter Underwood wrote:

> I would use the RSS feed (hopefully in Atom format) as a source of  
> links, then use a regular web spider to fetch the content.
>
> I seriously doubt that DIH is up to the task of general fetching  
> from the Wild Wild Web. That is a dirty and difficult job and DIH is  
> designed for cooperating data stores.
>

This is just for a quick demo thing, not production.

Re: [DIH] URLDataSource and fetching a link

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 16, 2009, at 9:13 PM, Walter Underwood wrote:

> I would use the RSS feed (hopefully in Atom format) as a source of  
> links, then use a regular web spider to fetch the content.
>
> I seriously doubt that DIH is up to the task of general fetching  
> from the Wild Wild Web. That is a dirty and difficult job and DIH is  
> designed for cooperating data stores.
>

This is just for a quick demo thing, not production.

Re: [DIH] URLDataSource and fetching a link

Posted by Walter Underwood <wu...@wunderwood.org>.

I would use the RSS feed (hopefully in Atom format) as a source of  
links, then use a regular web spider to fetch the content.

I seriously doubt that DIH is up to the task of general fetching from  
the Wild Wild Web. That is a dirty and difficult job and DIH is  
designed for cooperating data stores.

wunder

On Sep 16, 2009, at 8:27 PM, Grant Ingersoll wrote:

> Many RSS feeds contain a <link> to some full article.  How can I  
> have the DIH get the RSS feed and then have it go and fetch the  
> content at the link?
>
> Thanks,
> Grant