You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Matteo Moci <mo...@gmail.com> on 2010/11/11 14:21:26 UTC

index just new articles from rss feeds - Data Import Request Handler

Hello,
I'd like to use solr to index some documents coming from an rss feed,
like the example at [1], but it seems that the configuration used
there is just for a one-time indexing, trying to get all the articles
exposed in the rss feed of the website.

Is it possible to manage and index just the new articles coming from
the rss source?

I found that maybe the delta-import can be useful but, from what I understand,
the delta-import is used to just update the index with contents of
documents that have been modified since the last indexing:
this is obviously useful, but I'd like to index just the new articles
coming from an rss feed.

Is it something managed automatically by solr or I have to deal with
it in a separate way? Maybe a full import with &clean=false
parameters?
Are there any solutions that you would suggest?
Maybe storing the article feeds in a table like [2] and have a module
that periodically sends each row to solr for indexing it?

Thanks,
Matteo

[1] http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_Example
[2] http://wiki.apache.org/solr/DataImportHandler#Usage_with_RDBMS

Re: index just new articles from rss feeds - Data Import Request Handler

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Thu, Nov 11, 2010 at 8:21 AM, Matteo Moci <mo...@gmail.com> wrote:
> Hello,
> I'd like to use solr to index some documents coming from an rss feed,
> like the example at [1], but it seems that the configuration used
> there is just for a one-time indexing, trying to get all the articles
> exposed in the rss feed of the website.
>
> Is it possible to manage and index just the new articles coming from
> the rss source?
>

Each item in an RSS feed has a publishing date which you can use to
ingest only the new articles.

> I found that maybe the delta-import can be useful but, from what I understand,
> the delta-import is used to just update the index with contents of
> documents that have been modified since the last indexing:
> this is obviously useful, but I'd like to index just the new articles
> coming from an rss feed.
>
> Is it something managed automatically by solr or I have to deal with
> it in a separate way? Maybe a full import with &clean=false
> parameters?
> Are there any solutions that you would suggest?
> Maybe storing the article feeds in a table like [2] and have a module
> that periodically sends each row to solr for indexing it?
>

The RSS import example is more of a proof-of-concept that it can be
done, it may not be the best way to do it though. Storing the article
feeds in a table is essential if you have multiple ones. You can use a
parent entity for the table and a child entity to make the actual http
calls to the RSS. Be sure to use onError="continue" so that a bad RSS
feed does not stop the whole process. It will probably work fine for a
handful of feeds but if you are looking to develop a large feed
ingestion system, I'd suggest looking into alternate methods.

-- 
Regards,
Shalin Shekhar Mangar.