You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Shaun Barriball <sb...@yahoo.co.uk> on 2011/11/06 22:22:38 UTC

Aggregated indexing of updating RSS feeds

Hi all,

We've successfully setup Solr 3.4.0 to parse and import multiple news RSS feeds (based on the slashdot example on http://wiki.apache.org/solr/DataImportHandler) using the HttpDataSource.
The objective is for Solr to index ALL news items published on this feed (ever) - not just the current contents of the feed. I've read that the delta import is not supported for XML imports. I've therefore tried to use "command=full-impor&clean=false". 

But still the number of Documents Processed seems to be stuck at a fixed number of items looking at the Stats and the 'numFound' result for a generic '*:*' search. New items are being added to the feeds all the time (and old ones dropping off).

Is it possible for Solr to incrementally build an index of a live RSS feed which is changing but retain the index of its archive?

All help appreciated.
Shaun

Re: Aggregated indexing of updating RSS feeds

Posted by sbarriba <sb...@yahoo.co.uk>.

All,
Can anyone advise how to stop the "deleteAll" event during a full import? 

As discussed above using clean=false with Solr 3.4 still seems to trigger a
delete of all previous imported data. I want to aggregate the results of
multiple imports.

Thanks in advance.
S

--
View this message in context: http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3512260.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Aggregated indexing of updating RSS feeds

Posted by Michael Kuhlmann <ku...@solarier.de>.

Am 17.11.2011 11:53, schrieb sbarriba:
> The 'params' logging pointer was what I needed. So for reference its not a
> good idea to use a 'wget' command directly in a crontab.
> I was using:
>
> wget http://localhost/solr/myfeed?command=full-import&rows=5000&clean=false

:))

I think the shell handled the and sign as a flag to put the wget command 
into background.

You could put the full url into quotes, or escape the and sign with a 
backslash. Then it should work as well.

-Kuli

Re: Aggregated indexing of updating RSS feeds

Posted by sbarriba <sb...@yahoo.co.uk>.

Thanks Chris.

(Bell rings)

The 'params' logging pointer was what I needed. So for reference its not a
good idea to use a 'wget' command directly in a crontab.
I was using:

wget http://localhost/solr/myfeed?command=full-import&rows=5000&clean=false

...but moving this into a separate shell script, wrapping the URL in quotes
and calling that resolved the issue.

Thanks very much.

--
View this message in context: http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3515388.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Aggregated indexing of updating RSS feeds

Posted by Chris Hostetter <ho...@fucit.org>.

: ..but the request I'm making is..
: /solr/myfeed?command=full-import&rows=5000&clean=false
: 
: ..note the clean=false.

I see it, but i also see this in the logs you provided...

: INFO: [] webapp=/solr path=/myfeed params={command=full-import} status=0
: QTime=8

...which means someone somewhere is executing full-import w/o using 
clean=false.  

are you absolutely certain that you are executing the request you think 
you are?  can you find a request in your logs that includes clean=false?

if it's not you and your code -- it is comming from somewhere, and that's 
what's causing DIH to trigger a deleteAll...

: 10-Nov-2011 05:40:01 org.apache.solr.handler.dataimport.DataImporter
: doFullImport
: INFO: Starting Full Import
: 10-Nov-2011 05:40:01 org.apache.solr.handler.dataimport.SolrWriter
: readIndexerProperties
: INFO: Read myfeed.properties
: 10-Nov-2011 05:40:01 org.apache.solr.update.DirectUpdateHandler2 deleteAll
: INFO: [] REMOVING ALL DOCUMENTS FROM INDEX



-Hoss

Re: Aggregated indexing of updating RSS feeds

Posted by sbarriba <sb...@yahoo.co.uk>.

All,
Can anyone advise how to stop the "deleteAll" event during a full import?

I'm still unable to determine why repeat full imports seem to delete old
indexes. After investigation the logs confirm this - see "REMOVING ALL
DOCUMENTS FROM INDEX" below.

..but the request I'm making is..
/solr/myfeed?command=full-import&rows=5000&clean=false

..note the clean=false.

All help appreciated.
Shaun


INFO: [] webapp=/solr path=/myfeed params={command=full-import} status=0
QTime=8
10-Nov-2011 05:40:01 org.apache.solr.handler.dataimport.DataImporter
doFullImport
INFO: Starting Full Import
10-Nov-2011 05:40:01 org.apache.solr.handler.dataimport.SolrWriter
readIndexerProperties
INFO: Read myfeed.properties
10-Nov-2011 05:40:01 org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [] REMOVING ALL DOCUMENTS FROM INDEX
10-Nov-2011 05:40:05 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select/
params={indent=on&start=0&q=description:one+direction&rows=10&version=2.2}
hits=0 status=0 QTime=1
10-Nov-2011 05:40:07 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select/
params={indent=on&start=0&q=id:*23327977*&rows=10&version=2.2} hits=0
status=0 QTime=1
10-Nov-2011 05:40:08 org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=2000
       
commit{dir=/mnt/ebs1/data/index,segFN=segments_1x3,version=1319402557686,generation=2487,filenames=[_3u3.tii,
segments_1x3, _3u3.frq, _3u3.prx, _3u3.nrm, _3u3.fnm, _3u3.fdx, _3u3.tis,
_3u3.fdt]
       
commit{dir=/mnt/ebs1/data/index,segFN=segments_1x4,version=1319402557691,generation=2488,filenames=[_3u5.nrm,
_3u5.fnm, _3u5.fdx, segments_1x4, _3u5.tis, _3u5.prx, _3u5.frq, _3u5.tii,
_3u5.fdt]

--
View this message in context: http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3495882.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Aggregated indexing of updating RSS feeds

Posted by sbarriba <sb...@yahoo.co.uk>.

Hi Hoss,
Thanks for the quick response.

RE point 1) I'd mistyped (sorry) the incremental URL I'm using for updates.
Essentially every 5 minutes the system is making a HTTP call for...

http://localhost/solr/myfeed?clean=false&command=full-import&rows=5000

..which when accessed returns the following showing 0 deleted.

<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">/opt/solr/myfeed/data-config.xml</str>
</lst>
</lst>
<str name="command">full-import</str>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
<str name="Total Requests made to DataSource">33</str>
<str name="Total Rows Fetched">594</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2011-11-08 14:11:30</str>
<str name="">Indexing completed. Added/Updated: 594 documents. Deleted 0
documents.</str>
<str name="Committed">2011-11-08 14:11:31</str>
<str name="Optimized">2011-11-08 14:11:31</str>
<str name="Total Documents Processed">594</str>
<str name="Time taken ">0:0:6.492</str>
</lst>
<str name="WARNING">This response format is experimental.  It is likely to
change in the future.</str>
</response>

....but a search always returns between 550 and 600 rows. There should be
1,000s (as this is parsing 30+ active feeds).

My request handler is intended to be basic:

 <requestHandler name="/myfeed"
class="org.apache.solr.handler.dataimport.DataImportHandler">
      <lst name="defaults">
            <str name="config">/opt/solr/myfeed/data-config.xml</str>
       </lst>
</requestHandler>
I have not customised the solrconfig.xml beyond the above.

My data config is using:

<dataConfig>
        <dataSource type="HttpDataSource" />
        <document>
...

Should I be using the HttpDataSource?

Any other thoughts?
Regards,
Shaun

Chris Hostetter-3 wrote:
> 
> : We've successfully setup Solr 3.4.0 to parse and import multiple news 
> : RSS feeds (based on the slashdot example on 
> : http://wiki.apache.org/solr/DataImportHandler) using the HttpDataSource.
> 
> : The objective is for Solr to index ALL news items published on this feed 
> : (ever) - not just the current contents of the feed. I've read that the 
> : delta import is not supported for XML imports. I've therefore tried to 
> : use "command=full-impor&clean=false". 
> 
> 1) note your typo, should be "full-import"
> 
> : But still the number of Documents Processed seems to be stuck at a fixed 
> : number of items looking at the Stats and the 'numFound' result for a 
> : generic '*:*' search. New items are being added to the feeds all the 
> : time (and old ones dropping off).
> 
> "Documents Processed" after each full import should be whatever the number 
> of items in the current feed is -- it's the number processed in that 
> import, no total number processed in all time.
> 
> if you specify clean=false no documents should be deleted.  I just tested 
> this using the slashdot example with Solr 3.4 and could not reproduce the 
> problem you described.  I loaded the following URL...
> 
> http://localhost:8983/solr/rss/dataimport?clean=false&command=full-import
> 
> ...then waited a while for the feed to cahnge, and then loaded that URL 
> again.  The number of documents (returned by a *:* query) increased after 
> the second run.
> 
> 
> -Hoss
> 

--
View this message in context: http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3490501.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Aggregated indexing of updating RSS feeds

Posted by Chris Hostetter <ho...@fucit.org>.

: We've successfully setup Solr 3.4.0 to parse and import multiple news 
: RSS feeds (based on the slashdot example on 
: http://wiki.apache.org/solr/DataImportHandler) using the HttpDataSource.

: The objective is for Solr to index ALL news items published on this feed 
: (ever) - not just the current contents of the feed. I've read that the 
: delta import is not supported for XML imports. I've therefore tried to 
: use "command=full-impor&clean=false". 

1) note your typo, should be "full-import"

: But still the number of Documents Processed seems to be stuck at a fixed 
: number of items looking at the Stats and the 'numFound' result for a 
: generic '*:*' search. New items are being added to the feeds all the 
: time (and old ones dropping off).

"Documents Processed" after each full import should be whatever the number 
of items in the current feed is -- it's the number processed in that 
import, no total number processed in all time.

if you specify clean=false no documents should be deleted.  I just tested 
this using the slashdot example with Solr 3.4 and could not reproduce the 
problem you described.  I loaded the following URL...

http://localhost:8983/solr/rss/dataimport?clean=false&command=full-import

...then waited a while for the feed to cahnge, and then loaded that URL 
again.  The number of documents (returned by a *:* query) increased after 
the second run.


-Hoss

Re: Aggregated indexing of updating RSS feeds

Posted by sbarriba <sb...@yahoo.co.uk>.

Thanks Nagendra, I'll take a look.

So question for you et al, so Solr in its default installation will ALWAYS
delete content for an entity prior to doing a full import? 
You cannot simply build up an index incrementally from multiple imports
(from XML)? I read elsewhere that the 'clean' parameter was intended to
control this.

Regards,
Shaun

--
View this message in context: http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3487969.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Aggregated indexing of updating RSS feeds

Posted by Fred Zimmerman <zi...@gmail.com>.

Any options that do not require adding new software?

On Mon, Nov 7, 2011 at 11:11 AM, Nagendra Nagarajayya <
nnagarajayya@transaxtions.com> wrote:

> Shaun:
>
> You should try NRT available with Solr with RankingAlgorithm here. You
> should be able to add docs in real time and also query them in real time.
>  If DIH does not retain the old index, you may be able to convert the rss
> fields to a XML format as needed by Solr and update the docs (make sure
> there is a unique id)
>
> http://solr-ra.tgels.org/wiki/**en/Near_Real_Time_Search_ver_**3.x<http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x>
>
> You can download Solr 3.4.0 with RankingAlgorithm 1.3 from here:
> http://solr-ra.tgels.org
>
> Regards,
>
> - Nagendra Nagarajayya
> http://solr-ra.tgels.org
> http://rankingalgorithm.tgels.**org <http://rankingalgorithm.tgels.org>
>
>
> On 11/6/2011 1:22 PM, Shaun Barriball wrote:
>
>> Hi all,
>>
>> We've successfully setup Solr 3.4.0 to parse and import multiple news RSS
>> feeds (based on the slashdot example on http://wiki.apache.org/solr/**
>> DataImportHandler <http://wiki.apache.org/solr/DataImportHandler>) using
>> the HttpDataSource.
>> The objective is for Solr to index ALL news items published on this feed
>> (ever) - not just the current contents of the feed. I've read that the
>> delta import is not supported for XML imports. I've therefore tried to use
>> "command=full-impor&clean=**false".
>> But still the number of Documents Processed seems to be stuck at a fixed
>> number of items looking at the Stats and the 'numFound' result for a
>> generic '*:*' search. New items are being added to the feeds all the time
>> (and old ones dropping off).
>>
>> Is it possible for Solr to incrementally build an index of a live RSS
>> feed which is changing but retain the index of its archive?
>>
>> All help appreciated.
>> Shaun
>>
>
>

Re: Aggregated indexing of updating RSS feeds

Posted by Nagendra Nagarajayya <nn...@transaxtions.com>.

Shaun:

You should try NRT available with Solr with RankingAlgorithm here. You 
should be able to add docs in real time and also query them in real 
time.  If DIH does not retain the old index, you may be able to convert 
the rss fields to a XML format as needed by Solr and update the docs 
(make sure there is a unique id)

http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x

You can download Solr 3.4.0 with RankingAlgorithm 1.3 from here:
http://solr-ra.tgels.org

Regards,

- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

On 11/6/2011 1:22 PM, Shaun Barriball wrote:
> Hi all,
>
> We've successfully setup Solr 3.4.0 to parse and import multiple news RSS feeds (based on the slashdot example on http://wiki.apache.org/solr/DataImportHandler) using the HttpDataSource.
> The objective is for Solr to index ALL news items published on this feed (ever) - not just the current contents of the feed. I've read that the delta import is not supported for XML imports. I've therefore tried to use "command=full-impor&clean=false". 
>
> But still the number of Documents Processed seems to be stuck at a fixed number of items looking at the Stats and the 'numFound' result for a generic '*:*' search. New items are being added to the feeds all the time (and old ones dropping off).
>
> Is it possible for Solr to incrementally build an index of a live RSS feed which is changing but retain the index of its archive?
>
> All help appreciated.
> Shaun