You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shaun Barriball <sb...@yahoo.co.uk> on 2011/11/06 22:22:38 UTC
Aggregated indexing of updating RSS feeds
Hi all,
We've successfully setup Solr 3.4.0 to parse and import multiple news RSS feeds (based on the slashdot example on http://wiki.apache.org/solr/DataImportHandler) using the HttpDataSource.
The objective is for Solr to index ALL news items published on this feed (ever) - not just the current contents of the feed. I've read that the delta import is not supported for XML imports. I've therefore tried to use "command=full-impor&clean=false".
But still the number of Documents Processed seems to be stuck at a fixed number of items looking at the Stats and the 'numFound' result for a generic '*:*' search. New items are being added to the feeds all the time (and old ones dropping off).
Is it possible for Solr to incrementally build an index of a live RSS feed which is changing but retain the index of its archive?
All help appreciated.
Shaun
Re: Aggregated indexing of updating RSS feeds
Posted by sbarriba <sb...@yahoo.co.uk>.
All,
Can anyone advise how to stop the "deleteAll" event during a full import?
As discussed above using clean=false with Solr 3.4 still seems to trigger a
delete of all previous imported data. I want to aggregate the results of
multiple imports.
Thanks in advance.
S
--
View this message in context: http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3512260.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Aggregated indexing of updating RSS feeds
Posted by Michael Kuhlmann <ku...@solarier.de>.
Am 17.11.2011 11:53, schrieb sbarriba:
> The 'params' logging pointer was what I needed. So for reference its not a
> good idea to use a 'wget' command directly in a crontab.
> I was using:
>
> wget http://localhost/solr/myfeed?command=full-import&rows=5000&clean=false
:))
I think the shell handled the and sign as a flag to put the wget command
into background.
You could put the full url into quotes, or escape the and sign with a
backslash. Then it should work as well.
-Kuli
Re: Aggregated indexing of updating RSS feeds
Posted by sbarriba <sb...@yahoo.co.uk>.
Thanks Chris.
(Bell rings)
The 'params' logging pointer was what I needed. So for reference its not a
good idea to use a 'wget' command directly in a crontab.
I was using:
wget http://localhost/solr/myfeed?command=full-import&rows=5000&clean=false
...but moving this into a separate shell script, wrapping the URL in quotes
and calling that resolved the issue.
Thanks very much.
--
View this message in context: http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3515388.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Aggregated indexing of updating RSS feeds
Posted by Chris Hostetter <ho...@fucit.org>.
: ..but the request I'm making is..
: /solr/myfeed?command=full-import&rows=5000&clean=false
:
: ..note the clean=false.
I see it, but i also see this in the logs you provided...
: INFO: [] webapp=/solr path=/myfeed params={command=full-import} status=0
: QTime=8
...which means someone somewhere is executing full-import w/o using
clean=false.
are you absolutely certain that you are executing the request you think
you are? can you find a request in your logs that includes clean=false?
if it's not you and your code -- it is comming from somewhere, and that's
what's causing DIH to trigger a deleteAll...
: 10-Nov-2011 05:40:01 org.apache.solr.handler.dataimport.DataImporter
: doFullImport
: INFO: Starting Full Import
: 10-Nov-2011 05:40:01 org.apache.solr.handler.dataimport.SolrWriter
: readIndexerProperties
: INFO: Read myfeed.properties
: 10-Nov-2011 05:40:01 org.apache.solr.update.DirectUpdateHandler2 deleteAll
: INFO: [] REMOVING ALL DOCUMENTS FROM INDEX
-Hoss
Re: Aggregated indexing of updating RSS feeds
Posted by sbarriba <sb...@yahoo.co.uk>.
All,
Can anyone advise how to stop the "deleteAll" event during a full import?
I'm still unable to determine why repeat full imports seem to delete old
indexes. After investigation the logs confirm this - see "REMOVING ALL
DOCUMENTS FROM INDEX" below.
..but the request I'm making is..
/solr/myfeed?command=full-import&rows=5000&clean=false
..note the clean=false.
All help appreciated.
Shaun
INFO: [] webapp=/solr path=/myfeed params={command=full-import} status=0
QTime=8
10-Nov-2011 05:40:01 org.apache.solr.handler.dataimport.DataImporter
doFullImport
INFO: Starting Full Import
10-Nov-2011 05:40:01 org.apache.solr.handler.dataimport.SolrWriter
readIndexerProperties
INFO: Read myfeed.properties
10-Nov-2011 05:40:01 org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [] REMOVING ALL DOCUMENTS FROM INDEX
10-Nov-2011 05:40:05 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select/
params={indent=on&start=0&q=description:one+direction&rows=10&version=2.2}
hits=0 status=0 QTime=1
10-Nov-2011 05:40:07 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select/
params={indent=on&start=0&q=id:*23327977*&rows=10&version=2.2} hits=0
status=0 QTime=1
10-Nov-2011 05:40:08 org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=2000
commit{dir=/mnt/ebs1/data/index,segFN=segments_1x3,version=1319402557686,generation=2487,filenames=[_3u3.tii,
segments_1x3, _3u3.frq, _3u3.prx, _3u3.nrm, _3u3.fnm, _3u3.fdx, _3u3.tis,
_3u3.fdt]
commit{dir=/mnt/ebs1/data/index,segFN=segments_1x4,version=1319402557691,generation=2488,filenames=[_3u5.nrm,
_3u5.fnm, _3u5.fdx, segments_1x4, _3u5.tis, _3u5.prx, _3u5.frq, _3u5.tii,
_3u5.fdt]
--
View this message in context: http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3495882.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Aggregated indexing of updating RSS feeds
Posted by sbarriba <sb...@yahoo.co.uk>.
Hi Hoss,
Thanks for the quick response.
RE point 1) I'd mistyped (sorry) the incremental URL I'm using for updates.
Essentially every 5 minutes the system is making a HTTP call for...
http://localhost/solr/myfeed?clean=false&command=full-import&rows=5000
..which when accessed returns the following showing 0 deleted.
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">/opt/solr/myfeed/data-config.xml</str>
</lst>
</lst>
<str name="command">full-import</str>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
<str name="Total Requests made to DataSource">33</str>
<str name="Total Rows Fetched">594</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2011-11-08 14:11:30</str>
<str name="">Indexing completed. Added/Updated: 594 documents. Deleted 0
documents.</str>
<str name="Committed">2011-11-08 14:11:31</str>
<str name="Optimized">2011-11-08 14:11:31</str>
<str name="Total Documents Processed">594</str>
<str name="Time taken ">0:0:6.492</str>
</lst>
<str name="WARNING">This response format is experimental. It is likely to
change in the future.</str>
</response>
....but a search always returns between 550 and 600 rows. There should be
1,000s (as this is parsing 30+ active feeds).
My request handler is intended to be basic:
<requestHandler name="/myfeed"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">/opt/solr/myfeed/data-config.xml</str>
</lst>
</requestHandler>
I have not customised the solrconfig.xml beyond the above.
My data config is using:
<dataConfig>
<dataSource type="HttpDataSource" />
<document>
...
Should I be using the HttpDataSource?
Any other thoughts?
Regards,
Shaun
Chris Hostetter-3 wrote:
>
> : We've successfully setup Solr 3.4.0 to parse and import multiple news
> : RSS feeds (based on the slashdot example on
> : http://wiki.apache.org/solr/DataImportHandler) using the HttpDataSource.
>
> : The objective is for Solr to index ALL news items published on this feed
> : (ever) - not just the current contents of the feed. I've read that the
> : delta import is not supported for XML imports. I've therefore tried to
> : use "command=full-impor&clean=false".
>
> 1) note your typo, should be "full-import"
>
> : But still the number of Documents Processed seems to be stuck at a fixed
> : number of items looking at the Stats and the 'numFound' result for a
> : generic '*:*' search. New items are being added to the feeds all the
> : time (and old ones dropping off).
>
> "Documents Processed" after each full import should be whatever the number
> of items in the current feed is -- it's the number processed in that
> import, no total number processed in all time.
>
> if you specify clean=false no documents should be deleted. I just tested
> this using the slashdot example with Solr 3.4 and could not reproduce the
> problem you described. I loaded the following URL...
>
> http://localhost:8983/solr/rss/dataimport?clean=false&command=full-import
>
> ...then waited a while for the feed to cahnge, and then loaded that URL
> again. The number of documents (returned by a *:* query) increased after
> the second run.
>
>
> -Hoss
>
--
View this message in context: http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3490501.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Aggregated indexing of updating RSS feeds
Posted by Chris Hostetter <ho...@fucit.org>.
: We've successfully setup Solr 3.4.0 to parse and import multiple news
: RSS feeds (based on the slashdot example on
: http://wiki.apache.org/solr/DataImportHandler) using the HttpDataSource.
: The objective is for Solr to index ALL news items published on this feed
: (ever) - not just the current contents of the feed. I've read that the
: delta import is not supported for XML imports. I've therefore tried to
: use "command=full-impor&clean=false".
1) note your typo, should be "full-import"
: But still the number of Documents Processed seems to be stuck at a fixed
: number of items looking at the Stats and the 'numFound' result for a
: generic '*:*' search. New items are being added to the feeds all the
: time (and old ones dropping off).
"Documents Processed" after each full import should be whatever the number
of items in the current feed is -- it's the number processed in that
import, no total number processed in all time.
if you specify clean=false no documents should be deleted. I just tested
this using the slashdot example with Solr 3.4 and could not reproduce the
problem you described. I loaded the following URL...
http://localhost:8983/solr/rss/dataimport?clean=false&command=full-import
...then waited a while for the feed to cahnge, and then loaded that URL
again. The number of documents (returned by a *:* query) increased after
the second run.
-Hoss
Re: Aggregated indexing of updating RSS feeds
Posted by sbarriba <sb...@yahoo.co.uk>.
Thanks Nagendra, I'll take a look.
So question for you et al, so Solr in its default installation will ALWAYS
delete content for an entity prior to doing a full import?
You cannot simply build up an index incrementally from multiple imports
(from XML)? I read elsewhere that the 'clean' parameter was intended to
control this.
Regards,
Shaun
--
View this message in context: http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3487969.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Aggregated indexing of updating RSS feeds
Posted by Fred Zimmerman <zi...@gmail.com>.
Any options that do not require adding new software?
On Mon, Nov 7, 2011 at 11:11 AM, Nagendra Nagarajayya <
nnagarajayya@transaxtions.com> wrote:
> Shaun:
>
> You should try NRT available with Solr with RankingAlgorithm here. You
> should be able to add docs in real time and also query them in real time.
> If DIH does not retain the old index, you may be able to convert the rss
> fields to a XML format as needed by Solr and update the docs (make sure
> there is a unique id)
>
> http://solr-ra.tgels.org/wiki/**en/Near_Real_Time_Search_ver_**3.x<http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x>
>
> You can download Solr 3.4.0 with RankingAlgorithm 1.3 from here:
> http://solr-ra.tgels.org
>
> Regards,
>
> - Nagendra Nagarajayya
> http://solr-ra.tgels.org
> http://rankingalgorithm.tgels.**org <http://rankingalgorithm.tgels.org>
>
>
> On 11/6/2011 1:22 PM, Shaun Barriball wrote:
>
>> Hi all,
>>
>> We've successfully setup Solr 3.4.0 to parse and import multiple news RSS
>> feeds (based on the slashdot example on http://wiki.apache.org/solr/**
>> DataImportHandler <http://wiki.apache.org/solr/DataImportHandler>) using
>> the HttpDataSource.
>> The objective is for Solr to index ALL news items published on this feed
>> (ever) - not just the current contents of the feed. I've read that the
>> delta import is not supported for XML imports. I've therefore tried to use
>> "command=full-impor&clean=**false".
>> But still the number of Documents Processed seems to be stuck at a fixed
>> number of items looking at the Stats and the 'numFound' result for a
>> generic '*:*' search. New items are being added to the feeds all the time
>> (and old ones dropping off).
>>
>> Is it possible for Solr to incrementally build an index of a live RSS
>> feed which is changing but retain the index of its archive?
>>
>> All help appreciated.
>> Shaun
>>
>
>
Re: Aggregated indexing of updating RSS feeds
Posted by Nagendra Nagarajayya <nn...@transaxtions.com>.
Shaun:
You should try NRT available with Solr with RankingAlgorithm here. You
should be able to add docs in real time and also query them in real
time. If DIH does not retain the old index, you may be able to convert
the rss fields to a XML format as needed by Solr and update the docs
(make sure there is a unique id)
http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x
You can download Solr 3.4.0 with RankingAlgorithm 1.3 from here:
http://solr-ra.tgels.org
Regards,
- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org
On 11/6/2011 1:22 PM, Shaun Barriball wrote:
> Hi all,
>
> We've successfully setup Solr 3.4.0 to parse and import multiple news RSS feeds (based on the slashdot example on http://wiki.apache.org/solr/DataImportHandler) using the HttpDataSource.
> The objective is for Solr to index ALL news items published on this feed (ever) - not just the current contents of the feed. I've read that the delta import is not supported for XML imports. I've therefore tried to use "command=full-impor&clean=false".
>
> But still the number of Documents Processed seems to be stuck at a fixed number of items looking at the Stats and the 'numFound' result for a generic '*:*' search. New items are being added to the feeds all the time (and old ones dropping off).
>
> Is it possible for Solr to incrementally build an index of a live RSS feed which is changing but retain the index of its archive?
>
> All help appreciated.
> Shaun