You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by btucker <bt...@mintel.com> on 2011/01/21 11:27:25 UTC

Delta Import occasionally missing records.

Hello

We've just started using solr to provide search functionality for our
application with the DataImportHandler performing a delta-import every 1
fired by crontab, which works great, however it does occasionally miss
records that are added to the database while the delta-import is running.

Our data-config.xml has the following queries in its root entity:

query="SELECT id, date_published, date_created, publish_flag FROM Item WHERE
id > 0
                                                                                        
AND record_type_id=0
                                                                                        
ORDER BY id DESC"
preImportDeleteQuery="SELECT item_id AS Id FROM
gnpd_production.item_deletions"
deletedPkQuery="SELECT item_id AS id FROM gnpd_production.item_deletions
WHERE deletion_date >=
                                             
SUBDATE('${dataimporter.last_index_time}', INTERVAL 5 MINUTE)"
deltaImportQuery="SELECT id, date_published, date_created, publish_flag FROM
Item WHERE id > 0
                                                                                                   
AND record_type_id=0
                                                                                                   
AND id=${dataimporter.delta.id}
                                                                                                   
ORDER BY id DESC"
deltaQuery="SELECT id, date_published, date_created, publish_flag FROM Item
WHERE id > 0
                                                                                             
AND record_type_id=0
                                                                                             
AND sys_time_stamp >=
                                                                                   
SUBDATE('${dataimporter.last_index_time}', INTERVAL 1 MINUTE) ORDER BY id
DESC">

I think the problem i'm having comes from the way solr stores the
last_index_time in conf/dataimport.properties as stated on the wiki as 

""When delta-import command is executed, it reads the start time stored in
conf/dataimport.properties. It uses that timestamp to run delta queries and
after completion, updates the timestamp in conf/dataimport.properties.""

Which to me seems to indicate that any records with a time-stamp between
when the dataimport starts and ends will be missed as the last_index_time is
set to when it completes the import.

This doesn't seem quite right to me. I would have expected the
last_index_time to refer to when the dataimport was last STARTED so that
there was no gaps in the timestamp covered.

I changed the deltaQuery of our config to include the SUBDATE by INTERVAL 1
MINUTE statement to alleviate this problem, but it does only cover times
when the delta-import takes less than a minute.

Any ideas as to how this can be overcome? ,other than increasing the
INTERVAL to something larger.

Regards

Barry Tucker
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Delta Import occasionally missing records.

Posted by Lance Norskog <go...@gmail.com>.
The SolrEntityProcessor would be a top-level entity. You would do a
query like this: &sort=timestamp,desc&rows=1&fl=timestamp. This gives
you one data item: the timestamp of the last item added to the index.

With this, the JDBC sub-entity would create a query that chooses all
rows with a timestamp >= this latest timestamp. It will not be easy to
put this together, but it is possible :)

Good luck!

Lance

On Mon, Jan 24, 2011 at 2:04 AM, btucker <bt...@mintel.com> wrote:
>
> Thank you for your response.
>
> In what way is 'timestamp' not perfect?
>
> I've looked into the SolrEntityProcessor and added a timestamp field to our
> index.
> However i'm struggling to work out a query to get the max value od the
> timestamp field
> and does the SolrEntityProcessor entity appear before the root entity or
> does it wrap around the root entity.
>
> On 22 January 2011 07:24, Lance Norskog-2 [via Lucene] <
> ml-node+2307215-627680969-326003@n3.nabble.com<ml...@n3.nabble.com>
>> wrote:
>
>> The timestamp thing is not perfect. You can instead do a search
>> against Solr and find the latest timestamp in the index. SOLR-1499
>> allows you to search against Solr in the DataImportHandler.
>>
>> On Fri, Jan 21, 2011 at 2:27 AM, btucker <[hidden email]<http://user/SendEmail.jtp?type=node&node=2307215&i=0>>
>> wrote:
>>
>> >
>> > Hello
>> >
>> > We've just started using solr to provide search functionality for our
>> > application with the DataImportHandler performing a delta-import every 1
>> > fired by crontab, which works great, however it does occasionally miss
>> > records that are added to the database while the delta-import is running.
>>
>> >
>> > Our data-config.xml has the following queries in its root entity:
>> >
>> > query="SELECT id, date_published, date_created, publish_flag FROM Item
>> WHERE
>> > id > 0
>> >
>> > AND record_type_id=0
>> >
>> > ORDER BY id DESC"
>> > preImportDeleteQuery="SELECT item_id AS Id FROM
>> > gnpd_production.item_deletions"
>> > deletedPkQuery="SELECT item_id AS id FROM gnpd_production.item_deletions
>> > WHERE deletion_date >=
>> >
>> > SUBDATE('${dataimporter.last_index_time}', INTERVAL 5 MINUTE)"
>> > deltaImportQuery="SELECT id, date_published, date_created, publish_flag
>> FROM
>> > Item WHERE id > 0
>> >
>> > AND record_type_id=0
>> >
>> > AND id=${dataimporter.delta.id}
>> >
>> > ORDER BY id DESC"
>> > deltaQuery="SELECT id, date_published, date_created, publish_flag FROM
>> Item
>> > WHERE id > 0
>> >
>> > AND record_type_id=0
>> >
>> > AND sys_time_stamp >=
>> >
>> > SUBDATE('${dataimporter.last_index_time}', INTERVAL 1 MINUTE) ORDER BY id
>>
>> > DESC">
>> >
>> > I think the problem i'm having comes from the way solr stores the
>> > last_index_time in conf/dataimport.properties as stated on the wiki as
>> >
>> > ""When delta-import command is executed, it reads the start time stored
>> in
>> > conf/dataimport.properties. It uses that timestamp to run delta queries
>> and
>> > after completion, updates the timestamp in conf/dataimport.properties.""
>> >
>> > Which to me seems to indicate that any records with a time-stamp between
>> > when the dataimport starts and ends will be missed as the last_index_time
>> is
>> > set to when it completes the import.
>> >
>> > This doesn't seem quite right to me. I would have expected the
>> > last_index_time to refer to when the dataimport was last STARTED so that
>> > there was no gaps in the timestamp covered.
>> >
>> > I changed the deltaQuery of our config to include the SUBDATE by INTERVAL
>> 1
>> > MINUTE statement to alleviate this problem, but it does only cover times
>> > when the delta-import takes less than a minute.
>> >
>> > Any ideas as to how this can be overcome? ,other than increasing the
>> > INTERVAL to something larger.
>> >
>> > Regards
>> >
>> > Barry Tucker
>> > --
>> > View this message in context:
>> http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.html<http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.html?by-user=t>
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> [hidden email] <http://user/SendEmail.jtp?type=node&node=2307215&i=1>
>>
>>
>> ------------------------------
>>  If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2307215.html
>>  To unsubscribe from Delta Import occasionally missing records., click
>> here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=2300877&code=YnR1Y2tlckBtaW50ZWwuY29tfDIzMDA4Nzd8LTEzMDE5MDUxOTI=>.
>>
>>
>
> <font size="1" face="Verdana">
>
> Mintel International Group Ltd | 18-19 Long Lane | London EC1A 9PL UK
> Registered in England: Number 1475918. | VAT Number: GB 232 9342 72
>
> Contact details for our other offices can be found at http://www.mintel.com/office-locations.
>
> This email and any attachments may include content that is confidential, privileged, or otherwise protected
> under applicable law. Unauthorised disclosure, copying, distribution, or use of the contents is prohibited
> and may be unlawful. If you have received this email in error, including without appropriate authorisation,
> then please reply to the sender about the error and delete this email and any attachments.</font>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2318572.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goksron@gmail.com

Re: Delta Import occasionally missing records.

Posted by btucker <bt...@mintel.com>.
Thank you for your response.

In what way is 'timestamp' not perfect?

I've looked into the SolrEntityProcessor and added a timestamp field to our
index.
However i'm struggling to work out a query to get the max value od the
timestamp field
and does the SolrEntityProcessor entity appear before the root entity or
does it wrap around the root entity.

On 22 January 2011 07:24, Lance Norskog-2 [via Lucene] <
ml-node+2307215-627680969-326003@n3.nabble.com<ml...@n3.nabble.com>
> wrote:

> The timestamp thing is not perfect. You can instead do a search
> against Solr and find the latest timestamp in the index. SOLR-1499
> allows you to search against Solr in the DataImportHandler.
>
> On Fri, Jan 21, 2011 at 2:27 AM, btucker <[hidden email]<http://user/SendEmail.jtp?type=node&node=2307215&i=0>>
> wrote:
>
> >
> > Hello
> >
> > We've just started using solr to provide search functionality for our
> > application with the DataImportHandler performing a delta-import every 1
> > fired by crontab, which works great, however it does occasionally miss
> > records that are added to the database while the delta-import is running.
>
> >
> > Our data-config.xml has the following queries in its root entity:
> >
> > query="SELECT id, date_published, date_created, publish_flag FROM Item
> WHERE
> > id > 0
> >
> > AND record_type_id=0
> >
> > ORDER BY id DESC"
> > preImportDeleteQuery="SELECT item_id AS Id FROM
> > gnpd_production.item_deletions"
> > deletedPkQuery="SELECT item_id AS id FROM gnpd_production.item_deletions
> > WHERE deletion_date >=
> >
> > SUBDATE('${dataimporter.last_index_time}', INTERVAL 5 MINUTE)"
> > deltaImportQuery="SELECT id, date_published, date_created, publish_flag
> FROM
> > Item WHERE id > 0
> >
> > AND record_type_id=0
> >
> > AND id=${dataimporter.delta.id}
> >
> > ORDER BY id DESC"
> > deltaQuery="SELECT id, date_published, date_created, publish_flag FROM
> Item
> > WHERE id > 0
> >
> > AND record_type_id=0
> >
> > AND sys_time_stamp >=
> >
> > SUBDATE('${dataimporter.last_index_time}', INTERVAL 1 MINUTE) ORDER BY id
>
> > DESC">
> >
> > I think the problem i'm having comes from the way solr stores the
> > last_index_time in conf/dataimport.properties as stated on the wiki as
> >
> > ""When delta-import command is executed, it reads the start time stored
> in
> > conf/dataimport.properties. It uses that timestamp to run delta queries
> and
> > after completion, updates the timestamp in conf/dataimport.properties.""
> >
> > Which to me seems to indicate that any records with a time-stamp between
> > when the dataimport starts and ends will be missed as the last_index_time
> is
> > set to when it completes the import.
> >
> > This doesn't seem quite right to me. I would have expected the
> > last_index_time to refer to when the dataimport was last STARTED so that
> > there was no gaps in the timestamp covered.
> >
> > I changed the deltaQuery of our config to include the SUBDATE by INTERVAL
> 1
> > MINUTE statement to alleviate this problem, but it does only cover times
> > when the delta-import takes less than a minute.
> >
> > Any ideas as to how this can be overcome? ,other than increasing the
> > INTERVAL to something larger.
> >
> > Regards
> >
> > Barry Tucker
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.html<http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.html?by-user=t>
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
>
>
> --
> Lance Norskog
> [hidden email] <http://user/SendEmail.jtp?type=node&node=2307215&i=1>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2307215.html
>  To unsubscribe from Delta Import occasionally missing records., click
> here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=2300877&code=YnR1Y2tlckBtaW50ZWwuY29tfDIzMDA4Nzd8LTEzMDE5MDUxOTI=>.
>
>

<font size="1" face="Verdana">

Mintel International Group Ltd | 18-19 Long Lane | London EC1A 9PL UK
Registered in England: Number 1475918. | VAT Number: GB 232 9342 72

Contact details for our other offices can be found at http://www.mintel.com/office-locations.

This email and any attachments may include content that is confidential, privileged, or otherwise protected
under applicable law. Unauthorised disclosure, copying, distribution, or use of the contents is prohibited 
and may be unlawful. If you have received this email in error, including without appropriate authorisation, 
then please reply to the sender about the error and delete this email and any attachments.</font>

-- 
View this message in context: http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2318572.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Delta Import occasionally missing records.

Posted by Lance Norskog <go...@gmail.com>.
The timestamp thing is not perfect. You can instead do a search
against Solr and find the latest timestamp in the index. SOLR-1499
allows you to search against Solr in the DataImportHandler.

On Fri, Jan 21, 2011 at 2:27 AM, btucker <bt...@mintel.com> wrote:
>
> Hello
>
> We've just started using solr to provide search functionality for our
> application with the DataImportHandler performing a delta-import every 1
> fired by crontab, which works great, however it does occasionally miss
> records that are added to the database while the delta-import is running.
>
> Our data-config.xml has the following queries in its root entity:
>
> query="SELECT id, date_published, date_created, publish_flag FROM Item WHERE
> id > 0
>
> AND record_type_id=0
>
> ORDER BY id DESC"
> preImportDeleteQuery="SELECT item_id AS Id FROM
> gnpd_production.item_deletions"
> deletedPkQuery="SELECT item_id AS id FROM gnpd_production.item_deletions
> WHERE deletion_date >=
>
> SUBDATE('${dataimporter.last_index_time}', INTERVAL 5 MINUTE)"
> deltaImportQuery="SELECT id, date_published, date_created, publish_flag FROM
> Item WHERE id > 0
>
> AND record_type_id=0
>
> AND id=${dataimporter.delta.id}
>
> ORDER BY id DESC"
> deltaQuery="SELECT id, date_published, date_created, publish_flag FROM Item
> WHERE id > 0
>
> AND record_type_id=0
>
> AND sys_time_stamp >=
>
> SUBDATE('${dataimporter.last_index_time}', INTERVAL 1 MINUTE) ORDER BY id
> DESC">
>
> I think the problem i'm having comes from the way solr stores the
> last_index_time in conf/dataimport.properties as stated on the wiki as
>
> ""When delta-import command is executed, it reads the start time stored in
> conf/dataimport.properties. It uses that timestamp to run delta queries and
> after completion, updates the timestamp in conf/dataimport.properties.""
>
> Which to me seems to indicate that any records with a time-stamp between
> when the dataimport starts and ends will be missed as the last_index_time is
> set to when it completes the import.
>
> This doesn't seem quite right to me. I would have expected the
> last_index_time to refer to when the dataimport was last STARTED so that
> there was no gaps in the timestamp covered.
>
> I changed the deltaQuery of our config to include the SUBDATE by INTERVAL 1
> MINUTE statement to alleviate this problem, but it does only cover times
> when the delta-import takes less than a minute.
>
> Any ideas as to how this can be overcome? ,other than increasing the
> INTERVAL to something larger.
>
> Regards
>
> Barry Tucker
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goksron@gmail.com