You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com> on 2009/08/20 06:40:11 UTC

Re: DataImportHandler - very slow delta import

We have refrained from putting in any intelligence in DIH in
constructing queries. It is not wise to put in something which is
useful to somebody but breaks at lot of the cases. This is a support
nightmare. Our intent is to help user construct the query himself so
that there are few surprises

On Thu, Aug 20, 2009 at 3:24 AM, Matthew
Painter<Ma...@archives.govt.nz> wrote:
> Fair enough; I was wondering if that was the reason. Although, wouldn't the vast majority of delta queries be covered by standard 'in' clause syntax? e.g.
>
> select * from myTable where id in (1,2,3,...)
>
> I'm just wondering whether that could be adopted as the standard behaviour, with the simpleton approach available as an option for more complex queries. I can see though that this may well be fraught with peril!
>
> Anyway, thanks for your help - it's much appreciated.
>
> M
>
>
>
> -----Original Message-----
> From: noble.paul@gmail.com [mailto:noble.paul@gmail.com] On Behalf Of Noble Paul ??????? ??????
> Sent: Wednesday, 19 August 2009 5:52 p.m.
> To: Matthew Painter
> Subject: Re: DataImportHandler - very slow delta import
>
> On Wed, Aug 19, 2009 at 3:15 AM, Matthew Painter<Ma...@archives.govt.nz> wrote:
>> Thanks; that confirms my observed behaviour.
>>
>> However, why would the delta query have to make a single db call per changed row? For simple delta queries like mine below, batching a chunk of rows at the time from the database seems quite doable. Or are there less-trivial situations where batching wouldn't work?
>
> The problem is that DIH cannot create intelligent queries but the users can . So DIH goes with the simpleton approach of
>
> for each row returned by deltaQuery run the deltaImportQuery.
>
>
>
>
>
>
>>
>> Does the deletedPkQuery suffer from the same performance issues? The problem in our specific instance is that often we're removing and modifying thousands of rows in one hit so I may have to adopt a different approach. I'm not comfortable using Solr 1.4 in a production environment yet, so unfortunately the nice new features in the DataImportHandler aren't an option.
>
> deletedPkQuery has no such problem because it is run only once
>>
>> I'll try your suggested solution soon.
>>
>> M
>>
>>
>>
>> -----Original Message-----
>> From: noble.paul@gmail.com [mailto:noble.paul@gmail.com] On Behalf Of Noble Paul ??????? ??????
>> Sent: Tuesday, 18 August 2009 5:11 p.m.
>> To: solr-user@lucene.apache.org
>> Subject: Re: DataImportHandler - very slow delta import
>>
>> delta imports are likely to be far slower that the full imports
>> because it makes one db call per changed row. if you can write the
>> "query" in such a way that it gives only the changed rows, then write
>> a separate entity (directly under <document>) and just run a
>> full-import with that entity only.
>>
>> On Tue, Aug 18, 2009 at 6:32 AM, Matthew
>> Painter<Ma...@archives.govt.nz> wrote:
>>> Hi,
>>>
>>> We are using Solr's DataImportHandler to populate the Solr index from
>>> a SQL Server database of nearly 4,000,000 rows. Whereas the
>>> population itself is very fast (around 1000 rows per second), the
>>> delta import is only processing around one row a second.
>>>
>>> Is this a known performance issue? We are using Solr 1.3.
>>>
>>> For reference, the abridged entity configuration (cuts indicated by
>>> '...') is below:
>>>
>>>  <entity name="id" transformer="ClobTransformer" pk="oid"
>>>            query="select archwaypublic.getSolrIdentifier(oid,
>>> 'agency') as oid, oid as realoid,
>>> archwaypublic.getSolrIdentifier(oid, 'agency') as id, code, name, ..."
>>>   deltaQuery="select oid from publicagency with (nolock) where
>>> modifiedtime > '${dataimporter.last_index_time}'"
>>>   deletedPkQuery="select archwaypublic.getSolrIdentifier(entityoid,
>>> 'agency') as oid from pendingsolrdeletions with (nolock) where
>>> entitytype='agency'">
>>>
>>> ...
>>> </entity>
>>>
>>> Thanks,
>>> Matt
>>>
>>> This e-mail message and any attachments are CONFIDENTIAL to the addressee(s) and may also be LEGALLY PRIVILEGED.  If you are not the intended addressee, please do not use, disclose, copy or distribute the message or the information it contains.  Instead, please notify me as soon as possible and delete the e-mail, including any attachments.  Thank you.
>>>
>>
>>
>>
>> --
>> -----------------------------------------------------
>> Noble Paul | Principal Engineer| AOL | http://aol.com This e-mail
>> message and any attachments are CONFIDENTIAL to the addressee(s) and may also be LEGALLY PRIVILEGED.  If you are not the intended addressee, please do not use, disclose, copy or distribute the message or the information it contains.  Instead, please notify me as soon as possible and delete the e-mail, including any attachments.  Thank you.
>>
>
>
>
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
> This e-mail message and any attachments are CONFIDENTIAL to the addressee(s) and may also be LEGALLY PRIVILEGED.  If you are not the intended addressee, please do not use, disclose, copy or distribute the message or the information it contains.  Instead, please notify me as soon as possible and delete the e-mail, including any attachments.  Thank you.
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com