You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ephraim Ofir <Ep...@icq.com> on 2010/08/30 14:07:14 UTC

DIH - deleting documents, high performance (delta) imports, and passing parameters

After wasting a few days navigating the somewhat uncharted and murky
waters of DIH, thought I'd share my insights with the community to save
other newbies time, so here goes...

First off, this is not to say DIH is bad, I think it's great and it
works really well for my uses, but it has a few undocumented quirks
which cost me a lot of time.



Deleting documents - several options:
1. the deletedPkQuery in delta import - you'll need to make a DB query
which generates the IDs to be deleted (something like: SELECT id FROM
yourTable WHERE deletedFlag = 1).  Make sure that you have a pk in your
entity and that it's the same one returned by your query (in this case -
pk="id").
2. Add the $deleteDocById or $deleteDocByQuery special command to your
full/delta import.  This one is a bit tricky, see comment below**.
3. Use preImportDeleteQuery/postImportDeleteQuery in your full/delta
query (contrary to what the wiki says, this works for delta-import as
well as full-import).

Any one of these can be used separately from your import, you can put
them in a separate entity and do a full/delta import just on that entity
if that's what you want.



High performance imports with sub entities:
DIH's sub entity architecture is very easy to understand and makes a lot
of sense, but it performs sub queries for each row in the root entity,
which is not practical for high volumes.  I opted for a solution I found
in the Solr book by Packt (excellent book BTW) which involves pushing
multi-valued data into a single field with a separator which is then
split by DIH with a RegexTransformer.  This way Solr issues only one
query to the DB and the DB does all the heavy-lifting.  I actually
implemented my query as a stored procedure so it can be optimized by the
DBA and by the DB and be kept separate from the Solr config.  The
following (MySql) query concatenates 3 lang_code fields from the main
table into one field and multiple emails from a secondary table into
another field:
SELECT u.id,
       u.name,
       IF((u.lang_code1 IS NULL AND u.lang_code2 IS NULL AND
u.lang_code3 IS NULL), NULL,
           CONVERT(CONCAT_WS('|', u.lang_code1, u.lang_code2,
u.lang_code3) USING ascii)) AS multi_lang_codes,
       GROUP_CONCAT(e.email SEPARATOR '|') AS multiple_emails
FROM users_tb u
LEFT JOIN emails_tb e ON u.id = e.id
GROUP BY u.id

The entity in data-config.xml looks something like:
<entity name="my_entity"
        query="call get_solr_full();"
        transformer="RegexTransformer">
	<field name="email" column="multiple_emails" splitBy="\|" />
	<field name="lang_code" column="multiple_lang_codes"
splitBy="\|" />
</entity>

High performance delta imports:
DIH's delta import architecture suffers from the same problem as above,
it performs one query to create a list of IDs which need to be updated
and then performs one query to update each ID, which is not practical
for high volumes of data.  I was fervently looking for a way to do this
in a single simple query which would be basically like the full import
query only adding a "WHERE last_updated >
${dataimporter.last_index_time}" clause.  The closest thing I found was
how to do a delta-import using full-import (DIH FAQ).  I fiddled around
with it a bit until I finally realized that you can actually do exactly
what I wanted very simply - you just need to put a dummy query in the
deltaQuery (you have to have a query there which returns one row for
each time you want the deltaImportQuery to run - once in my case) and
put whatever query you want in the deltaImportQuery.  You could even use
the deltaQuery to get some parameter from the DB to use with the
deltaImportQuery instead of using the dataimporter's timestamp (I saw a
lot of questions concerning time differences between the Solr host and
the DB or other methods of determining the delta which could be solved
this way). I have no need for this so my entity in data-config.xml looks
something like:
<entity name="my_entity"
        pk="id"
        deltaQuery="SELECT 1 AS dummy;"
        deltaImportQuery="call
get_solr_delta('${dataimporter.last_index_time}');"
        ... >
	<field ... />
</entity>



Passing "parameters" to DIH config:
I have multiple Solr shards in my setup, and wanted to reuse as many
config files as possible, the problem is that data-config.xml doesn't
seem to support system property substitution like solrconfig.xml does
(at least not in 1.4.1, I think I saw something about that in JIRA
somewhere). I found a workaround for this by using the property
substitution in solrconfig.xml and "passing it as a parameter" to DIH.
Here's an excerpt from my solrconfig.xml using shardID and master
properties (the master property here is used to make sure you can't use
DIH on slaves):

<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler"
enable="${master:true}">
  <lst name="defaults">
    <str name="config">data-config.xml</str>

    <!-- passing shardID so it can be used in data-config.xml -->
    <str name="shardID">${shardID}</str>
  </lst>
</requestHandler>

Now, sharID can be used in data-config.xml as
${dataimporter.request.shardID}. Example:
<entity name="my_entity"
        query="call get_solr_full_${dataimporter.request.shardID}();"
        ... >
	<field ... />
</entity>



** $deleteDocById and $deleteDocByQuery:
I wanted my delta import to delete documents in the same query it uses
for import - initially because I was using full-import to do
delta-import and there's no support for deletedPkQuery in full-import
(SOLR-1168), and finally once I got it to work I saw no reason to change
it when I switched to real delta-import.
I had a lot of trouble getting $deleteDocById to work due to lack of
documentation...  I ended up having to look at the code in order to get
it to work.  I finally realized that $deleteDocById has 2 bugs:

1. Not sure it's a bug, but looks like a bug to me - if the query
returns any values other than $deleteDocById for the row you want
deleted, it deletes the row but also re-adds it with the rest of the
data, so in effect the row isn't deleted.  In order to work around this
issue, you have to either make sure no data other than
$deleteDocById=<id> exists in rows to be deleted or add $skipDoc='true'
(which I think is a little counter-intuitive, but was the better choice
in my case).  My query looks something like:
SELECT u.id,
       u.name,
       ...
       IF(u.delete_flag > 0, u.id, NULL) AS $deleteDocById,
       IF(u.delete_flag > 0, 'true', NULL) AS $skipDoc
FROM users_tb u

2. $deleteDocById doesn't update the statistics of deleted documents.
This has 2 downsides, the obvious one is that you don't know if/how many
documents were deleted, the not-so-obvious one is that if your import
contains only deleted items, it won't be committed automatically by DIH
and you'll have to commit it manually.



That's all I have so far, planning to add the $deleteDocById bugs (and a
patch) to JIRA as soon as I get a chance. Hope this helps somebody and
open to any suggestions,
Ephraim Ofir

Re: DIH - deleting documents, high performance (delta) imports, and passing parameters

Posted by Tommy Chheng <to...@gmail.com>.

  Thanks for the section on "Passing "parameters" to DIH config:"

I'm going to try the parameter passing to allow the DIH to index 
different DBs based on the system environment(local dev machine or 
production machine)

@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: http://gradschoolnow.com


On 8/30/10 5:07 AM, Ephraim Ofir wrote:
> After wasting a few days navigating the somewhat uncharted and murky
> waters of DIH, thought I'd share my insights with the community to save
> other newbies time, so here goes...
>
> First off, this is not to say DIH is bad, I think it's great and it
> works really well for my uses, but it has a few undocumented quirks
> which cost me a lot of time.
>
>
>
> Deleting documents - several options:
> 1. the deletedPkQuery in delta import - you'll need to make a DB query
> which generates the IDs to be deleted (something like: SELECT id FROM
> yourTable WHERE deletedFlag = 1).  Make sure that you have a pk in your
> entity and that it's the same one returned by your query (in this case -
> pk="id").
> 2. Add the $deleteDocById or $deleteDocByQuery special command to your
> full/delta import.  This one is a bit tricky, see comment below**.
> 3. Use preImportDeleteQuery/postImportDeleteQuery in your full/delta
> query (contrary to what the wiki says, this works for delta-import as
> well as full-import).
>
> Any one of these can be used separately from your import, you can put
> them in a separate entity and do a full/delta import just on that entity
> if that's what you want.
>
>
>
> High performance imports with sub entities:
> DIH's sub entity architecture is very easy to understand and makes a lot
> of sense, but it performs sub queries for each row in the root entity,
> which is not practical for high volumes.  I opted for a solution I found
> in the Solr book by Packt (excellent book BTW) which involves pushing
> multi-valued data into a single field with a separator which is then
> split by DIH with a RegexTransformer.  This way Solr issues only one
> query to the DB and the DB does all the heavy-lifting.  I actually
> implemented my query as a stored procedure so it can be optimized by the
> DBA and by the DB and be kept separate from the Solr config.  The
> following (MySql) query concatenates 3 lang_code fields from the main
> table into one field and multiple emails from a secondary table into
> another field:
> SELECT u.id,
>         u.name,
>         IF((u.lang_code1 IS NULL AND u.lang_code2 IS NULL AND
> u.lang_code3 IS NULL), NULL,
>             CONVERT(CONCAT_WS('|', u.lang_code1, u.lang_code2,
> u.lang_code3) USING ascii)) AS multi_lang_codes,
>         GROUP_CONCAT(e.email SEPARATOR '|') AS multiple_emails
> FROM users_tb u
> LEFT JOIN emails_tb e ON u.id = e.id
> GROUP BY u.id
>
> The entity in data-config.xml looks something like:
> <entity name="my_entity"
>          query="call get_solr_full();"
>          transformer="RegexTransformer">
> 	<field name="email" column="multiple_emails" splitBy="\|" />
> 	<field name="lang_code" column="multiple_lang_codes"
> splitBy="\|" />
> </entity>
>
> High performance delta imports:
> DIH's delta import architecture suffers from the same problem as above,
> it performs one query to create a list of IDs which need to be updated
> and then performs one query to update each ID, which is not practical
> for high volumes of data.  I was fervently looking for a way to do this
> in a single simple query which would be basically like the full import
> query only adding a "WHERE last_updated>
> ${dataimporter.last_index_time}" clause.  The closest thing I found was
> how to do a delta-import using full-import (DIH FAQ).  I fiddled around
> with it a bit until I finally realized that you can actually do exactly
> what I wanted very simply - you just need to put a dummy query in the
> deltaQuery (you have to have a query there which returns one row for
> each time you want the deltaImportQuery to run - once in my case) and
> put whatever query you want in the deltaImportQuery.  You could even use
> the deltaQuery to get some parameter from the DB to use with the
> deltaImportQuery instead of using the dataimporter's timestamp (I saw a
> lot of questions concerning time differences between the Solr host and
> the DB or other methods of determining the delta which could be solved
> this way). I have no need for this so my entity in data-config.xml looks
> something like:
> <entity name="my_entity"
>          pk="id"
>          deltaQuery="SELECT 1 AS dummy;"
>          deltaImportQuery="call
> get_solr_delta('${dataimporter.last_index_time}');"
>          ...>
> 	<field ... />
> </entity>
>
>
>
> Passing "parameters" to DIH config:
> I have multiple Solr shards in my setup, and wanted to reuse as many
> config files as possible, the problem is that data-config.xml doesn't
> seem to support system property substitution like solrconfig.xml does
> (at least not in 1.4.1, I think I saw something about that in JIRA
> somewhere). I found a workaround for this by using the property
> substitution in solrconfig.xml and "passing it as a parameter" to DIH.
> Here's an excerpt from my solrconfig.xml using shardID and master
> properties (the master property here is used to make sure you can't use
> DIH on slaves):
>
> <requestHandler name="/dataimport"
> class="org.apache.solr.handler.dataimport.DataImportHandler"
> enable="${master:true}">
>    <lst name="defaults">
>      <str name="config">data-config.xml</str>
>
>      <!-- passing shardID so it can be used in data-config.xml -->
>      <str name="shardID">${shardID}</str>
>    </lst>
> </requestHandler>
>
> Now, sharID can be used in data-config.xml as
> ${dataimporter.request.shardID}. Example:
> <entity name="my_entity"
>          query="call get_solr_full_${dataimporter.request.shardID}();"
>          ...>
> 	<field ... />
> </entity>
>
>
>
> ** $deleteDocById and $deleteDocByQuery:
> I wanted my delta import to delete documents in the same query it uses
> for import - initially because I was using full-import to do
> delta-import and there's no support for deletedPkQuery in full-import
> (SOLR-1168), and finally once I got it to work I saw no reason to change
> it when I switched to real delta-import.
> I had a lot of trouble getting $deleteDocById to work due to lack of
> documentation...  I ended up having to look at the code in order to get
> it to work.  I finally realized that $deleteDocById has 2 bugs:
>
> 1. Not sure it's a bug, but looks like a bug to me - if the query
> returns any values other than $deleteDocById for the row you want
> deleted, it deletes the row but also re-adds it with the rest of the
> data, so in effect the row isn't deleted.  In order to work around this
> issue, you have to either make sure no data other than
> $deleteDocById=<id>  exists in rows to be deleted or add $skipDoc='true'
> (which I think is a little counter-intuitive, but was the better choice
> in my case).  My query looks something like:
> SELECT u.id,
>         u.name,
>         ...
>         IF(u.delete_flag>  0, u.id, NULL) AS $deleteDocById,
>         IF(u.delete_flag>  0, 'true', NULL) AS $skipDoc
> FROM users_tb u
>
> 2. $deleteDocById doesn't update the statistics of deleted documents.
> This has 2 downsides, the obvious one is that you don't know if/how many
> documents were deleted, the not-so-obvious one is that if your import
> contains only deleted items, it won't be committed automatically by DIH
> and you'll have to commit it manually.
>
>
>
> That's all I have so far, planning to add the $deleteDocById bugs (and a
> patch) to JIRA as soon as I get a chance. Hope this helps somebody and
> open to any suggestions,
> Ephraim Ofir