You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by janne mattila <ja...@gmail.com> on 2012/03/27 16:24:36 UTC

dataImportHandler: delta query fetching data, not just ids?

It seems that delta import works in 2 steps, first query fetches the
ids of the modified entries, then second query fetches the actual
data.

        <entity name="item" pk="ID"
                query="select * from item"
                deltaImportQuery="select * from item where
ID='${dataimporter.delta.id}'"
                deltaQuery="select id from item where last_modified
&gt; '${dataimporter.last_index_time}'">
            <entity name="feature" pk="ITEM_ID"
                    query="select description as features from feature
where item_id='${item.ID}'">
            </entity>
            <entity name="item_category" pk="ITEM_ID, CATEGORY_ID"
                    query="select CATEGORY_ID from item_category where
ITEM_ID='${item.ID}'">
                <entity name="category" pk="ID"
                       query="select description as cat from category
where id = '${item_category.CATEGORY_ID}'">
                </entity>
            </entity>

I am aware that there's a workaround:
http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport

But still, to clarify, and make sure I have up-to-date info how Solr works:

1. Is it possible to fetch the modified data with a single SQL query
using deltaImportQuery, as in:

deltaImportQuery="select * from item where last_modified &gt;
'${dataimporter.last_index_time}'"?

2. If not - what's the reason delta import is implemented like it is?
Why split it in two queries? I would think having a single delta query
that fetches the data would be kind of an "obvious" design unless
there's something that calls for 2 separate queries...?

RE: dataImportHandler: delta query fetching data, not just ids?

Posted by "Dyer, James" <Ja...@ingrambook.com>.
You can also use $deleteDocById . If you also use $skipDoc, you can sometimes get the deletes on the same entity with a "command=full-import&clean=false" delta.  This may or may not be more convienent that what you're doing already.  See http://wiki.apache.org/solr/DataImportHandler#Special_Commands .

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: janne mattila [mailto:jannepostilistat@gmail.com] 
Sent: Thursday, March 29, 2012 12:45 AM
To: solr-user@lucene.apache.org
Subject: Re: dataImportHandler: delta query fetching data, not just ids?

> I'm not sure why deltas were implemented this way.  Possibly it was designed to behave like some of our object-to-relational libraries?  In any case, there are 2 ways to do deltas and you just have to take your pick based on what will work best for your situation.  I wouldn't consider the "command=full-import&clean=false" method a workaround but just a different way to tackle the same problem.

Yeah, I find the delta-update strategy a little strange as well.

Problem with command=full-import&clean=false is that you can't handle
removes nicely using that. If you use the actual delta-import and
deletedPkQuery for that, you run into problems with last_index_time
and miss either modifications or deletes.

I'm handling that by creating a different entity config for updates
(using command=full-import&clean=false) and deletes (using
command=delta-import) but it ends up being much dirtier than it should
be.

Re: dataImportHandler: delta query fetching data, not just ids?

Posted by janne mattila <ja...@gmail.com>.
> I'm not sure why deltas were implemented this way.  Possibly it was designed to behave like some of our object-to-relational libraries?  In any case, there are 2 ways to do deltas and you just have to take your pick based on what will work best for your situation.  I wouldn't consider the "command=full-import&clean=false" method a workaround but just a different way to tackle the same problem.

Yeah, I find the delta-update strategy a little strange as well.

Problem with command=full-import&clean=false is that you can't handle
removes nicely using that. If you use the actual delta-import and
deletedPkQuery for that, you run into problems with last_index_time
and miss either modifications or deletes.

I'm handling that by creating a different entity config for updates
(using command=full-import&clean=false) and deletes (using
command=delta-import) but it ends up being much dirtier than it should
be.

RE: dataImportHandler: delta query fetching data, not just ids?

Posted by "Dyer, James" <Ja...@ingrambook.com>.
Janne,

You're correct on how the delta import works.  You specify 3 queries:

- deletedPkQuery = query should return all "id"s (only) of items that were deleted since the last run.
- deltaQuery = query should return all "id"s (only) of items that were added/updated since the last run.
- deltaImportQuery = query should return full data for ONE row with "where id='${dih.delta.id}'".  

When DIH runs, it executes the first 2 queries and puts all of the returned id's in memory in a Set or something.  Then it does N selects on the deltaImportQuery, executing the query once per id.  This is maybe a good way to do it if you're doing very frequent deltas and you only expect a small number of changed documents per run, but I personally use the alternate way (as you noted):  http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport

I'm not sure why deltas were implemented this way.  Possibly it was designed to behave like some of our object-to-relational libraries?  In any case, there are 2 ways to do deltas and you just have to take your pick based on what will work best for your situation.  I wouldn't consider the "command=full-import&clean=false" method a workaround but just a different way to tackle the same problem.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: janne mattila [mailto:jannepostilistat@gmail.com] 
Sent: Tuesday, March 27, 2012 9:25 AM
To: solr-user@lucene.apache.org
Subject: dataImportHandler: delta query fetching data, not just ids?

It seems that delta import works in 2 steps, first query fetches the
ids of the modified entries, then second query fetches the actual
data.

        <entity name="item" pk="ID"
                query="select * from item"
                deltaImportQuery="select * from item where
ID='${dataimporter.delta.id}'"
                deltaQuery="select id from item where last_modified
&gt; '${dataimporter.last_index_time}'">
            <entity name="feature" pk="ITEM_ID"
                    query="select description as features from feature
where item_id='${item.ID}'">
            </entity>
            <entity name="item_category" pk="ITEM_ID, CATEGORY_ID"
                    query="select CATEGORY_ID from item_category where
ITEM_ID='${item.ID}'">
                <entity name="category" pk="ID"
                       query="select description as cat from category
where id = '${item_category.CATEGORY_ID}'">
                </entity>
            </entity>

I am aware that there's a workaround:
http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport

But still, to clarify, and make sure I have up-to-date info how Solr works:

1. Is it possible to fetch the modified data with a single SQL query
using deltaImportQuery, as in:

deltaImportQuery="select * from item where last_modified &gt;
'${dataimporter.last_index_time}'"?

2. If not - what's the reason delta import is implemented like it is?
Why split it in two queries? I would think having a single delta query
that fetches the data would be kind of an "obvious" design unless
there's something that calls for 2 separate queries...?

Re: dataImportHandler: delta query fetching data, not just ids?

Posted by janne mattila <ja...@gmail.com>.
How did it work before SOLR-811 update? I don't understand. Did it
fetch delta data with two queries (1. gets ids, 2. gets data per each
id) or did it fetch all delta data with a single query?

On Tue, Mar 27, 2012 at 5:45 PM, Ahmet Arslan <io...@yahoo.com> wrote:
>> 2. If not - what's the reason delta import is implemented
>> like it is?
>> Why split it in two queries? I would think having a single
>> delta query
>> that fetches the data would be kind of an "obvious" design
>> unless
>> there's something that calls for 2 separate queries...?
>
> I think this is it? https://issues.apache.org/jira/browse/SOLR-811

Re: dataImportHandler: delta query fetching data, not just ids?

Posted by Ahmet Arslan <io...@yahoo.com>.
> 2. If not - what's the reason delta import is implemented
> like it is?
> Why split it in two queries? I would think having a single
> delta query
> that fetches the data would be kind of an "obvious" design
> unless
> there's something that calls for 2 separate queries...?

I think this is it? https://issues.apache.org/jira/browse/SOLR-811