You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Giovanni De Stefano <gi...@gmail.com> on 2009/03/17 16:18:03 UTC

Solr: delta-import, help needed

Hello all,

I have a table TEST in an Oracle DB with the following columns: URI
(varchar), CONTENT (varchar), CREATION_TIME (date).

The primary key both in the DB and Solr is URI.

Here is my data-config.xml:

<dataConfig>
  <dataSource
    driver="oracle.jdbc.driver.OracleDriver"
    url="jdbc:oracle:thin:@localhost:1521/XE"
    user="username"
    password="password"
  />
  <document name="Test">
    <entity
        name="test_item"
        pk="URI"
        query="select URI,CONTENT from TEST"
*        deltaQuery="select URI,CONTENT from TEST where
TO_CHAR(CREATION_TIME,'YYYY-MM-DD HH:MI:SS') >
'${dataimporter.last_index_time}'" *
    >
      <field column="URI" name="uri"/>
      <field column="CONTENT" name="content"/>
    </entity>
  </document>
</dataConfig>

The problem is that anytime I perform a delta-import, the index keeps being
populated as if new documents were added. In other words, I am not able to
UPDATE an existing document or REMOVE a document that is not anymore in the
DB.

What am I missing? How should I specify my deltaQuery?

Thanks a lot in advance!

Giovanni

Re: Solr: delta-import, help needed

Posted by Giovanni De Stefano <gi...@gmail.com>.
Hello Paul,

thank you for your feedback. I will ask to add an expiration date to the DB
and run a process that updates the index accordingly.

Cheers,
Giovanni


On 3/18/09, Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com> wrote:
>
> it is not possible to query details from Solr and find out deleted
> items using DIH
>
> you must maintain a deleted rows ids in the db or just flag them as
> deleted.
>
> --Noble
>
>
>
> On Wed, Mar 18, 2009 at 2:46 PM, Giovanni De Stefano
> <gi...@gmail.com> wrote:
> > Hello Paul,
> >
> > thank you for your reply.
> >
> > The UPDATE in fact works fine: I only had to update the CREATION_TIME on
> the
> > DB :-)
> >
> > Regarding the deletedPkQuery, I understand it has to return the primary
> keys
> > that should be removed from the index (because they have been removed
> from
> > the DB) but I don't have any "deleted" flag on the DB.
> >
> > Basically the deletedPkQuery should be something like "select URI *
> > from_the_current_index* where URI is not in (select URI from TEST)"
> >
> > That is returning a subset of primary keys currently in the index and
> that
> > are not in the DB anymore. Is this possible?
> >
> > I am no DB expert...so ANY tip is very welcome!
> >
> > Thanks,
> > Giovanni
> >
> >
> > On 3/18/09, Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com> wrote:
> >>
> >> are you sure your schema.xml has a <uniqueKey> field to UPDATE docs.
> >>
> >> to remove deleted docs you must have deletedPkQuery attribute in the
> root
> >> entity
> >>
> >> On Tue, Mar 17, 2009 at 8:48 PM, Giovanni De Stefano
> >> <gi...@gmail.com> wrote:
> >> > Hello all,
> >> >
> >> > I have a table TEST in an Oracle DB with the following columns: URI
> >> > (varchar), CONTENT (varchar), CREATION_TIME (date).
> >> >
> >> > The primary key both in the DB and Solr is URI.
> >> >
> >> > Here is my data-config.xml:
> >> >
> >> > <dataConfig>
> >> >  <dataSource
> >> >    driver="oracle.jdbc.driver.OracleDriver"
> >> >    url="jdbc:oracle:thin:@localhost:1521/XE"
> >> >    user="username"
> >> >    password="password"
> >> >  />
> >> >  <document name="Test">
> >> >    <entity
> >> >        name="test_item"
> >> >        pk="URI"
> >> >        query="select URI,CONTENT from TEST"
> >> > *        deltaQuery="select URI,CONTENT from TEST where
> >> > TO_CHAR(CREATION_TIME,'YYYY-MM-DD HH:MI:SS') >
> >> > '${dataimporter.last_index_time}'" *
> >> >    >
> >> >      <field column="URI" name="uri"/>
> >> >      <field column="CONTENT" name="content"/>
> >> >    </entity>
> >> >  </document>
> >> > </dataConfig>
> >> >
> >> > The problem is that anytime I perform a delta-import, the index keeps
> >> being
> >> > populated as if new documents were added. In other words, I am not
> able
> >> to
> >> > UPDATE an existing document or REMOVE a document that is not anymore
> in
> >> the
> >> > DB.
> >> >
> >> > What am I missing? How should I specify my deltaQuery?
> >> >
> >> > Thanks a lot in advance!
> >> >
> >> > Giovanni
> >> >
> >>
> >>
> >>
> >> --
> >> --Noble Paul
> >>
> >
>
>
>
> --
> --Noble Paul
>

Re: Solr: delta-import, help needed

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
it is not possible to query details from Solr and find out deleted
items using DIH

you must maintain a deleted rows ids in the db or just flag them as deleted.

--Noble



On Wed, Mar 18, 2009 at 2:46 PM, Giovanni De Stefano
<gi...@gmail.com> wrote:
> Hello Paul,
>
> thank you for your reply.
>
> The UPDATE in fact works fine: I only had to update the CREATION_TIME on the
> DB :-)
>
> Regarding the deletedPkQuery, I understand it has to return the primary keys
> that should be removed from the index (because they have been removed from
> the DB) but I don't have any "deleted" flag on the DB.
>
> Basically the deletedPkQuery should be something like "select URI *
> from_the_current_index* where URI is not in (select URI from TEST)"
>
> That is returning a subset of primary keys currently in the index and that
> are not in the DB anymore. Is this possible?
>
> I am no DB expert...so ANY tip is very welcome!
>
> Thanks,
> Giovanni
>
>
> On 3/18/09, Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com> wrote:
>>
>> are you sure your schema.xml has a <uniqueKey> field to UPDATE docs.
>>
>> to remove deleted docs you must have deletedPkQuery attribute in the root
>> entity
>>
>> On Tue, Mar 17, 2009 at 8:48 PM, Giovanni De Stefano
>> <gi...@gmail.com> wrote:
>> > Hello all,
>> >
>> > I have a table TEST in an Oracle DB with the following columns: URI
>> > (varchar), CONTENT (varchar), CREATION_TIME (date).
>> >
>> > The primary key both in the DB and Solr is URI.
>> >
>> > Here is my data-config.xml:
>> >
>> > <dataConfig>
>> >  <dataSource
>> >    driver="oracle.jdbc.driver.OracleDriver"
>> >    url="jdbc:oracle:thin:@localhost:1521/XE"
>> >    user="username"
>> >    password="password"
>> >  />
>> >  <document name="Test">
>> >    <entity
>> >        name="test_item"
>> >        pk="URI"
>> >        query="select URI,CONTENT from TEST"
>> > *        deltaQuery="select URI,CONTENT from TEST where
>> > TO_CHAR(CREATION_TIME,'YYYY-MM-DD HH:MI:SS') >
>> > '${dataimporter.last_index_time}'" *
>> >    >
>> >      <field column="URI" name="uri"/>
>> >      <field column="CONTENT" name="content"/>
>> >    </entity>
>> >  </document>
>> > </dataConfig>
>> >
>> > The problem is that anytime I perform a delta-import, the index keeps
>> being
>> > populated as if new documents were added. In other words, I am not able
>> to
>> > UPDATE an existing document or REMOVE a document that is not anymore in
>> the
>> > DB.
>> >
>> > What am I missing? How should I specify my deltaQuery?
>> >
>> > Thanks a lot in advance!
>> >
>> > Giovanni
>> >
>>
>>
>>
>> --
>> --Noble Paul
>>
>



-- 
--Noble Paul

Re: Solr: delta-import, help needed

Posted by Giovanni De Stefano <gi...@gmail.com>.
Hello Paul,

thank you for your reply.

The UPDATE in fact works fine: I only had to update the CREATION_TIME on the
DB :-)

Regarding the deletedPkQuery, I understand it has to return the primary keys
that should be removed from the index (because they have been removed from
the DB) but I don't have any "deleted" flag on the DB.

Basically the deletedPkQuery should be something like "select URI *
from_the_current_index* where URI is not in (select URI from TEST)"

That is returning a subset of primary keys currently in the index and that
are not in the DB anymore. Is this possible?

I am no DB expert...so ANY tip is very welcome!

Thanks,
Giovanni


On 3/18/09, Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com> wrote:
>
> are you sure your schema.xml has a <uniqueKey> field to UPDATE docs.
>
> to remove deleted docs you must have deletedPkQuery attribute in the root
> entity
>
> On Tue, Mar 17, 2009 at 8:48 PM, Giovanni De Stefano
> <gi...@gmail.com> wrote:
> > Hello all,
> >
> > I have a table TEST in an Oracle DB with the following columns: URI
> > (varchar), CONTENT (varchar), CREATION_TIME (date).
> >
> > The primary key both in the DB and Solr is URI.
> >
> > Here is my data-config.xml:
> >
> > <dataConfig>
> >  <dataSource
> >    driver="oracle.jdbc.driver.OracleDriver"
> >    url="jdbc:oracle:thin:@localhost:1521/XE"
> >    user="username"
> >    password="password"
> >  />
> >  <document name="Test">
> >    <entity
> >        name="test_item"
> >        pk="URI"
> >        query="select URI,CONTENT from TEST"
> > *        deltaQuery="select URI,CONTENT from TEST where
> > TO_CHAR(CREATION_TIME,'YYYY-MM-DD HH:MI:SS') >
> > '${dataimporter.last_index_time}'" *
> >    >
> >      <field column="URI" name="uri"/>
> >      <field column="CONTENT" name="content"/>
> >    </entity>
> >  </document>
> > </dataConfig>
> >
> > The problem is that anytime I perform a delta-import, the index keeps
> being
> > populated as if new documents were added. In other words, I am not able
> to
> > UPDATE an existing document or REMOVE a document that is not anymore in
> the
> > DB.
> >
> > What am I missing? How should I specify my deltaQuery?
> >
> > Thanks a lot in advance!
> >
> > Giovanni
> >
>
>
>
> --
> --Noble Paul
>

Re: Solr: delta-import, help needed

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
are you sure your schema.xml has a <uniqueKey> field to UPDATE docs.

to remove deleted docs you must have deletedPkQuery attribute in the root entity

On Tue, Mar 17, 2009 at 8:48 PM, Giovanni De Stefano
<gi...@gmail.com> wrote:
> Hello all,
>
> I have a table TEST in an Oracle DB with the following columns: URI
> (varchar), CONTENT (varchar), CREATION_TIME (date).
>
> The primary key both in the DB and Solr is URI.
>
> Here is my data-config.xml:
>
> <dataConfig>
>  <dataSource
>    driver="oracle.jdbc.driver.OracleDriver"
>    url="jdbc:oracle:thin:@localhost:1521/XE"
>    user="username"
>    password="password"
>  />
>  <document name="Test">
>    <entity
>        name="test_item"
>        pk="URI"
>        query="select URI,CONTENT from TEST"
> *        deltaQuery="select URI,CONTENT from TEST where
> TO_CHAR(CREATION_TIME,'YYYY-MM-DD HH:MI:SS') >
> '${dataimporter.last_index_time}'" *
>    >
>      <field column="URI" name="uri"/>
>      <field column="CONTENT" name="content"/>
>    </entity>
>  </document>
> </dataConfig>
>
> The problem is that anytime I perform a delta-import, the index keeps being
> populated as if new documents were added. In other words, I am not able to
> UPDATE an existing document or REMOVE a document that is not anymore in the
> DB.
>
> What am I missing? How should I specify my deltaQuery?
>
> Thanks a lot in advance!
>
> Giovanni
>



-- 
--Noble Paul