You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by subinalex <al...@gmail.com> on 2017/04/04 13:40:31 UTC

Implementing DIH - Using a non-datetime change tracking column to Identify delta

Hi Experts,

Can we use a non-datetime column to identify delta rows in deltaQuery for
DIH configuration.
Like for example in the below deltaQuery ,

  deltaQuery="select ID from category where last_modified &gt;
'${dih.last_index_time}'"


the delta rows are picked when the last_modified datetime is greater than
last index time.

I want to pick the deltas if a column value differs from the corresponding
column value in solr.

 deltaQuery="select ID from category where md5hashcode  <> ;
'indexedmd5hashcode'"



Can we implement this?.






--
View this message in context: http://lucene.472066.n3.nabble.com/Implementing-DIH-Using-a-non-datetime-change-tracking-column-to-Identify-delta-tp4328306.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Implementing DIH - Using a non-datetime change tracking column to Identify delta

Posted by subinalex <al...@gmail.com>.
Thanks Shawn!!..:-)
Ll try out this..


On 6 Apr 2017 00:00, "Shawn Heisey-2 [via Lucene]" <
ml-node+s472066n4328519h45@n3.nabble.com> wrote:

On 4/4/2017 7:40 AM, subinalex wrote:

> Can we use a non-datetime column to identify delta rows in deltaQuery for
> DIH configuration.
> Like for example in the below deltaQuery ,
>
>   deltaQuery="select ID from category where last_modified &gt;
> '${dih.last_index_time}'"
>
> the delta rows are picked when the last_modified datetime is greater than
> last index time.
>
> I want to pick the deltas if a column value differs from the
corresponding
> column value in solr.
>
>  deltaQuery="select ID from category where md5hashcode  <> ;
> 'indexedmd5hashcode'"

The only piece of information that DIH saves internally when it starts
an import is the current timestamp.

You can still do what you want, but you will need to be responsible for
keeping track of the information necessary to determine what's new in
your own program.  Solr will not do it for you.

When you start an import, you can provide any arbitrary information with
URL parameters on the request that starts the import.  Here's my full
<entity> config for DIH from one of my Solr cores showing how to use
these parameters:

    <entity name="dataView" pk="did"
      query="
        SELECT * FROM ${dih.request.dataView}
        WHERE (
          (
            did &gt; ${dih.request.minDid}
            AND did &lt;= ${dih.request.maxDid}
          )
          ${dih.request.extraWhere}
        ) AND (crc32(did) % ${dih.request.numShards})
          IN (${dih.request.modVal})
        "
      deltaImportQuery="
        SELECT * FROM ${dih.request.dataView}
        WHERE (
          (
            did &gt; ${dih.request.minDid}
            AND did &lt;= ${dih.request.maxDid}
          )
          ${dih.request.extraWhere}
        ) AND (crc32(did) % ${dih.request.numShards})
          IN (${dih.request.modVal})
        "
      deltaQuery="SELECT 1 AS did"
    >

I am specifying many of the parts of the SQL query from URL parameters.
For example, I will include a "dataView" parameter to choose at import
time what view or table will be queried.  The other parameters control
what ID values will be returned.

The query and deltaImportQuery attributes are identical.  At one time,
all my indexing was done with DIH, so I used these parameters to limit
what was done by the delta-import runs.  Currently, DIH is only used for
full rebuilds, I have a SolrJ program for incremental changes.

Thanks,
Shawn



------------------------------
If you reply to this email, your message will be added to the discussion
below:
http://lucene.472066.n3.nabble.com/Implementing-DIH-
Using-a-non-datetime-change-tracking-column-to-Identify-
delta-tp4328306p4328519.html
To unsubscribe from Implementing DIH - Using a non-datetime change tracking
column to Identify delta, click here
<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4328306&code=YWxleGt1dHR5MTlAZ21haWwuY29tfDQzMjgzMDZ8LTc3MzYxMjgxNA==>
.
NAML
<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: http://lucene.472066.n3.nabble.com/Implementing-DIH-Using-a-non-datetime-change-tracking-column-to-Identify-delta-tp4328306p4329037.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Implementing DIH - Using a non-datetime change tracking column to Identify delta

Posted by Shawn Heisey <ap...@elyograg.org>.
On 4/4/2017 7:40 AM, subinalex wrote:
> Can we use a non-datetime column to identify delta rows in deltaQuery for
> DIH configuration.
> Like for example in the below deltaQuery ,
>
>   deltaQuery="select ID from category where last_modified &gt;
> '${dih.last_index_time}'"
>
> the delta rows are picked when the last_modified datetime is greater than
> last index time.
>
> I want to pick the deltas if a column value differs from the corresponding
> column value in solr.
>
>  deltaQuery="select ID from category where md5hashcode  <> ;
> 'indexedmd5hashcode'"

The only piece of information that DIH saves internally when it starts
an import is the current timestamp.

You can still do what you want, but you will need to be responsible for
keeping track of the information necessary to determine what's new in
your own program.  Solr will not do it for you.

When you start an import, you can provide any arbitrary information with
URL parameters on the request that starts the import.  Here's my full
<entity> config for DIH from one of my Solr cores showing how to use
these parameters:

    <entity name="dataView" pk="did"
      query="
        SELECT * FROM ${dih.request.dataView}
        WHERE (
          (
            did &gt; ${dih.request.minDid}
            AND did &lt;= ${dih.request.maxDid}
          )
          ${dih.request.extraWhere}
        ) AND (crc32(did) % ${dih.request.numShards})
          IN (${dih.request.modVal})
        "
      deltaImportQuery="
        SELECT * FROM ${dih.request.dataView}
        WHERE (
          (
            did &gt; ${dih.request.minDid}
            AND did &lt;= ${dih.request.maxDid}
          )
          ${dih.request.extraWhere}
        ) AND (crc32(did) % ${dih.request.numShards})
          IN (${dih.request.modVal})
        "
      deltaQuery="SELECT 1 AS did"
    >

I am specifying many of the parts of the SQL query from URL parameters. 
For example, I will include a "dataView" parameter to choose at import
time what view or table will be queried.  The other parameters control
what ID values will be returned.

The query and deltaImportQuery attributes are identical.  At one time,
all my indexing was done with DIH, so I used these parameters to limit
what was done by the delta-import runs.  Currently, DIH is only used for
full rebuilds, I have a SolrJ program for incremental changes.

Thanks,
Shawn