You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Faisal Durrani <te...@gmail.com> on 2018/06/11 01:43:13 UTC

Row counts - date lake

 Is there a recommended way to ensure the row counts form tables in source
(Oracle) are consistent with that of target tables in Hbase ( data-lake)?
.We are using Nifi which receives the golden gate messages and then by
using different processor we store the transactions in Hbase ,so
essentially the tables in Hbase should be in sync with the tables in Oracle
at all times. I am interested in knowing how the teams ensure and proof
this ? Do they take row counts from source and target everyday and match it
and say that its synced ? I used the counter option in Nifi which
maintained the record received against each table but i guess that is not
an optimized way to do it.

Re: Row counts - date lake

Posted by Faisal Durrani <te...@gmail.com>.
Hi Boris, The way we are exposing the data to the end  users is through a
Hive view on top of the Hbase tables. I guess a Hive-ql script that does
select count(*) of every table should do the job. I 'm pretty sure it will
be slow as well and there will be a lot of manual work when reconciling it
with the output from Oracle side but I don't see any other option.Maybe
we'll do this once a month or  fortnightly and will trust the error
notification from Nifi to establish whether everything working fine or
not.  I agree with your comment that counts wont match. We are using Kafka
to from the data stream coming from GG and I was wondering if there is way
we can tell how many messages have been read by the consumer and how many
are waiting to be read ? and by this we can sort of justify the difference
between GG and Hbase  but i guess that question for another topic.


On Tue, Jun 12, 2018 at 9:59 PM Boris Tyukin <bo...@boristyukin.com> wrote:

> I would be curious to hear how you end up doing it, Faisal. In my
> experience taking row count from HBase tables was painfully slow and this
> was one of the reasons we decided to move to Apache Kudu. We tried 5
> different ways taking row counts with HBase and it was still painfully slow.
>
> Another issue you will encounter is that counts will never match since GG
> delivers changes in near real-time. I heard about one creative way using
> Oracle flashback feature and compare counts at the precise moment of time.
>
> Boris
>
> On Mon, Jun 11, 2018 at 10:36 PM Faisal Durrani <te...@gmail.com>
> wrote:
>
>> Hi Andrew,
>>
>> Thank you for your suggestion. We are using the timestamp property of the
>> PutHbase processor to enforce the order. This timestamp is extracted from
>> the golden gate message as a meta data. I agree with your approach for
>> creating an End of day file which seems to be the most logical way of doing
>> the reconciliation.
>>
>> Thank you for the help.
>> Faisal
>>
>> On Mon, Jun 11, 2018 at 6:07 PM Andrew Psaltis <ps...@gmail.com>
>> wrote:
>>
>>> Hi Faisal,
>>> OK, so then a single partition in Kafka and I assume that you are using
>>> the EnforceOrder processor or another means to ensure the records are
>>> in order so that updates happen after inserts and so forth. In that case,
>>> assuming everything is in order the simplest approach is just to do a count
>>> at the end-of-the-day, whatever that means for your business. However,
>>> often times a simple count is not good enough, as you actually want to know
>>> that the tables reconcile. Often times, in this case, there is a way to
>>> produce an End Of Day file or record for the source system. This will
>>> usually contain things like the record count, along with a sum of various
>>> columns, and possibly a query that is used to produce the sum or another
>>> validation. With this, you can then ensure that the tables do reconcile.
>>>
>>> Do you have the ability to create an "End of Day" file or something
>>> along those lines that you can use?
>>>
>>>
>>> [1]
>>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.EnforceOrder/index.html
>>>
>>> Thanks,
>>> Andrew
>>>
>>>
>>>
>>> On Mon, Jun 11, 2018 at 11:11 AM Faisal Durrani <te...@gmail.com>
>>> wrote:
>>>
>>>> Hi Andrew,
>>>> We are receiving the golden gate transactions from Kafka which is
>>>> received in Nifi through consume kafka processor . Our data flow then
>>>> reduces the golden gate json message and sends the data to the target table
>>>> in Hbase using the PutHbase Json processor.
>>>>
>>>> Thanks,
>>>> Faisal
>>>>
>>>> On Mon, Jun 11, 2018 at 12:01 PM Andrew Psaltis <
>>>> psaltis.andrew@gmail.com> wrote:
>>>>
>>>>> Hi Faisal,
>>>>> There are various ways this can be handled. But this is going to
>>>>> depend on, how are you receiving data from Oracle via Golden Gate. Are you
>>>>> using the HBase Handler, the HDFS Handler, a Flat File, Kafka, or via
>>>>> another means?
>>>>>
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>> On Mon, Jun 11, 2018 at 9:43 AM Faisal Durrani <te...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Is there a recommended way to ensure the row counts form tables in
>>>>>> source (Oracle) are consistent with that of target tables in Hbase (
>>>>>> data-lake)? .We are using Nifi which receives the golden gate messages and
>>>>>> then by using different processor we store the transactions in Hbase ,so
>>>>>> essentially the tables in Hbase should be in sync with the tables in Oracle
>>>>>> at all times. I am interested in knowing how the teams ensure and proof
>>>>>> this ? Do they take row counts from source and target everyday and match it
>>>>>> and say that its synced ? I used the counter option in Nifi which
>>>>>> maintained the record received against each table but i guess that is not
>>>>>> an optimized way to do it.
>>>>>>
>>>>>

Re: Row counts - date lake

Posted by Boris Tyukin <bo...@boristyukin.com>.
I would be curious to hear how you end up doing it, Faisal. In my
experience taking row count from HBase tables was painfully slow and this
was one of the reasons we decided to move to Apache Kudu. We tried 5
different ways taking row counts with HBase and it was still painfully slow.

Another issue you will encounter is that counts will never match since GG
delivers changes in near real-time. I heard about one creative way using
Oracle flashback feature and compare counts at the precise moment of time.

Boris

On Mon, Jun 11, 2018 at 10:36 PM Faisal Durrani <te...@gmail.com> wrote:

> Hi Andrew,
>
> Thank you for your suggestion. We are using the timestamp property of the
> PutHbase processor to enforce the order. This timestamp is extracted from
> the golden gate message as a meta data. I agree with your approach for
> creating an End of day file which seems to be the most logical way of doing
> the reconciliation.
>
> Thank you for the help.
> Faisal
>
> On Mon, Jun 11, 2018 at 6:07 PM Andrew Psaltis <ps...@gmail.com>
> wrote:
>
>> Hi Faisal,
>> OK, so then a single partition in Kafka and I assume that you are using
>> the EnforceOrder processor or another means to ensure the records are in
>> order so that updates happen after inserts and so forth. In that case,
>> assuming everything is in order the simplest approach is just to do a count
>> at the end-of-the-day, whatever that means for your business. However,
>> often times a simple count is not good enough, as you actually want to know
>> that the tables reconcile. Often times, in this case, there is a way to
>> produce an End Of Day file or record for the source system. This will
>> usually contain things like the record count, along with a sum of various
>> columns, and possibly a query that is used to produce the sum or another
>> validation. With this, you can then ensure that the tables do reconcile.
>>
>> Do you have the ability to create an "End of Day" file or something along
>> those lines that you can use?
>>
>>
>> [1]
>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.EnforceOrder/index.html
>>
>> Thanks,
>> Andrew
>>
>>
>>
>> On Mon, Jun 11, 2018 at 11:11 AM Faisal Durrani <te...@gmail.com>
>> wrote:
>>
>>> Hi Andrew,
>>> We are receiving the golden gate transactions from Kafka which is
>>> received in Nifi through consume kafka processor . Our data flow then
>>> reduces the golden gate json message and sends the data to the target table
>>> in Hbase using the PutHbase Json processor.
>>>
>>> Thanks,
>>> Faisal
>>>
>>> On Mon, Jun 11, 2018 at 12:01 PM Andrew Psaltis <
>>> psaltis.andrew@gmail.com> wrote:
>>>
>>>> Hi Faisal,
>>>> There are various ways this can be handled. But this is going to depend
>>>> on, how are you receiving data from Oracle via Golden Gate. Are you using
>>>> the HBase Handler, the HDFS Handler, a Flat File, Kafka, or via another
>>>> means?
>>>>
>>>> Thanks,
>>>> Andrew
>>>>
>>>> On Mon, Jun 11, 2018 at 9:43 AM Faisal Durrani <te...@gmail.com>
>>>> wrote:
>>>>
>>>>> Is there a recommended way to ensure the row counts form tables in
>>>>> source (Oracle) are consistent with that of target tables in Hbase (
>>>>> data-lake)? .We are using Nifi which receives the golden gate messages and
>>>>> then by using different processor we store the transactions in Hbase ,so
>>>>> essentially the tables in Hbase should be in sync with the tables in Oracle
>>>>> at all times. I am interested in knowing how the teams ensure and proof
>>>>> this ? Do they take row counts from source and target everyday and match it
>>>>> and say that its synced ? I used the counter option in Nifi which
>>>>> maintained the record received against each table but i guess that is not
>>>>> an optimized way to do it.
>>>>>
>>>>

Re: Row counts - date lake

Posted by Faisal Durrani <te...@gmail.com>.
Hi Andrew,

Thank you for your suggestion. We are using the timestamp property of the
PutHbase processor to enforce the order. This timestamp is extracted from
the golden gate message as a meta data. I agree with your approach for
creating an End of day file which seems to be the most logical way of doing
the reconciliation.

Thank you for the help.
Faisal

On Mon, Jun 11, 2018 at 6:07 PM Andrew Psaltis <ps...@gmail.com>
wrote:

> Hi Faisal,
> OK, so then a single partition in Kafka and I assume that you are using
> the EnforceOrder processor or another means to ensure the records are in
> order so that updates happen after inserts and so forth. In that case,
> assuming everything is in order the simplest approach is just to do a count
> at the end-of-the-day, whatever that means for your business. However,
> often times a simple count is not good enough, as you actually want to know
> that the tables reconcile. Often times, in this case, there is a way to
> produce an End Of Day file or record for the source system. This will
> usually contain things like the record count, along with a sum of various
> columns, and possibly a query that is used to produce the sum or another
> validation. With this, you can then ensure that the tables do reconcile.
>
> Do you have the ability to create an "End of Day" file or something along
> those lines that you can use?
>
>
> [1]
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.EnforceOrder/index.html
>
> Thanks,
> Andrew
>
>
>
> On Mon, Jun 11, 2018 at 11:11 AM Faisal Durrani <te...@gmail.com>
> wrote:
>
>> Hi Andrew,
>> We are receiving the golden gate transactions from Kafka which is
>> received in Nifi through consume kafka processor . Our data flow then
>> reduces the golden gate json message and sends the data to the target table
>> in Hbase using the PutHbase Json processor.
>>
>> Thanks,
>> Faisal
>>
>> On Mon, Jun 11, 2018 at 12:01 PM Andrew Psaltis <ps...@gmail.com>
>> wrote:
>>
>>> Hi Faisal,
>>> There are various ways this can be handled. But this is going to depend
>>> on, how are you receiving data from Oracle via Golden Gate. Are you using
>>> the HBase Handler, the HDFS Handler, a Flat File, Kafka, or via another
>>> means?
>>>
>>> Thanks,
>>> Andrew
>>>
>>> On Mon, Jun 11, 2018 at 9:43 AM Faisal Durrani <te...@gmail.com>
>>> wrote:
>>>
>>>> Is there a recommended way to ensure the row counts form tables in
>>>> source (Oracle) are consistent with that of target tables in Hbase (
>>>> data-lake)? .We are using Nifi which receives the golden gate messages and
>>>> then by using different processor we store the transactions in Hbase ,so
>>>> essentially the tables in Hbase should be in sync with the tables in Oracle
>>>> at all times. I am interested in knowing how the teams ensure and proof
>>>> this ? Do they take row counts from source and target everyday and match it
>>>> and say that its synced ? I used the counter option in Nifi which
>>>> maintained the record received against each table but i guess that is not
>>>> an optimized way to do it.
>>>>
>>>

Re: Row counts - date lake

Posted by Andrew Psaltis <ps...@gmail.com>.
Hi Faisal,
OK, so then a single partition in Kafka and I assume that you are
using the EnforceOrder processor
or another means to ensure the records are in order so that updates happen
after inserts and so forth. In that case, assuming everything is in order
the simplest approach is just to do a count at the end-of-the-day, whatever
that means for your business. However, often times a simple count is not
good enough, as you actually want to know that the tables reconcile. Often
times, in this case, there is a way to produce an End Of Day file or record
for the source system. This will usually contain things like the record
count, along with a sum of various columns, and possibly a query that is
used to produce the sum or another validation. With this, you can then
ensure that the tables do reconcile.

Do you have the ability to create an "End of Day" file or something along
those lines that you can use?


[1]
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.EnforceOrder/index.html

Thanks,
Andrew



On Mon, Jun 11, 2018 at 11:11 AM Faisal Durrani <te...@gmail.com> wrote:

> Hi Andrew,
> We are receiving the golden gate transactions from Kafka which is received
> in Nifi through consume kafka processor . Our data flow then reduces the
> golden gate json message and sends the data to the target table in Hbase
> using the PutHbase Json processor.
>
> Thanks,
> Faisal
>
> On Mon, Jun 11, 2018 at 12:01 PM Andrew Psaltis <ps...@gmail.com>
> wrote:
>
>> Hi Faisal,
>> There are various ways this can be handled. But this is going to depend
>> on, how are you receiving data from Oracle via Golden Gate. Are you using
>> the HBase Handler, the HDFS Handler, a Flat File, Kafka, or via another
>> means?
>>
>> Thanks,
>> Andrew
>>
>> On Mon, Jun 11, 2018 at 9:43 AM Faisal Durrani <te...@gmail.com>
>> wrote:
>>
>>> Is there a recommended way to ensure the row counts form tables in
>>> source (Oracle) are consistent with that of target tables in Hbase (
>>> data-lake)? .We are using Nifi which receives the golden gate messages and
>>> then by using different processor we store the transactions in Hbase ,so
>>> essentially the tables in Hbase should be in sync with the tables in Oracle
>>> at all times. I am interested in knowing how the teams ensure and proof
>>> this ? Do they take row counts from source and target everyday and match it
>>> and say that its synced ? I used the counter option in Nifi which
>>> maintained the record received against each table but i guess that is not
>>> an optimized way to do it.
>>>
>>

Re: Row counts - date lake

Posted by Faisal Durrani <te...@gmail.com>.
Hi Andrew,
We are receiving the golden gate transactions from Kafka which is received
in Nifi through consume kafka processor . Our data flow then reduces the
golden gate json message and sends the data to the target table in Hbase
using the PutHbase Json processor.

Thanks,
Faisal

On Mon, Jun 11, 2018 at 12:01 PM Andrew Psaltis <ps...@gmail.com>
wrote:

> Hi Faisal,
> There are various ways this can be handled. But this is going to depend
> on, how are you receiving data from Oracle via Golden Gate. Are you using
> the HBase Handler, the HDFS Handler, a Flat File, Kafka, or via another
> means?
>
> Thanks,
> Andrew
>
> On Mon, Jun 11, 2018 at 9:43 AM Faisal Durrani <te...@gmail.com>
> wrote:
>
>> Is there a recommended way to ensure the row counts form tables in source
>> (Oracle) are consistent with that of target tables in Hbase ( data-lake)?
>> .We are using Nifi which receives the golden gate messages and then by
>> using different processor we store the transactions in Hbase ,so
>> essentially the tables in Hbase should be in sync with the tables in Oracle
>> at all times. I am interested in knowing how the teams ensure and proof
>> this ? Do they take row counts from source and target everyday and match it
>> and say that its synced ? I used the counter option in Nifi which
>> maintained the record received against each table but i guess that is not
>> an optimized way to do it.
>>
>

Re: Row counts - date lake

Posted by Andrew Psaltis <ps...@gmail.com>.
Hi Faisal,
There are various ways this can be handled. But this is going to depend on,
how are you receiving data from Oracle via Golden Gate. Are you using the
HBase Handler, the HDFS Handler, a Flat File, Kafka, or via another means?

Thanks,
Andrew

On Mon, Jun 11, 2018 at 9:43 AM Faisal Durrani <te...@gmail.com> wrote:

> Is there a recommended way to ensure the row counts form tables in source
> (Oracle) are consistent with that of target tables in Hbase ( data-lake)?
> .We are using Nifi which receives the golden gate messages and then by
> using different processor we store the transactions in Hbase ,so
> essentially the tables in Hbase should be in sync with the tables in Oracle
> at all times. I am interested in knowing how the teams ensure and proof
> this ? Do they take row counts from source and target everyday and match it
> and say that its synced ? I used the counter option in Nifi which
> maintained the record received against each table but i guess that is not
> an optimized way to do it.
>