You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@griffin.apache.org by Vikram Jain <vi...@enquero.com> on 2019/02/14 12:31:14 UTC

Griffin Job picking up already processed data

Hi,
I have a hive table partitioned date wise and hour wise. Data is coming in every hour our 2 hours in the table. I am using Griffin to perform certain profiling and accuracy checks on the data.
However, I want only the new data that was accumulated post the last job run to be processed. The job is scheduled to run every hour.
Right now, Griffin is picking up all the data present in the hive table (new data accumulated in past hour + past data already processed by the Griffin job previously). I believe there should be some configurations while creating a measure and job to avoid this scenario and process only the data acquired in the last hour. I have tried various permutation and combinations but have not been successful.
Can someone please tell me the list of steps and configurations in UI that I need to ensure in order to achieve the desired result?
Any help is much appreciated.

Regards,
Vikram

Re: Griffin Job picking up already processed data

Posted by Nick Sokolov <ch...@gmail.com>.

Hello Vikram,

Can you please clarify -- is the question on how to run job only on data in
some time interval (for example, -1 hour from now up to now), or how to
guarantee that job will read only new records since last DQ job run (even
if job was running twice same hour)? If question is about the first, you
can refer to Partition Configuration
<https://github.com/apache/griffin/blob/master/griffin-doc/ui/user-guide.md>
set
of user guide, or "where" field example in API guide
<https://github.com/apache/griffin/blob/master/griffin-doc/service/api-guide.md#add-measure>.
If it's about the second -- there is no mechanism in Griffin to track which
records have or have not been processed (at least in batch mode), and doing
something like that would require custom code.

On Thu, Feb 14, 2019 at 4:31 AM Vikram Jain <vi...@enquero.com> wrote:

> Hi,
>
> I have a hive table partitioned date wise and hour wise. Data is coming in
> every hour our 2 hours in the table. I am using Griffin to perform certain
> profiling and accuracy checks on the data.
>
> However, I want only the new data that was accumulated post the last job
> run to be processed. The job is scheduled to run every hour.
>
> Right now, Griffin is picking up all the data present in the hive table
> (new data accumulated in past hour + past data already processed by the
> Griffin job previously). I believe there should be some configurations
> while creating a measure and job to avoid this scenario and process only
> the data acquired in the last hour. I have tried various permutation and
> combinations but have not been successful.
>
> Can someone please tell me the list of steps and configurations in UI that
> I need to ensure in order to achieve the desired result?
>
> Any help is much appreciated.
>
>
>
> Regards,
>
> Vikram
>