You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Tobias Feldhaus <To...@localsearch.ch> on 2017/09/27 15:46:54 UTC

Possible Bug (?) in BigQueryOperator - Missing data when writing to a partitioned table

Hi,


I am tracing a bug in one of our data pipelines and I narrowed it down to some small number of events not being in a table (using Airflow 1.8.2).
After running the query myself that airflow executed interactively, I saw the missing entry. When airflow executed the same query, and writes the results to a partitioned table in BQ it was missing in that destination table.
I’ve tried different scenarios now several times and the only explanation or difference I can come up with, is that airflow _might_ be that using partitioned tables is not fully supported or there is some weird bug in the bigquery-python implementation.

When deleting the table and recreating it and reloading the complete date with airflow the data is still missing. When reloading a single day, it is also missing. I’ve created a python script to execute the exact same query and it works as expected.

Any advice how to track this down further? Is this a known issue?

Best,
Tobias



Re: Possible Bug (?) in BigQueryOperator - Missing data when writing to a partitioned table

Posted by Tobias Feldhaus <To...@localsearch.ch>.
I’ve tried clearing the DAG via the CLI and “airflow clear” – without any
effect.  The DAG is still not executed for the mentioned days when the
“automatic” backfilling kicks in, there is simply no DAG run scheduled.
 
When I’ve cleared the DAG via the CLI, disabled it in the webinterface and used
“airflow backfill”, the month was finally correctly scheduled and processed. 

As it seems this is somehow related to the date (yesterday, different days were
missing than today - maybe caused by the scheduler?) I was thinking about I
opening an issue for this, but I don’t have enough information
what could cause this behaviour, yet.

On 28.09.2017, 12:07, "Tobias Feldhaus" <To...@localsearch.ch> wrote:

    I think I found the issue. I was rerunning everything again and I found that now
    the respective date was there, but another date was missing. After some
    investigations I stumbled upon this:
    
    Airflow simply didnt process some days of the month (August) that I was
    reprocessing. It simply didnt process August 24th yesterday, and now it was 
    missing August 17th and 18th!
    
    [Screenshot for Airflow interface showing the run for 2017-08-16 run 17/18
    are missing, 19 is the next one: https://puu.sh/xKRKH/0cc9bc01d6.png
    
    [Screenshot for Airflow interface showing the run for 2017-08-19:
    https://puu.sh/xKRL9/0ac26fb476.png]
    
    What could be the reason for this? Did the clearing command via the
    webinterface maybe fail? Why are the days no longer shown in the webinterface 
    at all?
    
    
    On 27.09.2017, 23:20, "Tobias Feldhaus" <To...@localsearch.ch> wrote:
    
        I am also skeptical, but I want to be sure - the next thing I would do is stepping through with a debugger to see if the query gets altered in any way before it’s send out. Is it possible to step through with pdb when triggering via “airflow run” ?
        
        On 27.09.2017, 22:56, "Chris Riccomini" <cr...@apache.org>> wrote:
        
        I am highly skeptical that it's the library.
        
        On Wed, Sep 27, 2017 at 1:50 PM, Tobias Feldhaus <To...@localsearch.ch>> wrote:
        This was exactly my point. Before I dig deeper I want to build a very minimum PythonOperator that uses the new library as I am currently
         comparing apples with oranges (same query, same data, different libraries). Although it really puzzles me how a different library can yield different (read as: some is missing) data – when it’s job is just to execute a query and not pulling and transforming it.
        
        
        On 27.09.2017, 19:43, "Chris Riccomini" <cr...@apache.org>> wrote:
        
            Interesting. Just saw:
        
            https://github.com/google/google-api-python-client
        
            > This client library is supported but in maintenance mode only. We are
            fixing necessary bugs and adding essential features to ensure this library
            continues to meet your needs for accessing Google APIs. Non-critical issues
            will be closed. Any issue may be reopened if it is causing ongoing problems.
        
            Looks like we might want to migrate at some point. It'll be a big change.
            <https://github.com/google/google-api-python-client#about>
        
            On Wed, Sep 27, 2017 at 10:41 AM, Chris Riccomini <cr...@apache.org>>
            wrote:
        
            > AFAIK, google-api-python-client is not in maintenance mode. In fact, I
            > believe the idiomatic Python library (google-cloud-python) is built off of google-api-python-client,
            > I believe. I have spoken with several Google cloud PMs who have pointed me
            > at google-api-python-client as the canonical library to use, and the one
            > that receives updates for new products first (before google-cloud-python).
            >
            > On Wed, Sep 27, 2017 at 10:34 AM, Tobias Feldhaus <
            > Tobias.Feldhaus@localsearch.ch<ma...@localsearch.ch>> wrote:
            >
            >> Sounds like a possible solution, however to avoid hitting this problem
            >> I’ve deleted all the tables before rerunning stuff. I think it might have
            >> to do with the library. Airflow uses google-api-python-client which is in
            >> maintenance mode and Google suggests switching to google-cloud-python. I
            >> will write a PythonOperator DAG tomorrow and will check DAG against DAG
            >> then to see if the library could be the problem.
            >>
            >> On 27.09.2017, 19:15, "Chris Riccomini" <cr...@apache.org>> wrote:
            >>
            >>     Is it possible that you were getting a cache hit with the BQ operator?
            >>
            >>     https://cloud.google.com/bigquery/docs/cached-results#bigque
            >> ry-query-cache-api
            >>
            >>     The operator does not currently expose this flag, and I couldn't find
            >>     whether the cache defaults to on or off for insert-job API.
            >>
            >>     On Wed, Sep 27, 2017 at 9:41 AM, Tobias Feldhaus <
            >>     Tobias.Feldhaus@localsearch.ch<ma...@localsearch.ch>> wrote:
            >>
            >>     > I’ve created a table with only the missing value in the exact same
            >>     > partition, and then it’s going through. Could it be that the volume
            >> of the
            >>     > data plays a role or the client libraries maybe?
            >>     >
            >>     > On 27.09.2017, 17:46, "Tobias Feldhaus" <
            >> Tobias.Feldhaus@localsearch.ch<ma...@localsearch.ch>>
            >>     > wrote:
            >>     >
            >>     >     Hi,
            >>     >
            >>     >
            >>     >     I am tracing a bug in one of our data pipelines and I narrowed
            >> it down
            >>     > to some small number of events not being in a table (using Airflow
            >> 1.8.2).
            >>     >     After running the query myself that airflow executed
            >> interactively, I
            >>     > saw the missing entry. When airflow executed the same query, and
            >> writes the
            >>     > results to a partitioned table in BQ it was missing in that
            >> destination
            >>     > table.
            >>     >     I’ve tried different scenarios now several times and the only
            >>     > explanation or difference I can come up with, is that airflow
            >> _might_ be
            >>     > that using partitioned tables is not fully supported or there is
            >> some weird
            >>     > bug in the bigquery-python implementation.
            >>     >
            >>     >     When deleting the table and recreating it and reloading the
            >> complete
            >>     > date with airflow the data is still missing. When reloading a
            >> single day,
            >>     > it is also missing. I’ve created a python script to execute the
            >> exact same
            >>     > query and it works as expected.
            >>     >
            >>     >     Any advice how to track this down further? Is this a known
            >> issue?
            >>     >
            >>     >     Best,
            >>     >     Tobias
            >>     >
            >>     >
            >>     >
            >>     >
            >>     >
            >>
            >>
            >>
            >
        
        
        
    
    


Re: Possible Bug (?) in BigQueryOperator - Missing data when writing to a partitioned table

Posted by Tobias Feldhaus <To...@localsearch.ch>.
I think I found the issue. I was rerunning everything again and I found that now
the respective date was there, but another date was missing. After some
investigations I stumbled upon this:

Airflow simply didnt process some days of the month (August) that I was
reprocessing. It simply didnt process August 24th yesterday, and now it was 
missing August 17th and 18th!

[Screenshot for Airflow interface showing the run for 2017-08-16 run 17/18
are missing, 19 is the next one: https://puu.sh/xKRKH/0cc9bc01d6.png

[Screenshot for Airflow interface showing the run for 2017-08-19:
https://puu.sh/xKRL9/0ac26fb476.png]

What could be the reason for this? Did the clearing command via the
webinterface maybe fail? Why are the days no longer shown in the webinterface 
at all?


On 27.09.2017, 23:20, "Tobias Feldhaus" <To...@localsearch.ch> wrote:

    I am also skeptical, but I want to be sure - the next thing I would do is stepping through with a debugger to see if the query gets altered in any way before it’s send out. Is it possible to step through with pdb when triggering via “airflow run” ?
    
    On 27.09.2017, 22:56, "Chris Riccomini" <cr...@apache.org>> wrote:
    
    I am highly skeptical that it's the library.
    
    On Wed, Sep 27, 2017 at 1:50 PM, Tobias Feldhaus <To...@localsearch.ch>> wrote:
    This was exactly my point. Before I dig deeper I want to build a very minimum PythonOperator that uses the new library as I am currently
     comparing apples with oranges (same query, same data, different libraries). Although it really puzzles me how a different library can yield different (read as: some is missing) data – when it’s job is just to execute a query and not pulling and transforming it.
    
    
    On 27.09.2017, 19:43, "Chris Riccomini" <cr...@apache.org>> wrote:
    
        Interesting. Just saw:
    
        https://github.com/google/google-api-python-client
    
        > This client library is supported but in maintenance mode only. We are
        fixing necessary bugs and adding essential features to ensure this library
        continues to meet your needs for accessing Google APIs. Non-critical issues
        will be closed. Any issue may be reopened if it is causing ongoing problems.
    
        Looks like we might want to migrate at some point. It'll be a big change.
        <https://github.com/google/google-api-python-client#about>
    
        On Wed, Sep 27, 2017 at 10:41 AM, Chris Riccomini <cr...@apache.org>>
        wrote:
    
        > AFAIK, google-api-python-client is not in maintenance mode. In fact, I
        > believe the idiomatic Python library (google-cloud-python) is built off of google-api-python-client,
        > I believe. I have spoken with several Google cloud PMs who have pointed me
        > at google-api-python-client as the canonical library to use, and the one
        > that receives updates for new products first (before google-cloud-python).
        >
        > On Wed, Sep 27, 2017 at 10:34 AM, Tobias Feldhaus <
        > Tobias.Feldhaus@localsearch.ch<ma...@localsearch.ch>> wrote:
        >
        >> Sounds like a possible solution, however to avoid hitting this problem
        >> I’ve deleted all the tables before rerunning stuff. I think it might have
        >> to do with the library. Airflow uses google-api-python-client which is in
        >> maintenance mode and Google suggests switching to google-cloud-python. I
        >> will write a PythonOperator DAG tomorrow and will check DAG against DAG
        >> then to see if the library could be the problem.
        >>
        >> On 27.09.2017, 19:15, "Chris Riccomini" <cr...@apache.org>> wrote:
        >>
        >>     Is it possible that you were getting a cache hit with the BQ operator?
        >>
        >>     https://cloud.google.com/bigquery/docs/cached-results#bigque
        >> ry-query-cache-api
        >>
        >>     The operator does not currently expose this flag, and I couldn't find
        >>     whether the cache defaults to on or off for insert-job API.
        >>
        >>     On Wed, Sep 27, 2017 at 9:41 AM, Tobias Feldhaus <
        >>     Tobias.Feldhaus@localsearch.ch<ma...@localsearch.ch>> wrote:
        >>
        >>     > I’ve created a table with only the missing value in the exact same
        >>     > partition, and then it’s going through. Could it be that the volume
        >> of the
        >>     > data plays a role or the client libraries maybe?
        >>     >
        >>     > On 27.09.2017, 17:46, "Tobias Feldhaus" <
        >> Tobias.Feldhaus@localsearch.ch<ma...@localsearch.ch>>
        >>     > wrote:
        >>     >
        >>     >     Hi,
        >>     >
        >>     >
        >>     >     I am tracing a bug in one of our data pipelines and I narrowed
        >> it down
        >>     > to some small number of events not being in a table (using Airflow
        >> 1.8.2).
        >>     >     After running the query myself that airflow executed
        >> interactively, I
        >>     > saw the missing entry. When airflow executed the same query, and
        >> writes the
        >>     > results to a partitioned table in BQ it was missing in that
        >> destination
        >>     > table.
        >>     >     I’ve tried different scenarios now several times and the only
        >>     > explanation or difference I can come up with, is that airflow
        >> _might_ be
        >>     > that using partitioned tables is not fully supported or there is
        >> some weird
        >>     > bug in the bigquery-python implementation.
        >>     >
        >>     >     When deleting the table and recreating it and reloading the
        >> complete
        >>     > date with airflow the data is still missing. When reloading a
        >> single day,
        >>     > it is also missing. I’ve created a python script to execute the
        >> exact same
        >>     > query and it works as expected.
        >>     >
        >>     >     Any advice how to track this down further? Is this a known
        >> issue?
        >>     >
        >>     >     Best,
        >>     >     Tobias
        >>     >
        >>     >
        >>     >
        >>     >
        >>     >
        >>
        >>
        >>
        >
    
    
    


Re: Possible Bug (?) in BigQueryOperator - Missing data when writing to a partitioned table

Posted by Tobias Feldhaus <To...@localsearch.ch>.
I am also skeptical, but I want to be sure - the next thing I would do is stepping through with a debugger to see if the query gets altered in any way before it’s send out. Is it possible to step through with pdb when triggering via “airflow run” ?

On 27.09.2017, 22:56, "Chris Riccomini" <cr...@apache.org>> wrote:

I am highly skeptical that it's the library.

On Wed, Sep 27, 2017 at 1:50 PM, Tobias Feldhaus <To...@localsearch.ch>> wrote:
This was exactly my point. Before I dig deeper I want to build a very minimum PythonOperator that uses the new library as I am currently
 comparing apples with oranges (same query, same data, different libraries). Although it really puzzles me how a different library can yield different (read as: some is missing) data – when it’s job is just to execute a query and not pulling and transforming it.


On 27.09.2017, 19:43, "Chris Riccomini" <cr...@apache.org>> wrote:

    Interesting. Just saw:

    https://github.com/google/google-api-python-client

    > This client library is supported but in maintenance mode only. We are
    fixing necessary bugs and adding essential features to ensure this library
    continues to meet your needs for accessing Google APIs. Non-critical issues
    will be closed. Any issue may be reopened if it is causing ongoing problems.

    Looks like we might want to migrate at some point. It'll be a big change.
    <https://github.com/google/google-api-python-client#about>

    On Wed, Sep 27, 2017 at 10:41 AM, Chris Riccomini <cr...@apache.org>>
    wrote:

    > AFAIK, google-api-python-client is not in maintenance mode. In fact, I
    > believe the idiomatic Python library (google-cloud-python) is built off of google-api-python-client,
    > I believe. I have spoken with several Google cloud PMs who have pointed me
    > at google-api-python-client as the canonical library to use, and the one
    > that receives updates for new products first (before google-cloud-python).
    >
    > On Wed, Sep 27, 2017 at 10:34 AM, Tobias Feldhaus <
    > Tobias.Feldhaus@localsearch.ch<ma...@localsearch.ch>> wrote:
    >
    >> Sounds like a possible solution, however to avoid hitting this problem
    >> I’ve deleted all the tables before rerunning stuff. I think it might have
    >> to do with the library. Airflow uses google-api-python-client which is in
    >> maintenance mode and Google suggests switching to google-cloud-python. I
    >> will write a PythonOperator DAG tomorrow and will check DAG against DAG
    >> then to see if the library could be the problem.
    >>
    >> On 27.09.2017, 19:15, "Chris Riccomini" <cr...@apache.org>> wrote:
    >>
    >>     Is it possible that you were getting a cache hit with the BQ operator?
    >>
    >>     https://cloud.google.com/bigquery/docs/cached-results#bigque
    >> ry-query-cache-api
    >>
    >>     The operator does not currently expose this flag, and I couldn't find
    >>     whether the cache defaults to on or off for insert-job API.
    >>
    >>     On Wed, Sep 27, 2017 at 9:41 AM, Tobias Feldhaus <
    >>     Tobias.Feldhaus@localsearch.ch<ma...@localsearch.ch>> wrote:
    >>
    >>     > I’ve created a table with only the missing value in the exact same
    >>     > partition, and then it’s going through. Could it be that the volume
    >> of the
    >>     > data plays a role or the client libraries maybe?
    >>     >
    >>     > On 27.09.2017, 17:46, "Tobias Feldhaus" <
    >> Tobias.Feldhaus@localsearch.ch<ma...@localsearch.ch>>
    >>     > wrote:
    >>     >
    >>     >     Hi,
    >>     >
    >>     >
    >>     >     I am tracing a bug in one of our data pipelines and I narrowed
    >> it down
    >>     > to some small number of events not being in a table (using Airflow
    >> 1.8.2).
    >>     >     After running the query myself that airflow executed
    >> interactively, I
    >>     > saw the missing entry. When airflow executed the same query, and
    >> writes the
    >>     > results to a partitioned table in BQ it was missing in that
    >> destination
    >>     > table.
    >>     >     I’ve tried different scenarios now several times and the only
    >>     > explanation or difference I can come up with, is that airflow
    >> _might_ be
    >>     > that using partitioned tables is not fully supported or there is
    >> some weird
    >>     > bug in the bigquery-python implementation.
    >>     >
    >>     >     When deleting the table and recreating it and reloading the
    >> complete
    >>     > date with airflow the data is still missing. When reloading a
    >> single day,
    >>     > it is also missing. I’ve created a python script to execute the
    >> exact same
    >>     > query and it works as expected.
    >>     >
    >>     >     Any advice how to track this down further? Is this a known
    >> issue?
    >>     >
    >>     >     Best,
    >>     >     Tobias
    >>     >
    >>     >
    >>     >
    >>     >
    >>     >
    >>
    >>
    >>
    >



Re: Possible Bug (?) in BigQueryOperator - Missing data when writing to a partitioned table

Posted by Chris Riccomini <cr...@apache.org>.
I am highly skeptical that it's the library.

On Wed, Sep 27, 2017 at 1:50 PM, Tobias Feldhaus <
Tobias.Feldhaus@localsearch.ch> wrote:

> This was exactly my point. Before I dig deeper I want to build a very
> minimum PythonOperator that uses the new library as I am currently
>  comparing apples with oranges (same query, same data, different
> libraries). Although it really puzzles me how a different library can yield
> different (read as: some is missing) data – when it’s job is just to
> execute a query and not pulling and transforming it.
>
>
> On 27.09.2017, 19:43, "Chris Riccomini" <cr...@apache.org> wrote:
>
>     Interesting. Just saw:
>
>     https://github.com/google/google-api-python-client
>
>     > This client library is supported but in maintenance mode only. We are
>     fixing necessary bugs and adding essential features to ensure this
> library
>     continues to meet your needs for accessing Google APIs. Non-critical
> issues
>     will be closed. Any issue may be reopened if it is causing ongoing
> problems.
>
>     Looks like we might want to migrate at some point. It'll be a big
> change.
>     <https://github.com/google/google-api-python-client#about>
>
>     On Wed, Sep 27, 2017 at 10:41 AM, Chris Riccomini <
> criccomini@apache.org>
>     wrote:
>
>     > AFAIK, google-api-python-client is not in maintenance mode. In fact,
> I
>     > believe the idiomatic Python library (google-cloud-python) is built
> off of google-api-python-client,
>     > I believe. I have spoken with several Google cloud PMs who have
> pointed me
>     > at google-api-python-client as the canonical library to use, and the
> one
>     > that receives updates for new products first (before
> google-cloud-python).
>     >
>     > On Wed, Sep 27, 2017 at 10:34 AM, Tobias Feldhaus <
>     > Tobias.Feldhaus@localsearch.ch> wrote:
>     >
>     >> Sounds like a possible solution, however to avoid hitting this
> problem
>     >> I’ve deleted all the tables before rerunning stuff. I think it
> might have
>     >> to do with the library. Airflow uses google-api-python-client which
> is in
>     >> maintenance mode and Google suggests switching to
> google-cloud-python. I
>     >> will write a PythonOperator DAG tomorrow and will check DAG against
> DAG
>     >> then to see if the library could be the problem.
>     >>
>     >> On 27.09.2017, 19:15, "Chris Riccomini" <cr...@apache.org>
> wrote:
>     >>
>     >>     Is it possible that you were getting a cache hit with the BQ
> operator?
>     >>
>     >>     https://cloud.google.com/bigquery/docs/cached-results#bigque
>     >> ry-query-cache-api
>     >>
>     >>     The operator does not currently expose this flag, and I
> couldn't find
>     >>     whether the cache defaults to on or off for insert-job API.
>     >>
>     >>     On Wed, Sep 27, 2017 at 9:41 AM, Tobias Feldhaus <
>     >>     Tobias.Feldhaus@localsearch.ch> wrote:
>     >>
>     >>     > I’ve created a table with only the missing value in the exact
> same
>     >>     > partition, and then it’s going through. Could it be that the
> volume
>     >> of the
>     >>     > data plays a role or the client libraries maybe?
>     >>     >
>     >>     > On 27.09.2017, 17:46, "Tobias Feldhaus" <
>     >> Tobias.Feldhaus@localsearch.ch>
>     >>     > wrote:
>     >>     >
>     >>     >     Hi,
>     >>     >
>     >>     >
>     >>     >     I am tracing a bug in one of our data pipelines and I
> narrowed
>     >> it down
>     >>     > to some small number of events not being in a table (using
> Airflow
>     >> 1.8.2).
>     >>     >     After running the query myself that airflow executed
>     >> interactively, I
>     >>     > saw the missing entry. When airflow executed the same query,
> and
>     >> writes the
>     >>     > results to a partitioned table in BQ it was missing in that
>     >> destination
>     >>     > table.
>     >>     >     I’ve tried different scenarios now several times and the
> only
>     >>     > explanation or difference I can come up with, is that airflow
>     >> _might_ be
>     >>     > that using partitioned tables is not fully supported or there
> is
>     >> some weird
>     >>     > bug in the bigquery-python implementation.
>     >>     >
>     >>     >     When deleting the table and recreating it and reloading
> the
>     >> complete
>     >>     > date with airflow the data is still missing. When reloading a
>     >> single day,
>     >>     > it is also missing. I’ve created a python script to execute
> the
>     >> exact same
>     >>     > query and it works as expected.
>     >>     >
>     >>     >     Any advice how to track this down further? Is this a known
>     >> issue?
>     >>     >
>     >>     >     Best,
>     >>     >     Tobias
>     >>     >
>     >>     >
>     >>     >
>     >>     >
>     >>     >
>     >>
>     >>
>     >>
>     >
>
>
>

Re: Possible Bug (?) in BigQueryOperator - Missing data when writing to a partitioned table

Posted by Tobias Feldhaus <To...@localsearch.ch>.
This was exactly my point. Before I dig deeper I want to build a very minimum PythonOperator that uses the new library as I am currently
 comparing apples with oranges (same query, same data, different libraries). Although it really puzzles me how a different library can yield different (read as: some is missing) data – when it’s job is just to execute a query and not pulling and transforming it.


On 27.09.2017, 19:43, "Chris Riccomini" <cr...@apache.org> wrote:

    Interesting. Just saw:
    
    https://github.com/google/google-api-python-client
    
    > This client library is supported but in maintenance mode only. We are
    fixing necessary bugs and adding essential features to ensure this library
    continues to meet your needs for accessing Google APIs. Non-critical issues
    will be closed. Any issue may be reopened if it is causing ongoing problems.
    
    Looks like we might want to migrate at some point. It'll be a big change.
    <https://github.com/google/google-api-python-client#about>
    
    On Wed, Sep 27, 2017 at 10:41 AM, Chris Riccomini <cr...@apache.org>
    wrote:
    
    > AFAIK, google-api-python-client is not in maintenance mode. In fact, I
    > believe the idiomatic Python library (google-cloud-python) is built off of google-api-python-client,
    > I believe. I have spoken with several Google cloud PMs who have pointed me
    > at google-api-python-client as the canonical library to use, and the one
    > that receives updates for new products first (before google-cloud-python).
    >
    > On Wed, Sep 27, 2017 at 10:34 AM, Tobias Feldhaus <
    > Tobias.Feldhaus@localsearch.ch> wrote:
    >
    >> Sounds like a possible solution, however to avoid hitting this problem
    >> I’ve deleted all the tables before rerunning stuff. I think it might have
    >> to do with the library. Airflow uses google-api-python-client which is in
    >> maintenance mode and Google suggests switching to google-cloud-python. I
    >> will write a PythonOperator DAG tomorrow and will check DAG against DAG
    >> then to see if the library could be the problem.
    >>
    >> On 27.09.2017, 19:15, "Chris Riccomini" <cr...@apache.org> wrote:
    >>
    >>     Is it possible that you were getting a cache hit with the BQ operator?
    >>
    >>     https://cloud.google.com/bigquery/docs/cached-results#bigque
    >> ry-query-cache-api
    >>
    >>     The operator does not currently expose this flag, and I couldn't find
    >>     whether the cache defaults to on or off for insert-job API.
    >>
    >>     On Wed, Sep 27, 2017 at 9:41 AM, Tobias Feldhaus <
    >>     Tobias.Feldhaus@localsearch.ch> wrote:
    >>
    >>     > I’ve created a table with only the missing value in the exact same
    >>     > partition, and then it’s going through. Could it be that the volume
    >> of the
    >>     > data plays a role or the client libraries maybe?
    >>     >
    >>     > On 27.09.2017, 17:46, "Tobias Feldhaus" <
    >> Tobias.Feldhaus@localsearch.ch>
    >>     > wrote:
    >>     >
    >>     >     Hi,
    >>     >
    >>     >
    >>     >     I am tracing a bug in one of our data pipelines and I narrowed
    >> it down
    >>     > to some small number of events not being in a table (using Airflow
    >> 1.8.2).
    >>     >     After running the query myself that airflow executed
    >> interactively, I
    >>     > saw the missing entry. When airflow executed the same query, and
    >> writes the
    >>     > results to a partitioned table in BQ it was missing in that
    >> destination
    >>     > table.
    >>     >     I’ve tried different scenarios now several times and the only
    >>     > explanation or difference I can come up with, is that airflow
    >> _might_ be
    >>     > that using partitioned tables is not fully supported or there is
    >> some weird
    >>     > bug in the bigquery-python implementation.
    >>     >
    >>     >     When deleting the table and recreating it and reloading the
    >> complete
    >>     > date with airflow the data is still missing. When reloading a
    >> single day,
    >>     > it is also missing. I’ve created a python script to execute the
    >> exact same
    >>     > query and it works as expected.
    >>     >
    >>     >     Any advice how to track this down further? Is this a known
    >> issue?
    >>     >
    >>     >     Best,
    >>     >     Tobias
    >>     >
    >>     >
    >>     >
    >>     >
    >>     >
    >>
    >>
    >>
    >
    


Re: Possible Bug (?) in BigQueryOperator - Missing data when writing to a partitioned table

Posted by Chris Riccomini <cr...@apache.org>.
Interesting. Just saw:

https://github.com/google/google-api-python-client

> This client library is supported but in maintenance mode only. We are
fixing necessary bugs and adding essential features to ensure this library
continues to meet your needs for accessing Google APIs. Non-critical issues
will be closed. Any issue may be reopened if it is causing ongoing problems.

Looks like we might want to migrate at some point. It'll be a big change.
<https://github.com/google/google-api-python-client#about>

On Wed, Sep 27, 2017 at 10:41 AM, Chris Riccomini <cr...@apache.org>
wrote:

> AFAIK, google-api-python-client is not in maintenance mode. In fact, I
> believe the idiomatic Python library (google-cloud-python) is built off of google-api-python-client,
> I believe. I have spoken with several Google cloud PMs who have pointed me
> at google-api-python-client as the canonical library to use, and the one
> that receives updates for new products first (before google-cloud-python).
>
> On Wed, Sep 27, 2017 at 10:34 AM, Tobias Feldhaus <
> Tobias.Feldhaus@localsearch.ch> wrote:
>
>> Sounds like a possible solution, however to avoid hitting this problem
>> I’ve deleted all the tables before rerunning stuff. I think it might have
>> to do with the library. Airflow uses google-api-python-client which is in
>> maintenance mode and Google suggests switching to google-cloud-python. I
>> will write a PythonOperator DAG tomorrow and will check DAG against DAG
>> then to see if the library could be the problem.
>>
>> On 27.09.2017, 19:15, "Chris Riccomini" <cr...@apache.org> wrote:
>>
>>     Is it possible that you were getting a cache hit with the BQ operator?
>>
>>     https://cloud.google.com/bigquery/docs/cached-results#bigque
>> ry-query-cache-api
>>
>>     The operator does not currently expose this flag, and I couldn't find
>>     whether the cache defaults to on or off for insert-job API.
>>
>>     On Wed, Sep 27, 2017 at 9:41 AM, Tobias Feldhaus <
>>     Tobias.Feldhaus@localsearch.ch> wrote:
>>
>>     > I’ve created a table with only the missing value in the exact same
>>     > partition, and then it’s going through. Could it be that the volume
>> of the
>>     > data plays a role or the client libraries maybe?
>>     >
>>     > On 27.09.2017, 17:46, "Tobias Feldhaus" <
>> Tobias.Feldhaus@localsearch.ch>
>>     > wrote:
>>     >
>>     >     Hi,
>>     >
>>     >
>>     >     I am tracing a bug in one of our data pipelines and I narrowed
>> it down
>>     > to some small number of events not being in a table (using Airflow
>> 1.8.2).
>>     >     After running the query myself that airflow executed
>> interactively, I
>>     > saw the missing entry. When airflow executed the same query, and
>> writes the
>>     > results to a partitioned table in BQ it was missing in that
>> destination
>>     > table.
>>     >     I’ve tried different scenarios now several times and the only
>>     > explanation or difference I can come up with, is that airflow
>> _might_ be
>>     > that using partitioned tables is not fully supported or there is
>> some weird
>>     > bug in the bigquery-python implementation.
>>     >
>>     >     When deleting the table and recreating it and reloading the
>> complete
>>     > date with airflow the data is still missing. When reloading a
>> single day,
>>     > it is also missing. I’ve created a python script to execute the
>> exact same
>>     > query and it works as expected.
>>     >
>>     >     Any advice how to track this down further? Is this a known
>> issue?
>>     >
>>     >     Best,
>>     >     Tobias
>>     >
>>     >
>>     >
>>     >
>>     >
>>
>>
>>
>

Re: Possible Bug (?) in BigQueryOperator - Missing data when writing to a partitioned table

Posted by Chris Riccomini <cr...@apache.org>.
AFAIK, google-api-python-client is not in maintenance mode. In fact, I
believe the idiomatic Python library (google-cloud-python) is built
off of google-api-python-client,
I believe. I have spoken with several Google cloud PMs who have pointed me
at google-api-python-client as the canonical library to use, and the one
that receives updates for new products first (before google-cloud-python).

On Wed, Sep 27, 2017 at 10:34 AM, Tobias Feldhaus <
Tobias.Feldhaus@localsearch.ch> wrote:

> Sounds like a possible solution, however to avoid hitting this problem
> I’ve deleted all the tables before rerunning stuff. I think it might have
> to do with the library. Airflow uses google-api-python-client which is in
> maintenance mode and Google suggests switching to google-cloud-python. I
> will write a PythonOperator DAG tomorrow and will check DAG against DAG
> then to see if the library could be the problem.
>
> On 27.09.2017, 19:15, "Chris Riccomini" <cr...@apache.org> wrote:
>
>     Is it possible that you were getting a cache hit with the BQ operator?
>
>     https://cloud.google.com/bigquery/docs/cached-results#
> bigquery-query-cache-api
>
>     The operator does not currently expose this flag, and I couldn't find
>     whether the cache defaults to on or off for insert-job API.
>
>     On Wed, Sep 27, 2017 at 9:41 AM, Tobias Feldhaus <
>     Tobias.Feldhaus@localsearch.ch> wrote:
>
>     > I’ve created a table with only the missing value in the exact same
>     > partition, and then it’s going through. Could it be that the volume
> of the
>     > data plays a role or the client libraries maybe?
>     >
>     > On 27.09.2017, 17:46, "Tobias Feldhaus" <
> Tobias.Feldhaus@localsearch.ch>
>     > wrote:
>     >
>     >     Hi,
>     >
>     >
>     >     I am tracing a bug in one of our data pipelines and I narrowed
> it down
>     > to some small number of events not being in a table (using Airflow
> 1.8.2).
>     >     After running the query myself that airflow executed
> interactively, I
>     > saw the missing entry. When airflow executed the same query, and
> writes the
>     > results to a partitioned table in BQ it was missing in that
> destination
>     > table.
>     >     I’ve tried different scenarios now several times and the only
>     > explanation or difference I can come up with, is that airflow
> _might_ be
>     > that using partitioned tables is not fully supported or there is
> some weird
>     > bug in the bigquery-python implementation.
>     >
>     >     When deleting the table and recreating it and reloading the
> complete
>     > date with airflow the data is still missing. When reloading a single
> day,
>     > it is also missing. I’ve created a python script to execute the
> exact same
>     > query and it works as expected.
>     >
>     >     Any advice how to track this down further? Is this a known issue?
>     >
>     >     Best,
>     >     Tobias
>     >
>     >
>     >
>     >
>     >
>
>
>

Re: Possible Bug (?) in BigQueryOperator - Missing data when writing to a partitioned table

Posted by Tobias Feldhaus <To...@localsearch.ch>.
Sounds like a possible solution, however to avoid hitting this problem I’ve deleted all the tables before rerunning stuff. I think it might have to do with the library. Airflow uses google-api-python-client which is in maintenance mode and Google suggests switching to google-cloud-python. I will write a PythonOperator DAG tomorrow and will check DAG against DAG then to see if the library could be the problem.

On 27.09.2017, 19:15, "Chris Riccomini" <cr...@apache.org> wrote:

    Is it possible that you were getting a cache hit with the BQ operator?
    
    https://cloud.google.com/bigquery/docs/cached-results#bigquery-query-cache-api
    
    The operator does not currently expose this flag, and I couldn't find
    whether the cache defaults to on or off for insert-job API.
    
    On Wed, Sep 27, 2017 at 9:41 AM, Tobias Feldhaus <
    Tobias.Feldhaus@localsearch.ch> wrote:
    
    > I’ve created a table with only the missing value in the exact same
    > partition, and then it’s going through. Could it be that the volume of the
    > data plays a role or the client libraries maybe?
    >
    > On 27.09.2017, 17:46, "Tobias Feldhaus" <To...@localsearch.ch>
    > wrote:
    >
    >     Hi,
    >
    >
    >     I am tracing a bug in one of our data pipelines and I narrowed it down
    > to some small number of events not being in a table (using Airflow 1.8.2).
    >     After running the query myself that airflow executed interactively, I
    > saw the missing entry. When airflow executed the same query, and writes the
    > results to a partitioned table in BQ it was missing in that destination
    > table.
    >     I’ve tried different scenarios now several times and the only
    > explanation or difference I can come up with, is that airflow _might_ be
    > that using partitioned tables is not fully supported or there is some weird
    > bug in the bigquery-python implementation.
    >
    >     When deleting the table and recreating it and reloading the complete
    > date with airflow the data is still missing. When reloading a single day,
    > it is also missing. I’ve created a python script to execute the exact same
    > query and it works as expected.
    >
    >     Any advice how to track this down further? Is this a known issue?
    >
    >     Best,
    >     Tobias
    >
    >
    >
    >
    >
    


Re: Possible Bug (?) in BigQueryOperator - Missing data when writing to a partitioned table

Posted by Chris Riccomini <cr...@apache.org>.
Is it possible that you were getting a cache hit with the BQ operator?

https://cloud.google.com/bigquery/docs/cached-results#bigquery-query-cache-api

The operator does not currently expose this flag, and I couldn't find
whether the cache defaults to on or off for insert-job API.

On Wed, Sep 27, 2017 at 9:41 AM, Tobias Feldhaus <
Tobias.Feldhaus@localsearch.ch> wrote:

> I’ve created a table with only the missing value in the exact same
> partition, and then it’s going through. Could it be that the volume of the
> data plays a role or the client libraries maybe?
>
> On 27.09.2017, 17:46, "Tobias Feldhaus" <To...@localsearch.ch>
> wrote:
>
>     Hi,
>
>
>     I am tracing a bug in one of our data pipelines and I narrowed it down
> to some small number of events not being in a table (using Airflow 1.8.2).
>     After running the query myself that airflow executed interactively, I
> saw the missing entry. When airflow executed the same query, and writes the
> results to a partitioned table in BQ it was missing in that destination
> table.
>     I’ve tried different scenarios now several times and the only
> explanation or difference I can come up with, is that airflow _might_ be
> that using partitioned tables is not fully supported or there is some weird
> bug in the bigquery-python implementation.
>
>     When deleting the table and recreating it and reloading the complete
> date with airflow the data is still missing. When reloading a single day,
> it is also missing. I’ve created a python script to execute the exact same
> query and it works as expected.
>
>     Any advice how to track this down further? Is this a known issue?
>
>     Best,
>     Tobias
>
>
>
>
>

Re: Possible Bug (?) in BigQueryOperator - Missing data when writing to a partitioned table

Posted by Tobias Feldhaus <To...@localsearch.ch>.
I’ve created a table with only the missing value in the exact same partition, and then it’s going through. Could it be that the volume of the data plays a role or the client libraries maybe? 

On 27.09.2017, 17:46, "Tobias Feldhaus" <To...@localsearch.ch> wrote:

    Hi,
    
    
    I am tracing a bug in one of our data pipelines and I narrowed it down to some small number of events not being in a table (using Airflow 1.8.2).
    After running the query myself that airflow executed interactively, I saw the missing entry. When airflow executed the same query, and writes the results to a partitioned table in BQ it was missing in that destination table.
    I’ve tried different scenarios now several times and the only explanation or difference I can come up with, is that airflow _might_ be that using partitioned tables is not fully supported or there is some weird bug in the bigquery-python implementation.
    
    When deleting the table and recreating it and reloading the complete date with airflow the data is still missing. When reloading a single day, it is also missing. I’ve created a python script to execute the exact same query and it works as expected.
    
    Any advice how to track this down further? Is this a known issue?
    
    Best,
    Tobias