You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/09/06 19:37:20 UTC
[jira] [Commented] (NIFI-2712) Database Fetch processors' max-value columns don't work as expected

    [ https://issues.apache.org/jira/browse/NIFI-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468322#comment-15468322 ] 

ASF GitHub Bot commented on NIFI-2712:
--------------------------------------

Github user jtstorck commented on the issue:

    https://github.com/apache/nifi/pull/976
  
    +1 on this PR.
    
    Based on the scope of the code changes and the unit testing that has been introduced, this addresses the use-case of being able query tables based on a hierarchy of columns that take a primary column and partition columns into consideration.
    
    I did not run this against a live database, so the committer may want to do that for a sanity check.  The unit tests pass and look like they cover the hierarchical partitioned query use cases, and they use Derby as the database.
    
    There is one known issue that could occur with partitioning, in that not all data would be fetched from a partition if new data comes in that provides a new value for a partition column before all the data in the previous partition was retrieved. According to @mattyb149, this is an edge case.  I think this issue can be avoided with flow design to account for this, at any rate.


> Database Fetch processors' max-value columns don't work as expected
> -------------------------------------------------------------------
>
>                 Key: NIFI-2712
>                 URL: https://issues.apache.org/jira/browse/NIFI-2712
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Matt Burgess
>            Assignee: Matt Burgess
>
> Currently, for QueryDatabaseTable and GenerateTableFetch, the user can enter any number of maximum-value columns, which are used to generate a SQL query that will fetch all records whose values are greater than the last-observed maximum values for those columns.
> However this makes multiple max-value columns not very useful, since they will both have to increase in lockstep or records will be lost/skipped. In such a case, using one or the other (but not both) would suffice, making multiple max-value columns useless.
> The more likely use case is that there are multiple columns whose values are strictly increasing, but at different rates. This is common with very large tables where a column could be for "date_created" and also a "bucket number" that strictly increases once a day. Queries for a day's worth of data are more efficient if they can be filtered on "bucket" (in this case), then on timestamp. However the generated SQL queries would have to reflect that "bucket" may remain the same as timestamp is increasing, but once the bucket value has increased, then only the (new) timestamps for that bucket should be fetched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)