You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hawq.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/11/17 23:13:11 UTC

[jira] [Commented] (HAWQ-56) Non-deterministic header results with "HEADER" option from external table

    [ https://issues.apache.org/jira/browse/HAWQ-56?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15009684#comment-15009684 ] 

ASF GitHub Bot commented on HAWQ-56:
------------------------------------

GitHub user hornn opened a pull request:

    https://github.com/apache/incubator-hawq/pull/121

    HAWQ-56. Add format option to external protocol struct, and disallow …

    …HEADER format option to be used in PXF tables

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hornn/incubator-hawq HAWQ-56

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-hawq/pull/121.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #121
    
----
commit 97896f6b48735065c9996e64161f6f18ca9222e8
Author: Noa Horn <nh...@pivotal.io>
Date:   2015-11-17T22:11:49Z

    HAWQ-56. Add format option to external protocol struct, and disallow HEADER format option to be used in PXF tables

----


> Non-deterministic header results with "HEADER" option from external table
> -------------------------------------------------------------------------
>
>                 Key: HAWQ-56
>                 URL: https://issues.apache.org/jira/browse/HAWQ-56
>             Project: Apache HAWQ
>          Issue Type: Bug
>          Components: PXF
>            Reporter: Goden Yao
>            Assignee: Noa Horn
>            Priority: Critical
>
> *Repro Steps*
> External table definition
> {code:sql}
> drop external table if exists testtbl;
> create external table testtbl(i text, j text)
> location ('pxf://nakaphdns/tmp/testdata/*?Fragmenter=com.pivotal.pxf.plugins.hdfs.HdfsDataFragmenter&Accessor=com.pivotal.pxf.plugins.hdfs.TextFileAccessor&Resolver=com.pivotal.pxf.plugins.hdfs.TextResolver')
> format 'TEXT' (delimiter ',' header);
> select * from testtbl;
> {code}
> example with 4 segment servers and 4 test files with headers in hdfs
> {code:sql}
> gpadmin=# select * from testtbl ;
>  i | j
> ---+---
>  3 | c
>  2 | b
>  1 | a
>  4 | d
> (4 rows)
> {code}
> With 5 test files
> {code:sql}
> gpadmin=# select * from testtbl ;
>  i  |   j
> ----+-------
>  5  | e
>  2  | b
>  ID | Value
>  3  | c
>  1  | a
>  ID | VALUE
>  4  | d
> (7 rows)
> {code}
> *Analysis*
> When using HEADER option, header line is removed only once per segment.
> As a result there will be different results depending on the number of segments/fragments are scanned. If the number of files is greater than the number of segments, the header row is included in the data for some files.
> If the number of files is less than or equal to the number of segments, the data retrieved is good. Thus, non-deterministic.
> The reason for this behavior is that header line handling is done by the external protocol code (fileam.c) which checks if the header_line flag is true, and if so skips the first line and sets the flag to false. This code calls the custom protocol code (in our case pxf) to get the next tuples, and so doesn't know if the tuples are from the same resource or not.
> From what I can see, in gpfdist protocol the problem is solved by letting the custom protocol code handle this and marking the flag as false for the external protocol infrastructure in fileam.c.
> *Proposed Solution*
> Add a check in pxf validator function (a function that is being called as part of external table creation).
> This check will error out if HEADER option is used in a PXF table.
> Currently the validation function API only allows access to the table's URLs (in the LOCATION part of the table), and not the format options. In order to add the check an API change in the external protocol is required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)