You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "chong (Jira)" <ji...@apache.org> on 2022/01/10 02:03:00 UTC

[jira] [Updated] (ORC-1083) Failed to proune when converting Hybrid calendar to Proleptic calendar

     [ https://issues.apache.org/jira/browse/ORC-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

chong updated ORC-1083:
-----------------------
    Description: 
The orc file only has one date column and one row in hybrid(Julian/Gregorian) calendar:  1582-10-03.

Failed to proune for the filer "c1 = 1582-10-03" when converting hybrid calendar to proleptic calendar.  The date "1582-10-03" in hybrid calendar is "1582-09-23" or "1582-10-13" in proleptic calendar, I'm not sure which one, but apparently it's different from hybrid calendar. The query should return empty when filtering with "c1 = 1582-10-03".

 

The "pickRowGroups" failed to proune when setting "orc.proleptic.gregorian" as "true", this occures on version 1.6.11+. The version 1.5.10 is correct.

 

*The Orc file was attached: read-hybrid-as-proleptic.orc*


{code:java}
$ java -jar orc-tools-1.7.0-uber.jar meta read-hybrid-as-proleptic.orc 
Processing data file read-hybrid-as-proleptic.orc [length: 246]
Structure for read-hybrid-as-proleptic.orc
File Version: 0.12 with ORC_14 by ORC Java 1.6.11
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<c1:date>
Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 8 min: Hybrid AD 1582-10-03 max: Hybrid AD 1582-10-03
File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 8 min: Hybrid AD 1582-10-03 max: Hybrid AD 1582-10-03
Stripes:
  Stripe: offset: 3 data: 8 rows: 1 tail: 35 index: 37
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 26
    Stream: column 1 section DATA start: 40 length 8
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
File length: 246 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
  org.apache.spark.version=3.2.0
________________________________________________________________________________________________________________________
(base) [chong@chong-pc tools]$ java -jar orc-tools-1.7.0-uber.jar data read-hybrid-as-proleptic.orc 
Processing data file read-hybrid-as-proleptic.orc [length: 246]
{"c1":"1582-10-03"}
________________________________________________________________________________________________________________________
 
{code}
 

*Code to reproduce this:*
{code:java}

    Configuration conf = new Configuration();
    // convert to proleptic calendar
    conf.set("orc.proleptic.gregorian", "true");
    Reader reader = OrcFile.createReader(new Path("<path to read-hybrid-as-proleptic.orc>"),
        OrcFile.readerOptions(conf));
    System.out.println("File schema: " + reader.getSchema());
    System.out.println("File row count: " + reader.getNumberOfRows());
    Date dateForFilter = Date.valueOf("1582-10-03");
    System.out.println("Filter is c1 == " + dateForFilter);
    RecordReader rowIterator = reader.rows(
        reader.options()
            .searchArgument(SearchArgumentFactory.newBuilder()
                .equals("c1", PredicateLeaf.Type.DATE, dateForFilter)
                .build(), new String[]{"c1"}) //predict push down
    );
    // Read the row data
    VectorizedRowBatch batch = reader.getSchema().createRowBatch();
    DateColumnVector x = (DateColumnVector) batch.cols[0];
    System.out.println("find:);
    while (rowIterator.nextBatch(batch)) {
      for (int row = 0; row < batch.size; ++row)
{         int xRow = x.isRepeating ? 0 : row;         System.out.println("c1: " + (x.noNulls || !x.isNull[xRow] ?             x.vector[xRow] :null));       }
    }
    rowIterator.close();
{code}
 

*Comparation between 1.5.10 and 1.6.11*
For Orc version 1.5.10
find:
nothihng, this is correct

For Orc version 1.6.11
find:
c1: -141439
*Other information*

Please try to swith conf.set("orc.proleptic.gregorian", "true") for 1.5.10 and 1.6.11 and see the different.

The date "1582-10-03" in hybrid calendar is "1582-09-23" or "1582-10-13" in proleptic calendar, which one is correct?

This is found on Spark 3.2.0 and Spark 3.0.1 is correct.

 

  was:
The orc file only has one date column and one row in hybrid(Julian/Gregorian) calendar:  1582-10-03.

Failed to proune for the filer "c1 = 1582-10-03" when converting hybrid calendar to proleptic calendar.  The date "1582-10-03" in hybrid calendar is "1582-09-23" or "1582-10-13" in proleptic calendar, I'm not sure which one, but apparently it's different from hybrid calendar. The query should return empty when filtering with "c1 = 1582-10-03".

 

The "pickRowGroups" failed to proune when setting "orc.proleptic.gregorian" as "true", this occures on version 1.6.11+. The version 1.5.10 is correct.

 

*The Orc file was attached: read-hybrid-as-proleptic.orc*
```
$ java -jar orc-tools-1.7.0-uber.jar meta read-hybrid-as-proleptic.orc 
Processing data file read-hybrid-as-proleptic.orc [length: 246]
Structure for read-hybrid-as-proleptic.orc
File Version: 0.12 with ORC_14 by ORC Java 1.6.11
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<c1:date>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false bytesOnDisk: 8 min: Hybrid AD 1582-10-03 max: Hybrid AD 1582-10-03

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false bytesOnDisk: 8 min: Hybrid AD 1582-10-03 max: Hybrid AD 1582-10-03

Stripes:
  Stripe: offset: 3 data: 8 rows: 1 tail: 35 index: 37
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 26
    Stream: column 1 section DATA start: 40 length 8
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 246 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.2.0
________________________________________________________________________________________________________________________

(base) [chong@chong-pc tools]$ java -jar orc-tools-1.7.0-uber.jar data read-hybrid-as-proleptic.orc 
Processing data file read-hybrid-as-proleptic.orc [length: 246]
{"c1":"1582-10-03"}
________________________________________________________________________________________________________________________

```

*Code to reproduce this:*
```
    Configuration conf = new Configuration();
    // convert to proleptic calendar
    conf.set("orc.proleptic.gregorian", "true");
    Reader reader = OrcFile.createReader(new Path("<path to read-hybrid-as-proleptic.orc>"),
        OrcFile.readerOptions(conf));
    System.out.println("File schema: " + reader.getSchema());
    System.out.println("File row count: " + reader.getNumberOfRows());

    Date dateForFilter = Date.valueOf("1582-10-03");
    System.out.println("Filter is c1 == " + dateForFilter);

    RecordReader rowIterator = reader.rows(
        reader.options()
            .searchArgument(SearchArgumentFactory.newBuilder()
                .equals("c1", PredicateLeaf.Type.DATE, dateForFilter)
                .build(), new String[]\{"c1"}) //predict push down
    );

    // Read the row data
    VectorizedRowBatch batch = reader.getSchema().createRowBatch();
    DateColumnVector x = (DateColumnVector) batch.cols[0];

    System.out.println("-------------find-------------------------");
    while (rowIterator.nextBatch(batch)) {
      for (int row = 0; row < batch.size; ++row) {
        int xRow = x.isRepeating ? 0 : row;
        System.out.println("c1: " + (x.noNulls || !x.isNull[xRow] ?
            x.vector[xRow] :null));
      }
    }
    rowIterator.close();
```

*Comparation between 1.5.10 and 1.6.11*
```
For Orc version 1.5.10
-------------find-------------------------

For Orc version 1.6.11
-------------find-------------------------
c1: -141439
```

*Other information*

Please try to swith conf.set("orc.proleptic.gregorian", "true") for 1.5.10 and 1.6.11 and see the different.

The date "1582-10-03" in hybrid calendar is "1582-09-23" or "1582-10-13" in proleptic calendar, which one is correct?

This is found on Spark 3.2.0 and Spark 3.0.1 is correct.

 


> Failed to proune when converting Hybrid calendar to Proleptic calendar
> ----------------------------------------------------------------------
>
>                 Key: ORC-1083
>                 URL: https://issues.apache.org/jira/browse/ORC-1083
>             Project: ORC
>          Issue Type: Bug
>          Components: Java
>    Affects Versions: 1.6.11
>            Reporter: chong
>            Priority: Major
>         Attachments: read-hybrid-as-proleptic.orc
>
>
> The orc file only has one date column and one row in hybrid(Julian/Gregorian) calendar:  1582-10-03.
> Failed to proune for the filer "c1 = 1582-10-03" when converting hybrid calendar to proleptic calendar.  The date "1582-10-03" in hybrid calendar is "1582-09-23" or "1582-10-13" in proleptic calendar, I'm not sure which one, but apparently it's different from hybrid calendar. The query should return empty when filtering with "c1 = 1582-10-03".
>  
> The "pickRowGroups" failed to proune when setting "orc.proleptic.gregorian" as "true", this occures on version 1.6.11+. The version 1.5.10 is correct.
>  
> *The Orc file was attached: read-hybrid-as-proleptic.orc*
> {code:java}
> $ java -jar orc-tools-1.7.0-uber.jar meta read-hybrid-as-proleptic.orc 
> Processing data file read-hybrid-as-proleptic.orc [length: 246]
> Structure for read-hybrid-as-proleptic.orc
> File Version: 0.12 with ORC_14 by ORC Java 1.6.11
> Rows: 1
> Compression: SNAPPY
> Compression size: 262144
> Calendar: Julian/Gregorian
> Type: struct<c1:date>
> Stripe Statistics:
>   Stripe 1:
>     Column 0: count: 1 hasNull: false
>     Column 1: count: 1 hasNull: false bytesOnDisk: 8 min: Hybrid AD 1582-10-03 max: Hybrid AD 1582-10-03
> File Statistics:
>   Column 0: count: 1 hasNull: false
>   Column 1: count: 1 hasNull: false bytesOnDisk: 8 min: Hybrid AD 1582-10-03 max: Hybrid AD 1582-10-03
> Stripes:
>   Stripe: offset: 3 data: 8 rows: 1 tail: 35 index: 37
>     Stream: column 0 section ROW_INDEX start: 3 length 11
>     Stream: column 1 section ROW_INDEX start: 14 length 26
>     Stream: column 1 section DATA start: 40 length 8
>     Encoding column 0: DIRECT
>     Encoding column 1: DIRECT_V2
> File length: 246 bytes
> Padding length: 0 bytes
> Padding ratio: 0%
> User Metadata:
>   org.apache.spark.version=3.2.0
> ________________________________________________________________________________________________________________________
> (base) [chong@chong-pc tools]$ java -jar orc-tools-1.7.0-uber.jar data read-hybrid-as-proleptic.orc 
> Processing data file read-hybrid-as-proleptic.orc [length: 246]
> {"c1":"1582-10-03"}
> ________________________________________________________________________________________________________________________
>  
> {code}
>  
> *Code to reproduce this:*
> {code:java}
>     Configuration conf = new Configuration();
>     // convert to proleptic calendar
>     conf.set("orc.proleptic.gregorian", "true");
>     Reader reader = OrcFile.createReader(new Path("<path to read-hybrid-as-proleptic.orc>"),
>         OrcFile.readerOptions(conf));
>     System.out.println("File schema: " + reader.getSchema());
>     System.out.println("File row count: " + reader.getNumberOfRows());
>     Date dateForFilter = Date.valueOf("1582-10-03");
>     System.out.println("Filter is c1 == " + dateForFilter);
>     RecordReader rowIterator = reader.rows(
>         reader.options()
>             .searchArgument(SearchArgumentFactory.newBuilder()
>                 .equals("c1", PredicateLeaf.Type.DATE, dateForFilter)
>                 .build(), new String[]{"c1"}) //predict push down
>     );
>     // Read the row data
>     VectorizedRowBatch batch = reader.getSchema().createRowBatch();
>     DateColumnVector x = (DateColumnVector) batch.cols[0];
>     System.out.println("find:);
>     while (rowIterator.nextBatch(batch)) {
>       for (int row = 0; row < batch.size; ++row)
> {         int xRow = x.isRepeating ? 0 : row;         System.out.println("c1: " + (x.noNulls || !x.isNull[xRow] ?             x.vector[xRow] :null));       }
>     }
>     rowIterator.close();
> {code}
>  
> *Comparation between 1.5.10 and 1.6.11*
> For Orc version 1.5.10
> find:
> nothihng, this is correct
> For Orc version 1.6.11
> find:
> c1: -141439
> *Other information*
> Please try to swith conf.set("orc.proleptic.gregorian", "true") for 1.5.10 and 1.6.11 and see the different.
> The date "1582-10-03" in hybrid calendar is "1582-09-23" or "1582-10-13" in proleptic calendar, which one is correct?
> This is found on Spark 3.2.0 and Spark 3.0.1 is correct.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)