You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/19 09:07:42 UTC

[GitHub] [iceberg] chenjunjiedada opened a new issue #1215: FlinkParquetWriter should build writer with schema visitor

chenjunjiedada opened a new issue #1215:
URL: https://github.com/apache/iceberg/issues/1215


   Currently, the `FlinkParquetWriters` extends the `BaseParquetWriters` which is used for the generic records, while the Flink app can't ensure the types in `Row` can be transformed to Iceberg types like what `RandomDataGenerator` does. According to Flink [doc](https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/types.html) the types in `Row` is not compatible with Iceberg types. So, instead of extending the `BaseParquetWriters` it should build with a `flink` schema visitor as SparkParquetWriters.  @openinx @JingsongLi @rdblue , What do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on issue #1215: FlinkParquetWriter should build writer with schema visitor

Posted by GitBox <gi...@apache.org>.

rdblue commented on issue #1215:
URL: https://github.com/apache/iceberg/issues/1215#issuecomment-661994888


   Thanks for the clarification on the types that Flink uses.
   
   I agree that we should create readers and writers that work directly with these types, instead of trying to make the Iceberg generics work.
   
   > I wonder if we can only implement Flink internal data structure conversions
   
   This is what we do for Spark. We try to only work with `InternalRow`, and have conversions where we need to for testing the public `Row` we get back from some interfaces.
   
   I would prefer to have just one data model implementation for Flink's internal representation (`RowData`). If we need to, we can have one for the external `Row` as well, but if Flink can already convert from `RowData` to `Row`, I would like to try to use those conversions instead.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on issue #1215: FlinkParquetWriter should build writer with schema visitor

Posted by GitBox <gi...@apache.org>.

openinx commented on issue #1215:
URL: https://github.com/apache/iceberg/issues/1215#issuecomment-660777984


   @chenjunjiedada   +1,  the flink data type is not strickly mapping to one kinds of java data type, it's better to create a separate readers and writers to handle the differences. Supporting both `Row` and `RowData` in flink data stream is also necessary (as @JingsongLi  said) because flink 1.11 have turned its table row data type to `RowData`, I've prepared a patch in my own repo to abstract a `IcebergStreamWrter`  so that we could accept the generic data type in the stream,  I will update the pull request https://github.com/apache/iceberg/pull/1145, once its dependency PR https://github.com/apache/iceberg/pull/1213 get merged.  Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] JingsongLi commented on issue #1215: FlinkParquetWriter should build writer with schema visitor

Posted by GitBox <gi...@apache.org>.

JingsongLi commented on issue #1215:
URL: https://github.com/apache/iceberg/issues/1215#issuecomment-662196003


   > we can have one for the external Row as well, but if Flink can already convert from RowData to Row, I would like to try to use those conversions instead.
   
   Yes, Flink `DataStructureConverters.getConverter(DataType)` can be used, although it is not public API in Flink.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] JingsongLi commented on issue #1215: FlinkParquetWriter should build writer with schema visitor

Posted by GitBox <gi...@apache.org>.

JingsongLi commented on issue #1215:
URL: https://github.com/apache/iceberg/issues/1215#issuecomment-660768444


   +1, We should deal with these differences and we can consider rewriting some `ParquetValueWriter`s.
   The same is true for reader and Avro, including ORC.
   
   This workload is not small. I wonder if we can only implement Flink internal data structure conversions, and just provide a converter to convert Flink `Row` to `RowData`.
   
   Here is a Flink table/SQL data structure table, the internal data structure is more efficient, and the external structure is more often used in UDF.
   
   ```
    * +--------------------------------+-----------------------------------------+-------------------+
    * | SQL Data Types                 | Internal Data Structures (RowData)      |  External (Row)   |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | BOOLEAN                        | boolean                                 |  boolean          |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | CHAR / VARCHAR / STRING        | {@link StringData}                      |  String           |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | BINARY / VARBINARY / BYTES     | byte[]                                  |  byte[]           |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | DECIMAL                        | {@link DecimalData}                     |  BigDecimal       |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | TINYINT                        | byte                                    |  byte             |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | SMALLINT                       | short                                   |  short            |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | INT                            | int                                     |  int              |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | BIGINT                         | long                                    |  long             |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | FLOAT                          | float                                   |  float            |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | DOUBLE                         | double                                  |  double           |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | DATE                           | int (number of days since epoch)        |  LocalDate        |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | TIME                           | int (number of milliseconds of the day) |  LocalTime        |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | TIMESTAMP                      | {@link TimestampData}                   |  LocalDateTime    |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | TIMESTAMP WITH LOCAL TIME ZONE | {@link TimestampData}                   |  Instant          |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | INTERVAL YEAR TO MONTH         | int (number of months)                  |  Period           |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | INTERVAL DAY TO MONTH          | long (number of milliseconds)           |  Duration         |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | ROW / structured types         | {@link RowData}                         |  Row              |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | ARRAY                          | {@link ArrayData}                       |  T[]              |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | MAP / MULTISET                 | {@link MapData}                         |  Map              |
    * +--------------------------------+-----------------------------------------+-------------------+
    * | RAW                            | {@link RawValueData}                    |  T                |
    * +--------------------------------+-----------------------------------------+-------------------+
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx edited a comment on issue #1215: FlinkParquetWriter should build writer with schema visitor

Posted by GitBox <gi...@apache.org>.

openinx edited a comment on issue #1215:
URL: https://github.com/apache/iceberg/issues/1215#issuecomment-660777984


   @chenjunjiedada   +1,  the flink data type is not strickly mapping to one kinds of java data type, it's better to create a separate readers and writers to handle the differences. Supporting both `Row` and `RowData` in flink data stream is also necessary (as @JingsongLi  said) because flink 1.11 have turned its table row data type to `RowData`, I've prepared a patch in my own repo to abstract a `IcebergStreamWrter`  so that we could accept the generic data type in the stream ([commit](https://github.com/openinx/incubator-iceberg/commit/78fb6fe9b37f31a7eac14368e501535345953a92#diff-2e349ce4ab68643aa0f311250e745edfR49)),  I will update the pull request https://github.com/apache/iceberg/pull/1145, once its dependency PR https://github.com/apache/iceberg/pull/1213 get merged.  Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada closed issue #1215: FlinkParquetWriter should build writer with schema visitor

Posted by GitBox <gi...@apache.org>.

chenjunjiedada closed issue #1215:
URL: https://github.com/apache/iceberg/issues/1215


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org