You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "jayshrivastava (via GitHub)" <gi...@apache.org> on 2023/03/28 00:44:35 UTC

[GitHub] [arrow] jayshrivastava opened a new issue, #34751: [Go] parquet: how do you implement and read custom logical types?

jayshrivastava opened a new issue, #34751:
URL: https://github.com/apache/arrow/issues/34751

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   What is the correct way to implement a custom logical type? I'm having trouble reading my custom logical type from parquet files. I'm looking to decode the physical type differently depending on the logical type.
   
   I've tried something like the following so far:
   
   My logical type:
   ```
   type geometryLogicalType struct {
   	schema.StringLogicalType
   }
   
   func (geometryLogicalType) String() string {
   	return "geometry"
   }
   
   func (geometryLogicalType) Equals(rhs schema.LogicalType) bool {
   	_, ok := rhs.(geometryLogicalType)
   	return ok
   }
   ```
   
   My schema for this type:
   ```
   result.node, err = schema.NewPrimitiveNodeLogical("column1",
     optional, geometryLogicalType{}, parquet.Types.ByteArray, -1, -1)
   ```
   
   Reader:
   ```
   reader, err := file.NewParquetReader(f)
   	for rg := 0; rg < reader.NumRowGroups(); rg++ {
   		rgr := reader.RowGroup(rg)
   
   		for colIdx := 0; colIdx < numCols; colIdx++ {
   			col, err := rgr.Column(colIdx)
   			switch col.Type() {
   			case parquet.Types.ByteArray:
   			
   				switch typ := col.Descriptor().LogicalType().(type) {
   				case *schema.DecimalLogicalType:
   					...
   				case schema.StringLogicalType:
   					...
                                    case geometryLogicalType:
   					... // does not work
   				default:
   					panic(errors.Newf("unimplemented logical type %s", typ))
   				}
   			}
   		}
   	}
   	require.NoError(t, reader.Close())
   ```
   
   
   
   
   
   
   ### Component(s)
   
   Go


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jayshrivastava commented on issue #34751: [Go] parquet: how do you implement and read custom logical types?

Posted by "jayshrivastava (via GitHub)" <gi...@apache.org>.
jayshrivastava commented on issue #34751:
URL: https://github.com/apache/arrow/issues/34751#issuecomment-1494368444

   [WithWriteMetadata](https://pkg.go.dev/github.com/apache/arrow/go/v11@v11.0.0/parquet/file#WithWriteMetadata) is perfect. Thanks for the help!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on issue #34751: [Go] parquet: how do you implement and read custom logical types?

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on issue #34751:
URL: https://github.com/apache/arrow/issues/34751#issuecomment-1493131912

   > I think including KV pairs was the answer I was looking for. Does this library support reading and writing arbitrary KV metadata? I don't see any way to do this with parquet readers / writers.
   
   If you look at the [`file.NewParquetWriter`](https://pkg.go.dev/github.com/apache/arrow/go/v11@v11.0.0/parquet/file#NewParquetWriter) function, you can add an arbitrary number of `WriteOption`s when creating the writer. One of the options is [`WithWriteMetadata`](https://pkg.go.dev/github.com/apache/arrow/go/v11@v11.0.0/parquet/file#WithWriteMetadata), which allows you to provide the key value metadata to write for this file.  The metadata can be manipulated via [`metadata.KeyValueMetadata`](https://pkg.go.dev/github.com/apache/arrow/go/v11@v11.0.0/parquet/metadata#KeyValueMetadata).
   
   When reading a file, you can use the [`MetaData`](https://pkg.go.dev/github.com/apache/arrow/go/v11@v11.0.0/parquet/file#Reader.MetaData) method of the reader, to retrieve the file level metadata, and the [`KeyValueMetadata`](https://pkg.go.dev/github.com/apache/arrow/go/v11@v11.0.0/parquet/metadata#FileMetaData.KeyValueMetadata) method on the `FileMetaData` object will return back those Key Value pairs from the file.
   
   > Say I have strings and timestamps (uint64) stored in memory. This library only supports timestamps written as int64. To work around this, I was considering writing them as strings using the string logical type. The problem is that a reader (which reads the parquet file back into memory) will not know weather to interpret a string as a string or timestamp because the logical type is the same.
   
   This is because the Parquet specification states that timestamps should be written as an `int64` column with a timestamp logical type. In fact there is no physical `uint64` type for Parquet, unsigned types are a "logical" type annotated on a column. That said, it's pretty trivial (and likely more performant) to just convert your `[]uint64` timestamps into a `[]int64` to write them out than it would be to convert them to strings, right? But if you really want to convert them to strings, you can use the metadata functions I mentioned above for writing and reading the metadata.
   
   > Say that I want to store a protobuf column which will physically be stored as bytes. A reader will not know how to decode the bytes from the file unless there is some logical type / metadata which indicates the logical type of the column.
   
   Right, this can also be achieved with the key value metadata specified at write time and then read back as long as you communicate ahead of time what the Key is that a consumer should be reading.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jayshrivastava closed issue #34751: [Go] parquet: how do you implement and read custom logical types?

Posted by "jayshrivastava (via GitHub)" <gi...@apache.org>.
jayshrivastava closed issue #34751: [Go] parquet: how do you implement and read custom logical types?
URL: https://github.com/apache/arrow/issues/34751


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jayshrivastava commented on issue #34751: [Go] parquet: how do you implement and read custom logical types?

Posted by "jayshrivastava (via GitHub)" <gi...@apache.org>.
jayshrivastava commented on issue #34751:
URL: https://github.com/apache/arrow/issues/34751#issuecomment-1490695701

   I think including KV pairs was the answer I was looking for. Does this library support reading and writing arbitrary KV metadata? I don't see any way to do this with parquet readers / writers.
   
   Here's an example of what I'm trying to do:
   Say I have strings and timestamps (uint64) stored in memory. This library only supports timestamps written as int64. To work around this, I was considering writing them as strings using the string logical type. The problem is that a reader (which reads the parquet file back into memory) will not know weather to interpret a string as a string or timestamp because the logical type is the same.
   
   Another example:
   Say that I want to store a protobuf column which will physically be stored as bytes. A reader will not know how to decode the bytes from the file unless there is some logical type / metadata which indicates the logical type of the column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on issue #34751: [Go] parquet: how do you implement and read custom logical types?

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on issue #34751:
URL: https://github.com/apache/arrow/issues/34751#issuecomment-1488866862

   As far as I'm aware, the Parquet spec doesn't have a setup for "custom logical types". The best bet that I can think of would likely be to include metadata in the Parquet schema (metadata is just key - value pairs) that you can check for. Then you can check for that metadata to determine that it's your custom type and process accordingly.
   
   I guess my question is why you need the custom logical type in the first place. What is the structure of the data? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org