You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "minyoung (via GitHub)" <gi...@apache.org> on 2023/02/01 01:09:48 UTC

[GitHub] [arrow] minyoung opened a new pull request, #33965: GH-33875: [Go] Handle writing LargeString and LargeBinary types

minyoung opened a new pull request, #33965:
URL: https://github.com/apache/arrow/pull/33965

   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   Handle writing `array.LargeString` and `array.LargeBinary` data types. This allows parquet files to contain more than 2G worth of binary data in a single column chunk.
   
   ### Are these changes tested?
   
   Unit tests included
   
   ### Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   * Closes: https://github.com/apache/arrow/issues/33875


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] zeroshade commented on a diff in pull request #33965: GH-33875: [Go] Handle writing LargeString and LargeBinary types

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.

zeroshade commented on code in PR #33965:
URL: https://github.com/apache/arrow/pull/33965#discussion_r1093715942


##########
go/parquet/file/record_reader.go:
##########
@@ -751,14 +751,15 @@ type byteArrayRecordReader struct {
 	valueBuf []parquet.ByteArray
 }
 
-func newByteArrayRecordReader(descr *schema.Column, info LevelInfo, mem memory.Allocator, bufferPool *sync.Pool) RecordReader {
+func newByteArrayRecordReader(descr *schema.Column, info LevelInfo, dtype arrow.DataType, mem memory.Allocator, bufferPool *sync.Pool) RecordReader {
 	if mem == nil {
 		mem = memory.DefaultAllocator
 	}
 
-	dt := arrow.BinaryTypes.Binary
-	if descr.LogicalType().Equals(schema.StringLogicalType{}) {
-		dt = arrow.BinaryTypes.String
+	dt, ok := dtype.(arrow.BinaryDataType)
+	// arrow.DecimalType will also come through here, which we want to treat as binary
+	if !ok {
+		dt = arrow.BinaryTypes.Binary

Review Comment:
   I don't like requiring the `arrow.DataType` here as part of the point of this interface was that you could create a record reader without needing to explicitly code the underlying datatype, it determines it from the parquet schema. Could we instead use the metadata to determine whether or not we need to use a LargeString/LargeBinary builder by checking how large the total data column is? Or some other way of determining it? I'm not sure if that would work, but it would be preferable if it is possible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] zeroshade commented on a diff in pull request #33965: GH-33875: [Go] Handle writing LargeString and LargeBinary types

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.

zeroshade commented on code in PR #33965:
URL: https://github.com/apache/arrow/pull/33965#discussion_r1093505946


##########
go/parquet/pqarrow/encode_arrow.go:
##########
@@ -421,12 +425,21 @@ func writeDenseArrow(ctx *arrowWriteContext, cw file.ColumnChunkWriter, leafArr
 			wr.WriteBatchSpaced(leafArr.(*array.Float64).Float64Values(), defLevels, repLevels, leafArr.NullBitmapBytes(), int64(leafArr.Data().Offset()))
 		}
 	case *file.ByteArrayColumnChunkWriter:
-		if leafArr.DataType().ID() != arrow.STRING && leafArr.DataType().ID() != arrow.BINARY {
-			return xerrors.New("invalid column type to write to ByteArray")
+		var offsets []int64
+		switch leafArr.DataType().ID() {
+		case arrow.STRING, arrow.BINARY:
+			off32 := leafArr.(binaryarr).ValueOffsets()
+			offsets = make([]int64, len(off32))
+			for i, o := range off32 {
+				offsets[i] = int64(o)
+			}
+		case arrow.LARGE_STRING, arrow.LARGE_BINARY:
+			offsets = leafArr.(binary64arr).ValueOffsets()
+		default:
+			return xerrors.New(fmt.Sprintf("invalid column type to write to ByteArray: %s", leafArr.DataType().Name()))

Review Comment:
   Rather than pay the cost to copy the offsets, i'd rather just duplicate the loop at line 453 based on whether you're using int32 or int64 offsets. Duplicating the 4 lines of code is better than the cost of copying a potentially large offset array just to adjust for the int32/int64. 
   
   When go1.20 releases I'd consider it viable to upgrade us to using go1.18 as a default requirement, and we can make a generic version of the loop via a function and just have that work. But for now, let's just duplicate the loop and avoid copying the offset slice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] zeroshade merged pull request #33965: GH-33875: [Go] Handle writing LargeString and LargeBinary types

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.

zeroshade merged PR #33965:
URL: https://github.com/apache/arrow/pull/33965


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #33965: GH-33875: [Go] Handle writing LargeString and LargeBinary types

Posted by "ursabot (via GitHub)" <gi...@apache.org>.

ursabot commented on PR #33965:
URL: https://github.com/apache/arrow/pull/33965#issuecomment-1414311751

   ['Python', 'R'] benchmarks have high level of regressions.
   [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/7d6340037d464279a32aeb84e014b0ba...1adf31978295407faab403138f171dd3/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] minyoung commented on a diff in pull request #33965: GH-33875: [Go] Handle writing LargeString and LargeBinary types

Posted by "minyoung (via GitHub)" <gi...@apache.org>.

minyoung commented on code in PR #33965:
URL: https://github.com/apache/arrow/pull/33965#discussion_r1093573505


##########
go/parquet/pqarrow/encode_arrow.go:
##########
@@ -421,12 +425,21 @@ func writeDenseArrow(ctx *arrowWriteContext, cw file.ColumnChunkWriter, leafArr
 			wr.WriteBatchSpaced(leafArr.(*array.Float64).Float64Values(), defLevels, repLevels, leafArr.NullBitmapBytes(), int64(leafArr.Data().Offset()))
 		}
 	case *file.ByteArrayColumnChunkWriter:
-		if leafArr.DataType().ID() != arrow.STRING && leafArr.DataType().ID() != arrow.BINARY {
-			return xerrors.New("invalid column type to write to ByteArray")
+		var offsets []int64
+		switch leafArr.DataType().ID() {
+		case arrow.STRING, arrow.BINARY:
+			off32 := leafArr.(binaryarr).ValueOffsets()
+			offsets = make([]int64, len(off32))
+			for i, o := range off32 {
+				offsets[i] = int64(o)
+			}
+		case arrow.LARGE_STRING, arrow.LARGE_BINARY:
+			offsets = leafArr.(binary64arr).ValueOffsets()
+		default:
+			return xerrors.New(fmt.Sprintf("invalid column type to write to ByteArray: %s", leafArr.DataType().Name()))

Review Comment:
   That makes sense. Will shuffle the code around to use `ValueOffsets` as-is



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] minyoung commented on a diff in pull request #33965: GH-33875: [Go] Handle writing LargeString and LargeBinary types

Posted by "minyoung (via GitHub)" <gi...@apache.org>.

minyoung commented on code in PR #33965:
URL: https://github.com/apache/arrow/pull/33965#discussion_r1093873932


##########
go/parquet/file/record_reader.go:
##########
@@ -751,14 +751,15 @@ type byteArrayRecordReader struct {
 	valueBuf []parquet.ByteArray
 }
 
-func newByteArrayRecordReader(descr *schema.Column, info LevelInfo, mem memory.Allocator, bufferPool *sync.Pool) RecordReader {
+func newByteArrayRecordReader(descr *schema.Column, info LevelInfo, dtype arrow.DataType, mem memory.Allocator, bufferPool *sync.Pool) RecordReader {
 	if mem == nil {
 		mem = memory.DefaultAllocator
 	}
 
-	dt := arrow.BinaryTypes.Binary
-	if descr.LogicalType().Equals(schema.StringLogicalType{}) {
-		dt = arrow.BinaryTypes.String
+	dt, ok := dtype.(arrow.BinaryDataType)
+	// arrow.DecimalType will also come through here, which we want to treat as binary
+	if !ok {
+		dt = arrow.BinaryTypes.Binary

Review Comment:
   > Could we instead use the metadata to determine whether or not we need to use a LargeString/LargeBinary builder by checking how large the total data column is?
   
   This is a little tricky to do. We could potentially do https://github.com/apache/arrow/commit/8b526e5afe5dc1b9f2a8b264e2693f406161cbf2, but that's only summing the column chunk sizes and does not take into account the column chunk dictionary. Imagine 2M rows each referencing the same 1k string. The column chunk would have 1 string value in the dictionary, and all the rows would use that dictionary value. At [decoding time](https://github.com/apache/arrow/blob/go/v10.0.1/go/parquet/internal/utils/typed_rle_dict.gen.go#L1137), the values would get filled out to >2G.
   
   In the context of `pqarrow.WithStoreSchema()`, it might also be a little surprising if the arrow schema says that a column is `arrow.LargeString`, but you get an `arrow.String` back (since total data column size is low). Since this specific change is only for `pqarrow.WithStoreSchema()`, so we could revert this change and say that reads will always return `arrow.String`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] zeroshade commented on a diff in pull request #33965: GH-33875: [Go] Handle writing LargeString and LargeBinary types

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.

zeroshade commented on code in PR #33965:
URL: https://github.com/apache/arrow/pull/33965#discussion_r1094751946


##########
go/parquet/file/record_reader.go:
##########
@@ -751,14 +751,15 @@ type byteArrayRecordReader struct {
 	valueBuf []parquet.ByteArray
 }
 
-func newByteArrayRecordReader(descr *schema.Column, info LevelInfo, mem memory.Allocator, bufferPool *sync.Pool) RecordReader {
+func newByteArrayRecordReader(descr *schema.Column, info LevelInfo, dtype arrow.DataType, mem memory.Allocator, bufferPool *sync.Pool) RecordReader {
 	if mem == nil {
 		mem = memory.DefaultAllocator
 	}
 
-	dt := arrow.BinaryTypes.Binary
-	if descr.LogicalType().Equals(schema.StringLogicalType{}) {
-		dt = arrow.BinaryTypes.String
+	dt, ok := dtype.(arrow.BinaryDataType)
+	// arrow.DecimalType will also come through here, which we want to treat as binary
+	if !ok {
+		dt = arrow.BinaryTypes.Binary

Review Comment:
   hmm, okay that's fair. I think it's fine as is for now. I agree that if the stored schema says `arrow.LargeString` you should get that back.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #33965: GH-33875: [Go] Handle writing LargeString and LargeBinary types

Posted by "ursabot (via GitHub)" <gi...@apache.org>.

ursabot commented on PR #33965:
URL: https://github.com/apache/arrow/pull/33965#issuecomment-1414311135

   Benchmark runs are scheduled for baseline = a0d0ecee71b99b45f2c1b269c70033d39b982d00 and contender = fd2c80fb22d26d78b574b8f397979137b9c615b6. fd2c80fb22d26d78b574b8f397979137b9c615b6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/04072879a3b04626880d558118590b57...b4ca7250e7a64ce5b4d132a52a4bdebe/)
   [Failed :arrow_down:0.24% :arrow_up:0.03%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/2e6ba3c9abb040f6b3c799aedd9f0e84...bfa9fe0433f54a4a8e67cbd29348035b/)
   [Finished :arrow_down:4.08% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/7d6340037d464279a32aeb84e014b0ba...1adf31978295407faab403138f171dd3/)
   [Finished :arrow_down:0.22% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/bc9c1705b9424d529b8de2a94d0a1618...f34e8120f89e42cd9dc2fb3e1a0b7b7f/)
   Buildkite builds:
   [Finished] [`fd2c80fb` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2310)
   [Failed] [`fd2c80fb` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2336)
   [Finished] [`fd2c80fb` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2307)
   [Finished] [`fd2c80fb` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2328)
   [Finished] [`a0d0ecee` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2309)
   [Failed] [`a0d0ecee` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2335)
   [Finished] [`a0d0ecee` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2306)
   [Finished] [`a0d0ecee` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2327)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org