You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "zeroshade (via GitHub)" <gi...@apache.org> on 2023/04/03 13:34:28 UTC

[GitHub] [arrow] zeroshade commented on a diff in pull request #34854: GH-34853: [Go] Add TotalRecordSize, TotalArraySize

zeroshade commented on code in PR #34854:
URL: https://github.com/apache/arrow/pull/34854#discussion_r1155956646


##########
go/arrow/util/byte_size.go:
##########
@@ -0,0 +1,83 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package util
+
+import (
+	"fmt"
+
+	"github.com/apache/arrow/go/v12/arrow"
+	"github.com/apache/arrow/go/v12/arrow/array"
+)
+
+func isArrayDataNil(arrayData arrow.ArrayData) bool {
+	if arrayData == nil {
+		return true
+	}
+	switch v := arrayData.(type) {
+	case *array.Data:
+		if v == nil {
+			return true
+		}
+	default:
+		panic("unknown array data type")
+	}
+	return false

Review Comment:
   probably simpler to do:
   
   ```go
   if arrayData == nil {
       return true
   }
   
   if v, ok := arrayData.(*array.Data); ok {
       return v == nil
   }
   panic("unknown ArrayData type")
   ```
   



##########
go/arrow/array/record.go:
##########
@@ -180,6 +179,14 @@ func (rec *simpleRecord) SetColumn(i int, arr arrow.Array) (arrow.Record, error)
 	return NewRecord(rec.schema, arrs, rec.rows), nil
 }
 
+func (rec *simpleRecord) TotalBufferSize() uint64 {
+	var sz uint64
+	for _, arr := range rec.arrs {
+		sz += uint64(arr.Data().Len())
+	}
+	return sz
+}

Review Comment:
   This isn't accurate, this just computes NCols * NRows. 
   
   In addition, I'm not sure `TotalBufferSize` is a good name for this given that these would be separate buffers.
   
   The linked documentation is referring to the `compute.Datum` type for the "total buffer size". It might be better to name this something like `TotalBytes`?
   
   And implementing it would be accurately something like:
   
   ```go
   func (rec *simpleRecord) TotalBytes() (sz uint64) {
       for _, arr := range rec.arrs {
           for _, buf := range arr.Data().Buffers() {
               if buf != nil {
                   sz += buf.Len()
               }
           }
       }
   }
   ```



##########
go/arrow/util/byte_size.go:
##########
@@ -0,0 +1,83 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package util
+
+import (
+	"fmt"
+
+	"github.com/apache/arrow/go/v12/arrow"
+	"github.com/apache/arrow/go/v12/arrow/array"
+)
+
+func isArrayDataNil(arrayData arrow.ArrayData) bool {
+	if arrayData == nil {
+		return true
+	}
+	switch v := arrayData.(type) {
+	case *array.Data:
+		if v == nil {
+			return true
+		}
+	default:
+		panic("unknown array data type")
+	}
+	return false
+}
+
+func totalArrayDataSize(arrayData arrow.ArrayData, seenBuffers map[string]struct{}) int64 {

Review Comment:
   would be more efficient to just use a `map[*memory.Buffer]struct{}` instead rather than using `fmt.Sprintf("%p", buf)` no need to convert to string for the map key



##########
go/arrow/util/byte_size.go:
##########
@@ -0,0 +1,83 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package util
+
+import (
+	"fmt"
+
+	"github.com/apache/arrow/go/v12/arrow"
+	"github.com/apache/arrow/go/v12/arrow/array"
+)
+
+func isArrayDataNil(arrayData arrow.ArrayData) bool {
+	if arrayData == nil {
+		return true
+	}
+	switch v := arrayData.(type) {
+	case *array.Data:
+		if v == nil {
+			return true
+		}
+	default:
+		panic("unknown array data type")
+	}
+	return false
+}
+
+func totalArrayDataSize(arrayData arrow.ArrayData, seenBuffers map[string]struct{}) int64 {
+	var sum int64
+	for _, buf := range arrayData.Buffers() {
+		if buf != nil {
+			_, ok := seenBuffers[fmt.Sprintf("%p", buf)]
+			if !ok {
+				sum += int64(buf.Len())
+				seenBuffers[fmt.Sprintf("%p", buf)] = struct{}{}
+			}
+		}
+	}
+	for _, child := range arrayData.Children() {
+		sum += totalArrayDataSize(child, seenBuffers)
+	}
+	dict := arrayData.Dictionary()
+	if !isArrayDataNil(dict) {
+		sum += totalArrayDataSize(dict, seenBuffers)
+	}
+	return sum
+} 
+
+
+func totalArraySize(arr arrow.Array, seenBuffers map[string]struct{}) int64 {
+	return totalArrayDataSize(arr.Data(), seenBuffers)
+} 
+
+func totalRecordSize(record arrow.Record, seenBuffers map[string]struct{}) int64 {
+	var sum int64
+	for _, c := range record.Columns() {
+		sum += totalArraySize(c, seenBuffers)
+	}
+	return sum
+} 
+
+func TotalArraySize(arr arrow.Array) int64 {
+	seenBuffer := make(map[string]struct{})
+	return totalArraySize(arr, seenBuffer)
+}
+
+func TotalRecordSize(record arrow.Record) int64 {
+	seenBuffer := make(map[string]struct{})
+	return totalRecordSize(record, seenBuffer)
+}

Review Comment:
   should add a newline to the end of the file



##########
go/arrow/util/byte_size.go:
##########
@@ -0,0 +1,83 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package util
+
+import (
+	"fmt"
+
+	"github.com/apache/arrow/go/v12/arrow"
+	"github.com/apache/arrow/go/v12/arrow/array"
+)
+
+func isArrayDataNil(arrayData arrow.ArrayData) bool {
+	if arrayData == nil {
+		return true
+	}
+	switch v := arrayData.(type) {
+	case *array.Data:
+		if v == nil {
+			return true
+		}
+	default:
+		panic("unknown array data type")
+	}
+	return false
+}
+
+func totalArrayDataSize(arrayData arrow.ArrayData, seenBuffers map[string]struct{}) int64 {
+	var sum int64
+	for _, buf := range arrayData.Buffers() {
+		if buf != nil {
+			_, ok := seenBuffers[fmt.Sprintf("%p", buf)]
+			if !ok {
+				sum += int64(buf.Len())
+				seenBuffers[fmt.Sprintf("%p", buf)] = struct{}{}
+			}

Review Comment:
   can simplify this:
   
   ```go
   var void = struct{}{}
   for _, buf := range arrayData.Buffers() {
       if buf == nil {
           continue
       }
       if _, ok := seenBuffers[buf]; !ok {
           sum += int64(buf.Len())
           seenBuffers[buf] = void
       }
   }
   ```
   



##########
go/arrow/util/byte_size.go:
##########
@@ -0,0 +1,83 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package util
+
+import (
+	"fmt"
+
+	"github.com/apache/arrow/go/v12/arrow"
+	"github.com/apache/arrow/go/v12/arrow/array"
+)
+
+func isArrayDataNil(arrayData arrow.ArrayData) bool {
+	if arrayData == nil {
+		return true
+	}
+	switch v := arrayData.(type) {
+	case *array.Data:
+		if v == nil {
+			return true
+		}
+	default:
+		panic("unknown array data type")
+	}
+	return false
+}
+
+func totalArrayDataSize(arrayData arrow.ArrayData, seenBuffers map[string]struct{}) int64 {
+	var sum int64
+	for _, buf := range arrayData.Buffers() {
+		if buf != nil {
+			_, ok := seenBuffers[fmt.Sprintf("%p", buf)]
+			if !ok {
+				sum += int64(buf.Len())
+				seenBuffers[fmt.Sprintf("%p", buf)] = struct{}{}
+			}
+		}
+	}
+	for _, child := range arrayData.Children() {
+		sum += totalArrayDataSize(child, seenBuffers)
+	}
+	dict := arrayData.Dictionary()
+	if !isArrayDataNil(dict) {
+		sum += totalArrayDataSize(dict, seenBuffers)
+	}
+	return sum
+} 
+
+
+func totalArraySize(arr arrow.Array, seenBuffers map[string]struct{}) int64 {
+	return totalArrayDataSize(arr.Data(), seenBuffers)
+} 
+
+func totalRecordSize(record arrow.Record, seenBuffers map[string]struct{}) int64 {
+	var sum int64
+	for _, c := range record.Columns() {
+		sum += totalArraySize(c, seenBuffers)
+	}
+	return sum
+} 
+
+func TotalArraySize(arr arrow.Array) int64 {
+	seenBuffer := make(map[string]struct{})
+	return totalArraySize(arr, seenBuffer)
+}
+
+func TotalRecordSize(record arrow.Record) int64 {
+	seenBuffer := make(map[string]struct{})
+	return totalRecordSize(record, seenBuffer)
+}

Review Comment:
   Can you add docstring comments to these plz? thanks



##########
go/arrow/util/byte_size_test.go:
##########
@@ -0,0 +1,77 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package util_test
+
+import (
+	"strings"
+	"testing"
+
+	"github.com/apache/arrow/go/v12/arrow"
+	"github.com/apache/arrow/go/v12/arrow/array"
+	"github.com/apache/arrow/go/v12/arrow/memory"
+	"github.com/apache/arrow/go/v12/arrow/util"
+	"github.com/stretchr/testify/assert"
+)
+
+func TestTotalArraySizeBasic(t *testing.T) {

Review Comment:
   Please add tests for the nil cases and for reused buffers and dictionaries please



##########
go/arrow/array/record.go:
##########
@@ -180,6 +179,14 @@ func (rec *simpleRecord) SetColumn(i int, arr arrow.Array) (arrow.Record, error)
 	return NewRecord(rec.schema, arrs, rec.rows), nil
 }
 
+func (rec *simpleRecord) TotalBufferSize() uint64 {

Review Comment:
   Since you added `TotalRecordSize` in utils, is there actually a need for this separate method? If so, can it just call `TotalRecordSize` and add this to the `arrow.Record` interface?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org