You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by "McDonald, Ben" <be...@hpe.com> on 2022/04/15 22:28:37 UTC

[C++] Null indices and byte lengths of string columns

Hello,

I have been writing some code to read Parquet files and it would be useful if there was an easy way to get the number of bytes in a string column as well as the null indices of that column. I would have expected this to be available in metadata somewhere, but I have not seen any way to query that from the API and don’t see anything like this using `parquet-tools` to inspect the files.

Is there any way to get the null indices of a Parquet string column besides reading the whole file and manually checking for nulls?

Is there any way to get the byte lengths of string columns without reading each string and summing the number of bytes of each string?

Thank you.

Best,
Ben McDonald

Re: [C++] Null indices and byte lengths of string columns

Posted by Antoine Pitrou <an...@python.org>.
On Mon, 18 Apr 2022 13:09:52 -0700
Micah Kornfield <em...@gmail.com> wrote:
> Note that uncompressed size is encoded size so can be substantially smaller
> then a simple concatenated string buffer

Indeed, the only realiable way to get the desired information is to
actually read and decode the Parquet data.

Regards

Antoine.



> 
> On Monday, April 18, 2022, Weston Pace <we...@gmail.com> wrote:
> 
> > From a pure metadata-only perspective you should be able to get the
> > size of the column and possibly a null count (for parquet files where
> > statistics are stored).  However, you will not be able to get the
> > indices of the nulls.
> >
> > The null count and column size are going to come from the parquet
> > metadata and you will need to use the parquet APIs to get this
> > information.  In pyarrow this would be:
> >
> > ```  
> > >>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(  
> > 0).column(0).statistics.null_count
> > 1  
> > >>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(  
> > 0).column(0).total_compressed_size
> > 122  
> > >>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(  
> > 0).column(0).total_uncompressed_size
> > 119
> > ```
> >
> > In the C++ API you will want to look at `parquet::ParquetFileReader::
> > metadata`
> >
> > On Mon, Apr 18, 2022 at 6:18 AM McDonald, Ben <be...@hpe.com>
> > wrote:  
> > >
> > > It seems that these options require reading into `ArrayData`. I have  
> > been using `ReadBatch` to read directly into a malloced C buffer to avoid
> > having to create the additional copy, which is why I was hoping there would
> > be a way to get this from the file metadata or some operation on the file
> > rather than from the data that has already been read into an Arrow data
> > structure.  
> > >
> > >
> > >
> > > So, the only way that I could do this today would be to read into an  
> > `ArrayData` and then call an `arrow::compute` function? There is no way to
> > get the info from the file?  
> > >
> > >
> > >
> > > Best,
> > >
> > > Ben McDonald
> > >
> > >
> > >
> > > From: Niranda Perera <ni...@gmail.com>
> > > Date: Friday, April 15, 2022 at 5:43 PM
> > > To: user@arrow.apache.org <us...@arrow.apache.org>
> > > Subject: Re: [C++] Null indices and byte lengths of string columns
> > >
> > > Hi Ben,
> > >
> > >
> > >
> > > I believe you could use arrow::compute for this.
> > >
> > >
> > >
> > > On Fri, Apr 15, 2022 at 6:28 PM McDonald, Ben <be...@hpe.com>  
> > wrote:  
> > >
> > > Hello,
> > >
> > >
> > >
> > > I have been writing some code to read Parquet files and it would be  
> > useful if there was an easy way to get the number of bytes in a string
> > column as well as the null indices of that column. I would have expected
> > this to be available in metadata somewhere, but I have not seen any way to
> > query that from the API and don’t see anything like this using
> > `parquet-tools` to inspect the files.  
> > >
> > >
> > >
> > > Is there any way to get the null indices of a Parquet string column  
> > besides reading the whole file and manually checking for nulls?  
> > >
> > > There is an internal method for this [1]. But unfortunately I don't this  
> > is exposed to the outside. One possible solution is, calling
> > compute::is_null and pass the result to compute::indices_nonzero.  
> > >
> > >
> > >
> > >
> > >
> > > Is there any way to get the byte lengths of string columns without  
> > reading each string and summing the number of bytes of each string?  
> > >
> > > Do you want the non-null byte length?
> > >
> > > If not, you can simply take the offsets int64 buffer from ArrayData and  
> > take the last value. That would be the full bytesize of the string array.  
> > >
> > > If yes, I believe you can achieve this by using VisitArrayDataInline/  
> > VisitNullBitmapInline methods [2].  
> > >
> > >
> > >
> > >
> > >
> > > Thank you.
> > >
> > >
> > >
> > > Best,
> > >
> > > Ben McDonald
> > >
> > >
> > >
> > > [1] https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32  
> > eb6675b23a/cpp/src/arrow/compute/api_vector.h#L226  
> > >
> > > [2] https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32  
> > eb6675b23a/cpp/src/arrow/visit_data_inline.h#L224  
> > >
> > >
> > > --
> > >
> > > Niranda Perera
> > > https://niranda.dev/
> > >
> > > @n1r44
> > >
> > >  
> >  
> 




Re: [C++] Null indices and byte lengths of string columns

Posted by Micah Kornfield <em...@gmail.com>.
Note that uncompressed size is encoded size so can be substantially smaller
then a simple concatenated string buffer

On Monday, April 18, 2022, Weston Pace <we...@gmail.com> wrote:

> From a pure metadata-only perspective you should be able to get the
> size of the column and possibly a null count (for parquet files where
> statistics are stored).  However, you will not be able to get the
> indices of the nulls.
>
> The null count and column size are going to come from the parquet
> metadata and you will need to use the parquet APIs to get this
> information.  In pyarrow this would be:
>
> ```
> >>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(
> 0).column(0).statistics.null_count
> 1
> >>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(
> 0).column(0).total_compressed_size
> 122
> >>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(
> 0).column(0).total_uncompressed_size
> 119
> ```
>
> In the C++ API you will want to look at `parquet::ParquetFileReader::
> metadata`
>
> On Mon, Apr 18, 2022 at 6:18 AM McDonald, Ben <be...@hpe.com>
> wrote:
> >
> > It seems that these options require reading into `ArrayData`. I have
> been using `ReadBatch` to read directly into a malloced C buffer to avoid
> having to create the additional copy, which is why I was hoping there would
> be a way to get this from the file metadata or some operation on the file
> rather than from the data that has already been read into an Arrow data
> structure.
> >
> >
> >
> > So, the only way that I could do this today would be to read into an
> `ArrayData` and then call an `arrow::compute` function? There is no way to
> get the info from the file?
> >
> >
> >
> > Best,
> >
> > Ben McDonald
> >
> >
> >
> > From: Niranda Perera <ni...@gmail.com>
> > Date: Friday, April 15, 2022 at 5:43 PM
> > To: user@arrow.apache.org <us...@arrow.apache.org>
> > Subject: Re: [C++] Null indices and byte lengths of string columns
> >
> > Hi Ben,
> >
> >
> >
> > I believe you could use arrow::compute for this.
> >
> >
> >
> > On Fri, Apr 15, 2022 at 6:28 PM McDonald, Ben <be...@hpe.com>
> wrote:
> >
> > Hello,
> >
> >
> >
> > I have been writing some code to read Parquet files and it would be
> useful if there was an easy way to get the number of bytes in a string
> column as well as the null indices of that column. I would have expected
> this to be available in metadata somewhere, but I have not seen any way to
> query that from the API and don’t see anything like this using
> `parquet-tools` to inspect the files.
> >
> >
> >
> > Is there any way to get the null indices of a Parquet string column
> besides reading the whole file and manually checking for nulls?
> >
> > There is an internal method for this [1]. But unfortunately I don't this
> is exposed to the outside. One possible solution is, calling
> compute::is_null and pass the result to compute::indices_nonzero.
> >
> >
> >
> >
> >
> > Is there any way to get the byte lengths of string columns without
> reading each string and summing the number of bytes of each string?
> >
> > Do you want the non-null byte length?
> >
> > If not, you can simply take the offsets int64 buffer from ArrayData and
> take the last value. That would be the full bytesize of the string array.
> >
> > If yes, I believe you can achieve this by using VisitArrayDataInline/
> VisitNullBitmapInline methods [2].
> >
> >
> >
> >
> >
> > Thank you.
> >
> >
> >
> > Best,
> >
> > Ben McDonald
> >
> >
> >
> > [1] https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32
> eb6675b23a/cpp/src/arrow/compute/api_vector.h#L226
> >
> > [2] https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32
> eb6675b23a/cpp/src/arrow/visit_data_inline.h#L224
> >
> >
> > --
> >
> > Niranda Perera
> > https://niranda.dev/
> >
> > @n1r44
> >
> >
>

Re: [C++] Null indices and byte lengths of string columns

Posted by Weston Pace <we...@gmail.com>.
From a pure metadata-only perspective you should be able to get the
size of the column and possibly a null count (for parquet files where
statistics are stored).  However, you will not be able to get the
indices of the nulls.

The null count and column size are going to come from the parquet
metadata and you will need to use the parquet APIs to get this
information.  In pyarrow this would be:

```
>>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(0).column(0).statistics.null_count
1
>>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(0).column(0).total_compressed_size
122
>>> pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(0).column(0).total_uncompressed_size
119
```

In the C++ API you will want to look at `parquet::ParquetFileReader::metadata`

On Mon, Apr 18, 2022 at 6:18 AM McDonald, Ben <be...@hpe.com> wrote:
>
> It seems that these options require reading into `ArrayData`. I have been using `ReadBatch` to read directly into a malloced C buffer to avoid having to create the additional copy, which is why I was hoping there would be a way to get this from the file metadata or some operation on the file rather than from the data that has already been read into an Arrow data structure.
>
>
>
> So, the only way that I could do this today would be to read into an `ArrayData` and then call an `arrow::compute` function? There is no way to get the info from the file?
>
>
>
> Best,
>
> Ben McDonald
>
>
>
> From: Niranda Perera <ni...@gmail.com>
> Date: Friday, April 15, 2022 at 5:43 PM
> To: user@arrow.apache.org <us...@arrow.apache.org>
> Subject: Re: [C++] Null indices and byte lengths of string columns
>
> Hi Ben,
>
>
>
> I believe you could use arrow::compute for this.
>
>
>
> On Fri, Apr 15, 2022 at 6:28 PM McDonald, Ben <be...@hpe.com> wrote:
>
> Hello,
>
>
>
> I have been writing some code to read Parquet files and it would be useful if there was an easy way to get the number of bytes in a string column as well as the null indices of that column. I would have expected this to be available in metadata somewhere, but I have not seen any way to query that from the API and don’t see anything like this using `parquet-tools` to inspect the files.
>
>
>
> Is there any way to get the null indices of a Parquet string column besides reading the whole file and manually checking for nulls?
>
> There is an internal method for this [1]. But unfortunately I don't this is exposed to the outside. One possible solution is, calling compute::is_null and pass the result to compute::indices_nonzero.
>
>
>
>
>
> Is there any way to get the byte lengths of string columns without reading each string and summing the number of bytes of each string?
>
> Do you want the non-null byte length?
>
> If not, you can simply take the offsets int64 buffer from ArrayData and take the last value. That would be the full bytesize of the string array.
>
> If yes, I believe you can achieve this by using VisitArrayDataInline/ VisitNullBitmapInline methods [2].
>
>
>
>
>
> Thank you.
>
>
>
> Best,
>
> Ben McDonald
>
>
>
> [1] https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32eb6675b23a/cpp/src/arrow/compute/api_vector.h#L226
>
> [2] https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32eb6675b23a/cpp/src/arrow/visit_data_inline.h#L224
>
>
> --
>
> Niranda Perera
> https://niranda.dev/
>
> @n1r44
>
>

Re: [C++] Null indices and byte lengths of string columns

Posted by "McDonald, Ben" <be...@hpe.com>.
It seems that these options require reading into `ArrayData`. I have been using `ReadBatch` to read directly into a malloced C buffer to avoid having to create the additional copy, which is why I was hoping there would be a way to get this from the file metadata or some operation on the file rather than from the data that has already been read into an Arrow data structure.

So, the only way that I could do this today would be to read into an `ArrayData` and then call an `arrow::compute` function? There is no way to get the info from the file?

Best,
Ben McDonald

From: Niranda Perera <ni...@gmail.com>
Date: Friday, April 15, 2022 at 5:43 PM
To: user@arrow.apache.org <us...@arrow.apache.org>
Subject: Re: [C++] Null indices and byte lengths of string columns
Hi Ben,

I believe you could use arrow::compute for this.

On Fri, Apr 15, 2022 at 6:28 PM McDonald, Ben <be...@hpe.com>> wrote:
Hello,

I have been writing some code to read Parquet files and it would be useful if there was an easy way to get the number of bytes in a string column as well as the null indices of that column. I would have expected this to be available in metadata somewhere, but I have not seen any way to query that from the API and don’t see anything like this using `parquet-tools` to inspect the files.

Is there any way to get the null indices of a Parquet string column besides reading the whole file and manually checking for nulls?
There is an internal method for this [1]. But unfortunately I don't this is exposed to the outside. One possible solution is, calling compute::is_null and pass the result to compute::indices_nonzero.


Is there any way to get the byte lengths of string columns without reading each string and summing the number of bytes of each string?
Do you want the non-null byte length?
If not, you can simply take the offsets int64 buffer from ArrayData and take the last value. That would be the full bytesize of the string array.
If yes, I believe you can achieve this by using VisitArrayDataInline/ VisitNullBitmapInline methods [2].


Thank you.

Best,
Ben McDonald

[1] https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32eb6675b23a/cpp/src/arrow/compute/api_vector.h#L226
[2] https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32eb6675b23a/cpp/src/arrow/visit_data_inline.h#L224

--
Niranda Perera
https://niranda.dev/<https://niranda.dev/>
@n1r44<https://twitter.com/N1R44>


Re: [C++] Null indices and byte lengths of string columns

Posted by Niranda Perera <ni...@gmail.com>.
Hi Ben,

I believe you could use arrow::compute for this.

On Fri, Apr 15, 2022 at 6:28 PM McDonald, Ben <be...@hpe.com> wrote:

> Hello,
>
>
>
> I have been writing some code to read Parquet files and it would be useful
> if there was an easy way to get the number of bytes in a string column as
> well as the null indices of that column. I would have expected this to be
> available in metadata somewhere, but I have not seen any way to query that
> from the API and don’t see anything like this using `parquet-tools` to
> inspect the files.
>
>
>
> Is there any way to get the null indices of a Parquet string column
> besides reading the whole file and manually checking for nulls?
>
There is an internal method for this [1]. But unfortunately I don't this is
exposed to the outside. One possible solution is, calling compute::is_null
and pass the result to compute::indices_nonzero.


>
>
> Is there any way to get the byte lengths of string columns without reading
> each string and summing the number of bytes of each string?
>
Do you want the non-null byte length?
If not, you can simply take the offsets int64 buffer from ArrayData and
take the last value. That would be the full bytesize of the string array.
If yes, I believe you can achieve this by using VisitArrayDataInline/
VisitNullBitmapInline methods [2].


>
> Thank you.
>
>
>
> Best,
>
> Ben McDonald
>

[1]
https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32eb6675b23a/cpp/src/arrow/compute/api_vector.h#L226
[2]
https://github.com/apache/arrow/blob/d36b2b3392ed78b294b565c3bd3f32eb6675b23a/cpp/src/arrow/visit_data_inline.h#L224

-- 
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>