You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by 1057445597 <10...@qq.com.INVALID> on 2022/04/11 02:24:58 UTC

回复： construct dataset for s3 by ParquetDatasetFactory failed

This is a folder that contains some parquet files. What do you mean? ParquetDatasetFactory can only be used for a file？ FileSystemDatasetFactory can be used for folders.？Or can you tell me how to use parquetDatasetFactory correctly? What do I need to make sure of? For example, what should I notice about the metadata_path parameter? It's best to have an example. The reason I want to use ParquetDatasetFactory is because using the FileSystemDatasetFactory process seems to as follows


```
FileSystemDatasetFactory---&gt;get a dataset
dataset-&gt;GetFragments---------&gt;get fragments for parquet files in the folder
for fragment in fragments -------&gt;construct a scanner builder----&gt;Finish()---&gt;get a scanner
scanner -&gt;ToTable() ---&gt;get a table (read the file to memory)


// I want to filt some columns before ToTable(), But it seems that only struct table has the function of ColumnNames()
```
Is this a wrong way?
My ultimate goal is to use arrow to read S3 parquet files for tensorflow training





------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "dev"                                                                                    <weston.pace@gmail.com&gt;;
发送时间:&nbsp;2022年4月9日(星期六) 中午11:38
收件人:&nbsp;"dev"<dev@arrow.apache.org&gt;;

主题:&nbsp;Re: construct dataset for s3 by ParquetDatasetFactory failed



Is `iceberg-test/warehouse/test/metadata` a parquet file?&nbsp; I only ask
because there is no extension.&nbsp; The commented out
FileSystemDatasetFactory is only accessing bucket_uri so it would
potentially succeed even if the metadata file did not exist.

On Fri, Apr 8, 2022 at 1:48 AM 1057445597 <1057445597@qq.com.invalid&gt; wrote:
&gt;
&gt; I want use ParquetDatasetFactory to create a dataset for s3, but failed! The error message as follows
&gt;
&gt;
&gt; /build/apache-arrow-7.0.0/cpp/src/arrow/result.cc:28: ValueOrDie called on an error: IOError: Path does not exist 'iceberg-test/warehouse/test/metadata' /lib/x86_64-linux-gnu/libarrow.so.700(+0x10430bb)[0x7f4ee6fe50bb] /lib/x86_64-linux-gnu/libarrow.so.700(_ZN5arrow4util8ArrowLogD1Ev+0xed)[0x7f4ee6fe52fd] /lib/x86_64-linux-gnu/libarrow.so.700(_ZN5arrow8internal17InvalidValueOrDieERKNS_6StatusE+0x17e)[0x7f4ee7104a2e] ./example(+0xd97d)[0x564087f3e97d] ./example(+0x8bc2)[0x564087f39bc2] ./example(+0x94c8)[0x564087f3a4c8] ./example(+0x9fb4)[0x564087f3afb4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f4ee572b0b3] ./example(+0x69fe)[0x564087f379fe] Aborted (core dumped)
&gt;
&gt;
&gt; In the follow code snippet，There is a line of comment code，use FileSystemDatasetFactory to create dataset, It works well, Can't a dataset be created through a ParquetDatasetFactory？？
&gt;
&gt;
&gt; std::shared_ptr<ds::Dataset&amp;gt; GetDatasetFromS3(const std::string&amp;amp; access_key,
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; const std::string&amp;amp; secret_key,
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; const std::string&amp;amp; endpoint_override,
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; const std::string&amp;amp; bucket_uri) {
&gt;&nbsp;&nbsp; EnsureS3Initialized();
&gt;
&gt;&nbsp;&nbsp; S3Options s3Options = S3Options::FromAccessKey(access_key, secret_key);
&gt;&nbsp;&nbsp; s3Options.endpoint_override = endpoint_override;
&gt;&nbsp;&nbsp; s3Options.scheme = "http";
&gt;
&gt;&nbsp;&nbsp; std::shared_ptr<S3FileSystem&amp;gt; s3fs = S3FileSystem::Make(s3Options).ValueOrDie();
&gt;
&gt;&nbsp;&nbsp; std::string path;
&gt;&nbsp;&nbsp; std::stringstream ss;
&gt;&nbsp;&nbsp; ss << "s3://" << access_key << ":" << secret_key
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; << "@" << K_METADATA_PATH
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; << "?scheme=http&amp;amp;endpoint_override=" << endpoint_override;
&gt;&nbsp;&nbsp; auto fs = arrow::fs::FileSystemFromUri(ss.str(), &amp;amp;path).ValueOrDie();
&gt;&nbsp;&nbsp; // auto fileInfo = fs-&amp;gt;GetFileInfo().ValueOrDie();
&gt;
&gt;&nbsp;&nbsp; auto format = std::make_shared<ParquetFileFormat&amp;gt;();
&gt;
&gt;&nbsp;&nbsp; // FileSelector selector;
&gt;&nbsp;&nbsp; // selector.base_dir = bucket_uri;
&gt;
&gt;&nbsp;&nbsp; // FileSystemFactoryOptions options;
&gt;&nbsp;&nbsp; ds::ParquetFactoryOptions options;
&gt;
&gt;&nbsp;&nbsp; std::string metadata_path = bucket_uri;
&gt;
&gt;&nbsp;&nbsp; ds::FileSource source(bucket_uri, s3fs);
&gt;&nbsp;&nbsp; //auto factory = ds::ParquetDatasetFactory::Make(source, bucket_uri, fs, format, options).ValueOrDie();
&gt;&nbsp;&nbsp; auto factory = ds::ParquetDatasetFactory::Make(path, fs, format, options).ValueOrDie();
&gt;
&gt;&nbsp;&nbsp; //auto factory = FileSystemDatasetFactory::Make(s3fs, selector, format, options).ValueOrDie();
&gt;&nbsp;&nbsp; return factory-&amp;gt;Finish().ValueOrDie();
&gt; }

回复： construct dataset for s3 by ParquetDatasetFactory failed

Posted by 1057445597 <10...@qq.com.INVALID>.

Thank you for your previous reply.&nbsp;&nbsp;I still have some question want to ask



I found that the RecordBatchReader reads fewer rows at a time than each row_group contains, meaning that a row_group needs to be read twice by RecordBatchReader. So what is the default batch size for RecordBatchReader?


Also, any good advice if I have to follow the row_group? I have a lot of parquet files stored on S3, and if I convert scanner to BatchRecordReader, I just loop ReadNext(), and if I want to read row_group, I find, I have to call `auto Fragments dataset-&gt;GetFragments()`,then iterate through fragments and call SplitByRowGroups() to split each fragment again, The scanner is then constructed for each fragment divided and the scanner's ToTable() is called to read the data.


Finally, is there a performance difference between ToTable() and ReadNext()?





------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "dev"                                                                                    <weston.pace@gmail.com&gt;;
发送时间:&nbsp;2022年4月11日(星期一) 下午4:23
收件人:&nbsp;"dev"<dev@arrow.apache.org&gt;;

主题:&nbsp;Re: construct dataset for s3 by ParquetDatasetFactory failed



ParquetDatasetFactory should only be used when you have a "_metadata"
file that describes which files are in your dataset.&nbsp; Some dataset
creators (e.g. Dask) can create this file.&nbsp; This saves time because
you do not have to list directories to find all the files in your
dataset.&nbsp; This is described in the python docs[1] this way:

&gt; Some processing frameworks such as Dask (optionally) use a _metadata file
&gt; with partitioned datasets which includes information about the schema and
&gt; the row group metadata of the full dataset. Using such a file can give a more
&gt; efficient creation of a parquet Dataset, since it does not need to infer the
&gt; schema and crawl the directories for all Parquet files (this is especially the
&gt; case for filesystems where accessing files is expensive). The
&gt; parquet_dataset() function allows us to create a Dataset from a partitioned
&gt; dataset with a _metadata file:

You can only use ParquetDatasetFactory if you have one of these
"_metadata" files.

&gt; The reason I want to use ParquetDatasetFactory is because using the
&gt; FileSystemDatasetFactory process seems to as follows
&gt; ...
&gt; I want to filt some columns before ToTable(), But it seems that only
&gt; struct table has the function of ColumnNames()

To get a list of columns in your dataset before you load the dataset
you can use the FileSystemDatasetFactory to create a Dataset and then
access the arrow::dataset::Dataset::schema property[2].
You can then pass the list of columns you want to read when you create
the scanner.

&gt; FileSystemDatasetFactory---&gt;get a dataset
&gt; dataset-&gt;GetFragments---------&gt;get fragments for parquet files in the folder
&gt; for fragment in fragments -------&gt;construct a scanner builder----&gt;Finish()---&gt;get a scanner
&gt; scanner -&gt;ToTable() ---&gt;get a table (read the file to memory)

You should not have to call dataset-&gt;GetFragments.&nbsp; You should not
create a scanner from a fragment.&nbsp; Instead you can create a scanner
from the dataset.

There are a few examples.&nbsp; This example[3] shows how to do projection.
In the C++ API the selection of columns is sometimes called
"projection".&nbsp; In the example I linked the code is loading all columns
AND one extra dynamic column (b_large).&nbsp; However, you can also use the
same approach to load fewer columns.&nbsp; You can see that `names` and
`exprs` are created.&nbsp; These vectors define which columns will be
loaded.&nbsp; To load fewer columns you would only add the columns you want
to these vectors.

[1] https://arrow.apache.org/docs/python/dataset.html#working-with-parquet-datasets
[2] https://github.com/apache/arrow/blob/e453ffeff233c358ec934a53a33b8b4b1d4e299b/cpp/src/arrow/dataset/dataset.h#L151
[3] https://github.com/apache/arrow/blob/e453ffeff233c358ec934a53a33b8b4b1d4e299b/cpp/examples/arrow/dataset_documentation_example.cc#L244

On Sun, Apr 10, 2022 at 4:25 PM 1057445597 <1057445597@qq.com.invalid&gt; wrote:
&gt;
&gt; This is a folder that contains some parquet files. What do you mean? ParquetDatasetFactory can only be used for a file？ FileSystemDatasetFactory can be used for folders.？Or can you tell me how to use parquetDatasetFactory correctly? What do I need to make sure of? For example, what should I notice about the metadata_path parameter? It's best to have an example. The reason I want to use ParquetDatasetFactory is because using the FileSystemDatasetFactory process seems to as follows
&gt;
&gt;
&gt; ```
&gt; FileSystemDatasetFactory---&amp;gt;get a dataset
&gt; dataset-&amp;gt;GetFragments---------&amp;gt;get fragments for parquet files in the folder
&gt; for fragment in fragments -------&amp;gt;construct a scanner builder----&amp;gt;Finish()---&amp;gt;get a scanner
&gt; scanner -&amp;gt;ToTable() ---&amp;gt;get a table (read the file to memory)
&gt;
&gt;
&gt; // I want to filt some columns before ToTable(), But it seems that only struct table has the function of ColumnNames()
&gt; ```
&gt; Is this a wrong way?
&gt; My ultimate goal is to use arrow to read S3 parquet files for tensorflow training
&gt;
&gt;
&gt;
&gt;
&gt;
&gt; ------------------&amp;nbsp;原始邮件&amp;nbsp;------------------
&gt; 发件人:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "dev"&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <weston.pace@gmail.com&amp;gt;;
&gt; 发送时间:&amp;nbsp;2022年4月9日(星期六) 中午11:38
&gt; 收件人:&amp;nbsp;"dev"<dev@arrow.apache.org&amp;gt;;
&gt;
&gt; 主题:&amp;nbsp;Re: construct dataset for s3 by ParquetDatasetFactory failed
&gt;
&gt;
&gt;
&gt; Is `iceberg-test/warehouse/test/metadata` a parquet file?&amp;nbsp; I only ask
&gt; because there is no extension.&amp;nbsp; The commented out
&gt; FileSystemDatasetFactory is only accessing bucket_uri so it would
&gt; potentially succeed even if the metadata file did not exist.
&gt;
&gt; On Fri, Apr 8, 2022 at 1:48 AM 1057445597 <1057445597@qq.com.invalid&amp;gt; wrote:
&gt; &amp;gt;
&gt; &amp;gt; I want use ParquetDatasetFactory to create a dataset for s3, but failed! The error message as follows
&gt; &amp;gt;
&gt; &amp;gt;
&gt; &amp;gt; /build/apache-arrow-7.0.0/cpp/src/arrow/result.cc:28: ValueOrDie called on an error: IOError: Path does not exist 'iceberg-test/warehouse/test/metadata' /lib/x86_64-linux-gnu/libarrow.so.700(+0x10430bb)[0x7f4ee6fe50bb] /lib/x86_64-linux-gnu/libarrow.so.700(_ZN5arrow4util8ArrowLogD1Ev+0xed)[0x7f4ee6fe52fd] /lib/x86_64-linux-gnu/libarrow.so.700(_ZN5arrow8internal17InvalidValueOrDieERKNS_6StatusE+0x17e)[0x7f4ee7104a2e] ./example(+0xd97d)[0x564087f3e97d] ./example(+0x8bc2)[0x564087f39bc2] ./example(+0x94c8)[0x564087f3a4c8] ./example(+0x9fb4)[0x564087f3afb4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f4ee572b0b3] ./example(+0x69fe)[0x564087f379fe] Aborted (core dumped)
&gt; &amp;gt;
&gt; &amp;gt;
&gt; &amp;gt; In the follow code snippet，There is a line of comment code，use FileSystemDatasetFactory to create dataset, It works well, Can't a dataset be created through a ParquetDatasetFactory？？
&gt; &amp;gt;
&gt; &amp;gt;
&gt; &amp;gt; std::shared_ptr<ds::Dataset&amp;amp;gt; GetDatasetFromS3(const std::string&amp;amp;amp; access_key,
&gt; &amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; const std::string&amp;amp;amp; secret_key,
&gt; &amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; const std::string&amp;amp;amp; endpoint_override,
&gt; &amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; const std::string&amp;amp;amp; bucket_uri) {
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; EnsureS3Initialized();
&gt; &amp;gt;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; S3Options s3Options = S3Options::FromAccessKey(access_key, secret_key);
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; s3Options.endpoint_override = endpoint_override;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; s3Options.scheme = "http";
&gt; &amp;gt;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; std::shared_ptr<S3FileSystem&amp;amp;gt; s3fs = S3FileSystem::Make(s3Options).ValueOrDie();
&gt; &amp;gt;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; std::string path;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; std::stringstream ss;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; ss << "s3://" << access_key << ":" << secret_key
&gt; &amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; << "@" << K_METADATA_PATH
&gt; &amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; << "?scheme=http&amp;amp;amp;endpoint_override=" << endpoint_override;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; auto fs = arrow::fs::FileSystemFromUri(ss.str(), &amp;amp;amp;path).ValueOrDie();
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; // auto fileInfo = fs-&amp;amp;gt;GetFileInfo().ValueOrDie();
&gt; &amp;gt;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; auto format = std::make_shared<ParquetFileFormat&amp;amp;gt;();
&gt; &amp;gt;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; // FileSelector selector;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; // selector.base_dir = bucket_uri;
&gt; &amp;gt;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; // FileSystemFactoryOptions options;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; ds::ParquetFactoryOptions options;
&gt; &amp;gt;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; std::string metadata_path = bucket_uri;
&gt; &amp;gt;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; ds::FileSource source(bucket_uri, s3fs);
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; //auto factory = ds::ParquetDatasetFactory::Make(source, bucket_uri, fs, format, options).ValueOrDie();
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; auto factory = ds::ParquetDatasetFactory::Make(path, fs, format, options).ValueOrDie();
&gt; &amp;gt;
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; //auto factory = FileSystemDatasetFactory::Make(s3fs, selector, format, options).ValueOrDie();
&gt; &amp;gt;&amp;nbsp;&amp;nbsp; return factory-&amp;amp;gt;Finish().ValueOrDie();
&gt; &amp;gt; }

Re: construct dataset for s3 by ParquetDatasetFactory failed

Posted by Weston Pace <we...@gmail.com>.

ParquetDatasetFactory should only be used when you have a "_metadata"
file that describes which files are in your dataset.  Some dataset
creators (e.g. Dask) can create this file.  This saves time because
you do not have to list directories to find all the files in your
dataset.  This is described in the python docs[1] this way:

> Some processing frameworks such as Dask (optionally) use a _metadata file
> with partitioned datasets which includes information about the schema and
> the row group metadata of the full dataset. Using such a file can give a more
> efficient creation of a parquet Dataset, since it does not need to infer the
> schema and crawl the directories for all Parquet files (this is especially the
> case for filesystems where accessing files is expensive). The
> parquet_dataset() function allows us to create a Dataset from a partitioned
> dataset with a _metadata file:

You can only use ParquetDatasetFactory if you have one of these
"_metadata" files.

> The reason I want to use ParquetDatasetFactory is because using the
> FileSystemDatasetFactory process seems to as follows
> ...
> I want to filt some columns before ToTable(), But it seems that only
> struct table has the function of ColumnNames()

To get a list of columns in your dataset before you load the dataset
you can use the FileSystemDatasetFactory to create a Dataset and then
access the arrow::dataset::Dataset::schema property[2].
You can then pass the list of columns you want to read when you create
the scanner.

> FileSystemDatasetFactory--->get a dataset
> dataset->GetFragments--------->get fragments for parquet files in the folder
> for fragment in fragments ------->construct a scanner builder---->Finish()--->get a scanner
> scanner ->ToTable() --->get a table (read the file to memory)

You should not have to call dataset->GetFragments.  You should not
create a scanner from a fragment.  Instead you can create a scanner
from the dataset.

There are a few examples.  This example[3] shows how to do projection.
In the C++ API the selection of columns is sometimes called
"projection".  In the example I linked the code is loading all columns
AND one extra dynamic column (b_large).  However, you can also use the
same approach to load fewer columns.  You can see that `names` and
`exprs` are created.  These vectors define which columns will be
loaded.  To load fewer columns you would only add the columns you want
to these vectors.

[1] https://arrow.apache.org/docs/python/dataset.html#working-with-parquet-datasets
[2] https://github.com/apache/arrow/blob/e453ffeff233c358ec934a53a33b8b4b1d4e299b/cpp/src/arrow/dataset/dataset.h#L151
[3] https://github.com/apache/arrow/blob/e453ffeff233c358ec934a53a33b8b4b1d4e299b/cpp/examples/arrow/dataset_documentation_example.cc#L244

On Sun, Apr 10, 2022 at 4:25 PM 1057445597 <10...@qq.com.invalid> wrote:
>
> This is a folder that contains some parquet files. What do you mean? ParquetDatasetFactory can only be used for a file？ FileSystemDatasetFactory can be used for folders.？Or can you tell me how to use parquetDatasetFactory correctly? What do I need to make sure of? For example, what should I notice about the metadata_path parameter? It's best to have an example. The reason I want to use ParquetDatasetFactory is because using the FileSystemDatasetFactory process seems to as follows
>
>
> ```
> FileSystemDatasetFactory---&gt;get a dataset
> dataset-&gt;GetFragments---------&gt;get fragments for parquet files in the folder
> for fragment in fragments -------&gt;construct a scanner builder----&gt;Finish()---&gt;get a scanner
> scanner -&gt;ToTable() ---&gt;get a table (read the file to memory)
>
>
> // I want to filt some columns before ToTable(), But it seems that only struct table has the function of ColumnNames()
> ```
> Is this a wrong way?
> My ultimate goal is to use arrow to read S3 parquet files for tensorflow training
>
>
>
>
>
> ------------------&nbsp;原始邮件&nbsp;------------------
> 发件人:                                                                                                                        "dev"                                                                                    <weston.pace@gmail.com&gt;;
> 发送时间:&nbsp;2022年4月9日(星期六) 中午11:38
> 收件人:&nbsp;"dev"<dev@arrow.apache.org&gt;;
>
> 主题:&nbsp;Re: construct dataset for s3 by ParquetDatasetFactory failed
>
>
>
> Is `iceberg-test/warehouse/test/metadata` a parquet file?&nbsp; I only ask
> because there is no extension.&nbsp; The commented out
> FileSystemDatasetFactory is only accessing bucket_uri so it would
> potentially succeed even if the metadata file did not exist.
>
> On Fri, Apr 8, 2022 at 1:48 AM 1057445597 <1057445597@qq.com.invalid&gt; wrote:
> &gt;
> &gt; I want use ParquetDatasetFactory to create a dataset for s3, but failed! The error message as follows
> &gt;
> &gt;
> &gt; /build/apache-arrow-7.0.0/cpp/src/arrow/result.cc:28: ValueOrDie called on an error: IOError: Path does not exist 'iceberg-test/warehouse/test/metadata' /lib/x86_64-linux-gnu/libarrow.so.700(+0x10430bb)[0x7f4ee6fe50bb] /lib/x86_64-linux-gnu/libarrow.so.700(_ZN5arrow4util8ArrowLogD1Ev+0xed)[0x7f4ee6fe52fd] /lib/x86_64-linux-gnu/libarrow.so.700(_ZN5arrow8internal17InvalidValueOrDieERKNS_6StatusE+0x17e)[0x7f4ee7104a2e] ./example(+0xd97d)[0x564087f3e97d] ./example(+0x8bc2)[0x564087f39bc2] ./example(+0x94c8)[0x564087f3a4c8] ./example(+0x9fb4)[0x564087f3afb4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f4ee572b0b3] ./example(+0x69fe)[0x564087f379fe] Aborted (core dumped)
> &gt;
> &gt;
> &gt; In the follow code snippet，There is a line of comment code，use FileSystemDatasetFactory to create dataset, It works well, Can't a dataset be created through a ParquetDatasetFactory？？
> &gt;
> &gt;
> &gt; std::shared_ptr<ds::Dataset&amp;gt; GetDatasetFromS3(const std::string&amp;amp; access_key,
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; const std::string&amp;amp; secret_key,
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; const std::string&amp;amp; endpoint_override,
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; const std::string&amp;amp; bucket_uri) {
> &gt;&nbsp;&nbsp; EnsureS3Initialized();
> &gt;
> &gt;&nbsp;&nbsp; S3Options s3Options = S3Options::FromAccessKey(access_key, secret_key);
> &gt;&nbsp;&nbsp; s3Options.endpoint_override = endpoint_override;
> &gt;&nbsp;&nbsp; s3Options.scheme = "http";
> &gt;
> &gt;&nbsp;&nbsp; std::shared_ptr<S3FileSystem&amp;gt; s3fs = S3FileSystem::Make(s3Options).ValueOrDie();
> &gt;
> &gt;&nbsp;&nbsp; std::string path;
> &gt;&nbsp;&nbsp; std::stringstream ss;
> &gt;&nbsp;&nbsp; ss << "s3://" << access_key << ":" << secret_key
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; << "@" << K_METADATA_PATH
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; << "?scheme=http&amp;amp;endpoint_override=" << endpoint_override;
> &gt;&nbsp;&nbsp; auto fs = arrow::fs::FileSystemFromUri(ss.str(), &amp;amp;path).ValueOrDie();
> &gt;&nbsp;&nbsp; // auto fileInfo = fs-&amp;gt;GetFileInfo().ValueOrDie();
> &gt;
> &gt;&nbsp;&nbsp; auto format = std::make_shared<ParquetFileFormat&amp;gt;();
> &gt;
> &gt;&nbsp;&nbsp; // FileSelector selector;
> &gt;&nbsp;&nbsp; // selector.base_dir = bucket_uri;
> &gt;
> &gt;&nbsp;&nbsp; // FileSystemFactoryOptions options;
> &gt;&nbsp;&nbsp; ds::ParquetFactoryOptions options;
> &gt;
> &gt;&nbsp;&nbsp; std::string metadata_path = bucket_uri;
> &gt;
> &gt;&nbsp;&nbsp; ds::FileSource source(bucket_uri, s3fs);
> &gt;&nbsp;&nbsp; //auto factory = ds::ParquetDatasetFactory::Make(source, bucket_uri, fs, format, options).ValueOrDie();
> &gt;&nbsp;&nbsp; auto factory = ds::ParquetDatasetFactory::Make(path, fs, format, options).ValueOrDie();
> &gt;
> &gt;&nbsp;&nbsp; //auto factory = FileSystemDatasetFactory::Make(s3fs, selector, format, options).ValueOrDie();
> &gt;&nbsp;&nbsp; return factory-&amp;gt;Finish().ValueOrDie();
> &gt; }