You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Ruben Laguna <ru...@gmail.com> on 2020/11/09 08:27:48 UTC

Table SQL Filesystem CSV recursive directory traversal

Is it possible?

For Dataset I've found [1] :

parameters.setBoolean("recursive.file.enumeration", true);
// pass the configuration to the data sourceDataSet<String> logs =
env.readTextFile("file:///path/with.nested/files")
			  .withParameters(parameters);


But can I achieve something similar with the Table SQL?

I have the following directory structure
/myfiles/20201010/00/00restoffilename1.csv
/myfiles/20201010/00/00restoffilename2.csv
...
/myfiles/20201010/00/00restoffilename3000.csv
/myfiles/20201010/01/01restoffilename1.csv
....
/myfiles/20201010/00/00restoffilename3000.csv

So for each day I have 255  subdirectories from 00 to  FF and each of those
directories can have 1000-3000 files and I would like to load all those
files in one go.

[1]:
https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/#recursive-traversal-of-the-input-path-directory

-- 
/Rubén

Re: Table SQL Filesystem CSV recursive directory traversal

Posted by Danny Chan <da...@apache.org>.
In the current master code base, all the FileInputFormat default add the
files recursively with the given paths. (e.g. the #addFilesInDir method).

So it should be supported as default for SQL.

Timo Walther <tw...@apache.org> 于2020年11月9日周一 下午11:25写道:

> Hi Ruben,
>
> by looking at the code, it seems you should be able to do that. At least
> for batch workloads we are using
> org.apache.flink.formats.csv.CsvFileSystemFormatFactory.CsvInputFormat
> which is a FileInputFormat that supports the mentioned configuration
> option.
>
> The problem is that this might not have been exposed via SQL properties
> yet. So you would need to write your own property-to-InputFormat factory
> that does it similar to:
>
>
> https://github.com/apache/flink/blob/master/flink-formats/flink-csv/src/main/java/org/apache/flink/formats/csv/CsvFileSystemFormatFactory.java
>
> What you could do create your own factory and extend from the above so
> you can set additional properties. Not a nice solution but a workaround
> for now.
>
> More information to how to write your own factory can also be found here:
>
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sourceSinks.html
>
> I hope this helps.
>
> Regards,
> Timo
>
> On 09.11.20 09:27, Ruben Laguna wrote:
> > Is it possible?
> >
> > For Dataset I've found [1] :
> >
> > |parameters.setBoolean("recursive.file.enumeration", true); // pass the
> > configuration to the data source DataSet<String> logs =
> > env.readTextFile("file:///path/with.nested/files")
> > .withParameters(parameters);|
> >
> >
> > But can I achieve something similar with the Table SQL?
> >
> > I have the following directory structure
> > /myfiles/20201010/00/00restoffilename1.csv
> > /myfiles/20201010/00/00restoffilename2.csv
> > ...
> > /myfiles/20201010/00/00restoffilename3000.csv
> > /myfiles/20201010/01/01restoffilename1.csv
> > ....
> > /myfiles/20201010/00/00restoffilename3000.csv
> >
> > So for each day I have 255  subdirectories from 00 to  FF and each of
> > those directories can have 1000-3000 files and I would like to load all
> > those files in one go.
> >
> > [1]:
> >
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/#recursive-traversal-of-the-input-path-directory
> > <
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/#recursive-traversal-of-the-input-path-directory
> >
> >
> > --
> > /Rubén
>
>

Re: Table SQL Filesystem CSV recursive directory traversal

Posted by Timo Walther <tw...@apache.org>.
Hi Ruben,

by looking at the code, it seems you should be able to do that. At least 
for batch workloads we are using 
org.apache.flink.formats.csv.CsvFileSystemFormatFactory.CsvInputFormat 
which is a FileInputFormat that supports the mentioned configuration option.

The problem is that this might not have been exposed via SQL properties 
yet. So you would need to write your own property-to-InputFormat factory 
that does it similar to:

https://github.com/apache/flink/blob/master/flink-formats/flink-csv/src/main/java/org/apache/flink/formats/csv/CsvFileSystemFormatFactory.java

What you could do create your own factory and extend from the above so 
you can set additional properties. Not a nice solution but a workaround 
for now.

More information to how to write your own factory can also be found here:

https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sourceSinks.html

I hope this helps.

Regards,
Timo

On 09.11.20 09:27, Ruben Laguna wrote:
> Is it possible?
> 
> For Dataset I've found [1] :
> 
> |parameters.setBoolean("recursive.file.enumeration", true); // pass the 
> configuration to the data source DataSet<String> logs = 
> env.readTextFile("file:///path/with.nested/files") 
> .withParameters(parameters);|
> 
> 
> But can I achieve something similar with the Table SQL?
> 
> I have the following directory structure
> /myfiles/20201010/00/00restoffilename1.csv
> /myfiles/20201010/00/00restoffilename2.csv
> ...
> /myfiles/20201010/00/00restoffilename3000.csv
> /myfiles/20201010/01/01restoffilename1.csv
> ....
> /myfiles/20201010/00/00restoffilename3000.csv
> 
> So for each day I have 255  subdirectories from 00 to  FF and each of 
> those directories can have 1000-3000 files and I would like to load all 
> those files in one go.
> 
> [1]: 
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/#recursive-traversal-of-the-input-path-directory 
> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/#recursive-traversal-of-the-input-path-directory>
> 
> -- 
> /Rubén