You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Flavio Pompermaier <po...@okkam.it> on 2017/02/17 08:32:34 UTC

CSV sink partitioning and bucketing

Hi to all,
in my use case I'd need to output my Row objects into an output folder as
CSV on HDFS but creating/overwriting new subfolders based on an attribute
(for example create a subfolder for each value of a specified column).
Then, it could be interesting to bucketing the data inside those folders by
number of lines,i.e. every file inside those directory cannot contain more
than 1000 lines.

For example, if I have a dataset (of Row) containing people I need to write
my dataset as CSV into an output folder X  partitioned by year (where each
file cannot have more then 1000 rows), like:

X/1990/file1
   /1990/file2
   /1991/file1
etc..

Does something like that exists in Flink?
In principle I could use Hive for this but at the moment I'd try to avoid
to add another component to our pipeline...Moreover, my feeling is that
very few people is using Flink on Hive..am I wrong?
Any advice on how to proceed?

Best,
Flavio

Re: CSV sink partitioning and bucketing

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Flavio,

Flink does not come with an OutputFormat that creates buckets. It should
not be too hard to implement this in Flink though.

However, if you want a solution fast, I would try the following approach:
1) Search for a Hadoop OutputFormat that buckets Strings based on a key
(<Key, String>).
2) Implement a mapper that converts Row into a String and extracts the key
3) Use the Hadoop OutputFormat with Flink's HadoopOutputFormat wrapper.

Depending on the output format you might want to partition and sort the
data on the key before writing it out.

Best, Fabian

2017-02-17 9:32 GMT+01:00 Flavio Pompermaier <po...@okkam.it>:

> Hi to all,
> in my use case I'd need to output my Row objects into an output folder as
> CSV on HDFS but creating/overwriting new subfolders based on an attribute
> (for example create a subfolder for each value of a specified column).
> Then, it could be interesting to bucketing the data inside those folders by
> number of lines,i.e. every file inside those directory cannot contain more
> than 1000 lines.
>
> For example, if I have a dataset (of Row) containing people I need to
> write my dataset as CSV into an output folder X  partitioned by year (where
> each file cannot have more then 1000 rows), like:
>
> X/1990/file1
>    /1990/file2
>    /1991/file1
> etc..
>
> Does something like that exists in Flink?
> In principle I could use Hive for this but at the moment I'd try to avoid
> to add another component to our pipeline...Moreover, my feeling is that
> very few people is using Flink on Hive..am I wrong?
> Any advice on how to proceed?
>
> Best,
> Flavio
>
>