You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by sohimankotia <so...@gmail.com> on 2017/11/21 13:46:25 UTC

How to Create Sample Data from HDFS File using Flink ?

Hi, 

I have directory in HDFS containing 20 files with 150 Million records .

I just want random 20 million records from that directory . (Sampled Data ).
I see that there are few implementations are there in flink 
https://github.com/eBay/Flink/tree/master/flink-java/src/main/java/org/apache/flink/api/java/sampling
.

Can someone provide code example to use these .

Here is my code to read from HDFS file  :

	final
org.apache.flink.api.java.hadoop.mapred.HadoopInputFormat<LongWritable,
Text> inputFormat
				= HadoopInputs.readHadoopFile(new TextInputFormat(), LongWritable.class,
Text.class, hdfsPath);

		final DataSource<Tuple2&lt;LongWritable, Text>> input =
environment.createInput(inputFormat).withParameters(configs);







--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: How to Create Sample Data from HDFS File using Flink ?

Posted by Timo Walther <tw...@apache.org>.

Hi,

the sampling functions are exposed in 
org.apache.flink.api.java.utils.DataSetUtils. So you can basically can 
create something like:

final HadoopInputFormat<LongWritable, Text> inputFormat = 
HadoopInputs.readHadoopFile(new TextInputFormat(), LongWritable.class, 
Text.class, hdfsPath);

final DataSet<Tuple2<LongWritable, Text>> input = 
environment.createInput(inputFormat).withParameters(configs);

final DataSet<Tuple2<LongWritable, Text>> output = 
DataSetUtils.sample(input, true, true);

output.print();

Regards,
Timo




Am 11/21/17 um 2:46 PM schrieb sohimankotia:
> Hi,
>
> I have directory in HDFS containing 20 files with 150 Million records .
>
> I just want random 20 million records from that directory . (Sampled Data ).
> I see that there are few implementations are there in flink
> https://github.com/eBay/Flink/tree/master/flink-java/src/main/java/org/apache/flink/api/java/sampling
> .
>
> Can someone provide code example to use these .
>
> Here is my code to read from HDFS file  :
>
> 	final
> org.apache.flink.api.java.hadoop.mapred.HadoopInputFormat<LongWritable,
> Text> inputFormat
> 				= HadoopInputs.readHadoopFile(new TextInputFormat(), LongWritable.class,
> Text.class, hdfsPath);
>
> 		final DataSource<Tuple2&lt;LongWritable, Text>> input =
> environment.createInput(inputFormat).withParameters(configs);
>
>
>
>
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/