You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Alexandros Papadopoulos <al...@gmail.com> on 2014/09/22 12:12:00 UTC

TPC -H Benchmark

Hello all,

   i am trying to run some relational queries on flink over yarn,
i found two repo (https://github.com/stratosphere/stratosphere-tpch, 
https://github.com/project-flink/flink-perf ) with the java and scala 
implementation for some of the bench queries.
Running some of them with scale factor 64 the reading of the dataset 
seems to be bottleneck.
Cause im new in the flink community, is there any way to implement those 
queries more efficient ?
Also are there any results of this benchmark for the flink-yarn ??

Thanks in advance,

Alex

Re: TPC -H Benchmark

Posted by Robert Metzger <rm...@apache.org>.

Hi Alex,

"stratosphere-tpch" programs are written against our old Scala API and we
haven't really fine-tuned them, so maybe they are not optimally implemented.

We haven't benchmarked Flink explicitly on YARN, but I don't expect the
results to be different from non-yarn setups. We use YARN just for
deploying our JobManager and TaskManagers and then run everything like we
do with direct installations.
The execution is exactly the same for YARN and non-YARN setups.




On Mon, Sep 22, 2014 at 12:25 PM, Fabian Hueske <fh...@apache.org> wrote:

> Hi Alex,
>
> these jobs are implemented in a way that they read text data from HDFS.
> This is a very inefficient (yet very portable and easy-to-use) format to
> read relational data.
> There are several formats which are much better suited to read relational
> data such as Hive's ORC or Parquet (also in Apache Incubation).
>
> The performance problems with text files are manifold:
> - Data representation is not native but must be parsed (CPU intensive)
> - Data representation is inefficient (an integer might need several
> characters where 4 bytes would suffice)
> - All data must be read, even columns that are not used by the query.
> - No support to push filters down for early filtering
>
> You could port the jobs to use an ORC or Parquet format. Either use
> Hadoop's InputFormats (Flink supports those) or port them to Flink
> InputFormats (which are very similar to Hadoop's). Using Hadoop's formats
> might have a little overhead but will be easier...
> Having said that, it is not uncommon that I/O is the bottleneck in data
> processing systems.
>
> Let us know, if you need any help.
>
> Cheers, Fabian
>
>
> 2014-09-22 12:12 GMT+02:00 Alexandros Papadopoulos <al...@gmail.com>
> :
>
>> Hello all,
>>
>>   i am trying to run some relational queries on flink over yarn,
>> i found two repo (https://github.com/stratosphere/stratosphere-tpch,
>> https://github.com/project-flink/flink-perf ) with the java and scala
>> implementation for some of the bench queries.
>> Running some of them with scale factor 64 the reading of the dataset
>> seems to be bottleneck.
>> Cause im new in the flink community, is there any way to implement those
>> queries more efficient ?
>> Also are there any results of this benchmark for the flink-yarn ??
>>
>> Thanks in advance,
>>
>> Alex
>>
>
>

Re: TPC -H Benchmark

Posted by Fabian Hueske <fh...@apache.org>.

Hi Alex,

these jobs are implemented in a way that they read text data from HDFS.
This is a very inefficient (yet very portable and easy-to-use) format to
read relational data.
There are several formats which are much better suited to read relational
data such as Hive's ORC or Parquet (also in Apache Incubation).

The performance problems with text files are manifold:
- Data representation is not native but must be parsed (CPU intensive)
- Data representation is inefficient (an integer might need several
characters where 4 bytes would suffice)
- All data must be read, even columns that are not used by the query.
- No support to push filters down for early filtering

You could port the jobs to use an ORC or Parquet format. Either use
Hadoop's InputFormats (Flink supports those) or port them to Flink
InputFormats (which are very similar to Hadoop's). Using Hadoop's formats
might have a little overhead but will be easier...
Having said that, it is not uncommon that I/O is the bottleneck in data
processing systems.

Let us know, if you need any help.

Cheers, Fabian


2014-09-22 12:12 GMT+02:00 Alexandros Papadopoulos <al...@gmail.com>:

> Hello all,
>
>   i am trying to run some relational queries on flink over yarn,
> i found two repo (https://github.com/stratosphere/stratosphere-tpch,
> https://github.com/project-flink/flink-perf ) with the java and scala
> implementation for some of the bench queries.
> Running some of them with scale factor 64 the reading of the dataset seems
> to be bottleneck.
> Cause im new in the flink community, is there any way to implement those
> queries more efficient ?
> Also are there any results of this benchmark for the flink-yarn ??
>
> Thanks in advance,
>
> Alex
>