You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Hawin Jiang <ha...@gmail.com> on 2015/06/04 07:48:42 UTC

Apache Flink transactions

Hi  Admin

 

Do we have insert, update and remove operations on Apache Flink?

For example:  I have 10 million records in my test file.  I want to add one
record, update one record and remove one record from this test file. 

How to implement it by Flink?

Thanks.

 

 

 

 

Best regards

Email: hawin.jiang@gmail.com

Re: Apache Flink transactions

Posted by Fabian Hueske <fh...@gmail.com>.

Comparing the performance of systems is not easy and the results depend on
a lot of things as the configuration, data, and jobs.

That being said, the numbers that Bill reported for WordCount make
absolutely sense as Stephan pointed out in his response (Flink does not
feature hash-based aggregations yet).
So there are definitely use cases where Spark outperforms Flink, but there
are also other cases where both systems perform similar or Flink is faster.
For example more complex jobs benefit a lot from Flink's pipelined
execution and Flink's build-in iterations are very fast, especially
delta-iterations.

Best, Fabian



2015-06-10 0:53 GMT+02:00 Hawin Jiang <ha...@gmail.com>:

> Hey Aljoscha
>
> I also sent an email to Bill for asking the latest test results. From
> Bill's email, Apache Spark performance looks like better than Flink.
> How about your thoughts.
>
>
>
> Best regards
> Hawin
>
>
>
> On Tue, Jun 9, 2015 at 2:29 AM, Aljoscha Krettek <al...@apache.org>
> wrote:
>
>> Hi,
>> we don't have any current performance numbers. But the queries mentioned
>> on the benchmark page should be easy to implement in Flink. It could be
>> interesting if someone ported these queries and ran them with exactly the
>> same data on the same machines.
>>
>> Bill Sparks wrote on the mailing list some days ago (
>> http://mail-archives.apache.org/mod_mbox/flink-user/201506.mbox/%3cD1972778.64426%25jsparks@cray.com%3e).
>> He seems to be running some tests to compare Flink, Spark and MapReduce.
>>
>> Regards,
>> Aljoscha
>>
>> On Mon, Jun 8, 2015 at 9:09 PM, Hawin Jiang <ha...@gmail.com>
>> wrote:
>>
>>> Hi Aljoscha
>>>
>>> I want to know what is the apache flink performance if I run the same
>>> SQL as below.
>>> Do you have any apache flink benchmark information?
>>> Such as: https://amplab.cs.berkeley.edu/benchmark/
>>> Thanks.
>>>
>>>
>>>
>>> SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
>>>
>>> Query 1A
>>> 32,888 resultsQuery 1B
>>> 3,331,851 resultsQuery 1C
>>> 89,974,976 results05101520253035404550Redshift (HDD)Impala - DiskImpala
>>> - MemShark - DiskShark - MemHiveTez0510152025303540455055Redshift
>>> (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez0510152025303540Redshift
>>> (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTezOld DataMedian
>>> Response Time (s)Redshift (HDD) - Current2.492.619.46Impala - Disk -
>>> 1.2.312.01512.01537.085Impala - Mem - 1.2.32.173.0136.04Shark - Disk -
>>> 0.8.16.6722.4Shark - Mem - 0.8.11.71.83.6Hive - 0.12 YARN50.4959.9343.34Tez
>>> - 0.2.028.2236.3526.44
>>>
>>>
>>> On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek <al...@apache.org>
>>> wrote:
>>>
>>>> Hi,
>>>> actually, what do you want to know about Flink SQL?
>>>>
>>>> Aljoscha
>>>>
>>>> On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <ha...@gmail.com>
>>>> wrote:
>>>> > Thanks all
>>>> >
>>>> > Actually, I want to know more info about Flink SQL and Flink
>>>> performance
>>>> > Here is the Spark benchmark. Maybe you already saw it before.
>>>> > https://amplab.cs.berkeley.edu/benchmark/
>>>> >
>>>> > Thanks.
>>>> >
>>>> >
>>>> >
>>>> > Best regards
>>>> > Hawin
>>>> >
>>>> >
>>>> >
>>>> > On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <fh...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> If you want to append data to a data set that is store as files
>>>> (e.g., on
>>>> >> HDFS), you can go for a directory structure as follows:
>>>> >>
>>>> >> dataSetRootFolder
>>>> >>   - part1
>>>> >>     - 1
>>>> >>     - 2
>>>> >>     - ...
>>>> >>   - part2
>>>> >>     - 1
>>>> >>     - ...
>>>> >>   - partX
>>>> >>
>>>> >> Flink's file format supports recursive directory scans such that you
>>>> can
>>>> >> add new subfolders to dataSetRootFolder and read the full data set.
>>>> >>
>>>> >> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <al...@apache.org>:
>>>> >>>
>>>> >>> Hi,
>>>> >>> I think the example could be made more concise by using the Table
>>>> API.
>>>> >>>
>>>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>>> >>>
>>>> >>> Please let us know if you have questions about that, it is still
>>>> quite
>>>> >>> new.
>>>> >>>
>>>> >>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <ha...@gmail.com>
>>>> wrote:
>>>> >>> > Hi Aljoscha
>>>> >>> >
>>>> >>> > Thanks for your reply.
>>>> >>> > Do you have any tips for Flink SQL.
>>>> >>> > I know that Spark support ORC format. How about Flink SQL?
>>>> >>> > BTW, for TPCHQuery10 example, you have implemented it by 231
>>>> lines of
>>>> >>> > code.
>>>> >>> > How to make that as simple as possible by flink.
>>>> >>> > I am going to use Flink in my future project.  Sorry for so many
>>>> >>> > questions.
>>>> >>> > I believe that you guys will make a world difference.
>>>> >>> >
>>>> >>> >
>>>> >>> > @Chiwan
>>>> >>> > You made a very good example for me.
>>>> >>> > Thanks a lot
>>>> >>> >
>>>> >>> >
>>>> >>> >
>>>> >>> >
>>>> >>> >
>>>> >>> > --
>>>> >>> > View this message in context:
>>>> >>> >
>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>>>> >>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>>> >>> > archive at Nabble.com.
>>>> >>
>>>> >>
>>>> >
>>>>
>>>
>>>
>>
>

Re: Apache Flink transactions

Posted by Hawin Jiang <ha...@gmail.com>.

Hey Aljoscha

I also sent an email to Bill for asking the latest test results. From
Bill's email, Apache Spark performance looks like better than Flink.
How about your thoughts.



Best regards
Hawin



On Tue, Jun 9, 2015 at 2:29 AM, Aljoscha Krettek <al...@apache.org>
wrote:

> Hi,
> we don't have any current performance numbers. But the queries mentioned
> on the benchmark page should be easy to implement in Flink. It could be
> interesting if someone ported these queries and ran them with exactly the
> same data on the same machines.
>
> Bill Sparks wrote on the mailing list some days ago (
> http://mail-archives.apache.org/mod_mbox/flink-user/201506.mbox/%3cD1972778.64426%25jsparks@cray.com%3e).
> He seems to be running some tests to compare Flink, Spark and MapReduce.
>
> Regards,
> Aljoscha
>
> On Mon, Jun 8, 2015 at 9:09 PM, Hawin Jiang <ha...@gmail.com> wrote:
>
>> Hi Aljoscha
>>
>> I want to know what is the apache flink performance if I run the same SQL
>> as below.
>> Do you have any apache flink benchmark information?
>> Such as: https://amplab.cs.berkeley.edu/benchmark/
>> Thanks.
>>
>>
>>
>> SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
>>
>> Query 1A
>> 32,888 resultsQuery 1B
>> 3,331,851 resultsQuery 1C
>> 89,974,976 results05101520253035404550Redshift (HDD)Impala - DiskImpala
>> - MemShark - DiskShark - MemHiveTez0510152025303540455055Redshift
>> (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez0510152025303540Redshift
>> (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTezOld DataMedian
>> Response Time (s)Redshift (HDD) - Current2.492.619.46Impala - Disk -
>> 1.2.312.01512.01537.085Impala - Mem - 1.2.32.173.0136.04Shark - Disk -
>> 0.8.16.6722.4Shark - Mem - 0.8.11.71.83.6Hive - 0.12 YARN50.4959.9343.34Tez
>> - 0.2.028.2236.3526.44
>>
>>
>> On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek <al...@apache.org>
>> wrote:
>>
>>> Hi,
>>> actually, what do you want to know about Flink SQL?
>>>
>>> Aljoscha
>>>
>>> On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <ha...@gmail.com>
>>> wrote:
>>> > Thanks all
>>> >
>>> > Actually, I want to know more info about Flink SQL and Flink
>>> performance
>>> > Here is the Spark benchmark. Maybe you already saw it before.
>>> > https://amplab.cs.berkeley.edu/benchmark/
>>> >
>>> > Thanks.
>>> >
>>> >
>>> >
>>> > Best regards
>>> > Hawin
>>> >
>>> >
>>> >
>>> > On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <fh...@gmail.com>
>>> wrote:
>>> >>
>>> >> If you want to append data to a data set that is store as files
>>> (e.g., on
>>> >> HDFS), you can go for a directory structure as follows:
>>> >>
>>> >> dataSetRootFolder
>>> >>   - part1
>>> >>     - 1
>>> >>     - 2
>>> >>     - ...
>>> >>   - part2
>>> >>     - 1
>>> >>     - ...
>>> >>   - partX
>>> >>
>>> >> Flink's file format supports recursive directory scans such that you
>>> can
>>> >> add new subfolders to dataSetRootFolder and read the full data set.
>>> >>
>>> >> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <al...@apache.org>:
>>> >>>
>>> >>> Hi,
>>> >>> I think the example could be made more concise by using the Table
>>> API.
>>> >>>
>>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>> >>>
>>> >>> Please let us know if you have questions about that, it is still
>>> quite
>>> >>> new.
>>> >>>
>>> >>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <ha...@gmail.com> wrote:
>>> >>> > Hi Aljoscha
>>> >>> >
>>> >>> > Thanks for your reply.
>>> >>> > Do you have any tips for Flink SQL.
>>> >>> > I know that Spark support ORC format. How about Flink SQL?
>>> >>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines
>>> of
>>> >>> > code.
>>> >>> > How to make that as simple as possible by flink.
>>> >>> > I am going to use Flink in my future project.  Sorry for so many
>>> >>> > questions.
>>> >>> > I believe that you guys will make a world difference.
>>> >>> >
>>> >>> >
>>> >>> > @Chiwan
>>> >>> > You made a very good example for me.
>>> >>> > Thanks a lot
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > --
>>> >>> > View this message in context:
>>> >>> >
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>>> >>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>> >>> > archive at Nabble.com.
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Re: Apache Flink transactions

Posted by Hawin Jiang <ha...@gmail.com>.

On Tue, Jun 9, 2015 at 2:29 AM, Aljoscha Krettek <al...@apache.org>
wrote:

> Hi,
> we don't have any current performance numbers. But the queries mentioned
> on the benchmark page should be easy to implement in Flink. It could be
> interesting if someone ported these queries and ran them with exactly the
> same data on the same machines.
>
> Bill Sparks wrote on the mailing list some days ago (
> http://mail-archives.apache.org/mod_mbox/flink-user/201506.mbox/%3cD1972778.64426%25jsparks@cray.com%3e).
> He seems to be running some tests to compare Flink, Spark and MapReduce.
>
> Regards,
> Aljoscha
>
> On Mon, Jun 8, 2015 at 9:09 PM, Hawin Jiang <ha...@gmail.com> wrote:
>
>> Hi Aljoscha
>>
>> I want to know what is the apache flink performance if I run the same SQL
>> as below.
>> Do you have any apache flink benchmark information?
>> Such as: https://amplab.cs.berkeley.edu/benchmark/
>> Thanks.
>>
>>
>>
>> SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
>>
>> Query 1A
>> 32,888 resultsQuery 1B
>> 3,331,851 resultsQuery 1C
>> 89,974,976 results05101520253035404550Redshift (HDD)Impala - DiskImpala
>> - MemShark - DiskShark - MemHiveTez0510152025303540455055Redshift
>> (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez0510152025303540Redshift
>> (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTezOld DataMedian
>> Response Time (s)Redshift (HDD) - Current2.492.619.46Impala - Disk -
>> 1.2.312.01512.01537.085Impala - Mem - 1.2.32.173.0136.04Shark - Disk -
>> 0.8.16.6722.4Shark - Mem - 0.8.11.71.83.6Hive - 0.12 YARN50.4959.9343.34Tez
>> - 0.2.028.2236.3526.44
>>
>>
>> On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek <al...@apache.org>
>> wrote:
>>
>>> Hi,
>>> actually, what do you want to know about Flink SQL?
>>>
>>> Aljoscha
>>>
>>> On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <ha...@gmail.com>
>>> wrote:
>>> > Thanks all
>>> >
>>> > Actually, I want to know more info about Flink SQL and Flink
>>> performance
>>> > Here is the Spark benchmark. Maybe you already saw it before.
>>> > https://amplab.cs.berkeley.edu/benchmark/
>>> >
>>> > Thanks.
>>> >
>>> >
>>> >
>>> > Best regards
>>> > Hawin
>>> >
>>> >
>>> >
>>> > On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <fh...@gmail.com>
>>> wrote:
>>> >>
>>> >> If you want to append data to a data set that is store as files
>>> (e.g., on
>>> >> HDFS), you can go for a directory structure as follows:
>>> >>
>>> >> dataSetRootFolder
>>> >>   - part1
>>> >>     - 1
>>> >>     - 2
>>> >>     - ...
>>> >>   - part2
>>> >>     - 1
>>> >>     - ...
>>> >>   - partX
>>> >>
>>> >> Flink's file format supports recursive directory scans such that you
>>> can
>>> >> add new subfolders to dataSetRootFolder and read the full data set.
>>> >>
>>> >> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <al...@apache.org>:
>>> >>>
>>> >>> Hi,
>>> >>> I think the example could be made more concise by using the Table
>>> API.
>>> >>>
>>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>> >>>
>>> >>> Please let us know if you have questions about that, it is still
>>> quite
>>> >>> new.
>>> >>>
>>> >>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <ha...@gmail.com> wrote:
>>> >>> > Hi Aljoscha
>>> >>> >
>>> >>> > Thanks for your reply.
>>> >>> > Do you have any tips for Flink SQL.
>>> >>> > I know that Spark support ORC format. How about Flink SQL?
>>> >>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines
>>> of
>>> >>> > code.
>>> >>> > How to make that as simple as possible by flink.
>>> >>> > I am going to use Flink in my future project.  Sorry for so many
>>> >>> > questions.
>>> >>> > I believe that you guys will make a world difference.
>>> >>> >
>>> >>> >
>>> >>> > @Chiwan
>>> >>> > You made a very good example for me.
>>> >>> > Thanks a lot
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > --
>>> >>> > View this message in context:
>>> >>> >
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>>> >>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>> >>> > archive at Nabble.com.
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Re: Apache Flink transactions

Posted by Aljoscha Krettek <al...@apache.org>.

Hi,
we don't have any current performance numbers. But the queries mentioned on
the benchmark page should be easy to implement in Flink. It could be
interesting if someone ported these queries and ran them with exactly the
same data on the same machines.

Bill Sparks wrote on the mailing list some days ago (
http://mail-archives.apache.org/mod_mbox/flink-user/201506.mbox/%3cD1972778.64426%25jsparks@cray.com%3e).
He seems to be running some tests to compare Flink, Spark and MapReduce.

Regards,
Aljoscha

On Mon, Jun 8, 2015 at 9:09 PM, Hawin Jiang <ha...@gmail.com> wrote:

> Hi Aljoscha
>
> I want to know what is the apache flink performance if I run the same SQL
> as below.
> Do you have any apache flink benchmark information?
> Such as: https://amplab.cs.berkeley.edu/benchmark/
> Thanks.
>
>
>
> SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
>
> Query 1A
> 32,888 resultsQuery 1B
> 3,331,851 resultsQuery 1C
> 89,974,976 results05101520253035404550Redshift (HDD)Impala - DiskImpala -
> MemShark - DiskShark - MemHiveTez0510152025303540455055Redshift
> (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez0510152025303540Redshift
> (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTezOld DataMedian
> Response Time (s)Redshift (HDD) - Current2.492.619.46Impala - Disk - 1.2.3
> 12.01512.01537.085Impala - Mem - 1.2.32.173.0136.04Shark - Disk - 0.8.16.6
> 722.4Shark - Mem - 0.8.11.71.83.6Hive - 0.12 YARN50.4959.9343.34Tez -
> 0.2.028.2236.3526.44
>
>
> On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek <al...@apache.org>
> wrote:
>
>> Hi,
>> actually, what do you want to know about Flink SQL?
>>
>> Aljoscha
>>
>> On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <ha...@gmail.com>
>> wrote:
>> > Thanks all
>> >
>> > Actually, I want to know more info about Flink SQL and Flink performance
>> > Here is the Spark benchmark. Maybe you already saw it before.
>> > https://amplab.cs.berkeley.edu/benchmark/
>> >
>> > Thanks.
>> >
>> >
>> >
>> > Best regards
>> > Hawin
>> >
>> >
>> >
>> > On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <fh...@gmail.com>
>> wrote:
>> >>
>> >> If you want to append data to a data set that is store as files (e.g.,
>> on
>> >> HDFS), you can go for a directory structure as follows:
>> >>
>> >> dataSetRootFolder
>> >>   - part1
>> >>     - 1
>> >>     - 2
>> >>     - ...
>> >>   - part2
>> >>     - 1
>> >>     - ...
>> >>   - partX
>> >>
>> >> Flink's file format supports recursive directory scans such that you
>> can
>> >> add new subfolders to dataSetRootFolder and read the full data set.
>> >>
>> >> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <al...@apache.org>:
>> >>>
>> >>> Hi,
>> >>> I think the example could be made more concise by using the Table API.
>> >>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>> >>>
>> >>> Please let us know if you have questions about that, it is still quite
>> >>> new.
>> >>>
>> >>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <ha...@gmail.com> wrote:
>> >>> > Hi Aljoscha
>> >>> >
>> >>> > Thanks for your reply.
>> >>> > Do you have any tips for Flink SQL.
>> >>> > I know that Spark support ORC format. How about Flink SQL?
>> >>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines
>> of
>> >>> > code.
>> >>> > How to make that as simple as possible by flink.
>> >>> > I am going to use Flink in my future project.  Sorry for so many
>> >>> > questions.
>> >>> > I believe that you guys will make a world difference.
>> >>> >
>> >>> >
>> >>> > @Chiwan
>> >>> > You made a very good example for me.
>> >>> > Thanks a lot
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > --
>> >>> > View this message in context:
>> >>> >
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>> >>> > Sent from the Apache Flink User Mailing List archive. mailing list
>> >>> > archive at Nabble.com.
>> >>
>> >>
>> >
>>
>
>

Re: Apache Flink transactions

Posted by Hawin Jiang <ha...@gmail.com>.

Hi Aljoscha

I want to know what is the apache flink performance if I run the same SQL
as below.
Do you have any apache flink benchmark information?
Such as: https://amplab.cs.berkeley.edu/benchmark/
Thanks.



SELECT pageURL, pageRank FROM rankings WHERE pageRank > X

Query 1A
32,888 resultsQuery 1B
3,331,851 resultsQuery 1C
89,974,976 results05101520253035404550Redshift (HDD)Impala - DiskImpala -
MemShark - DiskShark - MemHiveTez0510152025303540455055Redshift (HDD)Impala
- DiskImpala - MemShark - DiskShark - MemHiveTez0510152025303540Redshift
(HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTezOld DataMedian
Response Time (s)Redshift (HDD) - Current2.492.619.46Impala - Disk - 1.2.3
12.01512.01537.085Impala - Mem - 1.2.32.173.0136.04Shark - Disk - 0.8.16.67
22.4Shark - Mem - 0.8.11.71.83.6Hive - 0.12 YARN50.4959.9343.34Tez - 0.2.0
28.2236.3526.44


On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek <al...@apache.org>
wrote:

> Hi,
> actually, what do you want to know about Flink SQL?
>
> Aljoscha
>
> On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <ha...@gmail.com> wrote:
> > Thanks all
> >
> > Actually, I want to know more info about Flink SQL and Flink performance
> > Here is the Spark benchmark. Maybe you already saw it before.
> > https://amplab.cs.berkeley.edu/benchmark/
> >
> > Thanks.
> >
> >
> >
> > Best regards
> > Hawin
> >
> >
> >
> > On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <fh...@gmail.com> wrote:
> >>
> >> If you want to append data to a data set that is store as files (e.g.,
> on
> >> HDFS), you can go for a directory structure as follows:
> >>
> >> dataSetRootFolder
> >>   - part1
> >>     - 1
> >>     - 2
> >>     - ...
> >>   - part2
> >>     - 1
> >>     - ...
> >>   - partX
> >>
> >> Flink's file format supports recursive directory scans such that you can
> >> add new subfolders to dataSetRootFolder and read the full data set.
> >>
> >> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <al...@apache.org>:
> >>>
> >>> Hi,
> >>> I think the example could be made more concise by using the Table API.
> >>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
> >>>
> >>> Please let us know if you have questions about that, it is still quite
> >>> new.
> >>>
> >>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <ha...@gmail.com> wrote:
> >>> > Hi Aljoscha
> >>> >
> >>> > Thanks for your reply.
> >>> > Do you have any tips for Flink SQL.
> >>> > I know that Spark support ORC format. How about Flink SQL?
> >>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of
> >>> > code.
> >>> > How to make that as simple as possible by flink.
> >>> > I am going to use Flink in my future project.  Sorry for so many
> >>> > questions.
> >>> > I believe that you guys will make a world difference.
> >>> >
> >>> >
> >>> > @Chiwan
> >>> > You made a very good example for me.
> >>> > Thanks a lot
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > View this message in context:
> >>> >
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
> >>> > Sent from the Apache Flink User Mailing List archive. mailing list
> >>> > archive at Nabble.com.
> >>
> >>
> >
>

Re: Apache Flink transactions

Posted by Aljoscha Krettek <al...@apache.org>.

Hi,
actually, what do you want to know about Flink SQL?

Aljoscha

On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <ha...@gmail.com> wrote:
> Thanks all
>
> Actually, I want to know more info about Flink SQL and Flink performance
> Here is the Spark benchmark. Maybe you already saw it before.
> https://amplab.cs.berkeley.edu/benchmark/
>
> Thanks.
>
>
>
> Best regards
> Hawin
>
>
>
> On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <fh...@gmail.com> wrote:
>>
>> If you want to append data to a data set that is store as files (e.g., on
>> HDFS), you can go for a directory structure as follows:
>>
>> dataSetRootFolder
>>   - part1
>>     - 1
>>     - 2
>>     - ...
>>   - part2
>>     - 1
>>     - ...
>>   - partX
>>
>> Flink's file format supports recursive directory scans such that you can
>> add new subfolders to dataSetRootFolder and read the full data set.
>>
>> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <al...@apache.org>:
>>>
>>> Hi,
>>> I think the example could be made more concise by using the Table API.
>>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>>
>>> Please let us know if you have questions about that, it is still quite
>>> new.
>>>
>>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <ha...@gmail.com> wrote:
>>> > Hi Aljoscha
>>> >
>>> > Thanks for your reply.
>>> > Do you have any tips for Flink SQL.
>>> > I know that Spark support ORC format. How about Flink SQL?
>>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of
>>> > code.
>>> > How to make that as simple as possible by flink.
>>> > I am going to use Flink in my future project.  Sorry for so many
>>> > questions.
>>> > I believe that you guys will make a world difference.
>>> >
>>> >
>>> > @Chiwan
>>> > You made a very good example for me.
>>> > Thanks a lot
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>> > archive at Nabble.com.
>>
>>
>

Re: Apache Flink transactions

Posted by Hawin Jiang <ha...@gmail.com>.

Thanks all

Actually, I want to know more info about Flink SQL and Flink performance
Here is the Spark benchmark. Maybe you already saw it before.
https://amplab.cs.berkeley.edu/benchmark/

Thanks.



Best regards
Hawin



On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <fh...@gmail.com> wrote:

> If you want to append data to a data set that is store as files (e.g., on
> HDFS), you can go for a directory structure as follows:
>
> dataSetRootFolder
>   - part1
>     - 1
>     - 2
>     - ...
>   - part2
>     - 1
>     - ...
>   - partX
>
> Flink's file format supports recursive directory scans such that you can
> add new subfolders to dataSetRootFolder and read the full data set.
>
> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <al...@apache.org>:
>
>> Hi,
>> I think the example could be made more concise by using the Table API.
>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>
>> Please let us know if you have questions about that, it is still quite
>> new.
>>
>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <ha...@gmail.com> wrote:
>> > Hi Aljoscha
>> >
>> > Thanks for your reply.
>> > Do you have any tips for Flink SQL.
>> > I know that Spark support ORC format. How about Flink SQL?
>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of
>> code.
>> > How to make that as simple as possible by flink.
>> > I am going to use Flink in my future project.  Sorry for so many
>> questions.
>> > I believe that you guys will make a world difference.
>> >
>> >
>> > @Chiwan
>> > You made a very good example for me.
>> > Thanks a lot
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>> > Sent from the Apache Flink User Mailing List archive. mailing list
>> archive at Nabble.com.
>>
>
>

Re: Apache Flink transactions

Posted by Fabian Hueske <fh...@gmail.com>.

If you want to append data to a data set that is store as files (e.g., on
HDFS), you can go for a directory structure as follows:

dataSetRootFolder
  - part1
    - 1
    - 2
    - ...
  - part2
    - 1
    - ...
  - partX

Flink's file format supports recursive directory scans such that you can
add new subfolders to dataSetRootFolder and read the full data set.

2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <al...@apache.org>:

> Hi,
> I think the example could be made more concise by using the Table API.
> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>
> Please let us know if you have questions about that, it is still quite new.
>
> On Fri, Jun 5, 2015 at 9:03 AM, hawin <ha...@gmail.com> wrote:
> > Hi Aljoscha
> >
> > Thanks for your reply.
> > Do you have any tips for Flink SQL.
> > I know that Spark support ORC format. How about Flink SQL?
> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of
> code.
> > How to make that as simple as possible by flink.
> > I am going to use Flink in my future project.  Sorry for so many
> questions.
> > I believe that you guys will make a world difference.
> >
> >
> > @Chiwan
> > You made a very good example for me.
> > Thanks a lot
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
> > Sent from the Apache Flink User Mailing List archive. mailing list
> archive at Nabble.com.
>

Re: Apache Flink transactions

Posted by Aljoscha Krettek <al...@apache.org>.

Hi,
I think the example could be made more concise by using the Table API.
http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html

Please let us know if you have questions about that, it is still quite new.

On Fri, Jun 5, 2015 at 9:03 AM, hawin <ha...@gmail.com> wrote:
> Hi Aljoscha
>
> Thanks for your reply.
> Do you have any tips for Flink SQL.
> I know that Spark support ORC format. How about Flink SQL?
> BTW, for TPCHQuery10 example, you have implemented it by 231 lines of code.
> How to make that as simple as possible by flink.
> I am going to use Flink in my future project.  Sorry for so many questions.
> I believe that you guys will make a world difference.
>
>
> @Chiwan
> You made a very good example for me.
> Thanks a lot
>
>
>
>
>
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Apache Flink transactions

Posted by hawin <ha...@gmail.com>.

Hi Aljoscha

Thanks for your reply.  
Do you have any tips for Flink SQL.
I know that Spark support ORC format. How about Flink SQL?  
BTW, for TPCHQuery10 example, you have implemented it by 231 lines of code. 
How to make that as simple as possible by flink. 
I am going to use Flink in my future project.  Sorry for so many questions.
I believe that you guys will make a world difference. 


@Chiwan 
You made a very good example for me.  
Thanks a lot





--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Apache Flink transactions

Posted by Aljoscha Krettek <al...@apache.org>.

Yes, this code seems very reasonable. :D

The way to use this to "modify" a file on HDFS is to read the file,
then filter out some elements and write a new modified file that does
not contain the filtered out elements. As said before, Flink (or
HDFS), does not allow in-place modification of files.

On Fri, Jun 5, 2015 at 4:55 AM, Chiwan Park <ch...@icloud.com> wrote:
> Basically Flink uses Data Model in functional programming model. All DataSet is immutable. This means we cannot modify DataSet but ㅐonly can create new DataSet with modification. Update, delete query are represented as writing filtered DataSet.
> Following scala sample shows select, insert, update, and remove query in Flink. (I’m not sure this is best practice.)
>
> case class MyType(id: Int, value1: String, value2: String)
>
> // load data (you can use readCsvFile, or something else.)
> val data = env.fromElements(MyType(0, “test”, “test2”), MyType(1, “hello”, “flink”), MyType(2, “flink”, “good”))
>
> // selecting
> // same as SELECT * FROM data WHERE id = 1
> val selectedData1 = data.filter(_.id == 1)
> // same as SELECT value1 FROM data WHERE id = 1
> val selectedData2 = data.filter(_.id == 1).map(_.value1)
>
> // removing is same as selecting such as following
> // same as DELETE FROM data WHERE id = 1, but DataSet data is not changed. the result is removedData
> val removedData = data.filter(_.id != 1)
>
> // inserting
> // same as INSERT INTO data (id, value1, value2) VALUES (3, “new”, “data”)
> val newData = env.fromElements(MyType(3, “new”, “data”))
> val insertedData = data.union(newData)
>
> // updating
> // UPDATE data SET value1 = “updated”, value2 = “data” WHERE id = 1, but DataSet data is not changed.
> val updatedData = data.map { x => if (x.id == 1) MyType(x.id, “updated”, “data”) else x }
>
> Regards,
> Chiwan Park
>
>> On Jun 5, 2015, at 9:22 AM, hawin <ha...@gmail.com> wrote:
>>
>> Hi  Chiwan
>>
>> Thanks for your information.  I knew Flink is not DBMS. I want to know what
>> is the flink way to select, insert, update and delete data on HDFS.
>>
>>
>> @Till
>> Maybe union is a way to insert data. But I think it will cost some
>> performance issue.
>>
>>
>> @Stephan
>> Thanks for your suggestion.  I have checked apache flink roadmap.  SQL on
>> flink will be released on Q3/Q4 2015. Will it support insertion, deletion
>> and update data on HDFS?
>> You guys already provided a nice example for selecting data on HDFS.  Such
>> as: TPCHQuery10 and TPCHQuery3.
>> Do you have other examples for inserting, updating and removing data on HDFS
>> by Apache flink
>>
>> Thanks
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1491.html
>> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
>
>
>
>

Re: Apache Flink transactions

Posted by Chiwan Park <ch...@icloud.com>.

Basically Flink uses Data Model in functional programming model. All DataSet is immutable. This means we cannot modify DataSet but ㅐonly can create new DataSet with modification. Update, delete query are represented as writing filtered DataSet.
Following scala sample shows select, insert, update, and remove query in Flink. (I’m not sure this is best practice.)

case class MyType(id: Int, value1: String, value2: String)

// load data (you can use readCsvFile, or something else.)
val data = env.fromElements(MyType(0, “test”, “test2”), MyType(1, “hello”, “flink”), MyType(2, “flink”, “good”))

// selecting
// same as SELECT * FROM data WHERE id = 1
val selectedData1 = data.filter(_.id == 1)
// same as SELECT value1 FROM data WHERE id = 1
val selectedData2 = data.filter(_.id == 1).map(_.value1)

// removing is same as selecting such as following
// same as DELETE FROM data WHERE id = 1, but DataSet data is not changed. the result is removedData
val removedData = data.filter(_.id != 1)

// inserting
// same as INSERT INTO data (id, value1, value2) VALUES (3, “new”, “data”)
val newData = env.fromElements(MyType(3, “new”, “data”))
val insertedData = data.union(newData) 

// updating
// UPDATE data SET value1 = “updated”, value2 = “data” WHERE id = 1, but DataSet data is not changed.
val updatedData = data.map { x => if (x.id == 1) MyType(x.id, “updated”, “data”) else x } 

Regards,
Chiwan Park

> On Jun 5, 2015, at 9:22 AM, hawin <ha...@gmail.com> wrote:
> 
> Hi  Chiwan
> 
> Thanks for your information.  I knew Flink is not DBMS. I want to know what
> is the flink way to select, insert, update and delete data on HDFS.
> 
> 
> @Till
> Maybe union is a way to insert data. But I think it will cost some
> performance issue.
> 
> 
> @Stephan
> Thanks for your suggestion.  I have checked apache flink roadmap.  SQL on
> flink will be released on Q3/Q4 2015. Will it support insertion, deletion
> and update data on HDFS?
> You guys already provided a nice example for selecting data on HDFS.  Such
> as: TPCHQuery10 and TPCHQuery3. 
> Do you have other examples for inserting, updating and removing data on HDFS
> by Apache flink
> 
> Thanks 
> 
> 
> 
> 
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1491.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Apache Flink transactions

Posted by hawin <ha...@gmail.com>.

Hi  Chiwan

Thanks for your information.  I knew Flink is not DBMS. I want to know what
is the flink way to select, insert, update and delete data on HDFS.


@Till
Maybe union is a way to insert data. But I think it will cost some
performance issue.


@Stephan
Thanks for your suggestion.  I have checked apache flink roadmap.  SQL on
flink will be released on Q3/Q4 2015. Will it support insertion, deletion
and update data on HDFS?
You guys already provided a nice example for selecting data on HDFS.  Such
as: TPCHQuery10 and TPCHQuery3. 
Do you have other examples for inserting, updating and removing data on HDFS
by Apache flink

Thanks 




--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1491.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Apache Flink transactions

Posted by Till Rohrmann <tr...@apache.org>.

But what you can do to simulate an insert is to read the new data in a
separate DataSet and then apply an union operator on the new and old DataSet
.

Cheers,
Till


On Thu, Jun 4, 2015 at 9:00 AM, Chiwan Park <ch...@icloud.com> wrote:

> Hi.
>
> Flink is not DBMS. There is no equivalent operation of insert, update,
> remove.
> But you can use map[1] or filter[2] operation to create modified dataset.
>
> I recommend you some sildes[3][4] to understand Flink concepts.
>
> Regards,
> Chiwan Park
>
> [1]
> http://ci.apache.org/projects/flink/flink-docs-master/apis/dataset_transformations.html#map
> [2]
> http://ci.apache.org/projects/flink/flink-docs-master/apis/dataset_transformations.html#filter
> [3]
> http://www.slideshare.net/robertmetzger1/introduction-to-apache-flink-palo-alto-meetup
> [4]
> http://www.slideshare.net/dataArtisans/flink-training-dataset-api-basics
>
>
> > On Jun 4, 2015, at 2:48 PM, Hawin Jiang <ha...@gmail.com> wrote:
> >
> > Hi  Admin
> >
> >
> >
> > Do we have insert, update and remove operations on Apache Flink?
> >
> > For example:  I have 10 million records in my test file.  I want to add
> one
> > record, update one record and remove one record from this test file.
> >
> > How to implement it by Flink?
> >
> > Thanks.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Best regards
> >
> > Email: hawin.jiang@gmail.com
> >
>
>
>
>
>

Re: Apache Flink transactions

Posted by Chiwan Park <ch...@icloud.com>.

Hi.

Flink is not DBMS. There is no equivalent operation of insert, update, remove.
But you can use map[1] or filter[2] operation to create modified dataset.

I recommend you some sildes[3][4] to understand Flink concepts.

Regards,
Chiwan Park

[1] http://ci.apache.org/projects/flink/flink-docs-master/apis/dataset_transformations.html#map
[2] http://ci.apache.org/projects/flink/flink-docs-master/apis/dataset_transformations.html#filter
[3] http://www.slideshare.net/robertmetzger1/introduction-to-apache-flink-palo-alto-meetup
[4] http://www.slideshare.net/dataArtisans/flink-training-dataset-api-basics


> On Jun 4, 2015, at 2:48 PM, Hawin Jiang <ha...@gmail.com> wrote:
> 
> Hi  Admin
> 
> 
> 
> Do we have insert, update and remove operations on Apache Flink?
> 
> For example:  I have 10 million records in my test file.  I want to add one
> record, update one record and remove one record from this test file. 
> 
> How to implement it by Flink?
> 
> Thanks.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Best regards
> 
> Email: hawin.jiang@gmail.com
>