You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by ch huang <ju...@gmail.com> on 2014/10/17 10:58:37 UTC

how to copy data between two hdfs cluster fastly?

hi,maillist:
             i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
find when copy small file,it very good, but when transfer big data ,it very
slow ,any good method recommand? thanks

Re: how to copy data between two hdfs cluster fastly?

Posted by ch huang <ju...@gmail.com>.

some file , total size  is 2T ,and block size  is 128M

On Sat, Oct 18, 2014 at 2:26 AM, Shivram Mani <sm...@pivotal.io> wrote:

> What is your approx input size ?
> Do you have multiple files or is this one large file ?
> What is your block size (source and destination cluster) ?
>
> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>
>> no ,all default
>>
>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>
>>> Did you specified how many map tasks?
>>>
>>>
>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>>
>>>> hi,maillist:
>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
>>>> find when copy small file,it very good, but when transfer big data ,it very
>>>> slow ,any good method recommand? thanks
>>>>
>>>
>>>
>>
>
>
> --
> Thanks
> Shivram
>

Re: how to copy data between two hdfs cluster fastly?

Posted by ch huang <ju...@gmail.com>.

some file , total size  is 2T ,and block size  is 128M

On Sat, Oct 18, 2014 at 2:26 AM, Shivram Mani <sm...@pivotal.io> wrote:

> What is your approx input size ?
> Do you have multiple files or is this one large file ?
> What is your block size (source and destination cluster) ?
>
> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>
>> no ,all default
>>
>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>
>>> Did you specified how many map tasks?
>>>
>>>
>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>>
>>>> hi,maillist:
>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
>>>> find when copy small file,it very good, but when transfer big data ,it very
>>>> slow ,any good method recommand? thanks
>>>>
>>>
>>>
>>
>
>
> --
> Thanks
> Shivram
>

Re: how to copy data between two hdfs cluster fastly?

Posted by ch huang <ju...@gmail.com>.

some file , total size  is 2T ,and block size  is 128M

On Sat, Oct 18, 2014 at 2:26 AM, Shivram Mani <sm...@pivotal.io> wrote:

> What is your approx input size ?
> Do you have multiple files or is this one large file ?
> What is your block size (source and destination cluster) ?
>
> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>
>> no ,all default
>>
>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>
>>> Did you specified how many map tasks?
>>>
>>>
>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>>
>>>> hi,maillist:
>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
>>>> find when copy small file,it very good, but when transfer big data ,it very
>>>> slow ,any good method recommand? thanks
>>>>
>>>
>>>
>>
>
>
> --
> Thanks
> Shivram
>

Re: how to copy data between two hdfs cluster fastly?

Posted by Shivram Mani <sm...@pivotal.io>.

If you still do want to use distcp

1. Break the file into smaller files (only if you have the luxury of doing
this

2. Use the "-m” option to set the number of mappers.

(Each map task will aim at copying (total bytes across all file)  /
numSplits. Uses the UniformSizeInputFormat by default

3. distcp by default uses a throttled input stream which by default is set
to 100MB. You can tune this based on your network bandwidth using the
-"bandwidth"
option

On Fri, Oct 17, 2014 at 10:24 PM, Shivram Mani <sm...@pivotal.io> wrote:

> Distcp is pretty restrictive w.r.t parallelizing data copy. If all that
> you are doing is one large file, distcp wouldn't make this any faster.
>
> In distcp, files are the lowest level of granularity. So increasing # of
> maps, may not necessarily increase the overall throughput.
>
> The default number of mappers if i’m not wrong is 20 for distcp. If all
> you were doing was to copy a large file, only one map task is effectively
> used
>
> On Fri, Oct 17, 2014 at 8:18 PM, ch huang <ju...@gmail.com> wrote:
>
>> yes
>>
>> On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky <st...@gmail.com>
>> wrote:
>>
>>> Distcp?
>>> On 17 Oct 2014 20:51, "Alexander Pivovarov" <ap...@gmail.com>
>>> wrote:
>>>
>>>> try to run on dest cluster datanode
>>>> $ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....
>>>>
>>>>
>>>>
>>>> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io>
>>>> wrote:
>>>>
>>>>> What is your approx input size ?
>>>>> Do you have multiple files or is this one large file ?
>>>>> What is your block size (source and destination cluster) ?
>>>>>
>>>>> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>>>>>
>>>>>> no ,all default
>>>>>>
>>>>>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Did you specified how many map tasks?
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> hi,maillist:
>>>>>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1
>>>>>>>> , i find when copy small file,it very good, but when transfer big data ,it
>>>>>>>> very slow ,any good method recommand? thanks
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks
>>>>> Shivram
>>>>>
>>>>
>>>>
>>
>
>
> --
> Thanks
> Shivram
>



-- 
Thanks
Shivram

Re: how to copy data between two hdfs cluster fastly?

Posted by Shivram Mani <sm...@pivotal.io>.

If you still do want to use distcp

1. Break the file into smaller files (only if you have the luxury of doing
this

2. Use the "-m” option to set the number of mappers.

(Each map task will aim at copying (total bytes across all file)  /
numSplits. Uses the UniformSizeInputFormat by default

3. distcp by default uses a throttled input stream which by default is set
to 100MB. You can tune this based on your network bandwidth using the
-"bandwidth"
option

On Fri, Oct 17, 2014 at 10:24 PM, Shivram Mani <sm...@pivotal.io> wrote:

> Distcp is pretty restrictive w.r.t parallelizing data copy. If all that
> you are doing is one large file, distcp wouldn't make this any faster.
>
> In distcp, files are the lowest level of granularity. So increasing # of
> maps, may not necessarily increase the overall throughput.
>
> The default number of mappers if i’m not wrong is 20 for distcp. If all
> you were doing was to copy a large file, only one map task is effectively
> used
>
> On Fri, Oct 17, 2014 at 8:18 PM, ch huang <ju...@gmail.com> wrote:
>
>> yes
>>
>> On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky <st...@gmail.com>
>> wrote:
>>
>>> Distcp?
>>> On 17 Oct 2014 20:51, "Alexander Pivovarov" <ap...@gmail.com>
>>> wrote:
>>>
>>>> try to run on dest cluster datanode
>>>> $ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....
>>>>
>>>>
>>>>
>>>> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io>
>>>> wrote:
>>>>
>>>>> What is your approx input size ?
>>>>> Do you have multiple files or is this one large file ?
>>>>> What is your block size (source and destination cluster) ?
>>>>>
>>>>> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>>>>>
>>>>>> no ,all default
>>>>>>
>>>>>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Did you specified how many map tasks?
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> hi,maillist:
>>>>>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1
>>>>>>>> , i find when copy small file,it very good, but when transfer big data ,it
>>>>>>>> very slow ,any good method recommand? thanks
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks
>>>>> Shivram
>>>>>
>>>>
>>>>
>>
>
>
> --
> Thanks
> Shivram
>



-- 
Thanks
Shivram

Re: how to copy data between two hdfs cluster fastly?

Posted by Shivram Mani <sm...@pivotal.io>.

If you still do want to use distcp

1. Break the file into smaller files (only if you have the luxury of doing
this

2. Use the "-m” option to set the number of mappers.

(Each map task will aim at copying (total bytes across all file)  /
numSplits. Uses the UniformSizeInputFormat by default

3. distcp by default uses a throttled input stream which by default is set
to 100MB. You can tune this based on your network bandwidth using the
-"bandwidth"
option

On Fri, Oct 17, 2014 at 10:24 PM, Shivram Mani <sm...@pivotal.io> wrote:

> Distcp is pretty restrictive w.r.t parallelizing data copy. If all that
> you are doing is one large file, distcp wouldn't make this any faster.
>
> In distcp, files are the lowest level of granularity. So increasing # of
> maps, may not necessarily increase the overall throughput.
>
> The default number of mappers if i’m not wrong is 20 for distcp. If all
> you were doing was to copy a large file, only one map task is effectively
> used
>
> On Fri, Oct 17, 2014 at 8:18 PM, ch huang <ju...@gmail.com> wrote:
>
>> yes
>>
>> On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky <st...@gmail.com>
>> wrote:
>>
>>> Distcp?
>>> On 17 Oct 2014 20:51, "Alexander Pivovarov" <ap...@gmail.com>
>>> wrote:
>>>
>>>> try to run on dest cluster datanode
>>>> $ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....
>>>>
>>>>
>>>>
>>>> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io>
>>>> wrote:
>>>>
>>>>> What is your approx input size ?
>>>>> Do you have multiple files or is this one large file ?
>>>>> What is your block size (source and destination cluster) ?
>>>>>
>>>>> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>>>>>
>>>>>> no ,all default
>>>>>>
>>>>>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Did you specified how many map tasks?
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> hi,maillist:
>>>>>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1
>>>>>>>> , i find when copy small file,it very good, but when transfer big data ,it
>>>>>>>> very slow ,any good method recommand? thanks
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks
>>>>> Shivram
>>>>>
>>>>
>>>>
>>
>
>
> --
> Thanks
> Shivram
>



-- 
Thanks
Shivram

Re: how to copy data between two hdfs cluster fastly?

Posted by Shivram Mani <sm...@pivotal.io>.

If you still do want to use distcp

1. Break the file into smaller files (only if you have the luxury of doing
this

2. Use the "-m” option to set the number of mappers.

(Each map task will aim at copying (total bytes across all file)  /
numSplits. Uses the UniformSizeInputFormat by default

3. distcp by default uses a throttled input stream which by default is set
to 100MB. You can tune this based on your network bandwidth using the
-"bandwidth"
option

On Fri, Oct 17, 2014 at 10:24 PM, Shivram Mani <sm...@pivotal.io> wrote:

> Distcp is pretty restrictive w.r.t parallelizing data copy. If all that
> you are doing is one large file, distcp wouldn't make this any faster.
>
> In distcp, files are the lowest level of granularity. So increasing # of
> maps, may not necessarily increase the overall throughput.
>
> The default number of mappers if i’m not wrong is 20 for distcp. If all
> you were doing was to copy a large file, only one map task is effectively
> used
>
> On Fri, Oct 17, 2014 at 8:18 PM, ch huang <ju...@gmail.com> wrote:
>
>> yes
>>
>> On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky <st...@gmail.com>
>> wrote:
>>
>>> Distcp?
>>> On 17 Oct 2014 20:51, "Alexander Pivovarov" <ap...@gmail.com>
>>> wrote:
>>>
>>>> try to run on dest cluster datanode
>>>> $ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....
>>>>
>>>>
>>>>
>>>> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io>
>>>> wrote:
>>>>
>>>>> What is your approx input size ?
>>>>> Do you have multiple files or is this one large file ?
>>>>> What is your block size (source and destination cluster) ?
>>>>>
>>>>> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>>>>>
>>>>>> no ,all default
>>>>>>
>>>>>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Did you specified how many map tasks?
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> hi,maillist:
>>>>>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1
>>>>>>>> , i find when copy small file,it very good, but when transfer big data ,it
>>>>>>>> very slow ,any good method recommand? thanks
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks
>>>>> Shivram
>>>>>
>>>>
>>>>
>>
>
>
> --
> Thanks
> Shivram
>



-- 
Thanks
Shivram

Re: how to copy data between two hdfs cluster fastly?

Posted by Shivram Mani <sm...@pivotal.io>.

Distcp is pretty restrictive w.r.t parallelizing data copy. If all that you
are doing is one large file, distcp wouldn't make this any faster.

In distcp, files are the lowest level of granularity. So increasing # of
maps, may not necessarily increase the overall throughput.

The default number of mappers if i’m not wrong is 20 for distcp. If all you
were doing was to copy a large file, only one map task is effectively used

On Fri, Oct 17, 2014 at 8:18 PM, ch huang <ju...@gmail.com> wrote:

> yes
>
> On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky <st...@gmail.com>
> wrote:
>
>> Distcp?
>> On 17 Oct 2014 20:51, "Alexander Pivovarov" <ap...@gmail.com> wrote:
>>
>>> try to run on dest cluster datanode
>>> $ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....
>>>
>>>
>>>
>>> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io> wrote:
>>>
>>>> What is your approx input size ?
>>>> Do you have multiple files or is this one large file ?
>>>> What is your block size (source and destination cluster) ?
>>>>
>>>> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>>>>
>>>>> no ,all default
>>>>>
>>>>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>>>>
>>>>>> Did you specified how many map tasks?
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> hi,maillist:
>>>>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1
>>>>>>> , i find when copy small file,it very good, but when transfer big data ,it
>>>>>>> very slow ,any good method recommand? thanks
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks
>>>> Shivram
>>>>
>>>
>>>
>


-- 
Thanks
Shivram

Re: how to copy data between two hdfs cluster fastly?

Posted by Shivram Mani <sm...@pivotal.io>.

Distcp is pretty restrictive w.r.t parallelizing data copy. If all that you
are doing is one large file, distcp wouldn't make this any faster.

In distcp, files are the lowest level of granularity. So increasing # of
maps, may not necessarily increase the overall throughput.

The default number of mappers if i’m not wrong is 20 for distcp. If all you
were doing was to copy a large file, only one map task is effectively used

On Fri, Oct 17, 2014 at 8:18 PM, ch huang <ju...@gmail.com> wrote:

> yes
>
> On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky <st...@gmail.com>
> wrote:
>
>> Distcp?
>> On 17 Oct 2014 20:51, "Alexander Pivovarov" <ap...@gmail.com> wrote:
>>
>>> try to run on dest cluster datanode
>>> $ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....
>>>
>>>
>>>
>>> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io> wrote:
>>>
>>>> What is your approx input size ?
>>>> Do you have multiple files or is this one large file ?
>>>> What is your block size (source and destination cluster) ?
>>>>
>>>> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>>>>
>>>>> no ,all default
>>>>>
>>>>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>>>>
>>>>>> Did you specified how many map tasks?
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> hi,maillist:
>>>>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1
>>>>>>> , i find when copy small file,it very good, but when transfer big data ,it
>>>>>>> very slow ,any good method recommand? thanks
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks
>>>> Shivram
>>>>
>>>
>>>
>


-- 
Thanks
Shivram

Re: how to copy data between two hdfs cluster fastly?

Posted by Shivram Mani <sm...@pivotal.io>.

Distcp is pretty restrictive w.r.t parallelizing data copy. If all that you
are doing is one large file, distcp wouldn't make this any faster.

In distcp, files are the lowest level of granularity. So increasing # of
maps, may not necessarily increase the overall throughput.

The default number of mappers if i’m not wrong is 20 for distcp. If all you
were doing was to copy a large file, only one map task is effectively used

On Fri, Oct 17, 2014 at 8:18 PM, ch huang <ju...@gmail.com> wrote:

> yes
>
> On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky <st...@gmail.com>
> wrote:
>
>> Distcp?
>> On 17 Oct 2014 20:51, "Alexander Pivovarov" <ap...@gmail.com> wrote:
>>
>>> try to run on dest cluster datanode
>>> $ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....
>>>
>>>
>>>
>>> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io> wrote:
>>>
>>>> What is your approx input size ?
>>>> Do you have multiple files or is this one large file ?
>>>> What is your block size (source and destination cluster) ?
>>>>
>>>> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>>>>
>>>>> no ,all default
>>>>>
>>>>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>>>>
>>>>>> Did you specified how many map tasks?
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> hi,maillist:
>>>>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1
>>>>>>> , i find when copy small file,it very good, but when transfer big data ,it
>>>>>>> very slow ,any good method recommand? thanks
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks
>>>> Shivram
>>>>
>>>
>>>
>


-- 
Thanks
Shivram

Re: how to copy data between two hdfs cluster fastly?

Posted by Shivram Mani <sm...@pivotal.io>.

Distcp is pretty restrictive w.r.t parallelizing data copy. If all that you
are doing is one large file, distcp wouldn't make this any faster.

In distcp, files are the lowest level of granularity. So increasing # of
maps, may not necessarily increase the overall throughput.

The default number of mappers if i’m not wrong is 20 for distcp. If all you
were doing was to copy a large file, only one map task is effectively used

On Fri, Oct 17, 2014 at 8:18 PM, ch huang <ju...@gmail.com> wrote:

> yes
>
> On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky <st...@gmail.com>
> wrote:
>
>> Distcp?
>> On 17 Oct 2014 20:51, "Alexander Pivovarov" <ap...@gmail.com> wrote:
>>
>>> try to run on dest cluster datanode
>>> $ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....
>>>
>>>
>>>
>>> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io> wrote:
>>>
>>>> What is your approx input size ?
>>>> Do you have multiple files or is this one large file ?
>>>> What is your block size (source and destination cluster) ?
>>>>
>>>> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>>>>
>>>>> no ,all default
>>>>>
>>>>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>>>>
>>>>>> Did you specified how many map tasks?
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> hi,maillist:
>>>>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1
>>>>>>> , i find when copy small file,it very good, but when transfer big data ,it
>>>>>>> very slow ,any good method recommand? thanks
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks
>>>> Shivram
>>>>
>>>
>>>
>


-- 
Thanks
Shivram

Re: how to copy data between two hdfs cluster fastly?

Posted by ch huang <ju...@gmail.com>.

yes

On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky <st...@gmail.com>
wrote:

> Distcp?
> On 17 Oct 2014 20:51, "Alexander Pivovarov" <ap...@gmail.com> wrote:
>
>> try to run on dest cluster datanode
>> $ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....
>>
>>
>>
>> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io> wrote:
>>
>>> What is your approx input size ?
>>> Do you have multiple files or is this one large file ?
>>> What is your block size (source and destination cluster) ?
>>>
>>> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>>>
>>>> no ,all default
>>>>
>>>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>>>
>>>>> Did you specified how many map tasks?
>>>>>
>>>>>
>>>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>>>>
>>>>>> hi,maillist:
>>>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 ,
>>>>>> i find when copy small file,it very good, but when transfer big data ,it
>>>>>> very slow ,any good method recommand? thanks
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks
>>> Shivram
>>>
>>
>>

Re: how to copy data between two hdfs cluster fastly?

Posted by ch huang <ju...@gmail.com>.

yes

On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky <st...@gmail.com>
wrote:

> Distcp?
> On 17 Oct 2014 20:51, "Alexander Pivovarov" <ap...@gmail.com> wrote:
>
>> try to run on dest cluster datanode
>> $ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....
>>
>>
>>
>> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io> wrote:
>>
>>> What is your approx input size ?
>>> Do you have multiple files or is this one large file ?
>>> What is your block size (source and destination cluster) ?
>>>
>>> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>>>
>>>> no ,all default
>>>>
>>>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>>>
>>>>> Did you specified how many map tasks?
>>>>>
>>>>>
>>>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>>>>
>>>>>> hi,maillist:
>>>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 ,
>>>>>> i find when copy small file,it very good, but when transfer big data ,it
>>>>>> very slow ,any good method recommand? thanks
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks
>>> Shivram
>>>
>>
>>

Re: how to copy data between two hdfs cluster fastly?

Posted by ch huang <ju...@gmail.com>.

yes

On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky <st...@gmail.com>
wrote:

> Distcp?
> On 17 Oct 2014 20:51, "Alexander Pivovarov" <ap...@gmail.com> wrote:
>
>> try to run on dest cluster datanode
>> $ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....
>>
>>
>>
>> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io> wrote:
>>
>>> What is your approx input size ?
>>> Do you have multiple files or is this one large file ?
>>> What is your block size (source and destination cluster) ?
>>>
>>> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>>>
>>>> no ,all default
>>>>
>>>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>>>
>>>>> Did you specified how many map tasks?
>>>>>
>>>>>
>>>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>>>>
>>>>>> hi,maillist:
>>>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 ,
>>>>>> i find when copy small file,it very good, but when transfer big data ,it
>>>>>> very slow ,any good method recommand? thanks
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks
>>> Shivram
>>>
>>
>>

Re: how to copy data between two hdfs cluster fastly?

Posted by ch huang <ju...@gmail.com>.

yes

On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky <st...@gmail.com>
wrote:

> Distcp?
> On 17 Oct 2014 20:51, "Alexander Pivovarov" <ap...@gmail.com> wrote:
>
>> try to run on dest cluster datanode
>> $ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....
>>
>>
>>
>> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io> wrote:
>>
>>> What is your approx input size ?
>>> Do you have multiple files or is this one large file ?
>>> What is your block size (source and destination cluster) ?
>>>
>>> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>>>
>>>> no ,all default
>>>>
>>>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>>>
>>>>> Did you specified how many map tasks?
>>>>>
>>>>>
>>>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>>>>
>>>>>> hi,maillist:
>>>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 ,
>>>>>> i find when copy small file,it very good, but when transfer big data ,it
>>>>>> very slow ,any good method recommand? thanks
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks
>>> Shivram
>>>
>>
>>

Re: how to copy data between two hdfs cluster fastly?

Posted by Jakub Stransky <st...@gmail.com>.

Distcp?
On 17 Oct 2014 20:51, "Alexander Pivovarov" <ap...@gmail.com> wrote:

> try to run on dest cluster datanode
> $ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....
>
>
>
> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io> wrote:
>
>> What is your approx input size ?
>> Do you have multiple files or is this one large file ?
>> What is your block size (source and destination cluster) ?
>>
>> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>>
>>> no ,all default
>>>
>>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>>
>>>> Did you specified how many map tasks?
>>>>
>>>>
>>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>>>
>>>>> hi,maillist:
>>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 ,
>>>>> i find when copy small file,it very good, but when transfer big data ,it
>>>>> very slow ,any good method recommand? thanks
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Thanks
>> Shivram
>>
>
>

Re: how to copy data between two hdfs cluster fastly?

Posted by Jakub Stransky <st...@gmail.com>.

Distcp?
On 17 Oct 2014 20:51, "Alexander Pivovarov" <ap...@gmail.com> wrote:

> try to run on dest cluster datanode
> $ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....
>
>
>
> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io> wrote:
>
>> What is your approx input size ?
>> Do you have multiple files or is this one large file ?
>> What is your block size (source and destination cluster) ?
>>
>> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>>
>>> no ,all default
>>>
>>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>>
>>>> Did you specified how many map tasks?
>>>>
>>>>
>>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>>>
>>>>> hi,maillist:
>>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 ,
>>>>> i find when copy small file,it very good, but when transfer big data ,it
>>>>> very slow ,any good method recommand? thanks
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Thanks
>> Shivram
>>
>
>

Re: how to copy data between two hdfs cluster fastly?

Posted by Jakub Stransky <st...@gmail.com>.

Distcp?
On 17 Oct 2014 20:51, "Alexander Pivovarov" <ap...@gmail.com> wrote:

> try to run on dest cluster datanode
> $ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....
>
>
>
> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io> wrote:
>
>> What is your approx input size ?
>> Do you have multiple files or is this one large file ?
>> What is your block size (source and destination cluster) ?
>>
>> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>>
>>> no ,all default
>>>
>>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>>
>>>> Did you specified how many map tasks?
>>>>
>>>>
>>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>>>
>>>>> hi,maillist:
>>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 ,
>>>>> i find when copy small file,it very good, but when transfer big data ,it
>>>>> very slow ,any good method recommand? thanks
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Thanks
>> Shivram
>>
>
>

Re: how to copy data between two hdfs cluster fastly?

Posted by Jakub Stransky <st...@gmail.com>.

Distcp?
On 17 Oct 2014 20:51, "Alexander Pivovarov" <ap...@gmail.com> wrote:

> try to run on dest cluster datanode
> $ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....
>
>
>
> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io> wrote:
>
>> What is your approx input size ?
>> Do you have multiple files or is this one large file ?
>> What is your block size (source and destination cluster) ?
>>
>> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>>
>>> no ,all default
>>>
>>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>>
>>>> Did you specified how many map tasks?
>>>>
>>>>
>>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>>>
>>>>> hi,maillist:
>>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 ,
>>>>> i find when copy small file,it very good, but when transfer big data ,it
>>>>> very slow ,any good method recommand? thanks
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Thanks
>> Shivram
>>
>
>

Re: how to copy data between two hdfs cluster fastly?

Posted by Alexander Pivovarov <ap...@gmail.com>.

try to run on dest cluster datanode
$ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....



On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io> wrote:

> What is your approx input size ?
> Do you have multiple files or is this one large file ?
> What is your block size (source and destination cluster) ?
>
> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>
>> no ,all default
>>
>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>
>>> Did you specified how many map tasks?
>>>
>>>
>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>>
>>>> hi,maillist:
>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
>>>> find when copy small file,it very good, but when transfer big data ,it very
>>>> slow ,any good method recommand? thanks
>>>>
>>>
>>>
>>
>
>
> --
> Thanks
> Shivram
>

Re: how to copy data between two hdfs cluster fastly?

Posted by Alexander Pivovarov <ap...@gmail.com>.

try to run on dest cluster datanode
$ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....



On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io> wrote:

> What is your approx input size ?
> Do you have multiple files or is this one large file ?
> What is your block size (source and destination cluster) ?
>
> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>
>> no ,all default
>>
>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>
>>> Did you specified how many map tasks?
>>>
>>>
>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>>
>>>> hi,maillist:
>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
>>>> find when copy small file,it very good, but when transfer big data ,it very
>>>> slow ,any good method recommand? thanks
>>>>
>>>
>>>
>>
>
>
> --
> Thanks
> Shivram
>

Re: how to copy data between two hdfs cluster fastly?

Posted by Alexander Pivovarov <ap...@gmail.com>.

try to run on dest cluster datanode
$ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....



On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io> wrote:

> What is your approx input size ?
> Do you have multiple files or is this one large file ?
> What is your block size (source and destination cluster) ?
>
> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>
>> no ,all default
>>
>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>
>>> Did you specified how many map tasks?
>>>
>>>
>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>>
>>>> hi,maillist:
>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
>>>> find when copy small file,it very good, but when transfer big data ,it very
>>>> slow ,any good method recommand? thanks
>>>>
>>>
>>>
>>
>
>
> --
> Thanks
> Shivram
>

Re: how to copy data between two hdfs cluster fastly?

Posted by Alexander Pivovarov <ap...@gmail.com>.

try to run on dest cluster datanode
$ hadoop fs -cp hdfs://from_cluster/....    hdfs://to_cluster/....



On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani <sm...@pivotal.io> wrote:

> What is your approx input size ?
> Do you have multiple files or is this one large file ?
> What is your block size (source and destination cluster) ?
>
> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>
>> no ,all default
>>
>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>
>>> Did you specified how many map tasks?
>>>
>>>
>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>>
>>>> hi,maillist:
>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
>>>> find when copy small file,it very good, but when transfer big data ,it very
>>>> slow ,any good method recommand? thanks
>>>>
>>>
>>>
>>
>
>
> --
> Thanks
> Shivram
>

Re: how to copy data between two hdfs cluster fastly?

Posted by ch huang <ju...@gmail.com>.

some file , total size  is 2T ,and block size  is 128M

On Sat, Oct 18, 2014 at 2:26 AM, Shivram Mani <sm...@pivotal.io> wrote:

> What is your approx input size ?
> Do you have multiple files or is this one large file ?
> What is your block size (source and destination cluster) ?
>
> On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:
>
>> no ,all default
>>
>> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>>
>>> Did you specified how many map tasks?
>>>
>>>
>>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>>
>>>> hi,maillist:
>>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
>>>> find when copy small file,it very good, but when transfer big data ,it very
>>>> slow ,any good method recommand? thanks
>>>>
>>>
>>>
>>
>
>
> --
> Thanks
> Shivram
>

Re: how to copy data between two hdfs cluster fastly?

Posted by Shivram Mani <sm...@pivotal.io>.

What is your approx input size ?
Do you have multiple files or is this one large file ?
What is your block size (source and destination cluster) ?

On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:

> no ,all default
>
> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>
>> Did you specified how many map tasks?
>>
>>
>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>
>>> hi,maillist:
>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
>>> find when copy small file,it very good, but when transfer big data ,it very
>>> slow ,any good method recommand? thanks
>>>
>>
>>
>

-- 
Thanks
Shivram

Re: Spark vs Tez

Posted by Alexander Pivovarov <ap...@gmail.com>.

Spark creator Amplab did some benchmarks.
https://amplab.cs.berkeley.edu/benchmark/

On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Does anybody have any performance figures on how Spark stacks up
> against Tez? If you don’t have figures, does anybody have an opinion? Spark
> seems so popular but I’m not really seeing why.
> B.
>

Re: Spark vs Tez

Posted by Alexander Pivovarov <ap...@gmail.com>.

Spark creator Amplab did some benchmarks.
https://amplab.cs.berkeley.edu/benchmark/

On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Does anybody have any performance figures on how Spark stacks up
> against Tez? If you don’t have figures, does anybody have an opinion? Spark
> seems so popular but I’m not really seeing why.
> B.
>

Re: Dynamically set map / reducer memory

Posted by Girish Lingappa <gl...@pivotal.io>.

Peter

If you are using oozie to launch the MR jobs you can specify the memory
requirements in the workflow action specific to each job, in the workflow
xml you are using to launch the job. If you are writing your own driver
program to launch the jobs you can still set these parameters in the job
configuration you are using to launch the job.
 In the case where you modified mapred-site.xml to set your memory
requirements did you change that on the client machine where you are
launching the job?
 Please share more details on the setup and the way you are launching the
jobs so we can better understand the problem you are facing

Girish

On Fri, Oct 17, 2014 at 11:24 AM, peter 2 <re...@gmail.com> wrote:

>  HI Guys,
> I am trying to run a few MR jobs in a succession, some of the jobs don't
> need that much memory and others do. I want to be able to tell hadoop how
> much memory should be allocated  for the mappers of each job.
> I know how to increase the memory for a mapper JVM, through the mapred
> xml.
> I tried manually setting the  mapreduce.reduce.java.opts = -Xmx<someNumber>m
> , but wasn't picked up by the mapper jvm, the global setting was always
> been picked up .
>
> In summation
> Job 1 - Mappers need only 250 Mg of Ram
> Job2 - Mapper
>            Reducer need around - 2Gb
>
> I don't want to be able to set those restrictions prior to submitting the
> job to my hadoop cluster.
>

Re: Dynamically set map / reducer memory

Posted by Girish Lingappa <gl...@pivotal.io>.

Peter

If you are using oozie to launch the MR jobs you can specify the memory
requirements in the workflow action specific to each job, in the workflow
xml you are using to launch the job. If you are writing your own driver
program to launch the jobs you can still set these parameters in the job
configuration you are using to launch the job.
 In the case where you modified mapred-site.xml to set your memory
requirements did you change that on the client machine where you are
launching the job?
 Please share more details on the setup and the way you are launching the
jobs so we can better understand the problem you are facing

Girish

On Fri, Oct 17, 2014 at 11:24 AM, peter 2 <re...@gmail.com> wrote:

>  HI Guys,
> I am trying to run a few MR jobs in a succession, some of the jobs don't
> need that much memory and others do. I want to be able to tell hadoop how
> much memory should be allocated  for the mappers of each job.
> I know how to increase the memory for a mapper JVM, through the mapred
> xml.
> I tried manually setting the  mapreduce.reduce.java.opts = -Xmx<someNumber>m
> , but wasn't picked up by the mapper jvm, the global setting was always
> been picked up .
>
> In summation
> Job 1 - Mappers need only 250 Mg of Ram
> Job2 - Mapper
>            Reducer need around - 2Gb
>
> I don't want to be able to set those restrictions prior to submitting the
> job to my hadoop cluster.
>

Re: Dynamically set map / reducer memory

Posted by Girish Lingappa <gl...@pivotal.io>.

Peter

If you are using oozie to launch the MR jobs you can specify the memory
requirements in the workflow action specific to each job, in the workflow
xml you are using to launch the job. If you are writing your own driver
program to launch the jobs you can still set these parameters in the job
configuration you are using to launch the job.
 In the case where you modified mapred-site.xml to set your memory
requirements did you change that on the client machine where you are
launching the job?
 Please share more details on the setup and the way you are launching the
jobs so we can better understand the problem you are facing

Girish

On Fri, Oct 17, 2014 at 11:24 AM, peter 2 <re...@gmail.com> wrote:

>  HI Guys,
> I am trying to run a few MR jobs in a succession, some of the jobs don't
> need that much memory and others do. I want to be able to tell hadoop how
> much memory should be allocated  for the mappers of each job.
> I know how to increase the memory for a mapper JVM, through the mapred
> xml.
> I tried manually setting the  mapreduce.reduce.java.opts = -Xmx<someNumber>m
> , but wasn't picked up by the mapper jvm, the global setting was always
> been picked up .
>
> In summation
> Job 1 - Mappers need only 250 Mg of Ram
> Job2 - Mapper
>            Reducer need around - 2Gb
>
> I don't want to be able to set those restrictions prior to submitting the
> job to my hadoop cluster.
>

Re: Dynamically set map / reducer memory

Posted by Girish Lingappa <gl...@pivotal.io>.

Peter

If you are using oozie to launch the MR jobs you can specify the memory
requirements in the workflow action specific to each job, in the workflow
xml you are using to launch the job. If you are writing your own driver
program to launch the jobs you can still set these parameters in the job
configuration you are using to launch the job.
 In the case where you modified mapred-site.xml to set your memory
requirements did you change that on the client machine where you are
launching the job?
 Please share more details on the setup and the way you are launching the
jobs so we can better understand the problem you are facing

Girish

On Fri, Oct 17, 2014 at 11:24 AM, peter 2 <re...@gmail.com> wrote:

>  HI Guys,
> I am trying to run a few MR jobs in a succession, some of the jobs don't
> need that much memory and others do. I want to be able to tell hadoop how
> much memory should be allocated  for the mappers of each job.
> I know how to increase the memory for a mapper JVM, through the mapred
> xml.
> I tried manually setting the  mapreduce.reduce.java.opts = -Xmx<someNumber>m
> , but wasn't picked up by the mapper jvm, the global setting was always
> been picked up .
>
> In summation
> Job 1 - Mappers need only 250 Mg of Ram
> Job2 - Mapper
>            Reducer need around - 2Gb
>
> I don't want to be able to set those restrictions prior to submitting the
> job to my hadoop cluster.
>

Dynamically set map / reducer memory

Posted by peter 2 <re...@gmail.com>.

HI Guys,
I am trying to run a few MR jobs in a succession, some of the jobs don't 
need that much memory and others do. I want to be able to tell hadoop 
how much memory should be allocated  for the mappers of each job.
I know how to increase the memory for a mapper JVM, through the mapred xml.
I tried manually setting the mapreduce.reduce.java.opts= 
-Xmx<someNumber>m , but wasn't picked up by the mapper jvm, the global 
setting was always been picked up .

In summation
Job 1 - Mappers need only 250 Mg of Ram
Job2 - Mapper
            Reducer need around - 2Gb

I don't want to be able to set those restrictions prior to submitting 
the job to my hadoop cluster.

Re: Spark vs Tez

Posted by Alexander Pivovarov <ap...@gmail.com>.

It's going to be spark engine for hive (in addition to mr and tez).

Spark API is available for Java and Python as well.

Tez engine is available now and it's quite stable. As for speed.  For
complex queries it shows 10x-20x improvement in comparison to mr engine.
e.g. one of my queries runs 30 min using mr (about 100 mr jobs),   if I
switch to tez it done in 100 sec.

I'm using HDP-2.1.5 (hive-0.13.1, tez 0.4.1)

On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   It was my understanding that Spark is faster batch processing. Tez is
> the new execution engine that replaces MapReduce and is also supposed to
> speed up batch processing. Is that not correct?
> B.
>
>
>
>  *From:* Shahab Yunus <sh...@gmail.com>
> *Sent:* Friday, October 17, 2014 1:12 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs Tez
>
>  What aspects of Tez and Spark are you comparing? They have different
> purposes and thus not directly comparable, as far as I understand.
>
> Regards,
> Shahab
>
> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   Does anybody have any performance figures on how Spark stacks up
>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>> seems so popular but I’m not really seeing why.
>> B.
>>
>
>

Re: Spark vs Tez

Posted by Mohan Radhakrishnan <ra...@gmail.com>.

Is Tez's architecture similar to Akka's distributed architecture ? I think
I remember that Jonas boner mentioned during a presentation on distributed
computing about Akka's support for protocols like raft etc. What makes Tez
more scalable in this regard ?

Thanks,
Mohan

On Sun, Oct 19, 2014 at 5:26 PM, Niels Basjes <Ni...@basjes.nl> wrote:

> Very interesting!
> What makes Tez more scalable than Spark?
> What architectural "thing" makes the difference?
>
> Niels Basjes
> On Oct 19, 2014 3:07 AM, "Jeff Zhang" <zj...@gmail.com> wrote:
>
>> Tez has a feature called pre-warm which will launch JVM before you use it
>> and you can reuse the container afterwards. So it is also suitable for
>> interactive queries and is more stable and scalable than spark IMO.
>>
>> On Sat, Oct 18, 2014 at 4:22 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>>
>>> It is my understanding that one of the big differences between Tez and
>>> Spark is is that a Tez based query still has the startup overhead of
>>> starting JVMs on the Yarn cluster. Spark based queries are immediately
>>> executed on "already running JVMs".
>>>
>>> So for interactive dashboards Spark seems more suitable.
>>>
>>> Did I understand correctly?
>>>
>>> Niels Basjes
>>> On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:
>>>
>>>> Spark and tez both make MR faster, this has no doubt.
>>>>
>>>> They also provide new features like DAG, which is quite important for
>>>> interactive query processing.  From this perspective, you could view them
>>>> as a wrapper around MR and try to handle the intermediary buffer(files)
>>>> more efficiently.  It is a big pain in MR.
>>>>
>>>> Also they both try to use Memory as the buffer instead of only
>>>> filesystems.   Spark has a concept RDD, which is quite interesting and also
>>>> limited.
>>>>
>>>>
>>>>
>>>> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
>>>> adaryl.wakefield@hotmail.com> wrote:
>>>>
>>>>>   It was my understanding that Spark is faster batch processing. Tez
>>>>> is the new execution engine that replaces MapReduce and is also supposed to
>>>>> speed up batch processing. Is that not correct?
>>>>> B.
>>>>>
>>>>>
>>>>>
>>>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>>>> *Sent:* Friday, October 17, 2014 1:12 PM
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* Re: Spark vs Tez
>>>>>
>>>>>  What aspects of Tez and Spark are you comparing? They have different
>>>>> purposes and thus not directly comparable, as far as I understand.
>>>>>
>>>>> Regards,
>>>>> Shahab
>>>>>
>>>>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>>>>> adaryl.wakefield@hotmail.com> wrote:
>>>>>
>>>>>>   Does anybody have any performance figures on how Spark stacks up
>>>>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>>>>> seems so popular but I’m not really seeing why.
>>>>>> B.
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>

Re: Spark vs Tez

Posted by Mohan Radhakrishnan <ra...@gmail.com>.

Is Tez's architecture similar to Akka's distributed architecture ? I think
I remember that Jonas boner mentioned during a presentation on distributed
computing about Akka's support for protocols like raft etc. What makes Tez
more scalable in this regard ?

Thanks,
Mohan

On Sun, Oct 19, 2014 at 5:26 PM, Niels Basjes <Ni...@basjes.nl> wrote:

> Very interesting!
> What makes Tez more scalable than Spark?
> What architectural "thing" makes the difference?
>
> Niels Basjes
> On Oct 19, 2014 3:07 AM, "Jeff Zhang" <zj...@gmail.com> wrote:
>
>> Tez has a feature called pre-warm which will launch JVM before you use it
>> and you can reuse the container afterwards. So it is also suitable for
>> interactive queries and is more stable and scalable than spark IMO.
>>
>> On Sat, Oct 18, 2014 at 4:22 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>>
>>> It is my understanding that one of the big differences between Tez and
>>> Spark is is that a Tez based query still has the startup overhead of
>>> starting JVMs on the Yarn cluster. Spark based queries are immediately
>>> executed on "already running JVMs".
>>>
>>> So for interactive dashboards Spark seems more suitable.
>>>
>>> Did I understand correctly?
>>>
>>> Niels Basjes
>>> On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:
>>>
>>>> Spark and tez both make MR faster, this has no doubt.
>>>>
>>>> They also provide new features like DAG, which is quite important for
>>>> interactive query processing.  From this perspective, you could view them
>>>> as a wrapper around MR and try to handle the intermediary buffer(files)
>>>> more efficiently.  It is a big pain in MR.
>>>>
>>>> Also they both try to use Memory as the buffer instead of only
>>>> filesystems.   Spark has a concept RDD, which is quite interesting and also
>>>> limited.
>>>>
>>>>
>>>>
>>>> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
>>>> adaryl.wakefield@hotmail.com> wrote:
>>>>
>>>>>   It was my understanding that Spark is faster batch processing. Tez
>>>>> is the new execution engine that replaces MapReduce and is also supposed to
>>>>> speed up batch processing. Is that not correct?
>>>>> B.
>>>>>
>>>>>
>>>>>
>>>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>>>> *Sent:* Friday, October 17, 2014 1:12 PM
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* Re: Spark vs Tez
>>>>>
>>>>>  What aspects of Tez and Spark are you comparing? They have different
>>>>> purposes and thus not directly comparable, as far as I understand.
>>>>>
>>>>> Regards,
>>>>> Shahab
>>>>>
>>>>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>>>>> adaryl.wakefield@hotmail.com> wrote:
>>>>>
>>>>>>   Does anybody have any performance figures on how Spark stacks up
>>>>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>>>>> seems so popular but I’m not really seeing why.
>>>>>> B.
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>

Re: Spark vs Tez

Posted by Mohan Radhakrishnan <ra...@gmail.com>.

Is Tez's architecture similar to Akka's distributed architecture ? I think
I remember that Jonas boner mentioned during a presentation on distributed
computing about Akka's support for protocols like raft etc. What makes Tez
more scalable in this regard ?

Thanks,
Mohan

On Sun, Oct 19, 2014 at 5:26 PM, Niels Basjes <Ni...@basjes.nl> wrote:

> Very interesting!
> What makes Tez more scalable than Spark?
> What architectural "thing" makes the difference?
>
> Niels Basjes
> On Oct 19, 2014 3:07 AM, "Jeff Zhang" <zj...@gmail.com> wrote:
>
>> Tez has a feature called pre-warm which will launch JVM before you use it
>> and you can reuse the container afterwards. So it is also suitable for
>> interactive queries and is more stable and scalable than spark IMO.
>>
>> On Sat, Oct 18, 2014 at 4:22 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>>
>>> It is my understanding that one of the big differences between Tez and
>>> Spark is is that a Tez based query still has the startup overhead of
>>> starting JVMs on the Yarn cluster. Spark based queries are immediately
>>> executed on "already running JVMs".
>>>
>>> So for interactive dashboards Spark seems more suitable.
>>>
>>> Did I understand correctly?
>>>
>>> Niels Basjes
>>> On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:
>>>
>>>> Spark and tez both make MR faster, this has no doubt.
>>>>
>>>> They also provide new features like DAG, which is quite important for
>>>> interactive query processing.  From this perspective, you could view them
>>>> as a wrapper around MR and try to handle the intermediary buffer(files)
>>>> more efficiently.  It is a big pain in MR.
>>>>
>>>> Also they both try to use Memory as the buffer instead of only
>>>> filesystems.   Spark has a concept RDD, which is quite interesting and also
>>>> limited.
>>>>
>>>>
>>>>
>>>> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
>>>> adaryl.wakefield@hotmail.com> wrote:
>>>>
>>>>>   It was my understanding that Spark is faster batch processing. Tez
>>>>> is the new execution engine that replaces MapReduce and is also supposed to
>>>>> speed up batch processing. Is that not correct?
>>>>> B.
>>>>>
>>>>>
>>>>>
>>>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>>>> *Sent:* Friday, October 17, 2014 1:12 PM
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* Re: Spark vs Tez
>>>>>
>>>>>  What aspects of Tez and Spark are you comparing? They have different
>>>>> purposes and thus not directly comparable, as far as I understand.
>>>>>
>>>>> Regards,
>>>>> Shahab
>>>>>
>>>>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>>>>> adaryl.wakefield@hotmail.com> wrote:
>>>>>
>>>>>>   Does anybody have any performance figures on how Spark stacks up
>>>>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>>>>> seems so popular but I’m not really seeing why.
>>>>>> B.
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>

Re: Spark vs Tez

Posted by Mohan Radhakrishnan <ra...@gmail.com>.

Is Tez's architecture similar to Akka's distributed architecture ? I think
I remember that Jonas boner mentioned during a presentation on distributed
computing about Akka's support for protocols like raft etc. What makes Tez
more scalable in this regard ?

Thanks,
Mohan

On Sun, Oct 19, 2014 at 5:26 PM, Niels Basjes <Ni...@basjes.nl> wrote:

> Very interesting!
> What makes Tez more scalable than Spark?
> What architectural "thing" makes the difference?
>
> Niels Basjes
> On Oct 19, 2014 3:07 AM, "Jeff Zhang" <zj...@gmail.com> wrote:
>
>> Tez has a feature called pre-warm which will launch JVM before you use it
>> and you can reuse the container afterwards. So it is also suitable for
>> interactive queries and is more stable and scalable than spark IMO.
>>
>> On Sat, Oct 18, 2014 at 4:22 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>>
>>> It is my understanding that one of the big differences between Tez and
>>> Spark is is that a Tez based query still has the startup overhead of
>>> starting JVMs on the Yarn cluster. Spark based queries are immediately
>>> executed on "already running JVMs".
>>>
>>> So for interactive dashboards Spark seems more suitable.
>>>
>>> Did I understand correctly?
>>>
>>> Niels Basjes
>>> On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:
>>>
>>>> Spark and tez both make MR faster, this has no doubt.
>>>>
>>>> They also provide new features like DAG, which is quite important for
>>>> interactive query processing.  From this perspective, you could view them
>>>> as a wrapper around MR and try to handle the intermediary buffer(files)
>>>> more efficiently.  It is a big pain in MR.
>>>>
>>>> Also they both try to use Memory as the buffer instead of only
>>>> filesystems.   Spark has a concept RDD, which is quite interesting and also
>>>> limited.
>>>>
>>>>
>>>>
>>>> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
>>>> adaryl.wakefield@hotmail.com> wrote:
>>>>
>>>>>   It was my understanding that Spark is faster batch processing. Tez
>>>>> is the new execution engine that replaces MapReduce and is also supposed to
>>>>> speed up batch processing. Is that not correct?
>>>>> B.
>>>>>
>>>>>
>>>>>
>>>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>>>> *Sent:* Friday, October 17, 2014 1:12 PM
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* Re: Spark vs Tez
>>>>>
>>>>>  What aspects of Tez and Spark are you comparing? They have different
>>>>> purposes and thus not directly comparable, as far as I understand.
>>>>>
>>>>> Regards,
>>>>> Shahab
>>>>>
>>>>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>>>>> adaryl.wakefield@hotmail.com> wrote:
>>>>>
>>>>>>   Does anybody have any performance figures on how Spark stacks up
>>>>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>>>>> seems so popular but I’m not really seeing why.
>>>>>> B.
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>

Re: Spark vs Tez

Posted by Niels Basjes <Ni...@basjes.nl>.

Very interesting!
What makes Tez more scalable than Spark?
What architectural "thing" makes the difference?

Niels Basjes
On Oct 19, 2014 3:07 AM, "Jeff Zhang" <zj...@gmail.com> wrote:

> Tez has a feature called pre-warm which will launch JVM before you use it
> and you can reuse the container afterwards. So it is also suitable for
> interactive queries and is more stable and scalable than spark IMO.
>
> On Sat, Oct 18, 2014 at 4:22 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>
>> It is my understanding that one of the big differences between Tez and
>> Spark is is that a Tez based query still has the startup overhead of
>> starting JVMs on the Yarn cluster. Spark based queries are immediately
>> executed on "already running JVMs".
>>
>> So for interactive dashboards Spark seems more suitable.
>>
>> Did I understand correctly?
>>
>> Niels Basjes
>> On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:
>>
>>> Spark and tez both make MR faster, this has no doubt.
>>>
>>> They also provide new features like DAG, which is quite important for
>>> interactive query processing.  From this perspective, you could view them
>>> as a wrapper around MR and try to handle the intermediary buffer(files)
>>> more efficiently.  It is a big pain in MR.
>>>
>>> Also they both try to use Memory as the buffer instead of only
>>> filesystems.   Spark has a concept RDD, which is quite interesting and also
>>> limited.
>>>
>>>
>>>
>>> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   It was my understanding that Spark is faster batch processing. Tez
>>>> is the new execution engine that replaces MapReduce and is also supposed to
>>>> speed up batch processing. Is that not correct?
>>>> B.
>>>>
>>>>
>>>>
>>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>>> *Sent:* Friday, October 17, 2014 1:12 PM
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Re: Spark vs Tez
>>>>
>>>>  What aspects of Tez and Spark are you comparing? They have different
>>>> purposes and thus not directly comparable, as far as I understand.
>>>>
>>>> Regards,
>>>> Shahab
>>>>
>>>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>>>> adaryl.wakefield@hotmail.com> wrote:
>>>>
>>>>>   Does anybody have any performance figures on how Spark stacks up
>>>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>>>> seems so popular but I’m not really seeing why.
>>>>> B.
>>>>>
>>>>
>>>>
>>>
>>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Spark vs Tez

Posted by Niels Basjes <Ni...@basjes.nl>.

Very interesting!
What makes Tez more scalable than Spark?
What architectural "thing" makes the difference?

Niels Basjes
On Oct 19, 2014 3:07 AM, "Jeff Zhang" <zj...@gmail.com> wrote:

> Tez has a feature called pre-warm which will launch JVM before you use it
> and you can reuse the container afterwards. So it is also suitable for
> interactive queries and is more stable and scalable than spark IMO.
>
> On Sat, Oct 18, 2014 at 4:22 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>
>> It is my understanding that one of the big differences between Tez and
>> Spark is is that a Tez based query still has the startup overhead of
>> starting JVMs on the Yarn cluster. Spark based queries are immediately
>> executed on "already running JVMs".
>>
>> So for interactive dashboards Spark seems more suitable.
>>
>> Did I understand correctly?
>>
>> Niels Basjes
>> On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:
>>
>>> Spark and tez both make MR faster, this has no doubt.
>>>
>>> They also provide new features like DAG, which is quite important for
>>> interactive query processing.  From this perspective, you could view them
>>> as a wrapper around MR and try to handle the intermediary buffer(files)
>>> more efficiently.  It is a big pain in MR.
>>>
>>> Also they both try to use Memory as the buffer instead of only
>>> filesystems.   Spark has a concept RDD, which is quite interesting and also
>>> limited.
>>>
>>>
>>>
>>> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   It was my understanding that Spark is faster batch processing. Tez
>>>> is the new execution engine that replaces MapReduce and is also supposed to
>>>> speed up batch processing. Is that not correct?
>>>> B.
>>>>
>>>>
>>>>
>>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>>> *Sent:* Friday, October 17, 2014 1:12 PM
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Re: Spark vs Tez
>>>>
>>>>  What aspects of Tez and Spark are you comparing? They have different
>>>> purposes and thus not directly comparable, as far as I understand.
>>>>
>>>> Regards,
>>>> Shahab
>>>>
>>>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>>>> adaryl.wakefield@hotmail.com> wrote:
>>>>
>>>>>   Does anybody have any performance figures on how Spark stacks up
>>>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>>>> seems so popular but I’m not really seeing why.
>>>>> B.
>>>>>
>>>>
>>>>
>>>
>>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Spark vs Tez

Posted by Niels Basjes <Ni...@basjes.nl>.

Very interesting!
What makes Tez more scalable than Spark?
What architectural "thing" makes the difference?

Niels Basjes
On Oct 19, 2014 3:07 AM, "Jeff Zhang" <zj...@gmail.com> wrote:

> Tez has a feature called pre-warm which will launch JVM before you use it
> and you can reuse the container afterwards. So it is also suitable for
> interactive queries and is more stable and scalable than spark IMO.
>
> On Sat, Oct 18, 2014 at 4:22 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>
>> It is my understanding that one of the big differences between Tez and
>> Spark is is that a Tez based query still has the startup overhead of
>> starting JVMs on the Yarn cluster. Spark based queries are immediately
>> executed on "already running JVMs".
>>
>> So for interactive dashboards Spark seems more suitable.
>>
>> Did I understand correctly?
>>
>> Niels Basjes
>> On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:
>>
>>> Spark and tez both make MR faster, this has no doubt.
>>>
>>> They also provide new features like DAG, which is quite important for
>>> interactive query processing.  From this perspective, you could view them
>>> as a wrapper around MR and try to handle the intermediary buffer(files)
>>> more efficiently.  It is a big pain in MR.
>>>
>>> Also they both try to use Memory as the buffer instead of only
>>> filesystems.   Spark has a concept RDD, which is quite interesting and also
>>> limited.
>>>
>>>
>>>
>>> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   It was my understanding that Spark is faster batch processing. Tez
>>>> is the new execution engine that replaces MapReduce and is also supposed to
>>>> speed up batch processing. Is that not correct?
>>>> B.
>>>>
>>>>
>>>>
>>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>>> *Sent:* Friday, October 17, 2014 1:12 PM
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Re: Spark vs Tez
>>>>
>>>>  What aspects of Tez and Spark are you comparing? They have different
>>>> purposes and thus not directly comparable, as far as I understand.
>>>>
>>>> Regards,
>>>> Shahab
>>>>
>>>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>>>> adaryl.wakefield@hotmail.com> wrote:
>>>>
>>>>>   Does anybody have any performance figures on how Spark stacks up
>>>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>>>> seems so popular but I’m not really seeing why.
>>>>> B.
>>>>>
>>>>
>>>>
>>>
>>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Spark vs Tez

Posted by Niels Basjes <Ni...@basjes.nl>.

Very interesting!
What makes Tez more scalable than Spark?
What architectural "thing" makes the difference?

Niels Basjes
On Oct 19, 2014 3:07 AM, "Jeff Zhang" <zj...@gmail.com> wrote:

> Tez has a feature called pre-warm which will launch JVM before you use it
> and you can reuse the container afterwards. So it is also suitable for
> interactive queries and is more stable and scalable than spark IMO.
>
> On Sat, Oct 18, 2014 at 4:22 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>
>> It is my understanding that one of the big differences between Tez and
>> Spark is is that a Tez based query still has the startup overhead of
>> starting JVMs on the Yarn cluster. Spark based queries are immediately
>> executed on "already running JVMs".
>>
>> So for interactive dashboards Spark seems more suitable.
>>
>> Did I understand correctly?
>>
>> Niels Basjes
>> On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:
>>
>>> Spark and tez both make MR faster, this has no doubt.
>>>
>>> They also provide new features like DAG, which is quite important for
>>> interactive query processing.  From this perspective, you could view them
>>> as a wrapper around MR and try to handle the intermediary buffer(files)
>>> more efficiently.  It is a big pain in MR.
>>>
>>> Also they both try to use Memory as the buffer instead of only
>>> filesystems.   Spark has a concept RDD, which is quite interesting and also
>>> limited.
>>>
>>>
>>>
>>> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   It was my understanding that Spark is faster batch processing. Tez
>>>> is the new execution engine that replaces MapReduce and is also supposed to
>>>> speed up batch processing. Is that not correct?
>>>> B.
>>>>
>>>>
>>>>
>>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>>> *Sent:* Friday, October 17, 2014 1:12 PM
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Re: Spark vs Tez
>>>>
>>>>  What aspects of Tez and Spark are you comparing? They have different
>>>> purposes and thus not directly comparable, as far as I understand.
>>>>
>>>> Regards,
>>>> Shahab
>>>>
>>>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>>>> adaryl.wakefield@hotmail.com> wrote:
>>>>
>>>>>   Does anybody have any performance figures on how Spark stacks up
>>>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>>>> seems so popular but I’m not really seeing why.
>>>>> B.
>>>>>
>>>>
>>>>
>>>
>>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Spark vs Tez

Posted by Jeff Zhang <zj...@gmail.com>.

Tez has a feature called pre-warm which will launch JVM before you use it
and you can reuse the container afterwards. So it is also suitable for
interactive queries and is more stable and scalable than spark IMO.

On Sat, Oct 18, 2014 at 4:22 PM, Niels Basjes <Ni...@basjes.nl> wrote:

> It is my understanding that one of the big differences between Tez and
> Spark is is that a Tez based query still has the startup overhead of
> starting JVMs on the Yarn cluster. Spark based queries are immediately
> executed on "already running JVMs".
>
> So for interactive dashboards Spark seems more suitable.
>
> Did I understand correctly?
>
> Niels Basjes
> On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:
>
>> Spark and tez both make MR faster, this has no doubt.
>>
>> They also provide new features like DAG, which is quite important for
>> interactive query processing.  From this perspective, you could view them
>> as a wrapper around MR and try to handle the intermediary buffer(files)
>> more efficiently.  It is a big pain in MR.
>>
>> Also they both try to use Memory as the buffer instead of only
>> filesystems.   Spark has a concept RDD, which is quite interesting and also
>> limited.
>>
>>
>>
>> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   It was my understanding that Spark is faster batch processing. Tez is
>>> the new execution engine that replaces MapReduce and is also supposed to
>>> speed up batch processing. Is that not correct?
>>> B.
>>>
>>>
>>>
>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>> *Sent:* Friday, October 17, 2014 1:12 PM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Spark vs Tez
>>>
>>>  What aspects of Tez and Spark are you comparing? They have different
>>> purposes and thus not directly comparable, as far as I understand.
>>>
>>> Regards,
>>> Shahab
>>>
>>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   Does anybody have any performance figures on how Spark stacks up
>>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>>> seems so popular but I’m not really seeing why.
>>>> B.
>>>>
>>>
>>>
>>
>>


-- 
Best Regards

Jeff Zhang

Re: Spark vs Tez

Posted by Jeff Zhang <zj...@gmail.com>.

Tez has a feature called pre-warm which will launch JVM before you use it
and you can reuse the container afterwards. So it is also suitable for
interactive queries and is more stable and scalable than spark IMO.

On Sat, Oct 18, 2014 at 4:22 PM, Niels Basjes <Ni...@basjes.nl> wrote:

> It is my understanding that one of the big differences between Tez and
> Spark is is that a Tez based query still has the startup overhead of
> starting JVMs on the Yarn cluster. Spark based queries are immediately
> executed on "already running JVMs".
>
> So for interactive dashboards Spark seems more suitable.
>
> Did I understand correctly?
>
> Niels Basjes
> On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:
>
>> Spark and tez both make MR faster, this has no doubt.
>>
>> They also provide new features like DAG, which is quite important for
>> interactive query processing.  From this perspective, you could view them
>> as a wrapper around MR and try to handle the intermediary buffer(files)
>> more efficiently.  It is a big pain in MR.
>>
>> Also they both try to use Memory as the buffer instead of only
>> filesystems.   Spark has a concept RDD, which is quite interesting and also
>> limited.
>>
>>
>>
>> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   It was my understanding that Spark is faster batch processing. Tez is
>>> the new execution engine that replaces MapReduce and is also supposed to
>>> speed up batch processing. Is that not correct?
>>> B.
>>>
>>>
>>>
>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>> *Sent:* Friday, October 17, 2014 1:12 PM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Spark vs Tez
>>>
>>>  What aspects of Tez and Spark are you comparing? They have different
>>> purposes and thus not directly comparable, as far as I understand.
>>>
>>> Regards,
>>> Shahab
>>>
>>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   Does anybody have any performance figures on how Spark stacks up
>>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>>> seems so popular but I’m not really seeing why.
>>>> B.
>>>>
>>>
>>>
>>
>>


-- 
Best Regards

Jeff Zhang

Re: Spark vs Tez

Posted by Mohan Radhakrishnan <ra...@gmail.com>.

I remember Spark uses Akka clusters. Isn't that totally different from
other distributed technologies ?

Thanks,
Mohan

On Sat, Oct 18, 2014 at 1:52 PM, Niels Basjes <Ni...@basjes.nl> wrote:

> It is my understanding that one of the big differences between Tez and
> Spark is is that a Tez based query still has the startup overhead of
> starting JVMs on the Yarn cluster. Spark based queries are immediately
> executed on "already running JVMs".
>
> So for interactive dashboards Spark seems more suitable.
>
> Did I understand correctly?
>
> Niels Basjes
> On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:
>
>> Spark and tez both make MR faster, this has no doubt.
>>
>> They also provide new features like DAG, which is quite important for
>> interactive query processing.  From this perspective, you could view them
>> as a wrapper around MR and try to handle the intermediary buffer(files)
>> more efficiently.  It is a big pain in MR.
>>
>> Also they both try to use Memory as the buffer instead of only
>> filesystems.   Spark has a concept RDD, which is quite interesting and also
>> limited.
>>
>>
>>
>> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   It was my understanding that Spark is faster batch processing. Tez is
>>> the new execution engine that replaces MapReduce and is also supposed to
>>> speed up batch processing. Is that not correct?
>>> B.
>>>
>>>
>>>
>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>> *Sent:* Friday, October 17, 2014 1:12 PM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Spark vs Tez
>>>
>>>  What aspects of Tez and Spark are you comparing? They have different
>>> purposes and thus not directly comparable, as far as I understand.
>>>
>>> Regards,
>>> Shahab
>>>
>>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   Does anybody have any performance figures on how Spark stacks up
>>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>>> seems so popular but I’m not really seeing why.
>>>> B.
>>>>
>>>
>>>
>>
>>

Re: Spark vs Tez

Posted by Jeff Zhang <zj...@gmail.com>.

Tez has a feature called pre-warm which will launch JVM before you use it
and you can reuse the container afterwards. So it is also suitable for
interactive queries and is more stable and scalable than spark IMO.

On Sat, Oct 18, 2014 at 4:22 PM, Niels Basjes <Ni...@basjes.nl> wrote:

> It is my understanding that one of the big differences between Tez and
> Spark is is that a Tez based query still has the startup overhead of
> starting JVMs on the Yarn cluster. Spark based queries are immediately
> executed on "already running JVMs".
>
> So for interactive dashboards Spark seems more suitable.
>
> Did I understand correctly?
>
> Niels Basjes
> On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:
>
>> Spark and tez both make MR faster, this has no doubt.
>>
>> They also provide new features like DAG, which is quite important for
>> interactive query processing.  From this perspective, you could view them
>> as a wrapper around MR and try to handle the intermediary buffer(files)
>> more efficiently.  It is a big pain in MR.
>>
>> Also they both try to use Memory as the buffer instead of only
>> filesystems.   Spark has a concept RDD, which is quite interesting and also
>> limited.
>>
>>
>>
>> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   It was my understanding that Spark is faster batch processing. Tez is
>>> the new execution engine that replaces MapReduce and is also supposed to
>>> speed up batch processing. Is that not correct?
>>> B.
>>>
>>>
>>>
>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>> *Sent:* Friday, October 17, 2014 1:12 PM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Spark vs Tez
>>>
>>>  What aspects of Tez and Spark are you comparing? They have different
>>> purposes and thus not directly comparable, as far as I understand.
>>>
>>> Regards,
>>> Shahab
>>>
>>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   Does anybody have any performance figures on how Spark stacks up
>>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>>> seems so popular but I’m not really seeing why.
>>>> B.
>>>>
>>>
>>>
>>
>>


-- 
Best Regards

Jeff Zhang

Re: Spark vs Tez

Posted by Mohan Radhakrishnan <ra...@gmail.com>.

I remember Spark uses Akka clusters. Isn't that totally different from
other distributed technologies ?

Thanks,
Mohan

On Sat, Oct 18, 2014 at 1:52 PM, Niels Basjes <Ni...@basjes.nl> wrote:

> It is my understanding that one of the big differences between Tez and
> Spark is is that a Tez based query still has the startup overhead of
> starting JVMs on the Yarn cluster. Spark based queries are immediately
> executed on "already running JVMs".
>
> So for interactive dashboards Spark seems more suitable.
>
> Did I understand correctly?
>
> Niels Basjes
> On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:
>
>> Spark and tez both make MR faster, this has no doubt.
>>
>> They also provide new features like DAG, which is quite important for
>> interactive query processing.  From this perspective, you could view them
>> as a wrapper around MR and try to handle the intermediary buffer(files)
>> more efficiently.  It is a big pain in MR.
>>
>> Also they both try to use Memory as the buffer instead of only
>> filesystems.   Spark has a concept RDD, which is quite interesting and also
>> limited.
>>
>>
>>
>> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   It was my understanding that Spark is faster batch processing. Tez is
>>> the new execution engine that replaces MapReduce and is also supposed to
>>> speed up batch processing. Is that not correct?
>>> B.
>>>
>>>
>>>
>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>> *Sent:* Friday, October 17, 2014 1:12 PM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Spark vs Tez
>>>
>>>  What aspects of Tez and Spark are you comparing? They have different
>>> purposes and thus not directly comparable, as far as I understand.
>>>
>>> Regards,
>>> Shahab
>>>
>>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   Does anybody have any performance figures on how Spark stacks up
>>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>>> seems so popular but I’m not really seeing why.
>>>> B.
>>>>
>>>
>>>
>>
>>

Re: Spark vs Tez

Posted by Mohan Radhakrishnan <ra...@gmail.com>.

I remember Spark uses Akka clusters. Isn't that totally different from
other distributed technologies ?

Thanks,
Mohan

On Sat, Oct 18, 2014 at 1:52 PM, Niels Basjes <Ni...@basjes.nl> wrote:

> It is my understanding that one of the big differences between Tez and
> Spark is is that a Tez based query still has the startup overhead of
> starting JVMs on the Yarn cluster. Spark based queries are immediately
> executed on "already running JVMs".
>
> So for interactive dashboards Spark seems more suitable.
>
> Did I understand correctly?
>
> Niels Basjes
> On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:
>
>> Spark and tez both make MR faster, this has no doubt.
>>
>> They also provide new features like DAG, which is quite important for
>> interactive query processing.  From this perspective, you could view them
>> as a wrapper around MR and try to handle the intermediary buffer(files)
>> more efficiently.  It is a big pain in MR.
>>
>> Also they both try to use Memory as the buffer instead of only
>> filesystems.   Spark has a concept RDD, which is quite interesting and also
>> limited.
>>
>>
>>
>> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   It was my understanding that Spark is faster batch processing. Tez is
>>> the new execution engine that replaces MapReduce and is also supposed to
>>> speed up batch processing. Is that not correct?
>>> B.
>>>
>>>
>>>
>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>> *Sent:* Friday, October 17, 2014 1:12 PM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Spark vs Tez
>>>
>>>  What aspects of Tez and Spark are you comparing? They have different
>>> purposes and thus not directly comparable, as far as I understand.
>>>
>>> Regards,
>>> Shahab
>>>
>>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   Does anybody have any performance figures on how Spark stacks up
>>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>>> seems so popular but I’m not really seeing why.
>>>> B.
>>>>
>>>
>>>
>>
>>

Re: Spark vs Tez

Posted by Mohan Radhakrishnan <ra...@gmail.com>.

I remember Spark uses Akka clusters. Isn't that totally different from
other distributed technologies ?

Thanks,
Mohan

On Sat, Oct 18, 2014 at 1:52 PM, Niels Basjes <Ni...@basjes.nl> wrote:

> It is my understanding that one of the big differences between Tez and
> Spark is is that a Tez based query still has the startup overhead of
> starting JVMs on the Yarn cluster. Spark based queries are immediately
> executed on "already running JVMs".
>
> So for interactive dashboards Spark seems more suitable.
>
> Did I understand correctly?
>
> Niels Basjes
> On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:
>
>> Spark and tez both make MR faster, this has no doubt.
>>
>> They also provide new features like DAG, which is quite important for
>> interactive query processing.  From this perspective, you could view them
>> as a wrapper around MR and try to handle the intermediary buffer(files)
>> more efficiently.  It is a big pain in MR.
>>
>> Also they both try to use Memory as the buffer instead of only
>> filesystems.   Spark has a concept RDD, which is quite interesting and also
>> limited.
>>
>>
>>
>> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   It was my understanding that Spark is faster batch processing. Tez is
>>> the new execution engine that replaces MapReduce and is also supposed to
>>> speed up batch processing. Is that not correct?
>>> B.
>>>
>>>
>>>
>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>> *Sent:* Friday, October 17, 2014 1:12 PM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Spark vs Tez
>>>
>>>  What aspects of Tez and Spark are you comparing? They have different
>>> purposes and thus not directly comparable, as far as I understand.
>>>
>>> Regards,
>>> Shahab
>>>
>>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   Does anybody have any performance figures on how Spark stacks up
>>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>>> seems so popular but I’m not really seeing why.
>>>> B.
>>>>
>>>
>>>
>>
>>

Re: Spark vs Tez

Posted by Jeff Zhang <zj...@gmail.com>.

Tez has a feature called pre-warm which will launch JVM before you use it
and you can reuse the container afterwards. So it is also suitable for
interactive queries and is more stable and scalable than spark IMO.

On Sat, Oct 18, 2014 at 4:22 PM, Niels Basjes <Ni...@basjes.nl> wrote:

> It is my understanding that one of the big differences between Tez and
> Spark is is that a Tez based query still has the startup overhead of
> starting JVMs on the Yarn cluster. Spark based queries are immediately
> executed on "already running JVMs".
>
> So for interactive dashboards Spark seems more suitable.
>
> Did I understand correctly?
>
> Niels Basjes
> On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:
>
>> Spark and tez both make MR faster, this has no doubt.
>>
>> They also provide new features like DAG, which is quite important for
>> interactive query processing.  From this perspective, you could view them
>> as a wrapper around MR and try to handle the intermediary buffer(files)
>> more efficiently.  It is a big pain in MR.
>>
>> Also they both try to use Memory as the buffer instead of only
>> filesystems.   Spark has a concept RDD, which is quite interesting and also
>> limited.
>>
>>
>>
>> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   It was my understanding that Spark is faster batch processing. Tez is
>>> the new execution engine that replaces MapReduce and is also supposed to
>>> speed up batch processing. Is that not correct?
>>> B.
>>>
>>>
>>>
>>>  *From:* Shahab Yunus <sh...@gmail.com>
>>> *Sent:* Friday, October 17, 2014 1:12 PM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Spark vs Tez
>>>
>>>  What aspects of Tez and Spark are you comparing? They have different
>>> purposes and thus not directly comparable, as far as I understand.
>>>
>>> Regards,
>>> Shahab
>>>
>>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   Does anybody have any performance figures on how Spark stacks up
>>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>>> seems so popular but I’m not really seeing why.
>>>> B.
>>>>
>>>
>>>
>>
>>


-- 
Best Regards

Jeff Zhang

Re: Spark vs Tez

Posted by Niels Basjes <Ni...@basjes.nl>.

It is my understanding that one of the big differences between Tez and
Spark is is that a Tez based query still has the startup overhead of
starting JVMs on the Yarn cluster. Spark based queries are immediately
executed on "already running JVMs".

So for interactive dashboards Spark seems more suitable.

Did I understand correctly?

Niels Basjes
On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:

> Spark and tez both make MR faster, this has no doubt.
>
> They also provide new features like DAG, which is quite important for
> interactive query processing.  From this perspective, you could view them
> as a wrapper around MR and try to handle the intermediary buffer(files)
> more efficiently.  It is a big pain in MR.
>
> Also they both try to use Memory as the buffer instead of only
> filesystems.   Spark has a concept RDD, which is quite interesting and also
> limited.
>
>
>
> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   It was my understanding that Spark is faster batch processing. Tez is
>> the new execution engine that replaces MapReduce and is also supposed to
>> speed up batch processing. Is that not correct?
>> B.
>>
>>
>>
>>  *From:* Shahab Yunus <sh...@gmail.com>
>> *Sent:* Friday, October 17, 2014 1:12 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Spark vs Tez
>>
>>  What aspects of Tez and Spark are you comparing? They have different
>> purposes and thus not directly comparable, as far as I understand.
>>
>> Regards,
>> Shahab
>>
>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   Does anybody have any performance figures on how Spark stacks up
>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>> seems so popular but I’m not really seeing why.
>>> B.
>>>
>>
>>
>
>

Re: Spark vs Tez

Posted by Niels Basjes <Ni...@basjes.nl>.

It is my understanding that one of the big differences between Tez and
Spark is is that a Tez based query still has the startup overhead of
starting JVMs on the Yarn cluster. Spark based queries are immediately
executed on "already running JVMs".

So for interactive dashboards Spark seems more suitable.

Did I understand correctly?

Niels Basjes
On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:

> Spark and tez both make MR faster, this has no doubt.
>
> They also provide new features like DAG, which is quite important for
> interactive query processing.  From this perspective, you could view them
> as a wrapper around MR and try to handle the intermediary buffer(files)
> more efficiently.  It is a big pain in MR.
>
> Also they both try to use Memory as the buffer instead of only
> filesystems.   Spark has a concept RDD, which is quite interesting and also
> limited.
>
>
>
> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   It was my understanding that Spark is faster batch processing. Tez is
>> the new execution engine that replaces MapReduce and is also supposed to
>> speed up batch processing. Is that not correct?
>> B.
>>
>>
>>
>>  *From:* Shahab Yunus <sh...@gmail.com>
>> *Sent:* Friday, October 17, 2014 1:12 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Spark vs Tez
>>
>>  What aspects of Tez and Spark are you comparing? They have different
>> purposes and thus not directly comparable, as far as I understand.
>>
>> Regards,
>> Shahab
>>
>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   Does anybody have any performance figures on how Spark stacks up
>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>> seems so popular but I’m not really seeing why.
>>> B.
>>>
>>
>>
>
>

Re: Spark vs Tez

Posted by Niels Basjes <Ni...@basjes.nl>.

It is my understanding that one of the big differences between Tez and
Spark is is that a Tez based query still has the startup overhead of
starting JVMs on the Yarn cluster. Spark based queries are immediately
executed on "already running JVMs".

So for interactive dashboards Spark seems more suitable.

Did I understand correctly?

Niels Basjes
On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:

> Spark and tez both make MR faster, this has no doubt.
>
> They also provide new features like DAG, which is quite important for
> interactive query processing.  From this perspective, you could view them
> as a wrapper around MR and try to handle the intermediary buffer(files)
> more efficiently.  It is a big pain in MR.
>
> Also they both try to use Memory as the buffer instead of only
> filesystems.   Spark has a concept RDD, which is quite interesting and also
> limited.
>
>
>
> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   It was my understanding that Spark is faster batch processing. Tez is
>> the new execution engine that replaces MapReduce and is also supposed to
>> speed up batch processing. Is that not correct?
>> B.
>>
>>
>>
>>  *From:* Shahab Yunus <sh...@gmail.com>
>> *Sent:* Friday, October 17, 2014 1:12 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Spark vs Tez
>>
>>  What aspects of Tez and Spark are you comparing? They have different
>> purposes and thus not directly comparable, as far as I understand.
>>
>> Regards,
>> Shahab
>>
>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   Does anybody have any performance figures on how Spark stacks up
>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>> seems so popular but I’m not really seeing why.
>>> B.
>>>
>>
>>
>
>

Re: Spark vs Tez

Posted by Niels Basjes <Ni...@basjes.nl>.

It is my understanding that one of the big differences between Tez and
Spark is is that a Tez based query still has the startup overhead of
starting JVMs on the Yarn cluster. Spark based queries are immediately
executed on "already running JVMs".

So for interactive dashboards Spark seems more suitable.

Did I understand correctly?

Niels Basjes
On Oct 17, 2014 8:30 PM, "Gavin Yue" <yu...@gmail.com> wrote:

> Spark and tez both make MR faster, this has no doubt.
>
> They also provide new features like DAG, which is quite important for
> interactive query processing.  From this perspective, you could view them
> as a wrapper around MR and try to handle the intermediary buffer(files)
> more efficiently.  It is a big pain in MR.
>
> Also they both try to use Memory as the buffer instead of only
> filesystems.   Spark has a concept RDD, which is quite interesting and also
> limited.
>
>
>
> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   It was my understanding that Spark is faster batch processing. Tez is
>> the new execution engine that replaces MapReduce and is also supposed to
>> speed up batch processing. Is that not correct?
>> B.
>>
>>
>>
>>  *From:* Shahab Yunus <sh...@gmail.com>
>> *Sent:* Friday, October 17, 2014 1:12 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Spark vs Tez
>>
>>  What aspects of Tez and Spark are you comparing? They have different
>> purposes and thus not directly comparable, as far as I understand.
>>
>> Regards,
>> Shahab
>>
>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   Does anybody have any performance figures on how Spark stacks up
>>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>>> seems so popular but I’m not really seeing why.
>>> B.
>>>
>>
>>
>
>

Re: Spark vs Tez

Posted by Gavin Yue <yu...@gmail.com>.

Spark and tez both make MR faster, this has no doubt.

They also provide new features like DAG, which is quite important for
interactive query processing.  From this perspective, you could view them
as a wrapper around MR and try to handle the intermediary buffer(files)
more efficiently.  It is a big pain in MR.

Also they both try to use Memory as the buffer instead of only
filesystems.   Spark has a concept RDD, which is quite interesting and also
limited.

On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   It was my understanding that Spark is faster batch processing. Tez is
> the new execution engine that replaces MapReduce and is also supposed to
> speed up batch processing. Is that not correct?
> B.
>
>
>
>  *From:* Shahab Yunus <sh...@gmail.com>
> *Sent:* Friday, October 17, 2014 1:12 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs Tez
>
>  What aspects of Tez and Spark are you comparing? They have different
> purposes and thus not directly comparable, as far as I understand.
>
> Regards,
> Shahab
>
> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   Does anybody have any performance figures on how Spark stacks up
>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>> seems so popular but I’m not really seeing why.
>> B.
>>
>
>

Re: Spark vs Tez

Posted by Gavin Yue <yu...@gmail.com>.

Spark and tez both make MR faster, this has no doubt.

They also provide new features like DAG, which is quite important for
interactive query processing.  From this perspective, you could view them
as a wrapper around MR and try to handle the intermediary buffer(files)
more efficiently.  It is a big pain in MR.

Also they both try to use Memory as the buffer instead of only
filesystems.   Spark has a concept RDD, which is quite interesting and also
limited.

On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   It was my understanding that Spark is faster batch processing. Tez is
> the new execution engine that replaces MapReduce and is also supposed to
> speed up batch processing. Is that not correct?
> B.
>
>
>
>  *From:* Shahab Yunus <sh...@gmail.com>
> *Sent:* Friday, October 17, 2014 1:12 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs Tez
>
>  What aspects of Tez and Spark are you comparing? They have different
> purposes and thus not directly comparable, as far as I understand.
>
> Regards,
> Shahab
>
> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   Does anybody have any performance figures on how Spark stacks up
>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>> seems so popular but I’m not really seeing why.
>> B.
>>
>
>

Re: Spark vs Tez

Posted by Alexander Pivovarov <ap...@gmail.com>.

It's going to be spark engine for hive (in addition to mr and tez).

Spark API is available for Java and Python as well.

Tez engine is available now and it's quite stable. As for speed.  For
complex queries it shows 10x-20x improvement in comparison to mr engine.
e.g. one of my queries runs 30 min using mr (about 100 mr jobs),   if I
switch to tez it done in 100 sec.

I'm using HDP-2.1.5 (hive-0.13.1, tez 0.4.1)

On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   It was my understanding that Spark is faster batch processing. Tez is
> the new execution engine that replaces MapReduce and is also supposed to
> speed up batch processing. Is that not correct?
> B.
>
>
>
>  *From:* Shahab Yunus <sh...@gmail.com>
> *Sent:* Friday, October 17, 2014 1:12 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs Tez
>
>  What aspects of Tez and Spark are you comparing? They have different
> purposes and thus not directly comparable, as far as I understand.
>
> Regards,
> Shahab
>
> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   Does anybody have any performance figures on how Spark stacks up
>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>> seems so popular but I’m not really seeing why.
>> B.
>>
>
>

Re: Spark vs Tez

Posted by Alexander Pivovarov <ap...@gmail.com>.

It's going to be spark engine for hive (in addition to mr and tez).

Spark API is available for Java and Python as well.

Tez engine is available now and it's quite stable. As for speed.  For
complex queries it shows 10x-20x improvement in comparison to mr engine.
e.g. one of my queries runs 30 min using mr (about 100 mr jobs),   if I
switch to tez it done in 100 sec.

I'm using HDP-2.1.5 (hive-0.13.1, tez 0.4.1)

On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   It was my understanding that Spark is faster batch processing. Tez is
> the new execution engine that replaces MapReduce and is also supposed to
> speed up batch processing. Is that not correct?
> B.
>
>
>
>  *From:* Shahab Yunus <sh...@gmail.com>
> *Sent:* Friday, October 17, 2014 1:12 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs Tez
>
>  What aspects of Tez and Spark are you comparing? They have different
> purposes and thus not directly comparable, as far as I understand.
>
> Regards,
> Shahab
>
> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   Does anybody have any performance figures on how Spark stacks up
>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>> seems so popular but I’m not really seeing why.
>> B.
>>
>
>

Re: Spark vs Tez

Posted by Gavin Yue <yu...@gmail.com>.

Spark and tez both make MR faster, this has no doubt.

They also provide new features like DAG, which is quite important for
interactive query processing.  From this perspective, you could view them
as a wrapper around MR and try to handle the intermediary buffer(files)
more efficiently.  It is a big pain in MR.

Also they both try to use Memory as the buffer instead of only
filesystems.   Spark has a concept RDD, which is quite interesting and also
limited.

On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   It was my understanding that Spark is faster batch processing. Tez is
> the new execution engine that replaces MapReduce and is also supposed to
> speed up batch processing. Is that not correct?
> B.
>
>
>
>  *From:* Shahab Yunus <sh...@gmail.com>
> *Sent:* Friday, October 17, 2014 1:12 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs Tez
>
>  What aspects of Tez and Spark are you comparing? They have different
> purposes and thus not directly comparable, as far as I understand.
>
> Regards,
> Shahab
>
> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   Does anybody have any performance figures on how Spark stacks up
>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>> seems so popular but I’m not really seeing why.
>> B.
>>
>
>

Re: Spark vs Tez

Posted by Gavin Yue <yu...@gmail.com>.

Spark and tez both make MR faster, this has no doubt.

They also provide new features like DAG, which is quite important for
interactive query processing.  From this perspective, you could view them
as a wrapper around MR and try to handle the intermediary buffer(files)
more efficiently.  It is a big pain in MR.

Also they both try to use Memory as the buffer instead of only
filesystems.   Spark has a concept RDD, which is quite interesting and also
limited.

On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   It was my understanding that Spark is faster batch processing. Tez is
> the new execution engine that replaces MapReduce and is also supposed to
> speed up batch processing. Is that not correct?
> B.
>
>
>
>  *From:* Shahab Yunus <sh...@gmail.com>
> *Sent:* Friday, October 17, 2014 1:12 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs Tez
>
>  What aspects of Tez and Spark are you comparing? They have different
> purposes and thus not directly comparable, as far as I understand.
>
> Regards,
> Shahab
>
> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   Does anybody have any performance figures on how Spark stacks up
>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>> seems so popular but I’m not really seeing why.
>> B.
>>
>
>

Re: Spark vs Tez

Posted by Alexander Pivovarov <ap...@gmail.com>.

It's going to be spark engine for hive (in addition to mr and tez).

Spark API is available for Java and Python as well.

Tez engine is available now and it's quite stable. As for speed.  For
complex queries it shows 10x-20x improvement in comparison to mr engine.
e.g. one of my queries runs 30 min using mr (about 100 mr jobs),   if I
switch to tez it done in 100 sec.

I'm using HDP-2.1.5 (hive-0.13.1, tez 0.4.1)

On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   It was my understanding that Spark is faster batch processing. Tez is
> the new execution engine that replaces MapReduce and is also supposed to
> speed up batch processing. Is that not correct?
> B.
>
>
>
>  *From:* Shahab Yunus <sh...@gmail.com>
> *Sent:* Friday, October 17, 2014 1:12 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs Tez
>
>  What aspects of Tez and Spark are you comparing? They have different
> purposes and thus not directly comparable, as far as I understand.
>
> Regards,
> Shahab
>
> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   Does anybody have any performance figures on how Spark stacks up
>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>> seems so popular but I’m not really seeing why.
>> B.
>>
>
>

Re: Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

It was my understanding that Spark is faster batch processing. Tez is the new execution engine that replaces MapReduce and is also supposed to speed up batch processing. Is that not correct?
B.

From: Shahab Yunus 
Sent: Friday, October 17, 2014 1:12 PM
To: user@hadoop.apache.org 
Subject: Re: Spark vs Tez

What aspects of Tez and Spark are you comparing? They have different purposes and thus not directly comparable, as far as I understand. 

Regards,
Shahab

On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
  B.

Dynamically set map / reducer memory

Posted by peter 2 <re...@gmail.com>.

HI Guys,
I am trying to run a few MR jobs in a succession, some of the jobs don't 
need that much memory and others do. I want to be able to tell hadoop 
how much memory should be allocated  for the mappers of each job.
I know how to increase the memory for a mapper JVM, through the mapred xml.
I tried manually setting the mapreduce.reduce.java.opts= 
-Xmx<someNumber>m , but wasn't picked up by the mapper jvm, the global 
setting was always been picked up .

In summation
Job 1 - Mappers need only 250 Mg of Ram
Job2 - Mapper
            Reducer need around - 2Gb

I don't want to be able to set those restrictions prior to submitting 
the job to my hadoop cluster.

Dynamically set map / reducer memory

Posted by peter 2 <re...@gmail.com>.

HI Guys,
I am trying to run a few MR jobs in a succession, some of the jobs don't 
need that much memory and others do. I want to be able to tell hadoop 
how much memory should be allocated  for the mappers of each job.
I know how to increase the memory for a mapper JVM, through the mapred xml.
I tried manually setting the mapreduce.reduce.java.opts= 
-Xmx<someNumber>m , but wasn't picked up by the mapper jvm, the global 
setting was always been picked up .

In summation
Job 1 - Mappers need only 250 Mg of Ram
Job2 - Mapper
            Reducer need around - 2Gb

I don't want to be able to set those restrictions prior to submitting 
the job to my hadoop cluster.

Dynamically set map / reducer memory

Posted by peter 2 <re...@gmail.com>.

HI Guys,
I am trying to run a few MR jobs in a succession, some of the jobs don't 
need that much memory and others do. I want to be able to tell hadoop 
how much memory should be allocated  for the mappers of each job.
I know how to increase the memory for a mapper JVM, through the mapred xml.
I tried manually setting the mapreduce.reduce.java.opts= 
-Xmx<someNumber>m , but wasn't picked up by the mapper jvm, the global 
setting was always been picked up .

In summation
Job 1 - Mappers need only 250 Mg of Ram
Job2 - Mapper
            Reducer need around - 2Gb

I don't want to be able to set those restrictions prior to submitting 
the job to my hadoop cluster.

Re: Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

It was my understanding that Spark is faster batch processing. Tez is the new execution engine that replaces MapReduce and is also supposed to speed up batch processing. Is that not correct?
B.

From: Shahab Yunus 
Sent: Friday, October 17, 2014 1:12 PM
To: user@hadoop.apache.org 
Subject: Re: Spark vs Tez

What aspects of Tez and Spark are you comparing? They have different purposes and thus not directly comparable, as far as I understand. 

Regards,
Shahab

On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
  B.

Re: Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

It was my understanding that Spark is faster batch processing. Tez is the new execution engine that replaces MapReduce and is also supposed to speed up batch processing. Is that not correct?
B.

From: Shahab Yunus 
Sent: Friday, October 17, 2014 1:12 PM
To: user@hadoop.apache.org 
Subject: Re: Spark vs Tez

What aspects of Tez and Spark are you comparing? They have different purposes and thus not directly comparable, as far as I understand. 

Regards,
Shahab

On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
  B.

Re: Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

It was my understanding that Spark is faster batch processing. Tez is the new execution engine that replaces MapReduce and is also supposed to speed up batch processing. Is that not correct?
B.

From: Shahab Yunus 
Sent: Friday, October 17, 2014 1:12 PM
To: user@hadoop.apache.org 
Subject: Re: Spark vs Tez

What aspects of Tez and Spark are you comparing? They have different purposes and thus not directly comparable, as far as I understand. 

Regards,
Shahab

On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
  B.

Re: Spark vs Tez

Posted by Shahab Yunus <sh...@gmail.com>.

What aspects of Tez and Spark are you comparing? They have different
purposes and thus not directly comparable, as far as I understand.

Regards,
Shahab

On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Does anybody have any performance figures on how Spark stacks up
> against Tez? If you don’t have figures, does anybody have an opinion? Spark
> seems so popular but I’m not really seeing why.
> B.
>

Re: Spark vs Tez

Posted by Shahab Yunus <sh...@gmail.com>.

What aspects of Tez and Spark are you comparing? They have different
purposes and thus not directly comparable, as far as I understand.

Regards,
Shahab

On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Does anybody have any performance figures on how Spark stacks up
> against Tez? If you don’t have figures, does anybody have an opinion? Spark
> seems so popular but I’m not really seeing why.
> B.
>

Re: Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

My comment was in response to the suggestion to use PySpark. Perhaps I misunderstand what PySpark is. It was my understanding that it let you work with Spark in Python. Is that not correct?

B.

From: Edward Capriolo 
Sent: Tuesday, October 21, 2014 11:06 AM
To: user@hadoop.apache.org 
Subject: Re: Spark vs Tez

scala is not an interpreted language, from my non authoritative view it seems to have 2-3 (thousand) more compile phases than java and as a result some of the things you are doing that look like they are "interpreted" are actually macro's that get converted into "usually" efficient java code.  

About scala in general. I have a few complains. The inter op is kinda clunky, I have to work in scala and run into stuff, like the json mapper in scala works! that is until one property in my scala object is actually a java object, then it does not or I should be able to call a method in java from scala but can not figure out how to turn a Comparator into a Comparator[_: <Any]. 

The immutability aspect i find to be a real PITA. It becomes really hard to write code the way you want to and then if you do not use an immutable collection or some other fancy scala construct people get on your case that your not writing idiomatic scala (even though few agree on what that really is).

Generally people have a large capacity to assume, "I'm smart, I know java, and I learned lisp in school so this scala stuff is going to be a breeze" Don't make that assumption. You will not be proficient in writing scala for months. You likely wont be able to hire anyone that has done much production scala. And everyone will come up to you and say "so Im trying to (sort list|cast objects|simple thing) in scala. I can do it in java but YOUR THE EXPERT and wondering how to do in in scala". 

On Tue, Oct 21, 2014 at 10:04 AM, Tim Randles <tr...@lanl.gov> wrote:

  Yeah, compared to something as performant as java...
  </sarcasm>

  On 10/20/2014 10:16 PM, Adaryl "Bob" Wakefield, MBA wrote:

    Using an interpreted scripting language with something that is billing
    itself as being fast doesn’t sound like the best idea...
    B.
    *From:* Russell Jurney <ma...@gmail.com>
    *Sent:* Saturday, October 18, 2014 7:38 AM
    *To:* user@hadoop.apache.org <ma...@hadoop.apache.org>
    *Subject:* Re: Spark vs Tez
    Check out PySpark. No Scala required.

    On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA
    <adaryl.wakefield@hotmail.com <ma...@hotmail.com>> wrote:

        “The only problem with Spark adoption is the steep learning curve of
        Scala , and understanding the API properly.”
        This is why I’m looking for reasons to avoid Spark. In my mind, it’s
        one more thing to have to master and doesn’t really have anything to
        offer that can’t be done with other tools that are already inside my
        skillset. I spoke with some software engineers recently and
        basically the discussion boiled down to if you need to master Java
        or Scala go with Java. Three months into Java I don’t want to stop
        that and start learning Scala.
        B.
        *From:* kartik saxena
        <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
        *Sent:* Friday, October 17, 2014 1:12 PM
        *To:* javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org');
        *Subject:* Re: Spark vs Tez
        I did a performance benchmark during my summer internship . I am
        currently a grad student. Can't reveal much about the specific
        project but Spark is still faster than around 4-5th iteration of Tez
        of the same query/dataset. By Iteration I mean utilizing the
        "hot-container" property of Apache Tez  . See latest release of Tez
        and some hortonworks tutorials on their website.
        The only problem with Spark adoption is the steep learning curve of
        Scala , and understanding the API properly.
        Thanks
        On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA
        <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:

            Does anybody have any performance figures on how Spark stacks up
            against Tez? If you don’t have figures, does anybody have an
            opinion? Spark seems so popular but I’m not really seeing why.
            B.

    --
    Russell Jurney twitter.com/rjurney
    <http://twitter.com/rjurney>russell.jurney@gmail.com
    <ma...@gmail.com> datasyndrome.com
    <http://datasyndrome.com/>

Re: Spark vs Tez

Posted by Brian O'Neill <bo...@alumni.brown.edu>.

@edwardcapriolo, funny running into you over here in the hadoop community.
=)

FWIW, I have the same perspective and had the same experience with Scala and
Spark. 
(I had LISP/Scheme in College too. =)

Additionally, with the JDK8 enhancements (lambda expressions, method
references, etc.), there is less motivation to move to Scala.

Specifically, with Spark  take a look at this:
http://blog.cloudera.com/blog/2014/04/making-apache-spark-easier-to-use-in-j
ava-with-java-8/

-brian

---
Brian O'Neill
Chief Technology Officer

Health Market Science
The Science of Better Results
2700 Horizon Drive  King of Prussia, PA  19406
M: 215.588.6024  @boneill42 <http://www.twitter.com/boneill42>   
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.

From:  Edward Capriolo <ed...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Tuesday, October 21, 2014 at 12:06 PM
To:  "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Spark vs Tez

scala is not an interpreted language, from my non authoritative view it
seems to have 2-3 (thousand) more compile phases than java and as a result
some of the things you are doing that look like they are "interpreted" are
actually macro's that get converted into "usually" efficient java code.

About scala in general. I have a few complains. The inter op is kinda
clunky, I have to work in scala and run into stuff, like the json mapper in
scala works! that is until one property in my scala object is actually a
java object, then it does not or I should be able to call a method in java
from scala but can not figure out how to turn a Comparator into a
Comparator[_: <Any].

The immutability aspect i find to be a real PITA. It becomes really hard to
write code the way you want to and then if you do not use an immutable
collection or some other fancy scala construct people get on your case that
your not writing idiomatic scala (even though few agree on what that really
is).

Generally people have a large capacity to assume, "I'm smart, I know java,
and I learned lisp in school so this scala stuff is going to be a breeze"
Don't make that assumption. You will not be proficient in writing scala for
months. You likely wont be able to hire anyone that has done much production
scala. And everyone will come up to you and say "so Im trying to (sort
list|cast objects|simple thing) in scala. I can do it in java but YOUR THE
EXPERT and wondering how to do in in scala".

On Tue, Oct 21, 2014 at 10:04 AM, Tim Randles <tr...@lanl.gov> wrote:
> Yeah, compared to something as performant as java...
> </sarcasm>
> 
> On 10/20/2014 10:16 PM, Adaryl "Bob" Wakefield, MBA wrote:
>> Using an interpreted scripting language with something that is billing
>> itself as being fast doesn¹t sound like the best idea...
>> B.
>> *From:* Russell Jurney <mailto:russell.jurney@gmail.com
>> <ma...@gmail.com> >
>> *Sent:* Saturday, October 18, 2014 7:38 AM
>> *To:* user@hadoop.apache.org <ma...@hadoop.apache.org>
>> *Subject:* Re: Spark vs Tez
>> Check out PySpark. No Scala required.
>> 
>> On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA
>> <adaryl.wakefield@hotmail.com <mailto:adaryl.wakefield@hotmail.com
>> <ma...@hotmail.com> >> wrote:
>> 
>>     ³The only problem with Spark adoption is the steep learning curve of
>>     Scala , and understanding the API properly.²
>>     This is why I¹m looking for reasons to avoid Spark. In my mind, it¹s
>>     one more thing to have to master and doesn¹t really have anything to
>>     offer that can¹t be done with other tools that are already inside my
>>     skillset. I spoke with some software engineers recently and
>>     basically the discussion boiled down to if you need to master Java
>>     or Scala go with Java. Three months into Java I don¹t want to stop
>>     that and start learning Scala.
>>     B.
>>     *From:* kartik saxena
>>     <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
>>     *Sent:* Friday, October 17, 2014 1:12 PM
>>     *To:* javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org
>> <ma...@hadoop.apache.org> ');
>>     *Subject:* Re: Spark vs Tez
>>     I did a performance benchmark during my summer internship . I am
>>     currently a grad student. Can't reveal much about the specific
>>     project but Spark is still faster than around 4-5th iteration of Tez
>>     of the same query/dataset. By Iteration I mean utilizing the
>>     "hot-container" property of Apache Tez  . See latest release of Tez
>>     and some hortonworks tutorials on their website.
>>     The only problem with Spark adoption is the steep learning curve of
>>     Scala , and understanding the API properly.
>>     Thanks
>>     On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA
>>     <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:
>> 
>>         Does anybody have any performance figures on how Spark stacks up
>>         against Tez? If you don¹t have figures, does anybody have an
>>         opinion? Spark seems so popular but I¹m not really seeing why.
>>         B.
>> 
>> 
>> 
>> --
>> Russell Jurney twitter.com/rjurney <http://twitter.com/rjurney>
>> <http://twitter.com/rjurney>russell.jurney@gmail.com
>> <ma...@gmail.com>
>> <mailto:russell.jurney@gmail.com <ma...@gmail.com> >
>> datasyndrome.com <http://datasyndrome.com>
>> <http://datasyndrome.com/>

Re: Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

My comment was in response to the suggestion to use PySpark. Perhaps I misunderstand what PySpark is. It was my understanding that it let you work with Spark in Python. Is that not correct?

B.

From: Edward Capriolo 
Sent: Tuesday, October 21, 2014 11:06 AM
To: user@hadoop.apache.org 
Subject: Re: Spark vs Tez

scala is not an interpreted language, from my non authoritative view it seems to have 2-3 (thousand) more compile phases than java and as a result some of the things you are doing that look like they are "interpreted" are actually macro's that get converted into "usually" efficient java code.  

About scala in general. I have a few complains. The inter op is kinda clunky, I have to work in scala and run into stuff, like the json mapper in scala works! that is until one property in my scala object is actually a java object, then it does not or I should be able to call a method in java from scala but can not figure out how to turn a Comparator into a Comparator[_: <Any]. 

The immutability aspect i find to be a real PITA. It becomes really hard to write code the way you want to and then if you do not use an immutable collection or some other fancy scala construct people get on your case that your not writing idiomatic scala (even though few agree on what that really is).

Generally people have a large capacity to assume, "I'm smart, I know java, and I learned lisp in school so this scala stuff is going to be a breeze" Don't make that assumption. You will not be proficient in writing scala for months. You likely wont be able to hire anyone that has done much production scala. And everyone will come up to you and say "so Im trying to (sort list|cast objects|simple thing) in scala. I can do it in java but YOUR THE EXPERT and wondering how to do in in scala". 

On Tue, Oct 21, 2014 at 10:04 AM, Tim Randles <tr...@lanl.gov> wrote:

  Yeah, compared to something as performant as java...
  </sarcasm>

  On 10/20/2014 10:16 PM, Adaryl "Bob" Wakefield, MBA wrote:

    Using an interpreted scripting language with something that is billing
    itself as being fast doesn’t sound like the best idea...
    B.
    *From:* Russell Jurney <ma...@gmail.com>
    *Sent:* Saturday, October 18, 2014 7:38 AM
    *To:* user@hadoop.apache.org <ma...@hadoop.apache.org>
    *Subject:* Re: Spark vs Tez
    Check out PySpark. No Scala required.

    On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA
    <adaryl.wakefield@hotmail.com <ma...@hotmail.com>> wrote:

        “The only problem with Spark adoption is the steep learning curve of
        Scala , and understanding the API properly.”
        This is why I’m looking for reasons to avoid Spark. In my mind, it’s
        one more thing to have to master and doesn’t really have anything to
        offer that can’t be done with other tools that are already inside my
        skillset. I spoke with some software engineers recently and
        basically the discussion boiled down to if you need to master Java
        or Scala go with Java. Three months into Java I don’t want to stop
        that and start learning Scala.
        B.
        *From:* kartik saxena
        <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
        *Sent:* Friday, October 17, 2014 1:12 PM
        *To:* javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org');
        *Subject:* Re: Spark vs Tez
        I did a performance benchmark during my summer internship . I am
        currently a grad student. Can't reveal much about the specific
        project but Spark is still faster than around 4-5th iteration of Tez
        of the same query/dataset. By Iteration I mean utilizing the
        "hot-container" property of Apache Tez  . See latest release of Tez
        and some hortonworks tutorials on their website.
        The only problem with Spark adoption is the steep learning curve of
        Scala , and understanding the API properly.
        Thanks
        On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA
        <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:

            Does anybody have any performance figures on how Spark stacks up
            against Tez? If you don’t have figures, does anybody have an
            opinion? Spark seems so popular but I’m not really seeing why.
            B.

    --
    Russell Jurney twitter.com/rjurney
    <http://twitter.com/rjurney>russell.jurney@gmail.com
    <ma...@gmail.com> datasyndrome.com
    <http://datasyndrome.com/>

Re: Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

My comment was in response to the suggestion to use PySpark. Perhaps I misunderstand what PySpark is. It was my understanding that it let you work with Spark in Python. Is that not correct?

B.

From: Edward Capriolo 
Sent: Tuesday, October 21, 2014 11:06 AM
To: user@hadoop.apache.org 
Subject: Re: Spark vs Tez

scala is not an interpreted language, from my non authoritative view it seems to have 2-3 (thousand) more compile phases than java and as a result some of the things you are doing that look like they are "interpreted" are actually macro's that get converted into "usually" efficient java code.  

About scala in general. I have a few complains. The inter op is kinda clunky, I have to work in scala and run into stuff, like the json mapper in scala works! that is until one property in my scala object is actually a java object, then it does not or I should be able to call a method in java from scala but can not figure out how to turn a Comparator into a Comparator[_: <Any]. 

The immutability aspect i find to be a real PITA. It becomes really hard to write code the way you want to and then if you do not use an immutable collection or some other fancy scala construct people get on your case that your not writing idiomatic scala (even though few agree on what that really is).

Generally people have a large capacity to assume, "I'm smart, I know java, and I learned lisp in school so this scala stuff is going to be a breeze" Don't make that assumption. You will not be proficient in writing scala for months. You likely wont be able to hire anyone that has done much production scala. And everyone will come up to you and say "so Im trying to (sort list|cast objects|simple thing) in scala. I can do it in java but YOUR THE EXPERT and wondering how to do in in scala". 

On Tue, Oct 21, 2014 at 10:04 AM, Tim Randles <tr...@lanl.gov> wrote:

  Yeah, compared to something as performant as java...
  </sarcasm>

  On 10/20/2014 10:16 PM, Adaryl "Bob" Wakefield, MBA wrote:

    Using an interpreted scripting language with something that is billing
    itself as being fast doesn’t sound like the best idea...
    B.
    *From:* Russell Jurney <ma...@gmail.com>
    *Sent:* Saturday, October 18, 2014 7:38 AM
    *To:* user@hadoop.apache.org <ma...@hadoop.apache.org>
    *Subject:* Re: Spark vs Tez
    Check out PySpark. No Scala required.

    On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA
    <adaryl.wakefield@hotmail.com <ma...@hotmail.com>> wrote:

        “The only problem with Spark adoption is the steep learning curve of
        Scala , and understanding the API properly.”
        This is why I’m looking for reasons to avoid Spark. In my mind, it’s
        one more thing to have to master and doesn’t really have anything to
        offer that can’t be done with other tools that are already inside my
        skillset. I spoke with some software engineers recently and
        basically the discussion boiled down to if you need to master Java
        or Scala go with Java. Three months into Java I don’t want to stop
        that and start learning Scala.
        B.
        *From:* kartik saxena
        <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
        *Sent:* Friday, October 17, 2014 1:12 PM
        *To:* javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org');
        *Subject:* Re: Spark vs Tez
        I did a performance benchmark during my summer internship . I am
        currently a grad student. Can't reveal much about the specific
        project but Spark is still faster than around 4-5th iteration of Tez
        of the same query/dataset. By Iteration I mean utilizing the
        "hot-container" property of Apache Tez  . See latest release of Tez
        and some hortonworks tutorials on their website.
        The only problem with Spark adoption is the steep learning curve of
        Scala , and understanding the API properly.
        Thanks
        On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA
        <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:

            Does anybody have any performance figures on how Spark stacks up
            against Tez? If you don’t have figures, does anybody have an
            opinion? Spark seems so popular but I’m not really seeing why.
            B.

    --
    Russell Jurney twitter.com/rjurney
    <http://twitter.com/rjurney>russell.jurney@gmail.com
    <ma...@gmail.com> datasyndrome.com
    <http://datasyndrome.com/>

Re: Spark vs Tez

Posted by Brian O'Neill <bo...@alumni.brown.edu>.

@edwardcapriolo, funny running into you over here in the hadoop community.
=)

FWIW, I have the same perspective and had the same experience with Scala and
Spark. 
(I had LISP/Scheme in College too. =)

Additionally, with the JDK8 enhancements (lambda expressions, method
references, etc.), there is less motivation to move to Scala.

Specifically, with Spark  take a look at this:
http://blog.cloudera.com/blog/2014/04/making-apache-spark-easier-to-use-in-j
ava-with-java-8/

-brian

---
Brian O'Neill
Chief Technology Officer

Health Market Science
The Science of Better Results
2700 Horizon Drive  King of Prussia, PA  19406
M: 215.588.6024  @boneill42 <http://www.twitter.com/boneill42>   
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.

From:  Edward Capriolo <ed...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Tuesday, October 21, 2014 at 12:06 PM
To:  "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Spark vs Tez

scala is not an interpreted language, from my non authoritative view it
seems to have 2-3 (thousand) more compile phases than java and as a result
some of the things you are doing that look like they are "interpreted" are
actually macro's that get converted into "usually" efficient java code.

About scala in general. I have a few complains. The inter op is kinda
clunky, I have to work in scala and run into stuff, like the json mapper in
scala works! that is until one property in my scala object is actually a
java object, then it does not or I should be able to call a method in java
from scala but can not figure out how to turn a Comparator into a
Comparator[_: <Any].

The immutability aspect i find to be a real PITA. It becomes really hard to
write code the way you want to and then if you do not use an immutable
collection or some other fancy scala construct people get on your case that
your not writing idiomatic scala (even though few agree on what that really
is).

Generally people have a large capacity to assume, "I'm smart, I know java,
and I learned lisp in school so this scala stuff is going to be a breeze"
Don't make that assumption. You will not be proficient in writing scala for
months. You likely wont be able to hire anyone that has done much production
scala. And everyone will come up to you and say "so Im trying to (sort
list|cast objects|simple thing) in scala. I can do it in java but YOUR THE
EXPERT and wondering how to do in in scala".

On Tue, Oct 21, 2014 at 10:04 AM, Tim Randles <tr...@lanl.gov> wrote:
> Yeah, compared to something as performant as java...
> </sarcasm>
> 
> On 10/20/2014 10:16 PM, Adaryl "Bob" Wakefield, MBA wrote:
>> Using an interpreted scripting language with something that is billing
>> itself as being fast doesn¹t sound like the best idea...
>> B.
>> *From:* Russell Jurney <mailto:russell.jurney@gmail.com
>> <ma...@gmail.com> >
>> *Sent:* Saturday, October 18, 2014 7:38 AM
>> *To:* user@hadoop.apache.org <ma...@hadoop.apache.org>
>> *Subject:* Re: Spark vs Tez
>> Check out PySpark. No Scala required.
>> 
>> On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA
>> <adaryl.wakefield@hotmail.com <mailto:adaryl.wakefield@hotmail.com
>> <ma...@hotmail.com> >> wrote:
>> 
>>     ³The only problem with Spark adoption is the steep learning curve of
>>     Scala , and understanding the API properly.²
>>     This is why I¹m looking for reasons to avoid Spark. In my mind, it¹s
>>     one more thing to have to master and doesn¹t really have anything to
>>     offer that can¹t be done with other tools that are already inside my
>>     skillset. I spoke with some software engineers recently and
>>     basically the discussion boiled down to if you need to master Java
>>     or Scala go with Java. Three months into Java I don¹t want to stop
>>     that and start learning Scala.
>>     B.
>>     *From:* kartik saxena
>>     <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
>>     *Sent:* Friday, October 17, 2014 1:12 PM
>>     *To:* javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org
>> <ma...@hadoop.apache.org> ');
>>     *Subject:* Re: Spark vs Tez
>>     I did a performance benchmark during my summer internship . I am
>>     currently a grad student. Can't reveal much about the specific
>>     project but Spark is still faster than around 4-5th iteration of Tez
>>     of the same query/dataset. By Iteration I mean utilizing the
>>     "hot-container" property of Apache Tez  . See latest release of Tez
>>     and some hortonworks tutorials on their website.
>>     The only problem with Spark adoption is the steep learning curve of
>>     Scala , and understanding the API properly.
>>     Thanks
>>     On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA
>>     <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:
>> 
>>         Does anybody have any performance figures on how Spark stacks up
>>         against Tez? If you don¹t have figures, does anybody have an
>>         opinion? Spark seems so popular but I¹m not really seeing why.
>>         B.
>> 
>> 
>> 
>> --
>> Russell Jurney twitter.com/rjurney <http://twitter.com/rjurney>
>> <http://twitter.com/rjurney>russell.jurney@gmail.com
>> <ma...@gmail.com>
>> <mailto:russell.jurney@gmail.com <ma...@gmail.com> >
>> datasyndrome.com <http://datasyndrome.com>
>> <http://datasyndrome.com/>

Re: Spark vs Tez

Posted by Brian O'Neill <bo...@alumni.brown.edu>.

@edwardcapriolo, funny running into you over here in the hadoop community.
=)

FWIW, I have the same perspective and had the same experience with Scala and
Spark. 
(I had LISP/Scheme in College too. =)

Additionally, with the JDK8 enhancements (lambda expressions, method
references, etc.), there is less motivation to move to Scala.

Specifically, with Spark  take a look at this:
http://blog.cloudera.com/blog/2014/04/making-apache-spark-easier-to-use-in-j
ava-with-java-8/

-brian

---
Brian O'Neill
Chief Technology Officer

Health Market Science
The Science of Better Results
2700 Horizon Drive  King of Prussia, PA  19406
M: 215.588.6024  @boneill42 <http://www.twitter.com/boneill42>   
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.

From:  Edward Capriolo <ed...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Tuesday, October 21, 2014 at 12:06 PM
To:  "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Spark vs Tez

scala is not an interpreted language, from my non authoritative view it
seems to have 2-3 (thousand) more compile phases than java and as a result
some of the things you are doing that look like they are "interpreted" are
actually macro's that get converted into "usually" efficient java code.

About scala in general. I have a few complains. The inter op is kinda
clunky, I have to work in scala and run into stuff, like the json mapper in
scala works! that is until one property in my scala object is actually a
java object, then it does not or I should be able to call a method in java
from scala but can not figure out how to turn a Comparator into a
Comparator[_: <Any].

The immutability aspect i find to be a real PITA. It becomes really hard to
write code the way you want to and then if you do not use an immutable
collection or some other fancy scala construct people get on your case that
your not writing idiomatic scala (even though few agree on what that really
is).

Generally people have a large capacity to assume, "I'm smart, I know java,
and I learned lisp in school so this scala stuff is going to be a breeze"
Don't make that assumption. You will not be proficient in writing scala for
months. You likely wont be able to hire anyone that has done much production
scala. And everyone will come up to you and say "so Im trying to (sort
list|cast objects|simple thing) in scala. I can do it in java but YOUR THE
EXPERT and wondering how to do in in scala".

On Tue, Oct 21, 2014 at 10:04 AM, Tim Randles <tr...@lanl.gov> wrote:
> Yeah, compared to something as performant as java...
> </sarcasm>
> 
> On 10/20/2014 10:16 PM, Adaryl "Bob" Wakefield, MBA wrote:
>> Using an interpreted scripting language with something that is billing
>> itself as being fast doesn¹t sound like the best idea...
>> B.
>> *From:* Russell Jurney <mailto:russell.jurney@gmail.com
>> <ma...@gmail.com> >
>> *Sent:* Saturday, October 18, 2014 7:38 AM
>> *To:* user@hadoop.apache.org <ma...@hadoop.apache.org>
>> *Subject:* Re: Spark vs Tez
>> Check out PySpark. No Scala required.
>> 
>> On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA
>> <adaryl.wakefield@hotmail.com <mailto:adaryl.wakefield@hotmail.com
>> <ma...@hotmail.com> >> wrote:
>> 
>>     ³The only problem with Spark adoption is the steep learning curve of
>>     Scala , and understanding the API properly.²
>>     This is why I¹m looking for reasons to avoid Spark. In my mind, it¹s
>>     one more thing to have to master and doesn¹t really have anything to
>>     offer that can¹t be done with other tools that are already inside my
>>     skillset. I spoke with some software engineers recently and
>>     basically the discussion boiled down to if you need to master Java
>>     or Scala go with Java. Three months into Java I don¹t want to stop
>>     that and start learning Scala.
>>     B.
>>     *From:* kartik saxena
>>     <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
>>     *Sent:* Friday, October 17, 2014 1:12 PM
>>     *To:* javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org
>> <ma...@hadoop.apache.org> ');
>>     *Subject:* Re: Spark vs Tez
>>     I did a performance benchmark during my summer internship . I am
>>     currently a grad student. Can't reveal much about the specific
>>     project but Spark is still faster than around 4-5th iteration of Tez
>>     of the same query/dataset. By Iteration I mean utilizing the
>>     "hot-container" property of Apache Tez  . See latest release of Tez
>>     and some hortonworks tutorials on their website.
>>     The only problem with Spark adoption is the steep learning curve of
>>     Scala , and understanding the API properly.
>>     Thanks
>>     On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA
>>     <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:
>> 
>>         Does anybody have any performance figures on how Spark stacks up
>>         against Tez? If you don¹t have figures, does anybody have an
>>         opinion? Spark seems so popular but I¹m not really seeing why.
>>         B.
>> 
>> 
>> 
>> --
>> Russell Jurney twitter.com/rjurney <http://twitter.com/rjurney>
>> <http://twitter.com/rjurney>russell.jurney@gmail.com
>> <ma...@gmail.com>
>> <mailto:russell.jurney@gmail.com <ma...@gmail.com> >
>> datasyndrome.com <http://datasyndrome.com>
>> <http://datasyndrome.com/>

Re: Spark vs Tez

Posted by Brian O'Neill <bo...@alumni.brown.edu>.

@edwardcapriolo, funny running into you over here in the hadoop community.
=)

FWIW, I have the same perspective and had the same experience with Scala and
Spark. 
(I had LISP/Scheme in College too. =)

Additionally, with the JDK8 enhancements (lambda expressions, method
references, etc.), there is less motivation to move to Scala.

Specifically, with Spark  take a look at this:
http://blog.cloudera.com/blog/2014/04/making-apache-spark-easier-to-use-in-j
ava-with-java-8/

-brian

---
Brian O'Neill
Chief Technology Officer

Health Market Science
The Science of Better Results
2700 Horizon Drive  King of Prussia, PA  19406
M: 215.588.6024  @boneill42 <http://www.twitter.com/boneill42>   
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.

From:  Edward Capriolo <ed...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Tuesday, October 21, 2014 at 12:06 PM
To:  "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Spark vs Tez

scala is not an interpreted language, from my non authoritative view it
seems to have 2-3 (thousand) more compile phases than java and as a result
some of the things you are doing that look like they are "interpreted" are
actually macro's that get converted into "usually" efficient java code.

About scala in general. I have a few complains. The inter op is kinda
clunky, I have to work in scala and run into stuff, like the json mapper in
scala works! that is until one property in my scala object is actually a
java object, then it does not or I should be able to call a method in java
from scala but can not figure out how to turn a Comparator into a
Comparator[_: <Any].

The immutability aspect i find to be a real PITA. It becomes really hard to
write code the way you want to and then if you do not use an immutable
collection or some other fancy scala construct people get on your case that
your not writing idiomatic scala (even though few agree on what that really
is).

Generally people have a large capacity to assume, "I'm smart, I know java,
and I learned lisp in school so this scala stuff is going to be a breeze"
Don't make that assumption. You will not be proficient in writing scala for
months. You likely wont be able to hire anyone that has done much production
scala. And everyone will come up to you and say "so Im trying to (sort
list|cast objects|simple thing) in scala. I can do it in java but YOUR THE
EXPERT and wondering how to do in in scala".

On Tue, Oct 21, 2014 at 10:04 AM, Tim Randles <tr...@lanl.gov> wrote:
> Yeah, compared to something as performant as java...
> </sarcasm>
> 
> On 10/20/2014 10:16 PM, Adaryl "Bob" Wakefield, MBA wrote:
>> Using an interpreted scripting language with something that is billing
>> itself as being fast doesn¹t sound like the best idea...
>> B.
>> *From:* Russell Jurney <mailto:russell.jurney@gmail.com
>> <ma...@gmail.com> >
>> *Sent:* Saturday, October 18, 2014 7:38 AM
>> *To:* user@hadoop.apache.org <ma...@hadoop.apache.org>
>> *Subject:* Re: Spark vs Tez
>> Check out PySpark. No Scala required.
>> 
>> On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA
>> <adaryl.wakefield@hotmail.com <mailto:adaryl.wakefield@hotmail.com
>> <ma...@hotmail.com> >> wrote:
>> 
>>     ³The only problem with Spark adoption is the steep learning curve of
>>     Scala , and understanding the API properly.²
>>     This is why I¹m looking for reasons to avoid Spark. In my mind, it¹s
>>     one more thing to have to master and doesn¹t really have anything to
>>     offer that can¹t be done with other tools that are already inside my
>>     skillset. I spoke with some software engineers recently and
>>     basically the discussion boiled down to if you need to master Java
>>     or Scala go with Java. Three months into Java I don¹t want to stop
>>     that and start learning Scala.
>>     B.
>>     *From:* kartik saxena
>>     <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
>>     *Sent:* Friday, October 17, 2014 1:12 PM
>>     *To:* javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org
>> <ma...@hadoop.apache.org> ');
>>     *Subject:* Re: Spark vs Tez
>>     I did a performance benchmark during my summer internship . I am
>>     currently a grad student. Can't reveal much about the specific
>>     project but Spark is still faster than around 4-5th iteration of Tez
>>     of the same query/dataset. By Iteration I mean utilizing the
>>     "hot-container" property of Apache Tez  . See latest release of Tez
>>     and some hortonworks tutorials on their website.
>>     The only problem with Spark adoption is the steep learning curve of
>>     Scala , and understanding the API properly.
>>     Thanks
>>     On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA
>>     <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:
>> 
>>         Does anybody have any performance figures on how Spark stacks up
>>         against Tez? If you don¹t have figures, does anybody have an
>>         opinion? Spark seems so popular but I¹m not really seeing why.
>>         B.
>> 
>> 
>> 
>> --
>> Russell Jurney twitter.com/rjurney <http://twitter.com/rjurney>
>> <http://twitter.com/rjurney>russell.jurney@gmail.com
>> <ma...@gmail.com>
>> <mailto:russell.jurney@gmail.com <ma...@gmail.com> >
>> datasyndrome.com <http://datasyndrome.com>
>> <http://datasyndrome.com/>

Re: Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

My comment was in response to the suggestion to use PySpark. Perhaps I misunderstand what PySpark is. It was my understanding that it let you work with Spark in Python. Is that not correct?

B.

From: Edward Capriolo 
Sent: Tuesday, October 21, 2014 11:06 AM
To: user@hadoop.apache.org 
Subject: Re: Spark vs Tez

scala is not an interpreted language, from my non authoritative view it seems to have 2-3 (thousand) more compile phases than java and as a result some of the things you are doing that look like they are "interpreted" are actually macro's that get converted into "usually" efficient java code.  

About scala in general. I have a few complains. The inter op is kinda clunky, I have to work in scala and run into stuff, like the json mapper in scala works! that is until one property in my scala object is actually a java object, then it does not or I should be able to call a method in java from scala but can not figure out how to turn a Comparator into a Comparator[_: <Any]. 

The immutability aspect i find to be a real PITA. It becomes really hard to write code the way you want to and then if you do not use an immutable collection or some other fancy scala construct people get on your case that your not writing idiomatic scala (even though few agree on what that really is).

Generally people have a large capacity to assume, "I'm smart, I know java, and I learned lisp in school so this scala stuff is going to be a breeze" Don't make that assumption. You will not be proficient in writing scala for months. You likely wont be able to hire anyone that has done much production scala. And everyone will come up to you and say "so Im trying to (sort list|cast objects|simple thing) in scala. I can do it in java but YOUR THE EXPERT and wondering how to do in in scala". 

On Tue, Oct 21, 2014 at 10:04 AM, Tim Randles <tr...@lanl.gov> wrote:

  Yeah, compared to something as performant as java...
  </sarcasm>

  On 10/20/2014 10:16 PM, Adaryl "Bob" Wakefield, MBA wrote:

    Using an interpreted scripting language with something that is billing
    itself as being fast doesn’t sound like the best idea...
    B.
    *From:* Russell Jurney <ma...@gmail.com>
    *Sent:* Saturday, October 18, 2014 7:38 AM
    *To:* user@hadoop.apache.org <ma...@hadoop.apache.org>
    *Subject:* Re: Spark vs Tez
    Check out PySpark. No Scala required.

    On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA
    <adaryl.wakefield@hotmail.com <ma...@hotmail.com>> wrote:

        “The only problem with Spark adoption is the steep learning curve of
        Scala , and understanding the API properly.”
        This is why I’m looking for reasons to avoid Spark. In my mind, it’s
        one more thing to have to master and doesn’t really have anything to
        offer that can’t be done with other tools that are already inside my
        skillset. I spoke with some software engineers recently and
        basically the discussion boiled down to if you need to master Java
        or Scala go with Java. Three months into Java I don’t want to stop
        that and start learning Scala.
        B.
        *From:* kartik saxena
        <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
        *Sent:* Friday, October 17, 2014 1:12 PM
        *To:* javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org');
        *Subject:* Re: Spark vs Tez
        I did a performance benchmark during my summer internship . I am
        currently a grad student. Can't reveal much about the specific
        project but Spark is still faster than around 4-5th iteration of Tez
        of the same query/dataset. By Iteration I mean utilizing the
        "hot-container" property of Apache Tez  . See latest release of Tez
        and some hortonworks tutorials on their website.
        The only problem with Spark adoption is the steep learning curve of
        Scala , and understanding the API properly.
        Thanks
        On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA
        <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:

            Does anybody have any performance figures on how Spark stacks up
            against Tez? If you don’t have figures, does anybody have an
            opinion? Spark seems so popular but I’m not really seeing why.
            B.

    --
    Russell Jurney twitter.com/rjurney
    <http://twitter.com/rjurney>russell.jurney@gmail.com
    <ma...@gmail.com> datasyndrome.com
    <http://datasyndrome.com/>

Re: Spark vs Tez

Posted by Edward Capriolo <ed...@gmail.com>.

scala is not an interpreted language, from my non authoritative view it
seems to have 2-3 (thousand) more compile phases than java and as a result
some of the things you are doing that look like they are "interpreted" are
actually macro's that get converted into "usually" efficient java code.

About scala in general. I have a few complains. The inter op is kinda
clunky, I have to work in scala and run into stuff, like the json mapper in
scala works! that is until one property in my scala object is actually a
java object, then it does not or I should be able to call a method in java
from scala but can not figure out how to turn a Comparator into a
Comparator[_: <Any].

The immutability aspect i find to be a real PITA. It becomes really hard to
write code the way you want to and then if you do not use an immutable
collection or some other fancy scala construct people get on your case that
your not writing idiomatic scala (even though few agree on what that really
is).

Generally people have a large capacity to assume, "I'm smart, I know java,
and I learned lisp in school so this scala stuff is going to be a breeze"
Don't make that assumption. You will not be proficient in writing scala for
months. You likely wont be able to hire anyone that has done much
production scala. And everyone will come up to you and say "so Im trying to
(sort list|cast objects|simple thing) in scala. I can do it in java but
YOUR THE EXPERT and wondering how to do in in scala".

On Tue, Oct 21, 2014 at 10:04 AM, Tim Randles <tr...@lanl.gov> wrote:

> Yeah, compared to something as performant as java...
> </sarcasm>
>
> On 10/20/2014 10:16 PM, Adaryl "Bob" Wakefield, MBA wrote:
>
>> Using an interpreted scripting language with something that is billing
>> itself as being fast doesn’t sound like the best idea...
>> B.
>> *From:* Russell Jurney <ma...@gmail.com>
>> *Sent:* Saturday, October 18, 2014 7:38 AM
>> *To:* user@hadoop.apache.org <ma...@hadoop.apache.org>
>> *Subject:* Re: Spark vs Tez
>> Check out PySpark. No Scala required.
>>
>> On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA
>> <adaryl.wakefield@hotmail.com <ma...@hotmail.com>>
>> wrote:
>>
>>     “The only problem with Spark adoption is the steep learning curve of
>>     Scala , and understanding the API properly.”
>>     This is why I’m looking for reasons to avoid Spark. In my mind, it’s
>>     one more thing to have to master and doesn’t really have anything to
>>     offer that can’t be done with other tools that are already inside my
>>     skillset. I spoke with some software engineers recently and
>>     basically the discussion boiled down to if you need to master Java
>>     or Scala go with Java. Three months into Java I don’t want to stop
>>     that and start learning Scala.
>>     B.
>>     *From:* kartik saxena
>>     <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
>>     *Sent:* Friday, October 17, 2014 1:12 PM
>>     *To:* javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org');
>>     *Subject:* Re: Spark vs Tez
>>     I did a performance benchmark during my summer internship . I am
>>     currently a grad student. Can't reveal much about the specific
>>     project but Spark is still faster than around 4-5th iteration of Tez
>>     of the same query/dataset. By Iteration I mean utilizing the
>>     "hot-container" property of Apache Tez  . See latest release of Tez
>>     and some hortonworks tutorials on their website.
>>     The only problem with Spark adoption is the steep learning curve of
>>     Scala , and understanding the API properly.
>>     Thanks
>>     On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA
>>     <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:
>>
>>         Does anybody have any performance figures on how Spark stacks up
>>         against Tez? If you don’t have figures, does anybody have an
>>         opinion? Spark seems so popular but I’m not really seeing why.
>>         B.
>>
>>
>>
>> --
>> Russell Jurney twitter.com/rjurney
>> <http://twitter.com/rjurney>russell.jurney@gmail.com
>> <ma...@gmail.com> datasyndrome.com
>> <http://datasyndrome.com/>
>>
>

Re: Spark vs Tez

Posted by Edward Capriolo <ed...@gmail.com>.

scala is not an interpreted language, from my non authoritative view it
seems to have 2-3 (thousand) more compile phases than java and as a result
some of the things you are doing that look like they are "interpreted" are
actually macro's that get converted into "usually" efficient java code.

About scala in general. I have a few complains. The inter op is kinda
clunky, I have to work in scala and run into stuff, like the json mapper in
scala works! that is until one property in my scala object is actually a
java object, then it does not or I should be able to call a method in java
from scala but can not figure out how to turn a Comparator into a
Comparator[_: <Any].

The immutability aspect i find to be a real PITA. It becomes really hard to
write code the way you want to and then if you do not use an immutable
collection or some other fancy scala construct people get on your case that
your not writing idiomatic scala (even though few agree on what that really
is).

Generally people have a large capacity to assume, "I'm smart, I know java,
and I learned lisp in school so this scala stuff is going to be a breeze"
Don't make that assumption. You will not be proficient in writing scala for
months. You likely wont be able to hire anyone that has done much
production scala. And everyone will come up to you and say "so Im trying to
(sort list|cast objects|simple thing) in scala. I can do it in java but
YOUR THE EXPERT and wondering how to do in in scala".

On Tue, Oct 21, 2014 at 10:04 AM, Tim Randles <tr...@lanl.gov> wrote:

> Yeah, compared to something as performant as java...
> </sarcasm>
>
> On 10/20/2014 10:16 PM, Adaryl "Bob" Wakefield, MBA wrote:
>
>> Using an interpreted scripting language with something that is billing
>> itself as being fast doesn’t sound like the best idea...
>> B.
>> *From:* Russell Jurney <ma...@gmail.com>
>> *Sent:* Saturday, October 18, 2014 7:38 AM
>> *To:* user@hadoop.apache.org <ma...@hadoop.apache.org>
>> *Subject:* Re: Spark vs Tez
>> Check out PySpark. No Scala required.
>>
>> On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA
>> <adaryl.wakefield@hotmail.com <ma...@hotmail.com>>
>> wrote:
>>
>>     “The only problem with Spark adoption is the steep learning curve of
>>     Scala , and understanding the API properly.”
>>     This is why I’m looking for reasons to avoid Spark. In my mind, it’s
>>     one more thing to have to master and doesn’t really have anything to
>>     offer that can’t be done with other tools that are already inside my
>>     skillset. I spoke with some software engineers recently and
>>     basically the discussion boiled down to if you need to master Java
>>     or Scala go with Java. Three months into Java I don’t want to stop
>>     that and start learning Scala.
>>     B.
>>     *From:* kartik saxena
>>     <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
>>     *Sent:* Friday, October 17, 2014 1:12 PM
>>     *To:* javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org');
>>     *Subject:* Re: Spark vs Tez
>>     I did a performance benchmark during my summer internship . I am
>>     currently a grad student. Can't reveal much about the specific
>>     project but Spark is still faster than around 4-5th iteration of Tez
>>     of the same query/dataset. By Iteration I mean utilizing the
>>     "hot-container" property of Apache Tez  . See latest release of Tez
>>     and some hortonworks tutorials on their website.
>>     The only problem with Spark adoption is the steep learning curve of
>>     Scala , and understanding the API properly.
>>     Thanks
>>     On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA
>>     <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:
>>
>>         Does anybody have any performance figures on how Spark stacks up
>>         against Tez? If you don’t have figures, does anybody have an
>>         opinion? Spark seems so popular but I’m not really seeing why.
>>         B.
>>
>>
>>
>> --
>> Russell Jurney twitter.com/rjurney
>> <http://twitter.com/rjurney>russell.jurney@gmail.com
>> <ma...@gmail.com> datasyndrome.com
>> <http://datasyndrome.com/>
>>
>

Re: Spark vs Tez

Posted by Edward Capriolo <ed...@gmail.com>.

scala is not an interpreted language, from my non authoritative view it
seems to have 2-3 (thousand) more compile phases than java and as a result
some of the things you are doing that look like they are "interpreted" are
actually macro's that get converted into "usually" efficient java code.

About scala in general. I have a few complains. The inter op is kinda
clunky, I have to work in scala and run into stuff, like the json mapper in
scala works! that is until one property in my scala object is actually a
java object, then it does not or I should be able to call a method in java
from scala but can not figure out how to turn a Comparator into a
Comparator[_: <Any].

The immutability aspect i find to be a real PITA. It becomes really hard to
write code the way you want to and then if you do not use an immutable
collection or some other fancy scala construct people get on your case that
your not writing idiomatic scala (even though few agree on what that really
is).

Generally people have a large capacity to assume, "I'm smart, I know java,
and I learned lisp in school so this scala stuff is going to be a breeze"
Don't make that assumption. You will not be proficient in writing scala for
months. You likely wont be able to hire anyone that has done much
production scala. And everyone will come up to you and say "so Im trying to
(sort list|cast objects|simple thing) in scala. I can do it in java but
YOUR THE EXPERT and wondering how to do in in scala".

On Tue, Oct 21, 2014 at 10:04 AM, Tim Randles <tr...@lanl.gov> wrote:

> Yeah, compared to something as performant as java...
> </sarcasm>
>
> On 10/20/2014 10:16 PM, Adaryl "Bob" Wakefield, MBA wrote:
>
>> Using an interpreted scripting language with something that is billing
>> itself as being fast doesn’t sound like the best idea...
>> B.
>> *From:* Russell Jurney <ma...@gmail.com>
>> *Sent:* Saturday, October 18, 2014 7:38 AM
>> *To:* user@hadoop.apache.org <ma...@hadoop.apache.org>
>> *Subject:* Re: Spark vs Tez
>> Check out PySpark. No Scala required.
>>
>> On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA
>> <adaryl.wakefield@hotmail.com <ma...@hotmail.com>>
>> wrote:
>>
>>     “The only problem with Spark adoption is the steep learning curve of
>>     Scala , and understanding the API properly.”
>>     This is why I’m looking for reasons to avoid Spark. In my mind, it’s
>>     one more thing to have to master and doesn’t really have anything to
>>     offer that can’t be done with other tools that are already inside my
>>     skillset. I spoke with some software engineers recently and
>>     basically the discussion boiled down to if you need to master Java
>>     or Scala go with Java. Three months into Java I don’t want to stop
>>     that and start learning Scala.
>>     B.
>>     *From:* kartik saxena
>>     <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
>>     *Sent:* Friday, October 17, 2014 1:12 PM
>>     *To:* javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org');
>>     *Subject:* Re: Spark vs Tez
>>     I did a performance benchmark during my summer internship . I am
>>     currently a grad student. Can't reveal much about the specific
>>     project but Spark is still faster than around 4-5th iteration of Tez
>>     of the same query/dataset. By Iteration I mean utilizing the
>>     "hot-container" property of Apache Tez  . See latest release of Tez
>>     and some hortonworks tutorials on their website.
>>     The only problem with Spark adoption is the steep learning curve of
>>     Scala , and understanding the API properly.
>>     Thanks
>>     On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA
>>     <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:
>>
>>         Does anybody have any performance figures on how Spark stacks up
>>         against Tez? If you don’t have figures, does anybody have an
>>         opinion? Spark seems so popular but I’m not really seeing why.
>>         B.
>>
>>
>>
>> --
>> Russell Jurney twitter.com/rjurney
>> <http://twitter.com/rjurney>russell.jurney@gmail.com
>> <ma...@gmail.com> datasyndrome.com
>> <http://datasyndrome.com/>
>>
>

Re: Spark vs Tez

Posted by Edward Capriolo <ed...@gmail.com>.

scala is not an interpreted language, from my non authoritative view it
seems to have 2-3 (thousand) more compile phases than java and as a result
some of the things you are doing that look like they are "interpreted" are
actually macro's that get converted into "usually" efficient java code.

About scala in general. I have a few complains. The inter op is kinda
clunky, I have to work in scala and run into stuff, like the json mapper in
scala works! that is until one property in my scala object is actually a
java object, then it does not or I should be able to call a method in java
from scala but can not figure out how to turn a Comparator into a
Comparator[_: <Any].

The immutability aspect i find to be a real PITA. It becomes really hard to
write code the way you want to and then if you do not use an immutable
collection or some other fancy scala construct people get on your case that
your not writing idiomatic scala (even though few agree on what that really
is).

Generally people have a large capacity to assume, "I'm smart, I know java,
and I learned lisp in school so this scala stuff is going to be a breeze"
Don't make that assumption. You will not be proficient in writing scala for
months. You likely wont be able to hire anyone that has done much
production scala. And everyone will come up to you and say "so Im trying to
(sort list|cast objects|simple thing) in scala. I can do it in java but
YOUR THE EXPERT and wondering how to do in in scala".

On Tue, Oct 21, 2014 at 10:04 AM, Tim Randles <tr...@lanl.gov> wrote:

> Yeah, compared to something as performant as java...
> </sarcasm>
>
> On 10/20/2014 10:16 PM, Adaryl "Bob" Wakefield, MBA wrote:
>
>> Using an interpreted scripting language with something that is billing
>> itself as being fast doesn’t sound like the best idea...
>> B.
>> *From:* Russell Jurney <ma...@gmail.com>
>> *Sent:* Saturday, October 18, 2014 7:38 AM
>> *To:* user@hadoop.apache.org <ma...@hadoop.apache.org>
>> *Subject:* Re: Spark vs Tez
>> Check out PySpark. No Scala required.
>>
>> On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA
>> <adaryl.wakefield@hotmail.com <ma...@hotmail.com>>
>> wrote:
>>
>>     “The only problem with Spark adoption is the steep learning curve of
>>     Scala , and understanding the API properly.”
>>     This is why I’m looking for reasons to avoid Spark. In my mind, it’s
>>     one more thing to have to master and doesn’t really have anything to
>>     offer that can’t be done with other tools that are already inside my
>>     skillset. I spoke with some software engineers recently and
>>     basically the discussion boiled down to if you need to master Java
>>     or Scala go with Java. Three months into Java I don’t want to stop
>>     that and start learning Scala.
>>     B.
>>     *From:* kartik saxena
>>     <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
>>     *Sent:* Friday, October 17, 2014 1:12 PM
>>     *To:* javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org');
>>     *Subject:* Re: Spark vs Tez
>>     I did a performance benchmark during my summer internship . I am
>>     currently a grad student. Can't reveal much about the specific
>>     project but Spark is still faster than around 4-5th iteration of Tez
>>     of the same query/dataset. By Iteration I mean utilizing the
>>     "hot-container" property of Apache Tez  . See latest release of Tez
>>     and some hortonworks tutorials on their website.
>>     The only problem with Spark adoption is the steep learning curve of
>>     Scala , and understanding the API properly.
>>     Thanks
>>     On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA
>>     <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:
>>
>>         Does anybody have any performance figures on how Spark stacks up
>>         against Tez? If you don’t have figures, does anybody have an
>>         opinion? Spark seems so popular but I’m not really seeing why.
>>         B.
>>
>>
>>
>> --
>> Russell Jurney twitter.com/rjurney
>> <http://twitter.com/rjurney>russell.jurney@gmail.com
>> <ma...@gmail.com> datasyndrome.com
>> <http://datasyndrome.com/>
>>
>

Re: Spark vs Tez

Posted by Tim Randles <tr...@lanl.gov>.

Yeah, compared to something as performant as java...
</sarcasm>

On 10/20/2014 10:16 PM, Adaryl "Bob" Wakefield, MBA wrote:
> Using an interpreted scripting language with something that is billing
> itself as being fast doesn’t sound like the best idea...
> B.
> *From:* Russell Jurney <ma...@gmail.com>
> *Sent:* Saturday, October 18, 2014 7:38 AM
> *To:* user@hadoop.apache.org <ma...@hadoop.apache.org>
> *Subject:* Re: Spark vs Tez
> Check out PySpark. No Scala required.
>
> On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA
> <adaryl.wakefield@hotmail.com <ma...@hotmail.com>> wrote:
>
>     “The only problem with Spark adoption is the steep learning curve of
>     Scala , and understanding the API properly.”
>     This is why I’m looking for reasons to avoid Spark. In my mind, it’s
>     one more thing to have to master and doesn’t really have anything to
>     offer that can’t be done with other tools that are already inside my
>     skillset. I spoke with some software engineers recently and
>     basically the discussion boiled down to if you need to master Java
>     or Scala go with Java. Three months into Java I don’t want to stop
>     that and start learning Scala.
>     B.
>     *From:* kartik saxena
>     <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
>     *Sent:* Friday, October 17, 2014 1:12 PM
>     *To:* javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org');
>     *Subject:* Re: Spark vs Tez
>     I did a performance benchmark during my summer internship . I am
>     currently a grad student. Can't reveal much about the specific
>     project but Spark is still faster than around 4-5th iteration of Tez
>     of the same query/dataset. By Iteration I mean utilizing the
>     "hot-container" property of Apache Tez  . See latest release of Tez
>     and some hortonworks tutorials on their website.
>     The only problem with Spark adoption is the steep learning curve of
>     Scala , and understanding the API properly.
>     Thanks
>     On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA
>     <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:
>
>         Does anybody have any performance figures on how Spark stacks up
>         against Tez? If you don’t have figures, does anybody have an
>         opinion? Spark seems so popular but I’m not really seeing why.
>         B.
>
>
>
> --
> Russell Jurney twitter.com/rjurney
> <http://twitter.com/rjurney>russell.jurney@gmail.com
> <ma...@gmail.com> datasyndrome.com
> <http://datasyndrome.com/>

Re: Spark vs Tez

Posted by Tim Randles <tr...@lanl.gov>.

Yeah, compared to something as performant as java...
</sarcasm>

On 10/20/2014 10:16 PM, Adaryl "Bob" Wakefield, MBA wrote:
> Using an interpreted scripting language with something that is billing
> itself as being fast doesn’t sound like the best idea...
> B.
> *From:* Russell Jurney <ma...@gmail.com>
> *Sent:* Saturday, October 18, 2014 7:38 AM
> *To:* user@hadoop.apache.org <ma...@hadoop.apache.org>
> *Subject:* Re: Spark vs Tez
> Check out PySpark. No Scala required.
>
> On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA
> <adaryl.wakefield@hotmail.com <ma...@hotmail.com>> wrote:
>
>     “The only problem with Spark adoption is the steep learning curve of
>     Scala , and understanding the API properly.”
>     This is why I’m looking for reasons to avoid Spark. In my mind, it’s
>     one more thing to have to master and doesn’t really have anything to
>     offer that can’t be done with other tools that are already inside my
>     skillset. I spoke with some software engineers recently and
>     basically the discussion boiled down to if you need to master Java
>     or Scala go with Java. Three months into Java I don’t want to stop
>     that and start learning Scala.
>     B.
>     *From:* kartik saxena
>     <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
>     *Sent:* Friday, October 17, 2014 1:12 PM
>     *To:* javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org');
>     *Subject:* Re: Spark vs Tez
>     I did a performance benchmark during my summer internship . I am
>     currently a grad student. Can't reveal much about the specific
>     project but Spark is still faster than around 4-5th iteration of Tez
>     of the same query/dataset. By Iteration I mean utilizing the
>     "hot-container" property of Apache Tez  . See latest release of Tez
>     and some hortonworks tutorials on their website.
>     The only problem with Spark adoption is the steep learning curve of
>     Scala , and understanding the API properly.
>     Thanks
>     On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA
>     <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:
>
>         Does anybody have any performance figures on how Spark stacks up
>         against Tez? If you don’t have figures, does anybody have an
>         opinion? Spark seems so popular but I’m not really seeing why.
>         B.
>
>
>
> --
> Russell Jurney twitter.com/rjurney
> <http://twitter.com/rjurney>russell.jurney@gmail.com
> <ma...@gmail.com> datasyndrome.com
> <http://datasyndrome.com/>

Re: Spark vs Tez

Posted by Tim Randles <tr...@lanl.gov>.

Yeah, compared to something as performant as java...
</sarcasm>

On 10/20/2014 10:16 PM, Adaryl "Bob" Wakefield, MBA wrote:
> Using an interpreted scripting language with something that is billing
> itself as being fast doesn’t sound like the best idea...
> B.
> *From:* Russell Jurney <ma...@gmail.com>
> *Sent:* Saturday, October 18, 2014 7:38 AM
> *To:* user@hadoop.apache.org <ma...@hadoop.apache.org>
> *Subject:* Re: Spark vs Tez
> Check out PySpark. No Scala required.
>
> On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA
> <adaryl.wakefield@hotmail.com <ma...@hotmail.com>> wrote:
>
>     “The only problem with Spark adoption is the steep learning curve of
>     Scala , and understanding the API properly.”
>     This is why I’m looking for reasons to avoid Spark. In my mind, it’s
>     one more thing to have to master and doesn’t really have anything to
>     offer that can’t be done with other tools that are already inside my
>     skillset. I spoke with some software engineers recently and
>     basically the discussion boiled down to if you need to master Java
>     or Scala go with Java. Three months into Java I don’t want to stop
>     that and start learning Scala.
>     B.
>     *From:* kartik saxena
>     <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
>     *Sent:* Friday, October 17, 2014 1:12 PM
>     *To:* javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org');
>     *Subject:* Re: Spark vs Tez
>     I did a performance benchmark during my summer internship . I am
>     currently a grad student. Can't reveal much about the specific
>     project but Spark is still faster than around 4-5th iteration of Tez
>     of the same query/dataset. By Iteration I mean utilizing the
>     "hot-container" property of Apache Tez  . See latest release of Tez
>     and some hortonworks tutorials on their website.
>     The only problem with Spark adoption is the steep learning curve of
>     Scala , and understanding the API properly.
>     Thanks
>     On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA
>     <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:
>
>         Does anybody have any performance figures on how Spark stacks up
>         against Tez? If you don’t have figures, does anybody have an
>         opinion? Spark seems so popular but I’m not really seeing why.
>         B.
>
>
>
> --
> Russell Jurney twitter.com/rjurney
> <http://twitter.com/rjurney>russell.jurney@gmail.com
> <ma...@gmail.com> datasyndrome.com
> <http://datasyndrome.com/>

Re: Spark vs Tez

Posted by Tim Randles <tr...@lanl.gov>.

Yeah, compared to something as performant as java...
</sarcasm>

On 10/20/2014 10:16 PM, Adaryl "Bob" Wakefield, MBA wrote:
> Using an interpreted scripting language with something that is billing
> itself as being fast doesn’t sound like the best idea...
> B.
> *From:* Russell Jurney <ma...@gmail.com>
> *Sent:* Saturday, October 18, 2014 7:38 AM
> *To:* user@hadoop.apache.org <ma...@hadoop.apache.org>
> *Subject:* Re: Spark vs Tez
> Check out PySpark. No Scala required.
>
> On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA
> <adaryl.wakefield@hotmail.com <ma...@hotmail.com>> wrote:
>
>     “The only problem with Spark adoption is the steep learning curve of
>     Scala , and understanding the API properly.”
>     This is why I’m looking for reasons to avoid Spark. In my mind, it’s
>     one more thing to have to master and doesn’t really have anything to
>     offer that can’t be done with other tools that are already inside my
>     skillset. I spoke with some software engineers recently and
>     basically the discussion boiled down to if you need to master Java
>     or Scala go with Java. Three months into Java I don’t want to stop
>     that and start learning Scala.
>     B.
>     *From:* kartik saxena
>     <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
>     *Sent:* Friday, October 17, 2014 1:12 PM
>     *To:* javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org');
>     *Subject:* Re: Spark vs Tez
>     I did a performance benchmark during my summer internship . I am
>     currently a grad student. Can't reveal much about the specific
>     project but Spark is still faster than around 4-5th iteration of Tez
>     of the same query/dataset. By Iteration I mean utilizing the
>     "hot-container" property of Apache Tez  . See latest release of Tez
>     and some hortonworks tutorials on their website.
>     The only problem with Spark adoption is the steep learning curve of
>     Scala , and understanding the API properly.
>     Thanks
>     On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA
>     <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:
>
>         Does anybody have any performance figures on how Spark stacks up
>         against Tez? If you don’t have figures, does anybody have an
>         opinion? Spark seems so popular but I’m not really seeing why.
>         B.
>
>
>
> --
> Russell Jurney twitter.com/rjurney
> <http://twitter.com/rjurney>russell.jurney@gmail.com
> <ma...@gmail.com> datasyndrome.com
> <http://datasyndrome.com/>

Re: Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Using an interpreted scripting language with something that is billing itself as being fast doesn’t sound like the best idea...
B.

From: Russell Jurney 
Sent: Saturday, October 18, 2014 7:38 AM
To: user@hadoop.apache.org 
Subject: Re: Spark vs Tez

Check out PySpark. No Scala required.

On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  “The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly.” 

  This is why I’m looking for reasons to avoid Spark. In my mind, it’s one more thing to have to master and doesn’t really have anything to offer that can’t be done with other tools that are already inside my skillset. I spoke with some software engineers recently and basically the discussion boiled down to if you need to master Java or Scala go with Java. Three months into Java I don’t want to stop that and start learning Scala.

  B.
  From: kartik saxena 
  Sent: Friday, October 17, 2014 1:12 PM
  To: javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org'); 
  Subject: Re: Spark vs Tez

  I did a performance benchmark during my summer internship . I am currently a grad student. Can't reveal much about the specific project but Spark is still faster than around 4-5th iteration of Tez of the same query/dataset. By Iteration I mean utilizing the "hot-container" property of Apache Tez  . See latest release of Tez and some hortonworks tutorials on their website.  

  The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly. 

  Thanks

  On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:

    Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
    B.

-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Using an interpreted scripting language with something that is billing itself as being fast doesn’t sound like the best idea...
B.

From: Russell Jurney 
Sent: Saturday, October 18, 2014 7:38 AM
To: user@hadoop.apache.org 
Subject: Re: Spark vs Tez

Check out PySpark. No Scala required.

On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  “The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly.” 

  This is why I’m looking for reasons to avoid Spark. In my mind, it’s one more thing to have to master and doesn’t really have anything to offer that can’t be done with other tools that are already inside my skillset. I spoke with some software engineers recently and basically the discussion boiled down to if you need to master Java or Scala go with Java. Three months into Java I don’t want to stop that and start learning Scala.

  B.
  From: kartik saxena 
  Sent: Friday, October 17, 2014 1:12 PM
  To: javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org'); 
  Subject: Re: Spark vs Tez

  I did a performance benchmark during my summer internship . I am currently a grad student. Can't reveal much about the specific project but Spark is still faster than around 4-5th iteration of Tez of the same query/dataset. By Iteration I mean utilizing the "hot-container" property of Apache Tez  . See latest release of Tez and some hortonworks tutorials on their website.  

  The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly. 

  Thanks

  On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:

    Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
    B.

-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Using an interpreted scripting language with something that is billing itself as being fast doesn’t sound like the best idea...
B.

From: Russell Jurney 
Sent: Saturday, October 18, 2014 7:38 AM
To: user@hadoop.apache.org 
Subject: Re: Spark vs Tez

Check out PySpark. No Scala required.

On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  “The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly.” 

  This is why I’m looking for reasons to avoid Spark. In my mind, it’s one more thing to have to master and doesn’t really have anything to offer that can’t be done with other tools that are already inside my skillset. I spoke with some software engineers recently and basically the discussion boiled down to if you need to master Java or Scala go with Java. Three months into Java I don’t want to stop that and start learning Scala.

  B.
  From: kartik saxena 
  Sent: Friday, October 17, 2014 1:12 PM
  To: javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org'); 
  Subject: Re: Spark vs Tez

  I did a performance benchmark during my summer internship . I am currently a grad student. Can't reveal much about the specific project but Spark is still faster than around 4-5th iteration of Tez of the same query/dataset. By Iteration I mean utilizing the "hot-container" property of Apache Tez  . See latest release of Tez and some hortonworks tutorials on their website.  

  The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly. 

  Thanks

  On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:

    Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
    B.

-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Using an interpreted scripting language with something that is billing itself as being fast doesn’t sound like the best idea...
B.

From: Russell Jurney 
Sent: Saturday, October 18, 2014 7:38 AM
To: user@hadoop.apache.org 
Subject: Re: Spark vs Tez

Check out PySpark. No Scala required.

On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  “The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly.” 

  This is why I’m looking for reasons to avoid Spark. In my mind, it’s one more thing to have to master and doesn’t really have anything to offer that can’t be done with other tools that are already inside my skillset. I spoke with some software engineers recently and basically the discussion boiled down to if you need to master Java or Scala go with Java. Three months into Java I don’t want to stop that and start learning Scala.

  B.
  From: kartik saxena 
  Sent: Friday, October 17, 2014 1:12 PM
  To: javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org'); 
  Subject: Re: Spark vs Tez

  I did a performance benchmark during my summer internship . I am currently a grad student. Can't reveal much about the specific project but Spark is still faster than around 4-5th iteration of Tez of the same query/dataset. By Iteration I mean utilizing the "hot-container" property of Apache Tez  . See latest release of Tez and some hortonworks tutorials on their website.  

  The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly. 

  Thanks

  On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');> wrote:

    Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
    B.

-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Spark vs Tez

Posted by Russell Jurney <ru...@gmail.com>.

Check out PySpark. No Scala required.

On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   “The only problem with Spark adoption is the steep learning curve of
> Scala , and understanding the API properly.”
>
> This is why I’m looking for reasons to avoid Spark. In my mind, it’s one
> more thing to have to master and doesn’t really have anything to offer that
> can’t be done with other tools that are already inside my skillset. I spoke
> with some software engineers recently and basically the discussion boiled
> down to if you need to master Java or Scala go with Java. Three months into
> Java I don’t want to stop that and start learning Scala.
>
>  B.
>   *From:* kartik saxena
> <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
> *Sent:* Friday, October 17, 2014 1:12 PM
> *To:* user@hadoop.apache.org
> <javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org');>
> *Subject:* Re: Spark vs Tez
>
>  I did a performance benchmark during my summer internship . I am
> currently a grad student. Can't reveal much about the specific project but
> Spark is still faster than around 4-5th iteration of Tez of the same
> query/dataset. By Iteration I mean utilizing the "hot-container" property
> of Apache Tez  . See latest release of Tez and some hortonworks tutorials
> on their website.
>
> The only problem with Spark adoption is the steep learning curve of Scala
> , and understanding the API properly.
>
> Thanks
>
> On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com
> <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');>> wrote:
>
>>   Does anybody have any performance figures on how Spark stacks up
>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>> seems so popular but I’m not really seeing why.
>> B.
>>
>
>


-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Spark vs Tez

Posted by Russell Jurney <ru...@gmail.com>.

Check out PySpark. No Scala required.

On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   “The only problem with Spark adoption is the steep learning curve of
> Scala , and understanding the API properly.”
>
> This is why I’m looking for reasons to avoid Spark. In my mind, it’s one
> more thing to have to master and doesn’t really have anything to offer that
> can’t be done with other tools that are already inside my skillset. I spoke
> with some software engineers recently and basically the discussion boiled
> down to if you need to master Java or Scala go with Java. Three months into
> Java I don’t want to stop that and start learning Scala.
>
>  B.
>   *From:* kartik saxena
> <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
> *Sent:* Friday, October 17, 2014 1:12 PM
> *To:* user@hadoop.apache.org
> <javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org');>
> *Subject:* Re: Spark vs Tez
>
>  I did a performance benchmark during my summer internship . I am
> currently a grad student. Can't reveal much about the specific project but
> Spark is still faster than around 4-5th iteration of Tez of the same
> query/dataset. By Iteration I mean utilizing the "hot-container" property
> of Apache Tez  . See latest release of Tez and some hortonworks tutorials
> on their website.
>
> The only problem with Spark adoption is the steep learning curve of Scala
> , and understanding the API properly.
>
> Thanks
>
> On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com
> <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');>> wrote:
>
>>   Does anybody have any performance figures on how Spark stacks up
>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>> seems so popular but I’m not really seeing why.
>> B.
>>
>
>


-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Spark vs Tez

Posted by Russell Jurney <ru...@gmail.com>.

Check out PySpark. No Scala required.

On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   “The only problem with Spark adoption is the steep learning curve of
> Scala , and understanding the API properly.”
>
> This is why I’m looking for reasons to avoid Spark. In my mind, it’s one
> more thing to have to master and doesn’t really have anything to offer that
> can’t be done with other tools that are already inside my skillset. I spoke
> with some software engineers recently and basically the discussion boiled
> down to if you need to master Java or Scala go with Java. Three months into
> Java I don’t want to stop that and start learning Scala.
>
>  B.
>   *From:* kartik saxena
> <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
> *Sent:* Friday, October 17, 2014 1:12 PM
> *To:* user@hadoop.apache.org
> <javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org');>
> *Subject:* Re: Spark vs Tez
>
>  I did a performance benchmark during my summer internship . I am
> currently a grad student. Can't reveal much about the specific project but
> Spark is still faster than around 4-5th iteration of Tez of the same
> query/dataset. By Iteration I mean utilizing the "hot-container" property
> of Apache Tez  . See latest release of Tez and some hortonworks tutorials
> on their website.
>
> The only problem with Spark adoption is the steep learning curve of Scala
> , and understanding the API properly.
>
> Thanks
>
> On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com
> <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');>> wrote:
>
>>   Does anybody have any performance figures on how Spark stacks up
>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>> seems so popular but I’m not really seeing why.
>> B.
>>
>
>


-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Spark vs Tez

Posted by Russell Jurney <ru...@gmail.com>.

Check out PySpark. No Scala required.

On Friday, October 17, 2014, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   “The only problem with Spark adoption is the steep learning curve of
> Scala , and understanding the API properly.”
>
> This is why I’m looking for reasons to avoid Spark. In my mind, it’s one
> more thing to have to master and doesn’t really have anything to offer that
> can’t be done with other tools that are already inside my skillset. I spoke
> with some software engineers recently and basically the discussion boiled
> down to if you need to master Java or Scala go with Java. Three months into
> Java I don’t want to stop that and start learning Scala.
>
>  B.
>   *From:* kartik saxena
> <javascript:_e(%7B%7D,'cvml','kartik.sxn@gmail.com');>
> *Sent:* Friday, October 17, 2014 1:12 PM
> *To:* user@hadoop.apache.org
> <javascript:_e(%7B%7D,'cvml','user@hadoop.apache.org');>
> *Subject:* Re: Spark vs Tez
>
>  I did a performance benchmark during my summer internship . I am
> currently a grad student. Can't reveal much about the specific project but
> Spark is still faster than around 4-5th iteration of Tez of the same
> query/dataset. By Iteration I mean utilizing the "hot-container" property
> of Apache Tez  . See latest release of Tez and some hortonworks tutorials
> on their website.
>
> The only problem with Spark adoption is the steep learning curve of Scala
> , and understanding the API properly.
>
> Thanks
>
> On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com
> <javascript:_e(%7B%7D,'cvml','adaryl.wakefield@hotmail.com');>> wrote:
>
>>   Does anybody have any performance figures on how Spark stacks up
>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>> seems so popular but I’m not really seeing why.
>> B.
>>
>
>


-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

“The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly.” 

This is why I’m looking for reasons to avoid Spark. In my mind, it’s one more thing to have to master and doesn’t really have anything to offer that can’t be done with other tools that are already inside my skillset. I spoke with some software engineers recently and basically the discussion boiled down to if you need to master Java or Scala go with Java. Three months into Java I don’t want to stop that and start learning Scala.

B.
From: kartik saxena 
Sent: Friday, October 17, 2014 1:12 PM
To: user@hadoop.apache.org 
Subject: Re: Spark vs Tez

I did a performance benchmark during my summer internship . I am currently a grad student. Can't reveal much about the specific project but Spark is still faster than around 4-5th iteration of Tez of the same query/dataset. By Iteration I mean utilizing the "hot-container" property of Apache Tez  . See latest release of Tez and some hortonworks tutorials on their website.  

The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly. 

Thanks

On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
  B.

Re: Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

“The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly.” 

This is why I’m looking for reasons to avoid Spark. In my mind, it’s one more thing to have to master and doesn’t really have anything to offer that can’t be done with other tools that are already inside my skillset. I spoke with some software engineers recently and basically the discussion boiled down to if you need to master Java or Scala go with Java. Three months into Java I don’t want to stop that and start learning Scala.

B.
From: kartik saxena 
Sent: Friday, October 17, 2014 1:12 PM
To: user@hadoop.apache.org 
Subject: Re: Spark vs Tez

I did a performance benchmark during my summer internship . I am currently a grad student. Can't reveal much about the specific project but Spark is still faster than around 4-5th iteration of Tez of the same query/dataset. By Iteration I mean utilizing the "hot-container" property of Apache Tez  . See latest release of Tez and some hortonworks tutorials on their website.  

The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly. 

Thanks

On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
  B.

Re: Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

“The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly.” 

This is why I’m looking for reasons to avoid Spark. In my mind, it’s one more thing to have to master and doesn’t really have anything to offer that can’t be done with other tools that are already inside my skillset. I spoke with some software engineers recently and basically the discussion boiled down to if you need to master Java or Scala go with Java. Three months into Java I don’t want to stop that and start learning Scala.

B.
From: kartik saxena 
Sent: Friday, October 17, 2014 1:12 PM
To: user@hadoop.apache.org 
Subject: Re: Spark vs Tez

I did a performance benchmark during my summer internship . I am currently a grad student. Can't reveal much about the specific project but Spark is still faster than around 4-5th iteration of Tez of the same query/dataset. By Iteration I mean utilizing the "hot-container" property of Apache Tez  . See latest release of Tez and some hortonworks tutorials on their website.  

The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly. 

Thanks

On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
  B.

Re: Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

“The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly.” 

This is why I’m looking for reasons to avoid Spark. In my mind, it’s one more thing to have to master and doesn’t really have anything to offer that can’t be done with other tools that are already inside my skillset. I spoke with some software engineers recently and basically the discussion boiled down to if you need to master Java or Scala go with Java. Three months into Java I don’t want to stop that and start learning Scala.

B.
From: kartik saxena 
Sent: Friday, October 17, 2014 1:12 PM
To: user@hadoop.apache.org 
Subject: Re: Spark vs Tez

I did a performance benchmark during my summer internship . I am currently a grad student. Can't reveal much about the specific project but Spark is still faster than around 4-5th iteration of Tez of the same query/dataset. By Iteration I mean utilizing the "hot-container" property of Apache Tez  . See latest release of Tez and some hortonworks tutorials on their website.  

The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly. 

Thanks

On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
  B.

Re: Spark vs Tez

Posted by kartik saxena <ka...@gmail.com>.

I did a performance benchmark during my summer internship . I am currently
a grad student. Can't reveal much about the specific project but Spark is
still faster than around 4-5th iteration of Tez of the same query/dataset.
By Iteration I mean utilizing the "hot-container" property of Apache Tez  .
See latest release of Tez and some hortonworks tutorials on their website.

The only problem with Spark adoption is the steep learning curve of Scala ,
and understanding the API properly.

Thanks

On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Does anybody have any performance figures on how Spark stacks up
> against Tez? If you don’t have figures, does anybody have an opinion? Spark
> seems so popular but I’m not really seeing why.
> B.
>

Re: Spark vs Tez

Posted by kartik saxena <ka...@gmail.com>.

I did a performance benchmark during my summer internship . I am currently
a grad student. Can't reveal much about the specific project but Spark is
still faster than around 4-5th iteration of Tez of the same query/dataset.
By Iteration I mean utilizing the "hot-container" property of Apache Tez  .
See latest release of Tez and some hortonworks tutorials on their website.

The only problem with Spark adoption is the steep learning curve of Scala ,
and understanding the API properly.

Thanks

On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Does anybody have any performance figures on how Spark stacks up
> against Tez? If you don’t have figures, does anybody have an opinion? Spark
> seems so popular but I’m not really seeing why.
> B.
>

Re: Spark vs Tez

Posted by Shahab Yunus <sh...@gmail.com>.

What aspects of Tez and Spark are you comparing? They have different
purposes and thus not directly comparable, as far as I understand.

Regards,
Shahab

On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Does anybody have any performance figures on how Spark stacks up
> against Tez? If you don’t have figures, does anybody have an opinion? Spark
> seems so popular but I’m not really seeing why.
> B.
>

Re: Spark vs Tez

Posted by Shahab Yunus <sh...@gmail.com>.

What aspects of Tez and Spark are you comparing? They have different
purposes and thus not directly comparable, as far as I understand.

Regards,
Shahab

On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Does anybody have any performance figures on how Spark stacks up
> against Tez? If you don’t have figures, does anybody have an opinion? Spark
> seems so popular but I’m not really seeing why.
> B.
>

Re: Spark vs Tez

Posted by Alexander Pivovarov <ap...@gmail.com>.

Spark creator Amplab did some benchmarks.
https://amplab.cs.berkeley.edu/benchmark/

On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Does anybody have any performance figures on how Spark stacks up
> against Tez? If you don’t have figures, does anybody have an opinion? Spark
> seems so popular but I’m not really seeing why.
> B.
>

Re: Spark vs Tez

Posted by kartik saxena <ka...@gmail.com>.

I did a performance benchmark during my summer internship . I am currently
a grad student. Can't reveal much about the specific project but Spark is
still faster than around 4-5th iteration of Tez of the same query/dataset.
By Iteration I mean utilizing the "hot-container" property of Apache Tez  .
See latest release of Tez and some hortonworks tutorials on their website.

The only problem with Spark adoption is the steep learning curve of Scala ,
and understanding the API properly.

Thanks

On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Does anybody have any performance figures on how Spark stacks up
> against Tez? If you don’t have figures, does anybody have an opinion? Spark
> seems so popular but I’m not really seeing why.
> B.
>

Re: Spark vs Tez

Posted by Alexander Pivovarov <ap...@gmail.com>.

Spark creator Amplab did some benchmarks.
https://amplab.cs.berkeley.edu/benchmark/

On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Does anybody have any performance figures on how Spark stacks up
> against Tez? If you don’t have figures, does anybody have an opinion? Spark
> seems so popular but I’m not really seeing why.
> B.
>

Re: Spark vs Tez

Posted by kartik saxena <ka...@gmail.com>.

I did a performance benchmark during my summer internship . I am currently
a grad student. Can't reveal much about the specific project but Spark is
still faster than around 4-5th iteration of Tez of the same query/dataset.
By Iteration I mean utilizing the "hot-container" property of Apache Tez  .
See latest release of Tez and some hortonworks tutorials on their website.

The only problem with Spark adoption is the steep learning curve of Scala ,
and understanding the API properly.

Thanks

On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Does anybody have any performance figures on how Spark stacks up
> against Tez? If you don’t have figures, does anybody have an opinion? Spark
> seems so popular but I’m not really seeing why.
> B.
>

Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
B.

Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
B.

Re: how to copy data between two hdfs cluster fastly?

Posted by Shivram Mani <sm...@pivotal.io>.

What is your approx input size ?
Do you have multiple files or is this one large file ?
What is your block size (source and destination cluster) ?

On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:

> no ,all default
>
> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>
>> Did you specified how many map tasks?
>>
>>
>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>
>>> hi,maillist:
>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
>>> find when copy small file,it very good, but when transfer big data ,it very
>>> slow ,any good method recommand? thanks
>>>
>>
>>
>

-- 
Thanks
Shivram

Re: how to copy data between two hdfs cluster fastly?

Posted by Shivram Mani <sm...@pivotal.io>.

What is your approx input size ?
Do you have multiple files or is this one large file ?
What is your block size (source and destination cluster) ?

On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:

> no ,all default
>
> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>
>> Did you specified how many map tasks?
>>
>>
>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>
>>> hi,maillist:
>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
>>> find when copy small file,it very good, but when transfer big data ,it very
>>> slow ,any good method recommand? thanks
>>>
>>
>>
>

-- 
Thanks
Shivram

Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
B.

Spark vs Tez

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why.
B.

Re: how to copy data between two hdfs cluster fastly?

Posted by Shivram Mani <sm...@pivotal.io>.

What is your approx input size ?
Do you have multiple files or is this one large file ?
What is your block size (source and destination cluster) ?

On Fri, Oct 17, 2014 at 4:19 AM, ch huang <ju...@gmail.com> wrote:

> no ,all default
>
> On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:
>
>> Did you specified how many map tasks?
>>
>>
>> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>>
>>> hi,maillist:
>>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
>>> find when copy small file,it very good, but when transfer big data ,it very
>>> slow ,any good method recommand? thanks
>>>
>>
>>
>

-- 
Thanks
Shivram

Re: how to copy data between two hdfs cluster fastly?

Posted by ch huang <ju...@gmail.com>.

no ,all default

On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:

> Did you specified how many map tasks?
>
>
> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>
>> hi,maillist:
>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
>> find when copy small file,it very good, but when transfer big data ,it very
>> slow ,any good method recommand? thanks
>>
>
>

Re: how to copy data between two hdfs cluster fastly?

Posted by ch huang <ju...@gmail.com>.

no ,all default

On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:

> Did you specified how many map tasks?
>
>
> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>
>> hi,maillist:
>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
>> find when copy small file,it very good, but when transfer big data ,it very
>> slow ,any good method recommand? thanks
>>
>
>

Re: how to copy data between two hdfs cluster fastly?

Posted by ch huang <ju...@gmail.com>.

no ,all default

On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:

> Did you specified how many map tasks?
>
>
> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>
>> hi,maillist:
>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
>> find when copy small file,it very good, but when transfer big data ,it very
>> slow ,any good method recommand? thanks
>>
>
>

Re: how to copy data between two hdfs cluster fastly?

Posted by ch huang <ju...@gmail.com>.

no ,all default

On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu <az...@gmail.com> wrote:

> Did you specified how many map tasks?
>
>
> On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:
>
>> hi,maillist:
>>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
>> find when copy small file,it very good, but when transfer big data ,it very
>> slow ,any good method recommand? thanks
>>
>
>

Re: how to copy data between two hdfs cluster fastly?

Posted by Azuryy Yu <az...@gmail.com>.

Did you specified how many map tasks?

On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:

> hi,maillist:
>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
> find when copy small file,it very good, but when transfer big data ,it very
> slow ,any good method recommand? thanks
>

Re: how to copy data between two hdfs cluster fastly?

Posted by Azuryy Yu <az...@gmail.com>.

Did you specified how many map tasks?

On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:

> hi,maillist:
>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
> find when copy small file,it very good, but when transfer big data ,it very
> slow ,any good method recommand? thanks
>

Re: how to copy data between two hdfs cluster fastly?

Posted by Azuryy Yu <az...@gmail.com>.

Did you specified how many map tasks?

On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:

> hi,maillist:
>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
> find when copy small file,it very good, but when transfer big data ,it very
> slow ,any good method recommand? thanks
>

Re: how to copy data between two hdfs cluster fastly?

Posted by Azuryy Yu <az...@gmail.com>.

Did you specified how many map tasks?

On Fri, Oct 17, 2014 at 4:58 PM, ch huang <ju...@gmail.com> wrote:

> hi,maillist:
>              i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
> find when copy small file,it very good, but when transfer big data ,it very
> slow ,any good method recommand? thanks
>