You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Jonathan Aquilina <ja...@eagleeyet.net> on 2015/02/23 08:00:46 UTC

recombining split files after data is processed

 

Hey all, 

I understand that the purpose of splitting files is to distribute the
data to multiple core and task nodes in a cluster. My question is that
after the output is complete is there a way one can combine all the
parts into a single file? 

-- 
Regards,
Jonathan Aquilina
Founder Eagle Eye T
 

Re: recombining split files after data is processed

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.
You could attach the hadoop dfs command per bootstrap.
http://stackoverflow.com/questions/12055595/emr-how-to-join-files-into-one <http://stackoverflow.com/questions/12055595/emr-how-to-join-files-into-one>

BR,
 Alex


> On 23 Feb 2015, at 08:10, Jonathan Aquilina <ja...@eagleeyet.net> wrote:
> 
> Thanks Alex. where would that command be placed in a mapper or reducer or run as a command. Here at work we are looking to use Amazon EMR to do our number crunching and we have access to the master node, but not really the rest of the cluster. Can this be added as a step to be run after initial processing?
> 
>  
> ---
> Regards,
> Jonathan Aquilina
> Founder Eagle Eye T
> On 2015-02-23 08:05, Alexander Alten-Lorenz wrote:
> 
>> Hi,
>>  
>> You can use an single reducer (http://wiki.apache.org/hadoop/HowManyMapsAndReduces <http://wiki.apache.org/hadoop/HowManyMapsAndReduces>) for smaller datasets, or ‚getmerge': hadoop dfs -getmerge /hdfs/path local_file_name
>> 
>>  
>> BR,
>>  Alex
>>  
>> 
>>> On 23 Feb 2015, at 08:00, Jonathan Aquilina <jaquilina@eagleeyet.net <ma...@eagleeyet.net>> wrote:
>>> 
>>> Hey all,
>>> 
>>> I understand that the purpose of splitting files is to distribute the data to multiple core and task nodes in a cluster. My question is that after the output is complete is there a way one can combine all the parts into a single file?
>>> 
>>>  
>>> -- 
>>> Regards,
>>> Jonathan Aquilina
>>> Founder Eagle Eye T


Re: recombining split files after data is processed

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.
You could attach the hadoop dfs command per bootstrap.
http://stackoverflow.com/questions/12055595/emr-how-to-join-files-into-one <http://stackoverflow.com/questions/12055595/emr-how-to-join-files-into-one>

BR,
 Alex


> On 23 Feb 2015, at 08:10, Jonathan Aquilina <ja...@eagleeyet.net> wrote:
> 
> Thanks Alex. where would that command be placed in a mapper or reducer or run as a command. Here at work we are looking to use Amazon EMR to do our number crunching and we have access to the master node, but not really the rest of the cluster. Can this be added as a step to be run after initial processing?
> 
>  
> ---
> Regards,
> Jonathan Aquilina
> Founder Eagle Eye T
> On 2015-02-23 08:05, Alexander Alten-Lorenz wrote:
> 
>> Hi,
>>  
>> You can use an single reducer (http://wiki.apache.org/hadoop/HowManyMapsAndReduces <http://wiki.apache.org/hadoop/HowManyMapsAndReduces>) for smaller datasets, or ‚getmerge': hadoop dfs -getmerge /hdfs/path local_file_name
>> 
>>  
>> BR,
>>  Alex
>>  
>> 
>>> On 23 Feb 2015, at 08:00, Jonathan Aquilina <jaquilina@eagleeyet.net <ma...@eagleeyet.net>> wrote:
>>> 
>>> Hey all,
>>> 
>>> I understand that the purpose of splitting files is to distribute the data to multiple core and task nodes in a cluster. My question is that after the output is complete is there a way one can combine all the parts into a single file?
>>> 
>>>  
>>> -- 
>>> Regards,
>>> Jonathan Aquilina
>>> Founder Eagle Eye T


Re: recombining split files after data is processed

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.
You could attach the hadoop dfs command per bootstrap.
http://stackoverflow.com/questions/12055595/emr-how-to-join-files-into-one <http://stackoverflow.com/questions/12055595/emr-how-to-join-files-into-one>

BR,
 Alex


> On 23 Feb 2015, at 08:10, Jonathan Aquilina <ja...@eagleeyet.net> wrote:
> 
> Thanks Alex. where would that command be placed in a mapper or reducer or run as a command. Here at work we are looking to use Amazon EMR to do our number crunching and we have access to the master node, but not really the rest of the cluster. Can this be added as a step to be run after initial processing?
> 
>  
> ---
> Regards,
> Jonathan Aquilina
> Founder Eagle Eye T
> On 2015-02-23 08:05, Alexander Alten-Lorenz wrote:
> 
>> Hi,
>>  
>> You can use an single reducer (http://wiki.apache.org/hadoop/HowManyMapsAndReduces <http://wiki.apache.org/hadoop/HowManyMapsAndReduces>) for smaller datasets, or ‚getmerge': hadoop dfs -getmerge /hdfs/path local_file_name
>> 
>>  
>> BR,
>>  Alex
>>  
>> 
>>> On 23 Feb 2015, at 08:00, Jonathan Aquilina <jaquilina@eagleeyet.net <ma...@eagleeyet.net>> wrote:
>>> 
>>> Hey all,
>>> 
>>> I understand that the purpose of splitting files is to distribute the data to multiple core and task nodes in a cluster. My question is that after the output is complete is there a way one can combine all the parts into a single file?
>>> 
>>>  
>>> -- 
>>> Regards,
>>> Jonathan Aquilina
>>> Founder Eagle Eye T


Re: recombining split files after data is processed

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.
You could attach the hadoop dfs command per bootstrap.
http://stackoverflow.com/questions/12055595/emr-how-to-join-files-into-one <http://stackoverflow.com/questions/12055595/emr-how-to-join-files-into-one>

BR,
 Alex


> On 23 Feb 2015, at 08:10, Jonathan Aquilina <ja...@eagleeyet.net> wrote:
> 
> Thanks Alex. where would that command be placed in a mapper or reducer or run as a command. Here at work we are looking to use Amazon EMR to do our number crunching and we have access to the master node, but not really the rest of the cluster. Can this be added as a step to be run after initial processing?
> 
>  
> ---
> Regards,
> Jonathan Aquilina
> Founder Eagle Eye T
> On 2015-02-23 08:05, Alexander Alten-Lorenz wrote:
> 
>> Hi,
>>  
>> You can use an single reducer (http://wiki.apache.org/hadoop/HowManyMapsAndReduces <http://wiki.apache.org/hadoop/HowManyMapsAndReduces>) for smaller datasets, or ‚getmerge': hadoop dfs -getmerge /hdfs/path local_file_name
>> 
>>  
>> BR,
>>  Alex
>>  
>> 
>>> On 23 Feb 2015, at 08:00, Jonathan Aquilina <jaquilina@eagleeyet.net <ma...@eagleeyet.net>> wrote:
>>> 
>>> Hey all,
>>> 
>>> I understand that the purpose of splitting files is to distribute the data to multiple core and task nodes in a cluster. My question is that after the output is complete is there a way one can combine all the parts into a single file?
>>> 
>>>  
>>> -- 
>>> Regards,
>>> Jonathan Aquilina
>>> Founder Eagle Eye T


Re: recombining split files after data is processed

Posted by Jonathan Aquilina <ja...@eagleeyet.net>.
 

Thanks Alex. where would that command be placed in a mapper or reducer
or run as a command. Here at work we are looking to use Amazon EMR to do
our number crunching and we have access to the master node, but not
really the rest of the cluster. Can this be added as a step to be run
after initial processing? 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-02-23 08:05, Alexander Alten-Lorenz wrote: 

> Hi, 
> 
> You can use an single reducer (http://wiki.apache.org/hadoop/HowManyMapsAndReduces [1]) for smaller datasets, or ‚getmerge': hadoop dfs -getmerge /hdfs/path local_file_name 
> 
> BR, 
> Alex 
> 
>> On 23 Feb 2015, at 08:00, Jonathan Aquilina <ja...@eagleeyet.net> wrote: 
>> 
>> Hey all, 
>> 
>> I understand that the purpose of splitting files is to distribute the data to multiple core and task nodes in a cluster. My question is that after the output is complete is there a way one can combine all the parts into a single file? 
>> 
>> -- 
>> Regards,
>> Jonathan Aquilina
>> Founder Eagle Eye T
 

Links:
------
[1] http://wiki.apache.org/hadoop/HowManyMapsAndReduces

Re: recombining split files after data is processed

Posted by Jonathan Aquilina <ja...@eagleeyet.net>.
 

Thanks Alex. where would that command be placed in a mapper or reducer
or run as a command. Here at work we are looking to use Amazon EMR to do
our number crunching and we have access to the master node, but not
really the rest of the cluster. Can this be added as a step to be run
after initial processing? 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-02-23 08:05, Alexander Alten-Lorenz wrote: 

> Hi, 
> 
> You can use an single reducer (http://wiki.apache.org/hadoop/HowManyMapsAndReduces [1]) for smaller datasets, or ‚getmerge': hadoop dfs -getmerge /hdfs/path local_file_name 
> 
> BR, 
> Alex 
> 
>> On 23 Feb 2015, at 08:00, Jonathan Aquilina <ja...@eagleeyet.net> wrote: 
>> 
>> Hey all, 
>> 
>> I understand that the purpose of splitting files is to distribute the data to multiple core and task nodes in a cluster. My question is that after the output is complete is there a way one can combine all the parts into a single file? 
>> 
>> -- 
>> Regards,
>> Jonathan Aquilina
>> Founder Eagle Eye T
 

Links:
------
[1] http://wiki.apache.org/hadoop/HowManyMapsAndReduces

Re: recombining split files after data is processed

Posted by Jonathan Aquilina <ja...@eagleeyet.net>.
 

Thanks Alex. where would that command be placed in a mapper or reducer
or run as a command. Here at work we are looking to use Amazon EMR to do
our number crunching and we have access to the master node, but not
really the rest of the cluster. Can this be added as a step to be run
after initial processing? 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-02-23 08:05, Alexander Alten-Lorenz wrote: 

> Hi, 
> 
> You can use an single reducer (http://wiki.apache.org/hadoop/HowManyMapsAndReduces [1]) for smaller datasets, or ‚getmerge': hadoop dfs -getmerge /hdfs/path local_file_name 
> 
> BR, 
> Alex 
> 
>> On 23 Feb 2015, at 08:00, Jonathan Aquilina <ja...@eagleeyet.net> wrote: 
>> 
>> Hey all, 
>> 
>> I understand that the purpose of splitting files is to distribute the data to multiple core and task nodes in a cluster. My question is that after the output is complete is there a way one can combine all the parts into a single file? 
>> 
>> -- 
>> Regards,
>> Jonathan Aquilina
>> Founder Eagle Eye T
 

Links:
------
[1] http://wiki.apache.org/hadoop/HowManyMapsAndReduces

Re: recombining split files after data is processed

Posted by Jonathan Aquilina <ja...@eagleeyet.net>.
 

Thanks Alex. where would that command be placed in a mapper or reducer
or run as a command. Here at work we are looking to use Amazon EMR to do
our number crunching and we have access to the master node, but not
really the rest of the cluster. Can this be added as a step to be run
after initial processing? 

---
Regards,
Jonathan Aquilina
Founder Eagle Eye T

On 2015-02-23 08:05, Alexander Alten-Lorenz wrote: 

> Hi, 
> 
> You can use an single reducer (http://wiki.apache.org/hadoop/HowManyMapsAndReduces [1]) for smaller datasets, or ‚getmerge': hadoop dfs -getmerge /hdfs/path local_file_name 
> 
> BR, 
> Alex 
> 
>> On 23 Feb 2015, at 08:00, Jonathan Aquilina <ja...@eagleeyet.net> wrote: 
>> 
>> Hey all, 
>> 
>> I understand that the purpose of splitting files is to distribute the data to multiple core and task nodes in a cluster. My question is that after the output is complete is there a way one can combine all the parts into a single file? 
>> 
>> -- 
>> Regards,
>> Jonathan Aquilina
>> Founder Eagle Eye T
 

Links:
------
[1] http://wiki.apache.org/hadoop/HowManyMapsAndReduces

Re: recombining split files after data is processed

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.
Hi,

You can use an single reducer (http://wiki.apache.org/hadoop/HowManyMapsAndReduces <http://wiki.apache.org/hadoop/HowManyMapsAndReduces>) for smaller datasets, or ‚getmerge‘: hadoop dfs -getmerge /hdfs/path local_file_name


BR,
 Alex


> On 23 Feb 2015, at 08:00, Jonathan Aquilina <ja...@eagleeyet.net> wrote:
> 
> Hey all,
> 
> I understand that the purpose of splitting files is to distribute the data to multiple core and task nodes in a cluster. My question is that after the output is complete is there a way one can combine all the parts into a single file?
> 
>  
> -- 
> Regards,
> Jonathan Aquilina
> Founder Eagle Eye T


Re: recombining split files after data is processed

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.
Hi,

You can use an single reducer (http://wiki.apache.org/hadoop/HowManyMapsAndReduces <http://wiki.apache.org/hadoop/HowManyMapsAndReduces>) for smaller datasets, or ‚getmerge‘: hadoop dfs -getmerge /hdfs/path local_file_name


BR,
 Alex


> On 23 Feb 2015, at 08:00, Jonathan Aquilina <ja...@eagleeyet.net> wrote:
> 
> Hey all,
> 
> I understand that the purpose of splitting files is to distribute the data to multiple core and task nodes in a cluster. My question is that after the output is complete is there a way one can combine all the parts into a single file?
> 
>  
> -- 
> Regards,
> Jonathan Aquilina
> Founder Eagle Eye T


Re: recombining split files after data is processed

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.
Hi,

You can use an single reducer (http://wiki.apache.org/hadoop/HowManyMapsAndReduces <http://wiki.apache.org/hadoop/HowManyMapsAndReduces>) for smaller datasets, or ‚getmerge‘: hadoop dfs -getmerge /hdfs/path local_file_name


BR,
 Alex


> On 23 Feb 2015, at 08:00, Jonathan Aquilina <ja...@eagleeyet.net> wrote:
> 
> Hey all,
> 
> I understand that the purpose of splitting files is to distribute the data to multiple core and task nodes in a cluster. My question is that after the output is complete is there a way one can combine all the parts into a single file?
> 
>  
> -- 
> Regards,
> Jonathan Aquilina
> Founder Eagle Eye T


Re: recombining split files after data is processed

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.
Hi,

You can use an single reducer (http://wiki.apache.org/hadoop/HowManyMapsAndReduces <http://wiki.apache.org/hadoop/HowManyMapsAndReduces>) for smaller datasets, or ‚getmerge‘: hadoop dfs -getmerge /hdfs/path local_file_name


BR,
 Alex


> On 23 Feb 2015, at 08:00, Jonathan Aquilina <ja...@eagleeyet.net> wrote:
> 
> Hey all,
> 
> I understand that the purpose of splitting files is to distribute the data to multiple core and task nodes in a cluster. My question is that after the output is complete is there a way one can combine all the parts into a single file?
> 
>  
> -- 
> Regards,
> Jonathan Aquilina
> Founder Eagle Eye T