You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Mohit Anchlia <mo...@gmail.com> on 2012/12/22 07:40:36 UTC

Merging files

Is it possible to merge files from different locations from HDFS location
into one file into HDFS location?

Re: Merging files

Posted by Edward Capriolo <ed...@gmail.com>.

https://github.com/edwardcapriolo/filecrush

^ Another option

On Sun, Dec 23, 2012 at 1:20 AM, Mohit Anchlia <mo...@gmail.com>wrote:

> Thanks for the info. I was trying not to use nfs because my data size
> might be 10-20GB in size for every merge I perform. I'll use pig instead.
>
> In dstcp I checked and none of the directories are duplicate. Looking at
> the logs it looks like it's failing because all those directories have
> sub-directories of the same name.
>
> On Sat, Dec 22, 2012 at 2:05 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>> A pig script should work quite well.
>>
>> I also note that the file paths have maprfs in them.  This implies that
>> you are using MapR and could simply use the normal linux command cat to
>> concatenate the files if you mount the files using NFS (depending on
>> volume, of course).  For small amounts of data, this would work very well.
>>  For large amounts of data, you would be better with some kind of
>> map-reduce program.  Your Pig script is just the sort of thing.
>>
>> Keep in mind if you write a map-reduce program (or pig script) that you
>> will wind up with as many outputs as you have reducers.  If you have only a
>> single reducer, you will get one output file, but that will mean that only
>> a single process will do all the writing.  That would be no faster than
>> using the cat + NFS method above.  Having multiple reducers will allow you
>> to have write parallelism.
>>
>> The error message that distcp is giving you is a little odd, however,
>> since it implies that some of your input files are repeated.  Is that
>> possible?
>>
>>
>>
>> On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>>
>>> Tried distcp but it fails. Is there a way to merge them? Or else I could
>>> write a pig script to load from multiple paths
>>>
>>>
>>> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input,
>>> there are duplicated files in the sources:
>>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
>>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo
>>>
>>> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)
>>>
>>> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)
>>>
>>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)
>>>
>>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)
>>>
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>
>>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)
>>>
>>>
>>>  On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <td...@maprtech.com>wrote:
>>>
>>>> The technical term for this is "copying".  You may have heard of it.
>>>>
>>>> It is a subject of such long technical standing that many do not
>>>> consider it worthy of detailed documentation.
>>>>
>>>> Distcp effects a similar process and can be modified to combine the
>>>> input files into a single file.
>>>>
>>>> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>>>>
>>>>
>>>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com>wrote:
>>>>
>>>>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>>>>
>>>>>
>>>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>>>>
>>>>>> Yes, via the simple act of opening a target stream and writing all
>>>>>> source streams into it. Or to save code time, an identity job with a
>>>>>> single reducer (you may not get control over ordering this way).
>>>>>>
>>>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <
>>>>>> mohitanchlia@gmail.com> wrote:
>>>>>> > Is it possible to merge files from different locations from HDFS
>>>>>> location
>>>>>> > into one file into HDFS location?
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Harsh J
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Merging files

Posted by Edward Capriolo <ed...@gmail.com>.

https://github.com/edwardcapriolo/filecrush

^ Another option

On Sun, Dec 23, 2012 at 1:20 AM, Mohit Anchlia <mo...@gmail.com>wrote:

> Thanks for the info. I was trying not to use nfs because my data size
> might be 10-20GB in size for every merge I perform. I'll use pig instead.
>
> In dstcp I checked and none of the directories are duplicate. Looking at
> the logs it looks like it's failing because all those directories have
> sub-directories of the same name.
>
> On Sat, Dec 22, 2012 at 2:05 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>> A pig script should work quite well.
>>
>> I also note that the file paths have maprfs in them.  This implies that
>> you are using MapR and could simply use the normal linux command cat to
>> concatenate the files if you mount the files using NFS (depending on
>> volume, of course).  For small amounts of data, this would work very well.
>>  For large amounts of data, you would be better with some kind of
>> map-reduce program.  Your Pig script is just the sort of thing.
>>
>> Keep in mind if you write a map-reduce program (or pig script) that you
>> will wind up with as many outputs as you have reducers.  If you have only a
>> single reducer, you will get one output file, but that will mean that only
>> a single process will do all the writing.  That would be no faster than
>> using the cat + NFS method above.  Having multiple reducers will allow you
>> to have write parallelism.
>>
>> The error message that distcp is giving you is a little odd, however,
>> since it implies that some of your input files are repeated.  Is that
>> possible?
>>
>>
>>
>> On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>>
>>> Tried distcp but it fails. Is there a way to merge them? Or else I could
>>> write a pig script to load from multiple paths
>>>
>>>
>>> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input,
>>> there are duplicated files in the sources:
>>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
>>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo
>>>
>>> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)
>>>
>>> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)
>>>
>>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)
>>>
>>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)
>>>
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>
>>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)
>>>
>>>
>>>  On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <td...@maprtech.com>wrote:
>>>
>>>> The technical term for this is "copying".  You may have heard of it.
>>>>
>>>> It is a subject of such long technical standing that many do not
>>>> consider it worthy of detailed documentation.
>>>>
>>>> Distcp effects a similar process and can be modified to combine the
>>>> input files into a single file.
>>>>
>>>> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>>>>
>>>>
>>>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com>wrote:
>>>>
>>>>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>>>>
>>>>>
>>>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>>>>
>>>>>> Yes, via the simple act of opening a target stream and writing all
>>>>>> source streams into it. Or to save code time, an identity job with a
>>>>>> single reducer (you may not get control over ordering this way).
>>>>>>
>>>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <
>>>>>> mohitanchlia@gmail.com> wrote:
>>>>>> > Is it possible to merge files from different locations from HDFS
>>>>>> location
>>>>>> > into one file into HDFS location?
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Harsh J
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Merging files

Posted by Edward Capriolo <ed...@gmail.com>.

https://github.com/edwardcapriolo/filecrush

^ Another option

On Sun, Dec 23, 2012 at 1:20 AM, Mohit Anchlia <mo...@gmail.com>wrote:

> Thanks for the info. I was trying not to use nfs because my data size
> might be 10-20GB in size for every merge I perform. I'll use pig instead.
>
> In dstcp I checked and none of the directories are duplicate. Looking at
> the logs it looks like it's failing because all those directories have
> sub-directories of the same name.
>
> On Sat, Dec 22, 2012 at 2:05 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>> A pig script should work quite well.
>>
>> I also note that the file paths have maprfs in them.  This implies that
>> you are using MapR and could simply use the normal linux command cat to
>> concatenate the files if you mount the files using NFS (depending on
>> volume, of course).  For small amounts of data, this would work very well.
>>  For large amounts of data, you would be better with some kind of
>> map-reduce program.  Your Pig script is just the sort of thing.
>>
>> Keep in mind if you write a map-reduce program (or pig script) that you
>> will wind up with as many outputs as you have reducers.  If you have only a
>> single reducer, you will get one output file, but that will mean that only
>> a single process will do all the writing.  That would be no faster than
>> using the cat + NFS method above.  Having multiple reducers will allow you
>> to have write parallelism.
>>
>> The error message that distcp is giving you is a little odd, however,
>> since it implies that some of your input files are repeated.  Is that
>> possible?
>>
>>
>>
>> On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>>
>>> Tried distcp but it fails. Is there a way to merge them? Or else I could
>>> write a pig script to load from multiple paths
>>>
>>>
>>> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input,
>>> there are duplicated files in the sources:
>>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
>>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo
>>>
>>> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)
>>>
>>> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)
>>>
>>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)
>>>
>>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)
>>>
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>
>>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)
>>>
>>>
>>>  On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <td...@maprtech.com>wrote:
>>>
>>>> The technical term for this is "copying".  You may have heard of it.
>>>>
>>>> It is a subject of such long technical standing that many do not
>>>> consider it worthy of detailed documentation.
>>>>
>>>> Distcp effects a similar process and can be modified to combine the
>>>> input files into a single file.
>>>>
>>>> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>>>>
>>>>
>>>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com>wrote:
>>>>
>>>>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>>>>
>>>>>
>>>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>>>>
>>>>>> Yes, via the simple act of opening a target stream and writing all
>>>>>> source streams into it. Or to save code time, an identity job with a
>>>>>> single reducer (you may not get control over ordering this way).
>>>>>>
>>>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <
>>>>>> mohitanchlia@gmail.com> wrote:
>>>>>> > Is it possible to merge files from different locations from HDFS
>>>>>> location
>>>>>> > into one file into HDFS location?
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Harsh J
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Merging files

Posted by Edward Capriolo <ed...@gmail.com>.

https://github.com/edwardcapriolo/filecrush

^ Another option

On Sun, Dec 23, 2012 at 1:20 AM, Mohit Anchlia <mo...@gmail.com>wrote:

> Thanks for the info. I was trying not to use nfs because my data size
> might be 10-20GB in size for every merge I perform. I'll use pig instead.
>
> In dstcp I checked and none of the directories are duplicate. Looking at
> the logs it looks like it's failing because all those directories have
> sub-directories of the same name.
>
> On Sat, Dec 22, 2012 at 2:05 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>> A pig script should work quite well.
>>
>> I also note that the file paths have maprfs in them.  This implies that
>> you are using MapR and could simply use the normal linux command cat to
>> concatenate the files if you mount the files using NFS (depending on
>> volume, of course).  For small amounts of data, this would work very well.
>>  For large amounts of data, you would be better with some kind of
>> map-reduce program.  Your Pig script is just the sort of thing.
>>
>> Keep in mind if you write a map-reduce program (or pig script) that you
>> will wind up with as many outputs as you have reducers.  If you have only a
>> single reducer, you will get one output file, but that will mean that only
>> a single process will do all the writing.  That would be no faster than
>> using the cat + NFS method above.  Having multiple reducers will allow you
>> to have write parallelism.
>>
>> The error message that distcp is giving you is a little odd, however,
>> since it implies that some of your input files are repeated.  Is that
>> possible?
>>
>>
>>
>> On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>>
>>> Tried distcp but it fails. Is there a way to merge them? Or else I could
>>> write a pig script to load from multiple paths
>>>
>>>
>>> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input,
>>> there are duplicated files in the sources:
>>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
>>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo
>>>
>>> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)
>>>
>>> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)
>>>
>>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)
>>>
>>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)
>>>
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>
>>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)
>>>
>>>
>>>  On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <td...@maprtech.com>wrote:
>>>
>>>> The technical term for this is "copying".  You may have heard of it.
>>>>
>>>> It is a subject of such long technical standing that many do not
>>>> consider it worthy of detailed documentation.
>>>>
>>>> Distcp effects a similar process and can be modified to combine the
>>>> input files into a single file.
>>>>
>>>> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>>>>
>>>>
>>>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com>wrote:
>>>>
>>>>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>>>>
>>>>>
>>>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>>>>
>>>>>> Yes, via the simple act of opening a target stream and writing all
>>>>>> source streams into it. Or to save code time, an identity job with a
>>>>>> single reducer (you may not get control over ordering this way).
>>>>>>
>>>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <
>>>>>> mohitanchlia@gmail.com> wrote:
>>>>>> > Is it possible to merge files from different locations from HDFS
>>>>>> location
>>>>>> > into one file into HDFS location?
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Harsh J
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Merging files

Posted by Mohit Anchlia <mo...@gmail.com>.

Thanks for the info. I was trying not to use nfs because my data size might
be 10-20GB in size for every merge I perform. I'll use pig instead.

In dstcp I checked and none of the directories are duplicate. Looking at
the logs it looks like it's failing because all those directories have
sub-directories of the same name.

On Sat, Dec 22, 2012 at 2:05 PM, Ted Dunning <td...@maprtech.com> wrote:

> A pig script should work quite well.
>
> I also note that the file paths have maprfs in them.  This implies that
> you are using MapR and could simply use the normal linux command cat to
> concatenate the files if you mount the files using NFS (depending on
> volume, of course).  For small amounts of data, this would work very well.
>  For large amounts of data, you would be better with some kind of
> map-reduce program.  Your Pig script is just the sort of thing.
>
> Keep in mind if you write a map-reduce program (or pig script) that you
> will wind up with as many outputs as you have reducers.  If you have only a
> single reducer, you will get one output file, but that will mean that only
> a single process will do all the writing.  That would be no faster than
> using the cat + NFS method above.  Having multiple reducers will allow you
> to have write parallelism.
>
> The error message that distcp is giving you is a little odd, however,
> since it implies that some of your input files are repeated.  Is that
> possible?
>
>
>
> On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>
>> Tried distcp but it fails. Is there a way to merge them? Or else I could
>> write a pig script to load from multiple paths
>>
>>
>> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there
>> are duplicated files in the sources:
>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo
>>
>> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)
>>
>> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)
>>
>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)
>>
>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>
>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)
>>
>>
>>  On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <td...@maprtech.com>wrote:
>>
>>> The technical term for this is "copying".  You may have heard of it.
>>>
>>> It is a subject of such long technical standing that many do not
>>> consider it worthy of detailed documentation.
>>>
>>> Distcp effects a similar process and can be modified to combine the
>>> input files into a single file.
>>>
>>> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>>>
>>>
>>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com>wrote:
>>>
>>>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>>>
>>>>
>>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>>>
>>>>> Yes, via the simple act of opening a target stream and writing all
>>>>> source streams into it. Or to save code time, an identity job with a
>>>>> single reducer (you may not get control over ordering this way).
>>>>>
>>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <
>>>>> mohitanchlia@gmail.com> wrote:
>>>>> > Is it possible to merge files from different locations from HDFS
>>>>> location
>>>>> > into one file into HDFS location?
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Harsh J
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Merging files

Posted by Mohit Anchlia <mo...@gmail.com>.

Thanks for the info. I was trying not to use nfs because my data size might
be 10-20GB in size for every merge I perform. I'll use pig instead.

In dstcp I checked and none of the directories are duplicate. Looking at
the logs it looks like it's failing because all those directories have
sub-directories of the same name.

On Sat, Dec 22, 2012 at 2:05 PM, Ted Dunning <td...@maprtech.com> wrote:

> A pig script should work quite well.
>
> I also note that the file paths have maprfs in them.  This implies that
> you are using MapR and could simply use the normal linux command cat to
> concatenate the files if you mount the files using NFS (depending on
> volume, of course).  For small amounts of data, this would work very well.
>  For large amounts of data, you would be better with some kind of
> map-reduce program.  Your Pig script is just the sort of thing.
>
> Keep in mind if you write a map-reduce program (or pig script) that you
> will wind up with as many outputs as you have reducers.  If you have only a
> single reducer, you will get one output file, but that will mean that only
> a single process will do all the writing.  That would be no faster than
> using the cat + NFS method above.  Having multiple reducers will allow you
> to have write parallelism.
>
> The error message that distcp is giving you is a little odd, however,
> since it implies that some of your input files are repeated.  Is that
> possible?
>
>
>
> On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>
>> Tried distcp but it fails. Is there a way to merge them? Or else I could
>> write a pig script to load from multiple paths
>>
>>
>> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there
>> are duplicated files in the sources:
>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo
>>
>> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)
>>
>> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)
>>
>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)
>>
>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>
>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)
>>
>>
>>  On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <td...@maprtech.com>wrote:
>>
>>> The technical term for this is "copying".  You may have heard of it.
>>>
>>> It is a subject of such long technical standing that many do not
>>> consider it worthy of detailed documentation.
>>>
>>> Distcp effects a similar process and can be modified to combine the
>>> input files into a single file.
>>>
>>> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>>>
>>>
>>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com>wrote:
>>>
>>>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>>>
>>>>
>>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>>>
>>>>> Yes, via the simple act of opening a target stream and writing all
>>>>> source streams into it. Or to save code time, an identity job with a
>>>>> single reducer (you may not get control over ordering this way).
>>>>>
>>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <
>>>>> mohitanchlia@gmail.com> wrote:
>>>>> > Is it possible to merge files from different locations from HDFS
>>>>> location
>>>>> > into one file into HDFS location?
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Harsh J
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Merging files

Posted by Mohit Anchlia <mo...@gmail.com>.

Thanks for the info. I was trying not to use nfs because my data size might
be 10-20GB in size for every merge I perform. I'll use pig instead.

In dstcp I checked and none of the directories are duplicate. Looking at
the logs it looks like it's failing because all those directories have
sub-directories of the same name.

On Sat, Dec 22, 2012 at 2:05 PM, Ted Dunning <td...@maprtech.com> wrote:

> A pig script should work quite well.
>
> I also note that the file paths have maprfs in them.  This implies that
> you are using MapR and could simply use the normal linux command cat to
> concatenate the files if you mount the files using NFS (depending on
> volume, of course).  For small amounts of data, this would work very well.
>  For large amounts of data, you would be better with some kind of
> map-reduce program.  Your Pig script is just the sort of thing.
>
> Keep in mind if you write a map-reduce program (or pig script) that you
> will wind up with as many outputs as you have reducers.  If you have only a
> single reducer, you will get one output file, but that will mean that only
> a single process will do all the writing.  That would be no faster than
> using the cat + NFS method above.  Having multiple reducers will allow you
> to have write parallelism.
>
> The error message that distcp is giving you is a little odd, however,
> since it implies that some of your input files are repeated.  Is that
> possible?
>
>
>
> On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>
>> Tried distcp but it fails. Is there a way to merge them? Or else I could
>> write a pig script to load from multiple paths
>>
>>
>> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there
>> are duplicated files in the sources:
>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo
>>
>> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)
>>
>> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)
>>
>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)
>>
>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>
>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)
>>
>>
>>  On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <td...@maprtech.com>wrote:
>>
>>> The technical term for this is "copying".  You may have heard of it.
>>>
>>> It is a subject of such long technical standing that many do not
>>> consider it worthy of detailed documentation.
>>>
>>> Distcp effects a similar process and can be modified to combine the
>>> input files into a single file.
>>>
>>> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>>>
>>>
>>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com>wrote:
>>>
>>>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>>>
>>>>
>>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>>>
>>>>> Yes, via the simple act of opening a target stream and writing all
>>>>> source streams into it. Or to save code time, an identity job with a
>>>>> single reducer (you may not get control over ordering this way).
>>>>>
>>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <
>>>>> mohitanchlia@gmail.com> wrote:
>>>>> > Is it possible to merge files from different locations from HDFS
>>>>> location
>>>>> > into one file into HDFS location?
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Harsh J
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Merging files

Posted by Mohit Anchlia <mo...@gmail.com>.

Thanks for the info. I was trying not to use nfs because my data size might
be 10-20GB in size for every merge I perform. I'll use pig instead.

In dstcp I checked and none of the directories are duplicate. Looking at
the logs it looks like it's failing because all those directories have
sub-directories of the same name.

On Sat, Dec 22, 2012 at 2:05 PM, Ted Dunning <td...@maprtech.com> wrote:

> A pig script should work quite well.
>
> I also note that the file paths have maprfs in them.  This implies that
> you are using MapR and could simply use the normal linux command cat to
> concatenate the files if you mount the files using NFS (depending on
> volume, of course).  For small amounts of data, this would work very well.
>  For large amounts of data, you would be better with some kind of
> map-reduce program.  Your Pig script is just the sort of thing.
>
> Keep in mind if you write a map-reduce program (or pig script) that you
> will wind up with as many outputs as you have reducers.  If you have only a
> single reducer, you will get one output file, but that will mean that only
> a single process will do all the writing.  That would be no faster than
> using the cat + NFS method above.  Having multiple reducers will allow you
> to have write parallelism.
>
> The error message that distcp is giving you is a little odd, however,
> since it implies that some of your input files are repeated.  Is that
> possible?
>
>
>
> On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>
>> Tried distcp but it fails. Is there a way to merge them? Or else I could
>> write a pig script to load from multiple paths
>>
>>
>> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there
>> are duplicated files in the sources:
>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
>> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo
>>
>> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)
>>
>> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)
>>
>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)
>>
>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>
>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)
>>
>>
>>  On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <td...@maprtech.com>wrote:
>>
>>> The technical term for this is "copying".  You may have heard of it.
>>>
>>> It is a subject of such long technical standing that many do not
>>> consider it worthy of detailed documentation.
>>>
>>> Distcp effects a similar process and can be modified to combine the
>>> input files into a single file.
>>>
>>> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>>>
>>>
>>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com>wrote:
>>>
>>>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>>>
>>>>
>>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>>>
>>>>> Yes, via the simple act of opening a target stream and writing all
>>>>> source streams into it. Or to save code time, an identity job with a
>>>>> single reducer (you may not get control over ordering this way).
>>>>>
>>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <
>>>>> mohitanchlia@gmail.com> wrote:
>>>>> > Is it possible to merge files from different locations from HDFS
>>>>> location
>>>>> > into one file into HDFS location?
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Harsh J
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Merging files

Posted by Ted Dunning <td...@maprtech.com>.

A pig script should work quite well.

I also note that the file paths have maprfs in them.  This implies that you
are using MapR and could simply use the normal linux command cat to
concatenate the files if you mount the files using NFS (depending on
volume, of course).  For small amounts of data, this would work very well.
 For large amounts of data, you would be better with some kind of
map-reduce program.  Your Pig script is just the sort of thing.

Keep in mind if you write a map-reduce program (or pig script) that you
will wind up with as many outputs as you have reducers.  If you have only a
single reducer, you will get one output file, but that will mean that only
a single process will do all the writing.  That would be no faster than
using the cat + NFS method above.  Having multiple reducers will allow you
to have write parallelism.

The error message that distcp is giving you is a little odd, however, since
it implies that some of your input files are repeated.  Is that possible?

On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> Tried distcp but it fails. Is there a way to merge them? Or else I could
> write a pig script to load from multiple paths
>
>
> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there
> are duplicated files in the sources:
> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo
>
> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)
>
> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)
>
> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)
>
> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)
>
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>
> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)
>
>
> On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <td...@maprtech.com>wrote:
>
>> The technical term for this is "copying".  You may have heard of it.
>>
>> It is a subject of such long technical standing that many do not consider
>> it worthy of detailed documentation.
>>
>> Distcp effects a similar process and can be modified to combine the input
>> files into a single file.
>>
>> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>>
>>
>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com>wrote:
>>
>>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>>
>>>
>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>>
>>>> Yes, via the simple act of opening a target stream and writing all
>>>> source streams into it. Or to save code time, an identity job with a
>>>> single reducer (you may not get control over ordering this way).
>>>>
>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com>
>>>> wrote:
>>>> > Is it possible to merge files from different locations from HDFS
>>>> location
>>>> > into one file into HDFS location?
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>>
>>>
>>>
>>
>

Re: Merging files

Posted by Ted Dunning <td...@maprtech.com>.

A pig script should work quite well.

I also note that the file paths have maprfs in them.  This implies that you
are using MapR and could simply use the normal linux command cat to
concatenate the files if you mount the files using NFS (depending on
volume, of course).  For small amounts of data, this would work very well.
 For large amounts of data, you would be better with some kind of
map-reduce program.  Your Pig script is just the sort of thing.

Keep in mind if you write a map-reduce program (or pig script) that you
will wind up with as many outputs as you have reducers.  If you have only a
single reducer, you will get one output file, but that will mean that only
a single process will do all the writing.  That would be no faster than
using the cat + NFS method above.  Having multiple reducers will allow you
to have write parallelism.

The error message that distcp is giving you is a little odd, however, since
it implies that some of your input files are repeated.  Is that possible?

On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> Tried distcp but it fails. Is there a way to merge them? Or else I could
> write a pig script to load from multiple paths
>
>
> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there
> are duplicated files in the sources:
> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo
>
> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)
>
> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)
>
> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)
>
> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)
>
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>
> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)
>
>
> On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <td...@maprtech.com>wrote:
>
>> The technical term for this is "copying".  You may have heard of it.
>>
>> It is a subject of such long technical standing that many do not consider
>> it worthy of detailed documentation.
>>
>> Distcp effects a similar process and can be modified to combine the input
>> files into a single file.
>>
>> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>>
>>
>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com>wrote:
>>
>>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>>
>>>
>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>>
>>>> Yes, via the simple act of opening a target stream and writing all
>>>> source streams into it. Or to save code time, an identity job with a
>>>> single reducer (you may not get control over ordering this way).
>>>>
>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com>
>>>> wrote:
>>>> > Is it possible to merge files from different locations from HDFS
>>>> location
>>>> > into one file into HDFS location?
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>>
>>>
>>>
>>
>

Re: Merging files

Posted by Ted Dunning <td...@maprtech.com>.

A pig script should work quite well.

I also note that the file paths have maprfs in them.  This implies that you
are using MapR and could simply use the normal linux command cat to
concatenate the files if you mount the files using NFS (depending on
volume, of course).  For small amounts of data, this would work very well.
 For large amounts of data, you would be better with some kind of
map-reduce program.  Your Pig script is just the sort of thing.

Keep in mind if you write a map-reduce program (or pig script) that you
will wind up with as many outputs as you have reducers.  If you have only a
single reducer, you will get one output file, but that will mean that only
a single process will do all the writing.  That would be no faster than
using the cat + NFS method above.  Having multiple reducers will allow you
to have write parallelism.

The error message that distcp is giving you is a little odd, however, since
it implies that some of your input files are repeated.  Is that possible?

On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> Tried distcp but it fails. Is there a way to merge them? Or else I could
> write a pig script to load from multiple paths
>
>
> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there
> are duplicated files in the sources:
> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo
>
> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)
>
> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)
>
> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)
>
> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)
>
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>
> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)
>
>
> On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <td...@maprtech.com>wrote:
>
>> The technical term for this is "copying".  You may have heard of it.
>>
>> It is a subject of such long technical standing that many do not consider
>> it worthy of detailed documentation.
>>
>> Distcp effects a similar process and can be modified to combine the input
>> files into a single file.
>>
>> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>>
>>
>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com>wrote:
>>
>>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>>
>>>
>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>>
>>>> Yes, via the simple act of opening a target stream and writing all
>>>> source streams into it. Or to save code time, an identity job with a
>>>> single reducer (you may not get control over ordering this way).
>>>>
>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com>
>>>> wrote:
>>>> > Is it possible to merge files from different locations from HDFS
>>>> location
>>>> > into one file into HDFS location?
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>>
>>>
>>>
>>
>

Re: Merging files

Posted by Ted Dunning <td...@maprtech.com>.

A pig script should work quite well.

I also note that the file paths have maprfs in them.  This implies that you
are using MapR and could simply use the normal linux command cat to
concatenate the files if you mount the files using NFS (depending on
volume, of course).  For small amounts of data, this would work very well.
 For large amounts of data, you would be better with some kind of
map-reduce program.  Your Pig script is just the sort of thing.

Keep in mind if you write a map-reduce program (or pig script) that you
will wind up with as many outputs as you have reducers.  If you have only a
single reducer, you will get one output file, but that will mean that only
a single process will do all the writing.  That would be no faster than
using the cat + NFS method above.  Having multiple reducers will allow you
to have write parallelism.

The error message that distcp is giving you is a little odd, however, since
it implies that some of your input files are repeated.  Is that possible?

On Sat, Dec 22, 2012 at 12:53 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> Tried distcp but it fails. Is there a way to merge them? Or else I could
> write a pig script to load from multiple paths
>
>
> org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there
> are duplicated files in the sources:
> maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
> maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo
>
> at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)
>
> at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)
>
> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)
>
> at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)
>
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>
> at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)
>
>
> On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <td...@maprtech.com>wrote:
>
>> The technical term for this is "copying".  You may have heard of it.
>>
>> It is a subject of such long technical standing that many do not consider
>> it worthy of detailed documentation.
>>
>> Distcp effects a similar process and can be modified to combine the input
>> files into a single file.
>>
>> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>>
>>
>> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com>wrote:
>>
>>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>>
>>>
>>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>>
>>>> Yes, via the simple act of opening a target stream and writing all
>>>> source streams into it. Or to save code time, an identity job with a
>>>> single reducer (you may not get control over ordering this way).
>>>>
>>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com>
>>>> wrote:
>>>> > Is it possible to merge files from different locations from HDFS
>>>> location
>>>> > into one file into HDFS location?
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>>
>>>
>>>
>>
>

Re: Merging files

Posted by Mohit Anchlia <mo...@gmail.com>.

Tried distcp but it fails. Is there a way to merge them? Or else I could
write a pig script to load from multiple paths


org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there
are duplicated files in the sources:
maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo

at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)

at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)

at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)

at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)

at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)


On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <td...@maprtech.com> wrote:

> The technical term for this is "copying".  You may have heard of it.
>
> It is a subject of such long technical standing that many do not consider
> it worthy of detailed documentation.
>
> Distcp effects a similar process and can be modified to combine the input
> files into a single file.
>
> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>
>
> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com>wrote:
>
>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>
>>
>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Yes, via the simple act of opening a target stream and writing all
>>> source streams into it. Or to save code time, an identity job with a
>>> single reducer (you may not get control over ordering this way).
>>>
>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com>
>>> wrote:
>>> > Is it possible to merge files from different locations from HDFS
>>> location
>>> > into one file into HDFS location?
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: Merging files

Posted by Mohit Anchlia <mo...@gmail.com>.

Tried distcp but it fails. Is there a way to merge them? Or else I could
write a pig script to load from multiple paths


org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there
are duplicated files in the sources:
maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo

at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)

at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)

at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)

at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)

at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)


On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <td...@maprtech.com> wrote:

> The technical term for this is "copying".  You may have heard of it.
>
> It is a subject of such long technical standing that many do not consider
> it worthy of detailed documentation.
>
> Distcp effects a similar process and can be modified to combine the input
> files into a single file.
>
> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>
>
> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com>wrote:
>
>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>
>>
>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Yes, via the simple act of opening a target stream and writing all
>>> source streams into it. Or to save code time, an identity job with a
>>> single reducer (you may not get control over ordering this way).
>>>
>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com>
>>> wrote:
>>> > Is it possible to merge files from different locations from HDFS
>>> location
>>> > into one file into HDFS location?
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: Merging files

Posted by Mohit Anchlia <mo...@gmail.com>.

Tried distcp but it fails. Is there a way to merge them? Or else I could
write a pig script to load from multiple paths


org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there
are duplicated files in the sources:
maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo

at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)

at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)

at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)

at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)

at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)


On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <td...@maprtech.com> wrote:

> The technical term for this is "copying".  You may have heard of it.
>
> It is a subject of such long technical standing that many do not consider
> it worthy of detailed documentation.
>
> Distcp effects a similar process and can be modified to combine the input
> files into a single file.
>
> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>
>
> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com>wrote:
>
>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>
>>
>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Yes, via the simple act of opening a target stream and writing all
>>> source streams into it. Or to save code time, an identity job with a
>>> single reducer (you may not get control over ordering this way).
>>>
>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com>
>>> wrote:
>>> > Is it possible to merge files from different locations from HDFS
>>> location
>>> > into one file into HDFS location?
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: Merging files

Posted by Mohit Anchlia <mo...@gmail.com>.

Tried distcp but it fails. Is there a way to merge them? Or else I could
write a pig script to load from multiple paths


org.apache.hadoop.tools.DistCp$DuplicationException: Invalid input, there
are duplicated files in the sources:
maprfs:/user/apuser/web-analytics/flume-output/2012/12/20/22/output/appinfo,
maprfs:/user/apuser/web-analytics/flume-output/2012/12/21/00/output/appinfo

at org.apache.hadoop.tools.DistCp.checkDuplication(DistCp.java:1419)

at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1222)

at org.apache.hadoop.tools.DistCp.copy(DistCp.java:675)

at org.apache.hadoop.tools.DistCp.run(DistCp.java:910)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)

at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)


On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning <td...@maprtech.com> wrote:

> The technical term for this is "copying".  You may have heard of it.
>
> It is a subject of such long technical standing that many do not consider
> it worthy of detailed documentation.
>
> Distcp effects a similar process and can be modified to combine the input
> files into a single file.
>
> http://hadoop.apache.org/docs/r1.0.4/distcp.html
>
>
> On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com>wrote:
>
>> Can you please attach HOW-TO links for the alternatives you mentioned?
>>
>>
>> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Yes, via the simple act of opening a target stream and writing all
>>> source streams into it. Or to save code time, an identity job with a
>>> single reducer (you may not get control over ordering this way).
>>>
>>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com>
>>> wrote:
>>> > Is it possible to merge files from different locations from HDFS
>>> location
>>> > into one file into HDFS location?
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: Merging files

Posted by Ted Dunning <td...@maprtech.com>.

The technical term for this is "copying".  You may have heard of it.

It is a subject of such long technical standing that many do not consider
it worthy of detailed documentation.

Distcp effects a similar process and can be modified to combine the input
files into a single file.

http://hadoop.apache.org/docs/r1.0.4/distcp.html

On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com> wrote:

> Can you please attach HOW-TO links for the alternatives you mentioned?
>
>
> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Yes, via the simple act of opening a target stream and writing all
>> source streams into it. Or to save code time, an identity job with a
>> single reducer (you may not get control over ordering this way).
>>
>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com>
>> wrote:
>> > Is it possible to merge files from different locations from HDFS
>> location
>> > into one file into HDFS location?
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Merging files

Posted by Ted Dunning <td...@maprtech.com>.

The technical term for this is "copying".  You may have heard of it.

It is a subject of such long technical standing that many do not consider
it worthy of detailed documentation.

Distcp effects a similar process and can be modified to combine the input
files into a single file.

http://hadoop.apache.org/docs/r1.0.4/distcp.html

On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com> wrote:

> Can you please attach HOW-TO links for the alternatives you mentioned?
>
>
> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Yes, via the simple act of opening a target stream and writing all
>> source streams into it. Or to save code time, an identity job with a
>> single reducer (you may not get control over ordering this way).
>>
>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com>
>> wrote:
>> > Is it possible to merge files from different locations from HDFS
>> location
>> > into one file into HDFS location?
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Merging files

Posted by Ted Dunning <td...@maprtech.com>.

The technical term for this is "copying".  You may have heard of it.

It is a subject of such long technical standing that many do not consider
it worthy of detailed documentation.

Distcp effects a similar process and can be modified to combine the input
files into a single file.

http://hadoop.apache.org/docs/r1.0.4/distcp.html

On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com> wrote:

> Can you please attach HOW-TO links for the alternatives you mentioned?
>
>
> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Yes, via the simple act of opening a target stream and writing all
>> source streams into it. Or to save code time, an identity job with a
>> single reducer (you may not get control over ordering this way).
>>
>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com>
>> wrote:
>> > Is it possible to merge files from different locations from HDFS
>> location
>> > into one file into HDFS location?
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Merging files

Posted by Ted Dunning <td...@maprtech.com>.

The technical term for this is "copying".  You may have heard of it.

It is a subject of such long technical standing that many do not consider
it worthy of detailed documentation.

Distcp effects a similar process and can be modified to combine the input
files into a single file.

http://hadoop.apache.org/docs/r1.0.4/distcp.html

On Sat, Dec 22, 2012 at 10:54 AM, Barak Yaish <ba...@gmail.com> wrote:

> Can you please attach HOW-TO links for the alternatives you mentioned?
>
>
> On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Yes, via the simple act of opening a target stream and writing all
>> source streams into it. Or to save code time, an identity job with a
>> single reducer (you may not get control over ordering this way).
>>
>> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com>
>> wrote:
>> > Is it possible to merge files from different locations from HDFS
>> location
>> > into one file into HDFS location?
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Merging files

Posted by Barak Yaish <ba...@gmail.com>.

Can you please attach HOW-TO links for the alternatives you mentioned?

On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:

> Yes, via the simple act of opening a target stream and writing all
> source streams into it. Or to save code time, an identity job with a
> single reducer (you may not get control over ordering this way).
>
> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> > Is it possible to merge files from different locations from HDFS location
> > into one file into HDFS location?
>
>
>
> --
> Harsh J
>

Re: Merging files

Posted by Barak Yaish <ba...@gmail.com>.

Can you please attach HOW-TO links for the alternatives you mentioned?

On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:

> Yes, via the simple act of opening a target stream and writing all
> source streams into it. Or to save code time, an identity job with a
> single reducer (you may not get control over ordering this way).
>
> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> > Is it possible to merge files from different locations from HDFS location
> > into one file into HDFS location?
>
>
>
> --
> Harsh J
>

Re: Merging files

Posted by Barak Yaish <ba...@gmail.com>.

Can you please attach HOW-TO links for the alternatives you mentioned?

On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:

> Yes, via the simple act of opening a target stream and writing all
> source streams into it. Or to save code time, an identity job with a
> single reducer (you may not get control over ordering this way).
>
> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> > Is it possible to merge files from different locations from HDFS location
> > into one file into HDFS location?
>
>
>
> --
> Harsh J
>

Re: Merging files

Posted by Barak Yaish <ba...@gmail.com>.

Can you please attach HOW-TO links for the alternatives you mentioned?

On Sat, Dec 22, 2012 at 10:46 AM, Harsh J <ha...@cloudera.com> wrote:

> Yes, via the simple act of opening a target stream and writing all
> source streams into it. Or to save code time, an identity job with a
> single reducer (you may not get control over ordering this way).
>
> On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> > Is it possible to merge files from different locations from HDFS location
> > into one file into HDFS location?
>
>
>
> --
> Harsh J
>

Re: Merging files

Posted by Harsh J <ha...@cloudera.com>.

Yes, via the simple act of opening a target stream and writing all
source streams into it. Or to save code time, an identity job with a
single reducer (you may not get control over ordering this way).

On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com> wrote:
> Is it possible to merge files from different locations from HDFS location
> into one file into HDFS location?

-- 
Harsh J

Re: Merging files

Posted by Harsh J <ha...@cloudera.com>.

Yes, via the simple act of opening a target stream and writing all
source streams into it. Or to save code time, an identity job with a
single reducer (you may not get control over ordering this way).

On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com> wrote:
> Is it possible to merge files from different locations from HDFS location
> into one file into HDFS location?

-- 
Harsh J

Re: Merging files

Posted by Harsh J <ha...@cloudera.com>.

Yes, via the simple act of opening a target stream and writing all
source streams into it. Or to save code time, an identity job with a
single reducer (you may not get control over ordering this way).

On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com> wrote:
> Is it possible to merge files from different locations from HDFS location
> into one file into HDFS location?

-- 
Harsh J

Re: Merging files

Posted by Harsh J <ha...@cloudera.com>.

Yes, via the simple act of opening a target stream and writing all
source streams into it. Or to save code time, an identity job with a
single reducer (you may not get control over ordering this way).

On Sat, Dec 22, 2012 at 12:10 PM, Mohit Anchlia <mo...@gmail.com> wrote:
> Is it possible to merge files from different locations from HDFS location
> into one file into HDFS location?

-- 
Harsh J