You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Pankaj Gupta <pa...@brightroll.com> on 2012/11/01 00:20:05 UTC
Reading part of file using Map Reduce
Hi,
Is it possible to run a MapReduce job on a part of file on HDFS? The use case is using a single file on HDFS as a stream to store all log events of a particular kind. New data can grow on top while Map Reduce can process old data. Of course one option would be to copy part of data into a separate file and give that to MapReduce but I was wondering if that extra copy can be avoided.
Thanks,
Pankaj
Re: Reading part of file using Map Reduce
Posted by Harsh J <ha...@cloudera.com>.
IIRC you can do this, but MR had some issues if you passed it a
non-closed (but sync'd upon) file for splitting.
However, if you run into similar issues, try generating your own
splits over the big file via FileInputFormat#getSplits(…), which will
then work.
On Thu, Nov 1, 2012 at 4:50 AM, Pankaj Gupta <pa...@brightroll.com> wrote:
> Hi,
>
> Is it possible to run a MapReduce job on a part of file on HDFS? The use case is using a single file on HDFS as a stream to store all log events of a particular kind. New data can grow on top while Map Reduce can process old data. Of course one option would be to copy part of data into a separate file and give that to MapReduce but I was wondering if that extra copy can be avoided.
>
> Thanks,
> Pankaj
--
Harsh J
Re: Reading part of file using Map Reduce
Posted by Harsh J <ha...@cloudera.com>.
IIRC you can do this, but MR had some issues if you passed it a
non-closed (but sync'd upon) file for splitting.
However, if you run into similar issues, try generating your own
splits over the big file via FileInputFormat#getSplits(…), which will
then work.
On Thu, Nov 1, 2012 at 4:50 AM, Pankaj Gupta <pa...@brightroll.com> wrote:
> Hi,
>
> Is it possible to run a MapReduce job on a part of file on HDFS? The use case is using a single file on HDFS as a stream to store all log events of a particular kind. New data can grow on top while Map Reduce can process old data. Of course one option would be to copy part of data into a separate file and give that to MapReduce but I was wondering if that extra copy can be avoided.
>
> Thanks,
> Pankaj
--
Harsh J
Re: Reading part of file using Map Reduce
Posted by Harsh J <ha...@cloudera.com>.
IIRC you can do this, but MR had some issues if you passed it a
non-closed (but sync'd upon) file for splitting.
However, if you run into similar issues, try generating your own
splits over the big file via FileInputFormat#getSplits(…), which will
then work.
On Thu, Nov 1, 2012 at 4:50 AM, Pankaj Gupta <pa...@brightroll.com> wrote:
> Hi,
>
> Is it possible to run a MapReduce job on a part of file on HDFS? The use case is using a single file on HDFS as a stream to store all log events of a particular kind. New data can grow on top while Map Reduce can process old data. Of course one option would be to copy part of data into a separate file and give that to MapReduce but I was wondering if that extra copy can be avoided.
>
> Thanks,
> Pankaj
--
Harsh J
Re: Reading part of file using Map Reduce
Posted by Harsh J <ha...@cloudera.com>.
IIRC you can do this, but MR had some issues if you passed it a
non-closed (but sync'd upon) file for splitting.
However, if you run into similar issues, try generating your own
splits over the big file via FileInputFormat#getSplits(…), which will
then work.
On Thu, Nov 1, 2012 at 4:50 AM, Pankaj Gupta <pa...@brightroll.com> wrote:
> Hi,
>
> Is it possible to run a MapReduce job on a part of file on HDFS? The use case is using a single file on HDFS as a stream to store all log events of a particular kind. New data can grow on top while Map Reduce can process old data. Of course one option would be to copy part of data into a separate file and give that to MapReduce but I was wondering if that extra copy can be avoided.
>
> Thanks,
> Pankaj
--
Harsh J