You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by "Brian C. Huffman" <bh...@etinternational.com> on 2013/05/08 16:21:33 UTC

MapReduce - FileInputFormat and Locality

All,

I'm trying to understand how the current FileInputFormat implements 
locality.  As far as I can tell, it calculates splits using getSplit and 
each split will contain the node that hosts the first block of data in 
that split.  Is my understanding correct?

Looking at the FileInputFormat for the old API (mapred), it appears that 
it does more to implement locality, using getSplitHosts to "return the 
hosts that contribute most for a given split"

If I understand correctly, why was this changed?

Thanks,
Brian


Re: MapReduce - FileInputFormat and Locality

Posted by Ted Dunning <td...@maprtech.com>.
I think that you just said what the OP said.

Your two cases reduce to the same single case that they had.

Whether this matters is another question, but it seems like it could in
cases where splits != blocks, especially if a split starts near the end of
a block which could give an illusion of locality.

My guess is that since data locality is typically very high that this
doesn't much matter.


On Wed, May 8, 2013 at 3:00 PM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

> I think you misread it.
>
> If a given split has only one block, it uses all the locations of that
> block.
>
> If it so happens that a given split has multiple blocks, it uses all the
> locations of the first block.
>
>  HTH,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
>
> On May 8, 2013, at 7:21 AM, Brian C. Huffman wrote:
>
> All,
>
> I'm trying to understand how the current FileInputFormat implements
> locality.  As far as I can tell, it calculates splits using getSplit and
> each split will contain the node that hosts the first block of data in that
> split.  Is my understanding correct?
>
> Looking at the FileInputFormat for the old API (mapred), it appears that
> it does more to implement locality, using getSplitHosts to "return the
> hosts that contribute most for a given split"
>
> If I understand correctly, why was this changed?
>
> Thanks,
> Brian
>
>
>

Re: MapReduce - FileInputFormat and Locality

Posted by Ted Dunning <td...@maprtech.com>.
I think that you just said what the OP said.

Your two cases reduce to the same single case that they had.

Whether this matters is another question, but it seems like it could in
cases where splits != blocks, especially if a split starts near the end of
a block which could give an illusion of locality.

My guess is that since data locality is typically very high that this
doesn't much matter.


On Wed, May 8, 2013 at 3:00 PM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

> I think you misread it.
>
> If a given split has only one block, it uses all the locations of that
> block.
>
> If it so happens that a given split has multiple blocks, it uses all the
> locations of the first block.
>
>  HTH,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
>
> On May 8, 2013, at 7:21 AM, Brian C. Huffman wrote:
>
> All,
>
> I'm trying to understand how the current FileInputFormat implements
> locality.  As far as I can tell, it calculates splits using getSplit and
> each split will contain the node that hosts the first block of data in that
> split.  Is my understanding correct?
>
> Looking at the FileInputFormat for the old API (mapred), it appears that
> it does more to implement locality, using getSplitHosts to "return the
> hosts that contribute most for a given split"
>
> If I understand correctly, why was this changed?
>
> Thanks,
> Brian
>
>
>

Re: MapReduce - FileInputFormat and Locality

Posted by Ted Dunning <td...@maprtech.com>.
I think that you just said what the OP said.

Your two cases reduce to the same single case that they had.

Whether this matters is another question, but it seems like it could in
cases where splits != blocks, especially if a split starts near the end of
a block which could give an illusion of locality.

My guess is that since data locality is typically very high that this
doesn't much matter.


On Wed, May 8, 2013 at 3:00 PM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

> I think you misread it.
>
> If a given split has only one block, it uses all the locations of that
> block.
>
> If it so happens that a given split has multiple blocks, it uses all the
> locations of the first block.
>
>  HTH,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
>
> On May 8, 2013, at 7:21 AM, Brian C. Huffman wrote:
>
> All,
>
> I'm trying to understand how the current FileInputFormat implements
> locality.  As far as I can tell, it calculates splits using getSplit and
> each split will contain the node that hosts the first block of data in that
> split.  Is my understanding correct?
>
> Looking at the FileInputFormat for the old API (mapred), it appears that
> it does more to implement locality, using getSplitHosts to "return the
> hosts that contribute most for a given split"
>
> If I understand correctly, why was this changed?
>
> Thanks,
> Brian
>
>
>

Re: MapReduce - FileInputFormat and Locality

Posted by Ted Dunning <td...@maprtech.com>.
I think that you just said what the OP said.

Your two cases reduce to the same single case that they had.

Whether this matters is another question, but it seems like it could in
cases where splits != blocks, especially if a split starts near the end of
a block which could give an illusion of locality.

My guess is that since data locality is typically very high that this
doesn't much matter.


On Wed, May 8, 2013 at 3:00 PM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

> I think you misread it.
>
> If a given split has only one block, it uses all the locations of that
> block.
>
> If it so happens that a given split has multiple blocks, it uses all the
> locations of the first block.
>
>  HTH,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
>
> On May 8, 2013, at 7:21 AM, Brian C. Huffman wrote:
>
> All,
>
> I'm trying to understand how the current FileInputFormat implements
> locality.  As far as I can tell, it calculates splits using getSplit and
> each split will contain the node that hosts the first block of data in that
> split.  Is my understanding correct?
>
> Looking at the FileInputFormat for the old API (mapred), it appears that
> it does more to implement locality, using getSplitHosts to "return the
> hosts that contribute most for a given split"
>
> If I understand correctly, why was this changed?
>
> Thanks,
> Brian
>
>
>

Re: MapReduce - FileInputFormat and Locality

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
I think you misread it.

If a given split has only one block, it uses all the locations of that block.

If it so happens that a given split has multiple blocks, it uses all the locations of the first block.

HTH,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/


On May 8, 2013, at 7:21 AM, Brian C. Huffman wrote:

> All,
> 
> I'm trying to understand how the current FileInputFormat implements locality.  As far as I can tell, it calculates splits using getSplit and each split will contain the node that hosts the first block of data in that split.  Is my understanding correct?
> 
> Looking at the FileInputFormat for the old API (mapred), it appears that it does more to implement locality, using getSplitHosts to "return the hosts that contribute most for a given split"
> 
> If I understand correctly, why was this changed?
> 
> Thanks,
> Brian
> 


Re: MapReduce - FileInputFormat and Locality

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
I think you misread it.

If a given split has only one block, it uses all the locations of that block.

If it so happens that a given split has multiple blocks, it uses all the locations of the first block.

HTH,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/


On May 8, 2013, at 7:21 AM, Brian C. Huffman wrote:

> All,
> 
> I'm trying to understand how the current FileInputFormat implements locality.  As far as I can tell, it calculates splits using getSplit and each split will contain the node that hosts the first block of data in that split.  Is my understanding correct?
> 
> Looking at the FileInputFormat for the old API (mapred), it appears that it does more to implement locality, using getSplitHosts to "return the hosts that contribute most for a given split"
> 
> If I understand correctly, why was this changed?
> 
> Thanks,
> Brian
> 


Re: MapReduce - FileInputFormat and Locality

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
I think you misread it.

If a given split has only one block, it uses all the locations of that block.

If it so happens that a given split has multiple blocks, it uses all the locations of the first block.

HTH,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/


On May 8, 2013, at 7:21 AM, Brian C. Huffman wrote:

> All,
> 
> I'm trying to understand how the current FileInputFormat implements locality.  As far as I can tell, it calculates splits using getSplit and each split will contain the node that hosts the first block of data in that split.  Is my understanding correct?
> 
> Looking at the FileInputFormat for the old API (mapred), it appears that it does more to implement locality, using getSplitHosts to "return the hosts that contribute most for a given split"
> 
> If I understand correctly, why was this changed?
> 
> Thanks,
> Brian
> 


Re: MapReduce - FileInputFormat and Locality

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
I think you misread it.

If a given split has only one block, it uses all the locations of that block.

If it so happens that a given split has multiple blocks, it uses all the locations of the first block.

HTH,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/


On May 8, 2013, at 7:21 AM, Brian C. Huffman wrote:

> All,
> 
> I'm trying to understand how the current FileInputFormat implements locality.  As far as I can tell, it calculates splits using getSplit and each split will contain the node that hosts the first block of data in that split.  Is my understanding correct?
> 
> Looking at the FileInputFormat for the old API (mapred), it appears that it does more to implement locality, using getSplitHosts to "return the hosts that contribute most for a given split"
> 
> If I understand correctly, why was this changed?
> 
> Thanks,
> Brian
>