You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Dmytro Molkov (JIRA)" <ji...@apache.org> on 2010/05/05 04:02:03 UTC

[jira] Created: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

Implement getFileBlockLocations in HarFilesystem
------------------------------------------------

                 Key: MAPREDUCE-1752
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
            Reporter: Dmytro Molkov


To efficiently run map reduce on the data that has been HAR'ed it will be great to actually implement getFileBlockLocations for a given filename.
This way the JobTracker will have information about data locality and will schedule tasks appropriately.
I believe the overhead introduced by doing lookups in the index files can be smaller than that of copying data over the wire.
Will upload the patch shortly, but would love to get some feedback on this. And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

Posted by "Dmytro Molkov (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmytro Molkov updated MAPREDUCE-1752:
-------------------------------------

    Status: Patch Available  (was: Open)

I will submit the patch for hudson testing. When someone has time I would appreciate if you could review it.

> Implement getFileBlockLocations in HarFilesystem
> ------------------------------------------------
>
>                 Key: MAPREDUCE-1752
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Dmytro Molkov
>            Assignee: Dmytro Molkov
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1752.2.patch, MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

Posted by "Patrick Kling (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927178#action_12927178 ] 

Patrick Kling commented on MAPREDUCE-1752:
------------------------------------------

There is something really strange about the semantics of the offsets and lengths returned by this. Consider the following part file consisting of 3 blocks containing a file f starting at offset 896 with length 512:

{code}
+---------------+
| ...           |
+---------------+
0           

+-----------+---+
| ...       | f |
+-----------+---+
512         896

+-----------+---+
| f         |...|
+-----------+---+
1024        1408
{code}

Calling getFileBlockLocations on this file will return 2 LocatedBlocks: b1=<offset=0, length=512>, b2=<offset=512, length=512>. This indicates that b1 contains the first 512 bytes of the block, even though in fact it only contains the first 128 bytes. This is a problem when the client uses these LocatedBlocks to detect whether a portion of f has been corrupted.

I can think of 2 possible ways of fixing this:

1) Fix the offset of the returned blocks by subtracting hstatus.getStartIndex() (i.e., the offset of f in the part file) from the block offset. This would return b1=<offset=-384, length=512> and b2=<offset=128, length=512>, indicating to the client that the first 384 bytes of b1 are not part of 1 and correctly indicating the length of each block. In a way, this is similar to how FSNamesystem.getBlockLocations returns entire blocks even if the caller asks for a range that covers only part of these blocks.

2) Fix the length on the first block returned to reflect the portion of f that is contained in this block, i.e., return b1=<offset=128, length=128>, b2=<offset=128, length=512>. This seems somewhat less clean to me but avoids negative offsets. Also, it would break the convention that all blocks of a file with the exception of the last block are the same length.

> Implement getFileBlockLocations in HarFilesystem
> ------------------------------------------------
>
>                 Key: MAPREDUCE-1752
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Dmytro Molkov
>            Assignee: Dmytro Molkov
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1752.2.patch, MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur reassigned MAPREDUCE-1752:
-------------------------------------------

    Assignee: Dmytro Molkov

> Implement getFileBlockLocations in HarFilesystem
> ------------------------------------------------
>
>                 Key: MAPREDUCE-1752
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Dmytro Molkov
>            Assignee: Dmytro Molkov
>             Fix For: 0.22.0
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated MAPREDUCE-1752:
----------------------------------------------

    Component/s: harchive

> Implement getFileBlockLocations in HarFilesystem
> ------------------------------------------------
>
>                 Key: MAPREDUCE-1752
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Dmytro Molkov
>            Assignee: Dmytro Molkov
>             Fix For: 0.22.0
>
>         Attachments: MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

Posted by "Dmytro Molkov (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmytro Molkov updated MAPREDUCE-1752:
-------------------------------------

    Attachment: MAPREDUCE-1752.2.patch

Finally got back to this JIRA.
Attached is the patch that we tested internally and are currently using. It does have the overhead of initial job submission, but it gives you locality for when you run the job which is a reasonable tradeoff.

We were thinking of taking it one step further eventually when the splits created by the job client on the job submission can have part files of the har directly. So that the only piece of infrastructure that will be accessing har index file will be the client and the mr tasks will go directly after the specific offsets inside of part files of har. But this seems like another JIRA.

> Implement getFileBlockLocations in HarFilesystem
> ------------------------------------------------
>
>                 Key: MAPREDUCE-1752
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Dmytro Molkov
>            Assignee: Dmytro Molkov
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1752.2.patch, MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

Posted by "Rodrigo Schmidt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872392#action_12872392 ] 

Rodrigo Schmidt commented on MAPREDUCE-1752:
--------------------------------------------

I've been following this discussion.

I think Dmytro's idea makes a lot of sense, specially for big jobs that read from big files. In such cases, the performance gains in having local reads would easily compensate for the extra delay at setup time.

The idea behind it is to use files stored in hadoop archives as input for mapreduce jobs. I don't think this method will be used elsewhere.

Using har to store mapreduce files that are stable (won't change anymore) but still necessary for read queries is a huge win for the namenode scalability, since it reduces the number of objects it has to store in memory.

> Implement getFileBlockLocations in HarFilesystem
> ------------------------------------------------
>
>                 Key: MAPREDUCE-1752
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Dmytro Molkov
>            Assignee: Dmytro Molkov
>             Fix For: 0.22.0
>
>         Attachments: MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

Posted by "Dmytro Molkov (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmytro Molkov updated MAPREDUCE-1752:
-------------------------------------

    Attachment: MR-1752.patch

Attaching a general idea for the patch. I will work on the unittest for this one soon.
Let me know if there are obvious problems with this approach.

> Implement getFileBlockLocations in HarFilesystem
> ------------------------------------------------
>
>                 Key: MAPREDUCE-1752
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Dmytro Molkov
>            Assignee: Dmytro Molkov
>             Fix For: 0.22.0
>
>         Attachments: MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

Posted by "Dmytro Molkov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927186#action_12927186 ] 

Dmytro Molkov commented on MAPREDUCE-1752:
------------------------------------------

In the second case it would of course be b1=<offset=0, length=128>.

I personally like the second way of fixing it more, since it gives predictable offsets. For the file f the block locations would start with offset 0 and the total length would sum up to the total length of the file. The problem with it might be that the block location of the first block will have length different from the actual block length in this file.
The way block locations are returned currently each of them except for the last one will have the length of the block and start at the offset which is a multiple of the block length. And even when I call getBlockLocations with offset and length different from 0, status.getLength() I am not guaranteed to get the result where the sum of length would be equal to length and the smallest offset of the block location would be equal to the offset provided.

That said I think that the second approach fits better into this system unless having block of different lengths will be a problem.

> Implement getFileBlockLocations in HarFilesystem
> ------------------------------------------------
>
>                 Key: MAPREDUCE-1752
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Dmytro Molkov
>            Assignee: Dmytro Molkov
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1752.2.patch, MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871891#action_12871891 ] 

Tsz Wo (Nicholas), SZE commented on MAPREDUCE-1752:
---------------------------------------------------

Also, the approach is quite expensive.  It requires
# read masterIndex
#- fs.open(masterIndex)
#- fs.getFileStatus(masterIndex)
#- read from datanode
# read archiveIndex
#- fs.open(archiveIndex)
#- read from datanode
# fs.getFileStatus(part)

> Implement getFileBlockLocations in HarFilesystem
> ------------------------------------------------
>
>                 Key: MAPREDUCE-1752
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Dmytro Molkov
>            Assignee: Dmytro Molkov
>             Fix For: 0.22.0
>
>         Attachments: MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated MAPREDUCE-1752:
-------------------------------------

    Fix Version/s: 0.22.0

> Implement getFileBlockLocations in HarFilesystem
> ------------------------------------------------
>
>                 Key: MAPREDUCE-1752
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Dmytro Molkov
>             Fix For: 0.22.0
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870770#action_12870770 ] 

Tsz Wo (Nicholas), SZE commented on MAPREDUCE-1752:
---------------------------------------------------

Hi Dmytro, are you still working on this?  Will you upload a patch soon?

> Implement getFileBlockLocations in HarFilesystem
> ------------------------------------------------
>
>                 Key: MAPREDUCE-1752
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Dmytro Molkov
>            Assignee: Dmytro Molkov
>             Fix For: 0.22.0
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872668#action_12872668 ] 

Tsz Wo (Nicholas), SZE commented on MAPREDUCE-1752:
---------------------------------------------------

> I guess my approach was making it right and then looking at the ways we can optimize it rather then trying to hack up a fast solution right from the start.
> Do you have any other ideas that may be worth exploring?

Yes, your approach totally make sense.  A potential improvement would be caching the masterIndex, archiveIndex and all the file statuses since a client calls getBlockLocation(..) multiple times for submitting a job.

> Implement getFileBlockLocations in HarFilesystem
> ------------------------------------------------
>
>                 Key: MAPREDUCE-1752
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Dmytro Molkov
>            Assignee: Dmytro Molkov
>             Fix For: 0.22.0
>
>         Attachments: MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864148#action_12864148 ] 

dhruba borthakur commented on MAPREDUCE-1752:
---------------------------------------------

Sounds like a good idea. +1

The idea is to make the contents of a Har file work well with FileInputFormat or CombineFileInputFormat, isn't it? In that case, you can see TestCombineFileInputFormat and see if u can extend it to test the case when the input file(s) are har files.

> Implement getFileBlockLocations in HarFilesystem
> ------------------------------------------------
>
>                 Key: MAPREDUCE-1752
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Dmytro Molkov
>
> To efficiently run map reduce on the data that has been HAR'ed it will be great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871871#action_12871871 ] 

Tsz Wo (Nicholas), SZE commented on MAPREDUCE-1752:
---------------------------------------------------

Dmytro, the patch does not compiled.

> Implement getFileBlockLocations in HarFilesystem
> ------------------------------------------------
>
>                 Key: MAPREDUCE-1752
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Dmytro Molkov
>            Assignee: Dmytro Molkov
>             Fix For: 0.22.0
>
>         Attachments: MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

Posted by "Dmytro Molkov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872372#action_12872372 ] 

Dmytro Molkov commented on MAPREDUCE-1752:
------------------------------------------

Nicholas, I do see that this approach is somewhat expensive. However it gives us the locality when we are running a job.
And this time will only be added to the job setup time, right?

I guess my approach was making it right and then looking at the ways we can optimize it rather then trying to hack up a fast solution right from the start.
Do you have any other ideas that may be worth exploring?

> Implement getFileBlockLocations in HarFilesystem
> ------------------------------------------------
>
>                 Key: MAPREDUCE-1752
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Dmytro Molkov
>            Assignee: Dmytro Molkov
>             Fix For: 0.22.0
>
>         Attachments: MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.