You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Mac Yang (JIRA)" <ji...@apache.org> on 2011/04/29 03:17:03 UTC

[jira] [Created] (MAPREDUCE-2459) Cache HAR filesystem metadata

Cache HAR filesystem metadata
-----------------------------

                 Key: MAPREDUCE-2459
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2459
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: harchive
            Reporter: Mac Yang
            Assignee: Mac Yang


Each HAR file system has two index files that contains information on how files are stored in the part files. During the block location calculation, these indexes are reread for every file in the archive. Caching the indexes and the status of the part files will greatly reduce the number of name node operations during the job setup time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-2459) Cache HAR filesystem metadata

Posted by "Mac Yang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mac Yang updated MAPREDUCE-2459:
--------------------------------

    Attachment: MAPREDUCE-2459.2.patch

> Cache HAR filesystem metadata
> -----------------------------
>
>                 Key: MAPREDUCE-2459
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2459
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Mac Yang
>            Assignee: Mac Yang
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-2459.1.patch, MAPREDUCE-2459.2.patch
>
>
> Each HAR file system has two index files that contains information on how files are stored in the part files. During the block location calculation, these indexes are reread for every file in the archive. Caching the indexes and the status of the part files will greatly reduce the number of name node operations during the job setup time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-2459) Cache HAR filesystem metadata

Posted by "Mac Yang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mac Yang updated MAPREDUCE-2459:
--------------------------------

    Attachment: MAPREDUCE-2459.1.patch

> Cache HAR filesystem metadata
> -----------------------------
>
>                 Key: MAPREDUCE-2459
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2459
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Mac Yang
>            Assignee: Mac Yang
>         Attachments: MAPREDUCE-2459.1.patch
>
>
> Each HAR file system has two index files that contains information on how files are stored in the part files. During the block location calculation, these indexes are reread for every file in the archive. Caching the indexes and the status of the part files will greatly reduce the number of name node operations during the job setup time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2459) Cache HAR filesystem metadata

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036890#comment-13036890 ] 

Hudson commented on MAPREDUCE-2459:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk-Commit #690 (See [https://builds.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/690/])
    MAPREDUCE-2459. Cache HAR filesystem metadata. (Mac Yang via mahadev)

mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1125428
Files : 
* /hadoop/mapreduce/trunk/CHANGES.txt
* /hadoop/mapreduce/trunk/src/tools/org/apache/hadoop/fs/HarFileSystem.java


> Cache HAR filesystem metadata
> -----------------------------
>
>                 Key: MAPREDUCE-2459
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2459
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Mac Yang
>            Assignee: Mac Yang
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-2459.1.patch, MAPREDUCE-2459.2.patch
>
>
> Each HAR file system has two index files that contains information on how files are stored in the part files. During the block location calculation, these indexes are reread for every file in the archive. Caching the indexes and the status of the part files will greatly reduce the number of name node operations during the job setup time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-2459) Cache HAR filesystem metadata

Posted by "Mac Yang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mac Yang updated MAPREDUCE-2459:
--------------------------------

    Status: Patch Available  (was: Open)

> Cache HAR filesystem metadata
> -----------------------------
>
>                 Key: MAPREDUCE-2459
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2459
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Mac Yang
>            Assignee: Mac Yang
>         Attachments: MAPREDUCE-2459.1.patch
>
>
> Each HAR file system has two index files that contains information on how files are stored in the part files. During the block location calculation, these indexes are reread for every file in the archive. Caching the indexes and the status of the part files will greatly reduce the number of name node operations during the job setup time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2459) Cache HAR filesystem metadata

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033383#comment-13033383 ] 

Hadoop QA commented on MAPREDUCE-2459:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12479166/MAPREDUCE-2459.2.patch
  against trunk revision 1102515.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

    +1 system test framework.  The patch passed system test framework compile.

Test results: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/245//testReport/
Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/245//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/245//console

This message is automatically generated.

> Cache HAR filesystem metadata
> -----------------------------
>
>                 Key: MAPREDUCE-2459
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2459
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Mac Yang
>            Assignee: Mac Yang
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-2459.1.patch, MAPREDUCE-2459.2.patch
>
>
> Each HAR file system has two index files that contains information on how files are stored in the part files. During the block location calculation, these indexes are reread for every file in the archive. Caching the indexes and the status of the part files will greatly reduce the number of name node operations during the job setup time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-2459) Cache HAR filesystem metadata

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated MAPREDUCE-2459:
-------------------------------------

    Affects Version/s:     (was: 0.23.0)
        Fix Version/s: 0.23.0

> Cache HAR filesystem metadata
> -----------------------------
>
>                 Key: MAPREDUCE-2459
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2459
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Mac Yang
>            Assignee: Mac Yang
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-2459.1.patch
>
>
> Each HAR file system has two index files that contains information on how files are stored in the part files. During the block location calculation, these indexes are reread for every file in the archive. Caching the indexes and the status of the part files will greatly reduce the number of name node operations during the job setup time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-2459) Cache HAR filesystem metadata

Posted by "Mac Yang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mac Yang updated MAPREDUCE-2459:
--------------------------------

    Status: Open  (was: Patch Available)

> Cache HAR filesystem metadata
> -----------------------------
>
>                 Key: MAPREDUCE-2459
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2459
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Mac Yang
>            Assignee: Mac Yang
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-2459.1.patch, MAPREDUCE-2459.2.patch
>
>
> Each HAR file system has two index files that contains information on how files are stored in the part files. During the block location calculation, these indexes are reread for every file in the archive. Caching the indexes and the status of the part files will greatly reduce the number of name node operations during the job setup time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2459) Cache HAR filesystem metadata

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027156#comment-13027156 ] 

Hadoop QA commented on MAPREDUCE-2459:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12477799/MAPREDUCE-2459.1.patch
  against trunk revision 1097679.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed these core unit tests:
                  org.apache.hadoop.cli.TestMRCLI
                  org.apache.hadoop.tools.TestHadoopArchives
                  org.apache.hadoop.tools.TestHarFileSystem

    -1 contrib tests.  The patch failed contrib unit tests.

    -1 system test framework.  The patch failed system test framework compile.

Test results: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/199//testReport/
Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/199//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/199//console

This message is automatically generated.

> Cache HAR filesystem metadata
> -----------------------------
>
>                 Key: MAPREDUCE-2459
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2459
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Mac Yang
>            Assignee: Mac Yang
>         Attachments: MAPREDUCE-2459.1.patch
>
>
> Each HAR file system has two index files that contains information on how files are stored in the part files. During the block location calculation, these indexes are reread for every file in the archive. Caching the indexes and the status of the part files will greatly reduce the number of name node operations during the job setup time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2459) Cache HAR filesystem metadata

Posted by "Mac Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033350#comment-13033350 ] 

Mac Yang commented on MAPREDUCE-2459:
-------------------------------------

Mahadev, thanks for the feedback, I have updated the patch to include the following changes,
- Removed '_' from harMetaCache
- Added modification time stamp check and reparse the index files if necessary. This is to address the case where the archive is overwritten in between two reads from the same process


> Cache HAR filesystem metadata
> -----------------------------
>
>                 Key: MAPREDUCE-2459
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2459
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Mac Yang
>            Assignee: Mac Yang
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-2459.1.patch, MAPREDUCE-2459.2.patch
>
>
> Each HAR file system has two index files that contains information on how files are stored in the part files. During the block location calculation, these indexes are reread for every file in the archive. Caching the indexes and the status of the part files will greatly reduce the number of name node operations during the job setup time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-2459) Cache HAR filesystem metadata

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated MAPREDUCE-2459:
-------------------------------------

    Affects Version/s: 0.23.0

> Cache HAR filesystem metadata
> -----------------------------
>
>                 Key: MAPREDUCE-2459
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2459
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Mac Yang
>            Assignee: Mac Yang
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-2459.1.patch
>
>
> Each HAR file system has two index files that contains information on how files are stored in the part files. During the block location calculation, these indexes are reread for every file in the archive. Caching the indexes and the status of the part files will greatly reduce the number of name node operations during the job setup time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2459) Cache HAR filesystem metadata

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036867#comment-13036867 ] 

Mahadev konar commented on MAPREDUCE-2459:
------------------------------------------

+1 lgtm. Ill commit it to trunk.

> Cache HAR filesystem metadata
> -----------------------------
>
>                 Key: MAPREDUCE-2459
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2459
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Mac Yang
>            Assignee: Mac Yang
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-2459.1.patch, MAPREDUCE-2459.2.patch
>
>
> Each HAR file system has two index files that contains information on how files are stored in the part files. During the block location calculation, these indexes are reread for every file in the archive. Caching the indexes and the status of the part files will greatly reduce the number of name node operations during the job setup time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2459) Cache HAR filesystem metadata

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032524#comment-13032524 ] 

Mahadev konar commented on MAPREDUCE-2459:
------------------------------------------

Mac, looks like the tests are failing (especially TestHarFileSystem). The patch looks good to me. Is there any particular reason on using an _ in front of the following variables?

{noformat}
_harMetaCache
{noformat}

Also, this is meant for trunk only?




> Cache HAR filesystem metadata
> -----------------------------
>
>                 Key: MAPREDUCE-2459
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2459
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Mac Yang
>            Assignee: Mac Yang
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-2459.1.patch
>
>
> Each HAR file system has two index files that contains information on how files are stored in the part files. During the block location calculation, these indexes are reread for every file in the archive. Caching the indexes and the status of the part files will greatly reduce the number of name node operations during the job setup time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-2459) Cache HAR filesystem metadata

Posted by "Mac Yang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mac Yang updated MAPREDUCE-2459:
--------------------------------

    Status: Patch Available  (was: Open)

> Cache HAR filesystem metadata
> -----------------------------
>
>                 Key: MAPREDUCE-2459
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2459
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Mac Yang
>            Assignee: Mac Yang
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-2459.1.patch, MAPREDUCE-2459.2.patch
>
>
> Each HAR file system has two index files that contains information on how files are stored in the part files. During the block location calculation, these indexes are reread for every file in the archive. Caching the indexes and the status of the part files will greatly reduce the number of name node operations during the job setup time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2459) Cache HAR filesystem metadata

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037407#comment-13037407 ] 

Hudson commented on MAPREDUCE-2459:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk #686 (See [https://builds.apache.org/hudson/job/Hadoop-Mapreduce-trunk/686/])
    MAPREDUCE-2459. Cache HAR filesystem metadata. (Mac Yang via mahadev)

mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1125428
Files : 
* /hadoop/mapreduce/trunk/CHANGES.txt
* /hadoop/mapreduce/trunk/src/tools/org/apache/hadoop/fs/HarFileSystem.java


> Cache HAR filesystem metadata
> -----------------------------
>
>                 Key: MAPREDUCE-2459
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2459
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Mac Yang
>            Assignee: Mac Yang
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-2459.1.patch, MAPREDUCE-2459.2.patch
>
>
> Each HAR file system has two index files that contains information on how files are stored in the part files. During the block location calculation, these indexes are reread for every file in the archive. Caching the indexes and the status of the part files will greatly reduce the number of name node operations during the job setup time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-2459) Cache HAR filesystem metadata

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated MAPREDUCE-2459:
-------------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

I just committed this to trunk. Thanks mac!

> Cache HAR filesystem metadata
> -----------------------------
>
>                 Key: MAPREDUCE-2459
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2459
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Mac Yang
>            Assignee: Mac Yang
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-2459.1.patch, MAPREDUCE-2459.2.patch
>
>
> Each HAR file system has two index files that contains information on how files are stored in the part files. During the block location calculation, these indexes are reread for every file in the archive. Caching the indexes and the status of the part files will greatly reduce the number of name node operations during the job setup time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira