You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org> on 2010/02/02 20:02:19 UTC

[jira] Created: (HADOOP-6532) Path objects are heavy

Path objects are heavy
----------------------

                 Key: HADOOP-6532
                 URL: https://issues.apache.org/jira/browse/HADOOP-6532
             Project: Hadoop Common
          Issue Type: Improvement
          Components: fs
            Reporter: Tsz Wo (Nicholas), SZE


Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI.  The size of a Path is roughly 3 times of a String.  Will post some numbers in the comments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6532) Path objects are heavy

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828732#action_12828732 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-6532:
------------------------------------------------

- 10^6 Path objects
{noformat}
 num     #instances         #bytes  class name
----------------------------------------------
   1:       5993061      239722440  java.lang.String
   2:       1000000      144000000  java.net.URI
   3:       1003050      111550720  [C
   4:       1000000       24000000  org.apache.hadoop.fs.Path
{noformat}
A Path object ~= 239 + 144 + 111 + 24 = 518 bytes
Running time was 23986 ms

- 10^6 String objects
{noformat}
 num     #instances         #bytes  class name
----------------------------------------------
   1:       1002961      111456288  [C
   2:       1002973       40118920  java.lang.String
{noformat}
A String object ~= 111 + 40 = 151 bytes
Running time was 5834 ms

- 10^6 MyPath objects (MyPath only contains a String)
{noformat}
 num     #instances         #bytes  class name
----------------------------------------------
   1:       1002961      111456256  [C
   2:       1002973       40118920  java.lang.String
   3:       1000000       24000000  org.apache.hadoop.examples.Fun$MyPath
{noformat}
A MyPath object ~= 111 + 40 + 24 = 175 bytes
Running time was 6828 ms

> Path objects are heavy
> ----------------------
>
>                 Key: HADOOP-6532
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6532
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI.  The size of a Path is roughly 3 times of a String.  Will post some numbers in the comments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6532) Path objects are heavy

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828807#action_12828807 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-6532:
------------------------------------------------

> Is it possible to keep the Path object, but avoid materializing the fields? 
Yes, we only have to get rid of Path.uri.

> Another alternative might be to have Path, like String, contain only a char[]....
That's correct.  char[] is better than String.

> Path objects are heavy
> ----------------------
>
>                 Key: HADOOP-6532
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6532
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: Fun.java
>
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI.  The size of a Path is roughly 3 times of a String.  See some numbers in the comments.
> A major impact of decreasing Path size is allowing ls, archive, etc. on directories with many files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6532) Path objects are heavy

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-6532:
-------------------------------------------

    Description: 
Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI.  The size of a Path is roughly 3 times of a String.  See some numbers in the comments.

A major impact of decreasing Path size is allowing ls, archive, etc. on directories with many files.

  was:Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI.  The size of a Path is roughly 3 times of a String.  Will post some numbers in the comments.


> Path objects are heavy
> ----------------------
>
>                 Key: HADOOP-6532
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6532
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: Fun.java
>
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI.  The size of a Path is roughly 3 times of a String.  See some numbers in the comments.
> A major impact of decreasing Path size is allowing ls, archive, etc. on directories with many files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6532) Path objects are heavy

Posted by "Eli Collins (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828758#action_12828758 ] 

Eli Collins commented on HADOOP-6532:
-------------------------------------

Path is essentially a thin wrapper around URI so we'll end up implementing the URI functionality that we use (eg the constructor error checking, the normalize, resolve, etc functions) so memory savings won't be the same as just making uri a string. Also there are quite a few callers of Path#toUri which we'll need to convert if we want to prevent creating URI objects. Most of them don't want a URI though so that shouldn't be hard. I'm +1 on removing the URI member and making Path implement the needed URI functionality explicitly just want to point these issues out.

> Path objects are heavy
> ----------------------
>
>                 Key: HADOOP-6532
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6532
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: Fun.java
>
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI.  The size of a Path is roughly 3 times of a String.  See some numbers in the comments.
> A major impact of decreasing Path size is allowing ls, archive, etc. on directories with many files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6532) Path objects are heavy

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837011#action_12837011 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-6532:
------------------------------------------------

Oops, my previous was supposed to post on HADOOP-6467.  Sorry.

> Path objects are heavy
> ----------------------
>
>                 Key: HADOOP-6532
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6532
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: Fun.java
>
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI.  The size of a Path is roughly 3 times of a String.  See some numbers in the comments.
> A major impact of decreasing Path size is allowing ls, archive, etc. on directories with many files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6532) Path objects are heavy

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828756#action_12828756 ] 

Doug Cutting commented on HADOOP-6532:
--------------------------------------

Another alternative might be to have Path, like String, contain only a char[].  Then it would have the same weight as String, but still be a distinct type with its own methods, constructors, etc.

> Path objects are heavy
> ----------------------
>
>                 Key: HADOOP-6532
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6532
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: Fun.java
>
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI.  The size of a Path is roughly 3 times of a String.  See some numbers in the comments.
> A major impact of decreasing Path size is allowing ls, archive, etc. on directories with many files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6532) Path objects are heavy

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828742#action_12828742 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-6532:
------------------------------------------------

So I suggest that we replace the Path.uri with a String.

Below is the testing environment details:
{noformat}
java.version = 1.6.0_15
java.runtime.name = Java(TM) SE Runtime Environment
java.runtime.version = 1.6.0_15-b03
java.vm.version = 14.1-b02
java.vm.vendor = Sun Microsystems Inc.
java.vm.name = Java HotSpot(TM) 64-Bit Server VM
java.vm.specification.version = 1.0
java.specification.version = 1.6
os.arch = amd64
os.name = Linux
os.version = 2.6.9-55.ELsmp
{noformat}

> Path objects are heavy
> ----------------------
>
>                 Key: HADOOP-6532
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6532
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: Fun.java
>
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI.  The size of a Path is roughly 3 times of a String.  Will post some numbers in the comments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6532) Path objects are heavy

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-6532:
-------------------------------------------

    Attachment: Fun.java

Fun.java: a test program.

> Path objects are heavy
> ----------------------
>
>                 Key: HADOOP-6532
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6532
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: Fun.java
>
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI.  The size of a Path is roughly 3 times of a String.  Will post some numbers in the comments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6532) Path objects are heavy

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828754#action_12828754 ] 

Hong Tang commented on HADOOP-6532:
-----------------------------------

Is it possible to keep the Path object, but avoid materializing the fields?

> Path objects are heavy
> ----------------------
>
>                 Key: HADOOP-6532
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6532
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: Fun.java
>
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI.  The size of a Path is roughly 3 times of a String.  See some numbers in the comments.
> A major impact of decreasing Path size is allowing ls, archive, etc. on directories with many files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6532) Path objects are heavy

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837009#action_12837009 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-6532:
------------------------------------------------

I ran wordcount with the v2 patch twice.  Both took 16 mins.  Not sure why it ran slower than the previous patch.
{noformat}
-bash-3.1$ date; time $H ${WC_CMD} ${HAR_FULL}/${DIR} ${TT_WC}3
Mon Feb 22 23:23:03 UTC 2010
10/02/22 23:23:19 INFO input.FileInputFormat: Total input paths to process : 100000
10/02/22 23:33:53 INFO mapred.JobClient: Running job: job_201002042035_75937
10/02/22 23:33:54 INFO mapred.JobClient:  map 0% reduce 0%
...
10/02/22 23:39:08 INFO mapred.JobClient:  map 100% reduce 100%
...
10/02/22 23:39:14 INFO mapred.JobClient:     Reduce input records=17729

real    16m11.171s
user    2m24.217s
sys     0m39.462s
{noformat}
Anyway, below are some improvement suggestions:
- use a variable to store harPath.depth() so that the same value won't be computed again and again.

- Unfortunately, Path is quite expensive (see HADOOP-6532).  It is better to check child.startsWith(parentString) before creating thisPath.  i.e. replace
{code}
+        String child = lineFeed.substring(0, lineFeed.indexOf(" "));
+        Path thisPath = new Path(child);
+        if ((child.startsWith(parentString)) && (thisPath.depth() == harPath.depth() + 1)) {
+          ...
{code}
with
{code}
+        String child = lineFeed.substring(0, lineFeed.indexOf(" "));
+        if (child.startsWith(parentString)) {
+          Path thisPath = new Path(child);
+          if (thisPath.depth() == harPath.depth() + 1) {
+            ...
{code}

- Also, there are lines more than 80 characters.


> Path objects are heavy
> ----------------------
>
>                 Key: HADOOP-6532
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6532
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: Fun.java
>
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI.  The size of a Path is roughly 3 times of a String.  See some numbers in the comments.
> A major impact of decreasing Path size is allowing ls, archive, etc. on directories with many files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.