You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org> on 2010/02/02 20:02:19 UTC
[jira] Created: (HADOOP-6532) Path objects are heavy
Path objects are heavy
----------------------
Key: HADOOP-6532
URL: https://issues.apache.org/jira/browse/HADOOP-6532
Project: Hadoop Common
Issue Type: Improvement
Components: fs
Reporter: Tsz Wo (Nicholas), SZE
Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI. The size of a Path is roughly 3 times of a String. Will post some numbers in the comments.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6532) Path objects are heavy
Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828732#action_12828732 ]
Tsz Wo (Nicholas), SZE commented on HADOOP-6532:
------------------------------------------------
- 10^6 Path objects
{noformat}
num #instances #bytes class name
----------------------------------------------
1: 5993061 239722440 java.lang.String
2: 1000000 144000000 java.net.URI
3: 1003050 111550720 [C
4: 1000000 24000000 org.apache.hadoop.fs.Path
{noformat}
A Path object ~= 239 + 144 + 111 + 24 = 518 bytes
Running time was 23986 ms
- 10^6 String objects
{noformat}
num #instances #bytes class name
----------------------------------------------
1: 1002961 111456288 [C
2: 1002973 40118920 java.lang.String
{noformat}
A String object ~= 111 + 40 = 151 bytes
Running time was 5834 ms
- 10^6 MyPath objects (MyPath only contains a String)
{noformat}
num #instances #bytes class name
----------------------------------------------
1: 1002961 111456256 [C
2: 1002973 40118920 java.lang.String
3: 1000000 24000000 org.apache.hadoop.examples.Fun$MyPath
{noformat}
A MyPath object ~= 111 + 40 + 24 = 175 bytes
Running time was 6828 ms
> Path objects are heavy
> ----------------------
>
> Key: HADOOP-6532
> URL: https://issues.apache.org/jira/browse/HADOOP-6532
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs
> Reporter: Tsz Wo (Nicholas), SZE
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI. The size of a Path is roughly 3 times of a String. Will post some numbers in the comments.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6532) Path objects are heavy
Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828807#action_12828807 ]
Tsz Wo (Nicholas), SZE commented on HADOOP-6532:
------------------------------------------------
> Is it possible to keep the Path object, but avoid materializing the fields?
Yes, we only have to get rid of Path.uri.
> Another alternative might be to have Path, like String, contain only a char[]....
That's correct. char[] is better than String.
> Path objects are heavy
> ----------------------
>
> Key: HADOOP-6532
> URL: https://issues.apache.org/jira/browse/HADOOP-6532
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs
> Reporter: Tsz Wo (Nicholas), SZE
> Attachments: Fun.java
>
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI. The size of a Path is roughly 3 times of a String. See some numbers in the comments.
> A major impact of decreasing Path size is allowing ls, archive, etc. on directories with many files.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-6532) Path objects are heavy
Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tsz Wo (Nicholas), SZE updated HADOOP-6532:
-------------------------------------------
Description:
Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI. The size of a Path is roughly 3 times of a String. See some numbers in the comments.
A major impact of decreasing Path size is allowing ls, archive, etc. on directories with many files.
was:Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI. The size of a Path is roughly 3 times of a String. Will post some numbers in the comments.
> Path objects are heavy
> ----------------------
>
> Key: HADOOP-6532
> URL: https://issues.apache.org/jira/browse/HADOOP-6532
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs
> Reporter: Tsz Wo (Nicholas), SZE
> Attachments: Fun.java
>
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI. The size of a Path is roughly 3 times of a String. See some numbers in the comments.
> A major impact of decreasing Path size is allowing ls, archive, etc. on directories with many files.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6532) Path objects are heavy
Posted by "Eli Collins (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828758#action_12828758 ]
Eli Collins commented on HADOOP-6532:
-------------------------------------
Path is essentially a thin wrapper around URI so we'll end up implementing the URI functionality that we use (eg the constructor error checking, the normalize, resolve, etc functions) so memory savings won't be the same as just making uri a string. Also there are quite a few callers of Path#toUri which we'll need to convert if we want to prevent creating URI objects. Most of them don't want a URI though so that shouldn't be hard. I'm +1 on removing the URI member and making Path implement the needed URI functionality explicitly just want to point these issues out.
> Path objects are heavy
> ----------------------
>
> Key: HADOOP-6532
> URL: https://issues.apache.org/jira/browse/HADOOP-6532
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs
> Reporter: Tsz Wo (Nicholas), SZE
> Attachments: Fun.java
>
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI. The size of a Path is roughly 3 times of a String. See some numbers in the comments.
> A major impact of decreasing Path size is allowing ls, archive, etc. on directories with many files.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6532) Path objects are heavy
Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837011#action_12837011 ]
Tsz Wo (Nicholas), SZE commented on HADOOP-6532:
------------------------------------------------
Oops, my previous was supposed to post on HADOOP-6467. Sorry.
> Path objects are heavy
> ----------------------
>
> Key: HADOOP-6532
> URL: https://issues.apache.org/jira/browse/HADOOP-6532
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs
> Reporter: Tsz Wo (Nicholas), SZE
> Attachments: Fun.java
>
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI. The size of a Path is roughly 3 times of a String. See some numbers in the comments.
> A major impact of decreasing Path size is allowing ls, archive, etc. on directories with many files.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6532) Path objects are heavy
Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828756#action_12828756 ]
Doug Cutting commented on HADOOP-6532:
--------------------------------------
Another alternative might be to have Path, like String, contain only a char[]. Then it would have the same weight as String, but still be a distinct type with its own methods, constructors, etc.
> Path objects are heavy
> ----------------------
>
> Key: HADOOP-6532
> URL: https://issues.apache.org/jira/browse/HADOOP-6532
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs
> Reporter: Tsz Wo (Nicholas), SZE
> Attachments: Fun.java
>
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI. The size of a Path is roughly 3 times of a String. See some numbers in the comments.
> A major impact of decreasing Path size is allowing ls, archive, etc. on directories with many files.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6532) Path objects are heavy
Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828742#action_12828742 ]
Tsz Wo (Nicholas), SZE commented on HADOOP-6532:
------------------------------------------------
So I suggest that we replace the Path.uri with a String.
Below is the testing environment details:
{noformat}
java.version = 1.6.0_15
java.runtime.name = Java(TM) SE Runtime Environment
java.runtime.version = 1.6.0_15-b03
java.vm.version = 14.1-b02
java.vm.vendor = Sun Microsystems Inc.
java.vm.name = Java HotSpot(TM) 64-Bit Server VM
java.vm.specification.version = 1.0
java.specification.version = 1.6
os.arch = amd64
os.name = Linux
os.version = 2.6.9-55.ELsmp
{noformat}
> Path objects are heavy
> ----------------------
>
> Key: HADOOP-6532
> URL: https://issues.apache.org/jira/browse/HADOOP-6532
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs
> Reporter: Tsz Wo (Nicholas), SZE
> Attachments: Fun.java
>
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI. The size of a Path is roughly 3 times of a String. Will post some numbers in the comments.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-6532) Path objects are heavy
Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tsz Wo (Nicholas), SZE updated HADOOP-6532:
-------------------------------------------
Attachment: Fun.java
Fun.java: a test program.
> Path objects are heavy
> ----------------------
>
> Key: HADOOP-6532
> URL: https://issues.apache.org/jira/browse/HADOOP-6532
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs
> Reporter: Tsz Wo (Nicholas), SZE
> Attachments: Fun.java
>
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI. The size of a Path is roughly 3 times of a String. Will post some numbers in the comments.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6532) Path objects are heavy
Posted by "Hong Tang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828754#action_12828754 ]
Hong Tang commented on HADOOP-6532:
-----------------------------------
Is it possible to keep the Path object, but avoid materializing the fields?
> Path objects are heavy
> ----------------------
>
> Key: HADOOP-6532
> URL: https://issues.apache.org/jira/browse/HADOOP-6532
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs
> Reporter: Tsz Wo (Nicholas), SZE
> Attachments: Fun.java
>
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI. The size of a Path is roughly 3 times of a String. See some numbers in the comments.
> A major impact of decreasing Path size is allowing ls, archive, etc. on directories with many files.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6532) Path objects are heavy
Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837009#action_12837009 ]
Tsz Wo (Nicholas), SZE commented on HADOOP-6532:
------------------------------------------------
I ran wordcount with the v2 patch twice. Both took 16 mins. Not sure why it ran slower than the previous patch.
{noformat}
-bash-3.1$ date; time $H ${WC_CMD} ${HAR_FULL}/${DIR} ${TT_WC}3
Mon Feb 22 23:23:03 UTC 2010
10/02/22 23:23:19 INFO input.FileInputFormat: Total input paths to process : 100000
10/02/22 23:33:53 INFO mapred.JobClient: Running job: job_201002042035_75937
10/02/22 23:33:54 INFO mapred.JobClient: map 0% reduce 0%
...
10/02/22 23:39:08 INFO mapred.JobClient: map 100% reduce 100%
...
10/02/22 23:39:14 INFO mapred.JobClient: Reduce input records=17729
real 16m11.171s
user 2m24.217s
sys 0m39.462s
{noformat}
Anyway, below are some improvement suggestions:
- use a variable to store harPath.depth() so that the same value won't be computed again and again.
- Unfortunately, Path is quite expensive (see HADOOP-6532). It is better to check child.startsWith(parentString) before creating thisPath. i.e. replace
{code}
+ String child = lineFeed.substring(0, lineFeed.indexOf(" "));
+ Path thisPath = new Path(child);
+ if ((child.startsWith(parentString)) && (thisPath.depth() == harPath.depth() + 1)) {
+ ...
{code}
with
{code}
+ String child = lineFeed.substring(0, lineFeed.indexOf(" "));
+ if (child.startsWith(parentString)) {
+ Path thisPath = new Path(child);
+ if (thisPath.depth() == harPath.depth() + 1) {
+ ...
{code}
- Also, there are lines more than 80 characters.
> Path objects are heavy
> ----------------------
>
> Key: HADOOP-6532
> URL: https://issues.apache.org/jira/browse/HADOOP-6532
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs
> Reporter: Tsz Wo (Nicholas), SZE
> Attachments: Fun.java
>
>
> Compared with java.lang.String, org.apache.hadoop.fs.Path is much heavier since it contains URI. The size of a Path is roughly 3 times of a String. See some numbers in the comments.
> A major impact of decreasing Path size is allowing ls, archive, etc. on directories with many files.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.