You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Bryan Duxbury (JIRA)" <ji...@apache.org> on 2008/02/09 02:06:07 UTC
[jira] Created: (HBASE-430) Performance: Scanners and getRow return
maps with duplicate data
Performance: Scanners and getRow return maps with duplicate data
----------------------------------------------------------------
Key: HBASE-430
URL: https://issues.apache.org/jira/browse/HBASE-430
Project: Hadoop HBase
Issue Type: Improvement
Reporter: Bryan Duxbury
Priority: Minor
Right now, whenever we get back multiple cells worth of data at a time, we do so in a map of HStoreKey->byte[]. This means that there is a duplicated Text row and long timestamp at the very least between every cell. This is quite a bit wasted. It also means we have to do a lot of translation every time.
We could create a new Writable that contains just one row, one timestamp, and a map of Text->byte[].
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HBASE-430) Performance: Scanners and getRow
return maps with duplicate data
Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bryan Duxbury reassigned HBASE-430:
-----------------------------------
Assignee: Bryan Duxbury
> Performance: Scanners and getRow return maps with duplicate data
> ----------------------------------------------------------------
>
> Key: HBASE-430
> URL: https://issues.apache.org/jira/browse/HBASE-430
> Project: Hadoop HBase
> Issue Type: Improvement
> Reporter: Bryan Duxbury
> Assignee: Bryan Duxbury
> Priority: Minor
>
> Right now, whenever we get back multiple cells worth of data at a time, we do so in a map of HStoreKey->byte[]. This means that there is a duplicated Text row and long timestamp at the very least between every cell. This is quite a bit wasted. It also means we have to do a lot of translation every time.
> We could create a new Writable that contains just one row, one timestamp, and a map of Text->byte[].
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-430) Performance: Scanners and getRow
return maps with duplicate data
Posted by "stack (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577611#action_12577611 ]
stack commented on HBASE-430:
-----------------------------
Patch is missing DeprecatedScannerInterface
> Performance: Scanners and getRow return maps with duplicate data
> ----------------------------------------------------------------
>
> Key: HBASE-430
> URL: https://issues.apache.org/jira/browse/HBASE-430
> Project: Hadoop HBase
> Issue Type: Improvement
> Reporter: Bryan Duxbury
> Assignee: Bryan Duxbury
> Priority: Minor
> Attachments: 430-v2.patch, 430-v3.patch, 430.patch
>
>
> Right now, whenever we get back multiple cells worth of data at a time, we do so in a map of HStoreKey->byte[]. This means that there is a duplicated Text row and long timestamp at the very least between every cell. This is quite a bit wasted. It also means we have to do a lot of translation every time.
> We could create a new Writable that contains just one row, one timestamp, and a map of Text->byte[].
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-430) Performance: Scanners and getRow return
maps with duplicate data
Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bryan Duxbury updated HBASE-430:
--------------------------------
Status: Patch Available (was: Open)
Please review.
> Performance: Scanners and getRow return maps with duplicate data
> ----------------------------------------------------------------
>
> Key: HBASE-430
> URL: https://issues.apache.org/jira/browse/HBASE-430
> Project: Hadoop HBase
> Issue Type: Improvement
> Reporter: Bryan Duxbury
> Assignee: Bryan Duxbury
> Priority: Minor
> Attachments: 430.patch
>
>
> Right now, whenever we get back multiple cells worth of data at a time, we do so in a map of HStoreKey->byte[]. This means that there is a duplicated Text row and long timestamp at the very least between every cell. This is quite a bit wasted. It also means we have to do a lot of translation every time.
> We could create a new Writable that contains just one row, one timestamp, and a map of Text->byte[].
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-430) Performance: Scanners and getRow
return maps with duplicate data
Posted by "stack (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577718#action_12577718 ]
stack commented on HBASE-430:
-----------------------------
Ok on the mapping. I don't know of a way around the copy. Go ahead and apply I'd say.
> Performance: Scanners and getRow return maps with duplicate data
> ----------------------------------------------------------------
>
> Key: HBASE-430
> URL: https://issues.apache.org/jira/browse/HBASE-430
> Project: Hadoop HBase
> Issue Type: Improvement
> Reporter: Bryan Duxbury
> Assignee: Bryan Duxbury
> Priority: Minor
> Attachments: 430-v2.patch, 430-v3.patch, 430-v4.patch, 430-v5.patch, 430.patch
>
>
> Right now, whenever we get back multiple cells worth of data at a time, we do so in a map of HStoreKey->byte[]. This means that there is a duplicated Text row and long timestamp at the very least between every cell. This is quite a bit wasted. It also means we have to do a lot of translation every time.
> We could create a new Writable that contains just one row, one timestamp, and a map of Text->byte[].
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-430) Performance: Scanners and getRow
return maps with duplicate data
Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576397#action_12576397 ]
Bryan Duxbury commented on HBASE-430:
-------------------------------------
I'm thinking we could add a class called RowResult, which has a Text row and HBaseMapWritable<Text, Cell> columns. This itself can be a writable.
Then, when you call scanner.next, you can just return a single record from the method instead of a bool + modify parameters. If it's null, there's nothing more left.
> Performance: Scanners and getRow return maps with duplicate data
> ----------------------------------------------------------------
>
> Key: HBASE-430
> URL: https://issues.apache.org/jira/browse/HBASE-430
> Project: Hadoop HBase
> Issue Type: Improvement
> Reporter: Bryan Duxbury
> Priority: Minor
>
> Right now, whenever we get back multiple cells worth of data at a time, we do so in a map of HStoreKey->byte[]. This means that there is a duplicated Text row and long timestamp at the very least between every cell. This is quite a bit wasted. It also means we have to do a lot of translation every time.
> We could create a new Writable that contains just one row, one timestamp, and a map of Text->byte[].
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-430) Performance: Scanners and getRow return
maps with duplicate data
Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bryan Duxbury updated HBASE-430:
--------------------------------
Resolution: Fixed
Fix Version/s: 0.2.0
Status: Resolved (was: Patch Available)
I just committed this to trunk.
> Performance: Scanners and getRow return maps with duplicate data
> ----------------------------------------------------------------
>
> Key: HBASE-430
> URL: https://issues.apache.org/jira/browse/HBASE-430
> Project: Hadoop HBase
> Issue Type: Improvement
> Reporter: Bryan Duxbury
> Assignee: Bryan Duxbury
> Priority: Minor
> Fix For: 0.2.0
>
> Attachments: 430-v2.patch, 430-v3.patch, 430-v4.patch, 430-v5.patch, 430.patch
>
>
> Right now, whenever we get back multiple cells worth of data at a time, we do so in a map of HStoreKey->byte[]. This means that there is a duplicated Text row and long timestamp at the very least between every cell. This is quite a bit wasted. It also means we have to do a lot of translation every time.
> We could create a new Writable that contains just one row, one timestamp, and a map of Text->byte[].
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-430) Performance: Scanners and getRow
return maps with duplicate data
Posted by "stack (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577695#action_12577695 ]
stack commented on HBASE-430:
-----------------------------
Last patch builds successfully on my machine. Before committing, you might make the following changes:
In src/java/org/apache/hadoop/hbase/client/HBaseAdmin.java remove:
+ // HStoreKey key = (HStoreKey) e.getKey();
In src/java/org/apache/hadoop/hbase/client/HConnectionManager.java remove:
+ // for (Map.Entry<Text, Cell> e: values.entrySet()) {
+ // if (e.getKey().equals(COL_REGIONINFO)) {
+ // // HRegionInfo info = new HRegionInfo();
+ // // info = (HRegionInfo) Writables.getWritable(
+ // // e.getValue().getValue(), info);
+ //
+ // }
+ // }
In src/java/org/apache/hadoop/hbase/io/RowResult.java, copyright should be 2008, not 2007. Thats kinda sweet that you have it implement Map
Whats going on here? +
+ public Set<Text> keySet() {
+ Set<Text> result = new HashSet<Text>();
+ for (Writable w : cells.keySet()) {
+ result.add((Text)w);
+ }
+ return result;
+ }
You are trying to protect against client alterations of underlying cell? If so, instead wrap in a call to Collections.unmodifiableSet: http://java.sun.com/j2se/1.5.0/docs/api/java/util/Collections.html#unmodifiableSet(java.util.Set)
There are a few other places where you do similar copies.
+1 on patch after consideration of above.
> Performance: Scanners and getRow return maps with duplicate data
> ----------------------------------------------------------------
>
> Key: HBASE-430
> URL: https://issues.apache.org/jira/browse/HBASE-430
> Project: Hadoop HBase
> Issue Type: Improvement
> Reporter: Bryan Duxbury
> Assignee: Bryan Duxbury
> Priority: Minor
> Attachments: 430-v2.patch, 430-v3.patch, 430-v4.patch, 430.patch
>
>
> Right now, whenever we get back multiple cells worth of data at a time, we do so in a map of HStoreKey->byte[]. This means that there is a duplicated Text row and long timestamp at the very least between every cell. This is quite a bit wasted. It also means we have to do a lot of translation every time.
> We could create a new Writable that contains just one row, one timestamp, and a map of Text->byte[].
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-430) Performance: Scanners and getRow
return maps with duplicate data
Posted by "stack (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577167#action_12577167 ]
stack commented on HBASE-430:
-----------------------------
Rather than currentMillis, should you use the HCONSTANTS.LATEST_TIMESTAMP (or whatever its called)?
{code}
Index: src/java/org/apache/hadoop/hbase/HMerge.java
===================================================================
--- src/java/org/apache/hadoop/hbase/HMerge.java (revision 635107)
+++ src/java/org/apache/hadoop/hbase/HMerge.java (working copy)
@@ -335,8 +335,9 @@
root = new HRegion(rootTableDir, hlog, fs, conf,
HRegionInfo.rootRegionInfo, null, null);
- HScannerInterface rootScanner = root.getScanner(COL_REGIONINFO_ARRAY,
- new Text(), System.currentTimeMillis(), null);
+ HScannerInterface rootScanner =
+ root.getScanner(COL_REGIONINFO_ARRAY, new Text(),
+ System.currentTimeMillis(), null);
{code}
Made other comments up on IRC about removing commented out TODO and that there is a Writables.getHRegionInfo if you want to use instead of other Writables methods... but otherwise, patch is great removing a bunch of duplicated Map making every time we get a row
> Performance: Scanners and getRow return maps with duplicate data
> ----------------------------------------------------------------
>
> Key: HBASE-430
> URL: https://issues.apache.org/jira/browse/HBASE-430
> Project: Hadoop HBase
> Issue Type: Improvement
> Reporter: Bryan Duxbury
> Assignee: Bryan Duxbury
> Priority: Minor
> Attachments: 430.patch
>
>
> Right now, whenever we get back multiple cells worth of data at a time, we do so in a map of HStoreKey->byte[]. This means that there is a duplicated Text row and long timestamp at the very least between every cell. This is quite a bit wasted. It also means we have to do a lot of translation every time.
> We could create a new Writable that contains just one row, one timestamp, and a map of Text->byte[].
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-430) Performance: Scanners and getRow return
maps with duplicate data
Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bryan Duxbury updated HBASE-430:
--------------------------------
Attachment: 430.patch
This patch makes it so RowResults are what are used in HRegionInterface. HTable still reconstitutes the Map<HStoreKey, byte[]> the client side scanners use. Ultimately I'd like to change scanners to use RowResults everywhere instead of the existing interface, but this might be a good incremental step. All unit tests pass.
> Performance: Scanners and getRow return maps with duplicate data
> ----------------------------------------------------------------
>
> Key: HBASE-430
> URL: https://issues.apache.org/jira/browse/HBASE-430
> Project: Hadoop HBase
> Issue Type: Improvement
> Reporter: Bryan Duxbury
> Assignee: Bryan Duxbury
> Priority: Minor
> Attachments: 430.patch
>
>
> Right now, whenever we get back multiple cells worth of data at a time, we do so in a map of HStoreKey->byte[]. This means that there is a duplicated Text row and long timestamp at the very least between every cell. This is quite a bit wasted. It also means we have to do a lot of translation every time.
> We could create a new Writable that contains just one row, one timestamp, and a map of Text->byte[].
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-430) Performance: Scanners and getRow return
maps with duplicate data
Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bryan Duxbury updated HBASE-430:
--------------------------------
Attachment: 430-v4.patch
This patch removes the HDeprecatedScannerInterface nonsense.
> Performance: Scanners and getRow return maps with duplicate data
> ----------------------------------------------------------------
>
> Key: HBASE-430
> URL: https://issues.apache.org/jira/browse/HBASE-430
> Project: Hadoop HBase
> Issue Type: Improvement
> Reporter: Bryan Duxbury
> Assignee: Bryan Duxbury
> Priority: Minor
> Attachments: 430-v2.patch, 430-v3.patch, 430-v4.patch, 430.patch
>
>
> Right now, whenever we get back multiple cells worth of data at a time, we do so in a map of HStoreKey->byte[]. This means that there is a duplicated Text row and long timestamp at the very least between every cell. This is quite a bit wasted. It also means we have to do a lot of translation every time.
> We could create a new Writable that contains just one row, one timestamp, and a map of Text->byte[].
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-430) Performance: Scanners and getRow
return maps with duplicate data
Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576709#action_12576709 ]
Bryan Duxbury commented on HBASE-430:
-------------------------------------
I have a simple implementation for the RowResult class itself already. It'll be pretty easy to use from a consumption standpoint (implements Map<Text, Cell>). However, if I change HScannerInterface#next from boolean(HStoreKey, SortedMap<Text, byte[]>) to RowResult(), there will be far-reaching changes throughout the code. This is because all scanners everywhere, including the internal ones used in the region server, implement HScannerInterface at some point.
Overall I don't think it's going to be a challenging change, beyond the fact that all the mechanics of advancing scanners is pretty hairy.
> Performance: Scanners and getRow return maps with duplicate data
> ----------------------------------------------------------------
>
> Key: HBASE-430
> URL: https://issues.apache.org/jira/browse/HBASE-430
> Project: Hadoop HBase
> Issue Type: Improvement
> Reporter: Bryan Duxbury
> Priority: Minor
>
> Right now, whenever we get back multiple cells worth of data at a time, we do so in a map of HStoreKey->byte[]. This means that there is a duplicated Text row and long timestamp at the very least between every cell. This is quite a bit wasted. It also means we have to do a lot of translation every time.
> We could create a new Writable that contains just one row, one timestamp, and a map of Text->byte[].
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-430) Performance: Scanners and getRow return
maps with duplicate data
Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bryan Duxbury updated HBASE-430:
--------------------------------
Attachment: 430-v5.patch
Removed the commented code from HBaseAdmin and HConnectionManager.
RowResult has to do that mapping for keySet et al because HBaseMapWritable is a map of <Writable, Writable>, not a map of <Text, Cell>. The mapping does the cast. Is there a simpler way?
> Performance: Scanners and getRow return maps with duplicate data
> ----------------------------------------------------------------
>
> Key: HBASE-430
> URL: https://issues.apache.org/jira/browse/HBASE-430
> Project: Hadoop HBase
> Issue Type: Improvement
> Reporter: Bryan Duxbury
> Assignee: Bryan Duxbury
> Priority: Minor
> Attachments: 430-v2.patch, 430-v3.patch, 430-v4.patch, 430-v5.patch, 430.patch
>
>
> Right now, whenever we get back multiple cells worth of data at a time, we do so in a map of HStoreKey->byte[]. This means that there is a duplicated Text row and long timestamp at the very least between every cell. This is quite a bit wasted. It also means we have to do a lot of translation every time.
> We could create a new Writable that contains just one row, one timestamp, and a map of Text->byte[].
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-430) Performance: Scanners and getRow return
maps with duplicate data
Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bryan Duxbury updated HBASE-430:
--------------------------------
Attachment: 430-v2.patch
This patch incorporates suggestions, passes tests. Looking for another run of tests and a +1 to commit.
> Performance: Scanners and getRow return maps with duplicate data
> ----------------------------------------------------------------
>
> Key: HBASE-430
> URL: https://issues.apache.org/jira/browse/HBASE-430
> Project: Hadoop HBase
> Issue Type: Improvement
> Reporter: Bryan Duxbury
> Assignee: Bryan Duxbury
> Priority: Minor
> Attachments: 430-v2.patch, 430.patch
>
>
> Right now, whenever we get back multiple cells worth of data at a time, we do so in a map of HStoreKey->byte[]. This means that there is a duplicated Text row and long timestamp at the very least between every cell. This is quite a bit wasted. It also means we have to do a lot of translation every time.
> We could create a new Writable that contains just one row, one timestamp, and a map of Text->byte[].
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-430) Performance: Scanners and getRow return
maps with duplicate data
Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bryan Duxbury updated HBASE-430:
--------------------------------
Attachment: 430-v3.patch
Forgot to svn add my new RowResult class.
> Performance: Scanners and getRow return maps with duplicate data
> ----------------------------------------------------------------
>
> Key: HBASE-430
> URL: https://issues.apache.org/jira/browse/HBASE-430
> Project: Hadoop HBase
> Issue Type: Improvement
> Reporter: Bryan Duxbury
> Assignee: Bryan Duxbury
> Priority: Minor
> Attachments: 430-v2.patch, 430-v3.patch, 430.patch
>
>
> Right now, whenever we get back multiple cells worth of data at a time, we do so in a map of HStoreKey->byte[]. This means that there is a duplicated Text row and long timestamp at the very least between every cell. This is quite a bit wasted. It also means we have to do a lot of translation every time.
> We could create a new Writable that contains just one row, one timestamp, and a map of Text->byte[].
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.