You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Jim Kellerman (JIRA)" <ji...@apache.org> on 2007/05/21 20:28:16 UTC

[jira] Created: (HADOOP-1398) Add in-memory caching of data

Add in-memory caching of data
-----------------------------

                 Key: HADOOP-1398
                 URL: https://issues.apache.org/jira/browse/HADOOP-1398
             Project: Hadoop
          Issue Type: New Feature
          Components: contrib/hbase
            Reporter: Jim Kellerman
            Priority: Minor


Bigtable provides two in-memory caches: one for row/column data and one for disk block caches.

The size of each cache should be configurable, data should be loaded lazily, and the cache managed by an LRU mechanism.

One complication of the block cache is that all data is read through a SequenceFile.Reader which ultimately reads data off of disk via a RPC proxy for ClientProtocol. This would imply that the block caching would have to be pushed down to either the DFSClient or SequenceFile.Reader


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1398) Add in-memory caching of data

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Kellerman updated HADOOP-1398:
----------------------------------

    Priority: Trivial  (was: Minor)

> Add in-memory caching of data
> -----------------------------
>
>                 Key: HADOOP-1398
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1398
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Jim Kellerman
>            Priority: Trivial
>
> Bigtable provides two in-memory caches: one for row/column data and one for disk block caches.
> The size of each cache should be configurable, data should be loaded lazily, and the cache managed by an LRU mechanism.
> One complication of the block cache is that all data is read through a SequenceFile.Reader which ultimately reads data off of disk via a RPC proxy for ClientProtocol. This would imply that the block caching would have to be pushed down to either the DFSClient or SequenceFile.Reader

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-1398) Add in-memory caching of data

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560991#action_12560991 ] 

tomwhite edited comment on HADOOP-1398 at 1/21/08 3:50 AM:
------------------------------------------------------------

bq. In the below from HStoreFile, blockCacheEnabled method argument is not being passed to the MapFile constructors.

Thanks - this had the effect of never enabling the cache! I've fixed this.

bq. Out of interest, did you regenerate the thrift or hand-edit it? Changes look right - just wondering.

I regenerated using the latest thrift trunk.

bq. Default ReferenceMap constructor makes for hard keys and soft values. If value has been let go by the GC, does the corresponding key just stay in the Map?

No, both the key and the value are removed from the map - I checked the source.

This patch also includes changes to HBase Shell so you can alter a table to enable block caching.

      was (Author: tomwhite):
    bq. In the below from HStoreFile, blockCacheEnabled method argument is not being passed to the MapFile constructors.

Thanks - this had the effect of never enabling the cache! I've fixed this.

bq. Out of interest, did you regenerate the thrift or hand-edit it? Changes look right - just wondering.

I regenerated using the latest thrift trunk.

bq. Default ReferenceMap constructor makes for hard keys and soft values. If value has been let go by the GC, does the corresponding key just stay in the Map?

Yes - I checked the source.

This patch also includes changes to HBase Shell so you can alter a table to enable block caching.
  
> Add in-memory caching of data
> -----------------------------
>
>                 Key: HADOOP-1398
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1398
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Jim Kellerman
>            Priority: Trivial
>         Attachments: commons-collections-3.2.jar, hadoop-blockcache-v2.patch, hadoop-blockcache-v3.patch, hadoop-blockcache-v4.patch, hadoop-blockcache.patch
>
>
> Bigtable provides two in-memory caches: one for row/column data and one for disk block caches.
> The size of each cache should be configurable, data should be loaded lazily, and the cache managed by an LRU mechanism.
> One complication of the block cache is that all data is read through a SequenceFile.Reader which ultimately reads data off of disk via a RPC proxy for ClientProtocol. This would imply that the block caching would have to be pushed down to either the DFSClient or SequenceFile.Reader

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1398) Add in-memory caching of data

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560607#action_12560607 ] 

stack commented on HADOOP-1398:
-------------------------------

(... continuing IRC discussion).

I didn't realize HColumnDescriptor was versioned.  It doesn't seem to have been added by either Jim or I.  Someone smarter no doubt.  So, my comment that this change is incompatible doesn't hold since I see you have code to make it so HCD migrates itself.  Nice.

In the below from HStoreFile, blockCacheEnabled method argument is not being passed to the MapFile constructors.

{code}
+  public synchronized MapFile.Reader getReader(final FileSystem fs,
+      final Filter bloomFilter, final boolean blockCacheEnabled)
+  throws IOException {
+    
+    if (isReference()) {
+      return new HStoreFile.HalfMapFileReader(fs,
+          getMapFilePath(reference).toString(), conf, 
+          reference.getFileRegion(), reference.getMidkey(), bloomFilter);
+    }
+    return new BloomFilterMapFile.Reader(fs, getMapFilePath().toString(),
+        conf, bloomFilter);
+  }
{code}

Out of interest, did you regenerate the thrift or hand-edit it?  Changes look right -- just wondering.

Default ReferenceMap constructor makes for hard keys and soft values.  If value has been let go by the GC, does the corresponding key just stay in the Map?

Otherwise, patch looks great Tom.

> Add in-memory caching of data
> -----------------------------
>
>                 Key: HADOOP-1398
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1398
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Jim Kellerman
>            Priority: Trivial
>         Attachments: commons-collections-3.2.jar, hadoop-blockcache-v2.patch, hadoop-blockcache-v3.patch, hadoop-blockcache.patch
>
>
> Bigtable provides two in-memory caches: one for row/column data and one for disk block caches.
> The size of each cache should be configurable, data should be loaded lazily, and the cache managed by an LRU mechanism.
> One complication of the block cache is that all data is read through a SequenceFile.Reader which ultimately reads data off of disk via a RPC proxy for ClientProtocol. This would imply that the block caching would have to be pushed down to either the DFSClient or SequenceFile.Reader

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1398) Add in-memory caching of data

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560091#action_12560091 ] 

Tom White commented on HADOOP-1398:
-----------------------------------

I'm trying to add a new parameter to HColumnDescriptor and would appreciate a little guidance. Do I need to worry about the version number? Is the order of the serialized fields important? It would be nice to group together the caching related ones if possible, so the block cache parameter would naturally sit next to the inMemory one. Ditto for the Thrift representation - how does it handle versioning? Thanks.

> Add in-memory caching of data
> -----------------------------
>
>                 Key: HADOOP-1398
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1398
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Jim Kellerman
>            Priority: Trivial
>         Attachments: hadoop-blockcache-v2.patch, hadoop-blockcache.patch
>
>
> Bigtable provides two in-memory caches: one for row/column data and one for disk block caches.
> The size of each cache should be configurable, data should be loaded lazily, and the cache managed by an LRU mechanism.
> One complication of the block cache is that all data is read through a SequenceFile.Reader which ultimately reads data off of disk via a RPC proxy for ClientProtocol. This would imply that the block caching would have to be pushed down to either the DFSClient or SequenceFile.Reader

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1398) Add in-memory caching of data

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560099#action_12560099 ] 

Jim Kellerman commented on HADOOP-1398:
---------------------------------------

Tom,

Yes, we need to start versioning everything that goes out to disk. And if we make an incompatible change, we either need to correct for it on the fly or augment the migration tool (hbase.util.Migrate.java)


> Add in-memory caching of data
> -----------------------------
>
>                 Key: HADOOP-1398
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1398
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Jim Kellerman
>            Priority: Trivial
>         Attachments: hadoop-blockcache-v2.patch, hadoop-blockcache.patch
>
>
> Bigtable provides two in-memory caches: one for row/column data and one for disk block caches.
> The size of each cache should be configurable, data should be loaded lazily, and the cache managed by an LRU mechanism.
> One complication of the block cache is that all data is read through a SequenceFile.Reader which ultimately reads data off of disk via a RPC proxy for ClientProtocol. This would imply that the block caching would have to be pushed down to either the DFSClient or SequenceFile.Reader

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1398) Add in-memory caching of data

Posted by "Tom White (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-1398:
------------------------------

    Attachment: commons-collections-3.2.jar

New dependency to go in src/contrib/hbase/lib/.

> Add in-memory caching of data
> -----------------------------
>
>                 Key: HADOOP-1398
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1398
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Jim Kellerman
>            Priority: Trivial
>         Attachments: commons-collections-3.2.jar, hadoop-blockcache-v2.patch, hadoop-blockcache-v3.patch, hadoop-blockcache.patch
>
>
> Bigtable provides two in-memory caches: one for row/column data and one for disk block caches.
> The size of each cache should be configurable, data should be loaded lazily, and the cache managed by an LRU mechanism.
> One complication of the block cache is that all data is read through a SequenceFile.Reader which ultimately reads data off of disk via a RPC proxy for ClientProtocol. This would imply that the block caching would have to be pushed down to either the DFSClient or SequenceFile.Reader

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1398) Add in-memory caching of data

Posted by "Tom White (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-1398:
------------------------------

    Attachment: hadoop-blockcache-v4.patch

bq. In the below from HStoreFile, blockCacheEnabled method argument is not being passed to the MapFile constructors.

Thanks - this had the effect of never enabling the cache! I've fixed this.

bq. Out of interest, did you regenerate the thrift or hand-edit it? Changes look right - just wondering.

I regenerated using the latest thrift trunk.

bq. Default ReferenceMap constructor makes for hard keys and soft values. If value has been let go by the GC, does the corresponding key just stay in the Map?

Yes - I checked the source.

This patch also includes changes to HBase Shell so you can alter a table to enable block caching.

> Add in-memory caching of data
> -----------------------------
>
>                 Key: HADOOP-1398
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1398
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Jim Kellerman
>            Priority: Trivial
>         Attachments: commons-collections-3.2.jar, hadoop-blockcache-v2.patch, hadoop-blockcache-v3.patch, hadoop-blockcache-v4.patch, hadoop-blockcache.patch
>
>
> Bigtable provides two in-memory caches: one for row/column data and one for disk block caches.
> The size of each cache should be configurable, data should be loaded lazily, and the cache managed by an LRU mechanism.
> One complication of the block cache is that all data is read through a SequenceFile.Reader which ultimately reads data off of disk via a RPC proxy for ClientProtocol. This would imply that the block caching would have to be pushed down to either the DFSClient or SequenceFile.Reader

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1398) Add in-memory caching of data

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561327#action_12561327 ] 

Tom White commented on HADOOP-1398:
-----------------------------------

I ran some benchmarks of PerformanceEvaluation with and without block caching enabled. The setup was similar to that described in http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation, with three machines on EC2: one running the namenode and HBase master, one running a datanode and a region server, and one running a datanode and the PerformanceEvaluation program.

Number of operations per second:

||Experiment||Block cache disabled||Block cache enabled||
|sequential reads|119|182|
|random reads|110|123|

I've seen quite a lot of variation in the results of PerformanceEvaluation, so I'm reluctant to read too much into these figures. But I think we can say that the block cache doesn't seem to slow down the system. 


> Add in-memory caching of data
> -----------------------------
>
>                 Key: HADOOP-1398
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1398
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Jim Kellerman
>            Priority: Trivial
>         Attachments: commons-collections-3.2.jar, hadoop-blockcache-v2.patch, hadoop-blockcache-v3.patch, hadoop-blockcache-v4.1.patch, hadoop-blockcache-v4.patch, hadoop-blockcache.patch
>
>
> Bigtable provides two in-memory caches: one for row/column data and one for disk block caches.
> The size of each cache should be configurable, data should be loaded lazily, and the cache managed by an LRU mechanism.
> One complication of the block cache is that all data is read through a SequenceFile.Reader which ultimately reads data off of disk via a RPC proxy for ClientProtocol. This would imply that the block caching would have to be pushed down to either the DFSClient or SequenceFile.Reader

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1398) Add in-memory caching of data

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561191#action_12561191 ] 

stack commented on HADOOP-1398:
-------------------------------

Patch looks good Tom.  I changed my mind since IRC this morning.  Now I think hbase should align with the parent and not add new features since feature freeze untill after we make the 0.16 branch (Kick me on IRC if you think different).

> Add in-memory caching of data
> -----------------------------
>
>                 Key: HADOOP-1398
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1398
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Jim Kellerman
>            Priority: Trivial
>         Attachments: commons-collections-3.2.jar, hadoop-blockcache-v2.patch, hadoop-blockcache-v3.patch, hadoop-blockcache-v4.1.patch, hadoop-blockcache-v4.patch, hadoop-blockcache.patch
>
>
> Bigtable provides two in-memory caches: one for row/column data and one for disk block caches.
> The size of each cache should be configurable, data should be loaded lazily, and the cache managed by an LRU mechanism.
> One complication of the block cache is that all data is read through a SequenceFile.Reader which ultimately reads data off of disk via a RPC proxy for ClientProtocol. This would imply that the block caching would have to be pushed down to either the DFSClient or SequenceFile.Reader

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-1398) Add in-memory caching of data

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Kellerman reassigned HADOOP-1398:
-------------------------------------

    Assignee: Tom White

> Add in-memory caching of data
> -----------------------------
>
>                 Key: HADOOP-1398
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1398
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Jim Kellerman
>            Assignee: Tom White
>            Priority: Trivial
>         Attachments: commons-collections-3.2.jar, hadoop-blockcache-v2.patch, hadoop-blockcache-v3.patch, hadoop-blockcache-v4.1.patch, hadoop-blockcache-v4.patch, hadoop-blockcache.patch
>
>
> Bigtable provides two in-memory caches: one for row/column data and one for disk block caches.
> The size of each cache should be configurable, data should be loaded lazily, and the cache managed by an LRU mechanism.
> One complication of the block cache is that all data is read through a SequenceFile.Reader which ultimately reads data off of disk via a RPC proxy for ClientProtocol. This would imply that the block caching would have to be pushed down to either the DFSClient or SequenceFile.Reader

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1398) Add in-memory caching of data

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559267#action_12559267 ] 

stack commented on HADOOP-1398:
-------------------------------

Tom: Ignore comment above on LruMap.  I just reread it.

> Add in-memory caching of data
> -----------------------------
>
>                 Key: HADOOP-1398
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1398
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Jim Kellerman
>            Priority: Trivial
>         Attachments: hadoop-blockcache.patch
>
>
> Bigtable provides two in-memory caches: one for row/column data and one for disk block caches.
> The size of each cache should be configurable, data should be loaded lazily, and the cache managed by an LRU mechanism.
> One complication of the block cache is that all data is read through a SequenceFile.Reader which ultimately reads data off of disk via a RPC proxy for ClientProtocol. This would imply that the block caching would have to be pushed down to either the DFSClient or SequenceFile.Reader

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1398) Add in-memory caching of data

Posted by "Tom White (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-1398:
------------------------------

    Attachment: hadoop-blockcache.patch

Here is an initial implementation - feedback would be much appreciated.

BlockFSInputStream reads a FSInputStream in a block-oriented manner, and caches blocks. There's also a BlockMapFile.Reader that uses a BlockFSInputStream to read the MapFile data. HStore uses a BlockMapFile.Reader to read the first HStoreFile - at startup and after compaction. New HStoreFiles produced after memcache flushes are read using a regular reader in order to keep memory use fixed. Currently block caching is configured by the hbase properties hbase.hstore.blockCache.maxSize (defaults to 0 - no cache) and hbase.hstore.blockCache.blockSize (defaults to 64k). (It would be desirable to make caches configurable on a per-column family basis - the current way is just a stop gap.)

I've also had to push details of the block caching implementation up to MapFile.Reader, which is undesirable. The problem is that the streams are opened in the constructor of SequenceFile.Reader, which is called by the constructor of MapFile.Reader, so there is no opportunity to set the final fields blockSize and maxBlockCacheSize on a subclass of MapFile.Reader before the stream is opened. I think the proper solution is to have an explicit open method on SequenceFile.Reader, but I'm not sure about the impact of this since it would be an incompatible change. Perhaps do in conjunction with HADOOP-2604?

> Add in-memory caching of data
> -----------------------------
>
>                 Key: HADOOP-1398
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1398
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Jim Kellerman
>            Priority: Trivial
>         Attachments: hadoop-blockcache.patch
>
>
> Bigtable provides two in-memory caches: one for row/column data and one for disk block caches.
> The size of each cache should be configurable, data should be loaded lazily, and the cache managed by an LRU mechanism.
> One complication of the block cache is that all data is read through a SequenceFile.Reader which ultimately reads data off of disk via a RPC proxy for ClientProtocol. This would imply that the block caching would have to be pushed down to either the DFSClient or SequenceFile.Reader

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1398) Add in-memory caching of data

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559498#action_12559498 ] 

Tom White commented on HADOOP-1398:
-----------------------------------

bq. You pass 'length' in the below but its not used:

It is used in the subclass of SequenceFile.Reader by BlockFSInputStream.

bq. Do you have any numbers for how it improves throughput when cached blocks are 'hot'?

I haven't got any numbers yet (working on them), but random reads will suffer in general since a whole 64KB block is retrieved to just read a single key/value. The Bigtable paper talks about reducing the block size to 8KB (see section 7).

bq. What do we need to add to make it so its easy to enable/disable this feature on a per-column basis? Currently edits to column config. requires taking column offline. Changing this configuration looks safe-to-do while the column stays on line. Would you agree?

Agreed. I think that dynamically editing a column descriptor should go in a separate jira issue. For now, I was planning on just adding the new parameters to HColumnDescriptor. Does the version number need bumping in this case?

> Add in-memory caching of data
> -----------------------------
>
>                 Key: HADOOP-1398
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1398
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Jim Kellerman
>            Priority: Trivial
>         Attachments: hadoop-blockcache.patch
>
>
> Bigtable provides two in-memory caches: one for row/column data and one for disk block caches.
> The size of each cache should be configurable, data should be loaded lazily, and the cache managed by an LRU mechanism.
> One complication of the block cache is that all data is read through a SequenceFile.Reader which ultimately reads data off of disk via a RPC proxy for ClientProtocol. This would imply that the block caching would have to be pushed down to either the DFSClient or SequenceFile.Reader

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1398) Add in-memory caching of data

Posted by "Tom White (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-1398:
------------------------------

    Attachment: hadoop-blockcache-v2.patch

A second patch with minimal changes to MapFile.Reader - there is a now a protected open() method for subclasses that wish to defer opening the streams until further initialization has been carried out.

> Add in-memory caching of data
> -----------------------------
>
>                 Key: HADOOP-1398
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1398
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Jim Kellerman
>            Priority: Trivial
>         Attachments: hadoop-blockcache-v2.patch, hadoop-blockcache.patch
>
>
> Bigtable provides two in-memory caches: one for row/column data and one for disk block caches.
> The size of each cache should be configurable, data should be loaded lazily, and the cache managed by an LRU mechanism.
> One complication of the block cache is that all data is read through a SequenceFile.Reader which ultimately reads data off of disk via a RPC proxy for ClientProtocol. This would imply that the block caching would have to be pushed down to either the DFSClient or SequenceFile.Reader

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1398) Add in-memory caching of data

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559208#action_12559208 ] 

stack commented on HADOOP-1398:
-------------------------------

Patch looks great Tom.

You pass 'length' in the below but its not used:

{code}
+    protected FSDataInputStream openFile(FileSystem fs, Path file,
+        int bufferSize, long length) throws IOException {
+      return fs.open(file, bufferSize);
{code}

I presume you have plans for it later?

You have confidence in the LruMap class?  You don't have unit tests (though these things are hard to test).  I ask because though small, sometimes these kinds of classes can prove a little tricky....

Do you have any numbers for how it improves throughput when cached blocks are 'hot'?   And you talked of a slight 'cost'.  Do you have rough numbers for that too? (Playing on datanode adjusting the size of the CRC blocks, a similar type of blocking to what you have here, there was no discernable difference adjusting sizes).

What do we need to add to make it so its easy to enable/disable this feature on a per-column basis?  Currently edits to column config. requires taking column offline.  Changing this configuration looks safe-to-do while the column stays on line.  Would you agree?

> Add in-memory caching of data
> -----------------------------
>
>                 Key: HADOOP-1398
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1398
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Jim Kellerman
>            Priority: Trivial
>         Attachments: hadoop-blockcache.patch
>
>
> Bigtable provides two in-memory caches: one for row/column data and one for disk block caches.
> The size of each cache should be configurable, data should be loaded lazily, and the cache managed by an LRU mechanism.
> One complication of the block cache is that all data is read through a SequenceFile.Reader which ultimately reads data off of disk via a RPC proxy for ClientProtocol. This would imply that the block caching would have to be pushed down to either the DFSClient or SequenceFile.Reader

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1398) Add in-memory caching of data

Posted by "Tom White (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-1398:
------------------------------

    Attachment: hadoop-blockcache-v4.1.patch

Fixing the v4 patch which was corrupt.

> Add in-memory caching of data
> -----------------------------
>
>                 Key: HADOOP-1398
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1398
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Jim Kellerman
>            Priority: Trivial
>         Attachments: commons-collections-3.2.jar, hadoop-blockcache-v2.patch, hadoop-blockcache-v3.patch, hadoop-blockcache-v4.1.patch, hadoop-blockcache-v4.patch, hadoop-blockcache.patch
>
>
> Bigtable provides two in-memory caches: one for row/column data and one for disk block caches.
> The size of each cache should be configurable, data should be loaded lazily, and the cache managed by an LRU mechanism.
> One complication of the block cache is that all data is read through a SequenceFile.Reader which ultimately reads data off of disk via a RPC proxy for ClientProtocol. This would imply that the block caching would have to be pushed down to either the DFSClient or SequenceFile.Reader

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1398) Add in-memory caching of data

Posted by "Tom White (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-1398:
------------------------------

    Attachment: hadoop-blockcache-v3.patch

This version (v3) changes the cache to a memory sensitive cache, implemented using SoftReferences (http://commons.apache.org/collections/api-release/org/apache/commons/collections/map/ReferenceMap.html). See HADOOP-2624 for background.

Also, block caching can be enabled on a column-family basis. The size of the block is a system wide setting - this could be adjustable on a per-column basis in the future, if it were deemed necessary.

I'm still looking at a performance comparison.

> Add in-memory caching of data
> -----------------------------
>
>                 Key: HADOOP-1398
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1398
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Jim Kellerman
>            Priority: Trivial
>         Attachments: hadoop-blockcache-v2.patch, hadoop-blockcache-v3.patch, hadoop-blockcache.patch
>
>
> Bigtable provides two in-memory caches: one for row/column data and one for disk block caches.
> The size of each cache should be configurable, data should be loaded lazily, and the cache managed by an LRU mechanism.
> One complication of the block cache is that all data is read through a SequenceFile.Reader which ultimately reads data off of disk via a RPC proxy for ClientProtocol. This would imply that the block caching would have to be pushed down to either the DFSClient or SequenceFile.Reader

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.