You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Chris Douglas (JIRA)" <ji...@apache.org> on 2008/09/12 04:25:44 UTC
[jira] Issue Comment Edited: (HADOOP-3638) Cache the iFile index files in memory to reduce seeks during map output serving

    [ https://issues.apache.org/jira/browse/HADOOP-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630461#action_12630461 ] 

chris.douglas edited comment on HADOOP-3638 at 9/11/08 7:24 PM:
----------------------------------------------------------------

This looks good. A few suggestions:
* In IndexCache::getIndexInformation, the write to the timestamp isn't guranteed to be [atomic|http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.7]. Moving it into the synchronized block above it, using an AtomicLong, or setting it to be volatile should all work (realistically, it's hugely unlikely to happen, be significant, and have a measurable performance impact, but it's worth noting).
* synchronizing on the intern'd value of a string isn't safe. String maintains its pool of intern'd Strings until the JVM shuts down, so using it here will cause each mapId to add an entry to this pool without ever removing it. Further, depending on the semantics of intern() for thread safety seems risky. It makes more sense to use a more conventional synchronization strategy.
* Why is DataOutputStream imported in IFileInputStream?
* The semantics of the comparator sorting the LRU cache aren't consistent. If o1 is null and o2 is not, o1 < o2 and o2 < o1.
* Is it necessary to copy readLong and writeLong from java.io classes? Since both are Filter\*Stream classes, isn't it sufficient to wrap each in the appropriate Data\*Stream?
* readIndexFile doesn't need to set the timeStamp; it's set in getIndexInformation after the call to readIndexFileToCache, which is more appropriate. If the LRU cache were replaced with some other metric, it would make sense to pass in an IndexInformation object from cache to readIndexFile, and fill in the timestamp in a subclass associated with the cache. IndexInformation::readIndexFile would fill in the index table, and mapId but setting the timestamp field could be (and effectively is) left to the cache. We're unlikely to add new caches, but this keeps the timestamp associated with the LRU cache and not the index reader.
* In IndexCache::readIndexFileToCache, the "excess" memory is calculated behind a block synchronized on the IndexCache instance, then that amount is freed in a block synchronized on a member field. It's possible, even likely, that multiple threads will free more memory than is necessary. For example, let the cache have a large, cached index in memory M0 with the lowest timestamp. If T1..Tn call readIndexFileToCache, say T1 is the first to grab the lock on cache in IndexCache::freeIndexInformation. It's legal for T2..Tn to get blocked in freeIndexInformation before T1 frees M0. T2..Tn will have calculated the number of bytes to free assuming M0 was still present, which is overly aggressive.

      was (Author: chris.douglas):
    This looks good. A few suggestions:
* In IndexCache::getIndexInformation, the write to the timestamp isn't guranteed to be [atomic|http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.7]. Moving it into the synchronized block above it, using an AtomicLong, or setting it to be volatile should all work.
* synchronizing on the intern'd value of a string isn't safe. String maintains its pool of intern'd Strings until the JVM shuts down, so using it here will cause each mapId to add an entry to this pool without ever removing it. Further, depending on the semantics of intern() for thread safety seems risky. It makes more sense to use a more conventional synchronization strategy.
* Why is DataOutputStream imported in IFileInputStream?
* The semantics of the comparator sorting the LRU cache aren't consistent. If o1 is null and o2 is not, o1 < o2 and o2 < o1.
* Is it necessary to copy readLong and writeLong from java.io classes? Since both are Filter\*Stream classes, isn't it sufficient to wrap each in the appropriate Data\*Stream?
* readIndexFile doesn't need to set the timeStamp; it's set in getIndexInformation after the call to readIndexFileToCache, which is more appropriate. If the LRU cache were replaced with some other metric, it would make sense to pass in an IndexInformation object from cache to readIndexFile, and fill in the timestamp in a subclass associated with the cache. IndexInformation::readIndexFile would fill in the index table, and mapId but setting the timestamp field could be (and effectively is) left to the cache. We're unlikely to add new caches, but this keeps the timestamp associated with the LRU cache and not the index reader.
* In IndexCache::readIndexFileToCache, the "excess" memory is calculated behind a block synchronized on the IndexCache instance, then that amount is freed in a block synchronized on a member field. It's possible, even likely, that multiple threads will free more memory than is necessary. For example, let the cache have a large, cached index in memory M0 with the lowest timestamp. If T1..Tn call readIndexFileToCache, say T1 is the first to grab the lock on cache in IndexCache::freeIndexInformation. It's legal for T2..Tn to get blocked in freeIndexInformation before T1 frees M0. T2..Tn will have calculated the number of bytes to free assuming M0 was still present, which is overly aggressive.
  
> Cache the iFile index files in memory to reduce seeks during map output serving
> -------------------------------------------------------------------------------
>
>                 Key: HADOOP-3638
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3638
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Devaraj Das
>            Assignee: Jothi Padmanabhan
>             Fix For: 0.19.0
>
>         Attachments: hadoop-3638-v1.patch, hadoop-3638-v2.patch
>
>
> The iFile index files can be cached in memory to reduce seeks during map output serving.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.