You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org> on 2009/01/25 08:55:59 UTC

[jira] Created: (HIVE-250) shared memory java dbm for map-side joins

shared memory java dbm for map-side joins
-----------------------------------------

                 Key: HIVE-250
                 URL: https://issues.apache.org/jira/browse/HIVE-250
             Project: Hadoop Hive
          Issue Type: Bug
          Components: Query Processor
            Reporter: Joydeep Sen Sarma
            Assignee: Joydeep Sen Sarma


can use either:
- sdbm: http://freshmeat.net/projects/solingerjavasdbm/
- jdbm: http://sourceforge.net/projects/jdbm/

both need modifications to use file mmaps instead of regular file io. will do some testing to see if there's a major difference between the two.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-250) shared memory java dbm for map-side joins

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-250:
----------------------------

       Resolution: Fixed
    Fix Version/s: 0.4.0
     Release Note: HIVE-250. Shared memory java dbm for map-side joins. (Joydeep Sen Sarma via zshao)
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

Committed to trunk. Thanks Joydeep!

> shared memory java dbm for map-side joins
> -----------------------------------------
>
>                 Key: HIVE-250
>                 URL: https://issues.apache.org/jira/browse/HIVE-250
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Joydeep Sen Sarma
>             Fix For: 0.4.0
>
>         Attachments: hive.250.1.patch
>
>
> can use either:
> - sdbm: http://freshmeat.net/projects/solingerjavasdbm/
> - jdbm: http://sourceforge.net/projects/jdbm/
> both need modifications to use file mmaps instead of regular file io. will do some testing to see if there's a major difference between the two.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-250) shared memory java dbm for map-side joins

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667056#action_12667056 ] 

Joydeep Sen Sarma commented on HIVE-250:
----------------------------------------

i think we had looked at berkeley db as well - seemed way too complicated to modify. it's a few lines mod in jdbm (fingers crossed).

> shared memory java dbm for map-side joins
> -----------------------------------------
>
>                 Key: HIVE-250
>                 URL: https://issues.apache.org/jira/browse/HIVE-250
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Joydeep Sen Sarma
>
> can use either:
> - sdbm: http://freshmeat.net/projects/solingerjavasdbm/
> - jdbm: http://sourceforge.net/projects/jdbm/
> both need modifications to use file mmaps instead of regular file io. will do some testing to see if there's a major difference between the two.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-250) shared memory java dbm for map-side joins

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667351#action_12667351 ] 

Joydeep Sen Sarma commented on HIVE-250:
----------------------------------------

ok - i guess i was quite off on this.

These modules use a java object cache - they deserialize objects from underlying dbm files. the object cache (in the way it's implemented) cannot be shared across processes. the act of mmaping the file and then reading it has some performance advantages (less system calls) - but doesn't really help that much from memory perspective. if the file is small and there are lots of readers - then the file system will anyway hold the file in cache (and that's 1X memory consumption).

so for v1 - what i would say is just use one of these modules with efficient deserialization (jdbm uses object serialization by default and that's way too expensive). and trust file caching to do it's job. The object cache in dbm (configurable) will have to be kept to reasonable size - quite likely a subset of the entire data size.

if we get time - we can do something smarter - the object cache should have pointers to data in the underlying mmap'ed file. This would mean that data across all object caches is shared (we will still have to pay for object overhead though). with the same amount of memory - the object cache can be sized much larger (since it's not storing data anymore).

--
i did do some experiments:
- replaced file io with mmaped buffers in jdbm - works perfectly
- bulk of compute cost is in object deserialization. On my T60 with a single thread - lookup throughput went from 6.5K op/s to 41K op/s if deserialization was avoided (by making the object cache large enough to store all values). cpu bound of course.



> shared memory java dbm for map-side joins
> -----------------------------------------
>
>                 Key: HIVE-250
>                 URL: https://issues.apache.org/jira/browse/HIVE-250
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Joydeep Sen Sarma
>
> can use either:
> - sdbm: http://freshmeat.net/projects/solingerjavasdbm/
> - jdbm: http://sourceforge.net/projects/jdbm/
> both need modifications to use file mmaps instead of regular file io. will do some testing to see if there's a major difference between the two.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-250) shared memory java dbm for map-side joins

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-250:
----------------------------

    Status: Patch Available  (was: Open)

> shared memory java dbm for map-side joins
> -----------------------------------------
>
>                 Key: HIVE-250
>                 URL: https://issues.apache.org/jira/browse/HIVE-250
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Joydeep Sen Sarma
>         Attachments: hive.250.1.patch
>
>
> can use either:
> - sdbm: http://freshmeat.net/projects/solingerjavasdbm/
> - jdbm: http://sourceforge.net/projects/jdbm/
> both need modifications to use file mmaps instead of regular file io. will do some testing to see if there's a major difference between the two.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-250) shared memory java dbm for map-side joins

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-250:
----------------------------

    Attachment: hive.250.1.patch

> shared memory java dbm for map-side joins
> -----------------------------------------
>
>                 Key: HIVE-250
>                 URL: https://issues.apache.org/jira/browse/HIVE-250
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Joydeep Sen Sarma
>         Attachments: hive.250.1.patch
>
>
> can use either:
> - sdbm: http://freshmeat.net/projects/solingerjavasdbm/
> - jdbm: http://sourceforge.net/projects/jdbm/
> both need modifications to use file mmaps instead of regular file io. will do some testing to see if there's a major difference between the two.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-250) shared memory java dbm for map-side joins

Posted by "Prasad Chakka (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667053#action_12667053 ] 

Prasad Chakka commented on HIVE-250:
------------------------------------

if they have to be modified to use mmap then why not berkey db java edition?

> shared memory java dbm for map-side joins
> -----------------------------------------
>
>                 Key: HIVE-250
>                 URL: https://issues.apache.org/jira/browse/HIVE-250
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Joydeep Sen Sarma
>
> can use either:
> - sdbm: http://freshmeat.net/projects/solingerjavasdbm/
> - jdbm: http://sourceforge.net/projects/jdbm/
> both need modifications to use file mmaps instead of regular file io. will do some testing to see if there's a major difference between the two.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.