You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by "Ioan Eugen Stan (Created) (JIRA)" <ji...@apache.org> on 2012/02/29 17:43:57 UTC

[jira] [Created] (MAILBOX-170) Store mailboxes in HDFS SequenceFile

Store mailboxes in HDFS SequenceFile
------------------------------------

                 Key: MAILBOX-170
                 URL: https://issues.apache.org/jira/browse/MAILBOX-170
             Project: James Mailbox
          Issue Type: Improvement
          Components: hbase
    Affects Versions: 0.4
            Reporter: Ioan Eugen Stan
            Assignee: Ioan Eugen Stan
             Fix For: 0.5


The current implementation stores messages directly in HBase. I believe a better approach is to store the messages as SequenceFiles in the <mail_ID>: <message_data>. HBase will store sequence File offests in the SequenceFile for each mailbox for fast access similar to a hadoop MapFile.  



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


[jira] [Commented] (MAILBOX-170) Store mailboxes in HDFS SequenceFile

Posted by "Ioan Eugen Stan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAILBOX-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220934#comment-13220934 ] 

Ioan Eugen Stan commented on MAILBOX-170:
-----------------------------------------

Like I've mentioned, I disagree with the split but I will do as you suggested. I will start a new implementation. I'm going for mailbox-hbase-hdfs.If there are no other objections I will start working on this in the near future but I will probably not be able to maintain/improve both implementations. 

Cheers, 
                
> Store mailboxes in HDFS SequenceFile
> ------------------------------------
>
>                 Key: MAILBOX-170
>                 URL: https://issues.apache.org/jira/browse/MAILBOX-170
>             Project: James Mailbox
>          Issue Type: Improvement
>          Components: hbase
>    Affects Versions: 0.4
>            Reporter: Ioan Eugen Stan
>            Assignee: Ioan Eugen Stan
>             Fix For: 0.5
>
>
> The current implementation stores messages directly in HBase. I believe a better approach is to store the messages as SequenceFiles in the <mail_ID>: <message_data>. HBase will store sequence File offests in the SequenceFile for each mailbox for fast access similar to a hadoop MapFile.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


[jira] [Commented] (MAILBOX-170) Store mailboxes in HDFS SequenceFile

Posted by "Eric Charles (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAILBOX-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220056#comment-13220056 ] 

Eric Charles commented on MAILBOX-170:
--------------------------------------

Quick answer:
- IMAP queries can be 'give me all mails with foo in the body'.
- Please start a new mailbox-hadoop submodule for the things you describe.
Thx again, Eric


                
> Store mailboxes in HDFS SequenceFile
> ------------------------------------
>
>                 Key: MAILBOX-170
>                 URL: https://issues.apache.org/jira/browse/MAILBOX-170
>             Project: James Mailbox
>          Issue Type: Improvement
>          Components: hbase
>    Affects Versions: 0.4
>            Reporter: Ioan Eugen Stan
>            Assignee: Ioan Eugen Stan
>             Fix For: 0.5
>
>
> The current implementation stores messages directly in HBase. I believe a better approach is to store the messages as SequenceFiles in the <mail_ID>: <message_data>. HBase will store sequence File offests in the SequenceFile for each mailbox for fast access similar to a hadoop MapFile.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


[jira] [Commented] (MAILBOX-170) Store mailboxes in HDFS SequenceFile

Posted by "Eric Charles (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAILBOX-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219979#comment-13219979 ] 

Eric Charles commented on MAILBOX-170:
--------------------------------------

Hi Ioan, imho the storage of the raw mail in hdfs sequence file can be an option.

We will need to measure the efficiency of this implementation compared to a pure hbase one (you know the story "... hdfs is for very very large files...".

Upon a distributed mailbox locker (JAMES-1388) we also need a mechanism to query efficiently the mailbox (in case of imap search queries for example - not covered neither in the current hbase impl).

btw, please ensure the existing mailbox-hbase remain as such (without hadoop), and start the implementation in a mailbox-hadoop project.

                
> Store mailboxes in HDFS SequenceFile
> ------------------------------------
>
>                 Key: MAILBOX-170
>                 URL: https://issues.apache.org/jira/browse/MAILBOX-170
>             Project: James Mailbox
>          Issue Type: Improvement
>          Components: hbase
>    Affects Versions: 0.4
>            Reporter: Ioan Eugen Stan
>            Assignee: Ioan Eugen Stan
>             Fix For: 0.5
>
>
> The current implementation stores messages directly in HBase. I believe a better approach is to store the messages as SequenceFiles in the <mail_ID>: <message_data>. HBase will store sequence File offests in the SequenceFile for each mailbox for fast access similar to a hadoop MapFile.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


[jira] [Commented] (MAILBOX-170) Store mailboxes in HDFS SequenceFile

Posted by "Ioan Eugen Stan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAILBOX-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220006#comment-13220006 ] 

Ioan Eugen Stan commented on MAILBOX-170:
-----------------------------------------

Hello Eric, long post ahead :)

First, could you please explain more what you meant about efficiently query the mailbox? I don't follow. 

Second, I don't believe a pure HBase implementation is the best. Let me explain why: HBase can't handle large emails and storing them inside Base will lead to performance issues (i have some experience with this while working for my current employer). That's why I'm planning to move the message  implementation to HDFS.  

Basically I wish to create an mbox on steroids -> replicated mbox that can provide indexed access to messages. I plan to store mailboxes as SequanceFiles and store in HBase the offset of the key-value pair that stores the message. 

Message additions will be appends and we will use ZK locking to sync write access between multiple instances of James. Deletes will be instant markers + MR jobs that do permanent clean-up: create a copy of the old file with just the messages that are not deleted + update the references in HBase. Reads will be done by opening the file do a seek and retrieve the message. I plan to mimic in HBase the hadoop MapFile. I don't wish to use the MapFile directly because it uses two files instead of one (each file uses 150 bytes or RAM + one block, so not good with millions of mailboxes, especially when we have HBase).  All the metadata will be stored in HBase like it is now, for fast access, the same will be (maybe) for message headers.

Messages will be stored with UID as key (they are ascending) and this means we can also iterate over them for bulk loads.
Also, because a file is stored in HDFS and replicated, we can have good performance since readers can access it from many nodes. I have to see the messages access pattern to optimize this. replication is done per file so we can replicate frequent accessed mailboxes more times than usual => good performance on reads because we can read in parallel => they are immutable ;). 

I plan to implement a special type of Writable that will allow us to stream the message from HBase and avoid loading all the message in memory. BytesWritable is fine for start, but uses readFully to load the whole value of a sequence file == our message so big messages will cause problems.

I plan to use the hadoop FileSystem class so we will use the distribuited filesystem HBase will use => this means the implementation could run on any distribuited fs supported by hbase. 

I also think HBase is intimately tied with Hadoop and things will not change in the near future so not taking advantage of that is kind of a dumb thing to do. 

Basically that's all, with enough free time I think we can make James run in clustering. 

Cheers, 


                
> Store mailboxes in HDFS SequenceFile
> ------------------------------------
>
>                 Key: MAILBOX-170
>                 URL: https://issues.apache.org/jira/browse/MAILBOX-170
>             Project: James Mailbox
>          Issue Type: Improvement
>          Components: hbase
>    Affects Versions: 0.4
>            Reporter: Ioan Eugen Stan
>            Assignee: Ioan Eugen Stan
>             Fix For: 0.5
>
>
> The current implementation stores messages directly in HBase. I believe a better approach is to store the messages as SequenceFiles in the <mail_ID>: <message_data>. HBase will store sequence File offests in the SequenceFile for each mailbox for fast access similar to a hadoop MapFile.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org