You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by "Mihai Soloi (JIRA)" <ji...@apache.org> on 2012/12/31 15:10:13 UTC

[jira] [Updated] (MAILBOX-173) [gsoc2012] Distribuited mailbox indexing over HBase/HDFS

     [ https://issues.apache.org/jira/browse/MAILBOX-173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mihai Soloi updated MAILBOX-173:
--------------------------------

    Attachment: MAILBOX-173.patch

This patch is an inverted index in an HBase table to search through the mails in a mailbox.

The structure of the index is as follows.

   1. mailboxID  is an java.util.UUID
   2. the fields are now Enums, and what is stored is a byte that identifies that enum field.
   3. each of the terms in the fields are tokenized using the lucene org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer, but some fields are not tokenized due to their nature(SENT_DATE for example)

The row is composed of all the above byte arrays concatenated, so that searching can be done very fast through the HBase table, as well as lookup on the specific mailbox and field in the mail. The mailID is the qualifier in the static column family(only one column family) so that mail id's are found with relative ease.

This is for the mail document in itself, the flags are stored in a single row in the table(one row for each mailbox) and can be found easily by a scan. Each of the rows now has an empty value, where in the possible future we'll be able to store data related to the term frequency in the document.

What works currently are the searches based on the text, flags, headers, all criterions, uid and uid ranges. These are implemented using Filters inside an Endpoint Coprocessors due to the benefit they provide of less data transfer over the network and distributed processing on each region. 
                
> [gsoc2012] Distribuited mailbox indexing over HBase/HDFS
> --------------------------------------------------------
>
>                 Key: MAILBOX-173
>                 URL: https://issues.apache.org/jira/browse/MAILBOX-173
>             Project: James Mailbox
>          Issue Type: New Feature
>          Components: hbase, lucene, store
>            Reporter: Ioan Eugen Stan
>            Assignee: Ioan Eugen Stan
>              Labels: gsoc, gsoc2012, mentor
>         Attachments: MAILBOX-173.patch
>
>
> James provide a module called Lucene Mailbox Index that knows how to index emails. Indexing is done by providing a suitable Lucene Directory implementation that will store the index and allow searching. Lucene comes with File system directory JDBC Directory and a few other implementations to store the index in a file-system or in a database.
> In order to provide distributed search we should implement a Directory implementation that will store the index in HBase. Such an implementation is described very well here [1].
> [1] http://www.infoq.com/articles/LuceneHbase

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org