You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by "Keith Turner (Created) (JIRA)" <ji...@apache.org> on 2012/02/10 17:40:59 UTC

[jira] [Created] (ACCUMULO-387) Map reduce directly over files

Map reduce directly over files
------------------------------

                 Key: ACCUMULO-387
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-387
             Project: Accumulo
          Issue Type: New Feature
            Reporter: Keith Turner
             Fix For: 1.5.0


Support map reduce jobs that directly read Accumulo files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ACCUMULO-387) Support map reduce directly over files

Posted by "Keith Turner (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209030#comment-13209030 ] 

Keith Turner commented on ACCUMULO-387:
---------------------------------------

For mapper locality can use a tablets last location, I think clone copies this.  Alternatively could choose node with majority of blocks for tablets files.
                
> Support map reduce directly over files
> --------------------------------------
>
>                 Key: ACCUMULO-387
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-387
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>            Assignee: Eric Newton
>             Fix For: 1.5.0
>
>
> Support map reduce jobs that directly read Accumulo files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ACCUMULO-387) Support map reduce directly over files

Posted by "Keith Turner (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ACCUMULO-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith Turner updated ACCUMULO-387:
----------------------------------

    Fix Version/s:     (was: 1.5.0)
                   1.4.1
    
> Support map reduce directly over files
> --------------------------------------
>
>                 Key: ACCUMULO-387
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-387
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>            Assignee: Eric Newton
>             Fix For: 1.4.1
>
>
> Support map reduce jobs that directly read Accumulo files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (ACCUMULO-387) Support map reduce directly over files

Posted by "Keith Turner (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210658#comment-13210658 ] 

Keith Turner edited comment on ACCUMULO-387 at 2/17/12 11:13 PM:
-----------------------------------------------------------------

This input format could run against offline tables.   It does not care if you clone or not, but it will only start if the table is offline.  This is easy to achieve, just clone the table and take it offline.  This is simpler than trying to adjust settings to disable compactions and writes, setting that may change over time.

One draw back with this approach  is that the current code to take a table offline is async.  It starts a table going offline, but does not wait for it to happen.  The inputformat could probably get around this pretty easily.  It could check that the table states is offline and then wait for there to be no locations in the metadata table.  Once there are no locations it could start computing input splits.
                
      was (Author: kturner):
    This input format could run against offline tables.   It does not care if you clone or not, but it will only start if the table is offline.  This is easy to achieve, just clone the table and take it offline.  This is simpler than trying to adjust settings to disable compactions and reads, setting that may change over time.

One draw back with this approach  is that the current code to take a table offline is async.  It starts a table going offline, but does not wait for it to happen.  The inputformat could probably get around this pretty easily.  It could check that the table states is offline and then wait for there to be no locations in the metadata table.  Once there are no locations it could start computing input splits.
                  
> Support map reduce directly over files
> --------------------------------------
>
>                 Key: ACCUMULO-387
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-387
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>            Assignee: Eric Newton
>             Fix For: 1.4.1
>
>
> Support map reduce jobs that directly read Accumulo files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ACCUMULO-387) Support map reduce directly over files

Posted by "Keith Turner (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210677#comment-13210677 ] 

Keith Turner commented on ACCUMULO-387:
---------------------------------------

I was looking at the current input format code to determine whats the best way to share code with this new direct file input format.  I realized a good solution may be to create a new Scanner that reads directly from files.  Then the current accumulo input format could just use this new scanner.  So a new input format would not be needed.  However, specialization would be needed in the computation of input splits.  One nice thing about doing this way is that the change can be made in InputFormatBase and then AccumuloRowInputFormat and AccumuloInputFormat both get the ability to read directly over files.

This scanner would also be useful for any non mapreduce application that wanted to read the files of a table directly.  One possible way to achieve this is with an API like the following.  Calls to create batch writers and batch scanners could throw unsupported operations exceptions or just pass through to the wrapped instance.  Of course the batch scanner could be supported directly on files too, but I would not implement this until there is a need for it.

{noformat}

  DirectFileInstance instance = new DirectFileInstance(new ZooKeeperInstance(...));

  //the call below creates a scanner that will read directly from a tables files
  //this could check that the table (and tablets in the range) are offline  
  Scanner scanner = instance.getConnector(...).createScanner(...);

{noformat}
                
> Support map reduce directly over files
> --------------------------------------
>
>                 Key: ACCUMULO-387
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-387
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>            Assignee: Eric Newton
>             Fix For: 1.4.1
>
>
> Support map reduce jobs that directly read Accumulo files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ACCUMULO-387) Support map reduce directly over files

Posted by "Eric Newton (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ACCUMULO-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Newton updated ACCUMULO-387:
---------------------------------

    Assignee: Eric Newton
    
> Support map reduce directly over files
> --------------------------------------
>
>                 Key: ACCUMULO-387
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-387
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>            Assignee: Eric Newton
>             Fix For: 1.5.0
>
>
> Support map reduce jobs that directly read Accumulo files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ACCUMULO-387) Support map reduce directly over files

Posted by "John Vines (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211414#comment-13211414 ] 

John Vines commented on ACCUMULO-387:
-------------------------------------

We need to make sure that we have the documentation regarding security clear enough to let users know that they need read access to the files in order to do the file access, both map reduce and local. Also we should look into integrating the authenticator / authorizer into this scanner if possible.
                
> Support map reduce directly over files
> --------------------------------------
>
>                 Key: ACCUMULO-387
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-387
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>            Assignee: Eric Newton
>             Fix For: 1.4.1
>
>
> Support map reduce jobs that directly read Accumulo files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ACCUMULO-387) Support map reduce directly over files

Posted by "Keith Turner (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210341#comment-13210341 ] 

Keith Turner commented on ACCUMULO-387:
---------------------------------------

I am thinking the clone operation will be left up the user to give them the flexibility that you mentioned.  The input format will just work against a table.  So you could clone and compact if you plan to read the table multiple times.  You could also compact and then clone, this will save disk space.  Just need to properly document how to clone a table such that automatic compactions are disabled.


                
> Support map reduce directly over files
> --------------------------------------
>
>                 Key: ACCUMULO-387
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-387
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>            Assignee: Eric Newton
>             Fix For: 1.4.1
>
>
> Support map reduce jobs that directly read Accumulo files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ACCUMULO-387) Support map reduce directly over files

Posted by "Billie Rinaldi (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211383#comment-13211383 ] 

Billie Rinaldi commented on ACCUMULO-387:
-----------------------------------------

ACCUMULO-418 is related to this.  A table could have RFiles of widely varying sizes, which could make map reduces run poorly unless RFile is splittable.
                
> Support map reduce directly over files
> --------------------------------------
>
>                 Key: ACCUMULO-387
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-387
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>            Assignee: Eric Newton
>             Fix For: 1.4.1
>
>
> Support map reduce jobs that directly read Accumulo files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ACCUMULO-387) Support map reduce directly over files

Posted by "Keith Turner (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210662#comment-13210662 ] 

Keith Turner commented on ACCUMULO-387:
---------------------------------------

We could add an option to clone table to start a cloned table off in the offline state.  So we never even bother loading the tablets for the clone.  This would be a fairly simple change.  

One other nice thing about this approach is that tables in the offline state can be deleted.  So the cloned table never has to come online.  It basically ends up being a snapshot of the tablets files that prevents garbage collection of those files.

                
> Support map reduce directly over files
> --------------------------------------
>
>                 Key: ACCUMULO-387
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-387
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>            Assignee: Eric Newton
>             Fix For: 1.4.1
>
>
> Support map reduce jobs that directly read Accumulo files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ACCUMULO-387) Support map reduce directly over files

Posted by "Joey Echeverria (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210315#comment-13210315 ] 

Joey Echeverria commented on ACCUMULO-387:
------------------------------------------

I would only force compact if you're going to run over the clone multiple times. Otherwise you're going to pay the same cost of reading the non-sequential data, plus writing it back out, plus reading the sequential copy. I don't see how that's a net win.
                
> Support map reduce directly over files
> --------------------------------------
>
>                 Key: ACCUMULO-387
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-387
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>            Assignee: Eric Newton
>             Fix For: 1.4.1
>
>
> Support map reduce jobs that directly read Accumulo files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ACCUMULO-387) Support map reduce directly over files

Posted by "Keith Turner (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ACCUMULO-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith Turner updated ACCUMULO-387:
----------------------------------

    Summary: Support map reduce directly over files  (was: Map reduce directly over files)
    
> Support map reduce directly over files
> --------------------------------------
>
>                 Key: ACCUMULO-387
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-387
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>             Fix For: 1.5.0
>
>
> Support map reduce jobs that directly read Accumulo files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ACCUMULO-387) Support map reduce directly over files

Posted by "Keith Turner (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210658#comment-13210658 ] 

Keith Turner commented on ACCUMULO-387:
---------------------------------------

This input format could run against offline tables.   It does not care if you clone or not, but it will only start if the table is offline.  This is easy to achieve, just clone the table and take it offline.  This is simpler than trying to adjust settings to disable compactions and reads, setting that may change over time.

One draw back with this approach  is that the current code to take a table offline is async.  It starts a table going offline, but does not wait for it to happen.  The inputformat could probably get around this pretty easily.  It could check that the table states is offline and then wait for there to be no locations in the metadata table.  Once there are no locations it could start computing input splits.
                
> Support map reduce directly over files
> --------------------------------------
>
>                 Key: ACCUMULO-387
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-387
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>            Assignee: Eric Newton
>             Fix For: 1.4.1
>
>
> Support map reduce jobs that directly read Accumulo files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ACCUMULO-387) Map reduce directly over files

Posted by "Keith Turner (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205538#comment-13205538 ] 

Keith Turner commented on ACCUMULO-387:
---------------------------------------

I think this is much easier now that we have clone table.  I think the following needs to be done :

 * Clone table (pass in options to disable writes and major compactions on clone)
 * Run map reduce over files referenced by clone
 * Delete clone

Would need a special input format that instantiates the iterator stack in the mapper for each tablet.  Doing this instead of reading the files directly is important for the following reasons.

 * The iterator stack will properly process updates and deletes that were made
 * The iterator stack will only read the data in a file that falls within a tablet.  This is important because tablets can reference files that contain data outside of a tablet, data that could have been deleted in another tablet.  Using the iterator stack will prevent this from happening.


                
> Map reduce directly over files
> ------------------------------
>
>                 Key: ACCUMULO-387
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-387
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>             Fix For: 1.5.0
>
>
> Support map reduce jobs that directly read Accumulo files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ACCUMULO-387) Support map reduce directly over files

Posted by "Jason Rutherglen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210085#comment-13210085 ] 

Jason Rutherglen commented on ACCUMULO-387:
-------------------------------------------

+1 Hive currently runs slowly on HBase, a major drawback.  However I wonder if the tablet files should be 'force' compacted first, to ensure sequential reads of map reduce jobs.
                
> Support map reduce directly over files
> --------------------------------------
>
>                 Key: ACCUMULO-387
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-387
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>            Assignee: Eric Newton
>             Fix For: 1.4.1
>
>
> Support map reduce jobs that directly read Accumulo files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (ACCUMULO-387) Support map reduce directly over files

Posted by "Keith Turner (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ACCUMULO-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith Turner resolved ACCUMULO-387.
-----------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 1.4.1)
                   1.4.0
         Assignee: Keith Turner  (was: Eric Newton)
    
> Support map reduce directly over files
> --------------------------------------
>
>                 Key: ACCUMULO-387
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-387
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>             Fix For: 1.4.0
>
>
> Support map reduce jobs that directly read Accumulo files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ACCUMULO-387) Support map reduce directly over files

Posted by "Keith Turner (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210343#comment-13210343 ] 

Keith Turner commented on ACCUMULO-387:
---------------------------------------

Something we may want to consider for 1.5 if this catches on is making RFile splittable.  This would help in the case you mentioned where you compact tablets down to one file.
                
> Support map reduce directly over files
> --------------------------------------
>
>                 Key: ACCUMULO-387
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-387
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>            Assignee: Eric Newton
>             Fix For: 1.4.1
>
>
> Support map reduce jobs that directly read Accumulo files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira