You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by "John Vines (Created) (JIRA)" <ji...@apache.org> on 2012/03/09 18:10:59 UTC

[jira] [Created] (ACCUMULO-454) RFile Input Format

RFile Input Format
------------------

                 Key: ACCUMULO-454
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-454
             Project: Accumulo
          Issue Type: New Feature
          Components: client
            Reporter: John Vines
            Assignee: Billie Rinaldi
             Fix For: 1.4.1


We currently provide InputFormats for reading from Accumulo and output formats for both direct input as well as outputting RFiles. But we provide no mechanism for doing a mapreduce over existing RFiles, which may be useful for optimizing data flow. We already have input formats which use RFiles directly for input (The offline input format Keith just finished), but that still relies on the Accumulo structure. We should go ahead and also create an input format that just hits RFiles like the other standard file input formats.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Created] (ACCUMULO-454) RFile Input Format

Posted by Aaron Cordova <aa...@cordovas.org>.
Smooth.

On Mar 9, 2012, at 12:36 PM, Keith Turner wrote:

> On Fri, Mar 9, 2012 at 12:31 PM, Aaron Cordova <aa...@cordovas.org> wrote:
>> Does Keith's input format apply the necessary Accumulo iterators to provide a sane view of the data to MapReduce?
>> 
> 
> Yes the input format sets up the iterator stack using system iterators
> and any iterators configured for the table.  It also respects tablet
> boundries when reading files to avoid stale data.


Re: [jira] [Created] (ACCUMULO-454) RFile Input Format

Posted by Keith Turner <ke...@deenlo.com>.
On Fri, Mar 9, 2012 at 12:31 PM, Aaron Cordova <aa...@cordovas.org> wrote:
> Does Keith's input format apply the necessary Accumulo iterators to provide a sane view of the data to MapReduce?
>

Yes the input format sets up the iterator stack using system iterators
and any iterators configured for the table.  It also respects tablet
boundries when reading files to avoid stale data.

Re: [jira] [Created] (ACCUMULO-454) RFile Input Format

Posted by Aaron Cordova <aa...@cordovas.org>.
Does Keith's input format apply the necessary Accumulo iterators to provide a sane view of the data to MapReduce?

And what you're proposing is an input format that works over RFiles where perhaps multiple versions of the same row/column don't exist in multiple files and where there are no delete markers, etc?

On Mar 9, 2012, at 12:10 PM, John Vines (Created) (JIRA) wrote:

> RFile Input Format
> ------------------
> 
>                 Key: ACCUMULO-454
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-454
>             Project: Accumulo
>          Issue Type: New Feature
>          Components: client
>            Reporter: John Vines
>            Assignee: Billie Rinaldi
>             Fix For: 1.4.1
> 
> 
> We currently provide InputFormats for reading from Accumulo and output formats for both direct input as well as outputting RFiles. But we provide no mechanism for doing a mapreduce over existing RFiles, which may be useful for optimizing data flow. We already have input formats which use RFiles directly for input (The offline input format Keith just finished), but that still relies on the Accumulo structure. We should go ahead and also create an input format that just hits RFiles like the other standard file input formats.
> 
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
> 
> 


Re: [jira] [Created] (ACCUMULO-454) RFile Input Format

Posted by John Vines <jo...@ugov.gov>.
The intent for this utility is not for running on files in Accumulo. The
intent is if you have a multiphase mapreduce where you want to ingest both
the final records generated, but also some of the intermediate data. You
shouldn't have to output to one file type to continue to do operations and
then have to translate from one file format to an RFile.

Also, future note- When you have commentary/question on a ticket, keep in
in JRA and not respond via email. I think it will make it easier for the
implementer to make design decisions, as well as for end users to
understand design intent.

John

On Fri, Mar 9, 2012 at 12:32 PM, Aaron Cordova <aa...@cordovas.org> wrote:

> Does Keith's input format apply the necessary Accumulo iterators to
> provide a sane view of the data to MapReduce?
>
> And what you're proposing is an input format that works over RFiles where
> perhaps multiple versions of the same row/column don't exist in multiple
> files and where there are no delete markers, etc?
>
> On Mar 9, 2012, at 12:10 PM, John Vines (Created) (JIRA) wrote:
>
> > RFile Input Format
> > ------------------
> >
> >                 Key: ACCUMULO-454
> >                 URL: https://issues.apache.org/jira/browse/ACCUMULO-454
> >             Project: Accumulo
> >          Issue Type: New Feature
> >          Components: client
> >            Reporter: John Vines
> >            Assignee: Billie Rinaldi
> >             Fix For: 1.4.1
> >
> >
> > We currently provide InputFormats for reading from Accumulo and output
> formats for both direct input as well as outputting RFiles. But we provide
> no mechanism for doing a mapreduce over existing RFiles, which may be
> useful for optimizing data flow. We already have input formats which use
> RFiles directly for input (The offline input format Keith just finished),
> but that still relies on the Accumulo structure. We should go ahead and
> also create an input format that just hits RFiles like the other standard
> file input formats.
> >
> > --
> > This message is automatically generated by JIRA.
> > If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> > For more information on JIRA, see:
> http://www.atlassian.com/software/jira
> >
> >
>
>