You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Julien Le Dem <ju...@dremio.com> on 2016/01/13 20:54:24 UTC

Parquet-cpp

Hello Nong, Wes, Stephen, Deepak and Aliaksei
I wanted to introduce you to each other as you are all looking at
Parquet-cpp.

I'd recommend opening JIRAs in the parquet-cpp component to collaborate (I
see you already doing this):
https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp

Nong is a committer and can merged pull requests (he also understands that
code base very well).
Other committer can too, feel free to ping us if you need help
Obviously, you don't need to be a committer to give others reviews (you
just need one to approve and merge).

-- 
Julien

Re: Parquet-cpp

Posted by Wes McKinney <we...@cloudera.com>.
I am happy to help out with the patch maintenance when there are conflicts.
With PARQUET-437 we'll want to write more unit tests which will help make
sure we aren't breaking each other's code.

On Mon, Jan 25, 2016 at 2:33 PM, Aliaksei Sandryhaila <as...@gmail.com>
wrote:

> Hi Ryan,
>
> This sounds very reasonable. I do not argue to disregard the standard
> Apache approach to promoting contributors to committers. I am just pointing
> out that without the input from current committers it is hard for us to
> productively contribute to the project. As a consequence, it is hard for us
> demonstrate our fit to become committers in the future. This leaves us in a
> deadlock, which can be resolved either by an increased feedback from
> existing committers or by making us committers sooner.
>
> I understand that most committers on the Parquet project are working on
> the Java implementation, so it can be harder for them to review patches for
> parquet-cpp. In this regard, how about the following protocol for
> parquet-cpp pull requests: After contributors review and revise a pull
> request and agree that it is in a good shape, we will ask a designated
> committer to review and commit the pull request. So far we have been asking
> Nong; if there is a better designated committer for parquet-cpp, please let
> us know.
>
> Thank you,
> Aliaksei.
>
>
>
> On 01/25/2016 04:54 PM, Ryan Blue wrote:
>
>> Hi everyone,
>>
>> Sorry about the current backlog on the parquet-cpp side. Most of the
>> current committer base works on the Java implementation so it's either slow
>> or not reliable for us to do those reviews.
>>
>> I think the best way to move forward is to review patches for each other.
>> That will keep those issues progressing, make it easy for committers to
>> validate the commit, and -- most importantly -- to build a trail of
>> contributions that we can look at to vote in new committers.
>>
>> I completely sympathize with the need for committers on the CPP project,
>> but I don't think this will take a long time given the current level of
>> activity. We're really just trying to build confidence that:
>>
>> 1. You produce quality contributions and understand the codebase
>> 2. You give friendly, thoughtful reviews and don't rubber-stamp
>> 3. You defer judgment and ask others when you don't know
>> 4. You respect others and interact professionally
>>
>> I don't think any of those are that hard to demonstrate, but I'd be
>> uncomfortable not validating committers like we normally do. Especially in
>> this situation, where I could easily see the amount of work you guys are
>> doing adding up pretty quickly!
>>
>> Does that sound like a reasonable path forward?
>>
>> rb
>>
>>
>> On 01/25/2016 12:46 PM, Aliaksei Sandryhaila wrote:
>>
>>> Hi Nong and Julien,
>>>
>>> As Wes has pointed out, we have a number of patches for parquet-cpp
>>> outstanding. Wes, Deepak, and I have been reviewing each other's pull
>>> requests. At this point, the patches need to be reviewed and approved by
>>> Parquet committers in order to be committed to master.
>>>
>>> Unfortunately, there is not much activity on this side of the project.
>>> The lack of response from current committers is holding us back, and we
>>> have to repeatedly rebase our batches, merge multiple pull requests
>>> together, and overall step on each others' toes.
>>>
>>> Is it possible to make Wes, Deepak, and me committers on the project, so
>>> we can contribute to parquet-cpp more efficiently?
>>>
>>> Thanks,
>>> Aliaksei.
>>>
>>>
>>> On 01/23/2016 06:07 PM, Wes McKinney wrote:
>>>
>>>> Folks,
>>>>
>>>> We're working on a pretty solid patch queue.
>>>>
>>>> independent patches
>>>> PARQUET-449: https://github.com/apache/parquet-cpp/pull/21
>>>>
>>>> interdependent patches (order to apply patches)
>>>> PARQUET-437 (MOSTLY REVIEWED):
>>>> https://github.com/apache/parquet-cpp/pull/19
>>>>
>>>> PARQUET-418: https://github.com/apache/parquet-cpp/pull/18
>>>> PARQUET-434: https://github.com/apache/parquet-cpp/pull/20
>>>> PARQUET-433: https://github.com/apache/parquet-cpp/pull/22
>>>> PARQUET-451 & PARQUET-453:
>>>> https://github.com/apache/parquet-cpp/pull/23
>>>>
>>>> PARQUET-428 (needs to be rebased on top of PARQUET-433):
>>>> https://github.com/apache/parquet-cpp/pull/24
>>>>
>>>> I'm going to take a breather and work on some other things this weekend,
>>>> but I'll be available for code reviews and fixes to try to move along
>>>> this
>>>> patch queue.
>>>>
>>>> Thanks,
>>>> Wes
>>>>
>>>> On Fri, Jan 15, 2016 at 8:18 AM, Wes McKinney <we...@cloudera.com> wrote:
>>>>
>>>> Great to meet you all!
>>>>>
>>>>> I've recently been collaborating with the Apache Drill team to spin out
>>>>> the ValueVector columnar in-memory data structure into a new standalone
>>>>> project that will be called Arrow [1] [2]. A brief summary of
>>>>> Arrow/ValueVectors is that it permits O(1) random access on nested
>>>>> columnar
>>>>> structures and is efficient for projections and scans in a columnar SQL
>>>>> setting.
>>>>>
>>>>> I'm very interested in making Parquet read/write support available to
>>>>> Python programmers via C/C++ extensions, so I'm going to be working the
>>>>> next few months on a Parquet->Arrow->Python toolchain, along with some
>>>>> tools to manipulate tables in-memory columnar data in the style of
>>>>> Python's
>>>>> pandas library.
>>>>>
>>>>> I will propose patches as needed to parquet-cpp to improve its
>>>>> performance
>>>>> and add functionality for writing Parquet files as well. The details of
>>>>> converting to/from Parquet's repetition/definition level
>>>>> representation of
>>>>> nested data will stay separate in the arrow-parquet adapter code.
>>>>>
>>>>> cheers,
>>>>> Wes
>>>>>
>>>>> [1]:
>>>>>
>>>>> http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E
>>>>>
>>>>> [2]:
>>>>>
>>>>> http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490
>>>>>
>>>>> On Fri, Jan 15, 2016 at 1:22 AM, Mickaël Lacour <m....@criteo.com>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>>
>>>>>> I'm very interested in this subject because I would like to export
>>>>>> parquet data from HDFS to Vertica (using VSQL).
>>>>>> I'm planning to work on it next quarter, but I will be very happy to
>>>>>> help
>>>>>> you on this subject (review, testing).
>>>>>>
>>>>>> Have a nice day,
>>>>>> --
>>>>>> Mickaël Lacour
>>>>>> Senior Software Engineer
>>>>>> Analytics Infrastructure team @Scalability
>>>>>>
>>>>>> ________________________________________
>>>>>> From: Walkauskas, Stephen Gregory (Vertica)
>>>>>> <st...@hpe.com>
>>>>>> Sent: Thursday, January 14, 2016 3:23 PM
>>>>>> To: Sandryhaila, Aliaksei; dev@parquet.apache.org; Majeti, Deepak;
>>>>>> nongli@gmail.com; Wes McKinney
>>>>>> Subject: Re: Parquet-cpp
>>>>>>
>>>>>> Yes, thanks for the introduction Julien.
>>>>>>
>>>>>> Nong and Wes,
>>>>>>
>>>>>> It'd be interesting to know your goals for parquet-cpp.
>>>>>>
>>>>>> The Vertica database already supports optimized reads of ORC files
>>>>>> (fast
>>>>>> c++ parser, predicate pushdown, columns selection etc). We'd like to
>>>>>> do
>>>>>> the same for parquet.
>>>>>>
>>>>>> Cheers,
>>>>>> Stephen
>>>>>>
>>>>>> On 01/13/2016 05:53 PM, Sandryhaila, Aliaksei wrote:
>>>>>>
>>>>>>> Thank you for the introduction, Julien!
>>>>>>>
>>>>>>> Hello Nong and Wes,
>>>>>>>
>>>>>>> Stephen, Deepak and I are developing a C++ library to support
>>>>>>> Parquet in
>>>>>>> Vertica RDBMS. We are using Parquet-cpp as a starting point and are
>>>>>>> expanding its functionality as well as improving it and fixing
>>>>>>> bugs. We
>>>>>>> would like to contribute these improvements back to the open-source
>>>>>>> community. We plan to do this through the usual process of creating
>>>>>>> jiras that justify and explain a code change, and then submitting
>>>>>>> pull
>>>>>>> requests. We look forward to working with you on Parquet-cpp and to
>>>>>>> your
>>>>>>> feedback and suggestions.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Aliaksei.
>>>>>>>
>>>>>>>
>>>>>>> On 01/13/2016 02:54 PM, Julien Le Dem wrote:
>>>>>>>
>>>>>>>> Hello Nong, Wes, Stephen, Deepak and Aliaksei
>>>>>>>> I wanted to introduce you to each other as you are all looking at
>>>>>>>> Parquet-cpp.
>>>>>>>>
>>>>>>>> I'd recommend opening JIRAs in the parquet-cpp component to
>>>>>>>>
>>>>>>> collaborate (I
>>>>>>
>>>>>>> see you already doing this):
>>>>>>>>
>>>>>>>>
>>>>>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
>>>>>>
>>>>>> Nong is a committer and can merged pull requests (he also understands
>>>>>>>>
>>>>>>> that
>>>>>>
>>>>>>> code base very well).
>>>>>>>> Other committer can too, feel free to ping us if you need help
>>>>>>>> Obviously, you don't need to be a committer to give others reviews
>>>>>>>> (you
>>>>>>>> just need one to approve and merge).
>>>>>>>>
>>>>>>>>
>>>>>
>>>
>>
>>
>

Re: Parquet-cpp

Posted by Julien Le Dem <ju...@ledem.net>.
Based on feedback on the PR and my own review, I merged #38
Have a good week end.

> On Feb 5, 2016, at 3:23 PM, Wes McKinney <we...@cloudera.com> wrote:
> 
> hi folks
> 
> We are making good progress on the read path in parquet-cpp, but we
> still have limited test coverage (and thus probably a bunch of
> non-working code) in a few key areas
> 
> - the file reader public API, generally
> - column reader and scanner business logic
> - decompression codecs (I'm going to pick up this patch -- this
> weekend maybe -- https://github.com/apache/parquet-cpp/pull/11)
> - parquet < 2.0 value decoders (level decoding is in good shape when
> Deepak's level decoder patch is merged). For example PLAIN_DICTIONARY
> decoding is not implemented
> - parquet 2.0 value encodings (unclear how urgent this is)
> - DataPageV2
> 
> The sooner we can get the schema patch
> (https://github.com/apache/parquet-cpp/pull/38) merged the better to
> proceed with filling the rest of these gaps.
> 
> AFAICT we have JIRAs tracking almost all of these items (and some
> other bugs) -- if you find some gaps something missing please create a
> JIRA and update the roadmap doc
> https://docs.google.com/document/d/1WyquzupLc3UkErO2OhqLJNQ9a84Cccc8LVUSuLQz39o/edit
> 
> Outside of functional requirements for read support, we have plenty of
> C++ engineering tidying to do, like:
> 
> - eliminating build dependencies from being transitively included in
> public headers (i.e. Boost and Thrift)
> - defining a public API in parquet/api/*.h for 3rd-party linkers
> - cleaning up includes
> (https://github.com/include-what-you-use/include-what-you-use)
> - shared library symbol visibility (we may not need this for a while)
> 
> Since file reading is the most overall pressing matter, I'm going to
> tilt my efforts toward completing the read path by the end of the
> month at the expense of the write path (outside of test fixtures to
> generate faux serialized data). For my needs the remaining tricky bit
> is columnar nested data structure reassembly but I'll defer on that
> until the other aspects are in good shape. I estimate about 30-40% of
> the effort is writing new code and 60-70% testing existing code and /
> or refactoring to enable component-level unit testing.
> 
> Thank you all in advance for your help, patches, and code reviews.
> 
> best,
> Wes
> 
> On Wed, Jan 27, 2016 at 10:22 PM, Wes McKinney <we...@cloudera.com> wrote:
>> Yeah, if the Apache build queue is clogged up with other projects' builds,
>> and you have a green build on your personal repo, I suggest posting that on
>> the PR and the reviewer can accept the patch after checking the git hash on
>> the green build. Hopefully now Travis CI has sorted out the infrastructure
>> problems so this won't happen again soon.
>> 
>> On Wed, Jan 27, 2016 at 9:59 AM, Julien Le Dem <ju...@dremio.com> wrote:
>>> 
>>> you can also enable travis for your personal repo which would have it's
>>> own queue.
>>> Then you can have the build running on your branches.
>>> 
>>> On Wed, Jan 27, 2016 at 9:44 AM, Ryan Blue <bl...@cloudera.com> wrote:
>>>> 
>>>> I have no problem substituting local testing as long as we test all the
>>>> environments that Travis does. I've done that to get around this problem in
>>>> the past. It takes a while to run each maven test profile, but it works.
>>>> 
>>>> rb
>>>> 
>>>> On 01/26/2016 09:44 PM, Wes McKinney wrote:
>>>>> 
>>>>> Also, things have been made much worse by Travis CI continuing to have
>>>>> infrastructure problems. The ASF build queue on Travis CI had completely
>>>>> stalled by this morning so that no builds were completing; fortunately
>>>>> their support is quite responsible and they've resolved the queue
>>>>> blockage, so builds are executing again.
>>>>> 
>>>>> On Tue, Jan 26, 2016 at 4:00 PM, Wes McKinney <wes@cloudera.com
>>>>> <ma...@cloudera.com>> wrote:
>>>>> 
>>>>>    There's 3 more patches outstanding that are causing blockage (418,
>>>>>    433, and 451/453), so I think if we get them merged today or
>>>>>    tomorrow when we should be able to proceed with some parallel
>>>>>    efforts without quite as much conflict.
>>>>> 
>>>>>    On Tue, Jan 26, 2016 at 3:56 PM, Nong Li <nongli@gmail.com
>>>>>    <ma...@gmail.com>> wrote:
>>>>> 
>>>>>        I'm going to try to more active this week but I admittedly don't
>>>>>        have a lot of
>>>>>        time to work on this. I understand we need to get critical mass
>>>>>        in committers,
>>>>>        code, etc to keep this going but I think we're making good
>>>>> progress.
>>>>> 
>>>>>        On Tue, Jan 26, 2016 at 3:27 PM, Julien Le Dem
>>>>>        <julien@dremio.com <ma...@dremio.com>> wrote:
>>>>> 
>>>>>            Also as Nong mentioned, PRs should be prefixed by the jira
>>>>>            id followed by a ":" as follows "PARQUET-X: description"
>>>>>            that's just to have the reference in the git changelog. The
>>>>>            merge script enforces it.
>>>>> 
>>>>> 
>>>>>            On Tue, Jan 26, 2016 at 3:24 PM, Julien Le Dem
>>>>>            <julien@dremio.com <ma...@dremio.com>> wrote:
>>>>> 
>>>>>                I'm happy too with Aliaksei, Deepak, Wes, etc reviewing
>>>>>                each other.
>>>>>                I see Nong (who's a committer) has been doing some
>>>>>                reviews already.
>>>>> 
>>>>>                When you guys reach a consensus on a PR and want it
>>>>>                merged please mention it in the PR (+1, LGTM) and
>>>>>                mention us directly (@julienledem, ...) to have it
>>>>> merged.
>>>>> 
>>>>>                right now I see that #19 and #21 have been committed
>>>>>                (thanks Nong) but it is not clear to me in what order
>>>>>                the others should be committed.
>>>>> 
>>>>>                For example Deepak should comment directly on #22 to
>>>>>                approve it. Right now he mentioned it on another PR.
>>>>> 
>>>>> https://github.com/apache/parquet-cpp/pull/24#issuecomment-174354139
>>>>>                Similarly Wes could confirm on that PR whether it looks
>>>>>                good.
>>>>> 
>>>>>                Tomorrow is the Parquet sync up if you want to discuss
>>>>>                further:
>>>>> 
>>>>> https://plus.google.com/u/0/events/cvgi67jmoptmgb1i488re8scbuo
>>>>> 
>>>>> 
>>>>>                On Mon, Jan 25, 2016 at 4:20 PM, Ryan Blue
>>>>>                <blue@cloudera.com <ma...@cloudera.com>> wrote:
>>>>> 
>>>>>                    Aliaksei, thanks for being understanding here.
>>>>> 
>>>>>                    I agree with you that it is too difficult. We really
>>>>>                    want to get the cpp side bootstrapped as soon as
>>>>>                    possible. Lets go with what you suggested, to have
>>>>>                    contributors review one another's patches and then
>>>>>                    ask a committer for a final review once both
>>>>>                    contributors reach a consensus.
>>>>> 
>>>>>                    If there are issues that are easy to review, maybe
>>>>>                    some of us other than Nong can take a look.
>>>>> 
>>>>>                    rb
>>>>> 
>>>>> 
>>>>>                    On 01/25/2016 02:33 PM, Aliaksei Sandryhaila wrote:
>>>>> 
>>>>>                        Hi Ryan,
>>>>> 
>>>>>                        This sounds very reasonable. I do not argue to
>>>>>                        disregard the standard
>>>>>                        Apache approach to promoting contributors to
>>>>>                        committers. I am just
>>>>>                        pointing out that without the input from current
>>>>>                        committers it is hard
>>>>>                        for us to productively contribute to the
>>>>>                        project. As a consequence, it
>>>>>                        is hard for us demonstrate our fit to become
>>>>>                        committers in the future.
>>>>>                        This leaves us in a deadlock, which can be
>>>>>                        resolved either by an
>>>>>                        increased feedback from existing committers or
>>>>>                        by making us committers
>>>>>                        sooner.
>>>>> 
>>>>>                        I understand that most committers on the Parquet
>>>>>                        project are working on
>>>>>                        the Java implementation, so it can be harder for
>>>>>                        them to review patches
>>>>>                        for parquet-cpp. In this regard, how about the
>>>>>                        following protocol for
>>>>>                        parquet-cpp pull requests: After contributors
>>>>>                        review and revise a pull
>>>>>                        request and agree that it is in a good shape, we
>>>>>                        will ask a designated
>>>>>                        committer to review and commit the pull request.
>>>>>                        So far we have been
>>>>>                        asking Nong; if there is a better designated
>>>>>                        committer for parquet-cpp,
>>>>>                        please let us know.
>>>>> 
>>>>>                        Thank you,
>>>>>                        Aliaksei.
>>>>> 
>>>>> 
>>>>>                        On 01/25/2016 04:54 PM, Ryan Blue wrote:
>>>>> 
>>>>>                            Hi everyone,
>>>>> 
>>>>>                            Sorry about the current backlog on the
>>>>>                            parquet-cpp side. Most of the
>>>>>                            current committer base works on the Java
>>>>>                            implementation so it's either
>>>>>                            slow or not reliable for us to do those
>>>>> reviews.
>>>>> 
>>>>>                            I think the best way to move forward is to
>>>>>                            review patches for each
>>>>>                            other. That will keep those issues
>>>>>                            progressing, make it easy for
>>>>>                            committers to validate the commit, and --
>>>>>                            most importantly -- to build
>>>>>                            a trail of contributions that we can look at
>>>>>                            to vote in new committers.
>>>>> 
>>>>>                            I completely sympathize with the need for
>>>>>                            committers on the CPP
>>>>>                            project, but I don't think this will take a
>>>>>                            long time given the
>>>>>                            current level of activity. We're really just
>>>>>                            trying to build
>>>>>                            confidence that:
>>>>> 
>>>>>                            1. You produce quality contributions and
>>>>>                            understand the codebase
>>>>>                            2. You give friendly, thoughtful reviews and
>>>>>                            don't rubber-stamp
>>>>>                            3. You defer judgment and ask others when
>>>>>                            you don't know
>>>>>                            4. You respect others and interact
>>>>>                            professionally
>>>>> 
>>>>>                            I don't think any of those are that hard to
>>>>>                            demonstrate, but I'd be
>>>>>                            uncomfortable not validating committers like
>>>>>                            we normally do.
>>>>>                            Especially in this situation, where I could
>>>>>                            easily see the amount of
>>>>>                            work you guys are doing adding up pretty
>>>>>                            quickly!
>>>>> 
>>>>>                            Does that sound like a reasonable path
>>>>> forward?
>>>>> 
>>>>>                            rb
>>>>> 
>>>>> 
>>>>>                            On 01/25/2016 12:46 PM, Aliaksei Sandryhaila
>>>>>                            wrote:
>>>>> 
>>>>>                                Hi Nong and Julien,
>>>>> 
>>>>>                                As Wes has pointed out, we have a number
>>>>>                                of patches for parquet-cpp
>>>>>                                outstanding. Wes, Deepak, and I have
>>>>>                                been reviewing each other's pull
>>>>>                                requests. At this point, the patches
>>>>>                                need to be reviewed and approved by
>>>>>                                Parquet committers in order to be
>>>>>                                committed to master.
>>>>> 
>>>>>                                Unfortunately, there is not much
>>>>>                                activity on this side of the project.
>>>>>                                The lack of response from current
>>>>>                                committers is holding us back, and we
>>>>>                                have to repeatedly rebase our batches,
>>>>>                                merge multiple pull requests
>>>>>                                together, and overall step on each
>>>>>                                others' toes.
>>>>> 
>>>>>                                Is it possible to make Wes, Deepak, and
>>>>>                                me committers on the project, so
>>>>>                                we can contribute to parquet-cpp more
>>>>>                                efficiently?
>>>>> 
>>>>>                                Thanks,
>>>>>                                Aliaksei.
>>>>> 
>>>>> 
>>>>>                                On 01/23/2016 06:07 PM, Wes McKinney
>>>>> wrote:
>>>>> 
>>>>>                                    Folks,
>>>>> 
>>>>>                                    We're working on a pretty solid
>>>>>                                    patch queue.
>>>>> 
>>>>>                                    independent patches
>>>>>                                    PARQUET-449:
>>>>> 
>>>>> https://github.com/apache/parquet-cpp/pull/21
>>>>> 
>>>>>                                    interdependent patches (order to
>>>>>                                    apply patches)
>>>>>                                    PARQUET-437 (MOSTLY REVIEWED):
>>>>> 
>>>>> https://github.com/apache/parquet-cpp/pull/19
>>>>> 
>>>>>                                    PARQUET-418:
>>>>> 
>>>>> https://github.com/apache/parquet-cpp/pull/18
>>>>>                                    PARQUET-434:
>>>>> 
>>>>> https://github.com/apache/parquet-cpp/pull/20
>>>>>                                    PARQUET-433:
>>>>> 
>>>>> https://github.com/apache/parquet-cpp/pull/22
>>>>>                                    PARQUET-451 & PARQUET-453:
>>>>> 
>>>>> https://github.com/apache/parquet-cpp/pull/23
>>>>> 
>>>>>                                    PARQUET-428 (needs to be rebased on
>>>>>                                    top of PARQUET-433):
>>>>> 
>>>>> https://github.com/apache/parquet-cpp/pull/24
>>>>> 
>>>>>                                    I'm going to take a breather and
>>>>>                                    work on some other things this
>>>>>                                    weekend,
>>>>>                                    but I'll be available for code
>>>>>                                    reviews and fixes to try to move
>>>>> along
>>>>>                                    this
>>>>>                                    patch queue.
>>>>> 
>>>>>                                    Thanks,
>>>>>                                    Wes
>>>>> 
>>>>>                                    On Fri, Jan 15, 2016 at 8:18 AM, Wes
>>>>>                                    McKinney <wes@cloudera.com
>>>>>                                    <ma...@cloudera.com>> wrote:
>>>>> 
>>>>>                                        Great to meet you all!
>>>>> 
>>>>>                                        I've recently been collaborating
>>>>>                                        with the Apache Drill team to
>>>>> spin
>>>>>                                        out
>>>>>                                        the ValueVector columnar
>>>>>                                        in-memory data structure into a
>>>>> new
>>>>>                                        standalone
>>>>>                                        project that will be called
>>>>>                                        Arrow [1] [2]. A brief summary
>>>>> of
>>>>>                                        Arrow/ValueVectors is that it
>>>>>                                        permits O(1) random access on
>>>>> nested
>>>>>                                        columnar
>>>>>                                        structures and is efficient for
>>>>>                                        projections and scans in a
>>>>> columnar
>>>>>                                        SQL
>>>>>                                        setting.
>>>>> 
>>>>>                                        I'm very interested in making
>>>>>                                        Parquet read/write support
>>>>>                                        available to
>>>>>                                        Python programmers via C/C++
>>>>>                                        extensions, so I'm going to be
>>>>>                                        working
>>>>>                                        the
>>>>>                                        next few months on a
>>>>>                                        Parquet->Arrow->Python
>>>>>                                        toolchain, along with some
>>>>>                                        tools to manipulate tables
>>>>>                                        in-memory columnar data in the
>>>>>                                        style of
>>>>>                                        Python's
>>>>>                                        pandas library.
>>>>> 
>>>>>                                        I will propose patches as needed
>>>>>                                        to parquet-cpp to improve its
>>>>>                                        performance
>>>>>                                        and add functionality for
>>>>>                                        writing Parquet files as well.
>>>>> The
>>>>>                                        details of
>>>>>                                        converting to/from Parquet's
>>>>>                                        repetition/definition level
>>>>>                                        representation of
>>>>>                                        nested data will stay separate
>>>>>                                        in the arrow-parquet adapter
>>>>> code.
>>>>> 
>>>>>                                        cheers,
>>>>>                                        Wes
>>>>> 
>>>>>                                        [1]:
>>>>> 
>>>>> http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E
>>>>> 
>>>>> 
>>>>>                                        [2]:
>>>>> 
>>>>> http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490
>>>>> 
>>>>> 
>>>>>                                        On Fri, Jan 15, 2016 at 1:22 AM,
>>>>>                                        Mickaël Lacour
>>>>>                                        <m.lacour@criteo.com
>>>>>                                        <ma...@criteo.com>>
>>>>>                                        wrote:
>>>>> 
>>>>>                                            Hi,
>>>>> 
>>>>>                                            I'm very interested in this
>>>>>                                            subject because I would like
>>>>>                                            to export
>>>>>                                            parquet data from HDFS to
>>>>>                                            Vertica (using VSQL).
>>>>>                                            I'm planning to work on it
>>>>>                                            next quarter, but I will be
>>>>>                                            very happy to
>>>>>                                            help
>>>>>                                            you on this subject (review,
>>>>>                                            testing).
>>>>> 
>>>>>                                            Have a nice day,
>>>>>                                            --
>>>>>                                            Mickaël Lacour
>>>>>                                            Senior Software Engineer
>>>>>                                            Analytics Infrastructure
>>>>>                                            team @Scalability
>>>>> 
>>>>> 
>>>>> ________________________________________
>>>>>                                            From: Walkauskas, Stephen
>>>>>                                            Gregory (Vertica)
>>>>>                                            <stephen.walkauskas@hpe.com
>>>>> 
>>>>> <ma...@hpe.com>>
>>>>>                                            Sent: Thursday, January 14,
>>>>>                                            2016 3:23 PM
>>>>>                                            To: Sandryhaila, Aliaksei;
>>>>>                                            dev@parquet.apache.org
>>>>> 
>>>>> <ma...@parquet.apache.org>;
>>>>>                                            Majeti, Deepak;
>>>>>                                            nongli@gmail.com
>>>>>                                            <ma...@gmail.com>;
>>>>> 
>>>>>                                            Wes McKinney
>>>>>                                            Subject: Re: Parquet-cpp
>>>>> 
>>>>>                                            Yes, thanks for the
>>>>>                                            introduction Julien.
>>>>> 
>>>>>                                            Nong and Wes,
>>>>> 
>>>>>                                            It'd be interesting to know
>>>>>                                            your goals for parquet-cpp.
>>>>> 
>>>>>                                            The Vertica database already
>>>>>                                            supports optimized reads of
>>>>>                                            ORC files
>>>>>                                            (fast
>>>>>                                            c++ parser, predicate
>>>>>                                            pushdown, columns selection
>>>>>                                            etc). We'd like
>>>>>                                            to do
>>>>>                                            the same for parquet.
>>>>> 
>>>>>                                            Cheers,
>>>>>                                            Stephen
>>>>> 
>>>>>                                            On 01/13/2016 05:53 PM,
>>>>>                                            Sandryhaila, Aliaksei wrote:
>>>>> 
>>>>>                                                Thank you for the
>>>>>                                                introduction, Julien!
>>>>> 
>>>>>                                                Hello Nong and Wes,
>>>>> 
>>>>>                                                Stephen, Deepak and I
>>>>>                                                are developing a C++
>>>>>                                                library to support
>>>>>                                                Parquet in
>>>>>                                                Vertica RDBMS. We are
>>>>>                                                using Parquet-cpp as a
>>>>>                                                starting point and are
>>>>>                                                expanding its
>>>>>                                                functionality as well as
>>>>>                                                improving it and fixing
>>>>>                                                bugs. We
>>>>>                                                would like to contribute
>>>>>                                                these improvements back
>>>>>                                                to the open-source
>>>>>                                                community. We plan to do
>>>>>                                                this through the usual
>>>>>                                                process of creating
>>>>>                                                jiras that justify and
>>>>>                                                explain a code change,
>>>>>                                                and then submitting
>>>>>                                                pull
>>>>>                                                requests. We look
>>>>>                                                forward to working with
>>>>>                                                you on Parquet-cpp and
>>>>> to
>>>>>                                                your
>>>>>                                                feedback and
>>>>> suggestions.
>>>>> 
>>>>>                                                Best regards,
>>>>>                                                Aliaksei.
>>>>> 
>>>>> 
>>>>>                                                On 01/13/2016 02:54 PM,
>>>>>                                                Julien Le Dem wrote:
>>>>> 
>>>>>                                                    Hello Nong, Wes,
>>>>>                                                    Stephen, Deepak and
>>>>>                                                    Aliaksei
>>>>>                                                    I wanted to
>>>>>                                                    introduce you to
>>>>>                                                    each other as you
>>>>>                                                    are all looking at
>>>>>                                                    Parquet-cpp.
>>>>> 
>>>>>                                                    I'd recommend
>>>>>                                                    opening JIRAs in the
>>>>>                                                    parquet-cpp
>>>>> component to
>>>>> 
>>>>>                                            collaborate (I
>>>>> 
>>>>>                                                    see you already
>>>>>                                                    doing this):
>>>>> 
>>>>> 
>>>>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
>>>>> 
>>>>> 
>>>>>                                                    Nong is a committer
>>>>>                                                    and can merged pull
>>>>>                                                    requests (he also
>>>>>                                                    understands
>>>>> 
>>>>>                                            that
>>>>> 
>>>>>                                                    code base very
>>>>> well).
>>>>>                                                    Other committer can
>>>>>                                                    too, feel free to
>>>>>                                                    ping us if you need
>>>>> help
>>>>>                                                    Obviously, you don't
>>>>>                                                    need to be a
>>>>>                                                    committer to give
>>>>>                                                    others reviews
>>>>>                                                    (you
>>>>>                                                    just need one to
>>>>>                                                    approve and merge).
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>                    --
>>>>>                    Ryan Blue
>>>>>                    Software Engineer
>>>>>                    Cloudera, Inc.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>                --
>>>>>                Julien
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>            --
>>>>>            Julien
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Cloudera, Inc.
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Julien
>> 
>> 


Re: Parquet-cpp

Posted by Wes McKinney <we...@cloudera.com>.
hi folks

We are making good progress on the read path in parquet-cpp, but we
still have limited test coverage (and thus probably a bunch of
non-working code) in a few key areas

- the file reader public API, generally
- column reader and scanner business logic
- decompression codecs (I'm going to pick up this patch -- this
weekend maybe -- https://github.com/apache/parquet-cpp/pull/11)
- parquet < 2.0 value decoders (level decoding is in good shape when
Deepak's level decoder patch is merged). For example PLAIN_DICTIONARY
decoding is not implemented
- parquet 2.0 value encodings (unclear how urgent this is)
- DataPageV2

The sooner we can get the schema patch
(https://github.com/apache/parquet-cpp/pull/38) merged the better to
proceed with filling the rest of these gaps.

AFAICT we have JIRAs tracking almost all of these items (and some
other bugs) -- if you find some gaps something missing please create a
JIRA and update the roadmap doc
https://docs.google.com/document/d/1WyquzupLc3UkErO2OhqLJNQ9a84Cccc8LVUSuLQz39o/edit

Outside of functional requirements for read support, we have plenty of
C++ engineering tidying to do, like:

- eliminating build dependencies from being transitively included in
public headers (i.e. Boost and Thrift)
- defining a public API in parquet/api/*.h for 3rd-party linkers
- cleaning up includes
(https://github.com/include-what-you-use/include-what-you-use)
- shared library symbol visibility (we may not need this for a while)

Since file reading is the most overall pressing matter, I'm going to
tilt my efforts toward completing the read path by the end of the
month at the expense of the write path (outside of test fixtures to
generate faux serialized data). For my needs the remaining tricky bit
is columnar nested data structure reassembly but I'll defer on that
until the other aspects are in good shape. I estimate about 30-40% of
the effort is writing new code and 60-70% testing existing code and /
or refactoring to enable component-level unit testing.

Thank you all in advance for your help, patches, and code reviews.

best,
Wes

On Wed, Jan 27, 2016 at 10:22 PM, Wes McKinney <we...@cloudera.com> wrote:
> Yeah, if the Apache build queue is clogged up with other projects' builds,
> and you have a green build on your personal repo, I suggest posting that on
> the PR and the reviewer can accept the patch after checking the git hash on
> the green build. Hopefully now Travis CI has sorted out the infrastructure
> problems so this won't happen again soon.
>
> On Wed, Jan 27, 2016 at 9:59 AM, Julien Le Dem <ju...@dremio.com> wrote:
>>
>> you can also enable travis for your personal repo which would have it's
>> own queue.
>> Then you can have the build running on your branches.
>>
>> On Wed, Jan 27, 2016 at 9:44 AM, Ryan Blue <bl...@cloudera.com> wrote:
>>>
>>> I have no problem substituting local testing as long as we test all the
>>> environments that Travis does. I've done that to get around this problem in
>>> the past. It takes a while to run each maven test profile, but it works.
>>>
>>> rb
>>>
>>> On 01/26/2016 09:44 PM, Wes McKinney wrote:
>>>>
>>>> Also, things have been made much worse by Travis CI continuing to have
>>>> infrastructure problems. The ASF build queue on Travis CI had completely
>>>> stalled by this morning so that no builds were completing; fortunately
>>>> their support is quite responsible and they've resolved the queue
>>>> blockage, so builds are executing again.
>>>>
>>>> On Tue, Jan 26, 2016 at 4:00 PM, Wes McKinney <wes@cloudera.com
>>>> <ma...@cloudera.com>> wrote:
>>>>
>>>>     There's 3 more patches outstanding that are causing blockage (418,
>>>>     433, and 451/453), so I think if we get them merged today or
>>>>     tomorrow when we should be able to proceed with some parallel
>>>>     efforts without quite as much conflict.
>>>>
>>>>     On Tue, Jan 26, 2016 at 3:56 PM, Nong Li <nongli@gmail.com
>>>>     <ma...@gmail.com>> wrote:
>>>>
>>>>         I'm going to try to more active this week but I admittedly don't
>>>>         have a lot of
>>>>         time to work on this. I understand we need to get critical mass
>>>>         in committers,
>>>>         code, etc to keep this going but I think we're making good
>>>> progress.
>>>>
>>>>         On Tue, Jan 26, 2016 at 3:27 PM, Julien Le Dem
>>>>         <julien@dremio.com <ma...@dremio.com>> wrote:
>>>>
>>>>             Also as Nong mentioned, PRs should be prefixed by the jira
>>>>             id followed by a ":" as follows "PARQUET-X: description"
>>>>             that's just to have the reference in the git changelog. The
>>>>             merge script enforces it.
>>>>
>>>>
>>>>             On Tue, Jan 26, 2016 at 3:24 PM, Julien Le Dem
>>>>             <julien@dremio.com <ma...@dremio.com>> wrote:
>>>>
>>>>                 I'm happy too with Aliaksei, Deepak, Wes, etc reviewing
>>>>                 each other.
>>>>                 I see Nong (who's a committer) has been doing some
>>>>                 reviews already.
>>>>
>>>>                 When you guys reach a consensus on a PR and want it
>>>>                 merged please mention it in the PR (+1, LGTM) and
>>>>                 mention us directly (@julienledem, ...) to have it
>>>> merged.
>>>>
>>>>                 right now I see that #19 and #21 have been committed
>>>>                 (thanks Nong) but it is not clear to me in what order
>>>>                 the others should be committed.
>>>>
>>>>                 For example Deepak should comment directly on #22 to
>>>>                 approve it. Right now he mentioned it on another PR.
>>>>
>>>> https://github.com/apache/parquet-cpp/pull/24#issuecomment-174354139
>>>>                 Similarly Wes could confirm on that PR whether it looks
>>>>                 good.
>>>>
>>>>                 Tomorrow is the Parquet sync up if you want to discuss
>>>>                 further:
>>>>
>>>> https://plus.google.com/u/0/events/cvgi67jmoptmgb1i488re8scbuo
>>>>
>>>>
>>>>                 On Mon, Jan 25, 2016 at 4:20 PM, Ryan Blue
>>>>                 <blue@cloudera.com <ma...@cloudera.com>> wrote:
>>>>
>>>>                     Aliaksei, thanks for being understanding here.
>>>>
>>>>                     I agree with you that it is too difficult. We really
>>>>                     want to get the cpp side bootstrapped as soon as
>>>>                     possible. Lets go with what you suggested, to have
>>>>                     contributors review one another's patches and then
>>>>                     ask a committer for a final review once both
>>>>                     contributors reach a consensus.
>>>>
>>>>                     If there are issues that are easy to review, maybe
>>>>                     some of us other than Nong can take a look.
>>>>
>>>>                     rb
>>>>
>>>>
>>>>                     On 01/25/2016 02:33 PM, Aliaksei Sandryhaila wrote:
>>>>
>>>>                         Hi Ryan,
>>>>
>>>>                         This sounds very reasonable. I do not argue to
>>>>                         disregard the standard
>>>>                         Apache approach to promoting contributors to
>>>>                         committers. I am just
>>>>                         pointing out that without the input from current
>>>>                         committers it is hard
>>>>                         for us to productively contribute to the
>>>>                         project. As a consequence, it
>>>>                         is hard for us demonstrate our fit to become
>>>>                         committers in the future.
>>>>                         This leaves us in a deadlock, which can be
>>>>                         resolved either by an
>>>>                         increased feedback from existing committers or
>>>>                         by making us committers
>>>>                         sooner.
>>>>
>>>>                         I understand that most committers on the Parquet
>>>>                         project are working on
>>>>                         the Java implementation, so it can be harder for
>>>>                         them to review patches
>>>>                         for parquet-cpp. In this regard, how about the
>>>>                         following protocol for
>>>>                         parquet-cpp pull requests: After contributors
>>>>                         review and revise a pull
>>>>                         request and agree that it is in a good shape, we
>>>>                         will ask a designated
>>>>                         committer to review and commit the pull request.
>>>>                         So far we have been
>>>>                         asking Nong; if there is a better designated
>>>>                         committer for parquet-cpp,
>>>>                         please let us know.
>>>>
>>>>                         Thank you,
>>>>                         Aliaksei.
>>>>
>>>>
>>>>                         On 01/25/2016 04:54 PM, Ryan Blue wrote:
>>>>
>>>>                             Hi everyone,
>>>>
>>>>                             Sorry about the current backlog on the
>>>>                             parquet-cpp side. Most of the
>>>>                             current committer base works on the Java
>>>>                             implementation so it's either
>>>>                             slow or not reliable for us to do those
>>>> reviews.
>>>>
>>>>                             I think the best way to move forward is to
>>>>                             review patches for each
>>>>                             other. That will keep those issues
>>>>                             progressing, make it easy for
>>>>                             committers to validate the commit, and --
>>>>                             most importantly -- to build
>>>>                             a trail of contributions that we can look at
>>>>                             to vote in new committers.
>>>>
>>>>                             I completely sympathize with the need for
>>>>                             committers on the CPP
>>>>                             project, but I don't think this will take a
>>>>                             long time given the
>>>>                             current level of activity. We're really just
>>>>                             trying to build
>>>>                             confidence that:
>>>>
>>>>                             1. You produce quality contributions and
>>>>                             understand the codebase
>>>>                             2. You give friendly, thoughtful reviews and
>>>>                             don't rubber-stamp
>>>>                             3. You defer judgment and ask others when
>>>>                             you don't know
>>>>                             4. You respect others and interact
>>>>                             professionally
>>>>
>>>>                             I don't think any of those are that hard to
>>>>                             demonstrate, but I'd be
>>>>                             uncomfortable not validating committers like
>>>>                             we normally do.
>>>>                             Especially in this situation, where I could
>>>>                             easily see the amount of
>>>>                             work you guys are doing adding up pretty
>>>>                             quickly!
>>>>
>>>>                             Does that sound like a reasonable path
>>>> forward?
>>>>
>>>>                             rb
>>>>
>>>>
>>>>                             On 01/25/2016 12:46 PM, Aliaksei Sandryhaila
>>>>                             wrote:
>>>>
>>>>                                 Hi Nong and Julien,
>>>>
>>>>                                 As Wes has pointed out, we have a number
>>>>                                 of patches for parquet-cpp
>>>>                                 outstanding. Wes, Deepak, and I have
>>>>                                 been reviewing each other's pull
>>>>                                 requests. At this point, the patches
>>>>                                 need to be reviewed and approved by
>>>>                                 Parquet committers in order to be
>>>>                                 committed to master.
>>>>
>>>>                                 Unfortunately, there is not much
>>>>                                 activity on this side of the project.
>>>>                                 The lack of response from current
>>>>                                 committers is holding us back, and we
>>>>                                 have to repeatedly rebase our batches,
>>>>                                 merge multiple pull requests
>>>>                                 together, and overall step on each
>>>>                                 others' toes.
>>>>
>>>>                                 Is it possible to make Wes, Deepak, and
>>>>                                 me committers on the project, so
>>>>                                 we can contribute to parquet-cpp more
>>>>                                 efficiently?
>>>>
>>>>                                 Thanks,
>>>>                                 Aliaksei.
>>>>
>>>>
>>>>                                 On 01/23/2016 06:07 PM, Wes McKinney
>>>> wrote:
>>>>
>>>>                                     Folks,
>>>>
>>>>                                     We're working on a pretty solid
>>>>                                     patch queue.
>>>>
>>>>                                     independent patches
>>>>                                     PARQUET-449:
>>>>
>>>> https://github.com/apache/parquet-cpp/pull/21
>>>>
>>>>                                     interdependent patches (order to
>>>>                                     apply patches)
>>>>                                     PARQUET-437 (MOSTLY REVIEWED):
>>>>
>>>> https://github.com/apache/parquet-cpp/pull/19
>>>>
>>>>                                     PARQUET-418:
>>>>
>>>> https://github.com/apache/parquet-cpp/pull/18
>>>>                                     PARQUET-434:
>>>>
>>>> https://github.com/apache/parquet-cpp/pull/20
>>>>                                     PARQUET-433:
>>>>
>>>> https://github.com/apache/parquet-cpp/pull/22
>>>>                                     PARQUET-451 & PARQUET-453:
>>>>
>>>> https://github.com/apache/parquet-cpp/pull/23
>>>>
>>>>                                     PARQUET-428 (needs to be rebased on
>>>>                                     top of PARQUET-433):
>>>>
>>>> https://github.com/apache/parquet-cpp/pull/24
>>>>
>>>>                                     I'm going to take a breather and
>>>>                                     work on some other things this
>>>>                                     weekend,
>>>>                                     but I'll be available for code
>>>>                                     reviews and fixes to try to move
>>>> along
>>>>                                     this
>>>>                                     patch queue.
>>>>
>>>>                                     Thanks,
>>>>                                     Wes
>>>>
>>>>                                     On Fri, Jan 15, 2016 at 8:18 AM, Wes
>>>>                                     McKinney <wes@cloudera.com
>>>>                                     <ma...@cloudera.com>> wrote:
>>>>
>>>>                                         Great to meet you all!
>>>>
>>>>                                         I've recently been collaborating
>>>>                                         with the Apache Drill team to
>>>> spin
>>>>                                         out
>>>>                                         the ValueVector columnar
>>>>                                         in-memory data structure into a
>>>> new
>>>>                                         standalone
>>>>                                         project that will be called
>>>>                                         Arrow [1] [2]. A brief summary
>>>> of
>>>>                                         Arrow/ValueVectors is that it
>>>>                                         permits O(1) random access on
>>>> nested
>>>>                                         columnar
>>>>                                         structures and is efficient for
>>>>                                         projections and scans in a
>>>> columnar
>>>>                                         SQL
>>>>                                         setting.
>>>>
>>>>                                         I'm very interested in making
>>>>                                         Parquet read/write support
>>>>                                         available to
>>>>                                         Python programmers via C/C++
>>>>                                         extensions, so I'm going to be
>>>>                                         working
>>>>                                         the
>>>>                                         next few months on a
>>>>                                         Parquet->Arrow->Python
>>>>                                         toolchain, along with some
>>>>                                         tools to manipulate tables
>>>>                                         in-memory columnar data in the
>>>>                                         style of
>>>>                                         Python's
>>>>                                         pandas library.
>>>>
>>>>                                         I will propose patches as needed
>>>>                                         to parquet-cpp to improve its
>>>>                                         performance
>>>>                                         and add functionality for
>>>>                                         writing Parquet files as well.
>>>> The
>>>>                                         details of
>>>>                                         converting to/from Parquet's
>>>>                                         repetition/definition level
>>>>                                         representation of
>>>>                                         nested data will stay separate
>>>>                                         in the arrow-parquet adapter
>>>> code.
>>>>
>>>>                                         cheers,
>>>>                                         Wes
>>>>
>>>>                                         [1]:
>>>>
>>>> http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E
>>>>
>>>>
>>>>                                         [2]:
>>>>
>>>> http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490
>>>>
>>>>
>>>>                                         On Fri, Jan 15, 2016 at 1:22 AM,
>>>>                                         Mickaël Lacour
>>>>                                         <m.lacour@criteo.com
>>>>                                         <ma...@criteo.com>>
>>>>                                         wrote:
>>>>
>>>>                                             Hi,
>>>>
>>>>                                             I'm very interested in this
>>>>                                             subject because I would like
>>>>                                             to export
>>>>                                             parquet data from HDFS to
>>>>                                             Vertica (using VSQL).
>>>>                                             I'm planning to work on it
>>>>                                             next quarter, but I will be
>>>>                                             very happy to
>>>>                                             help
>>>>                                             you on this subject (review,
>>>>                                             testing).
>>>>
>>>>                                             Have a nice day,
>>>>                                             --
>>>>                                             Mickaël Lacour
>>>>                                             Senior Software Engineer
>>>>                                             Analytics Infrastructure
>>>>                                             team @Scalability
>>>>
>>>>
>>>> ________________________________________
>>>>                                             From: Walkauskas, Stephen
>>>>                                             Gregory (Vertica)
>>>>                                             <stephen.walkauskas@hpe.com
>>>>
>>>> <ma...@hpe.com>>
>>>>                                             Sent: Thursday, January 14,
>>>>                                             2016 3:23 PM
>>>>                                             To: Sandryhaila, Aliaksei;
>>>>                                             dev@parquet.apache.org
>>>>
>>>> <ma...@parquet.apache.org>;
>>>>                                             Majeti, Deepak;
>>>>                                             nongli@gmail.com
>>>>                                             <ma...@gmail.com>;
>>>>
>>>>                                             Wes McKinney
>>>>                                             Subject: Re: Parquet-cpp
>>>>
>>>>                                             Yes, thanks for the
>>>>                                             introduction Julien.
>>>>
>>>>                                             Nong and Wes,
>>>>
>>>>                                             It'd be interesting to know
>>>>                                             your goals for parquet-cpp.
>>>>
>>>>                                             The Vertica database already
>>>>                                             supports optimized reads of
>>>>                                             ORC files
>>>>                                             (fast
>>>>                                             c++ parser, predicate
>>>>                                             pushdown, columns selection
>>>>                                             etc). We'd like
>>>>                                             to do
>>>>                                             the same for parquet.
>>>>
>>>>                                             Cheers,
>>>>                                             Stephen
>>>>
>>>>                                             On 01/13/2016 05:53 PM,
>>>>                                             Sandryhaila, Aliaksei wrote:
>>>>
>>>>                                                 Thank you for the
>>>>                                                 introduction, Julien!
>>>>
>>>>                                                 Hello Nong and Wes,
>>>>
>>>>                                                 Stephen, Deepak and I
>>>>                                                 are developing a C++
>>>>                                                 library to support
>>>>                                                 Parquet in
>>>>                                                 Vertica RDBMS. We are
>>>>                                                 using Parquet-cpp as a
>>>>                                                 starting point and are
>>>>                                                 expanding its
>>>>                                                 functionality as well as
>>>>                                                 improving it and fixing
>>>>                                                 bugs. We
>>>>                                                 would like to contribute
>>>>                                                 these improvements back
>>>>                                                 to the open-source
>>>>                                                 community. We plan to do
>>>>                                                 this through the usual
>>>>                                                 process of creating
>>>>                                                 jiras that justify and
>>>>                                                 explain a code change,
>>>>                                                 and then submitting
>>>>                                                 pull
>>>>                                                 requests. We look
>>>>                                                 forward to working with
>>>>                                                 you on Parquet-cpp and
>>>> to
>>>>                                                 your
>>>>                                                 feedback and
>>>> suggestions.
>>>>
>>>>                                                 Best regards,
>>>>                                                 Aliaksei.
>>>>
>>>>
>>>>                                                 On 01/13/2016 02:54 PM,
>>>>                                                 Julien Le Dem wrote:
>>>>
>>>>                                                     Hello Nong, Wes,
>>>>                                                     Stephen, Deepak and
>>>>                                                     Aliaksei
>>>>                                                     I wanted to
>>>>                                                     introduce you to
>>>>                                                     each other as you
>>>>                                                     are all looking at
>>>>                                                     Parquet-cpp.
>>>>
>>>>                                                     I'd recommend
>>>>                                                     opening JIRAs in the
>>>>                                                     parquet-cpp
>>>> component to
>>>>
>>>>                                             collaborate (I
>>>>
>>>>                                                     see you already
>>>>                                                     doing this):
>>>>
>>>>
>>>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
>>>>
>>>>
>>>>                                                     Nong is a committer
>>>>                                                     and can merged pull
>>>>                                                     requests (he also
>>>>                                                     understands
>>>>
>>>>                                             that
>>>>
>>>>                                                     code base very
>>>> well).
>>>>                                                     Other committer can
>>>>                                                     too, feel free to
>>>>                                                     ping us if you need
>>>> help
>>>>                                                     Obviously, you don't
>>>>                                                     need to be a
>>>>                                                     committer to give
>>>>                                                     others reviews
>>>>                                                     (you
>>>>                                                     just need one to
>>>>                                                     approve and merge).
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>                     --
>>>>                     Ryan Blue
>>>>                     Software Engineer
>>>>                     Cloudera, Inc.
>>>>
>>>>
>>>>
>>>>
>>>>                 --
>>>>                 Julien
>>>>
>>>>
>>>>
>>>>
>>>>             --
>>>>             Julien
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Cloudera, Inc.
>>
>>
>>
>>
>> --
>> Julien
>
>

Re: Parquet-cpp

Posted by Wes McKinney <we...@cloudera.com>.
Yeah, if the Apache build queue is clogged up with other projects' builds,
and you have a green build on your personal repo, I suggest posting that on
the PR and the reviewer can accept the patch after checking the git hash on
the green build. Hopefully now Travis CI has sorted out the infrastructure
problems so this won't happen again soon.

On Wed, Jan 27, 2016 at 9:59 AM, Julien Le Dem <ju...@dremio.com> wrote:

> you can also enable travis for your personal repo which would have it's
> own queue.
> Then you can have the build running on your branches.
>
> On Wed, Jan 27, 2016 at 9:44 AM, Ryan Blue <bl...@cloudera.com> wrote:
>
>> I have no problem substituting local testing as long as we test all the
>> environments that Travis does. I've done that to get around this problem in
>> the past. It takes a while to run each maven test profile, but it works.
>>
>> rb
>>
>> On 01/26/2016 09:44 PM, Wes McKinney wrote:
>>
>>> Also, things have been made much worse by Travis CI continuing to have
>>> infrastructure problems. The ASF build queue on Travis CI had completely
>>> stalled by this morning so that no builds were completing; fortunately
>>> their support is quite responsible and they've resolved the queue
>>> blockage, so builds are executing again.
>>>
>>> On Tue, Jan 26, 2016 at 4:00 PM, Wes McKinney <wes@cloudera.com
>>> <ma...@cloudera.com>> wrote:
>>>
>>>     There's 3 more patches outstanding that are causing blockage (418,
>>>     433, and 451/453), so I think if we get them merged today or
>>>     tomorrow when we should be able to proceed with some parallel
>>>     efforts without quite as much conflict.
>>>
>>>     On Tue, Jan 26, 2016 at 3:56 PM, Nong Li <nongli@gmail.com
>>>     <ma...@gmail.com>> wrote:
>>>
>>>         I'm going to try to more active this week but I admittedly don't
>>>         have a lot of
>>>         time to work on this. I understand we need to get critical mass
>>>         in committers,
>>>         code, etc to keep this going but I think we're making good
>>> progress.
>>>
>>>         On Tue, Jan 26, 2016 at 3:27 PM, Julien Le Dem
>>>         <julien@dremio.com <ma...@dremio.com>> wrote:
>>>
>>>             Also as Nong mentioned, PRs should be prefixed by the jira
>>>             id followed by a ":" as follows "PARQUET-X: description"
>>>             that's just to have the reference in the git changelog. The
>>>             merge script enforces it.
>>>
>>>
>>>             On Tue, Jan 26, 2016 at 3:24 PM, Julien Le Dem
>>>             <julien@dremio.com <ma...@dremio.com>> wrote:
>>>
>>>                 I'm happy too with Aliaksei, Deepak, Wes, etc reviewing
>>>                 each other.
>>>                 I see Nong (who's a committer) has been doing some
>>>                 reviews already.
>>>
>>>                 When you guys reach a consensus on a PR and want it
>>>                 merged please mention it in the PR (+1, LGTM) and
>>>                 mention us directly (@julienledem, ...) to have it
>>> merged.
>>>
>>>                 right now I see that #19 and #21 have been committed
>>>                 (thanks Nong) but it is not clear to me in what order
>>>                 the others should be committed.
>>>
>>>                 For example Deepak should comment directly on #22 to
>>>                 approve it. Right now he mentioned it on another PR.
>>>
>>> https://github.com/apache/parquet-cpp/pull/24#issuecomment-174354139
>>>                 Similarly Wes could confirm on that PR whether it looks
>>>                 good.
>>>
>>>                 Tomorrow is the Parquet sync up if you want to discuss
>>>                 further:
>>>
>>> https://plus.google.com/u/0/events/cvgi67jmoptmgb1i488re8scbuo
>>>
>>>
>>>                 On Mon, Jan 25, 2016 at 4:20 PM, Ryan Blue
>>>                 <blue@cloudera.com <ma...@cloudera.com>> wrote:
>>>
>>>                     Aliaksei, thanks for being understanding here.
>>>
>>>                     I agree with you that it is too difficult. We really
>>>                     want to get the cpp side bootstrapped as soon as
>>>                     possible. Lets go with what you suggested, to have
>>>                     contributors review one another's patches and then
>>>                     ask a committer for a final review once both
>>>                     contributors reach a consensus.
>>>
>>>                     If there are issues that are easy to review, maybe
>>>                     some of us other than Nong can take a look.
>>>
>>>                     rb
>>>
>>>
>>>                     On 01/25/2016 02:33 PM, Aliaksei Sandryhaila wrote:
>>>
>>>                         Hi Ryan,
>>>
>>>                         This sounds very reasonable. I do not argue to
>>>                         disregard the standard
>>>                         Apache approach to promoting contributors to
>>>                         committers. I am just
>>>                         pointing out that without the input from current
>>>                         committers it is hard
>>>                         for us to productively contribute to the
>>>                         project. As a consequence, it
>>>                         is hard for us demonstrate our fit to become
>>>                         committers in the future.
>>>                         This leaves us in a deadlock, which can be
>>>                         resolved either by an
>>>                         increased feedback from existing committers or
>>>                         by making us committers
>>>                         sooner.
>>>
>>>                         I understand that most committers on the Parquet
>>>                         project are working on
>>>                         the Java implementation, so it can be harder for
>>>                         them to review patches
>>>                         for parquet-cpp. In this regard, how about the
>>>                         following protocol for
>>>                         parquet-cpp pull requests: After contributors
>>>                         review and revise a pull
>>>                         request and agree that it is in a good shape, we
>>>                         will ask a designated
>>>                         committer to review and commit the pull request.
>>>                         So far we have been
>>>                         asking Nong; if there is a better designated
>>>                         committer for parquet-cpp,
>>>                         please let us know.
>>>
>>>                         Thank you,
>>>                         Aliaksei.
>>>
>>>
>>>                         On 01/25/2016 04:54 PM, Ryan Blue wrote:
>>>
>>>                             Hi everyone,
>>>
>>>                             Sorry about the current backlog on the
>>>                             parquet-cpp side. Most of the
>>>                             current committer base works on the Java
>>>                             implementation so it's either
>>>                             slow or not reliable for us to do those
>>> reviews.
>>>
>>>                             I think the best way to move forward is to
>>>                             review patches for each
>>>                             other. That will keep those issues
>>>                             progressing, make it easy for
>>>                             committers to validate the commit, and --
>>>                             most importantly -- to build
>>>                             a trail of contributions that we can look at
>>>                             to vote in new committers.
>>>
>>>                             I completely sympathize with the need for
>>>                             committers on the CPP
>>>                             project, but I don't think this will take a
>>>                             long time given the
>>>                             current level of activity. We're really just
>>>                             trying to build
>>>                             confidence that:
>>>
>>>                             1. You produce quality contributions and
>>>                             understand the codebase
>>>                             2. You give friendly, thoughtful reviews and
>>>                             don't rubber-stamp
>>>                             3. You defer judgment and ask others when
>>>                             you don't know
>>>                             4. You respect others and interact
>>>                             professionally
>>>
>>>                             I don't think any of those are that hard to
>>>                             demonstrate, but I'd be
>>>                             uncomfortable not validating committers like
>>>                             we normally do.
>>>                             Especially in this situation, where I could
>>>                             easily see the amount of
>>>                             work you guys are doing adding up pretty
>>>                             quickly!
>>>
>>>                             Does that sound like a reasonable path
>>> forward?
>>>
>>>                             rb
>>>
>>>
>>>                             On 01/25/2016 12:46 PM, Aliaksei Sandryhaila
>>>                             wrote:
>>>
>>>                                 Hi Nong and Julien,
>>>
>>>                                 As Wes has pointed out, we have a number
>>>                                 of patches for parquet-cpp
>>>                                 outstanding. Wes, Deepak, and I have
>>>                                 been reviewing each other's pull
>>>                                 requests. At this point, the patches
>>>                                 need to be reviewed and approved by
>>>                                 Parquet committers in order to be
>>>                                 committed to master.
>>>
>>>                                 Unfortunately, there is not much
>>>                                 activity on this side of the project.
>>>                                 The lack of response from current
>>>                                 committers is holding us back, and we
>>>                                 have to repeatedly rebase our batches,
>>>                                 merge multiple pull requests
>>>                                 together, and overall step on each
>>>                                 others' toes.
>>>
>>>                                 Is it possible to make Wes, Deepak, and
>>>                                 me committers on the project, so
>>>                                 we can contribute to parquet-cpp more
>>>                                 efficiently?
>>>
>>>                                 Thanks,
>>>                                 Aliaksei.
>>>
>>>
>>>                                 On 01/23/2016 06:07 PM, Wes McKinney
>>> wrote:
>>>
>>>                                     Folks,
>>>
>>>                                     We're working on a pretty solid
>>>                                     patch queue.
>>>
>>>                                     independent patches
>>>                                     PARQUET-449:
>>>
>>> https://github.com/apache/parquet-cpp/pull/21
>>>
>>>                                     interdependent patches (order to
>>>                                     apply patches)
>>>                                     PARQUET-437 (MOSTLY REVIEWED):
>>>
>>> https://github.com/apache/parquet-cpp/pull/19
>>>
>>>                                     PARQUET-418:
>>>
>>> https://github.com/apache/parquet-cpp/pull/18
>>>                                     PARQUET-434:
>>>
>>> https://github.com/apache/parquet-cpp/pull/20
>>>                                     PARQUET-433:
>>>
>>> https://github.com/apache/parquet-cpp/pull/22
>>>                                     PARQUET-451 & PARQUET-453:
>>>
>>> https://github.com/apache/parquet-cpp/pull/23
>>>
>>>                                     PARQUET-428 (needs to be rebased on
>>>                                     top of PARQUET-433):
>>>
>>> https://github.com/apache/parquet-cpp/pull/24
>>>
>>>                                     I'm going to take a breather and
>>>                                     work on some other things this
>>>                                     weekend,
>>>                                     but I'll be available for code
>>>                                     reviews and fixes to try to move
>>> along
>>>                                     this
>>>                                     patch queue.
>>>
>>>                                     Thanks,
>>>                                     Wes
>>>
>>>                                     On Fri, Jan 15, 2016 at 8:18 AM, Wes
>>>                                     McKinney <wes@cloudera.com
>>>                                     <ma...@cloudera.com>> wrote:
>>>
>>>                                         Great to meet you all!
>>>
>>>                                         I've recently been collaborating
>>>                                         with the Apache Drill team to
>>> spin
>>>                                         out
>>>                                         the ValueVector columnar
>>>                                         in-memory data structure into a
>>> new
>>>                                         standalone
>>>                                         project that will be called
>>>                                         Arrow [1] [2]. A brief summary of
>>>                                         Arrow/ValueVectors is that it
>>>                                         permits O(1) random access on
>>> nested
>>>                                         columnar
>>>                                         structures and is efficient for
>>>                                         projections and scans in a
>>> columnar
>>>                                         SQL
>>>                                         setting.
>>>
>>>                                         I'm very interested in making
>>>                                         Parquet read/write support
>>>                                         available to
>>>                                         Python programmers via C/C++
>>>                                         extensions, so I'm going to be
>>>                                         working
>>>                                         the
>>>                                         next few months on a
>>>                                         Parquet->Arrow->Python
>>>                                         toolchain, along with some
>>>                                         tools to manipulate tables
>>>                                         in-memory columnar data in the
>>>                                         style of
>>>                                         Python's
>>>                                         pandas library.
>>>
>>>                                         I will propose patches as needed
>>>                                         to parquet-cpp to improve its
>>>                                         performance
>>>                                         and add functionality for
>>>                                         writing Parquet files as well.
>>> The
>>>                                         details of
>>>                                         converting to/from Parquet's
>>>                                         repetition/definition level
>>>                                         representation of
>>>                                         nested data will stay separate
>>>                                         in the arrow-parquet adapter
>>> code.
>>>
>>>                                         cheers,
>>>                                         Wes
>>>
>>>                                         [1]:
>>>
>>> http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E
>>>
>>>
>>>                                         [2]:
>>>
>>> http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490
>>>
>>>
>>>                                         On Fri, Jan 15, 2016 at 1:22 AM,
>>>                                         Mickaël Lacour
>>>                                         <m.lacour@criteo.com
>>>                                         <ma...@criteo.com>>
>>>                                         wrote:
>>>
>>>                                             Hi,
>>>
>>>                                             I'm very interested in this
>>>                                             subject because I would like
>>>                                             to export
>>>                                             parquet data from HDFS to
>>>                                             Vertica (using VSQL).
>>>                                             I'm planning to work on it
>>>                                             next quarter, but I will be
>>>                                             very happy to
>>>                                             help
>>>                                             you on this subject (review,
>>>                                             testing).
>>>
>>>                                             Have a nice day,
>>>                                             --
>>>                                             Mickaël Lacour
>>>                                             Senior Software Engineer
>>>                                             Analytics Infrastructure
>>>                                             team @Scalability
>>>
>>>
>>> ________________________________________
>>>                                             From: Walkauskas, Stephen
>>>                                             Gregory (Vertica)
>>>                                             <stephen.walkauskas@hpe.com
>>>                                             <mailto:
>>> stephen.walkauskas@hpe.com>>
>>>                                             Sent: Thursday, January 14,
>>>                                             2016 3:23 PM
>>>                                             To: Sandryhaila, Aliaksei;
>>>                                             dev@parquet.apache.org
>>>                                             <mailto:
>>> dev@parquet.apache.org>;
>>>                                             Majeti, Deepak;
>>>                                             nongli@gmail.com
>>>                                             <ma...@gmail.com>;
>>>
>>>                                             Wes McKinney
>>>                                             Subject: Re: Parquet-cpp
>>>
>>>                                             Yes, thanks for the
>>>                                             introduction Julien.
>>>
>>>                                             Nong and Wes,
>>>
>>>                                             It'd be interesting to know
>>>                                             your goals for parquet-cpp.
>>>
>>>                                             The Vertica database already
>>>                                             supports optimized reads of
>>>                                             ORC files
>>>                                             (fast
>>>                                             c++ parser, predicate
>>>                                             pushdown, columns selection
>>>                                             etc). We'd like
>>>                                             to do
>>>                                             the same for parquet.
>>>
>>>                                             Cheers,
>>>                                             Stephen
>>>
>>>                                             On 01/13/2016 05:53 PM,
>>>                                             Sandryhaila, Aliaksei wrote:
>>>
>>>                                                 Thank you for the
>>>                                                 introduction, Julien!
>>>
>>>                                                 Hello Nong and Wes,
>>>
>>>                                                 Stephen, Deepak and I
>>>                                                 are developing a C++
>>>                                                 library to support
>>>                                                 Parquet in
>>>                                                 Vertica RDBMS. We are
>>>                                                 using Parquet-cpp as a
>>>                                                 starting point and are
>>>                                                 expanding its
>>>                                                 functionality as well as
>>>                                                 improving it and fixing
>>>                                                 bugs. We
>>>                                                 would like to contribute
>>>                                                 these improvements back
>>>                                                 to the open-source
>>>                                                 community. We plan to do
>>>                                                 this through the usual
>>>                                                 process of creating
>>>                                                 jiras that justify and
>>>                                                 explain a code change,
>>>                                                 and then submitting
>>>                                                 pull
>>>                                                 requests. We look
>>>                                                 forward to working with
>>>                                                 you on Parquet-cpp and to
>>>                                                 your
>>>                                                 feedback and suggestions.
>>>
>>>                                                 Best regards,
>>>                                                 Aliaksei.
>>>
>>>
>>>                                                 On 01/13/2016 02:54 PM,
>>>                                                 Julien Le Dem wrote:
>>>
>>>                                                     Hello Nong, Wes,
>>>                                                     Stephen, Deepak and
>>>                                                     Aliaksei
>>>                                                     I wanted to
>>>                                                     introduce you to
>>>                                                     each other as you
>>>                                                     are all looking at
>>>                                                     Parquet-cpp.
>>>
>>>                                                     I'd recommend
>>>                                                     opening JIRAs in the
>>>                                                     parquet-cpp
>>> component to
>>>
>>>                                             collaborate (I
>>>
>>>                                                     see you already
>>>                                                     doing this):
>>>
>>>
>>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
>>>
>>>
>>>                                                     Nong is a committer
>>>                                                     and can merged pull
>>>                                                     requests (he also
>>>                                                     understands
>>>
>>>                                             that
>>>
>>>                                                     code base very well).
>>>                                                     Other committer can
>>>                                                     too, feel free to
>>>                                                     ping us if you need
>>> help
>>>                                                     Obviously, you don't
>>>                                                     need to be a
>>>                                                     committer to give
>>>                                                     others reviews
>>>                                                     (you
>>>                                                     just need one to
>>>                                                     approve and merge).
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>                     --
>>>                     Ryan Blue
>>>                     Software Engineer
>>>                     Cloudera, Inc.
>>>
>>>
>>>
>>>
>>>                 --
>>>                 Julien
>>>
>>>
>>>
>>>
>>>             --
>>>             Julien
>>>
>>>
>>>
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Cloudera, Inc.
>>
>
>
>
> --
> Julien
>

Re: Parquet-cpp

Posted by Julien Le Dem <ju...@dremio.com>.
you can also enable travis for your personal repo which would have it's own
queue.
Then you can have the build running on your branches.

On Wed, Jan 27, 2016 at 9:44 AM, Ryan Blue <bl...@cloudera.com> wrote:

> I have no problem substituting local testing as long as we test all the
> environments that Travis does. I've done that to get around this problem in
> the past. It takes a while to run each maven test profile, but it works.
>
> rb
>
> On 01/26/2016 09:44 PM, Wes McKinney wrote:
>
>> Also, things have been made much worse by Travis CI continuing to have
>> infrastructure problems. The ASF build queue on Travis CI had completely
>> stalled by this morning so that no builds were completing; fortunately
>> their support is quite responsible and they've resolved the queue
>> blockage, so builds are executing again.
>>
>> On Tue, Jan 26, 2016 at 4:00 PM, Wes McKinney <wes@cloudera.com
>> <ma...@cloudera.com>> wrote:
>>
>>     There's 3 more patches outstanding that are causing blockage (418,
>>     433, and 451/453), so I think if we get them merged today or
>>     tomorrow when we should be able to proceed with some parallel
>>     efforts without quite as much conflict.
>>
>>     On Tue, Jan 26, 2016 at 3:56 PM, Nong Li <nongli@gmail.com
>>     <ma...@gmail.com>> wrote:
>>
>>         I'm going to try to more active this week but I admittedly don't
>>         have a lot of
>>         time to work on this. I understand we need to get critical mass
>>         in committers,
>>         code, etc to keep this going but I think we're making good
>> progress.
>>
>>         On Tue, Jan 26, 2016 at 3:27 PM, Julien Le Dem
>>         <julien@dremio.com <ma...@dremio.com>> wrote:
>>
>>             Also as Nong mentioned, PRs should be prefixed by the jira
>>             id followed by a ":" as follows "PARQUET-X: description"
>>             that's just to have the reference in the git changelog. The
>>             merge script enforces it.
>>
>>
>>             On Tue, Jan 26, 2016 at 3:24 PM, Julien Le Dem
>>             <julien@dremio.com <ma...@dremio.com>> wrote:
>>
>>                 I'm happy too with Aliaksei, Deepak, Wes, etc reviewing
>>                 each other.
>>                 I see Nong (who's a committer) has been doing some
>>                 reviews already.
>>
>>                 When you guys reach a consensus on a PR and want it
>>                 merged please mention it in the PR (+1, LGTM) and
>>                 mention us directly (@julienledem, ...) to have it merged.
>>
>>                 right now I see that #19 and #21 have been committed
>>                 (thanks Nong) but it is not clear to me in what order
>>                 the others should be committed.
>>
>>                 For example Deepak should comment directly on #22 to
>>                 approve it. Right now he mentioned it on another PR.
>>
>> https://github.com/apache/parquet-cpp/pull/24#issuecomment-174354139
>>                 Similarly Wes could confirm on that PR whether it looks
>>                 good.
>>
>>                 Tomorrow is the Parquet sync up if you want to discuss
>>                 further:
>>
>> https://plus.google.com/u/0/events/cvgi67jmoptmgb1i488re8scbuo
>>
>>
>>                 On Mon, Jan 25, 2016 at 4:20 PM, Ryan Blue
>>                 <blue@cloudera.com <ma...@cloudera.com>> wrote:
>>
>>                     Aliaksei, thanks for being understanding here.
>>
>>                     I agree with you that it is too difficult. We really
>>                     want to get the cpp side bootstrapped as soon as
>>                     possible. Lets go with what you suggested, to have
>>                     contributors review one another's patches and then
>>                     ask a committer for a final review once both
>>                     contributors reach a consensus.
>>
>>                     If there are issues that are easy to review, maybe
>>                     some of us other than Nong can take a look.
>>
>>                     rb
>>
>>
>>                     On 01/25/2016 02:33 PM, Aliaksei Sandryhaila wrote:
>>
>>                         Hi Ryan,
>>
>>                         This sounds very reasonable. I do not argue to
>>                         disregard the standard
>>                         Apache approach to promoting contributors to
>>                         committers. I am just
>>                         pointing out that without the input from current
>>                         committers it is hard
>>                         for us to productively contribute to the
>>                         project. As a consequence, it
>>                         is hard for us demonstrate our fit to become
>>                         committers in the future.
>>                         This leaves us in a deadlock, which can be
>>                         resolved either by an
>>                         increased feedback from existing committers or
>>                         by making us committers
>>                         sooner.
>>
>>                         I understand that most committers on the Parquet
>>                         project are working on
>>                         the Java implementation, so it can be harder for
>>                         them to review patches
>>                         for parquet-cpp. In this regard, how about the
>>                         following protocol for
>>                         parquet-cpp pull requests: After contributors
>>                         review and revise a pull
>>                         request and agree that it is in a good shape, we
>>                         will ask a designated
>>                         committer to review and commit the pull request.
>>                         So far we have been
>>                         asking Nong; if there is a better designated
>>                         committer for parquet-cpp,
>>                         please let us know.
>>
>>                         Thank you,
>>                         Aliaksei.
>>
>>
>>                         On 01/25/2016 04:54 PM, Ryan Blue wrote:
>>
>>                             Hi everyone,
>>
>>                             Sorry about the current backlog on the
>>                             parquet-cpp side. Most of the
>>                             current committer base works on the Java
>>                             implementation so it's either
>>                             slow or not reliable for us to do those
>> reviews.
>>
>>                             I think the best way to move forward is to
>>                             review patches for each
>>                             other. That will keep those issues
>>                             progressing, make it easy for
>>                             committers to validate the commit, and --
>>                             most importantly -- to build
>>                             a trail of contributions that we can look at
>>                             to vote in new committers.
>>
>>                             I completely sympathize with the need for
>>                             committers on the CPP
>>                             project, but I don't think this will take a
>>                             long time given the
>>                             current level of activity. We're really just
>>                             trying to build
>>                             confidence that:
>>
>>                             1. You produce quality contributions and
>>                             understand the codebase
>>                             2. You give friendly, thoughtful reviews and
>>                             don't rubber-stamp
>>                             3. You defer judgment and ask others when
>>                             you don't know
>>                             4. You respect others and interact
>>                             professionally
>>
>>                             I don't think any of those are that hard to
>>                             demonstrate, but I'd be
>>                             uncomfortable not validating committers like
>>                             we normally do.
>>                             Especially in this situation, where I could
>>                             easily see the amount of
>>                             work you guys are doing adding up pretty
>>                             quickly!
>>
>>                             Does that sound like a reasonable path
>> forward?
>>
>>                             rb
>>
>>
>>                             On 01/25/2016 12:46 PM, Aliaksei Sandryhaila
>>                             wrote:
>>
>>                                 Hi Nong and Julien,
>>
>>                                 As Wes has pointed out, we have a number
>>                                 of patches for parquet-cpp
>>                                 outstanding. Wes, Deepak, and I have
>>                                 been reviewing each other's pull
>>                                 requests. At this point, the patches
>>                                 need to be reviewed and approved by
>>                                 Parquet committers in order to be
>>                                 committed to master.
>>
>>                                 Unfortunately, there is not much
>>                                 activity on this side of the project.
>>                                 The lack of response from current
>>                                 committers is holding us back, and we
>>                                 have to repeatedly rebase our batches,
>>                                 merge multiple pull requests
>>                                 together, and overall step on each
>>                                 others' toes.
>>
>>                                 Is it possible to make Wes, Deepak, and
>>                                 me committers on the project, so
>>                                 we can contribute to parquet-cpp more
>>                                 efficiently?
>>
>>                                 Thanks,
>>                                 Aliaksei.
>>
>>
>>                                 On 01/23/2016 06:07 PM, Wes McKinney
>> wrote:
>>
>>                                     Folks,
>>
>>                                     We're working on a pretty solid
>>                                     patch queue.
>>
>>                                     independent patches
>>                                     PARQUET-449:
>>
>> https://github.com/apache/parquet-cpp/pull/21
>>
>>                                     interdependent patches (order to
>>                                     apply patches)
>>                                     PARQUET-437 (MOSTLY REVIEWED):
>>
>> https://github.com/apache/parquet-cpp/pull/19
>>
>>                                     PARQUET-418:
>>
>> https://github.com/apache/parquet-cpp/pull/18
>>                                     PARQUET-434:
>>
>> https://github.com/apache/parquet-cpp/pull/20
>>                                     PARQUET-433:
>>
>> https://github.com/apache/parquet-cpp/pull/22
>>                                     PARQUET-451 & PARQUET-453:
>>
>> https://github.com/apache/parquet-cpp/pull/23
>>
>>                                     PARQUET-428 (needs to be rebased on
>>                                     top of PARQUET-433):
>>
>> https://github.com/apache/parquet-cpp/pull/24
>>
>>                                     I'm going to take a breather and
>>                                     work on some other things this
>>                                     weekend,
>>                                     but I'll be available for code
>>                                     reviews and fixes to try to move along
>>                                     this
>>                                     patch queue.
>>
>>                                     Thanks,
>>                                     Wes
>>
>>                                     On Fri, Jan 15, 2016 at 8:18 AM, Wes
>>                                     McKinney <wes@cloudera.com
>>                                     <ma...@cloudera.com>> wrote:
>>
>>                                         Great to meet you all!
>>
>>                                         I've recently been collaborating
>>                                         with the Apache Drill team to spin
>>                                         out
>>                                         the ValueVector columnar
>>                                         in-memory data structure into a
>> new
>>                                         standalone
>>                                         project that will be called
>>                                         Arrow [1] [2]. A brief summary of
>>                                         Arrow/ValueVectors is that it
>>                                         permits O(1) random access on
>> nested
>>                                         columnar
>>                                         structures and is efficient for
>>                                         projections and scans in a
>> columnar
>>                                         SQL
>>                                         setting.
>>
>>                                         I'm very interested in making
>>                                         Parquet read/write support
>>                                         available to
>>                                         Python programmers via C/C++
>>                                         extensions, so I'm going to be
>>                                         working
>>                                         the
>>                                         next few months on a
>>                                         Parquet->Arrow->Python
>>                                         toolchain, along with some
>>                                         tools to manipulate tables
>>                                         in-memory columnar data in the
>>                                         style of
>>                                         Python's
>>                                         pandas library.
>>
>>                                         I will propose patches as needed
>>                                         to parquet-cpp to improve its
>>                                         performance
>>                                         and add functionality for
>>                                         writing Parquet files as well. The
>>                                         details of
>>                                         converting to/from Parquet's
>>                                         repetition/definition level
>>                                         representation of
>>                                         nested data will stay separate
>>                                         in the arrow-parquet adapter code.
>>
>>                                         cheers,
>>                                         Wes
>>
>>                                         [1]:
>>
>> http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E
>>
>>
>>                                         [2]:
>>
>> http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490
>>
>>
>>                                         On Fri, Jan 15, 2016 at 1:22 AM,
>>                                         Mickaël Lacour
>>                                         <m.lacour@criteo.com
>>                                         <ma...@criteo.com>>
>>                                         wrote:
>>
>>                                             Hi,
>>
>>                                             I'm very interested in this
>>                                             subject because I would like
>>                                             to export
>>                                             parquet data from HDFS to
>>                                             Vertica (using VSQL).
>>                                             I'm planning to work on it
>>                                             next quarter, but I will be
>>                                             very happy to
>>                                             help
>>                                             you on this subject (review,
>>                                             testing).
>>
>>                                             Have a nice day,
>>                                             --
>>                                             Mickaël Lacour
>>                                             Senior Software Engineer
>>                                             Analytics Infrastructure
>>                                             team @Scalability
>>
>>
>> ________________________________________
>>                                             From: Walkauskas, Stephen
>>                                             Gregory (Vertica)
>>                                             <stephen.walkauskas@hpe.com
>>                                             <mailto:
>> stephen.walkauskas@hpe.com>>
>>                                             Sent: Thursday, January 14,
>>                                             2016 3:23 PM
>>                                             To: Sandryhaila, Aliaksei;
>>                                             dev@parquet.apache.org
>>                                             <mailto:
>> dev@parquet.apache.org>;
>>                                             Majeti, Deepak;
>>                                             nongli@gmail.com
>>                                             <ma...@gmail.com>;
>>
>>                                             Wes McKinney
>>                                             Subject: Re: Parquet-cpp
>>
>>                                             Yes, thanks for the
>>                                             introduction Julien.
>>
>>                                             Nong and Wes,
>>
>>                                             It'd be interesting to know
>>                                             your goals for parquet-cpp.
>>
>>                                             The Vertica database already
>>                                             supports optimized reads of
>>                                             ORC files
>>                                             (fast
>>                                             c++ parser, predicate
>>                                             pushdown, columns selection
>>                                             etc). We'd like
>>                                             to do
>>                                             the same for parquet.
>>
>>                                             Cheers,
>>                                             Stephen
>>
>>                                             On 01/13/2016 05:53 PM,
>>                                             Sandryhaila, Aliaksei wrote:
>>
>>                                                 Thank you for the
>>                                                 introduction, Julien!
>>
>>                                                 Hello Nong and Wes,
>>
>>                                                 Stephen, Deepak and I
>>                                                 are developing a C++
>>                                                 library to support
>>                                                 Parquet in
>>                                                 Vertica RDBMS. We are
>>                                                 using Parquet-cpp as a
>>                                                 starting point and are
>>                                                 expanding its
>>                                                 functionality as well as
>>                                                 improving it and fixing
>>                                                 bugs. We
>>                                                 would like to contribute
>>                                                 these improvements back
>>                                                 to the open-source
>>                                                 community. We plan to do
>>                                                 this through the usual
>>                                                 process of creating
>>                                                 jiras that justify and
>>                                                 explain a code change,
>>                                                 and then submitting
>>                                                 pull
>>                                                 requests. We look
>>                                                 forward to working with
>>                                                 you on Parquet-cpp and to
>>                                                 your
>>                                                 feedback and suggestions.
>>
>>                                                 Best regards,
>>                                                 Aliaksei.
>>
>>
>>                                                 On 01/13/2016 02:54 PM,
>>                                                 Julien Le Dem wrote:
>>
>>                                                     Hello Nong, Wes,
>>                                                     Stephen, Deepak and
>>                                                     Aliaksei
>>                                                     I wanted to
>>                                                     introduce you to
>>                                                     each other as you
>>                                                     are all looking at
>>                                                     Parquet-cpp.
>>
>>                                                     I'd recommend
>>                                                     opening JIRAs in the
>>                                                     parquet-cpp component
>> to
>>
>>                                             collaborate (I
>>
>>                                                     see you already
>>                                                     doing this):
>>
>>
>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
>>
>>
>>                                                     Nong is a committer
>>                                                     and can merged pull
>>                                                     requests (he also
>>                                                     understands
>>
>>                                             that
>>
>>                                                     code base very well).
>>                                                     Other committer can
>>                                                     too, feel free to
>>                                                     ping us if you need
>> help
>>                                                     Obviously, you don't
>>                                                     need to be a
>>                                                     committer to give
>>                                                     others reviews
>>                                                     (you
>>                                                     just need one to
>>                                                     approve and merge).
>>
>>
>>
>>
>>
>>
>>
>>
>>                     --
>>                     Ryan Blue
>>                     Software Engineer
>>                     Cloudera, Inc.
>>
>>
>>
>>
>>                 --
>>                 Julien
>>
>>
>>
>>
>>             --
>>             Julien
>>
>>
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>



-- 
Julien

Re: Parquet-cpp

Posted by Ryan Blue <bl...@cloudera.com>.
I have no problem substituting local testing as long as we test all the 
environments that Travis does. I've done that to get around this problem 
in the past. It takes a while to run each maven test profile, but it works.

rb

On 01/26/2016 09:44 PM, Wes McKinney wrote:
> Also, things have been made much worse by Travis CI continuing to have
> infrastructure problems. The ASF build queue on Travis CI had completely
> stalled by this morning so that no builds were completing; fortunately
> their support is quite responsible and they've resolved the queue
> blockage, so builds are executing again.
>
> On Tue, Jan 26, 2016 at 4:00 PM, Wes McKinney <wes@cloudera.com
> <ma...@cloudera.com>> wrote:
>
>     There's 3 more patches outstanding that are causing blockage (418,
>     433, and 451/453), so I think if we get them merged today or
>     tomorrow when we should be able to proceed with some parallel
>     efforts without quite as much conflict.
>
>     On Tue, Jan 26, 2016 at 3:56 PM, Nong Li <nongli@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         I'm going to try to more active this week but I admittedly don't
>         have a lot of
>         time to work on this. I understand we need to get critical mass
>         in committers,
>         code, etc to keep this going but I think we're making good progress.
>
>         On Tue, Jan 26, 2016 at 3:27 PM, Julien Le Dem
>         <julien@dremio.com <ma...@dremio.com>> wrote:
>
>             Also as Nong mentioned, PRs should be prefixed by the jira
>             id followed by a ":" as follows "PARQUET-X: description"
>             that's just to have the reference in the git changelog. The
>             merge script enforces it.
>
>
>             On Tue, Jan 26, 2016 at 3:24 PM, Julien Le Dem
>             <julien@dremio.com <ma...@dremio.com>> wrote:
>
>                 I'm happy too with Aliaksei, Deepak, Wes, etc reviewing
>                 each other.
>                 I see Nong (who's a committer) has been doing some
>                 reviews already.
>
>                 When you guys reach a consensus on a PR and want it
>                 merged please mention it in the PR (+1, LGTM) and
>                 mention us directly (@julienledem, ...) to have it merged.
>
>                 right now I see that #19 and #21 have been committed
>                 (thanks Nong) but it is not clear to me in what order
>                 the others should be committed.
>
>                 For example Deepak should comment directly on #22 to
>                 approve it. Right now he mentioned it on another PR.
>                 https://github.com/apache/parquet-cpp/pull/24#issuecomment-174354139
>                 Similarly Wes could confirm on that PR whether it looks
>                 good.
>
>                 Tomorrow is the Parquet sync up if you want to discuss
>                 further:
>                 https://plus.google.com/u/0/events/cvgi67jmoptmgb1i488re8scbuo
>
>
>                 On Mon, Jan 25, 2016 at 4:20 PM, Ryan Blue
>                 <blue@cloudera.com <ma...@cloudera.com>> wrote:
>
>                     Aliaksei, thanks for being understanding here.
>
>                     I agree with you that it is too difficult. We really
>                     want to get the cpp side bootstrapped as soon as
>                     possible. Lets go with what you suggested, to have
>                     contributors review one another's patches and then
>                     ask a committer for a final review once both
>                     contributors reach a consensus.
>
>                     If there are issues that are easy to review, maybe
>                     some of us other than Nong can take a look.
>
>                     rb
>
>
>                     On 01/25/2016 02:33 PM, Aliaksei Sandryhaila wrote:
>
>                         Hi Ryan,
>
>                         This sounds very reasonable. I do not argue to
>                         disregard the standard
>                         Apache approach to promoting contributors to
>                         committers. I am just
>                         pointing out that without the input from current
>                         committers it is hard
>                         for us to productively contribute to the
>                         project. As a consequence, it
>                         is hard for us demonstrate our fit to become
>                         committers in the future.
>                         This leaves us in a deadlock, which can be
>                         resolved either by an
>                         increased feedback from existing committers or
>                         by making us committers
>                         sooner.
>
>                         I understand that most committers on the Parquet
>                         project are working on
>                         the Java implementation, so it can be harder for
>                         them to review patches
>                         for parquet-cpp. In this regard, how about the
>                         following protocol for
>                         parquet-cpp pull requests: After contributors
>                         review and revise a pull
>                         request and agree that it is in a good shape, we
>                         will ask a designated
>                         committer to review and commit the pull request.
>                         So far we have been
>                         asking Nong; if there is a better designated
>                         committer for parquet-cpp,
>                         please let us know.
>
>                         Thank you,
>                         Aliaksei.
>
>
>                         On 01/25/2016 04:54 PM, Ryan Blue wrote:
>
>                             Hi everyone,
>
>                             Sorry about the current backlog on the
>                             parquet-cpp side. Most of the
>                             current committer base works on the Java
>                             implementation so it's either
>                             slow or not reliable for us to do those reviews.
>
>                             I think the best way to move forward is to
>                             review patches for each
>                             other. That will keep those issues
>                             progressing, make it easy for
>                             committers to validate the commit, and --
>                             most importantly -- to build
>                             a trail of contributions that we can look at
>                             to vote in new committers.
>
>                             I completely sympathize with the need for
>                             committers on the CPP
>                             project, but I don't think this will take a
>                             long time given the
>                             current level of activity. We're really just
>                             trying to build
>                             confidence that:
>
>                             1. You produce quality contributions and
>                             understand the codebase
>                             2. You give friendly, thoughtful reviews and
>                             don't rubber-stamp
>                             3. You defer judgment and ask others when
>                             you don't know
>                             4. You respect others and interact
>                             professionally
>
>                             I don't think any of those are that hard to
>                             demonstrate, but I'd be
>                             uncomfortable not validating committers like
>                             we normally do.
>                             Especially in this situation, where I could
>                             easily see the amount of
>                             work you guys are doing adding up pretty
>                             quickly!
>
>                             Does that sound like a reasonable path forward?
>
>                             rb
>
>
>                             On 01/25/2016 12:46 PM, Aliaksei Sandryhaila
>                             wrote:
>
>                                 Hi Nong and Julien,
>
>                                 As Wes has pointed out, we have a number
>                                 of patches for parquet-cpp
>                                 outstanding. Wes, Deepak, and I have
>                                 been reviewing each other's pull
>                                 requests. At this point, the patches
>                                 need to be reviewed and approved by
>                                 Parquet committers in order to be
>                                 committed to master.
>
>                                 Unfortunately, there is not much
>                                 activity on this side of the project.
>                                 The lack of response from current
>                                 committers is holding us back, and we
>                                 have to repeatedly rebase our batches,
>                                 merge multiple pull requests
>                                 together, and overall step on each
>                                 others' toes.
>
>                                 Is it possible to make Wes, Deepak, and
>                                 me committers on the project, so
>                                 we can contribute to parquet-cpp more
>                                 efficiently?
>
>                                 Thanks,
>                                 Aliaksei.
>
>
>                                 On 01/23/2016 06:07 PM, Wes McKinney wrote:
>
>                                     Folks,
>
>                                     We're working on a pretty solid
>                                     patch queue.
>
>                                     independent patches
>                                     PARQUET-449:
>                                     https://github.com/apache/parquet-cpp/pull/21
>
>                                     interdependent patches (order to
>                                     apply patches)
>                                     PARQUET-437 (MOSTLY REVIEWED):
>                                     https://github.com/apache/parquet-cpp/pull/19
>
>                                     PARQUET-418:
>                                     https://github.com/apache/parquet-cpp/pull/18
>                                     PARQUET-434:
>                                     https://github.com/apache/parquet-cpp/pull/20
>                                     PARQUET-433:
>                                     https://github.com/apache/parquet-cpp/pull/22
>                                     PARQUET-451 & PARQUET-453:
>                                     https://github.com/apache/parquet-cpp/pull/23
>
>                                     PARQUET-428 (needs to be rebased on
>                                     top of PARQUET-433):
>                                     https://github.com/apache/parquet-cpp/pull/24
>
>                                     I'm going to take a breather and
>                                     work on some other things this
>                                     weekend,
>                                     but I'll be available for code
>                                     reviews and fixes to try to move along
>                                     this
>                                     patch queue.
>
>                                     Thanks,
>                                     Wes
>
>                                     On Fri, Jan 15, 2016 at 8:18 AM, Wes
>                                     McKinney <wes@cloudera.com
>                                     <ma...@cloudera.com>> wrote:
>
>                                         Great to meet you all!
>
>                                         I've recently been collaborating
>                                         with the Apache Drill team to spin
>                                         out
>                                         the ValueVector columnar
>                                         in-memory data structure into a new
>                                         standalone
>                                         project that will be called
>                                         Arrow [1] [2]. A brief summary of
>                                         Arrow/ValueVectors is that it
>                                         permits O(1) random access on nested
>                                         columnar
>                                         structures and is efficient for
>                                         projections and scans in a columnar
>                                         SQL
>                                         setting.
>
>                                         I'm very interested in making
>                                         Parquet read/write support
>                                         available to
>                                         Python programmers via C/C++
>                                         extensions, so I'm going to be
>                                         working
>                                         the
>                                         next few months on a
>                                         Parquet->Arrow->Python
>                                         toolchain, along with some
>                                         tools to manipulate tables
>                                         in-memory columnar data in the
>                                         style of
>                                         Python's
>                                         pandas library.
>
>                                         I will propose patches as needed
>                                         to parquet-cpp to improve its
>                                         performance
>                                         and add functionality for
>                                         writing Parquet files as well. The
>                                         details of
>                                         converting to/from Parquet's
>                                         repetition/definition level
>                                         representation of
>                                         nested data will stay separate
>                                         in the arrow-parquet adapter code.
>
>                                         cheers,
>                                         Wes
>
>                                         [1]:
>                                         http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E
>
>
>                                         [2]:
>                                         http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490
>
>
>                                         On Fri, Jan 15, 2016 at 1:22 AM,
>                                         Mickaël Lacour
>                                         <m.lacour@criteo.com
>                                         <ma...@criteo.com>>
>                                         wrote:
>
>                                             Hi,
>
>                                             I'm very interested in this
>                                             subject because I would like
>                                             to export
>                                             parquet data from HDFS to
>                                             Vertica (using VSQL).
>                                             I'm planning to work on it
>                                             next quarter, but I will be
>                                             very happy to
>                                             help
>                                             you on this subject (review,
>                                             testing).
>
>                                             Have a nice day,
>                                             --
>                                             Mickaël Lacour
>                                             Senior Software Engineer
>                                             Analytics Infrastructure
>                                             team @Scalability
>
>                                             ________________________________________
>                                             From: Walkauskas, Stephen
>                                             Gregory (Vertica)
>                                             <stephen.walkauskas@hpe.com
>                                             <ma...@hpe.com>>
>                                             Sent: Thursday, January 14,
>                                             2016 3:23 PM
>                                             To: Sandryhaila, Aliaksei;
>                                             dev@parquet.apache.org
>                                             <ma...@parquet.apache.org>;
>                                             Majeti, Deepak;
>                                             nongli@gmail.com
>                                             <ma...@gmail.com>;
>                                             Wes McKinney
>                                             Subject: Re: Parquet-cpp
>
>                                             Yes, thanks for the
>                                             introduction Julien.
>
>                                             Nong and Wes,
>
>                                             It'd be interesting to know
>                                             your goals for parquet-cpp.
>
>                                             The Vertica database already
>                                             supports optimized reads of
>                                             ORC files
>                                             (fast
>                                             c++ parser, predicate
>                                             pushdown, columns selection
>                                             etc). We'd like
>                                             to do
>                                             the same for parquet.
>
>                                             Cheers,
>                                             Stephen
>
>                                             On 01/13/2016 05:53 PM,
>                                             Sandryhaila, Aliaksei wrote:
>
>                                                 Thank you for the
>                                                 introduction, Julien!
>
>                                                 Hello Nong and Wes,
>
>                                                 Stephen, Deepak and I
>                                                 are developing a C++
>                                                 library to support
>                                                 Parquet in
>                                                 Vertica RDBMS. We are
>                                                 using Parquet-cpp as a
>                                                 starting point and are
>                                                 expanding its
>                                                 functionality as well as
>                                                 improving it and fixing
>                                                 bugs. We
>                                                 would like to contribute
>                                                 these improvements back
>                                                 to the open-source
>                                                 community. We plan to do
>                                                 this through the usual
>                                                 process of creating
>                                                 jiras that justify and
>                                                 explain a code change,
>                                                 and then submitting
>                                                 pull
>                                                 requests. We look
>                                                 forward to working with
>                                                 you on Parquet-cpp and to
>                                                 your
>                                                 feedback and suggestions.
>
>                                                 Best regards,
>                                                 Aliaksei.
>
>
>                                                 On 01/13/2016 02:54 PM,
>                                                 Julien Le Dem wrote:
>
>                                                     Hello Nong, Wes,
>                                                     Stephen, Deepak and
>                                                     Aliaksei
>                                                     I wanted to
>                                                     introduce you to
>                                                     each other as you
>                                                     are all looking at
>                                                     Parquet-cpp.
>
>                                                     I'd recommend
>                                                     opening JIRAs in the
>                                                     parquet-cpp component to
>
>                                             collaborate (I
>
>                                                     see you already
>                                                     doing this):
>
>                                             https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
>
>
>                                                     Nong is a committer
>                                                     and can merged pull
>                                                     requests (he also
>                                                     understands
>
>                                             that
>
>                                                     code base very well).
>                                                     Other committer can
>                                                     too, feel free to
>                                                     ping us if you need help
>                                                     Obviously, you don't
>                                                     need to be a
>                                                     committer to give
>                                                     others reviews
>                                                     (you
>                                                     just need one to
>                                                     approve and merge).
>
>
>
>
>
>
>
>
>                     --
>                     Ryan Blue
>                     Software Engineer
>                     Cloudera, Inc.
>
>
>
>
>                 --
>                 Julien
>
>
>
>
>             --
>             Julien
>
>
>
>


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Parquet-cpp

Posted by Wes McKinney <we...@cloudera.com>.
Also, things have been made much worse by Travis CI continuing to have
infrastructure problems. The ASF build queue on Travis CI had completely
stalled by this morning so that no builds were completing; fortunately
their support is quite responsible and they've resolved the queue blockage,
so builds are executing again.

On Tue, Jan 26, 2016 at 4:00 PM, Wes McKinney <we...@cloudera.com> wrote:

> There's 3 more patches outstanding that are causing blockage (418, 433,
> and 451/453), so I think if we get them merged today or tomorrow when we
> should be able to proceed with some parallel efforts without quite as much
> conflict.
>
> On Tue, Jan 26, 2016 at 3:56 PM, Nong Li <no...@gmail.com> wrote:
>
>> I'm going to try to more active this week but I admittedly don't have a
>> lot of
>> time to work on this. I understand we need to get critical mass in
>> committers,
>> code, etc to keep this going but I think we're making good progress.
>>
>> On Tue, Jan 26, 2016 at 3:27 PM, Julien Le Dem <ju...@dremio.com> wrote:
>>
>>> Also as Nong mentioned, PRs should be prefixed by the jira id followed
>>> by a ":" as follows "PARQUET-X: description" that's just to have the
>>> reference in the git changelog. The merge script enforces it.
>>>
>>>
>>> On Tue, Jan 26, 2016 at 3:24 PM, Julien Le Dem <ju...@dremio.com>
>>> wrote:
>>>
>>>> I'm happy too with Aliaksei, Deepak, Wes, etc reviewing each other.
>>>> I see Nong (who's a committer) has been doing some reviews already.
>>>>
>>>> When you guys reach a consensus on a PR and want it merged please
>>>> mention it in the PR (+1, LGTM) and mention us directly (@julienledem, ...)
>>>> to have it merged.
>>>>
>>>> right now I see that #19 and #21 have been committed (thanks Nong) but
>>>> it is not clear to me in what order the others should be committed.
>>>>
>>>> For example Deepak should comment directly on #22 to approve it. Right
>>>> now he mentioned it on another PR.
>>>> https://github.com/apache/parquet-cpp/pull/24#issuecomment-174354139
>>>> Similarly Wes could confirm on that PR whether it looks good.
>>>>
>>>> Tomorrow is the Parquet sync up if you want to discuss further:
>>>> https://plus.google.com/u/0/events/cvgi67jmoptmgb1i488re8scbuo
>>>>
>>>>
>>>> On Mon, Jan 25, 2016 at 4:20 PM, Ryan Blue <bl...@cloudera.com> wrote:
>>>>
>>>>> Aliaksei, thanks for being understanding here.
>>>>>
>>>>> I agree with you that it is too difficult. We really want to get the
>>>>> cpp side bootstrapped as soon as possible. Lets go with what you suggested,
>>>>> to have contributors review one another's patches and then ask a committer
>>>>> for a final review once both contributors reach a consensus.
>>>>>
>>>>> If there are issues that are easy to review, maybe some of us other
>>>>> than Nong can take a look.
>>>>>
>>>>> rb
>>>>>
>>>>>
>>>>> On 01/25/2016 02:33 PM, Aliaksei Sandryhaila wrote:
>>>>>
>>>>>> Hi Ryan,
>>>>>>
>>>>>> This sounds very reasonable. I do not argue to disregard the standard
>>>>>> Apache approach to promoting contributors to committers. I am just
>>>>>> pointing out that without the input from current committers it is hard
>>>>>> for us to productively contribute to the project. As a consequence, it
>>>>>> is hard for us demonstrate our fit to become committers in the future.
>>>>>> This leaves us in a deadlock, which can be resolved either by an
>>>>>> increased feedback from existing committers or by making us committers
>>>>>> sooner.
>>>>>>
>>>>>> I understand that most committers on the Parquet project are working
>>>>>> on
>>>>>> the Java implementation, so it can be harder for them to review
>>>>>> patches
>>>>>> for parquet-cpp. In this regard, how about the following protocol for
>>>>>> parquet-cpp pull requests: After contributors review and revise a pull
>>>>>> request and agree that it is in a good shape, we will ask a designated
>>>>>> committer to review and commit the pull request. So far we have been
>>>>>> asking Nong; if there is a better designated committer for
>>>>>> parquet-cpp,
>>>>>> please let us know.
>>>>>>
>>>>>> Thank you,
>>>>>> Aliaksei.
>>>>>>
>>>>>>
>>>>>> On 01/25/2016 04:54 PM, Ryan Blue wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> Sorry about the current backlog on the parquet-cpp side. Most of the
>>>>>>> current committer base works on the Java implementation so it's
>>>>>>> either
>>>>>>> slow or not reliable for us to do those reviews.
>>>>>>>
>>>>>>> I think the best way to move forward is to review patches for each
>>>>>>> other. That will keep those issues progressing, make it easy for
>>>>>>> committers to validate the commit, and -- most importantly -- to
>>>>>>> build
>>>>>>> a trail of contributions that we can look at to vote in new
>>>>>>> committers.
>>>>>>>
>>>>>>> I completely sympathize with the need for committers on the CPP
>>>>>>> project, but I don't think this will take a long time given the
>>>>>>> current level of activity. We're really just trying to build
>>>>>>> confidence that:
>>>>>>>
>>>>>>> 1. You produce quality contributions and understand the codebase
>>>>>>> 2. You give friendly, thoughtful reviews and don't rubber-stamp
>>>>>>> 3. You defer judgment and ask others when you don't know
>>>>>>> 4. You respect others and interact professionally
>>>>>>>
>>>>>>> I don't think any of those are that hard to demonstrate, but I'd be
>>>>>>> uncomfortable not validating committers like we normally do.
>>>>>>> Especially in this situation, where I could easily see the amount of
>>>>>>> work you guys are doing adding up pretty quickly!
>>>>>>>
>>>>>>> Does that sound like a reasonable path forward?
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>>
>>>>>>> On 01/25/2016 12:46 PM, Aliaksei Sandryhaila wrote:
>>>>>>>
>>>>>>>> Hi Nong and Julien,
>>>>>>>>
>>>>>>>> As Wes has pointed out, we have a number of patches for parquet-cpp
>>>>>>>> outstanding. Wes, Deepak, and I have been reviewing each other's
>>>>>>>> pull
>>>>>>>> requests. At this point, the patches need to be reviewed and
>>>>>>>> approved by
>>>>>>>> Parquet committers in order to be committed to master.
>>>>>>>>
>>>>>>>> Unfortunately, there is not much activity on this side of the
>>>>>>>> project.
>>>>>>>> The lack of response from current committers is holding us back,
>>>>>>>> and we
>>>>>>>> have to repeatedly rebase our batches, merge multiple pull requests
>>>>>>>> together, and overall step on each others' toes.
>>>>>>>>
>>>>>>>> Is it possible to make Wes, Deepak, and me committers on the
>>>>>>>> project, so
>>>>>>>> we can contribute to parquet-cpp more efficiently?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Aliaksei.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 01/23/2016 06:07 PM, Wes McKinney wrote:
>>>>>>>>
>>>>>>>>> Folks,
>>>>>>>>>
>>>>>>>>> We're working on a pretty solid patch queue.
>>>>>>>>>
>>>>>>>>> independent patches
>>>>>>>>> PARQUET-449: https://github.com/apache/parquet-cpp/pull/21
>>>>>>>>>
>>>>>>>>> interdependent patches (order to apply patches)
>>>>>>>>> PARQUET-437 (MOSTLY REVIEWED):
>>>>>>>>> https://github.com/apache/parquet-cpp/pull/19
>>>>>>>>>
>>>>>>>>> PARQUET-418: https://github.com/apache/parquet-cpp/pull/18
>>>>>>>>> PARQUET-434: https://github.com/apache/parquet-cpp/pull/20
>>>>>>>>> PARQUET-433: https://github.com/apache/parquet-cpp/pull/22
>>>>>>>>> PARQUET-451 & PARQUET-453:
>>>>>>>>> https://github.com/apache/parquet-cpp/pull/23
>>>>>>>>>
>>>>>>>>> PARQUET-428 (needs to be rebased on top of PARQUET-433):
>>>>>>>>> https://github.com/apache/parquet-cpp/pull/24
>>>>>>>>>
>>>>>>>>> I'm going to take a breather and work on some other things this
>>>>>>>>> weekend,
>>>>>>>>> but I'll be available for code reviews and fixes to try to move
>>>>>>>>> along
>>>>>>>>> this
>>>>>>>>> patch queue.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Wes
>>>>>>>>>
>>>>>>>>> On Fri, Jan 15, 2016 at 8:18 AM, Wes McKinney <we...@cloudera.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Great to meet you all!
>>>>>>>>>>
>>>>>>>>>> I've recently been collaborating with the Apache Drill team to
>>>>>>>>>> spin
>>>>>>>>>> out
>>>>>>>>>> the ValueVector columnar in-memory data structure into a new
>>>>>>>>>> standalone
>>>>>>>>>> project that will be called Arrow [1] [2]. A brief summary of
>>>>>>>>>> Arrow/ValueVectors is that it permits O(1) random access on nested
>>>>>>>>>> columnar
>>>>>>>>>> structures and is efficient for projections and scans in a
>>>>>>>>>> columnar
>>>>>>>>>> SQL
>>>>>>>>>> setting.
>>>>>>>>>>
>>>>>>>>>> I'm very interested in making Parquet read/write support
>>>>>>>>>> available to
>>>>>>>>>> Python programmers via C/C++ extensions, so I'm going to be
>>>>>>>>>> working
>>>>>>>>>> the
>>>>>>>>>> next few months on a Parquet->Arrow->Python toolchain, along with
>>>>>>>>>> some
>>>>>>>>>> tools to manipulate tables in-memory columnar data in the style of
>>>>>>>>>> Python's
>>>>>>>>>> pandas library.
>>>>>>>>>>
>>>>>>>>>> I will propose patches as needed to parquet-cpp to improve its
>>>>>>>>>> performance
>>>>>>>>>> and add functionality for writing Parquet files as well. The
>>>>>>>>>> details of
>>>>>>>>>> converting to/from Parquet's repetition/definition level
>>>>>>>>>> representation of
>>>>>>>>>> nested data will stay separate in the arrow-parquet adapter code.
>>>>>>>>>>
>>>>>>>>>> cheers,
>>>>>>>>>> Wes
>>>>>>>>>>
>>>>>>>>>> [1]:
>>>>>>>>>>
>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [2]:
>>>>>>>>>>
>>>>>>>>>> http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jan 15, 2016 at 1:22 AM, Mickaël Lacour <
>>>>>>>>>> m.lacour@criteo.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I'm very interested in this subject because I would like to
>>>>>>>>>>> export
>>>>>>>>>>> parquet data from HDFS to Vertica (using VSQL).
>>>>>>>>>>> I'm planning to work on it next quarter, but I will be very
>>>>>>>>>>> happy to
>>>>>>>>>>> help
>>>>>>>>>>> you on this subject (review, testing).
>>>>>>>>>>>
>>>>>>>>>>> Have a nice day,
>>>>>>>>>>> --
>>>>>>>>>>> Mickaël Lacour
>>>>>>>>>>> Senior Software Engineer
>>>>>>>>>>> Analytics Infrastructure team @Scalability
>>>>>>>>>>>
>>>>>>>>>>> ________________________________________
>>>>>>>>>>> From: Walkauskas, Stephen Gregory (Vertica)
>>>>>>>>>>> <st...@hpe.com>
>>>>>>>>>>> Sent: Thursday, January 14, 2016 3:23 PM
>>>>>>>>>>> To: Sandryhaila, Aliaksei; dev@parquet.apache.org; Majeti,
>>>>>>>>>>> Deepak;
>>>>>>>>>>> nongli@gmail.com; Wes McKinney
>>>>>>>>>>> Subject: Re: Parquet-cpp
>>>>>>>>>>>
>>>>>>>>>>> Yes, thanks for the introduction Julien.
>>>>>>>>>>>
>>>>>>>>>>> Nong and Wes,
>>>>>>>>>>>
>>>>>>>>>>> It'd be interesting to know your goals for parquet-cpp.
>>>>>>>>>>>
>>>>>>>>>>> The Vertica database already supports optimized reads of ORC
>>>>>>>>>>> files
>>>>>>>>>>> (fast
>>>>>>>>>>> c++ parser, predicate pushdown, columns selection etc). We'd like
>>>>>>>>>>> to do
>>>>>>>>>>> the same for parquet.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Stephen
>>>>>>>>>>>
>>>>>>>>>>> On 01/13/2016 05:53 PM, Sandryhaila, Aliaksei wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thank you for the introduction, Julien!
>>>>>>>>>>>>
>>>>>>>>>>>> Hello Nong and Wes,
>>>>>>>>>>>>
>>>>>>>>>>>> Stephen, Deepak and I are developing a C++ library to support
>>>>>>>>>>>> Parquet in
>>>>>>>>>>>> Vertica RDBMS. We are using Parquet-cpp as a starting point and
>>>>>>>>>>>> are
>>>>>>>>>>>> expanding its functionality as well as improving it and fixing
>>>>>>>>>>>> bugs. We
>>>>>>>>>>>> would like to contribute these improvements back to the
>>>>>>>>>>>> open-source
>>>>>>>>>>>> community. We plan to do this through the usual process of
>>>>>>>>>>>> creating
>>>>>>>>>>>> jiras that justify and explain a code change, and then
>>>>>>>>>>>> submitting
>>>>>>>>>>>> pull
>>>>>>>>>>>> requests. We look forward to working with you on Parquet-cpp
>>>>>>>>>>>> and to
>>>>>>>>>>>> your
>>>>>>>>>>>> feedback and suggestions.
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Aliaksei.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 01/13/2016 02:54 PM, Julien Le Dem wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello Nong, Wes, Stephen, Deepak and Aliaksei
>>>>>>>>>>>>> I wanted to introduce you to each other as you are all looking
>>>>>>>>>>>>> at
>>>>>>>>>>>>> Parquet-cpp.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd recommend opening JIRAs in the parquet-cpp component to
>>>>>>>>>>>>>
>>>>>>>>>>>> collaborate (I
>>>>>>>>>>>
>>>>>>>>>>>> see you already doing this):
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Nong is a committer and can merged pull requests (he also
>>>>>>>>>>>>> understands
>>>>>>>>>>>>>
>>>>>>>>>>>> that
>>>>>>>>>>>
>>>>>>>>>>>> code base very well).
>>>>>>>>>>>>> Other committer can too, feel free to ping us if you need help
>>>>>>>>>>>>> Obviously, you don't need to be a committer to give others
>>>>>>>>>>>>> reviews
>>>>>>>>>>>>> (you
>>>>>>>>>>>>> just need one to approve and merge).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Cloudera, Inc.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Julien
>>>>
>>>
>>>
>>>
>>> --
>>> Julien
>>>
>>
>>
>

Re: Parquet-cpp

Posted by Wes McKinney <we...@cloudera.com>.
There's 3 more patches outstanding that are causing blockage (418, 433, and
451/453), so I think if we get them merged today or tomorrow when we should
be able to proceed with some parallel efforts without quite as much
conflict.

On Tue, Jan 26, 2016 at 3:56 PM, Nong Li <no...@gmail.com> wrote:

> I'm going to try to more active this week but I admittedly don't have a
> lot of
> time to work on this. I understand we need to get critical mass in
> committers,
> code, etc to keep this going but I think we're making good progress.
>
> On Tue, Jan 26, 2016 at 3:27 PM, Julien Le Dem <ju...@dremio.com> wrote:
>
>> Also as Nong mentioned, PRs should be prefixed by the jira id followed by
>> a ":" as follows "PARQUET-X: description" that's just to have the reference
>> in the git changelog. The merge script enforces it.
>>
>>
>> On Tue, Jan 26, 2016 at 3:24 PM, Julien Le Dem <ju...@dremio.com> wrote:
>>
>>> I'm happy too with Aliaksei, Deepak, Wes, etc reviewing each other.
>>> I see Nong (who's a committer) has been doing some reviews already.
>>>
>>> When you guys reach a consensus on a PR and want it merged please
>>> mention it in the PR (+1, LGTM) and mention us directly (@julienledem, ...)
>>> to have it merged.
>>>
>>> right now I see that #19 and #21 have been committed (thanks Nong) but
>>> it is not clear to me in what order the others should be committed.
>>>
>>> For example Deepak should comment directly on #22 to approve it. Right
>>> now he mentioned it on another PR.
>>> https://github.com/apache/parquet-cpp/pull/24#issuecomment-174354139
>>> Similarly Wes could confirm on that PR whether it looks good.
>>>
>>> Tomorrow is the Parquet sync up if you want to discuss further:
>>> https://plus.google.com/u/0/events/cvgi67jmoptmgb1i488re8scbuo
>>>
>>>
>>> On Mon, Jan 25, 2016 at 4:20 PM, Ryan Blue <bl...@cloudera.com> wrote:
>>>
>>>> Aliaksei, thanks for being understanding here.
>>>>
>>>> I agree with you that it is too difficult. We really want to get the
>>>> cpp side bootstrapped as soon as possible. Lets go with what you suggested,
>>>> to have contributors review one another's patches and then ask a committer
>>>> for a final review once both contributors reach a consensus.
>>>>
>>>> If there are issues that are easy to review, maybe some of us other
>>>> than Nong can take a look.
>>>>
>>>> rb
>>>>
>>>>
>>>> On 01/25/2016 02:33 PM, Aliaksei Sandryhaila wrote:
>>>>
>>>>> Hi Ryan,
>>>>>
>>>>> This sounds very reasonable. I do not argue to disregard the standard
>>>>> Apache approach to promoting contributors to committers. I am just
>>>>> pointing out that without the input from current committers it is hard
>>>>> for us to productively contribute to the project. As a consequence, it
>>>>> is hard for us demonstrate our fit to become committers in the future.
>>>>> This leaves us in a deadlock, which can be resolved either by an
>>>>> increased feedback from existing committers or by making us committers
>>>>> sooner.
>>>>>
>>>>> I understand that most committers on the Parquet project are working on
>>>>> the Java implementation, so it can be harder for them to review patches
>>>>> for parquet-cpp. In this regard, how about the following protocol for
>>>>> parquet-cpp pull requests: After contributors review and revise a pull
>>>>> request and agree that it is in a good shape, we will ask a designated
>>>>> committer to review and commit the pull request. So far we have been
>>>>> asking Nong; if there is a better designated committer for parquet-cpp,
>>>>> please let us know.
>>>>>
>>>>> Thank you,
>>>>> Aliaksei.
>>>>>
>>>>>
>>>>> On 01/25/2016 04:54 PM, Ryan Blue wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> Sorry about the current backlog on the parquet-cpp side. Most of the
>>>>>> current committer base works on the Java implementation so it's either
>>>>>> slow or not reliable for us to do those reviews.
>>>>>>
>>>>>> I think the best way to move forward is to review patches for each
>>>>>> other. That will keep those issues progressing, make it easy for
>>>>>> committers to validate the commit, and -- most importantly -- to build
>>>>>> a trail of contributions that we can look at to vote in new
>>>>>> committers.
>>>>>>
>>>>>> I completely sympathize with the need for committers on the CPP
>>>>>> project, but I don't think this will take a long time given the
>>>>>> current level of activity. We're really just trying to build
>>>>>> confidence that:
>>>>>>
>>>>>> 1. You produce quality contributions and understand the codebase
>>>>>> 2. You give friendly, thoughtful reviews and don't rubber-stamp
>>>>>> 3. You defer judgment and ask others when you don't know
>>>>>> 4. You respect others and interact professionally
>>>>>>
>>>>>> I don't think any of those are that hard to demonstrate, but I'd be
>>>>>> uncomfortable not validating committers like we normally do.
>>>>>> Especially in this situation, where I could easily see the amount of
>>>>>> work you guys are doing adding up pretty quickly!
>>>>>>
>>>>>> Does that sound like a reasonable path forward?
>>>>>>
>>>>>> rb
>>>>>>
>>>>>>
>>>>>> On 01/25/2016 12:46 PM, Aliaksei Sandryhaila wrote:
>>>>>>
>>>>>>> Hi Nong and Julien,
>>>>>>>
>>>>>>> As Wes has pointed out, we have a number of patches for parquet-cpp
>>>>>>> outstanding. Wes, Deepak, and I have been reviewing each other's pull
>>>>>>> requests. At this point, the patches need to be reviewed and
>>>>>>> approved by
>>>>>>> Parquet committers in order to be committed to master.
>>>>>>>
>>>>>>> Unfortunately, there is not much activity on this side of the
>>>>>>> project.
>>>>>>> The lack of response from current committers is holding us back, and
>>>>>>> we
>>>>>>> have to repeatedly rebase our batches, merge multiple pull requests
>>>>>>> together, and overall step on each others' toes.
>>>>>>>
>>>>>>> Is it possible to make Wes, Deepak, and me committers on the
>>>>>>> project, so
>>>>>>> we can contribute to parquet-cpp more efficiently?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Aliaksei.
>>>>>>>
>>>>>>>
>>>>>>> On 01/23/2016 06:07 PM, Wes McKinney wrote:
>>>>>>>
>>>>>>>> Folks,
>>>>>>>>
>>>>>>>> We're working on a pretty solid patch queue.
>>>>>>>>
>>>>>>>> independent patches
>>>>>>>> PARQUET-449: https://github.com/apache/parquet-cpp/pull/21
>>>>>>>>
>>>>>>>> interdependent patches (order to apply patches)
>>>>>>>> PARQUET-437 (MOSTLY REVIEWED):
>>>>>>>> https://github.com/apache/parquet-cpp/pull/19
>>>>>>>>
>>>>>>>> PARQUET-418: https://github.com/apache/parquet-cpp/pull/18
>>>>>>>> PARQUET-434: https://github.com/apache/parquet-cpp/pull/20
>>>>>>>> PARQUET-433: https://github.com/apache/parquet-cpp/pull/22
>>>>>>>> PARQUET-451 & PARQUET-453:
>>>>>>>> https://github.com/apache/parquet-cpp/pull/23
>>>>>>>>
>>>>>>>> PARQUET-428 (needs to be rebased on top of PARQUET-433):
>>>>>>>> https://github.com/apache/parquet-cpp/pull/24
>>>>>>>>
>>>>>>>> I'm going to take a breather and work on some other things this
>>>>>>>> weekend,
>>>>>>>> but I'll be available for code reviews and fixes to try to move
>>>>>>>> along
>>>>>>>> this
>>>>>>>> patch queue.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Wes
>>>>>>>>
>>>>>>>> On Fri, Jan 15, 2016 at 8:18 AM, Wes McKinney <we...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Great to meet you all!
>>>>>>>>>
>>>>>>>>> I've recently been collaborating with the Apache Drill team to spin
>>>>>>>>> out
>>>>>>>>> the ValueVector columnar in-memory data structure into a new
>>>>>>>>> standalone
>>>>>>>>> project that will be called Arrow [1] [2]. A brief summary of
>>>>>>>>> Arrow/ValueVectors is that it permits O(1) random access on nested
>>>>>>>>> columnar
>>>>>>>>> structures and is efficient for projections and scans in a columnar
>>>>>>>>> SQL
>>>>>>>>> setting.
>>>>>>>>>
>>>>>>>>> I'm very interested in making Parquet read/write support available
>>>>>>>>> to
>>>>>>>>> Python programmers via C/C++ extensions, so I'm going to be working
>>>>>>>>> the
>>>>>>>>> next few months on a Parquet->Arrow->Python toolchain, along with
>>>>>>>>> some
>>>>>>>>> tools to manipulate tables in-memory columnar data in the style of
>>>>>>>>> Python's
>>>>>>>>> pandas library.
>>>>>>>>>
>>>>>>>>> I will propose patches as needed to parquet-cpp to improve its
>>>>>>>>> performance
>>>>>>>>> and add functionality for writing Parquet files as well. The
>>>>>>>>> details of
>>>>>>>>> converting to/from Parquet's repetition/definition level
>>>>>>>>> representation of
>>>>>>>>> nested data will stay separate in the arrow-parquet adapter code.
>>>>>>>>>
>>>>>>>>> cheers,
>>>>>>>>> Wes
>>>>>>>>>
>>>>>>>>> [1]:
>>>>>>>>>
>>>>>>>>> http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [2]:
>>>>>>>>>
>>>>>>>>> http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jan 15, 2016 at 1:22 AM, Mickaël Lacour <
>>>>>>>>> m.lacour@criteo.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I'm very interested in this subject because I would like to export
>>>>>>>>>> parquet data from HDFS to Vertica (using VSQL).
>>>>>>>>>> I'm planning to work on it next quarter, but I will be very happy
>>>>>>>>>> to
>>>>>>>>>> help
>>>>>>>>>> you on this subject (review, testing).
>>>>>>>>>>
>>>>>>>>>> Have a nice day,
>>>>>>>>>> --
>>>>>>>>>> Mickaël Lacour
>>>>>>>>>> Senior Software Engineer
>>>>>>>>>> Analytics Infrastructure team @Scalability
>>>>>>>>>>
>>>>>>>>>> ________________________________________
>>>>>>>>>> From: Walkauskas, Stephen Gregory (Vertica)
>>>>>>>>>> <st...@hpe.com>
>>>>>>>>>> Sent: Thursday, January 14, 2016 3:23 PM
>>>>>>>>>> To: Sandryhaila, Aliaksei; dev@parquet.apache.org; Majeti,
>>>>>>>>>> Deepak;
>>>>>>>>>> nongli@gmail.com; Wes McKinney
>>>>>>>>>> Subject: Re: Parquet-cpp
>>>>>>>>>>
>>>>>>>>>> Yes, thanks for the introduction Julien.
>>>>>>>>>>
>>>>>>>>>> Nong and Wes,
>>>>>>>>>>
>>>>>>>>>> It'd be interesting to know your goals for parquet-cpp.
>>>>>>>>>>
>>>>>>>>>> The Vertica database already supports optimized reads of ORC files
>>>>>>>>>> (fast
>>>>>>>>>> c++ parser, predicate pushdown, columns selection etc). We'd like
>>>>>>>>>> to do
>>>>>>>>>> the same for parquet.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Stephen
>>>>>>>>>>
>>>>>>>>>> On 01/13/2016 05:53 PM, Sandryhaila, Aliaksei wrote:
>>>>>>>>>>
>>>>>>>>>>> Thank you for the introduction, Julien!
>>>>>>>>>>>
>>>>>>>>>>> Hello Nong and Wes,
>>>>>>>>>>>
>>>>>>>>>>> Stephen, Deepak and I are developing a C++ library to support
>>>>>>>>>>> Parquet in
>>>>>>>>>>> Vertica RDBMS. We are using Parquet-cpp as a starting point and
>>>>>>>>>>> are
>>>>>>>>>>> expanding its functionality as well as improving it and fixing
>>>>>>>>>>> bugs. We
>>>>>>>>>>> would like to contribute these improvements back to the
>>>>>>>>>>> open-source
>>>>>>>>>>> community. We plan to do this through the usual process of
>>>>>>>>>>> creating
>>>>>>>>>>> jiras that justify and explain a code change, and then submitting
>>>>>>>>>>> pull
>>>>>>>>>>> requests. We look forward to working with you on Parquet-cpp and
>>>>>>>>>>> to
>>>>>>>>>>> your
>>>>>>>>>>> feedback and suggestions.
>>>>>>>>>>>
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Aliaksei.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 01/13/2016 02:54 PM, Julien Le Dem wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello Nong, Wes, Stephen, Deepak and Aliaksei
>>>>>>>>>>>> I wanted to introduce you to each other as you are all looking
>>>>>>>>>>>> at
>>>>>>>>>>>> Parquet-cpp.
>>>>>>>>>>>>
>>>>>>>>>>>> I'd recommend opening JIRAs in the parquet-cpp component to
>>>>>>>>>>>>
>>>>>>>>>>> collaborate (I
>>>>>>>>>>
>>>>>>>>>>> see you already doing this):
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Nong is a committer and can merged pull requests (he also
>>>>>>>>>>>> understands
>>>>>>>>>>>>
>>>>>>>>>>> that
>>>>>>>>>>
>>>>>>>>>>> code base very well).
>>>>>>>>>>>> Other committer can too, feel free to ping us if you need help
>>>>>>>>>>>> Obviously, you don't need to be a committer to give others
>>>>>>>>>>>> reviews
>>>>>>>>>>>> (you
>>>>>>>>>>>> just need one to approve and merge).
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Cloudera, Inc.
>>>>
>>>
>>>
>>>
>>> --
>>> Julien
>>>
>>
>>
>>
>> --
>> Julien
>>
>
>

Re: Parquet-cpp

Posted by Nong Li <no...@gmail.com>.
I'm going to try to more active this week but I admittedly don't have a lot
of
time to work on this. I understand we need to get critical mass in
committers,
code, etc to keep this going but I think we're making good progress.

On Tue, Jan 26, 2016 at 3:27 PM, Julien Le Dem <ju...@dremio.com> wrote:

> Also as Nong mentioned, PRs should be prefixed by the jira id followed by
> a ":" as follows "PARQUET-X: description" that's just to have the reference
> in the git changelog. The merge script enforces it.
>
>
> On Tue, Jan 26, 2016 at 3:24 PM, Julien Le Dem <ju...@dremio.com> wrote:
>
>> I'm happy too with Aliaksei, Deepak, Wes, etc reviewing each other.
>> I see Nong (who's a committer) has been doing some reviews already.
>>
>> When you guys reach a consensus on a PR and want it merged please mention
>> it in the PR (+1, LGTM) and mention us directly (@julienledem, ...) to have
>> it merged.
>>
>> right now I see that #19 and #21 have been committed (thanks Nong) but it
>> is not clear to me in what order the others should be committed.
>>
>> For example Deepak should comment directly on #22 to approve it. Right
>> now he mentioned it on another PR.
>> https://github.com/apache/parquet-cpp/pull/24#issuecomment-174354139
>> Similarly Wes could confirm on that PR whether it looks good.
>>
>> Tomorrow is the Parquet sync up if you want to discuss further:
>> https://plus.google.com/u/0/events/cvgi67jmoptmgb1i488re8scbuo
>>
>>
>> On Mon, Jan 25, 2016 at 4:20 PM, Ryan Blue <bl...@cloudera.com> wrote:
>>
>>> Aliaksei, thanks for being understanding here.
>>>
>>> I agree with you that it is too difficult. We really want to get the cpp
>>> side bootstrapped as soon as possible. Lets go with what you suggested, to
>>> have contributors review one another's patches and then ask a committer for
>>> a final review once both contributors reach a consensus.
>>>
>>> If there are issues that are easy to review, maybe some of us other than
>>> Nong can take a look.
>>>
>>> rb
>>>
>>>
>>> On 01/25/2016 02:33 PM, Aliaksei Sandryhaila wrote:
>>>
>>>> Hi Ryan,
>>>>
>>>> This sounds very reasonable. I do not argue to disregard the standard
>>>> Apache approach to promoting contributors to committers. I am just
>>>> pointing out that without the input from current committers it is hard
>>>> for us to productively contribute to the project. As a consequence, it
>>>> is hard for us demonstrate our fit to become committers in the future.
>>>> This leaves us in a deadlock, which can be resolved either by an
>>>> increased feedback from existing committers or by making us committers
>>>> sooner.
>>>>
>>>> I understand that most committers on the Parquet project are working on
>>>> the Java implementation, so it can be harder for them to review patches
>>>> for parquet-cpp. In this regard, how about the following protocol for
>>>> parquet-cpp pull requests: After contributors review and revise a pull
>>>> request and agree that it is in a good shape, we will ask a designated
>>>> committer to review and commit the pull request. So far we have been
>>>> asking Nong; if there is a better designated committer for parquet-cpp,
>>>> please let us know.
>>>>
>>>> Thank you,
>>>> Aliaksei.
>>>>
>>>>
>>>> On 01/25/2016 04:54 PM, Ryan Blue wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> Sorry about the current backlog on the parquet-cpp side. Most of the
>>>>> current committer base works on the Java implementation so it's either
>>>>> slow or not reliable for us to do those reviews.
>>>>>
>>>>> I think the best way to move forward is to review patches for each
>>>>> other. That will keep those issues progressing, make it easy for
>>>>> committers to validate the commit, and -- most importantly -- to build
>>>>> a trail of contributions that we can look at to vote in new committers.
>>>>>
>>>>> I completely sympathize with the need for committers on the CPP
>>>>> project, but I don't think this will take a long time given the
>>>>> current level of activity. We're really just trying to build
>>>>> confidence that:
>>>>>
>>>>> 1. You produce quality contributions and understand the codebase
>>>>> 2. You give friendly, thoughtful reviews and don't rubber-stamp
>>>>> 3. You defer judgment and ask others when you don't know
>>>>> 4. You respect others and interact professionally
>>>>>
>>>>> I don't think any of those are that hard to demonstrate, but I'd be
>>>>> uncomfortable not validating committers like we normally do.
>>>>> Especially in this situation, where I could easily see the amount of
>>>>> work you guys are doing adding up pretty quickly!
>>>>>
>>>>> Does that sound like a reasonable path forward?
>>>>>
>>>>> rb
>>>>>
>>>>>
>>>>> On 01/25/2016 12:46 PM, Aliaksei Sandryhaila wrote:
>>>>>
>>>>>> Hi Nong and Julien,
>>>>>>
>>>>>> As Wes has pointed out, we have a number of patches for parquet-cpp
>>>>>> outstanding. Wes, Deepak, and I have been reviewing each other's pull
>>>>>> requests. At this point, the patches need to be reviewed and approved
>>>>>> by
>>>>>> Parquet committers in order to be committed to master.
>>>>>>
>>>>>> Unfortunately, there is not much activity on this side of the project.
>>>>>> The lack of response from current committers is holding us back, and
>>>>>> we
>>>>>> have to repeatedly rebase our batches, merge multiple pull requests
>>>>>> together, and overall step on each others' toes.
>>>>>>
>>>>>> Is it possible to make Wes, Deepak, and me committers on the project,
>>>>>> so
>>>>>> we can contribute to parquet-cpp more efficiently?
>>>>>>
>>>>>> Thanks,
>>>>>> Aliaksei.
>>>>>>
>>>>>>
>>>>>> On 01/23/2016 06:07 PM, Wes McKinney wrote:
>>>>>>
>>>>>>> Folks,
>>>>>>>
>>>>>>> We're working on a pretty solid patch queue.
>>>>>>>
>>>>>>> independent patches
>>>>>>> PARQUET-449: https://github.com/apache/parquet-cpp/pull/21
>>>>>>>
>>>>>>> interdependent patches (order to apply patches)
>>>>>>> PARQUET-437 (MOSTLY REVIEWED):
>>>>>>> https://github.com/apache/parquet-cpp/pull/19
>>>>>>>
>>>>>>> PARQUET-418: https://github.com/apache/parquet-cpp/pull/18
>>>>>>> PARQUET-434: https://github.com/apache/parquet-cpp/pull/20
>>>>>>> PARQUET-433: https://github.com/apache/parquet-cpp/pull/22
>>>>>>> PARQUET-451 & PARQUET-453:
>>>>>>> https://github.com/apache/parquet-cpp/pull/23
>>>>>>>
>>>>>>> PARQUET-428 (needs to be rebased on top of PARQUET-433):
>>>>>>> https://github.com/apache/parquet-cpp/pull/24
>>>>>>>
>>>>>>> I'm going to take a breather and work on some other things this
>>>>>>> weekend,
>>>>>>> but I'll be available for code reviews and fixes to try to move along
>>>>>>> this
>>>>>>> patch queue.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Wes
>>>>>>>
>>>>>>> On Fri, Jan 15, 2016 at 8:18 AM, Wes McKinney <we...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Great to meet you all!
>>>>>>>>
>>>>>>>> I've recently been collaborating with the Apache Drill team to spin
>>>>>>>> out
>>>>>>>> the ValueVector columnar in-memory data structure into a new
>>>>>>>> standalone
>>>>>>>> project that will be called Arrow [1] [2]. A brief summary of
>>>>>>>> Arrow/ValueVectors is that it permits O(1) random access on nested
>>>>>>>> columnar
>>>>>>>> structures and is efficient for projections and scans in a columnar
>>>>>>>> SQL
>>>>>>>> setting.
>>>>>>>>
>>>>>>>> I'm very interested in making Parquet read/write support available
>>>>>>>> to
>>>>>>>> Python programmers via C/C++ extensions, so I'm going to be working
>>>>>>>> the
>>>>>>>> next few months on a Parquet->Arrow->Python toolchain, along with
>>>>>>>> some
>>>>>>>> tools to manipulate tables in-memory columnar data in the style of
>>>>>>>> Python's
>>>>>>>> pandas library.
>>>>>>>>
>>>>>>>> I will propose patches as needed to parquet-cpp to improve its
>>>>>>>> performance
>>>>>>>> and add functionality for writing Parquet files as well. The
>>>>>>>> details of
>>>>>>>> converting to/from Parquet's repetition/definition level
>>>>>>>> representation of
>>>>>>>> nested data will stay separate in the arrow-parquet adapter code.
>>>>>>>>
>>>>>>>> cheers,
>>>>>>>> Wes
>>>>>>>>
>>>>>>>> [1]:
>>>>>>>>
>>>>>>>> http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E
>>>>>>>>
>>>>>>>>
>>>>>>>> [2]:
>>>>>>>>
>>>>>>>> http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jan 15, 2016 at 1:22 AM, Mickaël Lacour <
>>>>>>>> m.lacour@criteo.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'm very interested in this subject because I would like to export
>>>>>>>>> parquet data from HDFS to Vertica (using VSQL).
>>>>>>>>> I'm planning to work on it next quarter, but I will be very happy
>>>>>>>>> to
>>>>>>>>> help
>>>>>>>>> you on this subject (review, testing).
>>>>>>>>>
>>>>>>>>> Have a nice day,
>>>>>>>>> --
>>>>>>>>> Mickaël Lacour
>>>>>>>>> Senior Software Engineer
>>>>>>>>> Analytics Infrastructure team @Scalability
>>>>>>>>>
>>>>>>>>> ________________________________________
>>>>>>>>> From: Walkauskas, Stephen Gregory (Vertica)
>>>>>>>>> <st...@hpe.com>
>>>>>>>>> Sent: Thursday, January 14, 2016 3:23 PM
>>>>>>>>> To: Sandryhaila, Aliaksei; dev@parquet.apache.org; Majeti, Deepak;
>>>>>>>>> nongli@gmail.com; Wes McKinney
>>>>>>>>> Subject: Re: Parquet-cpp
>>>>>>>>>
>>>>>>>>> Yes, thanks for the introduction Julien.
>>>>>>>>>
>>>>>>>>> Nong and Wes,
>>>>>>>>>
>>>>>>>>> It'd be interesting to know your goals for parquet-cpp.
>>>>>>>>>
>>>>>>>>> The Vertica database already supports optimized reads of ORC files
>>>>>>>>> (fast
>>>>>>>>> c++ parser, predicate pushdown, columns selection etc). We'd like
>>>>>>>>> to do
>>>>>>>>> the same for parquet.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Stephen
>>>>>>>>>
>>>>>>>>> On 01/13/2016 05:53 PM, Sandryhaila, Aliaksei wrote:
>>>>>>>>>
>>>>>>>>>> Thank you for the introduction, Julien!
>>>>>>>>>>
>>>>>>>>>> Hello Nong and Wes,
>>>>>>>>>>
>>>>>>>>>> Stephen, Deepak and I are developing a C++ library to support
>>>>>>>>>> Parquet in
>>>>>>>>>> Vertica RDBMS. We are using Parquet-cpp as a starting point and
>>>>>>>>>> are
>>>>>>>>>> expanding its functionality as well as improving it and fixing
>>>>>>>>>> bugs. We
>>>>>>>>>> would like to contribute these improvements back to the
>>>>>>>>>> open-source
>>>>>>>>>> community. We plan to do this through the usual process of
>>>>>>>>>> creating
>>>>>>>>>> jiras that justify and explain a code change, and then submitting
>>>>>>>>>> pull
>>>>>>>>>> requests. We look forward to working with you on Parquet-cpp and
>>>>>>>>>> to
>>>>>>>>>> your
>>>>>>>>>> feedback and suggestions.
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>> Aliaksei.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 01/13/2016 02:54 PM, Julien Le Dem wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello Nong, Wes, Stephen, Deepak and Aliaksei
>>>>>>>>>>> I wanted to introduce you to each other as you are all looking at
>>>>>>>>>>> Parquet-cpp.
>>>>>>>>>>>
>>>>>>>>>>> I'd recommend opening JIRAs in the parquet-cpp component to
>>>>>>>>>>>
>>>>>>>>>> collaborate (I
>>>>>>>>>
>>>>>>>>>> see you already doing this):
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Nong is a committer and can merged pull requests (he also
>>>>>>>>>>> understands
>>>>>>>>>>>
>>>>>>>>>> that
>>>>>>>>>
>>>>>>>>>> code base very well).
>>>>>>>>>>> Other committer can too, feel free to ping us if you need help
>>>>>>>>>>> Obviously, you don't need to be a committer to give others
>>>>>>>>>>> reviews
>>>>>>>>>>> (you
>>>>>>>>>>> just need one to approve and merge).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Cloudera, Inc.
>>>
>>
>>
>>
>> --
>> Julien
>>
>
>
>
> --
> Julien
>

Re: Parquet-cpp

Posted by Julien Le Dem <ju...@dremio.com>.
Also as Nong mentioned, PRs should be prefixed by the jira id followed by a
":" as follows "PARQUET-X: description" that's just to have the reference
in the git changelog. The merge script enforces it.


On Tue, Jan 26, 2016 at 3:24 PM, Julien Le Dem <ju...@dremio.com> wrote:

> I'm happy too with Aliaksei, Deepak, Wes, etc reviewing each other.
> I see Nong (who's a committer) has been doing some reviews already.
>
> When you guys reach a consensus on a PR and want it merged please mention
> it in the PR (+1, LGTM) and mention us directly (@julienledem, ...) to have
> it merged.
>
> right now I see that #19 and #21 have been committed (thanks Nong) but it
> is not clear to me in what order the others should be committed.
>
> For example Deepak should comment directly on #22 to approve it. Right now
> he mentioned it on another PR.
> https://github.com/apache/parquet-cpp/pull/24#issuecomment-174354139
> Similarly Wes could confirm on that PR whether it looks good.
>
> Tomorrow is the Parquet sync up if you want to discuss further:
> https://plus.google.com/u/0/events/cvgi67jmoptmgb1i488re8scbuo
>
>
> On Mon, Jan 25, 2016 at 4:20 PM, Ryan Blue <bl...@cloudera.com> wrote:
>
>> Aliaksei, thanks for being understanding here.
>>
>> I agree with you that it is too difficult. We really want to get the cpp
>> side bootstrapped as soon as possible. Lets go with what you suggested, to
>> have contributors review one another's patches and then ask a committer for
>> a final review once both contributors reach a consensus.
>>
>> If there are issues that are easy to review, maybe some of us other than
>> Nong can take a look.
>>
>> rb
>>
>>
>> On 01/25/2016 02:33 PM, Aliaksei Sandryhaila wrote:
>>
>>> Hi Ryan,
>>>
>>> This sounds very reasonable. I do not argue to disregard the standard
>>> Apache approach to promoting contributors to committers. I am just
>>> pointing out that without the input from current committers it is hard
>>> for us to productively contribute to the project. As a consequence, it
>>> is hard for us demonstrate our fit to become committers in the future.
>>> This leaves us in a deadlock, which can be resolved either by an
>>> increased feedback from existing committers or by making us committers
>>> sooner.
>>>
>>> I understand that most committers on the Parquet project are working on
>>> the Java implementation, so it can be harder for them to review patches
>>> for parquet-cpp. In this regard, how about the following protocol for
>>> parquet-cpp pull requests: After contributors review and revise a pull
>>> request and agree that it is in a good shape, we will ask a designated
>>> committer to review and commit the pull request. So far we have been
>>> asking Nong; if there is a better designated committer for parquet-cpp,
>>> please let us know.
>>>
>>> Thank you,
>>> Aliaksei.
>>>
>>>
>>> On 01/25/2016 04:54 PM, Ryan Blue wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> Sorry about the current backlog on the parquet-cpp side. Most of the
>>>> current committer base works on the Java implementation so it's either
>>>> slow or not reliable for us to do those reviews.
>>>>
>>>> I think the best way to move forward is to review patches for each
>>>> other. That will keep those issues progressing, make it easy for
>>>> committers to validate the commit, and -- most importantly -- to build
>>>> a trail of contributions that we can look at to vote in new committers.
>>>>
>>>> I completely sympathize with the need for committers on the CPP
>>>> project, but I don't think this will take a long time given the
>>>> current level of activity. We're really just trying to build
>>>> confidence that:
>>>>
>>>> 1. You produce quality contributions and understand the codebase
>>>> 2. You give friendly, thoughtful reviews and don't rubber-stamp
>>>> 3. You defer judgment and ask others when you don't know
>>>> 4. You respect others and interact professionally
>>>>
>>>> I don't think any of those are that hard to demonstrate, but I'd be
>>>> uncomfortable not validating committers like we normally do.
>>>> Especially in this situation, where I could easily see the amount of
>>>> work you guys are doing adding up pretty quickly!
>>>>
>>>> Does that sound like a reasonable path forward?
>>>>
>>>> rb
>>>>
>>>>
>>>> On 01/25/2016 12:46 PM, Aliaksei Sandryhaila wrote:
>>>>
>>>>> Hi Nong and Julien,
>>>>>
>>>>> As Wes has pointed out, we have a number of patches for parquet-cpp
>>>>> outstanding. Wes, Deepak, and I have been reviewing each other's pull
>>>>> requests. At this point, the patches need to be reviewed and approved
>>>>> by
>>>>> Parquet committers in order to be committed to master.
>>>>>
>>>>> Unfortunately, there is not much activity on this side of the project.
>>>>> The lack of response from current committers is holding us back, and we
>>>>> have to repeatedly rebase our batches, merge multiple pull requests
>>>>> together, and overall step on each others' toes.
>>>>>
>>>>> Is it possible to make Wes, Deepak, and me committers on the project,
>>>>> so
>>>>> we can contribute to parquet-cpp more efficiently?
>>>>>
>>>>> Thanks,
>>>>> Aliaksei.
>>>>>
>>>>>
>>>>> On 01/23/2016 06:07 PM, Wes McKinney wrote:
>>>>>
>>>>>> Folks,
>>>>>>
>>>>>> We're working on a pretty solid patch queue.
>>>>>>
>>>>>> independent patches
>>>>>> PARQUET-449: https://github.com/apache/parquet-cpp/pull/21
>>>>>>
>>>>>> interdependent patches (order to apply patches)
>>>>>> PARQUET-437 (MOSTLY REVIEWED):
>>>>>> https://github.com/apache/parquet-cpp/pull/19
>>>>>>
>>>>>> PARQUET-418: https://github.com/apache/parquet-cpp/pull/18
>>>>>> PARQUET-434: https://github.com/apache/parquet-cpp/pull/20
>>>>>> PARQUET-433: https://github.com/apache/parquet-cpp/pull/22
>>>>>> PARQUET-451 & PARQUET-453:
>>>>>> https://github.com/apache/parquet-cpp/pull/23
>>>>>>
>>>>>> PARQUET-428 (needs to be rebased on top of PARQUET-433):
>>>>>> https://github.com/apache/parquet-cpp/pull/24
>>>>>>
>>>>>> I'm going to take a breather and work on some other things this
>>>>>> weekend,
>>>>>> but I'll be available for code reviews and fixes to try to move along
>>>>>> this
>>>>>> patch queue.
>>>>>>
>>>>>> Thanks,
>>>>>> Wes
>>>>>>
>>>>>> On Fri, Jan 15, 2016 at 8:18 AM, Wes McKinney <we...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>> Great to meet you all!
>>>>>>>
>>>>>>> I've recently been collaborating with the Apache Drill team to spin
>>>>>>> out
>>>>>>> the ValueVector columnar in-memory data structure into a new
>>>>>>> standalone
>>>>>>> project that will be called Arrow [1] [2]. A brief summary of
>>>>>>> Arrow/ValueVectors is that it permits O(1) random access on nested
>>>>>>> columnar
>>>>>>> structures and is efficient for projections and scans in a columnar
>>>>>>> SQL
>>>>>>> setting.
>>>>>>>
>>>>>>> I'm very interested in making Parquet read/write support available to
>>>>>>> Python programmers via C/C++ extensions, so I'm going to be working
>>>>>>> the
>>>>>>> next few months on a Parquet->Arrow->Python toolchain, along with
>>>>>>> some
>>>>>>> tools to manipulate tables in-memory columnar data in the style of
>>>>>>> Python's
>>>>>>> pandas library.
>>>>>>>
>>>>>>> I will propose patches as needed to parquet-cpp to improve its
>>>>>>> performance
>>>>>>> and add functionality for writing Parquet files as well. The
>>>>>>> details of
>>>>>>> converting to/from Parquet's repetition/definition level
>>>>>>> representation of
>>>>>>> nested data will stay separate in the arrow-parquet adapter code.
>>>>>>>
>>>>>>> cheers,
>>>>>>> Wes
>>>>>>>
>>>>>>> [1]:
>>>>>>>
>>>>>>> http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E
>>>>>>>
>>>>>>>
>>>>>>> [2]:
>>>>>>>
>>>>>>> http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jan 15, 2016 at 1:22 AM, Mickaël Lacour <m.lacour@criteo.com
>>>>>>> >
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm very interested in this subject because I would like to export
>>>>>>>> parquet data from HDFS to Vertica (using VSQL).
>>>>>>>> I'm planning to work on it next quarter, but I will be very happy to
>>>>>>>> help
>>>>>>>> you on this subject (review, testing).
>>>>>>>>
>>>>>>>> Have a nice day,
>>>>>>>> --
>>>>>>>> Mickaël Lacour
>>>>>>>> Senior Software Engineer
>>>>>>>> Analytics Infrastructure team @Scalability
>>>>>>>>
>>>>>>>> ________________________________________
>>>>>>>> From: Walkauskas, Stephen Gregory (Vertica)
>>>>>>>> <st...@hpe.com>
>>>>>>>> Sent: Thursday, January 14, 2016 3:23 PM
>>>>>>>> To: Sandryhaila, Aliaksei; dev@parquet.apache.org; Majeti, Deepak;
>>>>>>>> nongli@gmail.com; Wes McKinney
>>>>>>>> Subject: Re: Parquet-cpp
>>>>>>>>
>>>>>>>> Yes, thanks for the introduction Julien.
>>>>>>>>
>>>>>>>> Nong and Wes,
>>>>>>>>
>>>>>>>> It'd be interesting to know your goals for parquet-cpp.
>>>>>>>>
>>>>>>>> The Vertica database already supports optimized reads of ORC files
>>>>>>>> (fast
>>>>>>>> c++ parser, predicate pushdown, columns selection etc). We'd like
>>>>>>>> to do
>>>>>>>> the same for parquet.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Stephen
>>>>>>>>
>>>>>>>> On 01/13/2016 05:53 PM, Sandryhaila, Aliaksei wrote:
>>>>>>>>
>>>>>>>>> Thank you for the introduction, Julien!
>>>>>>>>>
>>>>>>>>> Hello Nong and Wes,
>>>>>>>>>
>>>>>>>>> Stephen, Deepak and I are developing a C++ library to support
>>>>>>>>> Parquet in
>>>>>>>>> Vertica RDBMS. We are using Parquet-cpp as a starting point and are
>>>>>>>>> expanding its functionality as well as improving it and fixing
>>>>>>>>> bugs. We
>>>>>>>>> would like to contribute these improvements back to the open-source
>>>>>>>>> community. We plan to do this through the usual process of creating
>>>>>>>>> jiras that justify and explain a code change, and then submitting
>>>>>>>>> pull
>>>>>>>>> requests. We look forward to working with you on Parquet-cpp and to
>>>>>>>>> your
>>>>>>>>> feedback and suggestions.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Aliaksei.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 01/13/2016 02:54 PM, Julien Le Dem wrote:
>>>>>>>>>
>>>>>>>>>> Hello Nong, Wes, Stephen, Deepak and Aliaksei
>>>>>>>>>> I wanted to introduce you to each other as you are all looking at
>>>>>>>>>> Parquet-cpp.
>>>>>>>>>>
>>>>>>>>>> I'd recommend opening JIRAs in the parquet-cpp component to
>>>>>>>>>>
>>>>>>>>> collaborate (I
>>>>>>>>
>>>>>>>>> see you already doing this):
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
>>>>>>>>
>>>>>>>>
>>>>>>>> Nong is a committer and can merged pull requests (he also
>>>>>>>>>> understands
>>>>>>>>>>
>>>>>>>>> that
>>>>>>>>
>>>>>>>>> code base very well).
>>>>>>>>>> Other committer can too, feel free to ping us if you need help
>>>>>>>>>> Obviously, you don't need to be a committer to give others reviews
>>>>>>>>>> (you
>>>>>>>>>> just need one to approve and merge).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Cloudera, Inc.
>>
>
>
>
> --
> Julien
>



-- 
Julien

Re: Parquet-cpp

Posted by Julien Le Dem <ju...@dremio.com>.
I'm happy too with Aliaksei, Deepak, Wes, etc reviewing each other.
I see Nong (who's a committer) has been doing some reviews already.

When you guys reach a consensus on a PR and want it merged please mention
it in the PR (+1, LGTM) and mention us directly (@julienledem, ...) to have
it merged.

right now I see that #19 and #21 have been committed (thanks Nong) but it
is not clear to me in what order the others should be committed.

For example Deepak should comment directly on #22 to approve it. Right now
he mentioned it on another PR.
https://github.com/apache/parquet-cpp/pull/24#issuecomment-174354139
Similarly Wes could confirm on that PR whether it looks good.

Tomorrow is the Parquet sync up if you want to discuss further:
https://plus.google.com/u/0/events/cvgi67jmoptmgb1i488re8scbuo


On Mon, Jan 25, 2016 at 4:20 PM, Ryan Blue <bl...@cloudera.com> wrote:

> Aliaksei, thanks for being understanding here.
>
> I agree with you that it is too difficult. We really want to get the cpp
> side bootstrapped as soon as possible. Lets go with what you suggested, to
> have contributors review one another's patches and then ask a committer for
> a final review once both contributors reach a consensus.
>
> If there are issues that are easy to review, maybe some of us other than
> Nong can take a look.
>
> rb
>
>
> On 01/25/2016 02:33 PM, Aliaksei Sandryhaila wrote:
>
>> Hi Ryan,
>>
>> This sounds very reasonable. I do not argue to disregard the standard
>> Apache approach to promoting contributors to committers. I am just
>> pointing out that without the input from current committers it is hard
>> for us to productively contribute to the project. As a consequence, it
>> is hard for us demonstrate our fit to become committers in the future.
>> This leaves us in a deadlock, which can be resolved either by an
>> increased feedback from existing committers or by making us committers
>> sooner.
>>
>> I understand that most committers on the Parquet project are working on
>> the Java implementation, so it can be harder for them to review patches
>> for parquet-cpp. In this regard, how about the following protocol for
>> parquet-cpp pull requests: After contributors review and revise a pull
>> request and agree that it is in a good shape, we will ask a designated
>> committer to review and commit the pull request. So far we have been
>> asking Nong; if there is a better designated committer for parquet-cpp,
>> please let us know.
>>
>> Thank you,
>> Aliaksei.
>>
>>
>> On 01/25/2016 04:54 PM, Ryan Blue wrote:
>>
>>> Hi everyone,
>>>
>>> Sorry about the current backlog on the parquet-cpp side. Most of the
>>> current committer base works on the Java implementation so it's either
>>> slow or not reliable for us to do those reviews.
>>>
>>> I think the best way to move forward is to review patches for each
>>> other. That will keep those issues progressing, make it easy for
>>> committers to validate the commit, and -- most importantly -- to build
>>> a trail of contributions that we can look at to vote in new committers.
>>>
>>> I completely sympathize with the need for committers on the CPP
>>> project, but I don't think this will take a long time given the
>>> current level of activity. We're really just trying to build
>>> confidence that:
>>>
>>> 1. You produce quality contributions and understand the codebase
>>> 2. You give friendly, thoughtful reviews and don't rubber-stamp
>>> 3. You defer judgment and ask others when you don't know
>>> 4. You respect others and interact professionally
>>>
>>> I don't think any of those are that hard to demonstrate, but I'd be
>>> uncomfortable not validating committers like we normally do.
>>> Especially in this situation, where I could easily see the amount of
>>> work you guys are doing adding up pretty quickly!
>>>
>>> Does that sound like a reasonable path forward?
>>>
>>> rb
>>>
>>>
>>> On 01/25/2016 12:46 PM, Aliaksei Sandryhaila wrote:
>>>
>>>> Hi Nong and Julien,
>>>>
>>>> As Wes has pointed out, we have a number of patches for parquet-cpp
>>>> outstanding. Wes, Deepak, and I have been reviewing each other's pull
>>>> requests. At this point, the patches need to be reviewed and approved by
>>>> Parquet committers in order to be committed to master.
>>>>
>>>> Unfortunately, there is not much activity on this side of the project.
>>>> The lack of response from current committers is holding us back, and we
>>>> have to repeatedly rebase our batches, merge multiple pull requests
>>>> together, and overall step on each others' toes.
>>>>
>>>> Is it possible to make Wes, Deepak, and me committers on the project, so
>>>> we can contribute to parquet-cpp more efficiently?
>>>>
>>>> Thanks,
>>>> Aliaksei.
>>>>
>>>>
>>>> On 01/23/2016 06:07 PM, Wes McKinney wrote:
>>>>
>>>>> Folks,
>>>>>
>>>>> We're working on a pretty solid patch queue.
>>>>>
>>>>> independent patches
>>>>> PARQUET-449: https://github.com/apache/parquet-cpp/pull/21
>>>>>
>>>>> interdependent patches (order to apply patches)
>>>>> PARQUET-437 (MOSTLY REVIEWED):
>>>>> https://github.com/apache/parquet-cpp/pull/19
>>>>>
>>>>> PARQUET-418: https://github.com/apache/parquet-cpp/pull/18
>>>>> PARQUET-434: https://github.com/apache/parquet-cpp/pull/20
>>>>> PARQUET-433: https://github.com/apache/parquet-cpp/pull/22
>>>>> PARQUET-451 & PARQUET-453:
>>>>> https://github.com/apache/parquet-cpp/pull/23
>>>>>
>>>>> PARQUET-428 (needs to be rebased on top of PARQUET-433):
>>>>> https://github.com/apache/parquet-cpp/pull/24
>>>>>
>>>>> I'm going to take a breather and work on some other things this
>>>>> weekend,
>>>>> but I'll be available for code reviews and fixes to try to move along
>>>>> this
>>>>> patch queue.
>>>>>
>>>>> Thanks,
>>>>> Wes
>>>>>
>>>>> On Fri, Jan 15, 2016 at 8:18 AM, Wes McKinney <we...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>> Great to meet you all!
>>>>>>
>>>>>> I've recently been collaborating with the Apache Drill team to spin
>>>>>> out
>>>>>> the ValueVector columnar in-memory data structure into a new
>>>>>> standalone
>>>>>> project that will be called Arrow [1] [2]. A brief summary of
>>>>>> Arrow/ValueVectors is that it permits O(1) random access on nested
>>>>>> columnar
>>>>>> structures and is efficient for projections and scans in a columnar
>>>>>> SQL
>>>>>> setting.
>>>>>>
>>>>>> I'm very interested in making Parquet read/write support available to
>>>>>> Python programmers via C/C++ extensions, so I'm going to be working
>>>>>> the
>>>>>> next few months on a Parquet->Arrow->Python toolchain, along with some
>>>>>> tools to manipulate tables in-memory columnar data in the style of
>>>>>> Python's
>>>>>> pandas library.
>>>>>>
>>>>>> I will propose patches as needed to parquet-cpp to improve its
>>>>>> performance
>>>>>> and add functionality for writing Parquet files as well. The
>>>>>> details of
>>>>>> converting to/from Parquet's repetition/definition level
>>>>>> representation of
>>>>>> nested data will stay separate in the arrow-parquet adapter code.
>>>>>>
>>>>>> cheers,
>>>>>> Wes
>>>>>>
>>>>>> [1]:
>>>>>>
>>>>>> http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E
>>>>>>
>>>>>>
>>>>>> [2]:
>>>>>>
>>>>>> http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 15, 2016 at 1:22 AM, Mickaël Lacour <m....@criteo.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>>
>>>>>>> I'm very interested in this subject because I would like to export
>>>>>>> parquet data from HDFS to Vertica (using VSQL).
>>>>>>> I'm planning to work on it next quarter, but I will be very happy to
>>>>>>> help
>>>>>>> you on this subject (review, testing).
>>>>>>>
>>>>>>> Have a nice day,
>>>>>>> --
>>>>>>> Mickaël Lacour
>>>>>>> Senior Software Engineer
>>>>>>> Analytics Infrastructure team @Scalability
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: Walkauskas, Stephen Gregory (Vertica)
>>>>>>> <st...@hpe.com>
>>>>>>> Sent: Thursday, January 14, 2016 3:23 PM
>>>>>>> To: Sandryhaila, Aliaksei; dev@parquet.apache.org; Majeti, Deepak;
>>>>>>> nongli@gmail.com; Wes McKinney
>>>>>>> Subject: Re: Parquet-cpp
>>>>>>>
>>>>>>> Yes, thanks for the introduction Julien.
>>>>>>>
>>>>>>> Nong and Wes,
>>>>>>>
>>>>>>> It'd be interesting to know your goals for parquet-cpp.
>>>>>>>
>>>>>>> The Vertica database already supports optimized reads of ORC files
>>>>>>> (fast
>>>>>>> c++ parser, predicate pushdown, columns selection etc). We'd like
>>>>>>> to do
>>>>>>> the same for parquet.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Stephen
>>>>>>>
>>>>>>> On 01/13/2016 05:53 PM, Sandryhaila, Aliaksei wrote:
>>>>>>>
>>>>>>>> Thank you for the introduction, Julien!
>>>>>>>>
>>>>>>>> Hello Nong and Wes,
>>>>>>>>
>>>>>>>> Stephen, Deepak and I are developing a C++ library to support
>>>>>>>> Parquet in
>>>>>>>> Vertica RDBMS. We are using Parquet-cpp as a starting point and are
>>>>>>>> expanding its functionality as well as improving it and fixing
>>>>>>>> bugs. We
>>>>>>>> would like to contribute these improvements back to the open-source
>>>>>>>> community. We plan to do this through the usual process of creating
>>>>>>>> jiras that justify and explain a code change, and then submitting
>>>>>>>> pull
>>>>>>>> requests. We look forward to working with you on Parquet-cpp and to
>>>>>>>> your
>>>>>>>> feedback and suggestions.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Aliaksei.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 01/13/2016 02:54 PM, Julien Le Dem wrote:
>>>>>>>>
>>>>>>>>> Hello Nong, Wes, Stephen, Deepak and Aliaksei
>>>>>>>>> I wanted to introduce you to each other as you are all looking at
>>>>>>>>> Parquet-cpp.
>>>>>>>>>
>>>>>>>>> I'd recommend opening JIRAs in the parquet-cpp component to
>>>>>>>>>
>>>>>>>> collaborate (I
>>>>>>>
>>>>>>>> see you already doing this):
>>>>>>>>>
>>>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
>>>>>>>
>>>>>>>
>>>>>>> Nong is a committer and can merged pull requests (he also
>>>>>>>>> understands
>>>>>>>>>
>>>>>>>> that
>>>>>>>
>>>>>>>> code base very well).
>>>>>>>>> Other committer can too, feel free to ping us if you need help
>>>>>>>>> Obviously, you don't need to be a committer to give others reviews
>>>>>>>>> (you
>>>>>>>>> just need one to approve and merge).
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>
>>>
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>



-- 
Julien

Re: Parquet-cpp

Posted by Ryan Blue <bl...@cloudera.com>.
Aliaksei, thanks for being understanding here.

I agree with you that it is too difficult. We really want to get the cpp 
side bootstrapped as soon as possible. Lets go with what you suggested, 
to have contributors review one another's patches and then ask a 
committer for a final review once both contributors reach a consensus.

If there are issues that are easy to review, maybe some of us other than 
Nong can take a look.

rb

On 01/25/2016 02:33 PM, Aliaksei Sandryhaila wrote:
> Hi Ryan,
>
> This sounds very reasonable. I do not argue to disregard the standard
> Apache approach to promoting contributors to committers. I am just
> pointing out that without the input from current committers it is hard
> for us to productively contribute to the project. As a consequence, it
> is hard for us demonstrate our fit to become committers in the future.
> This leaves us in a deadlock, which can be resolved either by an
> increased feedback from existing committers or by making us committers
> sooner.
>
> I understand that most committers on the Parquet project are working on
> the Java implementation, so it can be harder for them to review patches
> for parquet-cpp. In this regard, how about the following protocol for
> parquet-cpp pull requests: After contributors review and revise a pull
> request and agree that it is in a good shape, we will ask a designated
> committer to review and commit the pull request. So far we have been
> asking Nong; if there is a better designated committer for parquet-cpp,
> please let us know.
>
> Thank you,
> Aliaksei.
>
>
> On 01/25/2016 04:54 PM, Ryan Blue wrote:
>> Hi everyone,
>>
>> Sorry about the current backlog on the parquet-cpp side. Most of the
>> current committer base works on the Java implementation so it's either
>> slow or not reliable for us to do those reviews.
>>
>> I think the best way to move forward is to review patches for each
>> other. That will keep those issues progressing, make it easy for
>> committers to validate the commit, and -- most importantly -- to build
>> a trail of contributions that we can look at to vote in new committers.
>>
>> I completely sympathize with the need for committers on the CPP
>> project, but I don't think this will take a long time given the
>> current level of activity. We're really just trying to build
>> confidence that:
>>
>> 1. You produce quality contributions and understand the codebase
>> 2. You give friendly, thoughtful reviews and don't rubber-stamp
>> 3. You defer judgment and ask others when you don't know
>> 4. You respect others and interact professionally
>>
>> I don't think any of those are that hard to demonstrate, but I'd be
>> uncomfortable not validating committers like we normally do.
>> Especially in this situation, where I could easily see the amount of
>> work you guys are doing adding up pretty quickly!
>>
>> Does that sound like a reasonable path forward?
>>
>> rb
>>
>>
>> On 01/25/2016 12:46 PM, Aliaksei Sandryhaila wrote:
>>> Hi Nong and Julien,
>>>
>>> As Wes has pointed out, we have a number of patches for parquet-cpp
>>> outstanding. Wes, Deepak, and I have been reviewing each other's pull
>>> requests. At this point, the patches need to be reviewed and approved by
>>> Parquet committers in order to be committed to master.
>>>
>>> Unfortunately, there is not much activity on this side of the project.
>>> The lack of response from current committers is holding us back, and we
>>> have to repeatedly rebase our batches, merge multiple pull requests
>>> together, and overall step on each others' toes.
>>>
>>> Is it possible to make Wes, Deepak, and me committers on the project, so
>>> we can contribute to parquet-cpp more efficiently?
>>>
>>> Thanks,
>>> Aliaksei.
>>>
>>>
>>> On 01/23/2016 06:07 PM, Wes McKinney wrote:
>>>> Folks,
>>>>
>>>> We're working on a pretty solid patch queue.
>>>>
>>>> independent patches
>>>> PARQUET-449: https://github.com/apache/parquet-cpp/pull/21
>>>>
>>>> interdependent patches (order to apply patches)
>>>> PARQUET-437 (MOSTLY REVIEWED):
>>>> https://github.com/apache/parquet-cpp/pull/19
>>>>
>>>> PARQUET-418: https://github.com/apache/parquet-cpp/pull/18
>>>> PARQUET-434: https://github.com/apache/parquet-cpp/pull/20
>>>> PARQUET-433: https://github.com/apache/parquet-cpp/pull/22
>>>> PARQUET-451 & PARQUET-453:
>>>> https://github.com/apache/parquet-cpp/pull/23
>>>>
>>>> PARQUET-428 (needs to be rebased on top of PARQUET-433):
>>>> https://github.com/apache/parquet-cpp/pull/24
>>>>
>>>> I'm going to take a breather and work on some other things this
>>>> weekend,
>>>> but I'll be available for code reviews and fixes to try to move along
>>>> this
>>>> patch queue.
>>>>
>>>> Thanks,
>>>> Wes
>>>>
>>>> On Fri, Jan 15, 2016 at 8:18 AM, Wes McKinney <we...@cloudera.com> wrote:
>>>>
>>>>> Great to meet you all!
>>>>>
>>>>> I've recently been collaborating with the Apache Drill team to spin
>>>>> out
>>>>> the ValueVector columnar in-memory data structure into a new
>>>>> standalone
>>>>> project that will be called Arrow [1] [2]. A brief summary of
>>>>> Arrow/ValueVectors is that it permits O(1) random access on nested
>>>>> columnar
>>>>> structures and is efficient for projections and scans in a columnar
>>>>> SQL
>>>>> setting.
>>>>>
>>>>> I'm very interested in making Parquet read/write support available to
>>>>> Python programmers via C/C++ extensions, so I'm going to be working
>>>>> the
>>>>> next few months on a Parquet->Arrow->Python toolchain, along with some
>>>>> tools to manipulate tables in-memory columnar data in the style of
>>>>> Python's
>>>>> pandas library.
>>>>>
>>>>> I will propose patches as needed to parquet-cpp to improve its
>>>>> performance
>>>>> and add functionality for writing Parquet files as well. The
>>>>> details of
>>>>> converting to/from Parquet's repetition/definition level
>>>>> representation of
>>>>> nested data will stay separate in the arrow-parquet adapter code.
>>>>>
>>>>> cheers,
>>>>> Wes
>>>>>
>>>>> [1]:
>>>>> http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E
>>>>>
>>>>>
>>>>> [2]:
>>>>> http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490
>>>>>
>>>>>
>>>>> On Fri, Jan 15, 2016 at 1:22 AM, Mickaël Lacour <m....@criteo.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm very interested in this subject because I would like to export
>>>>>> parquet data from HDFS to Vertica (using VSQL).
>>>>>> I'm planning to work on it next quarter, but I will be very happy to
>>>>>> help
>>>>>> you on this subject (review, testing).
>>>>>>
>>>>>> Have a nice day,
>>>>>> --
>>>>>> Mickaël Lacour
>>>>>> Senior Software Engineer
>>>>>> Analytics Infrastructure team @Scalability
>>>>>>
>>>>>> ________________________________________
>>>>>> From: Walkauskas, Stephen Gregory (Vertica)
>>>>>> <st...@hpe.com>
>>>>>> Sent: Thursday, January 14, 2016 3:23 PM
>>>>>> To: Sandryhaila, Aliaksei; dev@parquet.apache.org; Majeti, Deepak;
>>>>>> nongli@gmail.com; Wes McKinney
>>>>>> Subject: Re: Parquet-cpp
>>>>>>
>>>>>> Yes, thanks for the introduction Julien.
>>>>>>
>>>>>> Nong and Wes,
>>>>>>
>>>>>> It'd be interesting to know your goals for parquet-cpp.
>>>>>>
>>>>>> The Vertica database already supports optimized reads of ORC files
>>>>>> (fast
>>>>>> c++ parser, predicate pushdown, columns selection etc). We'd like
>>>>>> to do
>>>>>> the same for parquet.
>>>>>>
>>>>>> Cheers,
>>>>>> Stephen
>>>>>>
>>>>>> On 01/13/2016 05:53 PM, Sandryhaila, Aliaksei wrote:
>>>>>>> Thank you for the introduction, Julien!
>>>>>>>
>>>>>>> Hello Nong and Wes,
>>>>>>>
>>>>>>> Stephen, Deepak and I are developing a C++ library to support
>>>>>>> Parquet in
>>>>>>> Vertica RDBMS. We are using Parquet-cpp as a starting point and are
>>>>>>> expanding its functionality as well as improving it and fixing
>>>>>>> bugs. We
>>>>>>> would like to contribute these improvements back to the open-source
>>>>>>> community. We plan to do this through the usual process of creating
>>>>>>> jiras that justify and explain a code change, and then submitting
>>>>>>> pull
>>>>>>> requests. We look forward to working with you on Parquet-cpp and to
>>>>>>> your
>>>>>>> feedback and suggestions.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Aliaksei.
>>>>>>>
>>>>>>>
>>>>>>> On 01/13/2016 02:54 PM, Julien Le Dem wrote:
>>>>>>>> Hello Nong, Wes, Stephen, Deepak and Aliaksei
>>>>>>>> I wanted to introduce you to each other as you are all looking at
>>>>>>>> Parquet-cpp.
>>>>>>>>
>>>>>>>> I'd recommend opening JIRAs in the parquet-cpp component to
>>>>>> collaborate (I
>>>>>>>> see you already doing this):
>>>>>>>>
>>>>>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
>>>>>>
>>>>>>
>>>>>>>> Nong is a committer and can merged pull requests (he also
>>>>>>>> understands
>>>>>> that
>>>>>>>> code base very well).
>>>>>>>> Other committer can too, feel free to ping us if you need help
>>>>>>>> Obviously, you don't need to be a committer to give others reviews
>>>>>>>> (you
>>>>>>>> just need one to approve and merge).
>>>>>>>>
>>>>>
>>>
>>
>>
>


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Parquet-cpp

Posted by Aliaksei Sandryhaila <as...@gmail.com>.
Hi Ryan,

This sounds very reasonable. I do not argue to disregard the standard 
Apache approach to promoting contributors to committers. I am just 
pointing out that without the input from current committers it is hard 
for us to productively contribute to the project. As a consequence, it 
is hard for us demonstrate our fit to become committers in the future. 
This leaves us in a deadlock, which can be resolved either by an 
increased feedback from existing committers or by making us committers 
sooner.

I understand that most committers on the Parquet project are working on 
the Java implementation, so it can be harder for them to review patches 
for parquet-cpp. In this regard, how about the following protocol for 
parquet-cpp pull requests: After contributors review and revise a pull 
request and agree that it is in a good shape, we will ask a designated 
committer to review and commit the pull request. So far we have been 
asking Nong; if there is a better designated committer for parquet-cpp, 
please let us know.

Thank you,
Aliaksei.


On 01/25/2016 04:54 PM, Ryan Blue wrote:
> Hi everyone,
>
> Sorry about the current backlog on the parquet-cpp side. Most of the 
> current committer base works on the Java implementation so it's either 
> slow or not reliable for us to do those reviews.
>
> I think the best way to move forward is to review patches for each 
> other. That will keep those issues progressing, make it easy for 
> committers to validate the commit, and -- most importantly -- to build 
> a trail of contributions that we can look at to vote in new committers.
>
> I completely sympathize with the need for committers on the CPP 
> project, but I don't think this will take a long time given the 
> current level of activity. We're really just trying to build 
> confidence that:
>
> 1. You produce quality contributions and understand the codebase
> 2. You give friendly, thoughtful reviews and don't rubber-stamp
> 3. You defer judgment and ask others when you don't know
> 4. You respect others and interact professionally
>
> I don't think any of those are that hard to demonstrate, but I'd be 
> uncomfortable not validating committers like we normally do. 
> Especially in this situation, where I could easily see the amount of 
> work you guys are doing adding up pretty quickly!
>
> Does that sound like a reasonable path forward?
>
> rb
>
>
> On 01/25/2016 12:46 PM, Aliaksei Sandryhaila wrote:
>> Hi Nong and Julien,
>>
>> As Wes has pointed out, we have a number of patches for parquet-cpp
>> outstanding. Wes, Deepak, and I have been reviewing each other's pull
>> requests. At this point, the patches need to be reviewed and approved by
>> Parquet committers in order to be committed to master.
>>
>> Unfortunately, there is not much activity on this side of the project.
>> The lack of response from current committers is holding us back, and we
>> have to repeatedly rebase our batches, merge multiple pull requests
>> together, and overall step on each others' toes.
>>
>> Is it possible to make Wes, Deepak, and me committers on the project, so
>> we can contribute to parquet-cpp more efficiently?
>>
>> Thanks,
>> Aliaksei.
>>
>>
>> On 01/23/2016 06:07 PM, Wes McKinney wrote:
>>> Folks,
>>>
>>> We're working on a pretty solid patch queue.
>>>
>>> independent patches
>>> PARQUET-449: https://github.com/apache/parquet-cpp/pull/21
>>>
>>> interdependent patches (order to apply patches)
>>> PARQUET-437 (MOSTLY REVIEWED):
>>> https://github.com/apache/parquet-cpp/pull/19
>>>
>>> PARQUET-418: https://github.com/apache/parquet-cpp/pull/18
>>> PARQUET-434: https://github.com/apache/parquet-cpp/pull/20
>>> PARQUET-433: https://github.com/apache/parquet-cpp/pull/22
>>> PARQUET-451 & PARQUET-453: 
>>> https://github.com/apache/parquet-cpp/pull/23
>>>
>>> PARQUET-428 (needs to be rebased on top of PARQUET-433):
>>> https://github.com/apache/parquet-cpp/pull/24
>>>
>>> I'm going to take a breather and work on some other things this 
>>> weekend,
>>> but I'll be available for code reviews and fixes to try to move along
>>> this
>>> patch queue.
>>>
>>> Thanks,
>>> Wes
>>>
>>> On Fri, Jan 15, 2016 at 8:18 AM, Wes McKinney <we...@cloudera.com> wrote:
>>>
>>>> Great to meet you all!
>>>>
>>>> I've recently been collaborating with the Apache Drill team to spin 
>>>> out
>>>> the ValueVector columnar in-memory data structure into a new 
>>>> standalone
>>>> project that will be called Arrow [1] [2]. A brief summary of
>>>> Arrow/ValueVectors is that it permits O(1) random access on nested
>>>> columnar
>>>> structures and is efficient for projections and scans in a columnar 
>>>> SQL
>>>> setting.
>>>>
>>>> I'm very interested in making Parquet read/write support available to
>>>> Python programmers via C/C++ extensions, so I'm going to be working 
>>>> the
>>>> next few months on a Parquet->Arrow->Python toolchain, along with some
>>>> tools to manipulate tables in-memory columnar data in the style of
>>>> Python's
>>>> pandas library.
>>>>
>>>> I will propose patches as needed to parquet-cpp to improve its
>>>> performance
>>>> and add functionality for writing Parquet files as well. The 
>>>> details of
>>>> converting to/from Parquet's repetition/definition level
>>>> representation of
>>>> nested data will stay separate in the arrow-parquet adapter code.
>>>>
>>>> cheers,
>>>> Wes
>>>>
>>>> [1]:
>>>> http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E 
>>>>
>>>>
>>>> [2]:
>>>> http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490 
>>>>
>>>>
>>>> On Fri, Jan 15, 2016 at 1:22 AM, Mickaël Lacour <m....@criteo.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm very interested in this subject because I would like to export
>>>>> parquet data from HDFS to Vertica (using VSQL).
>>>>> I'm planning to work on it next quarter, but I will be very happy to
>>>>> help
>>>>> you on this subject (review, testing).
>>>>>
>>>>> Have a nice day,
>>>>> -- 
>>>>> Mickaël Lacour
>>>>> Senior Software Engineer
>>>>> Analytics Infrastructure team @Scalability
>>>>>
>>>>> ________________________________________
>>>>> From: Walkauskas, Stephen Gregory (Vertica)
>>>>> <st...@hpe.com>
>>>>> Sent: Thursday, January 14, 2016 3:23 PM
>>>>> To: Sandryhaila, Aliaksei; dev@parquet.apache.org; Majeti, Deepak;
>>>>> nongli@gmail.com; Wes McKinney
>>>>> Subject: Re: Parquet-cpp
>>>>>
>>>>> Yes, thanks for the introduction Julien.
>>>>>
>>>>> Nong and Wes,
>>>>>
>>>>> It'd be interesting to know your goals for parquet-cpp.
>>>>>
>>>>> The Vertica database already supports optimized reads of ORC files
>>>>> (fast
>>>>> c++ parser, predicate pushdown, columns selection etc). We'd like 
>>>>> to do
>>>>> the same for parquet.
>>>>>
>>>>> Cheers,
>>>>> Stephen
>>>>>
>>>>> On 01/13/2016 05:53 PM, Sandryhaila, Aliaksei wrote:
>>>>>> Thank you for the introduction, Julien!
>>>>>>
>>>>>> Hello Nong and Wes,
>>>>>>
>>>>>> Stephen, Deepak and I are developing a C++ library to support
>>>>>> Parquet in
>>>>>> Vertica RDBMS. We are using Parquet-cpp as a starting point and are
>>>>>> expanding its functionality as well as improving it and fixing
>>>>>> bugs. We
>>>>>> would like to contribute these improvements back to the open-source
>>>>>> community. We plan to do this through the usual process of creating
>>>>>> jiras that justify and explain a code change, and then submitting 
>>>>>> pull
>>>>>> requests. We look forward to working with you on Parquet-cpp and to
>>>>>> your
>>>>>> feedback and suggestions.
>>>>>>
>>>>>> Best regards,
>>>>>> Aliaksei.
>>>>>>
>>>>>>
>>>>>> On 01/13/2016 02:54 PM, Julien Le Dem wrote:
>>>>>>> Hello Nong, Wes, Stephen, Deepak and Aliaksei
>>>>>>> I wanted to introduce you to each other as you are all looking at
>>>>>>> Parquet-cpp.
>>>>>>>
>>>>>>> I'd recommend opening JIRAs in the parquet-cpp component to
>>>>> collaborate (I
>>>>>>> see you already doing this):
>>>>>>>
>>>>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp 
>>>>>
>>>>>
>>>>>>> Nong is a committer and can merged pull requests (he also 
>>>>>>> understands
>>>>> that
>>>>>>> code base very well).
>>>>>>> Other committer can too, feel free to ping us if you need help
>>>>>>> Obviously, you don't need to be a committer to give others reviews
>>>>>>> (you
>>>>>>> just need one to approve and merge).
>>>>>>>
>>>>
>>
>
>


Re: Parquet-cpp

Posted by Ryan Blue <bl...@cloudera.com>.
Hi everyone,

Sorry about the current backlog on the parquet-cpp side. Most of the 
current committer base works on the Java implementation so it's either 
slow or not reliable for us to do those reviews.

I think the best way to move forward is to review patches for each 
other. That will keep those issues progressing, make it easy for 
committers to validate the commit, and -- most importantly -- to build a 
trail of contributions that we can look at to vote in new committers.

I completely sympathize with the need for committers on the CPP project, 
but I don't think this will take a long time given the current level of 
activity. We're really just trying to build confidence that:

1. You produce quality contributions and understand the codebase
2. You give friendly, thoughtful reviews and don't rubber-stamp
3. You defer judgment and ask others when you don't know
4. You respect others and interact professionally

I don't think any of those are that hard to demonstrate, but I'd be 
uncomfortable not validating committers like we normally do. Especially 
in this situation, where I could easily see the amount of work you guys 
are doing adding up pretty quickly!

Does that sound like a reasonable path forward?

rb


On 01/25/2016 12:46 PM, Aliaksei Sandryhaila wrote:
> Hi Nong and Julien,
>
> As Wes has pointed out, we have a number of patches for parquet-cpp
> outstanding. Wes, Deepak, and I have been reviewing each other's pull
> requests. At this point, the patches need to be reviewed and approved by
> Parquet committers in order to be committed to master.
>
> Unfortunately, there is not much activity on this side of the project.
> The lack of response from current committers is holding us back, and we
> have to repeatedly rebase our batches, merge multiple pull requests
> together, and overall step on each others' toes.
>
> Is it possible to make Wes, Deepak, and me committers on the project, so
> we can contribute to parquet-cpp more efficiently?
>
> Thanks,
> Aliaksei.
>
>
> On 01/23/2016 06:07 PM, Wes McKinney wrote:
>> Folks,
>>
>> We're working on a pretty solid patch queue.
>>
>> independent patches
>> PARQUET-449: https://github.com/apache/parquet-cpp/pull/21
>>
>> interdependent patches (order to apply patches)
>> PARQUET-437 (MOSTLY REVIEWED):
>> https://github.com/apache/parquet-cpp/pull/19
>>
>> PARQUET-418: https://github.com/apache/parquet-cpp/pull/18
>> PARQUET-434: https://github.com/apache/parquet-cpp/pull/20
>> PARQUET-433: https://github.com/apache/parquet-cpp/pull/22
>> PARQUET-451 & PARQUET-453: https://github.com/apache/parquet-cpp/pull/23
>>
>> PARQUET-428 (needs to be rebased on top of PARQUET-433):
>> https://github.com/apache/parquet-cpp/pull/24
>>
>> I'm going to take a breather and work on some other things this weekend,
>> but I'll be available for code reviews and fixes to try to move along
>> this
>> patch queue.
>>
>> Thanks,
>> Wes
>>
>> On Fri, Jan 15, 2016 at 8:18 AM, Wes McKinney <we...@cloudera.com> wrote:
>>
>>> Great to meet you all!
>>>
>>> I've recently been collaborating with the Apache Drill team to spin out
>>> the ValueVector columnar in-memory data structure into a new standalone
>>> project that will be called Arrow [1] [2]. A brief summary of
>>> Arrow/ValueVectors is that it permits O(1) random access on nested
>>> columnar
>>> structures and is efficient for projections and scans in a columnar SQL
>>> setting.
>>>
>>> I'm very interested in making Parquet read/write support available to
>>> Python programmers via C/C++ extensions, so I'm going to be working the
>>> next few months on a Parquet->Arrow->Python toolchain, along with some
>>> tools to manipulate tables in-memory columnar data in the style of
>>> Python's
>>> pandas library.
>>>
>>> I will propose patches as needed to parquet-cpp to improve its
>>> performance
>>> and add functionality for writing Parquet files as well. The details of
>>> converting to/from Parquet's repetition/definition level
>>> representation of
>>> nested data will stay separate in the arrow-parquet adapter code.
>>>
>>> cheers,
>>> Wes
>>>
>>> [1]:
>>> http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E
>>>
>>> [2]:
>>> http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490
>>>
>>> On Fri, Jan 15, 2016 at 1:22 AM, Mickaël Lacour <m....@criteo.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm very interested in this subject because I would like to export
>>>> parquet data from HDFS to Vertica (using VSQL).
>>>> I'm planning to work on it next quarter, but I will be very happy to
>>>> help
>>>> you on this subject (review, testing).
>>>>
>>>> Have a nice day,
>>>> --
>>>> Mickaël Lacour
>>>> Senior Software Engineer
>>>> Analytics Infrastructure team @Scalability
>>>>
>>>> ________________________________________
>>>> From: Walkauskas, Stephen Gregory (Vertica)
>>>> <st...@hpe.com>
>>>> Sent: Thursday, January 14, 2016 3:23 PM
>>>> To: Sandryhaila, Aliaksei; dev@parquet.apache.org; Majeti, Deepak;
>>>> nongli@gmail.com; Wes McKinney
>>>> Subject: Re: Parquet-cpp
>>>>
>>>> Yes, thanks for the introduction Julien.
>>>>
>>>> Nong and Wes,
>>>>
>>>> It'd be interesting to know your goals for parquet-cpp.
>>>>
>>>> The Vertica database already supports optimized reads of ORC files
>>>> (fast
>>>> c++ parser, predicate pushdown, columns selection etc). We'd like to do
>>>> the same for parquet.
>>>>
>>>> Cheers,
>>>> Stephen
>>>>
>>>> On 01/13/2016 05:53 PM, Sandryhaila, Aliaksei wrote:
>>>>> Thank you for the introduction, Julien!
>>>>>
>>>>> Hello Nong and Wes,
>>>>>
>>>>> Stephen, Deepak and I are developing a C++ library to support
>>>>> Parquet in
>>>>> Vertica RDBMS. We are using Parquet-cpp as a starting point and are
>>>>> expanding its functionality as well as improving it and fixing
>>>>> bugs. We
>>>>> would like to contribute these improvements back to the open-source
>>>>> community. We plan to do this through the usual process of creating
>>>>> jiras that justify and explain a code change, and then submitting pull
>>>>> requests. We look forward to working with you on Parquet-cpp and to
>>>>> your
>>>>> feedback and suggestions.
>>>>>
>>>>> Best regards,
>>>>> Aliaksei.
>>>>>
>>>>>
>>>>> On 01/13/2016 02:54 PM, Julien Le Dem wrote:
>>>>>> Hello Nong, Wes, Stephen, Deepak and Aliaksei
>>>>>> I wanted to introduce you to each other as you are all looking at
>>>>>> Parquet-cpp.
>>>>>>
>>>>>> I'd recommend opening JIRAs in the parquet-cpp component to
>>>> collaborate (I
>>>>>> see you already doing this):
>>>>>>
>>>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
>>>>
>>>>>> Nong is a committer and can merged pull requests (he also understands
>>>> that
>>>>>> code base very well).
>>>>>> Other committer can too, feel free to ping us if you need help
>>>>>> Obviously, you don't need to be a committer to give others reviews
>>>>>> (you
>>>>>> just need one to approve and merge).
>>>>>>
>>>
>


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Parquet-cpp

Posted by Aliaksei Sandryhaila <as...@apache.org>.
Hi Nong and Julien,

As Wes has pointed out, we have a number of patches for parquet-cpp 
outstanding. Wes, Deepak, and I have been reviewing each other's pull 
requests. At this point, the patches need to be reviewed and approved by 
Parquet committers in order to be committed to master.

Unfortunately, there is not much activity on this side of the project. 
The lack of response from current committers is holding us back, and we 
have to repeatedly rebase our batches, merge multiple pull requests 
together, and overall step on each others' toes.

Is it possible to make Wes, Deepak, and me committers on the project, so 
we can contribute to parquet-cpp more efficiently?

Thanks,
Aliaksei.


On 01/23/2016 06:07 PM, Wes McKinney wrote:
> Folks,
>
> We're working on a pretty solid patch queue.
>
> independent patches
> PARQUET-449: https://github.com/apache/parquet-cpp/pull/21
>
> interdependent patches (order to apply patches)
> PARQUET-437 (MOSTLY REVIEWED): https://github.com/apache/parquet-cpp/pull/19
>
> PARQUET-418: https://github.com/apache/parquet-cpp/pull/18
> PARQUET-434: https://github.com/apache/parquet-cpp/pull/20
> PARQUET-433: https://github.com/apache/parquet-cpp/pull/22
> PARQUET-451 & PARQUET-453: https://github.com/apache/parquet-cpp/pull/23
>
> PARQUET-428 (needs to be rebased on top of PARQUET-433):
> https://github.com/apache/parquet-cpp/pull/24
>
> I'm going to take a breather and work on some other things this weekend,
> but I'll be available for code reviews and fixes to try to move along this
> patch queue.
>
> Thanks,
> Wes
>
> On Fri, Jan 15, 2016 at 8:18 AM, Wes McKinney <we...@cloudera.com> wrote:
>
>> Great to meet you all!
>>
>> I've recently been collaborating with the Apache Drill team to spin out
>> the ValueVector columnar in-memory data structure into a new standalone
>> project that will be called Arrow [1] [2]. A brief summary of
>> Arrow/ValueVectors is that it permits O(1) random access on nested columnar
>> structures and is efficient for projections and scans in a columnar SQL
>> setting.
>>
>> I'm very interested in making Parquet read/write support available to
>> Python programmers via C/C++ extensions, so I'm going to be working the
>> next few months on a Parquet->Arrow->Python toolchain, along with some
>> tools to manipulate tables in-memory columnar data in the style of Python's
>> pandas library.
>>
>> I will propose patches as needed to parquet-cpp to improve its performance
>> and add functionality for writing Parquet files as well. The details of
>> converting to/from Parquet's repetition/definition level representation of
>> nested data will stay separate in the arrow-parquet adapter code.
>>
>> cheers,
>> Wes
>>
>> [1]:
>> http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E
>> [2]:
>> http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490
>>
>> On Fri, Jan 15, 2016 at 1:22 AM, Mickaël Lacour <m....@criteo.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'm very interested in this subject because I would like to export
>>> parquet data from HDFS to Vertica (using VSQL).
>>> I'm planning to work on it next quarter, but I will be very happy to help
>>> you on this subject (review, testing).
>>>
>>> Have a nice day,
>>> --
>>> Mickaël Lacour
>>> Senior Software Engineer
>>> Analytics Infrastructure team @Scalability
>>>
>>> ________________________________________
>>> From: Walkauskas, Stephen Gregory (Vertica) <st...@hpe.com>
>>> Sent: Thursday, January 14, 2016 3:23 PM
>>> To: Sandryhaila, Aliaksei; dev@parquet.apache.org; Majeti, Deepak;
>>> nongli@gmail.com; Wes McKinney
>>> Subject: Re: Parquet-cpp
>>>
>>> Yes, thanks for the introduction Julien.
>>>
>>> Nong and Wes,
>>>
>>> It'd be interesting to know your goals for parquet-cpp.
>>>
>>> The Vertica database already supports optimized reads of ORC files (fast
>>> c++ parser, predicate pushdown, columns selection etc). We'd like to do
>>> the same for parquet.
>>>
>>> Cheers,
>>> Stephen
>>>
>>> On 01/13/2016 05:53 PM, Sandryhaila, Aliaksei wrote:
>>>> Thank you for the introduction, Julien!
>>>>
>>>> Hello Nong and Wes,
>>>>
>>>> Stephen, Deepak and I are developing a C++ library to support Parquet in
>>>> Vertica RDBMS. We are using Parquet-cpp as a starting point and are
>>>> expanding its functionality as well as improving it and fixing bugs. We
>>>> would like to contribute these improvements back to the open-source
>>>> community. We plan to do this through the usual process of creating
>>>> jiras that justify and explain a code change, and then submitting pull
>>>> requests. We look forward to working with you on Parquet-cpp and to your
>>>> feedback and suggestions.
>>>>
>>>> Best regards,
>>>> Aliaksei.
>>>>
>>>>
>>>> On 01/13/2016 02:54 PM, Julien Le Dem wrote:
>>>>> Hello Nong, Wes, Stephen, Deepak and Aliaksei
>>>>> I wanted to introduce you to each other as you are all looking at
>>>>> Parquet-cpp.
>>>>>
>>>>> I'd recommend opening JIRAs in the parquet-cpp component to
>>> collaborate (I
>>>>> see you already doing this):
>>>>>
>>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
>>>>> Nong is a committer and can merged pull requests (he also understands
>>> that
>>>>> code base very well).
>>>>> Other committer can too, feel free to ping us if you need help
>>>>> Obviously, you don't need to be a committer to give others reviews (you
>>>>> just need one to approve and merge).
>>>>>
>>


Re: Parquet-cpp

Posted by Wes McKinney <we...@cloudera.com>.
Folks,

We're working on a pretty solid patch queue.

independent patches
PARQUET-449: https://github.com/apache/parquet-cpp/pull/21

interdependent patches (order to apply patches)
PARQUET-437 (MOSTLY REVIEWED): https://github.com/apache/parquet-cpp/pull/19

PARQUET-418: https://github.com/apache/parquet-cpp/pull/18
PARQUET-434: https://github.com/apache/parquet-cpp/pull/20
PARQUET-433: https://github.com/apache/parquet-cpp/pull/22
PARQUET-451 & PARQUET-453: https://github.com/apache/parquet-cpp/pull/23

PARQUET-428 (needs to be rebased on top of PARQUET-433):
https://github.com/apache/parquet-cpp/pull/24

I'm going to take a breather and work on some other things this weekend,
but I'll be available for code reviews and fixes to try to move along this
patch queue.

Thanks,
Wes

On Fri, Jan 15, 2016 at 8:18 AM, Wes McKinney <we...@cloudera.com> wrote:

> Great to meet you all!
>
> I've recently been collaborating with the Apache Drill team to spin out
> the ValueVector columnar in-memory data structure into a new standalone
> project that will be called Arrow [1] [2]. A brief summary of
> Arrow/ValueVectors is that it permits O(1) random access on nested columnar
> structures and is efficient for projections and scans in a columnar SQL
> setting.
>
> I'm very interested in making Parquet read/write support available to
> Python programmers via C/C++ extensions, so I'm going to be working the
> next few months on a Parquet->Arrow->Python toolchain, along with some
> tools to manipulate tables in-memory columnar data in the style of Python's
> pandas library.
>
> I will propose patches as needed to parquet-cpp to improve its performance
> and add functionality for writing Parquet files as well. The details of
> converting to/from Parquet's repetition/definition level representation of
> nested data will stay separate in the arrow-parquet adapter code.
>
> cheers,
> Wes
>
> [1]:
> http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E
> [2]:
> http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490
>
> On Fri, Jan 15, 2016 at 1:22 AM, Mickaël Lacour <m....@criteo.com>
> wrote:
>
>> Hi,
>>
>> I'm very interested in this subject because I would like to export
>> parquet data from HDFS to Vertica (using VSQL).
>> I'm planning to work on it next quarter, but I will be very happy to help
>> you on this subject (review, testing).
>>
>> Have a nice day,
>> --
>> Mickaël Lacour
>> Senior Software Engineer
>> Analytics Infrastructure team @Scalability
>>
>> ________________________________________
>> From: Walkauskas, Stephen Gregory (Vertica) <st...@hpe.com>
>> Sent: Thursday, January 14, 2016 3:23 PM
>> To: Sandryhaila, Aliaksei; dev@parquet.apache.org; Majeti, Deepak;
>> nongli@gmail.com; Wes McKinney
>> Subject: Re: Parquet-cpp
>>
>> Yes, thanks for the introduction Julien.
>>
>> Nong and Wes,
>>
>> It'd be interesting to know your goals for parquet-cpp.
>>
>> The Vertica database already supports optimized reads of ORC files (fast
>> c++ parser, predicate pushdown, columns selection etc). We'd like to do
>> the same for parquet.
>>
>> Cheers,
>> Stephen
>>
>> On 01/13/2016 05:53 PM, Sandryhaila, Aliaksei wrote:
>> > Thank you for the introduction, Julien!
>> >
>> > Hello Nong and Wes,
>> >
>> > Stephen, Deepak and I are developing a C++ library to support Parquet in
>> > Vertica RDBMS. We are using Parquet-cpp as a starting point and are
>> > expanding its functionality as well as improving it and fixing bugs. We
>> > would like to contribute these improvements back to the open-source
>> > community. We plan to do this through the usual process of creating
>> > jiras that justify and explain a code change, and then submitting pull
>> > requests. We look forward to working with you on Parquet-cpp and to your
>> > feedback and suggestions.
>> >
>> > Best regards,
>> > Aliaksei.
>> >
>> >
>> > On 01/13/2016 02:54 PM, Julien Le Dem wrote:
>> >> Hello Nong, Wes, Stephen, Deepak and Aliaksei
>> >> I wanted to introduce you to each other as you are all looking at
>> >> Parquet-cpp.
>> >>
>> >> I'd recommend opening JIRAs in the parquet-cpp component to
>> collaborate (I
>> >> see you already doing this):
>> >>
>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
>> >>
>> >> Nong is a committer and can merged pull requests (he also understands
>> that
>> >> code base very well).
>> >> Other committer can too, feel free to ping us if you need help
>> >> Obviously, you don't need to be a committer to give others reviews (you
>> >> just need one to approve and merge).
>> >>
>> >
>>
>
>

Re: Parquet-cpp

Posted by Wes McKinney <we...@cloudera.com>.
Great to meet you all!

I've recently been collaborating with the Apache Drill team to spin out the
ValueVector columnar in-memory data structure into a new standalone project
that will be called Arrow [1] [2]. A brief summary of Arrow/ValueVectors is
that it permits O(1) random access on nested columnar structures and is
efficient for projections and scans in a columnar SQL setting.

I'm very interested in making Parquet read/write support available to
Python programmers via C/C++ extensions, so I'm going to be working the
next few months on a Parquet->Arrow->Python toolchain, along with some
tools to manipulate tables in-memory columnar data in the style of Python's
pandas library.

I will propose patches as needed to parquet-cpp to improve its performance
and add functionality for writing Parquet files as well. The details of
converting to/from Parquet's repetition/definition level representation of
nested data will stay separate in the arrow-parquet adapter code.

cheers,
Wes

[1]:
http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E
[2]:
http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490

On Fri, Jan 15, 2016 at 1:22 AM, Mickaël Lacour <m....@criteo.com> wrote:

> Hi,
>
> I'm very interested in this subject because I would like to export parquet
> data from HDFS to Vertica (using VSQL).
> I'm planning to work on it next quarter, but I will be very happy to help
> you on this subject (review, testing).
>
> Have a nice day,
> --
> Mickaël Lacour
> Senior Software Engineer
> Analytics Infrastructure team @Scalability
>
> ________________________________________
> From: Walkauskas, Stephen Gregory (Vertica) <st...@hpe.com>
> Sent: Thursday, January 14, 2016 3:23 PM
> To: Sandryhaila, Aliaksei; dev@parquet.apache.org; Majeti, Deepak;
> nongli@gmail.com; Wes McKinney
> Subject: Re: Parquet-cpp
>
> Yes, thanks for the introduction Julien.
>
> Nong and Wes,
>
> It'd be interesting to know your goals for parquet-cpp.
>
> The Vertica database already supports optimized reads of ORC files (fast
> c++ parser, predicate pushdown, columns selection etc). We'd like to do
> the same for parquet.
>
> Cheers,
> Stephen
>
> On 01/13/2016 05:53 PM, Sandryhaila, Aliaksei wrote:
> > Thank you for the introduction, Julien!
> >
> > Hello Nong and Wes,
> >
> > Stephen, Deepak and I are developing a C++ library to support Parquet in
> > Vertica RDBMS. We are using Parquet-cpp as a starting point and are
> > expanding its functionality as well as improving it and fixing bugs. We
> > would like to contribute these improvements back to the open-source
> > community. We plan to do this through the usual process of creating
> > jiras that justify and explain a code change, and then submitting pull
> > requests. We look forward to working with you on Parquet-cpp and to your
> > feedback and suggestions.
> >
> > Best regards,
> > Aliaksei.
> >
> >
> > On 01/13/2016 02:54 PM, Julien Le Dem wrote:
> >> Hello Nong, Wes, Stephen, Deepak and Aliaksei
> >> I wanted to introduce you to each other as you are all looking at
> >> Parquet-cpp.
> >>
> >> I'd recommend opening JIRAs in the parquet-cpp component to collaborate
> (I
> >> see you already doing this):
> >>
> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
> >>
> >> Nong is a committer and can merged pull requests (he also understands
> that
> >> code base very well).
> >> Other committer can too, feel free to ping us if you need help
> >> Obviously, you don't need to be a committer to give others reviews (you
> >> just need one to approve and merge).
> >>
> >
>

Re: Parquet-cpp

Posted by Mickaël Lacour <m....@criteo.com>.
Hi,

I'm very interested in this subject because I would like to export parquet data from HDFS to Vertica (using VSQL).
I'm planning to work on it next quarter, but I will be very happy to help you on this subject (review, testing).

Have a nice day,
--
Mickaël Lacour
Senior Software Engineer
Analytics Infrastructure team @Scalability

________________________________________
From: Walkauskas, Stephen Gregory (Vertica) <st...@hpe.com>
Sent: Thursday, January 14, 2016 3:23 PM
To: Sandryhaila, Aliaksei; dev@parquet.apache.org; Majeti, Deepak; nongli@gmail.com; Wes McKinney
Subject: Re: Parquet-cpp

Yes, thanks for the introduction Julien.

Nong and Wes,

It'd be interesting to know your goals for parquet-cpp.

The Vertica database already supports optimized reads of ORC files (fast
c++ parser, predicate pushdown, columns selection etc). We'd like to do
the same for parquet.

Cheers,
Stephen

On 01/13/2016 05:53 PM, Sandryhaila, Aliaksei wrote:
> Thank you for the introduction, Julien!
>
> Hello Nong and Wes,
>
> Stephen, Deepak and I are developing a C++ library to support Parquet in
> Vertica RDBMS. We are using Parquet-cpp as a starting point and are
> expanding its functionality as well as improving it and fixing bugs. We
> would like to contribute these improvements back to the open-source
> community. We plan to do this through the usual process of creating
> jiras that justify and explain a code change, and then submitting pull
> requests. We look forward to working with you on Parquet-cpp and to your
> feedback and suggestions.
>
> Best regards,
> Aliaksei.
>
>
> On 01/13/2016 02:54 PM, Julien Le Dem wrote:
>> Hello Nong, Wes, Stephen, Deepak and Aliaksei
>> I wanted to introduce you to each other as you are all looking at
>> Parquet-cpp.
>>
>> I'd recommend opening JIRAs in the parquet-cpp component to collaborate (I
>> see you already doing this):
>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
>>
>> Nong is a committer and can merged pull requests (he also understands that
>> code base very well).
>> Other committer can too, feel free to ping us if you need help
>> Obviously, you don't need to be a committer to give others reviews (you
>> just need one to approve and merge).
>>
>

Re: Parquet-cpp

Posted by "Walkauskas, Stephen Gregory (Vertica)" <st...@hpe.com>.
Yes, thanks for the introduction Julien.

Nong and Wes,

It'd be interesting to know your goals for parquet-cpp.

The Vertica database already supports optimized reads of ORC files (fast
c++ parser, predicate pushdown, columns selection etc). We'd like to do
the same for parquet.

Cheers,
Stephen

On 01/13/2016 05:53 PM, Sandryhaila, Aliaksei wrote:
> Thank you for the introduction, Julien!
>
> Hello Nong and Wes,
>
> Stephen, Deepak and I are developing a C++ library to support Parquet in
> Vertica RDBMS. We are using Parquet-cpp as a starting point and are
> expanding its functionality as well as improving it and fixing bugs. We
> would like to contribute these improvements back to the open-source
> community. We plan to do this through the usual process of creating
> jiras that justify and explain a code change, and then submitting pull
> requests. We look forward to working with you on Parquet-cpp and to your
> feedback and suggestions.
>
> Best regards,
> Aliaksei.
>
>
> On 01/13/2016 02:54 PM, Julien Le Dem wrote:
>> Hello Nong, Wes, Stephen, Deepak and Aliaksei
>> I wanted to introduce you to each other as you are all looking at
>> Parquet-cpp.
>>
>> I'd recommend opening JIRAs in the parquet-cpp component to collaborate (I
>> see you already doing this):
>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
>>
>> Nong is a committer and can merged pull requests (he also understands that
>> code base very well).
>> Other committer can too, feel free to ping us if you need help
>> Obviously, you don't need to be a committer to give others reviews (you
>> just need one to approve and merge).
>>
>


Re: Parquet-cpp

Posted by "Sandryhaila, Aliaksei" <al...@hpe.com>.
Thank you for the introduction, Julien!

Hello Nong and Wes,

Stephen, Deepak and I are developing a C++ library to support Parquet in
Vertica RDBMS. We are using Parquet-cpp as a starting point and are
expanding its functionality as well as improving it and fixing bugs. We
would like to contribute these improvements back to the open-source
community. We plan to do this through the usual process of creating
jiras that justify and explain a code change, and then submitting pull
requests. We look forward to working with you on Parquet-cpp and to your
feedback and suggestions.

Best regards,
Aliaksei.


On 01/13/2016 02:54 PM, Julien Le Dem wrote:
> Hello Nong, Wes, Stephen, Deepak and Aliaksei
> I wanted to introduce you to each other as you are all looking at
> Parquet-cpp.
>
> I'd recommend opening JIRAs in the parquet-cpp component to collaborate (I
> see you already doing this):
> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
>
> Nong is a committer and can merged pull requests (he also understands that
> code base very well).
> Other committer can too, feel free to ping us if you need help
> Obviously, you don't need to be a committer to give others reviews (you
> just need one to approve and merge).
>