You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Xinyu Zeng <xz...@gmail.com> on 2022/04/24 15:29:38 UTC

Any doc/wiki/contribution guide?

Hi,

I am a previous user of parquet-cpp(now integrated with arrow) and now
I am going to use the java version parquet-mr. However, I did not find
any doc or wiki on how to use the api. I am also interested in
contributing but there is also no contribution guide like other open
source projects. I would appreciate it if someone could give me a
short guide.

Thanks,
Xinyu

Re: Any doc/wiki/contribution guide?

Posted by "Miller, Tim" <th...@amazon.com.INVALID>.
I've been working with Trino, but I'm not that deep into it yet. I have noticed that there's a bunch of Parquet functionality that's duplicated in Trino. This may be necessary to work around some of the problems with ParquetMR being able or not being able to read the same file depending on which API you try to use. Or it may be just that there's functionality that specific to the database that isn't appropriate to put into Trino. Not sure. For sure, Trino depends on and uses a lot of ParquetMR, but there's loads of it that it doesn't use.

On 4/26/22, 10:19 AM, "Xinyu Zeng" <xz...@gmail.com> wrote:


    Thanks Tim and Gamaken! They are helpful links and code.

    A followup(maybe stupid) question: it seems other java based engines
    like Presto has their own implementations of Parquet read/write. Is
    that because parquet-mr can only deserialize Parquet into some
    specific format like avro/thrift/protobuf, but some other engines need
    tight coupling between Parquet and their in memory format? They also
    need some different IO/buffering techniques than parquet-mr. If my
    understanding is correct, does that mean a unified parquet
    implementation does not exist and that is not the purpose of
    parquet-mr?

    Thanks

    On Tue, Apr 26, 2022 at 9:37 PM Miller, Tim <th...@amazon.com.invalid> wrote:
    >
    > Also, using the API is a pain, because you have to use Hadoop. Various people have found work-arounds for this, such as:
    > Comments on: https://issues.apache.org/jira/browse/PARQUET-1822
    >
    > I also assembled a minimal reader myself (from code I found elsewhere on github, which I should add attributions for later) which I put here:
    > https://github.com/theosib-amazon/parquet-mr-minreader
    >
    > On 4/25/22, 2:51 PM, "gamaken k" <ga...@gmail.com> wrote:
    >
    >     CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
    >
    >
    >
    >     > wiki on how to use the api
    >     +1 to this. I too think this would be very useful for getting started.
    >     Xinyu, you could potentially look at parquet-cli's source code to
    >     understand how it invokes the various APIs from parquet-mr, I think.
    >
    >     On Sun, Apr 24, 2022 at 8:29 AM Xinyu Zeng <xz...@gmail.com> wrote:
    >
    >     > Hi,
    >     >
    >     > I am a previous user of parquet-cpp(now integrated with arrow) and now
    >     > I am going to use the java version parquet-mr. However, I did not find
    >     > any doc or wiki on how to use the api. I am also interested in
    >     > contributing but there is also no contribution guide like other open
    >     > source projects. I would appreciate it if someone could give me a
    >     > short guide.
    >     >
    >     > Thanks,
    >     > Xinyu
    >     >
    >


Re: Any doc/wiki/contribution guide?

Posted by Xinyu Zeng <xz...@gmail.com>.
Thanks Tim and Gamaken! They are helpful links and code.

A followup(maybe stupid) question: it seems other java based engines
like Presto has their own implementations of Parquet read/write. Is
that because parquet-mr can only deserialize Parquet into some
specific format like avro/thrift/protobuf, but some other engines need
tight coupling between Parquet and their in memory format? They also
need some different IO/buffering techniques than parquet-mr. If my
understanding is correct, does that mean a unified parquet
implementation does not exist and that is not the purpose of
parquet-mr?

Thanks

On Tue, Apr 26, 2022 at 9:37 PM Miller, Tim <th...@amazon.com.invalid> wrote:
>
> Also, using the API is a pain, because you have to use Hadoop. Various people have found work-arounds for this, such as:
> Comments on: https://issues.apache.org/jira/browse/PARQUET-1822
>
> I also assembled a minimal reader myself (from code I found elsewhere on github, which I should add attributions for later) which I put here:
> https://github.com/theosib-amazon/parquet-mr-minreader
>
> On 4/25/22, 2:51 PM, "gamaken k" <ga...@gmail.com> wrote:
>
>     CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
>     > wiki on how to use the api
>     +1 to this. I too think this would be very useful for getting started.
>     Xinyu, you could potentially look at parquet-cli's source code to
>     understand how it invokes the various APIs from parquet-mr, I think.
>
>     On Sun, Apr 24, 2022 at 8:29 AM Xinyu Zeng <xz...@gmail.com> wrote:
>
>     > Hi,
>     >
>     > I am a previous user of parquet-cpp(now integrated with arrow) and now
>     > I am going to use the java version parquet-mr. However, I did not find
>     > any doc or wiki on how to use the api. I am also interested in
>     > contributing but there is also no contribution guide like other open
>     > source projects. I would appreciate it if someone could give me a
>     > short guide.
>     >
>     > Thanks,
>     > Xinyu
>     >
>

Re: Any doc/wiki/contribution guide?

Posted by "Miller, Tim" <th...@amazon.com.INVALID>.
Also, using the API is a pain, because you have to use Hadoop. Various people have found work-arounds for this, such as:
Comments on: https://issues.apache.org/jira/browse/PARQUET-1822

I also assembled a minimal reader myself (from code I found elsewhere on github, which I should add attributions for later) which I put here:
https://github.com/theosib-amazon/parquet-mr-minreader 

On 4/25/22, 2:51 PM, "gamaken k" <ga...@gmail.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    > wiki on how to use the api
    +1 to this. I too think this would be very useful for getting started.
    Xinyu, you could potentially look at parquet-cli's source code to
    understand how it invokes the various APIs from parquet-mr, I think.

    On Sun, Apr 24, 2022 at 8:29 AM Xinyu Zeng <xz...@gmail.com> wrote:

    > Hi,
    >
    > I am a previous user of parquet-cpp(now integrated with arrow) and now
    > I am going to use the java version parquet-mr. However, I did not find
    > any doc or wiki on how to use the api. I am also interested in
    > contributing but there is also no contribution guide like other open
    > source projects. I would appreciate it if someone could give me a
    > short guide.
    >
    > Thanks,
    > Xinyu
    >


Re: Any doc/wiki/contribution guide?

Posted by gamaken k <ga...@gmail.com>.
> wiki on how to use the api
+1 to this. I too think this would be very useful for getting started.
Xinyu, you could potentially look at parquet-cli's source code to
understand how it invokes the various APIs from parquet-mr, I think.

On Sun, Apr 24, 2022 at 8:29 AM Xinyu Zeng <xz...@gmail.com> wrote:

> Hi,
>
> I am a previous user of parquet-cpp(now integrated with arrow) and now
> I am going to use the java version parquet-mr. However, I did not find
> any doc or wiki on how to use the api. I am also interested in
> contributing but there is also no contribution guide like other open
> source projects. I would appreciate it if someone could give me a
> short guide.
>
> Thanks,
> Xinyu
>