You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Michael Doo <mi...@verve.com> on 2018/08/27 17:18:41 UTC

Reading partitioned Parquet data into Pig

Hello,

I’m trying to read in Parquet data into Pig that is partitioned (so it’s stored in S3 like s3://path/to/files/some_flag=true/part-00095-a2a6230b-9750-48e4-9cd0-b553ffc220de.c000.gz.parquet). I’d like to load it into Pig and add the partitions as columns. I’ve read some resources suggesting using the HCatLoader, but so far haven’t had success.

Any advice would be welcome.

~ Michael

Re: Reading partitioned Parquet data into Pig

Posted by Michael Doo <mi...@verve.com>.

Eyal,

The Parquet Pig loader is fine if all the data is present, but if I've written out from Spark using `df.write.partitionBy('colA', 'colB').parquet('s3://path/to/output')`, the data from those two columns are put into the output path and taken out from the data: s3://path/to/output/colA=valA/colB=valB/part-0001.parquet. There are hacky workarounds, such as duplicating the columns in Spark before writing, which fix the issue of loading into Pig but then mean they re-appear in the data when you read back into Spark.

Best,
Michael 

On 8/30/18, 10:15 AM, "Adam Szita" <sz...@cloudera.com.INVALID> wrote:

    Hi Eyal,
    
    For just loading Parquet files the Parquet Pig loader is okay, although I
    don't think it lets you use partition values in the dataset later.
    I know the plain old PigStorage has a trick with -tagFiles option but not
    sure if that'd be enough in Michael's case and also if that's something
    Parquet Loader supports.
    
    Thanks
    
    On Thu, 30 Aug 2018 at 16:10, Eyal Allweil <ey...@yahoo.com.invalid>
    wrote:
    
    > Hi Michael,
    > You can also use the Parquet Pig loader (especially if you're not working
    > with Hive). Here's a link to the Maven repository for it.
    >
    > https://mvnrepository.com/artifact/org.apache.parquet/parquet-pig/1.10.0
    > Regards,Eyal
    > <https://mvnrepository.com/artifact/org.apache.parquet/parquet-pig/1.10.0Regards,Eyal>
    >
    >
    >
    >
    >
    >    On Tuesday, August 28, 2018, 2:40:36 PM GMT+3, Adam Szita
    > <sz...@cloudera.com.INVALID> wrote:
    >
    >  Hi Michael,
    >
    > Yes you can use HCatLoader to do this.
    > The requirement is that you have a Hive table defined on top of your data
    > (probably pointing to s3://path/to/files) (and Hive MetaStore has all the
    > relevant meta/schema information).
    > If you do not have a Hive table yet, you can go ahead and define it in Hive
    > by manually specifying schema information, and after that partitions can be
    > added automatically via the 'msck repair' function of Hive.
    >
    > Hope this helps,
    > Adam
    >
    >
    > On Mon, 27 Aug 2018 at 19:18, Michael Doo <mi...@verve.com> wrote:
    >
    > > Hello,
    > >
    > > I’m trying to read in Parquet data into Pig that is partitioned (so it’s
    > > stored in S3 like
    > >
    > s3://path/to/files/some_flag=true/part-00095-a2a6230b-9750-48e4-9cd0-b553ffc220de.c000.gz.parquet).
    > > I’d like to load it into Pig and add the partitions as columns. I’ve read
    > > some resources suggesting using the HCatLoader, but so far haven’t had
    > > success.
    > >
    > > Any advice would be welcome.
    > >
    > > ~ Michael
    > >

Re: Reading partitioned Parquet data into Pig

Posted by Adam Szita <sz...@cloudera.com.INVALID>.

Hi Eyal,

For just loading Parquet files the Parquet Pig loader is okay, although I
don't think it lets you use partition values in the dataset later.
I know the plain old PigStorage has a trick with -tagFiles option but not
sure if that'd be enough in Michael's case and also if that's something
Parquet Loader supports.

Thanks

On Thu, 30 Aug 2018 at 16:10, Eyal Allweil <ey...@yahoo.com.invalid>
wrote:

> Hi Michael,
> You can also use the Parquet Pig loader (especially if you're not working
> with Hive). Here's a link to the Maven repository for it.
>
> https://mvnrepository.com/artifact/org.apache.parquet/parquet-pig/1.10.0
> Regards,Eyal
> <https://mvnrepository.com/artifact/org.apache.parquet/parquet-pig/1.10.0Regards,Eyal>
>
>
>
>
>
>    On Tuesday, August 28, 2018, 2:40:36 PM GMT+3, Adam Szita
> <sz...@cloudera.com.INVALID> wrote:
>
>  Hi Michael,
>
> Yes you can use HCatLoader to do this.
> The requirement is that you have a Hive table defined on top of your data
> (probably pointing to s3://path/to/files) (and Hive MetaStore has all the
> relevant meta/schema information).
> If you do not have a Hive table yet, you can go ahead and define it in Hive
> by manually specifying schema information, and after that partitions can be
> added automatically via the 'msck repair' function of Hive.
>
> Hope this helps,
> Adam
>
>
> On Mon, 27 Aug 2018 at 19:18, Michael Doo <mi...@verve.com> wrote:
>
> > Hello,
> >
> > I’m trying to read in Parquet data into Pig that is partitioned (so it’s
> > stored in S3 like
> >
> s3://path/to/files/some_flag=true/part-00095-a2a6230b-9750-48e4-9cd0-b553ffc220de.c000.gz.parquet).
> > I’d like to load it into Pig and add the partitions as columns. I’ve read
> > some resources suggesting using the HCatLoader, but so far haven’t had
> > success.
> >
> > Any advice would be welcome.
> >
> > ~ Michael
> >

Re: Reading partitioned Parquet data into Pig

Posted by Eyal Allweil <ey...@yahoo.com.INVALID>.

Hi Michael,
You can also use the Parquet Pig loader (especially if you're not working with Hive). Here's a link to the Maven repository for it.

https://mvnrepository.com/artifact/org.apache.parquet/parquet-pig/1.10.0
Regards,Eyal

   On Tuesday, August 28, 2018, 2:40:36 PM GMT+3, Adam Szita <sz...@cloudera.com.INVALID> wrote:  

 Hi Michael,

Yes you can use HCatLoader to do this.
The requirement is that you have a Hive table defined on top of your data
(probably pointing to s3://path/to/files) (and Hive MetaStore has all the
relevant meta/schema information).
If you do not have a Hive table yet, you can go ahead and define it in Hive
by manually specifying schema information, and after that partitions can be
added automatically via the 'msck repair' function of Hive.

Hope this helps,
Adam

On Mon, 27 Aug 2018 at 19:18, Michael Doo <mi...@verve.com> wrote:

> Hello,
>
> I’m trying to read in Parquet data into Pig that is partitioned (so it’s
> stored in S3 like
> s3://path/to/files/some_flag=true/part-00095-a2a6230b-9750-48e4-9cd0-b553ffc220de.c000.gz.parquet).
> I’d like to load it into Pig and add the partitions as columns. I’ve read
> some resources suggesting using the HCatLoader, but so far haven’t had
> success.
>
> Any advice would be welcome.
>
> ~ Michael
>

Re: Reading partitioned Parquet data into Pig

Posted by Adam Szita <sz...@cloudera.com.INVALID>.

Hi Michael,

Yes you can use HCatLoader to do this.
The requirement is that you have a Hive table defined on top of your data
(probably pointing to s3://path/to/files) (and Hive MetaStore has all the
relevant meta/schema information).
If you do not have a Hive table yet, you can go ahead and define it in Hive
by manually specifying schema information, and after that partitions can be
added automatically via the 'msck repair' function of Hive.

Hope this helps,
Adam

On Mon, 27 Aug 2018 at 19:18, Michael Doo <mi...@verve.com> wrote:

> Hello,
>
> I’m trying to read in Parquet data into Pig that is partitioned (so it’s
> stored in S3 like
> s3://path/to/files/some_flag=true/part-00095-a2a6230b-9750-48e4-9cd0-b553ffc220de.c000.gz.parquet).
> I’d like to load it into Pig and add the partitions as columns. I’ve read
> some resources suggesting using the HCatLoader, but so far haven’t had
> success.
>
> Any advice would be welcome.
>
> ~ Michael
>