You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Junfeng Chen <da...@gmail.com> on 2018/04/03 03:28:56 UTC

How to delete empty columns in df when writing to parquet?

I am trying to read data from kafka and writing them in parquet format via
Spark Streaming.
The problem is, the data from kafka are in variable data structure. For
example, app one has columns A,B,C, app two has columns B,C,D. So the data
frame I read from kafka has all columns ABCD. When I decide to write the
dataframe to parquet file partitioned with app name,
the parquet file of app one also contains columns D, where the columns D is
empty and it contains no data actually. So how to filter the empty columns
when I writing dataframe to parquet?

Thanks!


Regard,
Junfeng Chen

Re: How to delete empty columns in df when writing to parquet?

Posted by Gourav Sengupta <go...@gmail.com>.
Hi Junfeng,

you are welcome. If users are extremely adamant on seeing only a few
columns try to see if you can create a view on only the selected columns
and give it to them, in case you are using hive metastore.

Regards,
Gourav

On Sun, Apr 8, 2018 at 3:28 AM, Junfeng Chen <da...@gmail.com> wrote:

> Hi,
> Thanks for explaining!
>
>
> Regard,
> Junfeng Chen
>
> On Wed, Apr 4, 2018 at 7:43 PM, Gourav Sengupta <gourav.sengupta@gmail.com
> > wrote:
>
>> Hi,
>>
>> I do not think that in a columnar database it makes much of a difference.
>> The amount of data that you will be parsing will not be much anyways.
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Wed, Apr 4, 2018 at 11:02 AM, Junfeng Chen <da...@gmail.com> wrote:
>>
>>> Our users ask for it....
>>>
>>>
>>> Regard,
>>> Junfeng Chen
>>>
>>> On Wed, Apr 4, 2018 at 5:45 PM, Gourav Sengupta <
>>> gourav.sengupta@gmail.com> wrote:
>>>
>>>> Hi Junfeng,
>>>>
>>>> can I ask why it is important to remove the empty column?
>>>>
>>>> Regards,
>>>> Gourav Sengupta
>>>>
>>>> On Tue, Apr 3, 2018 at 4:28 AM, Junfeng Chen <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> I am trying to read data from kafka and writing them in parquet format
>>>>> via Spark Streaming.
>>>>> The problem is, the data from kafka are in variable data structure.
>>>>> For example, app one has columns A,B,C, app two has columns B,C,D. So the
>>>>> data frame I read from kafka has all columns ABCD. When I decide to write
>>>>> the dataframe to parquet file partitioned with app name,
>>>>> the parquet file of app one also contains columns D, where the columns
>>>>> D is empty and it contains no data actually. So how to filter the empty
>>>>> columns when I writing dataframe to parquet?
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>> Regard,
>>>>> Junfeng Chen
>>>>>
>>>>
>>>>
>>>
>>
>

Re: How to delete empty columns in df when writing to parquet?

Posted by Junfeng Chen <da...@gmail.com>.
Hi,
Thanks for explaining!


Regard,
Junfeng Chen

On Wed, Apr 4, 2018 at 7:43 PM, Gourav Sengupta <go...@gmail.com>
wrote:

> Hi,
>
> I do not think that in a columnar database it makes much of a difference.
> The amount of data that you will be parsing will not be much anyways.
>
> Regards,
> Gourav Sengupta
>
> On Wed, Apr 4, 2018 at 11:02 AM, Junfeng Chen <da...@gmail.com> wrote:
>
>> Our users ask for it....
>>
>>
>> Regard,
>> Junfeng Chen
>>
>> On Wed, Apr 4, 2018 at 5:45 PM, Gourav Sengupta <
>> gourav.sengupta@gmail.com> wrote:
>>
>>> Hi Junfeng,
>>>
>>> can I ask why it is important to remove the empty column?
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Tue, Apr 3, 2018 at 4:28 AM, Junfeng Chen <da...@gmail.com> wrote:
>>>
>>>> I am trying to read data from kafka and writing them in parquet format
>>>> via Spark Streaming.
>>>> The problem is, the data from kafka are in variable data structure. For
>>>> example, app one has columns A,B,C, app two has columns B,C,D. So the data
>>>> frame I read from kafka has all columns ABCD. When I decide to write the
>>>> dataframe to parquet file partitioned with app name,
>>>> the parquet file of app one also contains columns D, where the columns
>>>> D is empty and it contains no data actually. So how to filter the empty
>>>> columns when I writing dataframe to parquet?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> Regard,
>>>> Junfeng Chen
>>>>
>>>
>>>
>>
>

Re: How to delete empty columns in df when writing to parquet?

Posted by Junfeng Chen <da...@gmail.com>.
Our users ask for it....


Regard,
Junfeng Chen

On Wed, Apr 4, 2018 at 5:45 PM, Gourav Sengupta <go...@gmail.com>
wrote:

> Hi Junfeng,
>
> can I ask why it is important to remove the empty column?
>
> Regards,
> Gourav Sengupta
>
> On Tue, Apr 3, 2018 at 4:28 AM, Junfeng Chen <da...@gmail.com> wrote:
>
>> I am trying to read data from kafka and writing them in parquet format
>> via Spark Streaming.
>> The problem is, the data from kafka are in variable data structure. For
>> example, app one has columns A,B,C, app two has columns B,C,D. So the data
>> frame I read from kafka has all columns ABCD. When I decide to write the
>> dataframe to parquet file partitioned with app name,
>> the parquet file of app one also contains columns D, where the columns D
>> is empty and it contains no data actually. So how to filter the empty
>> columns when I writing dataframe to parquet?
>>
>> Thanks!
>>
>>
>> Regard,
>> Junfeng Chen
>>
>
>

Re: How to delete empty columns in df when writing to parquet?

Posted by Gourav Sengupta <go...@gmail.com>.
Hi Junfeng,

can I ask why it is important to remove the empty column?

Regards,
Gourav Sengupta

On Tue, Apr 3, 2018 at 4:28 AM, Junfeng Chen <da...@gmail.com> wrote:

> I am trying to read data from kafka and writing them in parquet format via
> Spark Streaming.
> The problem is, the data from kafka are in variable data structure. For
> example, app one has columns A,B,C, app two has columns B,C,D. So the data
> frame I read from kafka has all columns ABCD. When I decide to write the
> dataframe to parquet file partitioned with app name,
> the parquet file of app one also contains columns D, where the columns D
> is empty and it contains no data actually. So how to filter the empty
> columns when I writing dataframe to parquet?
>
> Thanks!
>
>
> Regard,
> Junfeng Chen
>

Re: How to delete empty columns in df when writing to parquet?

Posted by Junfeng Chen <da...@gmail.com>.
You mean I should start two spark streaming application and read topics
respectively?


Regard,
Junfeng Chen

On Tue, Apr 3, 2018 at 10:31 PM, naresh Goud <na...@gmail.com>
wrote:

> I don’t see any option other than staring two individual queries. It’s
> just a thought.
>
> Thank you,
> Naresh
>
> On Mon, Apr 2, 2018 at 10:29 PM Junfeng Chen <da...@gmail.com> wrote:
>
>> I am trying to read data from kafka and writing them in parquet format
>> via Spark Streaming.
>> The problem is, the data from kafka are in variable data structure. For
>> example, app one has columns A,B,C, app two has columns B,C,D. So the data
>> frame I read from kafka has all columns ABCD. When I decide to write the
>> dataframe to parquet file partitioned with app name,
>> the parquet file of app one also contains columns D, where the columns D
>> is empty and it contains no data actually. So how to filter the empty
>> columns when I writing dataframe to parquet?
>>
>> Thanks!
>>
>>
>> Regard,
>> Junfeng Chen
>>
> --
> Thanks,
> Naresh
> www.linkedin.com/in/naresh-dulam
> http://hadoopandspark.blogspot.com/
>
>