You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Manik Singla <sm...@gmail.com> on 2019/10/15 09:45:12 UTC

multi threading support

Hi Guys

I was looking for tasks list or blockers which are required to support
multi-threaded writer( java specifically).
I did not find anything in JIRA or forums.

Could someone help me to point some doc/link if exists


Regards
Manik Singla
+91-9996008893
+91-9665639677

"Life doesn't consist in holding good cards but playing those you hold
well."

Re: multi threading support

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.
Manik,

What I've done previously with Apache Avro, is sharding the workload based
on the fingerprint of the schema. So you compute the fingerprint of the
schema, and, you'll get a Long that represents the canonical form of the
schema. I was using Flink, but you can implement this with any other
distributed framework as well. You partition on the fingerprint, and this
will ensure that messages with the same schema's land on the same executor,
which will scale the workload out (assuming that there isn't data skew).
Hope this helps.

Cheers, Fokko

Op ma 21 okt. 2019 om 22:08 schreef Manik Singla <sm...@gmail.com>:

> In our case,  all 1500 writers have different schema, so we need to
> increase throughput per writer.
> But currently, writers throughput is not application bottleneck.
>
> As per suggestion, We will look at application level fixes if we come to
> this.
>
>
>
> Regards
> Manik Singla
> +91-9996008893
> +91-9665639677
>
> "Life doesn't consist in holding good cards but playing those you hold
> well."
>
>
> On Mon, Oct 21, 2019 at 9:26 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> > I agree with Fokko. Multi-threading is not the responsibility of Parquet.
> > You can parallelize by writing more Parquet files in separate threads.
> > Adding locks to Parquet doesn't make much sense and is unlikely to speed
> up
> > your application without huge changes to Parquet.
> >
> > On Mon, Oct 21, 2019 at 12:14 AM Driesprong, Fokko <fokko@driesprong.frl
> >
> > wrote:
> >
> > >  I don't think the multi-threading should be on the level of Parquet.
> But
> > > you could write on a different thread. For example, when one of the
> 1500
> > > writers is ready to write, you could do this on a different thread.
> > >
> > > Cheers, Fokko
> > >
> > > Op za 19 okt. 2019 om 12:16 schreef Manik Singla <smanik.im@gmail.com
> >:
> > >
> > > > Thanks Fokko for response and correcting me on the way I addressed
> > > >
> > > > We are using parquet using our internal framework where we usually
> have
> > > > dynamic schema.  Due to dynamic schema, we do some buffering to
> figure
> > > out
> > > > schema for current writer.
> > > > We open around 1500 writers at a time but not able to achieve
> > throughput
> > > at
> > > > times when one particular schema is making most of data.
> > > > Though we can handle that by creating multiple writers by identifying
> > > such
> > > > schema,  I was thinking if we can increase throughput by having
> > > > multi-threaded support.
> > > >
> > > > For sure, it will increase locking if we implement concurrent access
> > but
> > > > leave users carefree.
> > > >
> > > >
> > > >
> > > > Regards
> > > > Manik Singla
> > > > +91-9996008893
> > > > +91-9665639677
> > > >
> > > > "Life doesn't consist in holding good cards but playing those you
> hold
> > > > well."
> > > >
> > > >
> > > > On Thu, Oct 17, 2019 at 7:54 PM Driesprong, Fokko
> <fokko@driesprong.frl
> > >
> > > > wrote:
> > > >
> > > > > Thank you for your question Manik,
> > > > >
> > > > > First of all, I think most of the people working on this project
> are
> > > > guys,
> > > > > but I would not exclude any other gender.
> > > > >
> > > > > Secondly. Parquet is widely used in different open source project
> > such
> > > as
> > > > > Hive, Presto and Spark. These frameworks scale-out by design. For
> > > > example,
> > > > > Spark writes by default 200 files to the persistent store. I think
> > > > > multi-threading (or multi-processing) should be implemented at
> such a
> > > > > level. For example, you could write multiple parquet files from
> your
> > > > > application. Having multiple threads writing to the same thread
> would
> > > not
> > > > > make too much sense to me. Please let me know your thoughts on how
> > you
> > > > see
> > > > > multi-threading within Parquet.
> > > > >
> > > > > Cheers, Fokko
> > > > >
> > > > >
> > > > >
> > > > > Op di 15 okt. 2019 om 11:45 schreef Manik Singla <
> > smanik.im@gmail.com
> > > >:
> > > > >
> > > > > > Hi Guys
> > > > > >
> > > > > > I was looking for tasks list or blockers which are required to
> > > support
> > > > > > multi-threaded writer( java specifically).
> > > > > > I did not find anything in JIRA or forums.
> > > > > >
> > > > > > Could someone help me to point some doc/link if exists
> > > > > >
> > > > > >
> > > > > > Regards
> > > > > > Manik Singla
> > > > > > +91-9996008893
> > > > > > +91-9665639677
> > > > > >
> > > > > > "Life doesn't consist in holding good cards but playing those you
> > > hold
> > > > > > well."
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>

Re: multi threading support

Posted by Manik Singla <sm...@gmail.com>.
In our case,  all 1500 writers have different schema, so we need to
increase throughput per writer.
But currently, writers throughput is not application bottleneck.

As per suggestion, We will look at application level fixes if we come to
this.



Regards
Manik Singla
+91-9996008893
+91-9665639677

"Life doesn't consist in holding good cards but playing those you hold
well."


On Mon, Oct 21, 2019 at 9:26 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> I agree with Fokko. Multi-threading is not the responsibility of Parquet.
> You can parallelize by writing more Parquet files in separate threads.
> Adding locks to Parquet doesn't make much sense and is unlikely to speed up
> your application without huge changes to Parquet.
>
> On Mon, Oct 21, 2019 at 12:14 AM Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
>
> >  I don't think the multi-threading should be on the level of Parquet. But
> > you could write on a different thread. For example, when one of the 1500
> > writers is ready to write, you could do this on a different thread.
> >
> > Cheers, Fokko
> >
> > Op za 19 okt. 2019 om 12:16 schreef Manik Singla <sm...@gmail.com>:
> >
> > > Thanks Fokko for response and correcting me on the way I addressed
> > >
> > > We are using parquet using our internal framework where we usually have
> > > dynamic schema.  Due to dynamic schema, we do some buffering to figure
> > out
> > > schema for current writer.
> > > We open around 1500 writers at a time but not able to achieve
> throughput
> > at
> > > times when one particular schema is making most of data.
> > > Though we can handle that by creating multiple writers by identifying
> > such
> > > schema,  I was thinking if we can increase throughput by having
> > > multi-threaded support.
> > >
> > > For sure, it will increase locking if we implement concurrent access
> but
> > > leave users carefree.
> > >
> > >
> > >
> > > Regards
> > > Manik Singla
> > > +91-9996008893
> > > +91-9665639677
> > >
> > > "Life doesn't consist in holding good cards but playing those you hold
> > > well."
> > >
> > >
> > > On Thu, Oct 17, 2019 at 7:54 PM Driesprong, Fokko <fokko@driesprong.frl
> >
> > > wrote:
> > >
> > > > Thank you for your question Manik,
> > > >
> > > > First of all, I think most of the people working on this project are
> > > guys,
> > > > but I would not exclude any other gender.
> > > >
> > > > Secondly. Parquet is widely used in different open source project
> such
> > as
> > > > Hive, Presto and Spark. These frameworks scale-out by design. For
> > > example,
> > > > Spark writes by default 200 files to the persistent store. I think
> > > > multi-threading (or multi-processing) should be implemented at such a
> > > > level. For example, you could write multiple parquet files from your
> > > > application. Having multiple threads writing to the same thread would
> > not
> > > > make too much sense to me. Please let me know your thoughts on how
> you
> > > see
> > > > multi-threading within Parquet.
> > > >
> > > > Cheers, Fokko
> > > >
> > > >
> > > >
> > > > Op di 15 okt. 2019 om 11:45 schreef Manik Singla <
> smanik.im@gmail.com
> > >:
> > > >
> > > > > Hi Guys
> > > > >
> > > > > I was looking for tasks list or blockers which are required to
> > support
> > > > > multi-threaded writer( java specifically).
> > > > > I did not find anything in JIRA or forums.
> > > > >
> > > > > Could someone help me to point some doc/link if exists
> > > > >
> > > > >
> > > > > Regards
> > > > > Manik Singla
> > > > > +91-9996008893
> > > > > +91-9665639677
> > > > >
> > > > > "Life doesn't consist in holding good cards but playing those you
> > hold
> > > > > well."
> > > > >
> > > >
> > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: multi threading support

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I agree with Fokko. Multi-threading is not the responsibility of Parquet.
You can parallelize by writing more Parquet files in separate threads.
Adding locks to Parquet doesn't make much sense and is unlikely to speed up
your application without huge changes to Parquet.

On Mon, Oct 21, 2019 at 12:14 AM Driesprong, Fokko <fo...@driesprong.frl>
wrote:

>  I don't think the multi-threading should be on the level of Parquet. But
> you could write on a different thread. For example, when one of the 1500
> writers is ready to write, you could do this on a different thread.
>
> Cheers, Fokko
>
> Op za 19 okt. 2019 om 12:16 schreef Manik Singla <sm...@gmail.com>:
>
> > Thanks Fokko for response and correcting me on the way I addressed
> >
> > We are using parquet using our internal framework where we usually have
> > dynamic schema.  Due to dynamic schema, we do some buffering to figure
> out
> > schema for current writer.
> > We open around 1500 writers at a time but not able to achieve throughput
> at
> > times when one particular schema is making most of data.
> > Though we can handle that by creating multiple writers by identifying
> such
> > schema,  I was thinking if we can increase throughput by having
> > multi-threaded support.
> >
> > For sure, it will increase locking if we implement concurrent access but
> > leave users carefree.
> >
> >
> >
> > Regards
> > Manik Singla
> > +91-9996008893
> > +91-9665639677
> >
> > "Life doesn't consist in holding good cards but playing those you hold
> > well."
> >
> >
> > On Thu, Oct 17, 2019 at 7:54 PM Driesprong, Fokko <fo...@driesprong.frl>
> > wrote:
> >
> > > Thank you for your question Manik,
> > >
> > > First of all, I think most of the people working on this project are
> > guys,
> > > but I would not exclude any other gender.
> > >
> > > Secondly. Parquet is widely used in different open source project such
> as
> > > Hive, Presto and Spark. These frameworks scale-out by design. For
> > example,
> > > Spark writes by default 200 files to the persistent store. I think
> > > multi-threading (or multi-processing) should be implemented at such a
> > > level. For example, you could write multiple parquet files from your
> > > application. Having multiple threads writing to the same thread would
> not
> > > make too much sense to me. Please let me know your thoughts on how you
> > see
> > > multi-threading within Parquet.
> > >
> > > Cheers, Fokko
> > >
> > >
> > >
> > > Op di 15 okt. 2019 om 11:45 schreef Manik Singla <smanik.im@gmail.com
> >:
> > >
> > > > Hi Guys
> > > >
> > > > I was looking for tasks list or blockers which are required to
> support
> > > > multi-threaded writer( java specifically).
> > > > I did not find anything in JIRA or forums.
> > > >
> > > > Could someone help me to point some doc/link if exists
> > > >
> > > >
> > > > Regards
> > > > Manik Singla
> > > > +91-9996008893
> > > > +91-9665639677
> > > >
> > > > "Life doesn't consist in holding good cards but playing those you
> hold
> > > > well."
> > > >
> > >
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: multi threading support

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.
 I don't think the multi-threading should be on the level of Parquet. But
you could write on a different thread. For example, when one of the 1500
writers is ready to write, you could do this on a different thread.

Cheers, Fokko

Op za 19 okt. 2019 om 12:16 schreef Manik Singla <sm...@gmail.com>:

> Thanks Fokko for response and correcting me on the way I addressed
>
> We are using parquet using our internal framework where we usually have
> dynamic schema.  Due to dynamic schema, we do some buffering to figure out
> schema for current writer.
> We open around 1500 writers at a time but not able to achieve throughput at
> times when one particular schema is making most of data.
> Though we can handle that by creating multiple writers by identifying such
> schema,  I was thinking if we can increase throughput by having
> multi-threaded support.
>
> For sure, it will increase locking if we implement concurrent access but
> leave users carefree.
>
>
>
> Regards
> Manik Singla
> +91-9996008893
> +91-9665639677
>
> "Life doesn't consist in holding good cards but playing those you hold
> well."
>
>
> On Thu, Oct 17, 2019 at 7:54 PM Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
>
> > Thank you for your question Manik,
> >
> > First of all, I think most of the people working on this project are
> guys,
> > but I would not exclude any other gender.
> >
> > Secondly. Parquet is widely used in different open source project such as
> > Hive, Presto and Spark. These frameworks scale-out by design. For
> example,
> > Spark writes by default 200 files to the persistent store. I think
> > multi-threading (or multi-processing) should be implemented at such a
> > level. For example, you could write multiple parquet files from your
> > application. Having multiple threads writing to the same thread would not
> > make too much sense to me. Please let me know your thoughts on how you
> see
> > multi-threading within Parquet.
> >
> > Cheers, Fokko
> >
> >
> >
> > Op di 15 okt. 2019 om 11:45 schreef Manik Singla <sm...@gmail.com>:
> >
> > > Hi Guys
> > >
> > > I was looking for tasks list or blockers which are required to support
> > > multi-threaded writer( java specifically).
> > > I did not find anything in JIRA or forums.
> > >
> > > Could someone help me to point some doc/link if exists
> > >
> > >
> > > Regards
> > > Manik Singla
> > > +91-9996008893
> > > +91-9665639677
> > >
> > > "Life doesn't consist in holding good cards but playing those you hold
> > > well."
> > >
> >
>

Re: multi threading support

Posted by Manik Singla <sm...@gmail.com>.
Thanks Fokko for response and correcting me on the way I addressed

We are using parquet using our internal framework where we usually have
dynamic schema.  Due to dynamic schema, we do some buffering to figure out
schema for current writer.
We open around 1500 writers at a time but not able to achieve throughput at
times when one particular schema is making most of data.
Though we can handle that by creating multiple writers by identifying such
schema,  I was thinking if we can increase throughput by having
multi-threaded support.

For sure, it will increase locking if we implement concurrent access but
leave users carefree.



Regards
Manik Singla
+91-9996008893
+91-9665639677

"Life doesn't consist in holding good cards but playing those you hold
well."


On Thu, Oct 17, 2019 at 7:54 PM Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> Thank you for your question Manik,
>
> First of all, I think most of the people working on this project are guys,
> but I would not exclude any other gender.
>
> Secondly. Parquet is widely used in different open source project such as
> Hive, Presto and Spark. These frameworks scale-out by design. For example,
> Spark writes by default 200 files to the persistent store. I think
> multi-threading (or multi-processing) should be implemented at such a
> level. For example, you could write multiple parquet files from your
> application. Having multiple threads writing to the same thread would not
> make too much sense to me. Please let me know your thoughts on how you see
> multi-threading within Parquet.
>
> Cheers, Fokko
>
>
>
> Op di 15 okt. 2019 om 11:45 schreef Manik Singla <sm...@gmail.com>:
>
> > Hi Guys
> >
> > I was looking for tasks list or blockers which are required to support
> > multi-threaded writer( java specifically).
> > I did not find anything in JIRA or forums.
> >
> > Could someone help me to point some doc/link if exists
> >
> >
> > Regards
> > Manik Singla
> > +91-9996008893
> > +91-9665639677
> >
> > "Life doesn't consist in holding good cards but playing those you hold
> > well."
> >
>

Re: multi threading support

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.
Thank you for your question Manik,

First of all, I think most of the people working on this project are guys,
but I would not exclude any other gender.

Secondly. Parquet is widely used in different open source project such as
Hive, Presto and Spark. These frameworks scale-out by design. For example,
Spark writes by default 200 files to the persistent store. I think
multi-threading (or multi-processing) should be implemented at such a
level. For example, you could write multiple parquet files from your
application. Having multiple threads writing to the same thread would not
make too much sense to me. Please let me know your thoughts on how you see
multi-threading within Parquet.

Cheers, Fokko



Op di 15 okt. 2019 om 11:45 schreef Manik Singla <sm...@gmail.com>:

> Hi Guys
>
> I was looking for tasks list or blockers which are required to support
> multi-threaded writer( java specifically).
> I did not find anything in JIRA or forums.
>
> Could someone help me to point some doc/link if exists
>
>
> Regards
> Manik Singla
> +91-9996008893
> +91-9665639677
>
> "Life doesn't consist in holding good cards but playing those you hold
> well."
>