You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@asterixdb.apache.org by Müller Ingo <in...@inf.ethz.ch> on 2021/08/09 17:05:36 UTC

Increasing degree of parallelism when reading Parquet files

Dear AsterixDB devs,

I am currently trying out the new support for Parquet files on S3 (still in the context of my High-energy Physics use case [1]). This works great so far and has generally decent performance. However, I realized that it does not use more than 16 cores, even though 96 logical cores are available and even though I run long-running queries (several minutes) on large data sets with a large number of files (I tried 128 files of 17GB each). Is this an arbitrary/artificial limitation that can be changed somehow (potentially with a small patch+recompiling) or is there more serious development required to lift it? FYI, I am currently using 03fd6d0f, which should include all S3/Parquet commits on master.

Cheers,
Ingo


[1] https://arxiv.org/abs/2104.12615

RE: Increasing degree of parallelism when reading Parquet files

Posted by Müller Ingo <in...@inf.ethz.ch>.

Dear all,

Thanks a lot for the help over the last weeks. We have just published the updated version our study on query languages and systems in the context of high-energy physics (HEP) here: https://arxiv.org/abs/2104.12615. In that version, AsterixDB and SQL++ are part of the study. In short, we concluded that, like JSONiq, SQL++ is a perfect fit for HEP. This isn't completely surprising since the nested (but fully-structured) data model from that domain is a subset of what both languages were originally designed for. In terms of performance, AsterixDB fares significantly better than our implementation of JSONiq (RumbleDB), but both are too slow to be useful in practice, between an order of magnitude and two slower than what the physicist use today.

We have made the complete set of scripts, query implementations, etc public here: https://github.com/RumbleDB/hep-iris-benchmark-scripts/. If anybody has any type of feedback on the study, the experiment set-up, or the query implementations, we'd be curious to hear it.

All the best,
Ingo


> -----Original Message-----
> From: Wail Alkowaileet <wa...@gmail.com>
> Sent: Wednesday, August 11, 2021 6:41 PM
> To: users@asterixdb.apache.org
> Subject: Re: Increasing degree of parallelism when reading Parquet files
> 
> Ingo,
> 
> Thanks for trying the Parquet connector. Your inputs were super valuable!
> Sure you can use the current change if it solves the problem.
> Please let us know if you have any questions/concerns.
> 
> On Wed, Aug 11, 2021 at 1:24 AM Müller Ingo <ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> > wrote:
> 
> 
> 	Dear all,
> 
> 	I have just tried out Wail's patch set from here:
> https://issues.apache.org/jira/browse/ASTERIXDB-2945. It seems to solve my
> problem fully: in the 96-vCPU instance with 48 I/O devices, I see about 4800%
> CPU utilization during query execution, and the queries run only marginally
> longer than if run against local files. Thanks a lot for the quick fix!
> 
> 	Should I use this version for a full benchmark run or wait until the
> patch makes it to master?
> 
> 	Cheers,
> 	Ingo
> 
> 
> 	> -----Original Message-----
> 	> From: Wail Alkowaileet <wael.y.k@gmail.com
> <ma...@gmail.com> >
> 	> Sent: Tuesday, August 10, 2021 6:10 PM
> 	> To: users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> Subject: Re: Increasing degree of parallelism when reading Parquet
> files
> 	>
> 	> Thanks Ingo for the detailed explanation and for benchmarking it!
> It is a great
> 	> input for us. We will look at the issue and hopefully we can get it
> fixed before
> 	> the end of the week.
> 	>
> 	> On Tue, Aug 10, 2021 at 8:42 AM Müller Ingo
> <ingo.mueller@inf.ethz.ch <ma...@inf.ethz.ch>
> 	> <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> > > wrote:
> 	>
> 	>
> 	>       Let me also say that I can still rerun the experiments for the
> (hopefully
> 	> subsequent) camera-ready version if the problem takes longer to
> fix.
> 	>
> 	>       Cheers,
> 	>       Ingo
> 	>
> 	>
> 	>       > -----Original Message-----
> 	>       > From: Müller Ingo <ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch>
> 	> <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> > >
> 	>       > Sent: Tuesday, August 10, 2021 5:34 PM
> 	>       > To: users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	>       > Subject: RE: Increasing degree of parallelism when reading
> Parquet
> 	> files
> 	>       >
> 	>       > Hey Mike!
> 	>       >
> 	>       > Thanks for confirming! I am happy to test any fixes that you
> may come
> 	> up with.
> 	>       > If the happens to be simple and is fixed before Friday, I can
> still include
> 	> it in the
> 	>       > revision I am currently working on ;) Otherwise, it'd be great
> to be able
> 	> to
> 	>       > mention a Jira issue or similar (maybe this mailing list thread is
> 	> enough?) that I
> 	>       > can refer to.
> 	>       >
> 	>       > Cheers,
> 	>       > Ingo
> 	>       >
> 	>       >
> 	>       > > -----Original Message-----
> 	>       > > From: Michael Carey <mjcarey@ics.uci.edu
> <ma...@ics.uci.edu>
> 	> <mailto:mjcarey@ics.uci.edu <ma...@ics.uci.edu> > >
> 	>       > > Sent: Tuesday, August 10, 2021 4:36 PM
> 	>       > > To: users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	>       > > Subject: Re: Increasing degree of parallelism when reading
> Parquet
> 	>       > > files
> 	>       > >
> 	>       > > Ingo,
> 	>       > >
> 	>       > > Got it!  It sounds like we indeed have a parallelism
> performance bug
> 	>       > > in the area of threading for S3, then.  Weird!  We'll look into
> it...
> 	>       > >
> 	>       > >
> 	>       > > Cheers,
> 	>       > >
> 	>       > > Mike
> 	>       > >
> 	>       > >
> 	>       > > On 8/9/21 11:21 PM, Müller Ingo wrote:
> 	>       > >
> 	>       > >
> 	>       > >     Hey Mike,
> 	>       > >
> 	>       > >     Just to clarify: "partitions" is the same thing as I/O devices,
> 	>       > > right? I have configured 48 of those via "[nc]\niodevices=..."
> and see
> 	>       > > the corresponding folders with content show up on the file
> system.
> 	>       > > When I vary the number of these devices, I see that all other
> storage
> 	>       > > format change the degree of parallelism with my queries.
> That
> 	>       > > mechanism thus seems to work in general. It just doesn't
> seem to
> 	> work
> 	>       > > for Parquet on S3. (I am not 100% sure if I tried other file
> formats
> 	>       > > on S3.)
> 	>       > >
> 	>       > >     I have also tried to set compiler.parallelism to 4 for
> Parquet files
> 	>       > > on HDFS with a file:// path and did not see any effect, i.e., it
> used
> 	>       > > 48 threads, which corresponds to the number of I/O devices.
> 	> However,
> 	>       > > with what Dmitry said, I guess that this is expected behavior
> and the
> 	>       > > flag should only influence the degree of parallelism after
> exchanges
> 	> (which I
> 	>       > don't have in my queries).
> 	>       > >
> 	>       > >     Cheers,
> 	>       > >     Ingo
> 	>       > >
> 	>       > >
> 	>       > >
> 	>       > >             -----Original Message-----
> 	>       > >             From: Michael Carey <mjcarey@ics.uci.edu
> <ma...@ics.uci.edu>
> 	> <mailto:mjcarey@ics.uci.edu <ma...@ics.uci.edu> > >
> 	>       > > <mailto:mjcarey@ics.uci.edu <ma...@ics.uci.edu>
> <mailto:mjcarey@ics.uci.edu <ma...@ics.uci.edu> > >
> 	>       > >             Sent: Monday, August 9, 2021 10:10 PM
> 	>       > >             To: users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	>       > > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >
> 	>       > >             Subject: Re: Increasing degree of parallelism when
> reading
> 	>       > Parquet
> 	>       > > files
> 	>       > >
> 	>       > >             Ingo,
> 	>       > >
> 	>       > >             Q: In your Parquet/S3 testing, what does your current
> cluster
> 	>       > > configuration look
> 	>       > >             like?  (I.e., how many partitions have you configured it
> with -
> 	>       > > physical storage
> 	>       > >             partitions that is?)  Even though your S3 data isn't
> stored
> 	> inside
> 	>       > > AsterixDB in this
> 	>       > >             case, the system still uses that info to decide how
> many
> 	> parallel
> 	>       > > threads to use
> 	>       > >             at the base of its query plans.  (Obviously there is
> room for
> 	>       > > improvement on that
> 	>       > >             behavior for use cases involving external storage. :-))
> 	>       > >
> 	>       > >
> 	>       > >             Cheers,
> 	>       > >
> 	>       > >             Mike
> 	>       > >
> 	>       > >
> 	>       > >             On 8/9/21 12:28 PM, Müller Ingo wrote:
> 	>       > >
> 	>       > >
> 	>       > >                     Hi Dmitry,
> 	>       > >
> 	>       > >                     Thanks a lot for checking! Indeed, my queries do
> not
> 	>       > have an
> 	>       > > exchange.
> 	>       > >             However, the number of I/O devices has indeed
> worked well
> 	> in
> 	>       > many
> 	>       > > cases:
> 	>       > >             when I tried the various VM instance sizes, I always
> created as
> 	>       > many
> 	>       > > I/O devices
> 	>       > >             as there were physical cores (i.e., half the number of
> logical
> 	>       > > CPUs). For internal
> 	>       > >             storage as well as HDFS (both using the hdfs:// and
> the file://
> 	>       > > protocol), I saw
> 	>       > >             the full system being utilized. However, just for the
> case of
> 	>       > > Parquet on S3, I
> 	>       > >             cannot seem to make it use more than 16 cores.
> 	>       > >
> 	>       > >                     Cheers,
> 	>       > >                     Ingo
> 	>       > >
> 	>       > >
> 	>       > >
> 	>       > >                             -----Original Message-----
> 	>       > >                             From: Dmitry Lychagin
> 	>       > > <dmitry.lychagin@couchbase.com
> <ma...@couchbase.com>
> 	> <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com> > >
> 	>       > <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com>
> 	> <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com> > >
> 	>       > >             <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com>
> 	> <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com> > >
> 	>       > > <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com>
> 	> <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com> > >
> 	>       > >                             Sent: Monday, August 9, 2021 9:10 PM
> 	>       > >                             To: users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	>       > > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >
> 	>       > >             <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >
> 	>       > > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >
> 	>       > >                             Subject: Re: Increasing degree of parallelism
> 	>       > when reading
> 	>       > >             Parquet files
> 	>       > >
> 	>       > >                             Hi Ingo,
> 	>       > >
> 	>       > >                             I checked the code and it seems that when
> 	>       > scanning external
> 	>       > >             datasource we're
> 	>       > >                             using the same number of cores as there are
> 	>       > configured storage
> 	>       > >             partitions (I/O
> 	>       > >                             devices).
> 	>       > >                             Therefore, if you want 96 cores to be used
> 	>       > when scanning
> 	>       > >             Parquet files then you
> 	>       > >                             need to configure 96 I/O devices.
> 	>       > >
> 	>       > >                             Compiler.parallelism setting is supposed to
> 	>       > affect how many
> 	>       > >             cores we use after
> 	>       > >                             the first EXCHANGE operator. However, if
> your
> 	>       > query doesn't
> 	>       > >             have any
> 	>       > >                             EXCHANGEs then it'll use the number of
> cores
> 	>       > assigned for the
> 	>       > >             initial data scan
> 	>       > >                             operator (number of I/O devices)
> 	>       > >
> 	>       > >                             Thanks,
> 	>       > >                             -- Dmitry
> 	>       > >
> 	>       > >
> 	>       > >                             On 8/9/21, 11:42 AM, "Müller  Ingo"
> 	>       > >             <ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch>  <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> >
> 	> > <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch>  <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> > >
> 	>       > > <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch>
> 	> <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> > > <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch>
> 	> <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> > >   wrote:
> 	>       > >
> 	>       > >                                  EXTERNAL EMAIL:  Use caution when
> 	>       > opening attachments
> 	>       > >             or clicking on links
> 	>       > >
> 	>       > >
> 	>       > >
> 	>       > >
> 	>       > >
> 	>       > >                                 Dear Dmitry,
> 	>       > >
> 	>       > >                                 Thanks a lot for the quick reply! I had not
> 	>       > though of this.
> 	>       > >             However, I have tried
> 	>       > >                             out both ways just now (per query and in the
> 	>       > cluster
> 	>       > >             configuration) and did not
> 	>       > >                             see any changes. Is there any way I can
> control
> 	>       > that the setting
> 	>       > >             was applied
> 	>       > >                             successfully? I have also tried setting
> 	>       > compiler.parallelism to 4
> 	>       > >             and still observed
> 	>       > >                             16 cores being utilized.
> 	>       > >
> 	>       > >                                 Note that the observed degree of
> parallelism
> 	>       > does not
> 	>       > >             correspond to anything
> 	>       > >                             related to the data set (I tried with every
> power
> 	>       > of two files
> 	>       > >             between 1 and 128)
> 	>       > >                             or the cluster (I tried with every power of two
> 	>       > cores between 2
> 	>       > >             and 64, as well
> 	>       > >                             as 48 and 96) and I always see 16 cores being
> 	>       > used (or fewer, if
> 	>       > >             the system has
> 	>       > >                             fewer). To me, this makes it unlikely that the
> 	>       > system really uses
> 	>       > >             the semantics
> 	>       > >                             for p=0 or p<0, but looks more like some
> hard-
> 	>       > coded value.
> 	>       > >
> 	>       > >                                 Cheers,
> 	>       > >                                 Ingo
> 	>       > >
> 	>       > >
> 	>       > >                                 > -----Original Message-----
> 	>       > >                                 > From: Dmitry Lychagin
> 	>       > > <dmitry.lychagin@couchbase.com
> <ma...@couchbase.com>
> 	> <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com> > >
> 	>       > <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com>
> 	> <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com> > >
> 	>       > >             <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com>
> 	> <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com> > >
> 	>       > > <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com>
> 	> <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com> > >
> 	>       > >                                 > Sent: Monday, August 9, 2021 7:25 PM
> 	>       > >                                 > To: users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	>       > > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >
> 	>       > >             <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >
> 	>       > > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >
> 	>       > >                                 > Subject: Re: Increasing degree of
> 	>       > parallelism when reading
> 	>       > >             Parquet files
> 	>       > >                                 >
> 	>       > >                                 > Ingo,
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 > We have `compiler.parallelism`
> parameter
> 	>       > that controls
> 	>       > >             how many cores are
> 	>       > >                                 > used for query execution.
> 	>       > >                                 >
> 	>       > >                                 > See
> 	>       > >                                 >
> 	>       > >
> 	>       > >
> 	>       > >
> 	>
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism
> 	>       > >             _param
> 	>       > >                                 > eter
> 	>       > >                                 >
> 	>       > >
> 	>       > >
> 	>       > >
> 	>
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> 	>       > >             m_para
> 	>       > >                                 >
> 	>       > >
> 	>       >
> 	>
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_p
> ara>
> 	>       > >
> 	>       > >
> 	>
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> 	>       > > m_para>
> 	>       > >
> 	>       >
> 	>
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_p
> ara>
> 	>       > >             meter>
> 	>       > >                                 >
> 	>       > >                                 > You can either set it per query (e.g. SET
> 	>       > >             `compiler.parallelism` "-1";) ,
> 	>       > >                                 >
> 	>       > >                                 > or globally in the cluster configuration:
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >
> 	>       > >
> 	>
> https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
> 	>       > >                                 > app/src/main/resources/cc2.conf#L57
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 > Thanks,
> 	>       > >                                 >
> 	>       > >                                 > -- Dmitry
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 > From: Müller Ingo
> 	>       > > <ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch>  <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> > >
> 	> <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch>  <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> > >
> 	>       > >             <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch>
> 	> <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> > >
> 	>       > > <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch>
> 	> <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> > >
> 	>       > >                                 > Reply-To: "users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > "
> 	>       > > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >
> 	>       > >             <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >
> 	>       > > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >   <users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >
> 	>       > > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >
> 	>       > >             <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >
> 	>       > > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >
> 	>       > >                                 > Date: Monday, August 9, 2021 at 10:05
> AM
> 	>       > >                                 > To: "users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > "
> 	>       > > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >
> 	>       > >             <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >
> 	>       > > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >   <users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >
> 	>       > > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >
> 	>       > >             <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >
> 	>       > > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> > >
> 	>       > >                                 > Subject: Increasing degree of parallelism
> 	>       > > when reading
> 	>       > >             Parquet files
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 >  EXTERNAL EMAIL:  Use caution when
> 	>       > > opening attachments
> 	>       > >             or clicking on
> 	>       > >                             links
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 > Dear AsterixDB devs,
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 > I am currently trying out the new support
> 	>       > > for Parquet files
> 	>       > >             on S3 (still in the
> 	>       > >                                 > context of my High-energy Physics use
> case
> 	>       > > [1]). This works
> 	>       > >             great so far and
> 	>       > >                             has
> 	>       > >                                 > generally decent performance. However,
> I
> 	>       > > realized that it
> 	>       > >             does not use more
> 	>       > >                                 > than 16 cores, even though 96 logical
> cores
> 	>       > > are available
> 	>       > >             and even though I
> 	>       > >                             run
> 	>       > >                                 > long-running queries (several minutes)
> on
> 	>       > > large data sets
> 	>       > >             with a large
> 	>       > >                             number of
> 	>       > >                                 > files (I tried 128 files of 17GB each). Is this
> 	>       > > an
> 	>       > >             arbitrary/artificial limitation
> 	>       > >                             that
> 	>       > >                                 > can be changed somehow (potentially
> with
> 	>       > > a small
> 	>       > >             patch+recompiling) or is
> 	>       > >                                 > there more serious development
> required
> 	>       > > to lift it? FYI, I am
> 	>       > >             currently using
> 	>       > >                                 > 03fd6d0f, which should include all
> 	>       > > S3/Parquet commits on
> 	>       > >             master.
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 > Cheers,
> 	>       > >                                 >
> 	>       > >                                 > Ingo
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >                                 > [1] https://arxiv.org/abs/2104.12615
> 	>       > >                                 >
> 	>       > >                                 >
> 	>       > >
> 	>       > >
> 	>       > >
> 	>       > >
> 	>       > >
> 	>
> 	>
> 	>
> 	>
> 	>
> 	> --
> 	>
> 	>
> 	> Regards,
> 	> Wail Alkowaileet
> 
> 
> 
> 
> 
> --
> 
> 
> Regards,
> Wail Alkowaileet

Re: Increasing degree of parallelism when reading Parquet files

Posted by Wail Alkowaileet <wa...@gmail.com>.

Ingo,

Thanks for trying the Parquet connector. Your inputs were super valuable!
Sure you can use the current change if it solves the problem.
Please let us know if you have any questions/concerns.

On Wed, Aug 11, 2021 at 1:24 AM Müller Ingo <in...@inf.ethz.ch>
wrote:

> Dear all,
>
> I have just tried out Wail's patch set from here:
> https://issues.apache.org/jira/browse/ASTERIXDB-2945. It seems to solve
> my problem fully: in the 96-vCPU instance with 48 I/O devices, I see about
> 4800% CPU utilization during query execution, and the queries run only
> marginally longer than if run against local files. Thanks a lot for the
> quick fix!
>
> Should I use this version for a full benchmark run or wait until the patch
> makes it to master?
>
> Cheers,
> Ingo
>
>
> > -----Original Message-----
> > From: Wail Alkowaileet <wa...@gmail.com>
> > Sent: Tuesday, August 10, 2021 6:10 PM
> > To: users@asterixdb.apache.org
> > Subject: Re: Increasing degree of parallelism when reading Parquet files
> >
> > Thanks Ingo for the detailed explanation and for benchmarking it! It is
> a great
> > input for us. We will look at the issue and hopefully we can get it
> fixed before
> > the end of the week.
> >
> > On Tue, Aug 10, 2021 at 8:42 AM Müller Ingo <ingo.mueller@inf.ethz.ch
> > <ma...@inf.ethz.ch> > wrote:
> >
> >
> >       Let me also say that I can still rerun the experiments for the
> (hopefully
> > subsequent) camera-ready version if the problem takes longer to fix.
> >
> >       Cheers,
> >       Ingo
> >
> >
> >       > -----Original Message-----
> >       > From: Müller Ingo <ingo.mueller@inf.ethz.ch
> > <ma...@inf.ethz.ch> >
> >       > Sent: Tuesday, August 10, 2021 5:34 PM
> >       > To: users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org>
> >       > Subject: RE: Increasing degree of parallelism when reading
> Parquet
> > files
> >       >
> >       > Hey Mike!
> >       >
> >       > Thanks for confirming! I am happy to test any fixes that you may
> come
> > up with.
> >       > If the happens to be simple and is fixed before Friday, I can
> still include
> > it in the
> >       > revision I am currently working on ;) Otherwise, it'd be great
> to be able
> > to
> >       > mention a Jira issue or similar (maybe this mailing list thread
> is
> > enough?) that I
> >       > can refer to.
> >       >
> >       > Cheers,
> >       > Ingo
> >       >
> >       >
> >       > > -----Original Message-----
> >       > > From: Michael Carey <mjcarey@ics.uci.edu
> > <ma...@ics.uci.edu> >
> >       > > Sent: Tuesday, August 10, 2021 4:36 PM
> >       > > To: users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org>
> >       > > Subject: Re: Increasing degree of parallelism when reading
> Parquet
> >       > > files
> >       > >
> >       > > Ingo,
> >       > >
> >       > > Got it!  It sounds like we indeed have a parallelism
> performance bug
> >       > > in the area of threading for S3, then.  Weird!  We'll look
> into it...
> >       > >
> >       > >
> >       > > Cheers,
> >       > >
> >       > > Mike
> >       > >
> >       > >
> >       > > On 8/9/21 11:21 PM, Müller Ingo wrote:
> >       > >
> >       > >
> >       > >     Hey Mike,
> >       > >
> >       > >     Just to clarify: "partitions" is the same thing as I/O
> devices,
> >       > > right? I have configured 48 of those via "[nc]\niodevices=..."
> and see
> >       > > the corresponding folders with content show up on the file
> system.
> >       > > When I vary the number of these devices, I see that all other
> storage
> >       > > format change the degree of parallelism with my queries. That
> >       > > mechanism thus seems to work in general. It just doesn't seem
> to
> > work
> >       > > for Parquet on S3. (I am not 100% sure if I tried other file
> formats
> >       > > on S3.)
> >       > >
> >       > >     I have also tried to set compiler.parallelism to 4 for
> Parquet files
> >       > > on HDFS with a file:// path and did not see any effect, i.e.,
> it used
> >       > > 48 threads, which corresponds to the number of I/O devices.
> > However,
> >       > > with what Dmitry said, I guess that this is expected behavior
> and the
> >       > > flag should only influence the degree of parallelism after
> exchanges
> > (which I
> >       > don't have in my queries).
> >       > >
> >       > >     Cheers,
> >       > >     Ingo
> >       > >
> >       > >
> >       > >
> >       > >             -----Original Message-----
> >       > >             From: Michael Carey <mjcarey@ics.uci.edu
> > <ma...@ics.uci.edu> >
> >       > > <mailto:mjcarey@ics.uci.edu <ma...@ics.uci.edu> >
> >       > >             Sent: Monday, August 9, 2021 10:10 PM
> >       > >             To: users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org>
> >       > > <mailto:users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >
> >       > >             Subject: Re: Increasing degree of parallelism when
> reading
> >       > Parquet
> >       > > files
> >       > >
> >       > >             Ingo,
> >       > >
> >       > >             Q: In your Parquet/S3 testing, what does your
> current cluster
> >       > > configuration look
> >       > >             like?  (I.e., how many partitions have you
> configured it with -
> >       > > physical storage
> >       > >             partitions that is?)  Even though your S3 data
> isn't stored
> > inside
> >       > > AsterixDB in this
> >       > >             case, the system still uses that info to decide
> how many
> > parallel
> >       > > threads to use
> >       > >             at the base of its query plans.  (Obviously there
> is room for
> >       > > improvement on that
> >       > >             behavior for use cases involving external storage.
> :-))
> >       > >
> >       > >
> >       > >             Cheers,
> >       > >
> >       > >             Mike
> >       > >
> >       > >
> >       > >             On 8/9/21 12:28 PM, Müller Ingo wrote:
> >       > >
> >       > >
> >       > >                     Hi Dmitry,
> >       > >
> >       > >                     Thanks a lot for checking! Indeed, my
> queries do not
> >       > have an
> >       > > exchange.
> >       > >             However, the number of I/O devices has indeed
> worked well
> > in
> >       > many
> >       > > cases:
> >       > >             when I tried the various VM instance sizes, I
> always created as
> >       > many
> >       > > I/O devices
> >       > >             as there were physical cores (i.e., half the
> number of logical
> >       > > CPUs). For internal
> >       > >             storage as well as HDFS (both using the hdfs://
> and the file://
> >       > > protocol), I saw
> >       > >             the full system being utilized. However, just for
> the case of
> >       > > Parquet on S3, I
> >       > >             cannot seem to make it use more than 16 cores.
> >       > >
> >       > >                     Cheers,
> >       > >                     Ingo
> >       > >
> >       > >
> >       > >
> >       > >                             -----Original Message-----
> >       > >                             From: Dmitry Lychagin
> >       > > <dmitry.lychagin@couchbase.com
> > <ma...@couchbase.com> >
> >       > <mailto:dmitry.lychagin@couchbase.com
> > <ma...@couchbase.com> >
> >       > >             <mailto:dmitry.lychagin@couchbase.com
> > <ma...@couchbase.com> >
> >       > > <mailto:dmitry.lychagin@couchbase.com
> > <ma...@couchbase.com> >
> >       > >                             Sent: Monday, August 9, 2021 9:10
> PM
> >       > >                             To: users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org>
> >       > > <mailto:users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >
> >       > >             <mailto:users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >
> >       > > <mailto:users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >
> >       > >                             Subject: Re: Increasing degree of
> parallelism
> >       > when reading
> >       > >             Parquet files
> >       > >
> >       > >                             Hi Ingo,
> >       > >
> >       > >                             I checked the code and it seems
> that when
> >       > scanning external
> >       > >             datasource we're
> >       > >                             using the same number of cores as
> there are
> >       > configured storage
> >       > >             partitions (I/O
> >       > >                             devices).
> >       > >                             Therefore, if you want 96 cores to
> be used
> >       > when scanning
> >       > >             Parquet files then you
> >       > >                             need to configure 96 I/O devices.
> >       > >
> >       > >                             Compiler.parallelism setting is
> supposed to
> >       > affect how many
> >       > >             cores we use after
> >       > >                             the first EXCHANGE operator.
> However, if your
> >       > query doesn't
> >       > >             have any
> >       > >                             EXCHANGEs then it'll use the
> number of cores
> >       > assigned for the
> >       > >             initial data scan
> >       > >                             operator (number of I/O devices)
> >       > >
> >       > >                             Thanks,
> >       > >                             -- Dmitry
> >       > >
> >       > >
> >       > >                             On 8/9/21, 11:42 AM, "Müller
> Ingo"
> >       > >             <ingo.mueller@inf.ethz.ch <mailto:
> ingo.mueller@inf.ethz.ch>
> > > <mailto:ingo.mueller@inf.ethz.ch <ma...@inf.ethz.ch> >
> >       > > <mailto:ingo.mueller@inf.ethz.ch
> > <ma...@inf.ethz.ch> > <mailto:ingo.mueller@inf.ethz.ch
> > <ma...@inf.ethz.ch> >   wrote:
> >       > >
> >       > >                                  EXTERNAL EMAIL:  Use caution
> when
> >       > opening attachments
> >       > >             or clicking on links
> >       > >
> >       > >
> >       > >
> >       > >
> >       > >
> >       > >                                 Dear Dmitry,
> >       > >
> >       > >                                 Thanks a lot for the quick
> reply! I had not
> >       > though of this.
> >       > >             However, I have tried
> >       > >                             out both ways just now (per query
> and in the
> >       > cluster
> >       > >             configuration) and did not
> >       > >                             see any changes. Is there any way
> I can control
> >       > that the setting
> >       > >             was applied
> >       > >                             successfully? I have also tried
> setting
> >       > compiler.parallelism to 4
> >       > >             and still observed
> >       > >                             16 cores being utilized.
> >       > >
> >       > >                                 Note that the observed degree
> of parallelism
> >       > does not
> >       > >             correspond to anything
> >       > >                             related to the data set (I tried
> with every power
> >       > of two files
> >       > >             between 1 and 128)
> >       > >                             or the cluster (I tried with every
> power of two
> >       > cores between 2
> >       > >             and 64, as well
> >       > >                             as 48 and 96) and I always see 16
> cores being
> >       > used (or fewer, if
> >       > >             the system has
> >       > >                             fewer). To me, this makes it
> unlikely that the
> >       > system really uses
> >       > >             the semantics
> >       > >                             for p=0 or p<0, but looks more
> like some hard-
> >       > coded value.
> >       > >
> >       > >                                 Cheers,
> >       > >                                 Ingo
> >       > >
> >       > >
> >       > >                                 > -----Original Message-----
> >       > >                                 > From: Dmitry Lychagin
> >       > > <dmitry.lychagin@couchbase.com
> > <ma...@couchbase.com> >
> >       > <mailto:dmitry.lychagin@couchbase.com
> > <ma...@couchbase.com> >
> >       > >             <mailto:dmitry.lychagin@couchbase.com
> > <ma...@couchbase.com> >
> >       > > <mailto:dmitry.lychagin@couchbase.com
> > <ma...@couchbase.com> >
> >       > >                                 > Sent: Monday, August 9, 2021
> 7:25 PM
> >       > >                                 > To:
> users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org>
> >       > > <mailto:users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >
> >       > >             <mailto:users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >
> >       > > <mailto:users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >
> >       > >                                 > Subject: Re: Increasing
> degree of
> >       > parallelism when reading
> >       > >             Parquet files
> >       > >                                 >
> >       > >                                 > Ingo,
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 > We have
> `compiler.parallelism` parameter
> >       > that controls
> >       > >             how many cores are
> >       > >                                 > used for query execution.
> >       > >                                 >
> >       > >                                 > See
> >       > >                                 >
> >       > >
> >       > >
> >       > >
> > https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism
> >       > >             _param
> >       > >                                 > eter
> >       > >                                 >
> >       > >
> >       > >
> >       > >
> > <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> >       > >             m_para
> >       > >                                 >
> >       > >
> >       >
> > <
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
> >
> >       > >
> >       > >
> > <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> >       > > m_para>
> >       > >
> >       >
> > <
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
> >
> >       > >             meter>
> >       > >                                 >
> >       > >                                 > You can either set it per
> query (e.g. SET
> >       > >             `compiler.parallelism` "-1";) ,
> >       > >                                 >
> >       > >                                 > or globally in the cluster
> configuration:
> >       > >                                 >
> >       > >                                 >
> >       > >
> >       > >
> > https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
> >       > >                                 >
> app/src/main/resources/cc2.conf#L57
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 > Thanks,
> >       > >                                 >
> >       > >                                 > -- Dmitry
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 > From: Müller Ingo
> >       > > <ingo.mueller@inf.ethz.ch <ma...@inf.ethz.ch> >
> > <mailto:ingo.mueller@inf.ethz.ch <ma...@inf.ethz.ch> >
> >       > >             <mailto:ingo.mueller@inf.ethz.ch
> > <ma...@inf.ethz.ch> >
> >       > > <mailto:ingo.mueller@inf.ethz.ch
> > <ma...@inf.ethz.ch> >
> >       > >                                 > Reply-To: "
> users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> "
> >       > > <mailto:users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >
> >       > >             <mailto:users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >
> >       > > <mailto:users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >   <users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >
> >       > > <mailto:users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >
> >       > >             <mailto:users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >
> >       > > <mailto:users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >
> >       > >                                 > Date: Monday, August 9, 2021
> at 10:05 AM
> >       > >                                 > To: "
> users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> "
> >       > > <mailto:users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >
> >       > >             <mailto:users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >
> >       > > <mailto:users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >   <users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >
> >       > > <mailto:users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >
> >       > >             <mailto:users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >
> >       > > <mailto:users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org> >
> >       > >                                 > Subject: Increasing degree
> of parallelism
> >       > > when reading
> >       > >             Parquet files
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >  EXTERNAL EMAIL:  Use
> caution when
> >       > > opening attachments
> >       > >             or clicking on
> >       > >                             links
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 > Dear AsterixDB devs,
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 > I am currently trying out
> the new support
> >       > > for Parquet files
> >       > >             on S3 (still in the
> >       > >                                 > context of my High-energy
> Physics use case
> >       > > [1]). This works
> >       > >             great so far and
> >       > >                             has
> >       > >                                 > generally decent
> performance. However, I
> >       > > realized that it
> >       > >             does not use more
> >       > >                                 > than 16 cores, even though
> 96 logical cores
> >       > > are available
> >       > >             and even though I
> >       > >                             run
> >       > >                                 > long-running queries
> (several minutes) on
> >       > > large data sets
> >       > >             with a large
> >       > >                             number of
> >       > >                                 > files (I tried 128 files of
> 17GB each). Is this
> >       > > an
> >       > >             arbitrary/artificial limitation
> >       > >                             that
> >       > >                                 > can be changed somehow
> (potentially with
> >       > > a small
> >       > >             patch+recompiling) or is
> >       > >                                 > there more serious
> development required
> >       > > to lift it? FYI, I am
> >       > >             currently using
> >       > >                                 > 03fd6d0f, which should
> include all
> >       > > S3/Parquet commits on
> >       > >             master.
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 > Cheers,
> >       > >                                 >
> >       > >                                 > Ingo
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 >
> >       > >                                 > [1]
> https://arxiv.org/abs/2104.12615
> >       > >                                 >
> >       > >                                 >
> >       > >
> >       > >
> >       > >
> >       > >
> >       > >
> >
> >
> >
> >
> >
> > --
> >
> >
> > Regards,
> > Wail Alkowaileet
>
>

-- 

*Regards,*
Wail Alkowaileet

RE: Increasing degree of parallelism when reading Parquet files

Posted by Müller Ingo <in...@inf.ethz.ch>.

Dear all,

I have just tried out Wail's patch set from here: https://issues.apache.org/jira/browse/ASTERIXDB-2945. It seems to solve my problem fully: in the 96-vCPU instance with 48 I/O devices, I see about 4800% CPU utilization during query execution, and the queries run only marginally longer than if run against local files. Thanks a lot for the quick fix!

Should I use this version for a full benchmark run or wait until the patch makes it to master?

Cheers,
Ingo


> -----Original Message-----
> From: Wail Alkowaileet <wa...@gmail.com>
> Sent: Tuesday, August 10, 2021 6:10 PM
> To: users@asterixdb.apache.org
> Subject: Re: Increasing degree of parallelism when reading Parquet files
> 
> Thanks Ingo for the detailed explanation and for benchmarking it! It is a great
> input for us. We will look at the issue and hopefully we can get it fixed before
> the end of the week.
> 
> On Tue, Aug 10, 2021 at 8:42 AM Müller Ingo <ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> > wrote:
> 
> 
> 	Let me also say that I can still rerun the experiments for the (hopefully
> subsequent) camera-ready version if the problem takes longer to fix.
> 
> 	Cheers,
> 	Ingo
> 
> 
> 	> -----Original Message-----
> 	> From: Müller Ingo <ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> >
> 	> Sent: Tuesday, August 10, 2021 5:34 PM
> 	> To: users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> Subject: RE: Increasing degree of parallelism when reading Parquet
> files
> 	>
> 	> Hey Mike!
> 	>
> 	> Thanks for confirming! I am happy to test any fixes that you may come
> up with.
> 	> If the happens to be simple and is fixed before Friday, I can still include
> it in the
> 	> revision I am currently working on ;) Otherwise, it'd be great to be able
> to
> 	> mention a Jira issue or similar (maybe this mailing list thread is
> enough?) that I
> 	> can refer to.
> 	>
> 	> Cheers,
> 	> Ingo
> 	>
> 	>
> 	> > -----Original Message-----
> 	> > From: Michael Carey <mjcarey@ics.uci.edu
> <ma...@ics.uci.edu> >
> 	> > Sent: Tuesday, August 10, 2021 4:36 PM
> 	> > To: users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> > Subject: Re: Increasing degree of parallelism when reading Parquet
> 	> > files
> 	> >
> 	> > Ingo,
> 	> >
> 	> > Got it!  It sounds like we indeed have a parallelism performance bug
> 	> > in the area of threading for S3, then.  Weird!  We'll look into it...
> 	> >
> 	> >
> 	> > Cheers,
> 	> >
> 	> > Mike
> 	> >
> 	> >
> 	> > On 8/9/21 11:21 PM, Müller Ingo wrote:
> 	> >
> 	> >
> 	> >     Hey Mike,
> 	> >
> 	> >     Just to clarify: "partitions" is the same thing as I/O devices,
> 	> > right? I have configured 48 of those via "[nc]\niodevices=..." and see
> 	> > the corresponding folders with content show up on the file system.
> 	> > When I vary the number of these devices, I see that all other storage
> 	> > format change the degree of parallelism with my queries. That
> 	> > mechanism thus seems to work in general. It just doesn't seem to
> work
> 	> > for Parquet on S3. (I am not 100% sure if I tried other file formats
> 	> > on S3.)
> 	> >
> 	> >     I have also tried to set compiler.parallelism to 4 for Parquet files
> 	> > on HDFS with a file:// path and did not see any effect, i.e., it used
> 	> > 48 threads, which corresponds to the number of I/O devices.
> However,
> 	> > with what Dmitry said, I guess that this is expected behavior and the
> 	> > flag should only influence the degree of parallelism after exchanges
> (which I
> 	> don't have in my queries).
> 	> >
> 	> >     Cheers,
> 	> >     Ingo
> 	> >
> 	> >
> 	> >
> 	> >             -----Original Message-----
> 	> >             From: Michael Carey <mjcarey@ics.uci.edu
> <ma...@ics.uci.edu> >
> 	> > <mailto:mjcarey@ics.uci.edu <ma...@ics.uci.edu> >
> 	> >             Sent: Monday, August 9, 2021 10:10 PM
> 	> >             To: users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	> >             Subject: Re: Increasing degree of parallelism when reading
> 	> Parquet
> 	> > files
> 	> >
> 	> >             Ingo,
> 	> >
> 	> >             Q: In your Parquet/S3 testing, what does your current cluster
> 	> > configuration look
> 	> >             like?  (I.e., how many partitions have you configured it with -
> 	> > physical storage
> 	> >             partitions that is?)  Even though your S3 data isn't stored
> inside
> 	> > AsterixDB in this
> 	> >             case, the system still uses that info to decide how many
> parallel
> 	> > threads to use
> 	> >             at the base of its query plans.  (Obviously there is room for
> 	> > improvement on that
> 	> >             behavior for use cases involving external storage. :-))
> 	> >
> 	> >
> 	> >             Cheers,
> 	> >
> 	> >             Mike
> 	> >
> 	> >
> 	> >             On 8/9/21 12:28 PM, Müller Ingo wrote:
> 	> >
> 	> >
> 	> >                     Hi Dmitry,
> 	> >
> 	> >                     Thanks a lot for checking! Indeed, my queries do not
> 	> have an
> 	> > exchange.
> 	> >             However, the number of I/O devices has indeed worked well
> in
> 	> many
> 	> > cases:
> 	> >             when I tried the various VM instance sizes, I always created as
> 	> many
> 	> > I/O devices
> 	> >             as there were physical cores (i.e., half the number of logical
> 	> > CPUs). For internal
> 	> >             storage as well as HDFS (both using the hdfs:// and the file://
> 	> > protocol), I saw
> 	> >             the full system being utilized. However, just for the case of
> 	> > Parquet on S3, I
> 	> >             cannot seem to make it use more than 16 cores.
> 	> >
> 	> >                     Cheers,
> 	> >                     Ingo
> 	> >
> 	> >
> 	> >
> 	> >                             -----Original Message-----
> 	> >                             From: Dmitry Lychagin
> 	> > <dmitry.lychagin@couchbase.com
> <ma...@couchbase.com> >
> 	> <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com> >
> 	> >             <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com> >
> 	> > <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com> >
> 	> >                             Sent: Monday, August 9, 2021 9:10 PM
> 	> >                             To: users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	> >             <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	> > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	> >                             Subject: Re: Increasing degree of parallelism
> 	> when reading
> 	> >             Parquet files
> 	> >
> 	> >                             Hi Ingo,
> 	> >
> 	> >                             I checked the code and it seems that when
> 	> scanning external
> 	> >             datasource we're
> 	> >                             using the same number of cores as there are
> 	> configured storage
> 	> >             partitions (I/O
> 	> >                             devices).
> 	> >                             Therefore, if you want 96 cores to be used
> 	> when scanning
> 	> >             Parquet files then you
> 	> >                             need to configure 96 I/O devices.
> 	> >
> 	> >                             Compiler.parallelism setting is supposed to
> 	> affect how many
> 	> >             cores we use after
> 	> >                             the first EXCHANGE operator. However, if your
> 	> query doesn't
> 	> >             have any
> 	> >                             EXCHANGEs then it'll use the number of cores
> 	> assigned for the
> 	> >             initial data scan
> 	> >                             operator (number of I/O devices)
> 	> >
> 	> >                             Thanks,
> 	> >                             -- Dmitry
> 	> >
> 	> >
> 	> >                             On 8/9/21, 11:42 AM, "Müller  Ingo"
> 	> >             <ingo.mueller@inf.ethz.ch <ma...@inf.ethz.ch>
> > <mailto:ingo.mueller@inf.ethz.ch <ma...@inf.ethz.ch> >
> 	> > <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> > <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> >   wrote:
> 	> >
> 	> >                                  EXTERNAL EMAIL:  Use caution when
> 	> opening attachments
> 	> >             or clicking on links
> 	> >
> 	> >
> 	> >
> 	> >
> 	> >
> 	> >                                 Dear Dmitry,
> 	> >
> 	> >                                 Thanks a lot for the quick reply! I had not
> 	> though of this.
> 	> >             However, I have tried
> 	> >                             out both ways just now (per query and in the
> 	> cluster
> 	> >             configuration) and did not
> 	> >                             see any changes. Is there any way I can control
> 	> that the setting
> 	> >             was applied
> 	> >                             successfully? I have also tried setting
> 	> compiler.parallelism to 4
> 	> >             and still observed
> 	> >                             16 cores being utilized.
> 	> >
> 	> >                                 Note that the observed degree of parallelism
> 	> does not
> 	> >             correspond to anything
> 	> >                             related to the data set (I tried with every power
> 	> of two files
> 	> >             between 1 and 128)
> 	> >                             or the cluster (I tried with every power of two
> 	> cores between 2
> 	> >             and 64, as well
> 	> >                             as 48 and 96) and I always see 16 cores being
> 	> used (or fewer, if
> 	> >             the system has
> 	> >                             fewer). To me, this makes it unlikely that the
> 	> system really uses
> 	> >             the semantics
> 	> >                             for p=0 or p<0, but looks more like some hard-
> 	> coded value.
> 	> >
> 	> >                                 Cheers,
> 	> >                                 Ingo
> 	> >
> 	> >
> 	> >                                 > -----Original Message-----
> 	> >                                 > From: Dmitry Lychagin
> 	> > <dmitry.lychagin@couchbase.com
> <ma...@couchbase.com> >
> 	> <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com> >
> 	> >             <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com> >
> 	> > <mailto:dmitry.lychagin@couchbase.com
> <ma...@couchbase.com> >
> 	> >                                 > Sent: Monday, August 9, 2021 7:25 PM
> 	> >                                 > To: users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	> >             <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	> > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	> >                                 > Subject: Re: Increasing degree of
> 	> parallelism when reading
> 	> >             Parquet files
> 	> >                                 >
> 	> >                                 > Ingo,
> 	> >                                 >
> 	> >                                 >
> 	> >                                 >
> 	> >                                 > We have `compiler.parallelism` parameter
> 	> that controls
> 	> >             how many cores are
> 	> >                                 > used for query execution.
> 	> >                                 >
> 	> >                                 > See
> 	> >                                 >
> 	> >
> 	> >
> 	> >
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism
> 	> >             _param
> 	> >                                 > eter
> 	> >                                 >
> 	> >
> 	> >
> 	> >
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> 	> >             m_para
> 	> >                                 >
> 	> >
> 	>
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para>
> 	> >
> 	> >
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> 	> > m_para>
> 	> >
> 	>
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para>
> 	> >             meter>
> 	> >                                 >
> 	> >                                 > You can either set it per query (e.g. SET
> 	> >             `compiler.parallelism` "-1";) ,
> 	> >                                 >
> 	> >                                 > or globally in the cluster configuration:
> 	> >                                 >
> 	> >                                 >
> 	> >
> 	> >
> https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
> 	> >                                 > app/src/main/resources/cc2.conf#L57
> 	> >                                 >
> 	> >                                 >
> 	> >                                 >
> 	> >                                 > Thanks,
> 	> >                                 >
> 	> >                                 > -- Dmitry
> 	> >                                 >
> 	> >                                 >
> 	> >                                 >
> 	> >                                 >
> 	> >                                 >
> 	> >                                 > From: Müller Ingo
> 	> > <ingo.mueller@inf.ethz.ch <ma...@inf.ethz.ch> >
> <mailto:ingo.mueller@inf.ethz.ch <ma...@inf.ethz.ch> >
> 	> >             <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> >
> 	> > <mailto:ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> >
> 	> >                                 > Reply-To: "users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> "
> 	> > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	> >             <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	> > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >   <users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	> > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	> >             <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	> > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	> >                                 > Date: Monday, August 9, 2021 at 10:05 AM
> 	> >                                 > To: "users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> "
> 	> > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	> >             <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	> > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >   <users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	> > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	> >             <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	> > <mailto:users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	> >                                 > Subject: Increasing degree of parallelism
> 	> > when reading
> 	> >             Parquet files
> 	> >                                 >
> 	> >                                 >
> 	> >                                 >
> 	> >                                 >  EXTERNAL EMAIL:  Use caution when
> 	> > opening attachments
> 	> >             or clicking on
> 	> >                             links
> 	> >                                 >
> 	> >                                 >
> 	> >                                 >
> 	> >                                 >
> 	> >                                 >
> 	> >                                 > Dear AsterixDB devs,
> 	> >                                 >
> 	> >                                 >
> 	> >                                 >
> 	> >                                 > I am currently trying out the new support
> 	> > for Parquet files
> 	> >             on S3 (still in the
> 	> >                                 > context of my High-energy Physics use case
> 	> > [1]). This works
> 	> >             great so far and
> 	> >                             has
> 	> >                                 > generally decent performance. However, I
> 	> > realized that it
> 	> >             does not use more
> 	> >                                 > than 16 cores, even though 96 logical cores
> 	> > are available
> 	> >             and even though I
> 	> >                             run
> 	> >                                 > long-running queries (several minutes) on
> 	> > large data sets
> 	> >             with a large
> 	> >                             number of
> 	> >                                 > files (I tried 128 files of 17GB each). Is this
> 	> > an
> 	> >             arbitrary/artificial limitation
> 	> >                             that
> 	> >                                 > can be changed somehow (potentially with
> 	> > a small
> 	> >             patch+recompiling) or is
> 	> >                                 > there more serious development required
> 	> > to lift it? FYI, I am
> 	> >             currently using
> 	> >                                 > 03fd6d0f, which should include all
> 	> > S3/Parquet commits on
> 	> >             master.
> 	> >                                 >
> 	> >                                 >
> 	> >                                 >
> 	> >                                 > Cheers,
> 	> >                                 >
> 	> >                                 > Ingo
> 	> >                                 >
> 	> >                                 >
> 	> >                                 >
> 	> >                                 >
> 	> >                                 >
> 	> >                                 > [1] https://arxiv.org/abs/2104.12615
> 	> >                                 >
> 	> >                                 >
> 	> >
> 	> >
> 	> >
> 	> >
> 	> >
> 
> 
> 
> 
> 
> --
> 
> 
> Regards,
> Wail Alkowaileet

Re: Increasing degree of parallelism when reading Parquet files

Posted by Wail Alkowaileet <wa...@gmail.com>.

Thanks Ingo for the detailed explanation and for benchmarking it! It is a
great input for us. We will look at the issue and hopefully we can get it
fixed before the end of the week.

On Tue, Aug 10, 2021 at 8:42 AM Müller Ingo <in...@inf.ethz.ch>
wrote:

> Let me also say that I can still rerun the experiments for the (hopefully
> subsequent) camera-ready version if the problem takes longer to fix.
>
> Cheers,
> Ingo
>
>
> > -----Original Message-----
> > From: Müller Ingo <in...@inf.ethz.ch>
> > Sent: Tuesday, August 10, 2021 5:34 PM
> > To: users@asterixdb.apache.org
> > Subject: RE: Increasing degree of parallelism when reading Parquet files
> >
> > Hey Mike!
> >
> > Thanks for confirming! I am happy to test any fixes that you may come up
> with.
> > If the happens to be simple and is fixed before Friday, I can still
> include it in the
> > revision I am currently working on ;) Otherwise, it'd be great to be
> able to
> > mention a Jira issue or similar (maybe this mailing list thread is
> enough?) that I
> > can refer to.
> >
> > Cheers,
> > Ingo
> >
> >
> > > -----Original Message-----
> > > From: Michael Carey <mj...@ics.uci.edu>
> > > Sent: Tuesday, August 10, 2021 4:36 PM
> > > To: users@asterixdb.apache.org
> > > Subject: Re: Increasing degree of parallelism when reading Parquet
> > > files
> > >
> > > Ingo,
> > >
> > > Got it!  It sounds like we indeed have a parallelism performance bug
> > > in the area of threading for S3, then.  Weird!  We'll look into it...
> > >
> > >
> > > Cheers,
> > >
> > > Mike
> > >
> > >
> > > On 8/9/21 11:21 PM, Müller Ingo wrote:
> > >
> > >
> > >     Hey Mike,
> > >
> > >     Just to clarify: "partitions" is the same thing as I/O devices,
> > > right? I have configured 48 of those via "[nc]\niodevices=..." and see
> > > the corresponding folders with content show up on the file system.
> > > When I vary the number of these devices, I see that all other storage
> > > format change the degree of parallelism with my queries. That
> > > mechanism thus seems to work in general. It just doesn't seem to work
> > > for Parquet on S3. (I am not 100% sure if I tried other file formats
> > > on S3.)
> > >
> > >     I have also tried to set compiler.parallelism to 4 for Parquet
> files
> > > on HDFS with a file:// path and did not see any effect, i.e., it used
> > > 48 threads, which corresponds to the number of I/O devices. However,
> > > with what Dmitry said, I guess that this is expected behavior and the
> > > flag should only influence the degree of parallelism after exchanges
> (which I
> > don't have in my queries).
> > >
> > >     Cheers,
> > >     Ingo
> > >
> > >
> > >
> > >             -----Original Message-----
> > >             From: Michael Carey <mj...@ics.uci.edu>
> > > <ma...@ics.uci.edu>
> > >             Sent: Monday, August 9, 2021 10:10 PM
> > >             To: users@asterixdb.apache.org
> > > <ma...@asterixdb.apache.org>
> > >             Subject: Re: Increasing degree of parallelism when reading
> > Parquet
> > > files
> > >
> > >             Ingo,
> > >
> > >             Q: In your Parquet/S3 testing, what does your current
> cluster
> > > configuration look
> > >             like?  (I.e., how many partitions have you configured it
> with -
> > > physical storage
> > >             partitions that is?)  Even though your S3 data isn't
> stored inside
> > > AsterixDB in this
> > >             case, the system still uses that info to decide how many
> parallel
> > > threads to use
> > >             at the base of its query plans.  (Obviously there is room
> for
> > > improvement on that
> > >             behavior for use cases involving external storage. :-))
> > >
> > >
> > >             Cheers,
> > >
> > >             Mike
> > >
> > >
> > >             On 8/9/21 12:28 PM, Müller Ingo wrote:
> > >
> > >
> > >                     Hi Dmitry,
> > >
> > >                     Thanks a lot for checking! Indeed, my queries do
> not
> > have an
> > > exchange.
> > >             However, the number of I/O devices has indeed worked well
> in
> > many
> > > cases:
> > >             when I tried the various VM instance sizes, I always
> created as
> > many
> > > I/O devices
> > >             as there were physical cores (i.e., half the number of
> logical
> > > CPUs). For internal
> > >             storage as well as HDFS (both using the hdfs:// and the
> file://
> > > protocol), I saw
> > >             the full system being utilized. However, just for the case
> of
> > > Parquet on S3, I
> > >             cannot seem to make it use more than 16 cores.
> > >
> > >                     Cheers,
> > >                     Ingo
> > >
> > >
> > >
> > >                             -----Original Message-----
> > >                             From: Dmitry Lychagin
> > > <dm...@couchbase.com>
> > <ma...@couchbase.com>
> > >             <ma...@couchbase.com>
> > > <ma...@couchbase.com>
> > >                             Sent: Monday, August 9, 2021 9:10 PM
> > >                             To: users@asterixdb.apache.org
> > > <ma...@asterixdb.apache.org>
> > >             <ma...@asterixdb.apache.org>
> > > <ma...@asterixdb.apache.org>
> > >                             Subject: Re: Increasing degree of
> parallelism
> > when reading
> > >             Parquet files
> > >
> > >                             Hi Ingo,
> > >
> > >                             I checked the code and it seems that when
> > scanning external
> > >             datasource we're
> > >                             using the same number of cores as there are
> > configured storage
> > >             partitions (I/O
> > >                             devices).
> > >                             Therefore, if you want 96 cores to be used
> > when scanning
> > >             Parquet files then you
> > >                             need to configure 96 I/O devices.
> > >
> > >                             Compiler.parallelism setting is supposed to
> > affect how many
> > >             cores we use after
> > >                             the first EXCHANGE operator. However, if
> your
> > query doesn't
> > >             have any
> > >                             EXCHANGEs then it'll use the number of
> cores
> > assigned for the
> > >             initial data scan
> > >                             operator (number of I/O devices)
> > >
> > >                             Thanks,
> > >                             -- Dmitry
> > >
> > >
> > >                             On 8/9/21, 11:42 AM, "Müller  Ingo"
> > >             <in...@inf.ethz.ch> <mailto:
> ingo.mueller@inf.ethz.ch>
> > > <ma...@inf.ethz.ch> <ma...@inf.ethz.ch>
>  wrote:
> > >
> > >                                  EXTERNAL EMAIL:  Use caution when
> > opening attachments
> > >             or clicking on links
> > >
> > >
> > >
> > >
> > >
> > >                                 Dear Dmitry,
> > >
> > >                                 Thanks a lot for the quick reply! I
> had not
> > though of this.
> > >             However, I have tried
> > >                             out both ways just now (per query and in
> the
> > cluster
> > >             configuration) and did not
> > >                             see any changes. Is there any way I can
> control
> > that the setting
> > >             was applied
> > >                             successfully? I have also tried setting
> > compiler.parallelism to 4
> > >             and still observed
> > >                             16 cores being utilized.
> > >
> > >                                 Note that the observed degree of
> parallelism
> > does not
> > >             correspond to anything
> > >                             related to the data set (I tried with
> every power
> > of two files
> > >             between 1 and 128)
> > >                             or the cluster (I tried with every power
> of two
> > cores between 2
> > >             and 64, as well
> > >                             as 48 and 96) and I always see 16 cores
> being
> > used (or fewer, if
> > >             the system has
> > >                             fewer). To me, this makes it unlikely that
> the
> > system really uses
> > >             the semantics
> > >                             for p=0 or p<0, but looks more like some
> hard-
> > coded value.
> > >
> > >                                 Cheers,
> > >                                 Ingo
> > >
> > >
> > >                                 > -----Original Message-----
> > >                                 > From: Dmitry Lychagin
> > > <dm...@couchbase.com>
> > <ma...@couchbase.com>
> > >             <ma...@couchbase.com>
> > > <ma...@couchbase.com>
> > >                                 > Sent: Monday, August 9, 2021 7:25 PM
> > >                                 > To: users@asterixdb.apache.org
> > > <ma...@asterixdb.apache.org>
> > >             <ma...@asterixdb.apache.org>
> > > <ma...@asterixdb.apache.org>
> > >                                 > Subject: Re: Increasing degree of
> > parallelism when reading
> > >             Parquet files
> > >                                 >
> > >                                 > Ingo,
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 > We have `compiler.parallelism`
> parameter
> > that controls
> > >             how many cores are
> > >                                 > used for query execution.
> > >                                 >
> > >                                 > See
> > >                                 >
> > >
> > >
> > >
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism
> > >             _param
> > >                                 > eter
> > >                                 >
> > >
> > >
> > >     <
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> > >             m_para
> > >                                 >
> > >
> > <
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
> >
> > >
> > >     <
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> > > m_para>
> > >
> > <
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
> >
> > >             meter>
> > >                                 >
> > >                                 > You can either set it per query
> (e.g. SET
> > >             `compiler.parallelism` "-1";) ,
> > >                                 >
> > >                                 > or globally in the cluster
> configuration:
> > >                                 >
> > >                                 >
> > >
> > >     https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
> > >                                 > app/src/main/resources/cc2.conf#L57
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 > Thanks,
> > >                                 >
> > >                                 > -- Dmitry
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 > From: Müller Ingo
> > > <in...@inf.ethz.ch> <ma...@inf.ethz.ch>
> > >             <ma...@inf.ethz.ch>
> > > <ma...@inf.ethz.ch>
> > >                                 > Reply-To: "
> users@asterixdb.apache.org"
> > > <ma...@asterixdb.apache.org>
> > >             <ma...@asterixdb.apache.org>
> > > <ma...@asterixdb.apache.org>   <us...@asterixdb.apache.org>
> > > <ma...@asterixdb.apache.org>
> > >             <ma...@asterixdb.apache.org>
> > > <ma...@asterixdb.apache.org>
> > >                                 > Date: Monday, August 9, 2021 at
> 10:05 AM
> > >                                 > To: "users@asterixdb.apache.org"
> > > <ma...@asterixdb.apache.org>
> > >             <ma...@asterixdb.apache.org>
> > > <ma...@asterixdb.apache.org>   <us...@asterixdb.apache.org>
> > > <ma...@asterixdb.apache.org>
> > >             <ma...@asterixdb.apache.org>
> > > <ma...@asterixdb.apache.org>
> > >                                 > Subject: Increasing degree of
> parallelism
> > > when reading
> > >             Parquet files
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 >  EXTERNAL EMAIL:  Use caution when
> > > opening attachments
> > >             or clicking on
> > >                             links
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 > Dear AsterixDB devs,
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 > I am currently trying out the new
> support
> > > for Parquet files
> > >             on S3 (still in the
> > >                                 > context of my High-energy Physics
> use case
> > > [1]). This works
> > >             great so far and
> > >                             has
> > >                                 > generally decent performance.
> However, I
> > > realized that it
> > >             does not use more
> > >                                 > than 16 cores, even though 96
> logical cores
> > > are available
> > >             and even though I
> > >                             run
> > >                                 > long-running queries (several
> minutes) on
> > > large data sets
> > >             with a large
> > >                             number of
> > >                                 > files (I tried 128 files of 17GB
> each). Is this
> > > an
> > >             arbitrary/artificial limitation
> > >                             that
> > >                                 > can be changed somehow (potentially
> with
> > > a small
> > >             patch+recompiling) or is
> > >                                 > there more serious development
> required
> > > to lift it? FYI, I am
> > >             currently using
> > >                                 > 03fd6d0f, which should include all
> > > S3/Parquet commits on
> > >             master.
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 > Cheers,
> > >                                 >
> > >                                 > Ingo
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 >
> > >                                 > [1] https://arxiv.org/abs/2104.12615
> > >                                 >
> > >                                 >
> > >
> > >
> > >
> > >
> > >
>
>

-- 

*Regards,*
Wail Alkowaileet

RE: Increasing degree of parallelism when reading Parquet files

Posted by Müller Ingo <in...@inf.ethz.ch>.

Let me also say that I can still rerun the experiments for the (hopefully subsequent) camera-ready version if the problem takes longer to fix.

Cheers,
Ingo


> -----Original Message-----
> From: Müller Ingo <in...@inf.ethz.ch>
> Sent: Tuesday, August 10, 2021 5:34 PM
> To: users@asterixdb.apache.org
> Subject: RE: Increasing degree of parallelism when reading Parquet files
> 
> Hey Mike!
> 
> Thanks for confirming! I am happy to test any fixes that you may come up with.
> If the happens to be simple and is fixed before Friday, I can still include it in the
> revision I am currently working on ;) Otherwise, it'd be great to be able to
> mention a Jira issue or similar (maybe this mailing list thread is enough?) that I
> can refer to.
> 
> Cheers,
> Ingo
> 
> 
> > -----Original Message-----
> > From: Michael Carey <mj...@ics.uci.edu>
> > Sent: Tuesday, August 10, 2021 4:36 PM
> > To: users@asterixdb.apache.org
> > Subject: Re: Increasing degree of parallelism when reading Parquet
> > files
> >
> > Ingo,
> >
> > Got it!  It sounds like we indeed have a parallelism performance bug
> > in the area of threading for S3, then.  Weird!  We'll look into it...
> >
> >
> > Cheers,
> >
> > Mike
> >
> >
> > On 8/9/21 11:21 PM, Müller Ingo wrote:
> >
> >
> > 	Hey Mike,
> >
> > 	Just to clarify: "partitions" is the same thing as I/O devices,
> > right? I have configured 48 of those via "[nc]\niodevices=..." and see
> > the corresponding folders with content show up on the file system.
> > When I vary the number of these devices, I see that all other storage
> > format change the degree of parallelism with my queries. That
> > mechanism thus seems to work in general. It just doesn't seem to work
> > for Parquet on S3. (I am not 100% sure if I tried other file formats
> > on S3.)
> >
> > 	I have also tried to set compiler.parallelism to 4 for Parquet files
> > on HDFS with a file:// path and did not see any effect, i.e., it used
> > 48 threads, which corresponds to the number of I/O devices. However,
> > with what Dmitry said, I guess that this is expected behavior and the
> > flag should only influence the degree of parallelism after exchanges (which I
> don't have in my queries).
> >
> > 	Cheers,
> > 	Ingo
> >
> >
> >
> > 		-----Original Message-----
> > 		From: Michael Carey <mj...@ics.uci.edu>
> > <ma...@ics.uci.edu>
> > 		Sent: Monday, August 9, 2021 10:10 PM
> > 		To: users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org>
> > 		Subject: Re: Increasing degree of parallelism when reading
> Parquet
> > files
> >
> > 		Ingo,
> >
> > 		Q: In your Parquet/S3 testing, what does your current cluster
> > configuration look
> > 		like?  (I.e., how many partitions have you configured it with -
> > physical storage
> > 		partitions that is?)  Even though your S3 data isn't stored inside
> > AsterixDB in this
> > 		case, the system still uses that info to decide how many parallel
> > threads to use
> > 		at the base of its query plans.  (Obviously there is room for
> > improvement on that
> > 		behavior for use cases involving external storage. :-))
> >
> >
> > 		Cheers,
> >
> > 		Mike
> >
> >
> > 		On 8/9/21 12:28 PM, Müller Ingo wrote:
> >
> >
> > 			Hi Dmitry,
> >
> > 			Thanks a lot for checking! Indeed, my queries do not
> have an
> > exchange.
> > 		However, the number of I/O devices has indeed worked well in
> many
> > cases:
> > 		when I tried the various VM instance sizes, I always created as
> many
> > I/O devices
> > 		as there were physical cores (i.e., half the number of logical
> > CPUs). For internal
> > 		storage as well as HDFS (both using the hdfs:// and the file://
> > protocol), I saw
> > 		the full system being utilized. However, just for the case of
> > Parquet on S3, I
> > 		cannot seem to make it use more than 16 cores.
> >
> > 			Cheers,
> > 			Ingo
> >
> >
> >
> > 				-----Original Message-----
> > 				From: Dmitry Lychagin
> > <dm...@couchbase.com>
> <ma...@couchbase.com>
> > 		<ma...@couchbase.com>
> > <ma...@couchbase.com>
> > 				Sent: Monday, August 9, 2021 9:10 PM
> > 				To: users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org>
> > 		<ma...@asterixdb.apache.org>
> > <ma...@asterixdb.apache.org>
> > 				Subject: Re: Increasing degree of parallelism
> when reading
> > 		Parquet files
> >
> > 				Hi Ingo,
> >
> > 				I checked the code and it seems that when
> scanning external
> > 		datasource we're
> > 				using the same number of cores as there are
> configured storage
> > 		partitions (I/O
> > 				devices).
> > 				Therefore, if you want 96 cores to be used
> when scanning
> > 		Parquet files then you
> > 				need to configure 96 I/O devices.
> >
> > 				Compiler.parallelism setting is supposed to
> affect how many
> > 		cores we use after
> > 				the first EXCHANGE operator. However, if your
> query doesn't
> > 		have any
> > 				EXCHANGEs then it'll use the number of cores
> assigned for the
> > 		initial data scan
> > 				operator (number of I/O devices)
> >
> > 				Thanks,
> > 				-- Dmitry
> >
> >
> > 				On 8/9/21, 11:42 AM, "Müller  Ingo"
> > 		<in...@inf.ethz.ch> <ma...@inf.ethz.ch>
> > <ma...@inf.ethz.ch> <ma...@inf.ethz.ch>   wrote:
> >
> > 				     EXTERNAL EMAIL:  Use caution when
> opening attachments
> > 		or clicking on links
> >
> >
> >
> >
> >
> > 				    Dear Dmitry,
> >
> > 				    Thanks a lot for the quick reply! I had not
> though of this.
> > 		However, I have tried
> > 				out both ways just now (per query and in the
> cluster
> > 		configuration) and did not
> > 				see any changes. Is there any way I can control
> that the setting
> > 		was applied
> > 				successfully? I have also tried setting
> compiler.parallelism to 4
> > 		and still observed
> > 				16 cores being utilized.
> >
> > 				    Note that the observed degree of parallelism
> does not
> > 		correspond to anything
> > 				related to the data set (I tried with every power
> of two files
> > 		between 1 and 128)
> > 				or the cluster (I tried with every power of two
> cores between 2
> > 		and 64, as well
> > 				as 48 and 96) and I always see 16 cores being
> used (or fewer, if
> > 		the system has
> > 				fewer). To me, this makes it unlikely that the
> system really uses
> > 		the semantics
> > 				for p=0 or p<0, but looks more like some hard-
> coded value.
> >
> > 				    Cheers,
> > 				    Ingo
> >
> >
> > 				    > -----Original Message-----
> > 				    > From: Dmitry Lychagin
> > <dm...@couchbase.com>
> <ma...@couchbase.com>
> > 		<ma...@couchbase.com>
> > <ma...@couchbase.com>
> > 				    > Sent: Monday, August 9, 2021 7:25 PM
> > 				    > To: users@asterixdb.apache.org
> > <ma...@asterixdb.apache.org>
> > 		<ma...@asterixdb.apache.org>
> > <ma...@asterixdb.apache.org>
> > 				    > Subject: Re: Increasing degree of
> parallelism when reading
> > 		Parquet files
> > 				    >
> > 				    > Ingo,
> > 				    >
> > 				    >
> > 				    >
> > 				    > We have `compiler.parallelism` parameter
> that controls
> > 		how many cores are
> > 				    > used for query execution.
> > 				    >
> > 				    > See
> > 				    >
> >
> >
> > 	https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism
> > 		_param
> > 				    > eter
> > 				    >
> >
> >
> > 	<https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> > 		m_para
> > 				    >
> >
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para>
> >
> > 	<https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> > m_para>
> >
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para>
> > 		meter>
> > 				    >
> > 				    > You can either set it per query (e.g. SET
> > 		`compiler.parallelism` "-1";) ,
> > 				    >
> > 				    > or globally in the cluster configuration:
> > 				    >
> > 				    >
> >
> > 	https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
> > 				    > app/src/main/resources/cc2.conf#L57
> > 				    >
> > 				    >
> > 				    >
> > 				    > Thanks,
> > 				    >
> > 				    > -- Dmitry
> > 				    >
> > 				    >
> > 				    >
> > 				    >
> > 				    >
> > 				    > From: Müller Ingo
> > <in...@inf.ethz.ch> <ma...@inf.ethz.ch>
> > 		<ma...@inf.ethz.ch>
> > <ma...@inf.ethz.ch>
> > 				    > Reply-To: "users@asterixdb.apache.org"
> > <ma...@asterixdb.apache.org>
> > 		<ma...@asterixdb.apache.org>
> > <ma...@asterixdb.apache.org>   <us...@asterixdb.apache.org>
> > <ma...@asterixdb.apache.org>
> > 		<ma...@asterixdb.apache.org>
> > <ma...@asterixdb.apache.org>
> > 				    > Date: Monday, August 9, 2021 at 10:05 AM
> > 				    > To: "users@asterixdb.apache.org"
> > <ma...@asterixdb.apache.org>
> > 		<ma...@asterixdb.apache.org>
> > <ma...@asterixdb.apache.org>   <us...@asterixdb.apache.org>
> > <ma...@asterixdb.apache.org>
> > 		<ma...@asterixdb.apache.org>
> > <ma...@asterixdb.apache.org>
> > 				    > Subject: Increasing degree of parallelism
> > when reading
> > 		Parquet files
> > 				    >
> > 				    >
> > 				    >
> > 				    >  EXTERNAL EMAIL:  Use caution when
> > opening attachments
> > 		or clicking on
> > 				links
> > 				    >
> > 				    >
> > 				    >
> > 				    >
> > 				    >
> > 				    > Dear AsterixDB devs,
> > 				    >
> > 				    >
> > 				    >
> > 				    > I am currently trying out the new support
> > for Parquet files
> > 		on S3 (still in the
> > 				    > context of my High-energy Physics use case
> > [1]). This works
> > 		great so far and
> > 				has
> > 				    > generally decent performance. However, I
> > realized that it
> > 		does not use more
> > 				    > than 16 cores, even though 96 logical cores
> > are available
> > 		and even though I
> > 				run
> > 				    > long-running queries (several minutes) on
> > large data sets
> > 		with a large
> > 				number of
> > 				    > files (I tried 128 files of 17GB each). Is this
> > an
> > 		arbitrary/artificial limitation
> > 				that
> > 				    > can be changed somehow (potentially with
> > a small
> > 		patch+recompiling) or is
> > 				    > there more serious development required
> > to lift it? FYI, I am
> > 		currently using
> > 				    > 03fd6d0f, which should include all
> > S3/Parquet commits on
> > 		master.
> > 				    >
> > 				    >
> > 				    >
> > 				    > Cheers,
> > 				    >
> > 				    > Ingo
> > 				    >
> > 				    >
> > 				    >
> > 				    >
> > 				    >
> > 				    > [1] https://arxiv.org/abs/2104.12615
> > 				    >
> > 				    >
> >
> >
> >
> >
> >

RE: Increasing degree of parallelism when reading Parquet files

Posted by Müller Ingo <in...@inf.ethz.ch>.

Hey Mike!

Thanks for confirming! I am happy to test any fixes that you may come up with. If the happens to be simple and is fixed before Friday, I can still include it in the revision I am currently working on ;) Otherwise, it'd be great to be able to mention a Jira issue or similar (maybe this mailing list thread is enough?) that I can refer to.

Cheers,
Ingo
 

> -----Original Message-----
> From: Michael Carey <mj...@ics.uci.edu>
> Sent: Tuesday, August 10, 2021 4:36 PM
> To: users@asterixdb.apache.org
> Subject: Re: Increasing degree of parallelism when reading Parquet files
> 
> Ingo,
> 
> Got it!  It sounds like we indeed have a parallelism performance bug in the area
> of threading for S3, then.  Weird!  We'll look into it...
> 
> 
> Cheers,
> 
> Mike
> 
> 
> On 8/9/21 11:21 PM, Müller Ingo wrote:
> 
> 
> 	Hey Mike,
> 
> 	Just to clarify: "partitions" is the same thing as I/O devices, right? I have
> configured 48 of those via "[nc]\niodevices=..." and see the corresponding
> folders with content show up on the file system. When I vary the number of
> these devices, I see that all other storage format change the degree of
> parallelism with my queries. That mechanism thus seems to work in general. It
> just doesn't seem to work for Parquet on S3. (I am not 100% sure if I tried other
> file formats on S3.)
> 
> 	I have also tried to set compiler.parallelism to 4 for Parquet files on
> HDFS with a file:// path and did not see any effect, i.e., it used 48 threads, which
> corresponds to the number of I/O devices. However, with what Dmitry said, I
> guess that this is expected behavior and the flag should only influence the
> degree of parallelism after exchanges (which I don't have in my queries).
> 
> 	Cheers,
> 	Ingo
> 
> 
> 
> 		-----Original Message-----
> 		From: Michael Carey <mj...@ics.uci.edu>
> <ma...@ics.uci.edu>
> 		Sent: Monday, August 9, 2021 10:10 PM
> 		To: users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 		Subject: Re: Increasing degree of parallelism when reading
> Parquet files
> 
> 		Ingo,
> 
> 		Q: In your Parquet/S3 testing, what does your current cluster
> configuration look
> 		like?  (I.e., how many partitions have you configured it with -
> physical storage
> 		partitions that is?)  Even though your S3 data isn't stored inside
> AsterixDB in this
> 		case, the system still uses that info to decide how many parallel
> threads to use
> 		at the base of its query plans.  (Obviously there is room for
> improvement on that
> 		behavior for use cases involving external storage. :-))
> 
> 
> 		Cheers,
> 
> 		Mike
> 
> 
> 		On 8/9/21 12:28 PM, Müller Ingo wrote:
> 
> 
> 			Hi Dmitry,
> 
> 			Thanks a lot for checking! Indeed, my queries do not
> have an exchange.
> 		However, the number of I/O devices has indeed worked well in
> many cases:
> 		when I tried the various VM instance sizes, I always created as
> many I/O devices
> 		as there were physical cores (i.e., half the number of logical
> CPUs). For internal
> 		storage as well as HDFS (both using the hdfs:// and the file://
> protocol), I saw
> 		the full system being utilized. However, just for the case of
> Parquet on S3, I
> 		cannot seem to make it use more than 16 cores.
> 
> 			Cheers,
> 			Ingo
> 
> 
> 
> 				-----Original Message-----
> 				From: Dmitry Lychagin
> <dm...@couchbase.com> <ma...@couchbase.com>
> 		<ma...@couchbase.com>
> <ma...@couchbase.com>
> 				Sent: Monday, August 9, 2021 9:10 PM
> 				To: users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 		<ma...@asterixdb.apache.org>
> <ma...@asterixdb.apache.org>
> 				Subject: Re: Increasing degree of parallelism
> when reading
> 		Parquet files
> 
> 				Hi Ingo,
> 
> 				I checked the code and it seems that when
> scanning external
> 		datasource we're
> 				using the same number of cores as there are
> configured storage
> 		partitions (I/O
> 				devices).
> 				Therefore, if you want 96 cores to be used
> when scanning
> 		Parquet files then you
> 				need to configure 96 I/O devices.
> 
> 				Compiler.parallelism setting is supposed to
> affect how many
> 		cores we use after
> 				the first EXCHANGE operator. However, if your
> query doesn't
> 		have any
> 				EXCHANGEs then it'll use the number of cores
> assigned for the
> 		initial data scan
> 				operator (number of I/O devices)
> 
> 				Thanks,
> 				-- Dmitry
> 
> 
> 				On 8/9/21, 11:42 AM, "Müller  Ingo"
> 		<in...@inf.ethz.ch> <ma...@inf.ethz.ch>
> <ma...@inf.ethz.ch> <ma...@inf.ethz.ch>   wrote:
> 
> 				     EXTERNAL EMAIL:  Use caution when
> opening attachments
> 		or clicking on links
> 
> 
> 
> 
> 
> 				    Dear Dmitry,
> 
> 				    Thanks a lot for the quick reply! I had not
> though of this.
> 		However, I have tried
> 				out both ways just now (per query and in the
> cluster
> 		configuration) and did not
> 				see any changes. Is there any way I can control
> that the setting
> 		was applied
> 				successfully? I have also tried setting
> compiler.parallelism to 4
> 		and still observed
> 				16 cores being utilized.
> 
> 				    Note that the observed degree of parallelism
> does not
> 		correspond to anything
> 				related to the data set (I tried with every power
> of two files
> 		between 1 and 128)
> 				or the cluster (I tried with every power of two
> cores between 2
> 		and 64, as well
> 				as 48 and 96) and I always see 16 cores being
> used (or fewer, if
> 		the system has
> 				fewer). To me, this makes it unlikely that the
> system really uses
> 		the semantics
> 				for p=0 or p<0, but looks more like some hard-
> coded value.
> 
> 				    Cheers,
> 				    Ingo
> 
> 
> 				    > -----Original Message-----
> 				    > From: Dmitry Lychagin
> <dm...@couchbase.com> <ma...@couchbase.com>
> 		<ma...@couchbase.com>
> <ma...@couchbase.com>
> 				    > Sent: Monday, August 9, 2021 7:25 PM
> 				    > To: users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 		<ma...@asterixdb.apache.org>
> <ma...@asterixdb.apache.org>
> 				    > Subject: Re: Increasing degree of
> parallelism when reading
> 		Parquet files
> 				    >
> 				    > Ingo,
> 				    >
> 				    >
> 				    >
> 				    > We have `compiler.parallelism` parameter
> that controls
> 		how many cores are
> 				    > used for query execution.
> 				    >
> 				    > See
> 				    >
> 
> 
> 	https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism
> 		_param
> 				    > eter
> 				    >
> 
> 
> 	<https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> 		m_para
> 				    >
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para>
> 
> 	<https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> m_para>
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para>
> 		meter>
> 				    >
> 				    > You can either set it per query (e.g. SET
> 		`compiler.parallelism` "-1";) ,
> 				    >
> 				    > or globally in the cluster configuration:
> 				    >
> 				    >
> 
> 	https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
> 				    > app/src/main/resources/cc2.conf#L57
> 				    >
> 				    >
> 				    >
> 				    > Thanks,
> 				    >
> 				    > -- Dmitry
> 				    >
> 				    >
> 				    >
> 				    >
> 				    >
> 				    > From: Müller Ingo
> <in...@inf.ethz.ch> <ma...@inf.ethz.ch>
> 		<ma...@inf.ethz.ch>
> <ma...@inf.ethz.ch>
> 				    > Reply-To: "users@asterixdb.apache.org"
> <ma...@asterixdb.apache.org>
> 		<ma...@asterixdb.apache.org>
> <ma...@asterixdb.apache.org>   <us...@asterixdb.apache.org>
> <ma...@asterixdb.apache.org>
> 		<ma...@asterixdb.apache.org>
> <ma...@asterixdb.apache.org>
> 				    > Date: Monday, August 9, 2021 at 10:05 AM
> 				    > To: "users@asterixdb.apache.org"
> <ma...@asterixdb.apache.org>
> 		<ma...@asterixdb.apache.org>
> <ma...@asterixdb.apache.org>   <us...@asterixdb.apache.org>
> <ma...@asterixdb.apache.org>
> 		<ma...@asterixdb.apache.org>
> <ma...@asterixdb.apache.org>
> 				    > Subject: Increasing degree of parallelism
> when reading
> 		Parquet files
> 				    >
> 				    >
> 				    >
> 				    >  EXTERNAL EMAIL:  Use caution when
> opening attachments
> 		or clicking on
> 				links
> 				    >
> 				    >
> 				    >
> 				    >
> 				    >
> 				    > Dear AsterixDB devs,
> 				    >
> 				    >
> 				    >
> 				    > I am currently trying out the new support
> for Parquet files
> 		on S3 (still in the
> 				    > context of my High-energy Physics use case
> [1]). This works
> 		great so far and
> 				has
> 				    > generally decent performance. However, I
> realized that it
> 		does not use more
> 				    > than 16 cores, even though 96 logical cores
> are available
> 		and even though I
> 				run
> 				    > long-running queries (several minutes) on
> large data sets
> 		with a large
> 				number of
> 				    > files (I tried 128 files of 17GB each). Is this
> an
> 		arbitrary/artificial limitation
> 				that
> 				    > can be changed somehow (potentially with
> a small
> 		patch+recompiling) or is
> 				    > there more serious development required
> to lift it? FYI, I am
> 		currently using
> 				    > 03fd6d0f, which should include all
> S3/Parquet commits on
> 		master.
> 				    >
> 				    >
> 				    >
> 				    > Cheers,
> 				    >
> 				    > Ingo
> 				    >
> 				    >
> 				    >
> 				    >
> 				    >
> 				    > [1] https://arxiv.org/abs/2104.12615
> 				    >
> 				    >
> 
> 
> 
> 
>

Re: Increasing degree of parallelism when reading Parquet files

Posted by Michael Carey <mj...@ics.uci.edu>.

Ingo,

Got it!  It sounds like we indeed have a parallelism performance bug in 
the area of threading for S3, then.  Weird!  We'll look into it...

Cheers,

Mike

On 8/9/21 11:21 PM, Müller Ingo wrote:
> Hey Mike,
>
> Just to clarify: "partitions" is the same thing as I/O devices, right? I have configured 48 of those via "[nc]\niodevices=..." and see the corresponding folders with content show up on the file system. When I vary the number of these devices, I see that all other storage format change the degree of parallelism with my queries. That mechanism thus seems to work in general. It just doesn't seem to work for Parquet on S3. (I am not 100% sure if I tried other file formats on S3.)
>
> I have also tried to set compiler.parallelism to 4 for Parquet files on HDFS with a file:// path and did not see any effect, i.e., it used 48 threads, which corresponds to the number of I/O devices. However, with what Dmitry said, I guess that this is expected behavior and the flag should only influence the degree of parallelism after exchanges (which I don't have in my queries).
>
> Cheers,
> Ingo
>
>
>> -----Original Message-----
>> From: Michael Carey <mj...@ics.uci.edu>
>> Sent: Monday, August 9, 2021 10:10 PM
>> To: users@asterixdb.apache.org
>> Subject: Re: Increasing degree of parallelism when reading Parquet files
>>
>> Ingo,
>>
>> Q: In your Parquet/S3 testing, what does your current cluster configuration look
>> like?  (I.e., how many partitions have you configured it with - physical storage
>> partitions that is?)  Even though your S3 data isn't stored inside AsterixDB in this
>> case, the system still uses that info to decide how many parallel threads to use
>> at the base of its query plans.  (Obviously there is room for improvement on that
>> behavior for use cases involving external storage. :-))
>>
>>
>> Cheers,
>>
>> Mike
>>
>>
>> On 8/9/21 12:28 PM, Müller Ingo wrote:
>>
>>
>> 	Hi Dmitry,
>>
>> 	Thanks a lot for checking! Indeed, my queries do not have an exchange.
>> However, the number of I/O devices has indeed worked well in many cases:
>> when I tried the various VM instance sizes, I always created as many I/O devices
>> as there were physical cores (i.e., half the number of logical CPUs). For internal
>> storage as well as HDFS (both using the hdfs:// and the file:// protocol), I saw
>> the full system being utilized. However, just for the case of Parquet on S3, I
>> cannot seem to make it use more than 16 cores.
>>
>> 	Cheers,
>> 	Ingo
>>
>>
>>
>> 		-----Original Message-----
>> 		From: Dmitry Lychagin <dm...@couchbase.com>
>> <ma...@couchbase.com>
>> 		Sent: Monday, August 9, 2021 9:10 PM
>> 		To: users@asterixdb.apache.org
>> <ma...@asterixdb.apache.org>
>> 		Subject: Re: Increasing degree of parallelism when reading
>> Parquet files
>>
>> 		Hi Ingo,
>>
>> 		I checked the code and it seems that when scanning external
>> datasource we're
>> 		using the same number of cores as there are configured storage
>> partitions (I/O
>> 		devices).
>> 		Therefore, if you want 96 cores to be used when scanning
>> Parquet files then you
>> 		need to configure 96 I/O devices.
>>
>> 		Compiler.parallelism setting is supposed to affect how many
>> cores we use after
>> 		the first EXCHANGE operator. However, if your query doesn't
>> have any
>> 		EXCHANGEs then it'll use the number of cores assigned for the
>> initial data scan
>> 		operator (number of I/O devices)
>>
>> 		Thanks,
>> 		-- Dmitry
>>
>>
>> 		On 8/9/21, 11:42 AM, "Müller  Ingo"
>> <in...@inf.ethz.ch> <ma...@inf.ethz.ch>  wrote:
>>
>> 		     EXTERNAL EMAIL:  Use caution when opening attachments
>> or clicking on links
>>
>>
>>
>>
>>
>> 		    Dear Dmitry,
>>
>> 		    Thanks a lot for the quick reply! I had not though of this.
>> However, I have tried
>> 		out both ways just now (per query and in the cluster
>> configuration) and did not
>> 		see any changes. Is there any way I can control that the setting
>> was applied
>> 		successfully? I have also tried setting compiler.parallelism to 4
>> and still observed
>> 		16 cores being utilized.
>>
>> 		    Note that the observed degree of parallelism does not
>> correspond to anything
>> 		related to the data set (I tried with every power of two files
>> between 1 and 128)
>> 		or the cluster (I tried with every power of two cores between 2
>> and 64, as well
>> 		as 48 and 96) and I always see 16 cores being used (or fewer, if
>> the system has
>> 		fewer). To me, this makes it unlikely that the system really uses
>> the semantics
>> 		for p=0 or p<0, but looks more like some hard-coded value.
>>
>> 		    Cheers,
>> 		    Ingo
>>
>>
>> 		    > -----Original Message-----
>> 		    > From: Dmitry Lychagin <dm...@couchbase.com>
>> <ma...@couchbase.com>
>> 		    > Sent: Monday, August 9, 2021 7:25 PM
>> 		    > To: users@asterixdb.apache.org
>> <ma...@asterixdb.apache.org>
>> 		    > Subject: Re: Increasing degree of parallelism when reading
>> Parquet files
>> 		    >
>> 		    > Ingo,
>> 		    >
>> 		    >
>> 		    >
>> 		    > We have `compiler.parallelism` parameter that controls
>> how many cores are
>> 		    > used for query execution.
>> 		    >
>> 		    > See
>> 		    >
>>
>> 	https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism
>> _param
>> 		    > eter
>> 		    >
>>
>> 	<https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
>> m_para
>> 		    >
>> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para>
>> meter>
>> 		    >
>> 		    > You can either set it per query (e.g. SET
>> `compiler.parallelism` "-1";) ,
>> 		    >
>> 		    > or globally in the cluster configuration:
>> 		    >
>> 		    >
>> https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
>> 		    > app/src/main/resources/cc2.conf#L57
>> 		    >
>> 		    >
>> 		    >
>> 		    > Thanks,
>> 		    >
>> 		    > -- Dmitry
>> 		    >
>> 		    >
>> 		    >
>> 		    >
>> 		    >
>> 		    > From: Müller Ingo <in...@inf.ethz.ch>
>> <ma...@inf.ethz.ch>
>> 		    > Reply-To: "users@asterixdb.apache.org"
>> <ma...@asterixdb.apache.org>  <us...@asterixdb.apache.org>
>> <ma...@asterixdb.apache.org>
>> 		    > Date: Monday, August 9, 2021 at 10:05 AM
>> 		    > To: "users@asterixdb.apache.org"
>> <ma...@asterixdb.apache.org>  <us...@asterixdb.apache.org>
>> <ma...@asterixdb.apache.org>
>> 		    > Subject: Increasing degree of parallelism when reading
>> Parquet files
>> 		    >
>> 		    >
>> 		    >
>> 		    >  EXTERNAL EMAIL:  Use caution when opening attachments
>> or clicking on
>> 		links
>> 		    >
>> 		    >
>> 		    >
>> 		    >
>> 		    >
>> 		    > Dear AsterixDB devs,
>> 		    >
>> 		    >
>> 		    >
>> 		    > I am currently trying out the new support for Parquet files
>> on S3 (still in the
>> 		    > context of my High-energy Physics use case [1]). This works
>> great so far and
>> 		has
>> 		    > generally decent performance. However, I realized that it
>> does not use more
>> 		    > than 16 cores, even though 96 logical cores are available
>> and even though I
>> 		run
>> 		    > long-running queries (several minutes) on large data sets
>> with a large
>> 		number of
>> 		    > files (I tried 128 files of 17GB each). Is this an
>> arbitrary/artificial limitation
>> 		that
>> 		    > can be changed somehow (potentially with a small
>> patch+recompiling) or is
>> 		    > there more serious development required to lift it? FYI, I am
>> currently using
>> 		    > 03fd6d0f, which should include all S3/Parquet commits on
>> master.
>> 		    >
>> 		    >
>> 		    >
>> 		    > Cheers,
>> 		    >
>> 		    > Ingo
>> 		    >
>> 		    >
>> 		    >
>> 		    >
>> 		    >
>> 		    > [1] https://arxiv.org/abs/2104.12615
>> 		    >
>> 		    >
>>
>>
>>

RE: Increasing degree of parallelism when reading Parquet files

Posted by Müller Ingo <in...@inf.ethz.ch>.

Hey Mike,

Just to clarify: "partitions" is the same thing as I/O devices, right? I have configured 48 of those via "[nc]\niodevices=..." and see the corresponding folders with content show up on the file system. When I vary the number of these devices, I see that all other storage format change the degree of parallelism with my queries. That mechanism thus seems to work in general. It just doesn't seem to work for Parquet on S3. (I am not 100% sure if I tried other file formats on S3.)

I have also tried to set compiler.parallelism to 4 for Parquet files on HDFS with a file:// path and did not see any effect, i.e., it used 48 threads, which corresponds to the number of I/O devices. However, with what Dmitry said, I guess that this is expected behavior and the flag should only influence the degree of parallelism after exchanges (which I don't have in my queries).

Cheers,
Ingo


> -----Original Message-----
> From: Michael Carey <mj...@ics.uci.edu>
> Sent: Monday, August 9, 2021 10:10 PM
> To: users@asterixdb.apache.org
> Subject: Re: Increasing degree of parallelism when reading Parquet files
> 
> Ingo,
> 
> Q: In your Parquet/S3 testing, what does your current cluster configuration look
> like?  (I.e., how many partitions have you configured it with - physical storage
> partitions that is?)  Even though your S3 data isn't stored inside AsterixDB in this
> case, the system still uses that info to decide how many parallel threads to use
> at the base of its query plans.  (Obviously there is room for improvement on that
> behavior for use cases involving external storage. :-))
> 
> 
> Cheers,
> 
> Mike
> 
> 
> On 8/9/21 12:28 PM, Müller Ingo wrote:
> 
> 
> 	Hi Dmitry,
> 
> 	Thanks a lot for checking! Indeed, my queries do not have an exchange.
> However, the number of I/O devices has indeed worked well in many cases:
> when I tried the various VM instance sizes, I always created as many I/O devices
> as there were physical cores (i.e., half the number of logical CPUs). For internal
> storage as well as HDFS (both using the hdfs:// and the file:// protocol), I saw
> the full system being utilized. However, just for the case of Parquet on S3, I
> cannot seem to make it use more than 16 cores.
> 
> 	Cheers,
> 	Ingo
> 
> 
> 
> 		-----Original Message-----
> 		From: Dmitry Lychagin <dm...@couchbase.com>
> <ma...@couchbase.com>
> 		Sent: Monday, August 9, 2021 9:10 PM
> 		To: users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 		Subject: Re: Increasing degree of parallelism when reading
> Parquet files
> 
> 		Hi Ingo,
> 
> 		I checked the code and it seems that when scanning external
> datasource we're
> 		using the same number of cores as there are configured storage
> partitions (I/O
> 		devices).
> 		Therefore, if you want 96 cores to be used when scanning
> Parquet files then you
> 		need to configure 96 I/O devices.
> 
> 		Compiler.parallelism setting is supposed to affect how many
> cores we use after
> 		the first EXCHANGE operator. However, if your query doesn't
> have any
> 		EXCHANGEs then it'll use the number of cores assigned for the
> initial data scan
> 		operator (number of I/O devices)
> 
> 		Thanks,
> 		-- Dmitry
> 
> 
> 		On 8/9/21, 11:42 AM, "Müller  Ingo"
> <in...@inf.ethz.ch> <ma...@inf.ethz.ch>  wrote:
> 
> 		     EXTERNAL EMAIL:  Use caution when opening attachments
> or clicking on links
> 
> 
> 
> 
> 
> 		    Dear Dmitry,
> 
> 		    Thanks a lot for the quick reply! I had not though of this.
> However, I have tried
> 		out both ways just now (per query and in the cluster
> configuration) and did not
> 		see any changes. Is there any way I can control that the setting
> was applied
> 		successfully? I have also tried setting compiler.parallelism to 4
> and still observed
> 		16 cores being utilized.
> 
> 		    Note that the observed degree of parallelism does not
> correspond to anything
> 		related to the data set (I tried with every power of two files
> between 1 and 128)
> 		or the cluster (I tried with every power of two cores between 2
> and 64, as well
> 		as 48 and 96) and I always see 16 cores being used (or fewer, if
> the system has
> 		fewer). To me, this makes it unlikely that the system really uses
> the semantics
> 		for p=0 or p<0, but looks more like some hard-coded value.
> 
> 		    Cheers,
> 		    Ingo
> 
> 
> 		    > -----Original Message-----
> 		    > From: Dmitry Lychagin <dm...@couchbase.com>
> <ma...@couchbase.com>
> 		    > Sent: Monday, August 9, 2021 7:25 PM
> 		    > To: users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 		    > Subject: Re: Increasing degree of parallelism when reading
> Parquet files
> 		    >
> 		    > Ingo,
> 		    >
> 		    >
> 		    >
> 		    > We have `compiler.parallelism` parameter that controls
> how many cores are
> 		    > used for query execution.
> 		    >
> 		    > See
> 		    >
> 
> 	https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism
> _param
> 		    > eter
> 		    >
> 
> 	<https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelis
> m_para
> 		    >
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para>
> meter>
> 		    >
> 		    > You can either set it per query (e.g. SET
> `compiler.parallelism` "-1";) ,
> 		    >
> 		    > or globally in the cluster configuration:
> 		    >
> 		    >
> https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
> 		    > app/src/main/resources/cc2.conf#L57
> 		    >
> 		    >
> 		    >
> 		    > Thanks,
> 		    >
> 		    > -- Dmitry
> 		    >
> 		    >
> 		    >
> 		    >
> 		    >
> 		    > From: Müller Ingo <in...@inf.ethz.ch>
> <ma...@inf.ethz.ch>
> 		    > Reply-To: "users@asterixdb.apache.org"
> <ma...@asterixdb.apache.org>  <us...@asterixdb.apache.org>
> <ma...@asterixdb.apache.org>
> 		    > Date: Monday, August 9, 2021 at 10:05 AM
> 		    > To: "users@asterixdb.apache.org"
> <ma...@asterixdb.apache.org>  <us...@asterixdb.apache.org>
> <ma...@asterixdb.apache.org>
> 		    > Subject: Increasing degree of parallelism when reading
> Parquet files
> 		    >
> 		    >
> 		    >
> 		    >  EXTERNAL EMAIL:  Use caution when opening attachments
> or clicking on
> 		links
> 		    >
> 		    >
> 		    >
> 		    >
> 		    >
> 		    > Dear AsterixDB devs,
> 		    >
> 		    >
> 		    >
> 		    > I am currently trying out the new support for Parquet files
> on S3 (still in the
> 		    > context of my High-energy Physics use case [1]). This works
> great so far and
> 		has
> 		    > generally decent performance. However, I realized that it
> does not use more
> 		    > than 16 cores, even though 96 logical cores are available
> and even though I
> 		run
> 		    > long-running queries (several minutes) on large data sets
> with a large
> 		number of
> 		    > files (I tried 128 files of 17GB each). Is this an
> arbitrary/artificial limitation
> 		that
> 		    > can be changed somehow (potentially with a small
> patch+recompiling) or is
> 		    > there more serious development required to lift it? FYI, I am
> currently using
> 		    > 03fd6d0f, which should include all S3/Parquet commits on
> master.
> 		    >
> 		    >
> 		    >
> 		    > Cheers,
> 		    >
> 		    > Ingo
> 		    >
> 		    >
> 		    >
> 		    >
> 		    >
> 		    > [1] https://arxiv.org/abs/2104.12615
> 		    >
> 		    >
> 
> 
>

Re: Increasing degree of parallelism when reading Parquet files

Posted by Michael Carey <mj...@ics.uci.edu>.

Ingo,

Q: In your Parquet/S3 testing, what does your current cluster 
configuration look like?  (I.e., how many partitions have you configured 
it with - physical storage partitions that is?)  Even though your S3 
data isn't stored inside AsterixDB in this case, the system still uses 
that info to decide how many parallel threads to use at the base of its 
query plans.  (Obviously there is room for improvement on that behavior 
for use cases involving external storage. :-))

Cheers,

Mike

On 8/9/21 12:28 PM, Müller Ingo wrote:
> Hi Dmitry,
>
> Thanks a lot for checking! Indeed, my queries do not have an exchange. However, the number of I/O devices has indeed worked well in many cases: when I tried the various VM instance sizes, I always created as many I/O devices as there were physical cores (i.e., half the number of logical CPUs). For internal storage as well as HDFS (both using the hdfs:// and the file:// protocol), I saw the full system being utilized. However, just for the case of Parquet on S3, I cannot seem to make it use more than 16 cores.
>
> Cheers,
> Ingo
>
>
>> -----Original Message-----
>> From: Dmitry Lychagin <dm...@couchbase.com>
>> Sent: Monday, August 9, 2021 9:10 PM
>> To: users@asterixdb.apache.org
>> Subject: Re: Increasing degree of parallelism when reading Parquet files
>>
>> Hi Ingo,
>>
>> I checked the code and it seems that when scanning external datasource we're
>> using the same number of cores as there are configured storage partitions (I/O
>> devices).
>> Therefore, if you want 96 cores to be used when scanning Parquet files then you
>> need to configure 96 I/O devices.
>>
>> Compiler.parallelism setting is supposed to affect how many cores we use after
>> the first EXCHANGE operator. However, if your query doesn't have any
>> EXCHANGEs then it'll use the number of cores assigned for the initial data scan
>> operator (number of I/O devices)
>>
>> Thanks,
>> -- Dmitry
>>
>>
>> On 8/9/21, 11:42 AM, "Müller  Ingo" <in...@inf.ethz.ch> wrote:
>>
>>       EXTERNAL EMAIL:  Use caution when opening attachments or clicking on links
>>
>>
>>
>>
>>
>>      Dear Dmitry,
>>
>>      Thanks a lot for the quick reply! I had not though of this. However, I have tried
>> out both ways just now (per query and in the cluster configuration) and did not
>> see any changes. Is there any way I can control that the setting was applied
>> successfully? I have also tried setting compiler.parallelism to 4 and still observed
>> 16 cores being utilized.
>>
>>      Note that the observed degree of parallelism does not correspond to anything
>> related to the data set (I tried with every power of two files between 1 and 128)
>> or the cluster (I tried with every power of two cores between 2 and 64, as well
>> as 48 and 96) and I always see 16 cores being used (or fewer, if the system has
>> fewer). To me, this makes it unlikely that the system really uses the semantics
>> for p=0 or p<0, but looks more like some hard-coded value.
>>
>>      Cheers,
>>      Ingo
>>
>>
>>      > -----Original Message-----
>>      > From: Dmitry Lychagin <dm...@couchbase.com>
>>      > Sent: Monday, August 9, 2021 7:25 PM
>>      > To: users@asterixdb.apache.org
>>      > Subject: Re: Increasing degree of parallelism when reading Parquet files
>>      >
>>      > Ingo,
>>      >
>>      >
>>      >
>>      > We have `compiler.parallelism` parameter that controls how many cores are
>>      > used for query execution.
>>      >
>>      > See
>>      >
>> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_param
>>      > eter
>>      >
>> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
>>      > meter>
>>      >
>>      > You can either set it per query (e.g. SET `compiler.parallelism` "-1";) ,
>>      >
>>      > or globally in the cluster configuration:
>>      >
>>      > https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
>>      > app/src/main/resources/cc2.conf#L57
>>      >
>>      >
>>      >
>>      > Thanks,
>>      >
>>      > -- Dmitry
>>      >
>>      >
>>      >
>>      >
>>      >
>>      > From: Müller Ingo <in...@inf.ethz.ch>
>>      > Reply-To: "users@asterixdb.apache.org" <us...@asterixdb.apache.org>
>>      > Date: Monday, August 9, 2021 at 10:05 AM
>>      > To: "users@asterixdb.apache.org" <us...@asterixdb.apache.org>
>>      > Subject: Increasing degree of parallelism when reading Parquet files
>>      >
>>      >
>>      >
>>      >  EXTERNAL EMAIL:  Use caution when opening attachments or clicking on
>> links
>>      >
>>      >
>>      >
>>      >
>>      >
>>      > Dear AsterixDB devs,
>>      >
>>      >
>>      >
>>      > I am currently trying out the new support for Parquet files on S3 (still in the
>>      > context of my High-energy Physics use case [1]). This works great so far and
>> has
>>      > generally decent performance. However, I realized that it does not use more
>>      > than 16 cores, even though 96 logical cores are available and even though I
>> run
>>      > long-running queries (several minutes) on large data sets with a large
>> number of
>>      > files (I tried 128 files of 17GB each). Is this an arbitrary/artificial limitation
>> that
>>      > can be changed somehow (potentially with a small patch+recompiling) or is
>>      > there more serious development required to lift it? FYI, I am currently using
>>      > 03fd6d0f, which should include all S3/Parquet commits on master.
>>      >
>>      >
>>      >
>>      > Cheers,
>>      >
>>      > Ingo
>>      >
>>      >
>>      >
>>      >
>>      >
>>      > [1] https://arxiv.org/abs/2104.12615
>>      >
>>      >
>>

RE: Increasing degree of parallelism when reading Parquet files

Posted by Müller Ingo <in...@inf.ethz.ch>.

Hey Wail,

Thanks a lot for helping! I am reading from an EC2 m5d.24xlarge instance from AWS's S3. I am pretty sure that S3 is not the problem: First, others have measured [1] up to 2.7GB/s from S3. When I measure the network bandwidth of AsterixDB, I see in the order of 600MB/s. (The 60-80MB/s you mention are typical per connection, but one application can get a much higher bandwidth using multiple connections.) Indeed, with other systems in the comparison, I can read from S3 *much* faster from S3 than AsterixDB; my current understanding is that AserixDB is compute-bound. Also, when I scale up the instance size, I get almost perfectly linear speed up until exactly the point where I use 16 cores (and no speed-up after that) -- this is unlikely to happen if the network is getting saturated, where you'd see some slow-down before the saturation point. Finally, I see 16 threads using 100% of one core each and all other threads being completely idle -- this doesn't look like 48 threads waiting for the network either.

I have a few other things to try, then I'll report back.

Cheers,
Ingo


[1] https://github.com/dvassallo/s3-benchmark

> -----Original Message-----
> From: Wail Alkowaileet <wa...@gmail.com>
> Sent: Monday, August 9, 2021 10:04 PM
> To: users@asterixdb.apache.org
> Subject: Re: Increasing degree of parallelism when reading Parquet files
> 
> Hi Ingo,
> 
> Were you reading from an actual S3 bucket? or was it a local S3 mock server?
> The reason I ask is because reading from a remote bucket is slow (the fastest I
> have seen was ~60MB/s). If your HDFS server(s) are backed by NVMe drives,
> then the read speed could be in GBs/s. For the remote S3 bucket case, other
> cores would be idle as they will be waiting for the data to arrive.
> 
> 
> On Mon, Aug 9, 2021 at 12:28 PM Müller Ingo <ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> > wrote:
> 
> 
> 	Hi Dmitry,
> 
> 	Thanks a lot for checking! Indeed, my queries do not have an exchange.
> However, the number of I/O devices has indeed worked well in many cases:
> when I tried the various VM instance sizes, I always created as many I/O devices
> as there were physical cores (i.e., half the number of logical CPUs). For internal
> storage as well as HDFS (both using the hdfs:// and the file:// protocol), I saw
> the full system being utilized. However, just for the case of Parquet on S3, I
> cannot seem to make it use more than 16 cores.
> 
> 	Cheers,
> 	Ingo
> 
> 
> 	> -----Original Message-----
> 	> From: Dmitry Lychagin <dmitry.lychagin@couchbase.com
> <ma...@couchbase.com> >
> 	> Sent: Monday, August 9, 2021 9:10 PM
> 	> To: users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	> Subject: Re: Increasing degree of parallelism when reading Parquet
> files
> 	>
> 	> Hi Ingo,
> 	>
> 	> I checked the code and it seems that when scanning external
> datasource we're
> 	> using the same number of cores as there are configured storage
> partitions (I/O
> 	> devices).
> 	> Therefore, if you want 96 cores to be used when scanning Parquet
> files then you
> 	> need to configure 96 I/O devices.
> 	>
> 	> Compiler.parallelism setting is supposed to affect how many cores we
> use after
> 	> the first EXCHANGE operator. However, if your query doesn't have any
> 	> EXCHANGEs then it'll use the number of cores assigned for the initial
> data scan
> 	> operator (number of I/O devices)
> 	>
> 	> Thanks,
> 	> -- Dmitry
> 	>
> 	>
> 	> On 8/9/21, 11:42 AM, "Müller  Ingo" <ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> > wrote:
> 	>
> 	>      EXTERNAL EMAIL:  Use caution when opening attachments or
> clicking on links
> 	>
> 	>
> 	>
> 	>
> 	>
> 	>     Dear Dmitry,
> 	>
> 	>     Thanks a lot for the quick reply! I had not though of this. However, I
> have tried
> 	> out both ways just now (per query and in the cluster configuration)
> and did not
> 	> see any changes. Is there any way I can control that the setting was
> applied
> 	> successfully? I have also tried setting compiler.parallelism to 4 and still
> observed
> 	> 16 cores being utilized.
> 	>
> 	>     Note that the observed degree of parallelism does not correspond
> to anything
> 	> related to the data set (I tried with every power of two files between 1
> and 128)
> 	> or the cluster (I tried with every power of two cores between 2 and
> 64, as well
> 	> as 48 and 96) and I always see 16 cores being used (or fewer, if the
> system has
> 	> fewer). To me, this makes it unlikely that the system really uses the
> semantics
> 	> for p=0 or p<0, but looks more like some hard-coded value.
> 	>
> 	>     Cheers,
> 	>     Ingo
> 	>
> 	>
> 	>     > -----Original Message-----
> 	>     > From: Dmitry Lychagin <dmitry.lychagin@couchbase.com
> <ma...@couchbase.com> >
> 	>     > Sent: Monday, August 9, 2021 7:25 PM
> 	>     > To: users@asterixdb.apache.org
> <ma...@asterixdb.apache.org>
> 	>     > Subject: Re: Increasing degree of parallelism when reading
> Parquet files
> 	>     >
> 	>     > Ingo,
> 	>     >
> 	>     >
> 	>     >
> 	>     > We have `compiler.parallelism` parameter that controls how many
> cores are
> 	>     > used for query execution.
> 	>     >
> 	>     > See
> 	>     >
> 	>
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_param
> 	>     > eter
> 	>     >
> 	>
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
> 	>     > meter>
> 	>     >
> 	>     > You can either set it per query (e.g. SET `compiler.parallelism` "-
> 1";) ,
> 	>     >
> 	>     > or globally in the cluster configuration:
> 	>     >
> 	>     >
> https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
> 	>     > app/src/main/resources/cc2.conf#L57
> 	>     >
> 	>     >
> 	>     >
> 	>     > Thanks,
> 	>     >
> 	>     > -- Dmitry
> 	>     >
> 	>     >
> 	>     >
> 	>     >
> 	>     >
> 	>     > From: Müller Ingo <ingo.mueller@inf.ethz.ch
> <ma...@inf.ethz.ch> >
> 	>     > Reply-To: "users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> " <users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	>     > Date: Monday, August 9, 2021 at 10:05 AM
> 	>     > To: "users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> " <users@asterixdb.apache.org
> <ma...@asterixdb.apache.org> >
> 	>     > Subject: Increasing degree of parallelism when reading Parquet
> files
> 	>     >
> 	>     >
> 	>     >
> 	>     >  EXTERNAL EMAIL:  Use caution when opening attachments or
> clicking on
> 	> links
> 	>     >
> 	>     >
> 	>     >
> 	>     >
> 	>     >
> 	>     > Dear AsterixDB devs,
> 	>     >
> 	>     >
> 	>     >
> 	>     > I am currently trying out the new support for Parquet files on S3
> (still in the
> 	>     > context of my High-energy Physics use case [1]). This works great
> so far and
> 	> has
> 	>     > generally decent performance. However, I realized that it does not
> use more
> 	>     > than 16 cores, even though 96 logical cores are available and even
> though I
> 	> run
> 	>     > long-running queries (several minutes) on large data sets with a
> large
> 	> number of
> 	>     > files (I tried 128 files of 17GB each). Is this an arbitrary/artificial
> limitation
> 	> that
> 	>     > can be changed somehow (potentially with a small
> patch+recompiling) or is
> 	>     > there more serious development required to lift it? FYI, I am
> currently using
> 	>     > 03fd6d0f, which should include all S3/Parquet commits on master.
> 	>     >
> 	>     >
> 	>     >
> 	>     > Cheers,
> 	>     >
> 	>     > Ingo
> 	>     >
> 	>     >
> 	>     >
> 	>     >
> 	>     >
> 	>     > [1] https://arxiv.org/abs/2104.12615
> 	>     >
> 	>     >
> 	>
> 
> 
> 
> 
> 
> --
> 
> 
> Regards,
> Wail Alkowaileet

Re: Increasing degree of parallelism when reading Parquet files

Posted by Wail Alkowaileet <wa...@gmail.com>.

Hi Ingo,

Were you reading from an actual S3 bucket? or was it a local S3 mock
server? The reason I ask is because reading from a remote bucket is slow
(the fastest I have seen was ~60MB/s). If your HDFS server(s) are backed by
NVMe drives, then the read speed could be in GBs/s. For the remote S3
bucket case, other cores would be idle as they will be waiting for the data
to arrive.


On Mon, Aug 9, 2021 at 12:28 PM Müller Ingo <in...@inf.ethz.ch>
wrote:

> Hi Dmitry,
>
> Thanks a lot for checking! Indeed, my queries do not have an exchange.
> However, the number of I/O devices has indeed worked well in many cases:
> when I tried the various VM instance sizes, I always created as many I/O
> devices as there were physical cores (i.e., half the number of logical
> CPUs). For internal storage as well as HDFS (both using the hdfs:// and the
> file:// protocol), I saw the full system being utilized. However, just for
> the case of Parquet on S3, I cannot seem to make it use more than 16 cores.
>
> Cheers,
> Ingo
>
>
> > -----Original Message-----
> > From: Dmitry Lychagin <dm...@couchbase.com>
> > Sent: Monday, August 9, 2021 9:10 PM
> > To: users@asterixdb.apache.org
> > Subject: Re: Increasing degree of parallelism when reading Parquet files
> >
> > Hi Ingo,
> >
> > I checked the code and it seems that when scanning external datasource
> we're
> > using the same number of cores as there are configured storage
> partitions (I/O
> > devices).
> > Therefore, if you want 96 cores to be used when scanning Parquet files
> then you
> > need to configure 96 I/O devices.
> >
> > Compiler.parallelism setting is supposed to affect how many cores we use
> after
> > the first EXCHANGE operator. However, if your query doesn't have any
> > EXCHANGEs then it'll use the number of cores assigned for the initial
> data scan
> > operator (number of I/O devices)
> >
> > Thanks,
> > -- Dmitry
> >
> >
> > On 8/9/21, 11:42 AM, "Müller  Ingo" <in...@inf.ethz.ch> wrote:
> >
> >      EXTERNAL EMAIL:  Use caution when opening attachments or clicking
> on links
> >
> >
> >
> >
> >
> >     Dear Dmitry,
> >
> >     Thanks a lot for the quick reply! I had not though of this. However,
> I have tried
> > out both ways just now (per query and in the cluster configuration) and
> did not
> > see any changes. Is there any way I can control that the setting was
> applied
> > successfully? I have also tried setting compiler.parallelism to 4 and
> still observed
> > 16 cores being utilized.
> >
> >     Note that the observed degree of parallelism does not correspond to
> anything
> > related to the data set (I tried with every power of two files between 1
> and 128)
> > or the cluster (I tried with every power of two cores between 2 and 64,
> as well
> > as 48 and 96) and I always see 16 cores being used (or fewer, if the
> system has
> > fewer). To me, this makes it unlikely that the system really uses the
> semantics
> > for p=0 or p<0, but looks more like some hard-coded value.
> >
> >     Cheers,
> >     Ingo
> >
> >
> >     > -----Original Message-----
> >     > From: Dmitry Lychagin <dm...@couchbase.com>
> >     > Sent: Monday, August 9, 2021 7:25 PM
> >     > To: users@asterixdb.apache.org
> >     > Subject: Re: Increasing degree of parallelism when reading Parquet
> files
> >     >
> >     > Ingo,
> >     >
> >     >
> >     >
> >     > We have `compiler.parallelism` parameter that controls how many
> cores are
> >     > used for query execution.
> >     >
> >     > See
> >     >
> >
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_param
> >     > eter
> >     >
> > <
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
> >     > meter>
> >     >
> >     > You can either set it per query (e.g. SET `compiler.parallelism`
> "-1";) ,
> >     >
> >     > or globally in the cluster configuration:
> >     >
> >     > https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
> >     > app/src/main/resources/cc2.conf#L57
> >     >
> >     >
> >     >
> >     > Thanks,
> >     >
> >     > -- Dmitry
> >     >
> >     >
> >     >
> >     >
> >     >
> >     > From: Müller Ingo <in...@inf.ethz.ch>
> >     > Reply-To: "users@asterixdb.apache.org" <users@asterixdb.apache.org
> >
> >     > Date: Monday, August 9, 2021 at 10:05 AM
> >     > To: "users@asterixdb.apache.org" <us...@asterixdb.apache.org>
> >     > Subject: Increasing degree of parallelism when reading Parquet
> files
> >     >
> >     >
> >     >
> >     >  EXTERNAL EMAIL:  Use caution when opening attachments or clicking
> on
> > links
> >     >
> >     >
> >     >
> >     >
> >     >
> >     > Dear AsterixDB devs,
> >     >
> >     >
> >     >
> >     > I am currently trying out the new support for Parquet files on S3
> (still in the
> >     > context of my High-energy Physics use case [1]). This works great
> so far and
> > has
> >     > generally decent performance. However, I realized that it does not
> use more
> >     > than 16 cores, even though 96 logical cores are available and even
> though I
> > run
> >     > long-running queries (several minutes) on large data sets with a
> large
> > number of
> >     > files (I tried 128 files of 17GB each). Is this an
> arbitrary/artificial limitation
> > that
> >     > can be changed somehow (potentially with a small
> patch+recompiling) or is
> >     > there more serious development required to lift it? FYI, I am
> currently using
> >     > 03fd6d0f, which should include all S3/Parquet commits on master.
> >     >
> >     >
> >     >
> >     > Cheers,
> >     >
> >     > Ingo
> >     >
> >     >
> >     >
> >     >
> >     >
> >     > [1] https://arxiv.org/abs/2104.12615
> >     >
> >     >
> >
>
>

-- 

*Regards,*
Wail Alkowaileet

RE: Increasing degree of parallelism when reading Parquet files

Posted by Müller Ingo <in...@inf.ethz.ch>.

Hi Dmitry,

Thanks a lot for checking! Indeed, my queries do not have an exchange. However, the number of I/O devices has indeed worked well in many cases: when I tried the various VM instance sizes, I always created as many I/O devices as there were physical cores (i.e., half the number of logical CPUs). For internal storage as well as HDFS (both using the hdfs:// and the file:// protocol), I saw the full system being utilized. However, just for the case of Parquet on S3, I cannot seem to make it use more than 16 cores.

Cheers,
Ingo


> -----Original Message-----
> From: Dmitry Lychagin <dm...@couchbase.com>
> Sent: Monday, August 9, 2021 9:10 PM
> To: users@asterixdb.apache.org
> Subject: Re: Increasing degree of parallelism when reading Parquet files
> 
> Hi Ingo,
> 
> I checked the code and it seems that when scanning external datasource we're
> using the same number of cores as there are configured storage partitions (I/O
> devices).
> Therefore, if you want 96 cores to be used when scanning Parquet files then you
> need to configure 96 I/O devices.
> 
> Compiler.parallelism setting is supposed to affect how many cores we use after
> the first EXCHANGE operator. However, if your query doesn't have any
> EXCHANGEs then it'll use the number of cores assigned for the initial data scan
> operator (number of I/O devices)
> 
> Thanks,
> -- Dmitry
> 
> 
> On 8/9/21, 11:42 AM, "Müller  Ingo" <in...@inf.ethz.ch> wrote:
> 
>      EXTERNAL EMAIL:  Use caution when opening attachments or clicking on links
> 
> 
> 
> 
> 
>     Dear Dmitry,
> 
>     Thanks a lot for the quick reply! I had not though of this. However, I have tried
> out both ways just now (per query and in the cluster configuration) and did not
> see any changes. Is there any way I can control that the setting was applied
> successfully? I have also tried setting compiler.parallelism to 4 and still observed
> 16 cores being utilized.
> 
>     Note that the observed degree of parallelism does not correspond to anything
> related to the data set (I tried with every power of two files between 1 and 128)
> or the cluster (I tried with every power of two cores between 2 and 64, as well
> as 48 and 96) and I always see 16 cores being used (or fewer, if the system has
> fewer). To me, this makes it unlikely that the system really uses the semantics
> for p=0 or p<0, but looks more like some hard-coded value.
> 
>     Cheers,
>     Ingo
> 
> 
>     > -----Original Message-----
>     > From: Dmitry Lychagin <dm...@couchbase.com>
>     > Sent: Monday, August 9, 2021 7:25 PM
>     > To: users@asterixdb.apache.org
>     > Subject: Re: Increasing degree of parallelism when reading Parquet files
>     >
>     > Ingo,
>     >
>     >
>     >
>     > We have `compiler.parallelism` parameter that controls how many cores are
>     > used for query execution.
>     >
>     > See
>     >
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_param
>     > eter
>     >
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
>     > meter>
>     >
>     > You can either set it per query (e.g. SET `compiler.parallelism` "-1";) ,
>     >
>     > or globally in the cluster configuration:
>     >
>     > https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
>     > app/src/main/resources/cc2.conf#L57
>     >
>     >
>     >
>     > Thanks,
>     >
>     > -- Dmitry
>     >
>     >
>     >
>     >
>     >
>     > From: Müller Ingo <in...@inf.ethz.ch>
>     > Reply-To: "users@asterixdb.apache.org" <us...@asterixdb.apache.org>
>     > Date: Monday, August 9, 2021 at 10:05 AM
>     > To: "users@asterixdb.apache.org" <us...@asterixdb.apache.org>
>     > Subject: Increasing degree of parallelism when reading Parquet files
>     >
>     >
>     >
>     >  EXTERNAL EMAIL:  Use caution when opening attachments or clicking on
> links
>     >
>     >
>     >
>     >
>     >
>     > Dear AsterixDB devs,
>     >
>     >
>     >
>     > I am currently trying out the new support for Parquet files on S3 (still in the
>     > context of my High-energy Physics use case [1]). This works great so far and
> has
>     > generally decent performance. However, I realized that it does not use more
>     > than 16 cores, even though 96 logical cores are available and even though I
> run
>     > long-running queries (several minutes) on large data sets with a large
> number of
>     > files (I tried 128 files of 17GB each). Is this an arbitrary/artificial limitation
> that
>     > can be changed somehow (potentially with a small patch+recompiling) or is
>     > there more serious development required to lift it? FYI, I am currently using
>     > 03fd6d0f, which should include all S3/Parquet commits on master.
>     >
>     >
>     >
>     > Cheers,
>     >
>     > Ingo
>     >
>     >
>     >
>     >
>     >
>     > [1] https://arxiv.org/abs/2104.12615
>     >
>     >
>

Re: Increasing degree of parallelism when reading Parquet files

Posted by Dmitry Lychagin <dm...@couchbase.com>.

Hi Ingo,

I checked the code and it seems that when scanning external datasource we're using the same number of cores as there are configured storage partitions (I/O devices).
Therefore, if you want 96 cores to be used when scanning Parquet files then you need to configure 96 I/O devices.

Compiler.parallelism setting is supposed to affect how many cores we use after the first EXCHANGE operator. However, if your query doesn't have any EXCHANGEs then it'll use the number of cores assigned for the initial data scan operator (number of I/O devices)

Thanks,
-- Dmitry
 

On 8/9/21, 11:42 AM, "Müller  Ingo" <in...@inf.ethz.ch> wrote:

     EXTERNAL EMAIL:  Use caution when opening attachments or clicking on links





    Dear Dmitry,

    Thanks a lot for the quick reply! I had not though of this. However, I have tried out both ways just now (per query and in the cluster configuration) and did not see any changes. Is there any way I can control that the setting was applied successfully? I have also tried setting compiler.parallelism to 4 and still observed 16 cores being utilized.

    Note that the observed degree of parallelism does not correspond to anything related to the data set (I tried with every power of two files between 1 and 128) or the cluster (I tried with every power of two cores between 2 and 64, as well as 48 and 96) and I always see 16 cores being used (or fewer, if the system has fewer). To me, this makes it unlikely that the system really uses the semantics for p=0 or p<0, but looks more like some hard-coded value.

    Cheers,
    Ingo


    > -----Original Message-----
    > From: Dmitry Lychagin <dm...@couchbase.com>
    > Sent: Monday, August 9, 2021 7:25 PM
    > To: users@asterixdb.apache.org
    > Subject: Re: Increasing degree of parallelism when reading Parquet files
    >
    > Ingo,
    >
    >
    >
    > We have `compiler.parallelism` parameter that controls how many cores are
    > used for query execution.
    >
    > See
    > https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_param
    > eter
    > <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
    > meter>
    >
    > You can either set it per query (e.g. SET `compiler.parallelism` "-1";) ,
    >
    > or globally in the cluster configuration:
    >
    > https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
    > app/src/main/resources/cc2.conf#L57
    >
    >
    >
    > Thanks,
    >
    > -- Dmitry
    >
    >
    >
    >
    >
    > From: Müller Ingo <in...@inf.ethz.ch>
    > Reply-To: "users@asterixdb.apache.org" <us...@asterixdb.apache.org>
    > Date: Monday, August 9, 2021 at 10:05 AM
    > To: "users@asterixdb.apache.org" <us...@asterixdb.apache.org>
    > Subject: Increasing degree of parallelism when reading Parquet files
    >
    >
    >
    >  EXTERNAL EMAIL:  Use caution when opening attachments or clicking on links
    >
    >
    >
    >
    >
    > Dear AsterixDB devs,
    >
    >
    >
    > I am currently trying out the new support for Parquet files on S3 (still in the
    > context of my High-energy Physics use case [1]). This works great so far and has
    > generally decent performance. However, I realized that it does not use more
    > than 16 cores, even though 96 logical cores are available and even though I run
    > long-running queries (several minutes) on large data sets with a large number of
    > files (I tried 128 files of 17GB each). Is this an arbitrary/artificial limitation that
    > can be changed somehow (potentially with a small patch+recompiling) or is
    > there more serious development required to lift it? FYI, I am currently using
    > 03fd6d0f, which should include all S3/Parquet commits on master.
    >
    >
    >
    > Cheers,
    >
    > Ingo
    >
    >
    >
    >
    >
    > [1] https://arxiv.org/abs/2104.12615
    >
    >

RE: Increasing degree of parallelism when reading Parquet files

Posted by Müller Ingo <in...@inf.ethz.ch>.

Dear Dmitry,

Thanks a lot for the quick reply! I had not though of this. However, I have tried out both ways just now (per query and in the cluster configuration) and did not see any changes. Is there any way I can control that the setting was applied successfully? I have also tried setting compiler.parallelism to 4 and still observed 16 cores being utilized.

Note that the observed degree of parallelism does not correspond to anything related to the data set (I tried with every power of two files between 1 and 128) or the cluster (I tried with every power of two cores between 2 and 64, as well as 48 and 96) and I always see 16 cores being used (or fewer, if the system has fewer). To me, this makes it unlikely that the system really uses the semantics for p=0 or p<0, but looks more like some hard-coded value.

Cheers,
Ingo


> -----Original Message-----
> From: Dmitry Lychagin <dm...@couchbase.com>
> Sent: Monday, August 9, 2021 7:25 PM
> To: users@asterixdb.apache.org
> Subject: Re: Increasing degree of parallelism when reading Parquet files
> 
> Ingo,
> 
> 
> 
> We have `compiler.parallelism` parameter that controls how many cores are
> used for query execution.
> 
> See
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_param
> eter
> <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_para
> meter>
> 
> You can either set it per query (e.g. SET `compiler.parallelism` "-1";) ,
> 
> or globally in the cluster configuration:
> 
> https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-
> app/src/main/resources/cc2.conf#L57
> 
> 
> 
> Thanks,
> 
> -- Dmitry
> 
> 
> 
> 
> 
> From: Müller Ingo <in...@inf.ethz.ch>
> Reply-To: "users@asterixdb.apache.org" <us...@asterixdb.apache.org>
> Date: Monday, August 9, 2021 at 10:05 AM
> To: "users@asterixdb.apache.org" <us...@asterixdb.apache.org>
> Subject: Increasing degree of parallelism when reading Parquet files
> 
> 
> 
>  EXTERNAL EMAIL:  Use caution when opening attachments or clicking on links
> 
> 
> 
> 
> 
> Dear AsterixDB devs,
> 
> 
> 
> I am currently trying out the new support for Parquet files on S3 (still in the
> context of my High-energy Physics use case [1]). This works great so far and has
> generally decent performance. However, I realized that it does not use more
> than 16 cores, even though 96 logical cores are available and even though I run
> long-running queries (several minutes) on large data sets with a large number of
> files (I tried 128 files of 17GB each). Is this an arbitrary/artificial limitation that
> can be changed somehow (potentially with a small patch+recompiling) or is
> there more serious development required to lift it? FYI, I am currently using
> 03fd6d0f, which should include all S3/Parquet commits on master.
> 
> 
> 
> Cheers,
> 
> Ingo
> 
> 
> 
> 
> 
> [1] https://arxiv.org/abs/2104.12615
> 
>

Re: Increasing degree of parallelism when reading Parquet files

Posted by Dmitry Lychagin <dm...@couchbase.com>.

Ingo,

We have `compiler.parallelism` parameter that controls how many cores are used for query execution.
See https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Parallelism_parameter
You can either set it per query (e.g. SET `compiler.parallelism` "-1";) ,
or globally in the cluster configuration:
https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-app/src/main/resources/cc2.conf#L57

Thanks,
-- Dmitry


From: Müller Ingo <in...@inf.ethz.ch>
Reply-To: "users@asterixdb.apache.org" <us...@asterixdb.apache.org>
Date: Monday, August 9, 2021 at 10:05 AM
To: "users@asterixdb.apache.org" <us...@asterixdb.apache.org>
Subject: Increasing degree of parallelism when reading Parquet files

 EXTERNAL EMAIL:  Use caution when opening attachments or clicking on links




Dear AsterixDB devs,

I am currently trying out the new support for Parquet files on S3 (still in the context of my High-energy Physics use case [1]). This works great so far and has generally decent performance. However, I realized that it does not use more than 16 cores, even though 96 logical cores are available and even though I run long-running queries (several minutes) on large data sets with a large number of files (I tried 128 files of 17GB each). Is this an arbitrary/artificial limitation that can be changed somehow (potentially with a small patch+recompiling) or is there more serious development required to lift it? FYI, I am currently using 03fd6d0f, which should include all S3/Parquet commits on master.

Cheers,
Ingo


[1] https://arxiv.org/abs/2104.12615