You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by John Omernik <jo...@omernik.com> on 2016/07/05 15:39:25 UTC

Initial Feed Back on 1.7.0 Release

Working with the 1.7.0, the feature that I was very interested in was the
fixing of the Metadata Caching while using user impersonation.

I have a large table, with a day directory that can contain up to 1000
parquet files each.


Planning was getting terrible on this table as I added new data, and the
metadata cache wasn't an option for me because of impersonation.

Well now will 1.7.0 that's working, and it makes a HUGE difference. A query
that would take 120 seconds now takes 20 seconds.   Etc.

Overall, this is a great feature and folks should look into it for
performance of large Parquet tables.

Some observations that I would love some help with.

1. Drill "Seems" to know when a new subdirectory was added and it generates
the metadata for that directory with the missing data. This is without
another REFRESH TABLE METADATA command.  That works great for new
directories, however, what happens if you just copy new files into an
existing directory? Will it use the metadata cache that only lists the old
files. or will things get updated? I guess, how does it know things are in
sync?

2.  Pertaining to point 1, when new data was added, the first query that
used that directory partition, seemed to write the metadata file. However,
the second query ran ALSO rewrote the file (and it ran with the speed of an
uncached directory).  However, the third query was now running at cached
speeds. (the 20 seconds vs. 120 seconds).  This seems odd, but maybe there
is an reason?

3. Is Drill ok with me running REFRESH TABLE METADATA only for
subdirectory?  So if I load a day, can I issue REFRESH TABLE METADATA
`mytable/2016-07-04`  and have things be all where drill is happy?  I.e.
does the mytable metadata need to be updated as well or is that wasted
cycles?

4.  Discussion: perhaps we could compress the metadata file? Each day (for
me) has 8.2 mb of data, and the file at the root of my table has 332mb of
data. Just using standard gzip/gunzip I got the 332mb file to 11 mb. That
seems like an improvement, however, not knowing how this file is
used/updated compression may add lag.

5. Any other thoughts/suggestions?

Re: Initial Feed Back on 1.7.0 Release

Posted by Parth Chandra <pc...@maprtech.com>.

You might have run the two queries while the cache was still being built.
There is no concurrency control for the metadata cache at the moment (one
of the many improvements we need to make).
For metadata caching, the best practice with the current implementation is
to run a manual refresh metadata command at the top level directory after
adding any data.



On Tue, Jul 5, 2016 at 10:20 AM, Abdel Hakim Deneche <ad...@maprtech.com>
wrote:

> Actually, I slightly misunderstood your 2nd question: so you made some
> changes to a subfolder, then run query A that caused the cache to refresh,
> then you run another query B that also caused the cache to refresh, the
> finally query C actually seemed to use the cache as it is.
>
> Is my understanding now correct ? are queries A and B exactly the same or
> different ?
>
> On Tue, Jul 5, 2016 at 10:13 AM, rahul challapalli <
> challapallirahul@gmail.com> wrote:
>
> > John,
> >
> > Once you add/update data in one of your sub-folders, the immediate next
> > query should update the metadata cache automatically and all subsequent
> > queries should fetch metadata from the cache. If this is not the case,
> its
> > a bug. Can you confirm your findings?
> >
> > - Rahul
> >
> > On Tue, Jul 5, 2016 at 9:53 AM, John Omernik <jo...@omernik.com> wrote:
> >
> > > Hey Abdel, thanks for the response..  on questions 1 and 2, from what I
> > > understood, nothing was changed, but then I had to make the third query
> > for
> > > it to take.  I'll keep observing to determine what that may be.
> > >
> > > On 3, a logical place to implement, or start implementing incremental
> may
> > > be allowing a directories refresh automatically update the parents data
> > > without causing a cascading (update everything) refresh.  So if if I
> > have a
> > > structure like this:
> > >
> > > mytable
> > > ...dir0=2016-06-06
> > > .......dir1=23
> > >
> > > (basically table, days, hours)
> > >
> > > that if I update data in hour 23, it would update 2016-06-06 with the
> new
> > > timestamps and update mytable with the new timestamps.  The only issue
> > > would be figuring out a way to take a lock. (Say you had multiple loads
> > > happening, you want to ensure that one days updates don't clobber
> another
> > > days)
> > >
> > > Just a thought on that.
> > >
> > > Yep, the incremental issue would come into play here.  Are there any
> > design
> > > docs or JIRAs on the incremental updates to metadata?
> > >
> > > Thanks for your reply.  I am looking forward other dev's thoughts on
> your
> > > answer to 3 as well.
> > >
> > > Thanks!
> > >
> > > John
> > >
> > >
> > > On Tue, Jul 5, 2016 at 11:05 AM, Abdel Hakim Deneche <
> > > adeneche@maprtech.com>
> > > wrote:
> > >
> > > > answers inline.
> > > >
> > > > On Tue, Jul 5, 2016 at 8:39 AM, John Omernik <jo...@omernik.com>
> wrote:
> > > >
> > > > > Working with the 1.7.0, the feature that I was very interested in
> was
> > > the
> > > > > fixing of the Metadata Caching while using user impersonation.
> > > > >
> > > > > I have a large table, with a day directory that can contain up to
> > 1000
> > > > > parquet files each.
> > > > >
> > > > >
> > > > > Planning was getting terrible on this table as I added new data,
> and
> > > the
> > > > > metadata cache wasn't an option for me because of impersonation.
> > > > >
> > > > > Well now will 1.7.0 that's working, and it makes a HUGE
> difference. A
> > > > query
> > > > > that would take 120 seconds now takes 20 seconds.   Etc.
> > > > >
> > > > > Overall, this is a great feature and folks should look into it for
> > > > > performance of large Parquet tables.
> > > > >
> > > > > Some observations that I would love some help with.
> > > > >
> > > > > 1. Drill "Seems" to know when a new subdirectory was added and it
> > > > generates
> > > > > the metadata for that directory with the missing data. This is
> > without
> > > > > another REFRESH TABLE METADATA command.  That works great for new
> > > > > directories, however, what happens if you just copy new files into
> an
> > > > > existing directory? Will it use the metadata cache that only lists
> > the
> > > > old
> > > > > files. or will things get updated? I guess, how does it know things
> > are
> > > > in
> > > > > sync?
> > > > >
> > > >
> > > > When you query folder A that contains metadata cache, Drill will
> check
> > > all
> > > > it's sub-directories' last modification time to figure out if
> anything
> > > > changed since last time the metadata cache was refreshed. If data was
> > > > added/removed, Drill will refresh the metadata cache for folder A.
> > > >
> > > >
> > > > > 2.  Pertaining to point 1, when new data was added, the first query
> > > that
> > > > > used that directory partition, seemed to write the metadata file.
> > > > However,
> > > > > the second query ran ALSO rewrote the file (and it ran with the
> speed
> > > of
> > > > an
> > > > > uncached directory).  However, the third query was now running at
> > > cached
> > > > > speeds. (the 20 seconds vs. 120 seconds).  This seems odd, but
> maybe
> > > > there
> > > > > is an reason?
> > > > >
> > > >
> > > > Unfortunately, the current implementation of metadata cache doesn't
> > > support
> > > > incremental refresh, so each time Drill detects a change inside the
> > > folder,
> > > > it will run a "full" metadata cache refresh before running the query,
> > > > that's what explains why your second query took so long to finish.
> > > >
> > > >
> > > > > 3. Is Drill ok with me running REFRESH TABLE METADATA only for
> > > > > subdirectory?  So if I load a day, can I issue REFRESH TABLE
> METADATA
> > > > > `mytable/2016-07-04`  and have things be all where drill is happy?
> > > I.e.
> > > > > does the mytable metadata need to be updated as well or is that
> > wasted
> > > > > cycles?
> > > > >
> > > >
> > > > Drill keeps a metadata cache file for every subdirectory of your
> table.
> > > So
> > > > you'll end up with a cache file in "mytable" and another one in
> > > > "mytable/2016-07-04".
> > > > I'm not sure about the following, and other developers will correct
> > soon
> > > > enough, but my understanding is that you can run a refresh command on
> > the
> > > > subfolder and it will only cause that particular cache (and any of
> it's
> > > > subfolders) to be updated and it won't cause the cache file on
> > "mytable"
> > > > and any other of it's subfolders to be updated.
> > > > Also, as long as you only query this particular day, Drill won't
> detect
> > > the
> > > > change and won't try to update any other metadata cache, but as soon
> as
> > > you
> > > > query "mytable" Drill will figure out things have changed and it will
> > > cause
> > > > a full refresh of the table.
> > > >
> > > >
> > > > > 4.  Discussion: perhaps we could compress the metadata file? Each
> day
> > > > (for
> > > > > me) has 8.2 mb of data, and the file at the root of my table has
> > 332mb
> > > of
> > > > > data. Just using standard gzip/gunzip I got the 332mb file to 11
> mb.
> > > That
> > > > > seems like an improvement, however, not knowing how this file is
> > > > > used/updated compression may add lag.
> > > > >
> > > >
> > > > There are definitely other ways we can store the metadata cache
> files,
> > > > compression is one of them but we also want the alternative to make
> it
> > > > easier to run incremental metadata refresh.
> > > >
> > > >
> > > > > 5. Any other thoughts/suggestions?
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Abdelhakim Deneche
> > > >
> > > > Software Engineer
> > > >
> > > >   <http://www.mapr.com/>
> > > >
> > > >
> > > > Now Available - Free Hadoop On-Demand Training
> > > > <
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > >
> > > >
> > >
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Re: Initial Feed Back on 1.7.0 Release

Posted by Abdel Hakim Deneche <ad...@maprtech.com>.

Actually, I slightly misunderstood your 2nd question: so you made some
changes to a subfolder, then run query A that caused the cache to refresh,
then you run another query B that also caused the cache to refresh, the
finally query C actually seemed to use the cache as it is.

Is my understanding now correct ? are queries A and B exactly the same or
different ?

On Tue, Jul 5, 2016 at 10:13 AM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> John,
>
> Once you add/update data in one of your sub-folders, the immediate next
> query should update the metadata cache automatically and all subsequent
> queries should fetch metadata from the cache. If this is not the case, its
> a bug. Can you confirm your findings?
>
> - Rahul
>
> On Tue, Jul 5, 2016 at 9:53 AM, John Omernik <jo...@omernik.com> wrote:
>
> > Hey Abdel, thanks for the response..  on questions 1 and 2, from what I
> > understood, nothing was changed, but then I had to make the third query
> for
> > it to take.  I'll keep observing to determine what that may be.
> >
> > On 3, a logical place to implement, or start implementing incremental may
> > be allowing a directories refresh automatically update the parents data
> > without causing a cascading (update everything) refresh.  So if if I
> have a
> > structure like this:
> >
> > mytable
> > ...dir0=2016-06-06
> > .......dir1=23
> >
> > (basically table, days, hours)
> >
> > that if I update data in hour 23, it would update 2016-06-06 with the new
> > timestamps and update mytable with the new timestamps.  The only issue
> > would be figuring out a way to take a lock. (Say you had multiple loads
> > happening, you want to ensure that one days updates don't clobber another
> > days)
> >
> > Just a thought on that.
> >
> > Yep, the incremental issue would come into play here.  Are there any
> design
> > docs or JIRAs on the incremental updates to metadata?
> >
> > Thanks for your reply.  I am looking forward other dev's thoughts on your
> > answer to 3 as well.
> >
> > Thanks!
> >
> > John
> >
> >
> > On Tue, Jul 5, 2016 at 11:05 AM, Abdel Hakim Deneche <
> > adeneche@maprtech.com>
> > wrote:
> >
> > > answers inline.
> > >
> > > On Tue, Jul 5, 2016 at 8:39 AM, John Omernik <jo...@omernik.com> wrote:
> > >
> > > > Working with the 1.7.0, the feature that I was very interested in was
> > the
> > > > fixing of the Metadata Caching while using user impersonation.
> > > >
> > > > I have a large table, with a day directory that can contain up to
> 1000
> > > > parquet files each.
> > > >
> > > >
> > > > Planning was getting terrible on this table as I added new data, and
> > the
> > > > metadata cache wasn't an option for me because of impersonation.
> > > >
> > > > Well now will 1.7.0 that's working, and it makes a HUGE difference. A
> > > query
> > > > that would take 120 seconds now takes 20 seconds.   Etc.
> > > >
> > > > Overall, this is a great feature and folks should look into it for
> > > > performance of large Parquet tables.
> > > >
> > > > Some observations that I would love some help with.
> > > >
> > > > 1. Drill "Seems" to know when a new subdirectory was added and it
> > > generates
> > > > the metadata for that directory with the missing data. This is
> without
> > > > another REFRESH TABLE METADATA command.  That works great for new
> > > > directories, however, what happens if you just copy new files into an
> > > > existing directory? Will it use the metadata cache that only lists
> the
> > > old
> > > > files. or will things get updated? I guess, how does it know things
> are
> > > in
> > > > sync?
> > > >
> > >
> > > When you query folder A that contains metadata cache, Drill will check
> > all
> > > it's sub-directories' last modification time to figure out if anything
> > > changed since last time the metadata cache was refreshed. If data was
> > > added/removed, Drill will refresh the metadata cache for folder A.
> > >
> > >
> > > > 2.  Pertaining to point 1, when new data was added, the first query
> > that
> > > > used that directory partition, seemed to write the metadata file.
> > > However,
> > > > the second query ran ALSO rewrote the file (and it ran with the speed
> > of
> > > an
> > > > uncached directory).  However, the third query was now running at
> > cached
> > > > speeds. (the 20 seconds vs. 120 seconds).  This seems odd, but maybe
> > > there
> > > > is an reason?
> > > >
> > >
> > > Unfortunately, the current implementation of metadata cache doesn't
> > support
> > > incremental refresh, so each time Drill detects a change inside the
> > folder,
> > > it will run a "full" metadata cache refresh before running the query,
> > > that's what explains why your second query took so long to finish.
> > >
> > >
> > > > 3. Is Drill ok with me running REFRESH TABLE METADATA only for
> > > > subdirectory?  So if I load a day, can I issue REFRESH TABLE METADATA
> > > > `mytable/2016-07-04`  and have things be all where drill is happy?
> > I.e.
> > > > does the mytable metadata need to be updated as well or is that
> wasted
> > > > cycles?
> > > >
> > >
> > > Drill keeps a metadata cache file for every subdirectory of your table.
> > So
> > > you'll end up with a cache file in "mytable" and another one in
> > > "mytable/2016-07-04".
> > > I'm not sure about the following, and other developers will correct
> soon
> > > enough, but my understanding is that you can run a refresh command on
> the
> > > subfolder and it will only cause that particular cache (and any of it's
> > > subfolders) to be updated and it won't cause the cache file on
> "mytable"
> > > and any other of it's subfolders to be updated.
> > > Also, as long as you only query this particular day, Drill won't detect
> > the
> > > change and won't try to update any other metadata cache, but as soon as
> > you
> > > query "mytable" Drill will figure out things have changed and it will
> > cause
> > > a full refresh of the table.
> > >
> > >
> > > > 4.  Discussion: perhaps we could compress the metadata file? Each day
> > > (for
> > > > me) has 8.2 mb of data, and the file at the root of my table has
> 332mb
> > of
> > > > data. Just using standard gzip/gunzip I got the 332mb file to 11 mb.
> > That
> > > > seems like an improvement, however, not knowing how this file is
> > > > used/updated compression may add lag.
> > > >
> > >
> > > There are definitely other ways we can store the metadata cache files,
> > > compression is one of them but we also want the alternative to make it
> > > easier to run incremental metadata refresh.
> > >
> > >
> > > > 5. Any other thoughts/suggestions?
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Abdelhakim Deneche
> > >
> > > Software Engineer
> > >
> > >   <http://www.mapr.com/>
> > >
> > >
> > > Now Available - Free Hadoop On-Demand Training
> > > <
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > >
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: Initial Feed Back on 1.7.0 Release

Posted by rahul challapalli <ch...@gmail.com>.

John,

Once you add/update data in one of your sub-folders, the immediate next
query should update the metadata cache automatically and all subsequent
queries should fetch metadata from the cache. If this is not the case, its
a bug. Can you confirm your findings?

- Rahul

On Tue, Jul 5, 2016 at 9:53 AM, John Omernik <jo...@omernik.com> wrote:

> Hey Abdel, thanks for the response..  on questions 1 and 2, from what I
> understood, nothing was changed, but then I had to make the third query for
> it to take.  I'll keep observing to determine what that may be.
>
> On 3, a logical place to implement, or start implementing incremental may
> be allowing a directories refresh automatically update the parents data
> without causing a cascading (update everything) refresh.  So if if I have a
> structure like this:
>
> mytable
> ...dir0=2016-06-06
> .......dir1=23
>
> (basically table, days, hours)
>
> that if I update data in hour 23, it would update 2016-06-06 with the new
> timestamps and update mytable with the new timestamps.  The only issue
> would be figuring out a way to take a lock. (Say you had multiple loads
> happening, you want to ensure that one days updates don't clobber another
> days)
>
> Just a thought on that.
>
> Yep, the incremental issue would come into play here.  Are there any design
> docs or JIRAs on the incremental updates to metadata?
>
> Thanks for your reply.  I am looking forward other dev's thoughts on your
> answer to 3 as well.
>
> Thanks!
>
> John
>
>
> On Tue, Jul 5, 2016 at 11:05 AM, Abdel Hakim Deneche <
> adeneche@maprtech.com>
> wrote:
>
> > answers inline.
> >
> > On Tue, Jul 5, 2016 at 8:39 AM, John Omernik <jo...@omernik.com> wrote:
> >
> > > Working with the 1.7.0, the feature that I was very interested in was
> the
> > > fixing of the Metadata Caching while using user impersonation.
> > >
> > > I have a large table, with a day directory that can contain up to 1000
> > > parquet files each.
> > >
> > >
> > > Planning was getting terrible on this table as I added new data, and
> the
> > > metadata cache wasn't an option for me because of impersonation.
> > >
> > > Well now will 1.7.0 that's working, and it makes a HUGE difference. A
> > query
> > > that would take 120 seconds now takes 20 seconds.   Etc.
> > >
> > > Overall, this is a great feature and folks should look into it for
> > > performance of large Parquet tables.
> > >
> > > Some observations that I would love some help with.
> > >
> > > 1. Drill "Seems" to know when a new subdirectory was added and it
> > generates
> > > the metadata for that directory with the missing data. This is without
> > > another REFRESH TABLE METADATA command.  That works great for new
> > > directories, however, what happens if you just copy new files into an
> > > existing directory? Will it use the metadata cache that only lists the
> > old
> > > files. or will things get updated? I guess, how does it know things are
> > in
> > > sync?
> > >
> >
> > When you query folder A that contains metadata cache, Drill will check
> all
> > it's sub-directories' last modification time to figure out if anything
> > changed since last time the metadata cache was refreshed. If data was
> > added/removed, Drill will refresh the metadata cache for folder A.
> >
> >
> > > 2.  Pertaining to point 1, when new data was added, the first query
> that
> > > used that directory partition, seemed to write the metadata file.
> > However,
> > > the second query ran ALSO rewrote the file (and it ran with the speed
> of
> > an
> > > uncached directory).  However, the third query was now running at
> cached
> > > speeds. (the 20 seconds vs. 120 seconds).  This seems odd, but maybe
> > there
> > > is an reason?
> > >
> >
> > Unfortunately, the current implementation of metadata cache doesn't
> support
> > incremental refresh, so each time Drill detects a change inside the
> folder,
> > it will run a "full" metadata cache refresh before running the query,
> > that's what explains why your second query took so long to finish.
> >
> >
> > > 3. Is Drill ok with me running REFRESH TABLE METADATA only for
> > > subdirectory?  So if I load a day, can I issue REFRESH TABLE METADATA
> > > `mytable/2016-07-04`  and have things be all where drill is happy?
> I.e.
> > > does the mytable metadata need to be updated as well or is that wasted
> > > cycles?
> > >
> >
> > Drill keeps a metadata cache file for every subdirectory of your table.
> So
> > you'll end up with a cache file in "mytable" and another one in
> > "mytable/2016-07-04".
> > I'm not sure about the following, and other developers will correct soon
> > enough, but my understanding is that you can run a refresh command on the
> > subfolder and it will only cause that particular cache (and any of it's
> > subfolders) to be updated and it won't cause the cache file on "mytable"
> > and any other of it's subfolders to be updated.
> > Also, as long as you only query this particular day, Drill won't detect
> the
> > change and won't try to update any other metadata cache, but as soon as
> you
> > query "mytable" Drill will figure out things have changed and it will
> cause
> > a full refresh of the table.
> >
> >
> > > 4.  Discussion: perhaps we could compress the metadata file? Each day
> > (for
> > > me) has 8.2 mb of data, and the file at the root of my table has 332mb
> of
> > > data. Just using standard gzip/gunzip I got the 332mb file to 11 mb.
> That
> > > seems like an improvement, however, not knowing how this file is
> > > used/updated compression may add lag.
> > >
> >
> > There are definitely other ways we can store the metadata cache files,
> > compression is one of them but we also want the alternative to make it
> > easier to run incremental metadata refresh.
> >
> >
> > > 5. Any other thoughts/suggestions?
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>

Re: Initial Feed Back on 1.7.0 Release

Posted by John Omernik <jo...@omernik.com>.

Hey Abdel, thanks for the response..  on questions 1 and 2, from what I
understood, nothing was changed, but then I had to make the third query for
it to take.  I'll keep observing to determine what that may be.

On 3, a logical place to implement, or start implementing incremental may
be allowing a directories refresh automatically update the parents data
without causing a cascading (update everything) refresh.  So if if I have a
structure like this:

mytable
...dir0=2016-06-06
.......dir1=23

(basically table, days, hours)

that if I update data in hour 23, it would update 2016-06-06 with the new
timestamps and update mytable with the new timestamps.  The only issue
would be figuring out a way to take a lock. (Say you had multiple loads
happening, you want to ensure that one days updates don't clobber another
days)

Just a thought on that.

Yep, the incremental issue would come into play here.  Are there any design
docs or JIRAs on the incremental updates to metadata?

Thanks for your reply.  I am looking forward other dev's thoughts on your
answer to 3 as well.

Thanks!

John


On Tue, Jul 5, 2016 at 11:05 AM, Abdel Hakim Deneche <ad...@maprtech.com>
wrote:

> answers inline.
>
> On Tue, Jul 5, 2016 at 8:39 AM, John Omernik <jo...@omernik.com> wrote:
>
> > Working with the 1.7.0, the feature that I was very interested in was the
> > fixing of the Metadata Caching while using user impersonation.
> >
> > I have a large table, with a day directory that can contain up to 1000
> > parquet files each.
> >
> >
> > Planning was getting terrible on this table as I added new data, and the
> > metadata cache wasn't an option for me because of impersonation.
> >
> > Well now will 1.7.0 that's working, and it makes a HUGE difference. A
> query
> > that would take 120 seconds now takes 20 seconds.   Etc.
> >
> > Overall, this is a great feature and folks should look into it for
> > performance of large Parquet tables.
> >
> > Some observations that I would love some help with.
> >
> > 1. Drill "Seems" to know when a new subdirectory was added and it
> generates
> > the metadata for that directory with the missing data. This is without
> > another REFRESH TABLE METADATA command.  That works great for new
> > directories, however, what happens if you just copy new files into an
> > existing directory? Will it use the metadata cache that only lists the
> old
> > files. or will things get updated? I guess, how does it know things are
> in
> > sync?
> >
>
> When you query folder A that contains metadata cache, Drill will check all
> it's sub-directories' last modification time to figure out if anything
> changed since last time the metadata cache was refreshed. If data was
> added/removed, Drill will refresh the metadata cache for folder A.
>
>
> > 2.  Pertaining to point 1, when new data was added, the first query that
> > used that directory partition, seemed to write the metadata file.
> However,
> > the second query ran ALSO rewrote the file (and it ran with the speed of
> an
> > uncached directory).  However, the third query was now running at cached
> > speeds. (the 20 seconds vs. 120 seconds).  This seems odd, but maybe
> there
> > is an reason?
> >
>
> Unfortunately, the current implementation of metadata cache doesn't support
> incremental refresh, so each time Drill detects a change inside the folder,
> it will run a "full" metadata cache refresh before running the query,
> that's what explains why your second query took so long to finish.
>
>
> > 3. Is Drill ok with me running REFRESH TABLE METADATA only for
> > subdirectory?  So if I load a day, can I issue REFRESH TABLE METADATA
> > `mytable/2016-07-04`  and have things be all where drill is happy?  I.e.
> > does the mytable metadata need to be updated as well or is that wasted
> > cycles?
> >
>
> Drill keeps a metadata cache file for every subdirectory of your table. So
> you'll end up with a cache file in "mytable" and another one in
> "mytable/2016-07-04".
> I'm not sure about the following, and other developers will correct soon
> enough, but my understanding is that you can run a refresh command on the
> subfolder and it will only cause that particular cache (and any of it's
> subfolders) to be updated and it won't cause the cache file on "mytable"
> and any other of it's subfolders to be updated.
> Also, as long as you only query this particular day, Drill won't detect the
> change and won't try to update any other metadata cache, but as soon as you
> query "mytable" Drill will figure out things have changed and it will cause
> a full refresh of the table.
>
>
> > 4.  Discussion: perhaps we could compress the metadata file? Each day
> (for
> > me) has 8.2 mb of data, and the file at the root of my table has 332mb of
> > data. Just using standard gzip/gunzip I got the 332mb file to 11 mb. That
> > seems like an improvement, however, not knowing how this file is
> > used/updated compression may add lag.
> >
>
> There are definitely other ways we can store the metadata cache files,
> compression is one of them but we also want the alternative to make it
> easier to run incremental metadata refresh.
>
>
> > 5. Any other thoughts/suggestions?
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Re: Initial Feed Back on 1.7.0 Release

Posted by Abdel Hakim Deneche <ad...@maprtech.com>.

answers inline.

On Tue, Jul 5, 2016 at 8:39 AM, John Omernik <jo...@omernik.com> wrote:

> Working with the 1.7.0, the feature that I was very interested in was the
> fixing of the Metadata Caching while using user impersonation.
>
> I have a large table, with a day directory that can contain up to 1000
> parquet files each.
>
>
> Planning was getting terrible on this table as I added new data, and the
> metadata cache wasn't an option for me because of impersonation.
>
> Well now will 1.7.0 that's working, and it makes a HUGE difference. A query
> that would take 120 seconds now takes 20 seconds.   Etc.
>
> Overall, this is a great feature and folks should look into it for
> performance of large Parquet tables.
>
> Some observations that I would love some help with.
>
> 1. Drill "Seems" to know when a new subdirectory was added and it generates
> the metadata for that directory with the missing data. This is without
> another REFRESH TABLE METADATA command.  That works great for new
> directories, however, what happens if you just copy new files into an
> existing directory? Will it use the metadata cache that only lists the old
> files. or will things get updated? I guess, how does it know things are in
> sync?
>

When you query folder A that contains metadata cache, Drill will check all
it's sub-directories' last modification time to figure out if anything
changed since last time the metadata cache was refreshed. If data was
added/removed, Drill will refresh the metadata cache for folder A.

> 2.  Pertaining to point 1, when new data was added, the first query that
> used that directory partition, seemed to write the metadata file. However,
> the second query ran ALSO rewrote the file (and it ran with the speed of an
> uncached directory).  However, the third query was now running at cached
> speeds. (the 20 seconds vs. 120 seconds).  This seems odd, but maybe there
> is an reason?
>

Unfortunately, the current implementation of metadata cache doesn't support
incremental refresh, so each time Drill detects a change inside the folder,
it will run a "full" metadata cache refresh before running the query,
that's what explains why your second query took so long to finish.

> 3. Is Drill ok with me running REFRESH TABLE METADATA only for
> subdirectory?  So if I load a day, can I issue REFRESH TABLE METADATA
> `mytable/2016-07-04`  and have things be all where drill is happy?  I.e.
> does the mytable metadata need to be updated as well or is that wasted
> cycles?
>

Drill keeps a metadata cache file for every subdirectory of your table. So
you'll end up with a cache file in "mytable" and another one in
"mytable/2016-07-04".
I'm not sure about the following, and other developers will correct soon
enough, but my understanding is that you can run a refresh command on the
subfolder and it will only cause that particular cache (and any of it's
subfolders) to be updated and it won't cause the cache file on "mytable"
and any other of it's subfolders to be updated.
Also, as long as you only query this particular day, Drill won't detect the
change and won't try to update any other metadata cache, but as soon as you
query "mytable" Drill will figure out things have changed and it will cause
a full refresh of the table.

> 4.  Discussion: perhaps we could compress the metadata file? Each day (for
> me) has 8.2 mb of data, and the file at the root of my table has 332mb of
> data. Just using standard gzip/gunzip I got the 332mb file to 11 mb. That
> seems like an improvement, however, not knowing how this file is
> used/updated compression may add lag.
>

There are definitely other ways we can store the metadata cache files,
compression is one of them but we also want the alternative to make it
easier to run incremental metadata refresh.

> 5. Any other thoughts/suggestions?
>

-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>

Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>