You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Timothy Farkas <tf...@mapr.com> on 2018/08/22 23:41:24 UTC

[Question] HiveStoragePlugin and NativeParquetRowGroupScan

Hi All,

I'm a bit confused and I was hoping to get some clarification about how the
HiveStoragePlugin interacts with the FileSystem plugin. Currently the
HiveStoragePlugin allows the user to configure their own value for
fs.defaultFS in the plugin properties, which overrides the defaultFS used
when doing a native parquet scan for Hive. Is this intentional? Also what
is the high level theory about how Hive and the FileSystem plugins
interact? Specifically does Drill support querying Hive when Hive is using
a different FileSystem than the one specified in the file system plugin? Or
does Drill assume that the Hive is using the same FileSystem as the one
defined in the Drill FileSystem plugin?

Thanks,
Tim

Re: [Question] HiveStoragePlugin and NativeParquetRowGroupScan

Posted by Timothy Farkas <tf...@mapr.com>.

Hi Paul,

As you said each reader uses a different file system and config. As far as
I know this happens correctly in all cases, except there was one corner
case reported by a user a year ago. The corner case was that if you set
fs.defaultFS to the local file system in the HiveStoragePlugin, then
restart a Drillbit and then do a CTAS statement, the command would fail
because an operator was using the wrong FileSystem. This corner case is no
longer reproducible in house. So, I've been trying to narrow down possible
root causes by trying to understand the theory of how Drill handles
FileSystems. Since, the problem is not reproducible and the candid root
causes for the problem have been debunked, I am going to abandon the issue
and mark it as not reproducible.

One bit of learning that came out of the exercise was that the
DrillFileSystem should be immutable after it is created. This was not
previously enforced or documented, so a programmer could accidentally
mutate a DrillFileSystem incorrectly. I have a PR open that documents and
enforces this contract now.

Thanks,
Tim

On Fri, Aug 24, 2018 at 5:11 PM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Tim,
>
> Can't recall the details on this. The phrase "the filesystem
> configuration" might be misleading. When executing, Drill must support
> multiple filesystems. I can have two different DFS configs, pointing to two
> different HDFS clusters (say) in a single query:
>
> SELECT ... FROM dfs1.`aFile.csv`, dfs2.`anotherFile.csv`
>
> We'd create separate readers for each file. Each reader should have a
> different filesystem conf: the one appropriate for the storage plugin
> config used for that file.
>
> Using that as a reference, it would seem that Hive plugin queries use the
> hive fs, while any DFS tables in the same query use the DFS config.
>
> I wonder, based on your comment, is this not happening? Are the configs
> getting muddled somehow?
>
> Thanks,
> - Paul
>
>
>
>     On Friday, August 24, 2018, 3:45:08 PM PDT, Timothy Farkas <
> tfarkas@mapr.com> wrote:
>
>  Hi Paul / Vitalii
>
> Thanks for the info. I was asking about this because of
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D6609&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=4eQVr8zB8ZBff-yxTimdOQ&m=G3sMOgIgWfI5gdRM9Zg-q7FCe5lveejIeHMb9EHRGbA&s=3joGV6TQJXZ8OlUctGeTyMc5d2KuCAJPgYnQ5K0siKI&e=
> in which some strange
> behavior was observed if the user defined fs.default.name in the
> HivePlugin
> config. I also saw that the filesystem specified in the HivePlugin config
> influences the FileSystem used for native scans. This happens because in
> HiveDrillNativeParquetRowGroupScan.getFsConf we use the HiveStoragePlugin
> to create the filesystem configuration, which is then used by
> DrillFileSystem.
>
> However, based on your feedback it looks like this is desirable behavior,
> since the user may want to define a different filesystem for the HivePlugin
> along with different format plugins. Which means the root cause of
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D6609&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=4eQVr8zB8ZBff-yxTimdOQ&m=G3sMOgIgWfI5gdRM9Zg-q7FCe5lveejIeHMb9EHRGbA&s=3joGV6TQJXZ8OlUctGeTyMc5d2KuCAJPgYnQ5K0siKI&e=
> is something else then.
> I'll probably abandon that issue at this point since it's not reproducible
> and I have no further leads as to what could cause it.
>
> Thanks,
> Tim
>
> On Thu, Aug 23, 2018 at 2:46 AM, Vitalii Diravka <
> vitalii.diravka@gmail.com>
> wrote:
>
> > Hi Tim,
> >
> > Some comments from me.
> >
> > *HiveStoragePlugin*
> > *fs.defaultFS *is Hive specific property. This is the URI used by Hive
> > Metastore to point where tables are placed. There is no need to specify
> > this property, if default value from *core-site.xml* is acceptable, see
> > more:
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop.
> > apache.org_docs_r3.1.0_hadoop-2Dproject-2Ddist_hadoop-
> > 2Dcommon_core-2Ddefault.xml&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=
> > 4eQVr8zB8ZBff-yxTimdOQ&m=Y3D0V12MikEpxfG9ybUeW6KLgeJcCD
> > N8jXEur5IyORo&s=iJjg-o08kFjMfaxGHOZ9QAiTnk2KhkwPofQ3jEVjtyw&e=
> >
> > *Hive Native readers. *
> > Currently Drill has two Hive Native readers: Parquet and MapR Json. Both
> of
> > them use appropriate default File Format Plugins. It is a limitation and
> > there is no way for now to change FormatPlugins config for them.
> > There is Jira ticket for it:
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.
> > apache.org_jira_browse_DRILL-2D6621&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=
> > 4eQVr8zB8ZBff-yxTimdOQ&m=Y3D0V12MikEpxfG9ybUeW6KLgeJcCDN8jXEur5IyORo&s=
> > QDZyPZEwolNN1wu5z4QMwajvdQ3iQPPQ0yycxhUUKw0&e=
> >
> >
> > Kind regards
> > Vitalii
> >
> >
> > On Thu, Aug 23, 2018 at 3:02 AM Paul Rogers <pa...@yahoo.com.invalid>
> > wrote:
> >
> > > Hi Tim,
> > >
> > > I don't have an answer. But, I can point out some factors to consider.
> > >
> > > Hive describes a set of data in a specific file system. Would make
> sense
> > > to associate that file system with the Hive configuration. Else, I
> could
> > > use a Hive metastore for FS A, with a DFS configured for FS B, and have
> > > nothing work for reasons that would be hard to figure out.
> > >
> > > Further, isn't Hive its own storage plugin, and thus would be
> referenced
> > > as, say, "myHive.customers"? What would be the implied relationship
> > between
> > > the Hive plugin config and the DFS plugin config?
> > >
> > > Suppose I had two Hive plugin configs, Hive1 and Hive2. And, two DFS
> > > configs: DFS1 and DFS2. What is the implied relationship (if any)
> between
> > > Hive1 and either DFS1 or DFS2? Between Hive2 and DFS1 or DFS2?
> > >
> > > Given these ambiguities, it would seem to explain why Hive's HDFS URL
> is
> > > configured with Hive and is distinct from other a similar HDFS URL
> > defined
> > > for DFS.
> > >
> > > Can you suggest a way to avoid duplication and link the two? Perhaps,
> in
> > > Hive config, name a DFS config rather than duplicating the HDFS config
> > for
> > > Hive?
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > >
> > >    On Wednesday, August 22, 2018, 4:41:37 PM PDT, Timothy Farkas <
> > > tfarkas@mapr.com> wrote:
> > >
> > >  Hi All,
> > >
> > > I'm a bit confused and I was hoping to get some clarification about how
> > the
> > > HiveStoragePlugin interacts with the FileSystem plugin. Currently the
> > > HiveStoragePlugin allows the user to configure their own value for
> > > fs.defaultFS in the plugin properties, which overrides the defaultFS
> used
> > > when doing a native parquet scan for Hive. Is this intentional? Also
> what
> > > is the high level theory about how Hive and the FileSystem plugins
> > > interact? Specifically does Drill support querying Hive when Hive is
> > using
> > > a different FileSystem than the one specified in the file system
> plugin?
> > Or
> > > does Drill assume that the Hive is using the same FileSystem as the one
> > > defined in the Drill FileSystem plugin?
> > >
> > > Thanks,
> > > Tim
> > >
> >
>

Re: [Question] HiveStoragePlugin and NativeParquetRowGroupScan

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Tim,

Can't recall the details on this. The phrase "the filesystem configuration" might be misleading. When executing, Drill must support multiple filesystems. I can have two different DFS configs, pointing to two different HDFS clusters (say) in a single query:

SELECT ... FROM dfs1.`aFile.csv`, dfs2.`anotherFile.csv`

We'd create separate readers for each file. Each reader should have a different filesystem conf: the one appropriate for the storage plugin config used for that file.

Using that as a reference, it would seem that Hive plugin queries use the hive fs, while any DFS tables in the same query use the DFS config.

I wonder, based on your comment, is this not happening? Are the configs getting muddled somehow?

Thanks,
- Paul

    On Friday, August 24, 2018, 3:45:08 PM PDT, Timothy Farkas <tf...@mapr.com> wrote:  

 Hi Paul / Vitalii

Thanks for the info. I was asking about this because of
https://issues.apache.org/jira/browse/DRILL-6609 in which some strange
behavior was observed if the user defined fs.default.name in the HivePlugin
config. I also saw that the filesystem specified in the HivePlugin config
influences the FileSystem used for native scans. This happens because in
HiveDrillNativeParquetRowGroupScan.getFsConf we use the HiveStoragePlugin
to create the filesystem configuration, which is then used by
DrillFileSystem.

However, based on your feedback it looks like this is desirable behavior,
since the user may want to define a different filesystem for the HivePlugin
along with different format plugins. Which means the root cause of
https://issues.apache.org/jira/browse/DRILL-6609 is something else then.
I'll probably abandon that issue at this point since it's not reproducible
and I have no further leads as to what could cause it.

Thanks,
Tim

On Thu, Aug 23, 2018 at 2:46 AM, Vitalii Diravka <vi...@gmail.com>
wrote:

> Hi Tim,
>
> Some comments from me.
>
> *HiveStoragePlugin*
> *fs.defaultFS *is Hive specific property. This is the URI used by Hive
> Metastore to point where tables are placed. There is no need to specify
> this property, if default value from *core-site.xml* is acceptable, see
> more:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop.
> apache.org_docs_r3.1.0_hadoop-2Dproject-2Ddist_hadoop-
> 2Dcommon_core-2Ddefault.xml&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=
> 4eQVr8zB8ZBff-yxTimdOQ&m=Y3D0V12MikEpxfG9ybUeW6KLgeJcCD
> N8jXEur5IyORo&s=iJjg-o08kFjMfaxGHOZ9QAiTnk2KhkwPofQ3jEVjtyw&e=
>
> *Hive Native readers. *
> Currently Drill has two Hive Native readers: Parquet and MapR Json. Both of
> them use appropriate default File Format Plugins. It is a limitation and
> there is no way for now to change FormatPlugins config for them.
> There is Jira ticket for it:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.
> apache.org_jira_browse_DRILL-2D6621&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=
> 4eQVr8zB8ZBff-yxTimdOQ&m=Y3D0V12MikEpxfG9ybUeW6KLgeJcCDN8jXEur5IyORo&s=
> QDZyPZEwolNN1wu5z4QMwajvdQ3iQPPQ0yycxhUUKw0&e=
>
>
> Kind regards
> Vitalii
>
>
> On Thu, Aug 23, 2018 at 3:02 AM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
>
> > Hi Tim,
> >
> > I don't have an answer. But, I can point out some factors to consider.
> >
> > Hive describes a set of data in a specific file system. Would make sense
> > to associate that file system with the Hive configuration. Else, I could
> > use a Hive metastore for FS A, with a DFS configured for FS B, and have
> > nothing work for reasons that would be hard to figure out.
> >
> > Further, isn't Hive its own storage plugin, and thus would be referenced
> > as, say, "myHive.customers"? What would be the implied relationship
> between
> > the Hive plugin config and the DFS plugin config?
> >
> > Suppose I had two Hive plugin configs, Hive1 and Hive2. And, two DFS
> > configs: DFS1 and DFS2. What is the implied relationship (if any) between
> > Hive1 and either DFS1 or DFS2? Between Hive2 and DFS1 or DFS2?
> >
> > Given these ambiguities, it would seem to explain why Hive's HDFS URL is
> > configured with Hive and is distinct from other a similar HDFS URL
> defined
> > for DFS.
> >
> > Can you suggest a way to avoid duplication and link the two? Perhaps, in
> > Hive config, name a DFS config rather than duplicating the HDFS config
> for
> > Hive?
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Wednesday, August 22, 2018, 4:41:37 PM PDT, Timothy Farkas <
> > tfarkas@mapr.com> wrote:
> >
> >  Hi All,
> >
> > I'm a bit confused and I was hoping to get some clarification about how
> the
> > HiveStoragePlugin interacts with the FileSystem plugin. Currently the
> > HiveStoragePlugin allows the user to configure their own value for
> > fs.defaultFS in the plugin properties, which overrides the defaultFS used
> > when doing a native parquet scan for Hive. Is this intentional? Also what
> > is the high level theory about how Hive and the FileSystem plugins
> > interact? Specifically does Drill support querying Hive when Hive is
> using
> > a different FileSystem than the one specified in the file system plugin?
> Or
> > does Drill assume that the Hive is using the same FileSystem as the one
> > defined in the Drill FileSystem plugin?
> >
> > Thanks,
> > Tim
> >
>

Re: [Question] HiveStoragePlugin and NativeParquetRowGroupScan

Posted by Timothy Farkas <tf...@mapr.com>.

Hi Paul / Vitalii

Thanks for the info. I was asking about this because of
https://issues.apache.org/jira/browse/DRILL-6609 in which some strange
behavior was observed if the user defined fs.default.name in the HivePlugin
config. I also saw that the filesystem specified in the HivePlugin config
influences the FileSystem used for native scans. This happens because in
HiveDrillNativeParquetRowGroupScan.getFsConf we use the HiveStoragePlugin
to create the filesystem configuration, which is then used by
DrillFileSystem.

However, based on your feedback it looks like this is desirable behavior,
since the user may want to define a different filesystem for the HivePlugin
along with different format plugins. Which means the root cause of
https://issues.apache.org/jira/browse/DRILL-6609 is something else then.
I'll probably abandon that issue at this point since it's not reproducible
and I have no further leads as to what could cause it.

Thanks,
Tim

On Thu, Aug 23, 2018 at 2:46 AM, Vitalii Diravka <vi...@gmail.com>
wrote:

> Hi Tim,
>
> Some comments from me.
>
> *HiveStoragePlugin*
> *fs.defaultFS *is Hive specific property. This is the URI used by Hive
> Metastore to point where tables are placed. There is no need to specify
> this property, if default value from *core-site.xml* is acceptable, see
> more:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop.
> apache.org_docs_r3.1.0_hadoop-2Dproject-2Ddist_hadoop-
> 2Dcommon_core-2Ddefault.xml&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=
> 4eQVr8zB8ZBff-yxTimdOQ&m=Y3D0V12MikEpxfG9ybUeW6KLgeJcCD
> N8jXEur5IyORo&s=iJjg-o08kFjMfaxGHOZ9QAiTnk2KhkwPofQ3jEVjtyw&e=
>
> *Hive Native readers. *
> Currently Drill has two Hive Native readers: Parquet and MapR Json. Both of
> them use appropriate default File Format Plugins. It is a limitation and
> there is no way for now to change FormatPlugins config for them.
> There is Jira ticket for it:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.
> apache.org_jira_browse_DRILL-2D6621&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=
> 4eQVr8zB8ZBff-yxTimdOQ&m=Y3D0V12MikEpxfG9ybUeW6KLgeJcCDN8jXEur5IyORo&s=
> QDZyPZEwolNN1wu5z4QMwajvdQ3iQPPQ0yycxhUUKw0&e=
>
>
> Kind regards
> Vitalii
>
>
> On Thu, Aug 23, 2018 at 3:02 AM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
>
> > Hi Tim,
> >
> > I don't have an answer. But, I can point out some factors to consider.
> >
> > Hive describes a set of data in a specific file system. Would make sense
> > to associate that file system with the Hive configuration. Else, I could
> > use a Hive metastore for FS A, with a DFS configured for FS B, and have
> > nothing work for reasons that would be hard to figure out.
> >
> > Further, isn't Hive its own storage plugin, and thus would be referenced
> > as, say, "myHive.customers"? What would be the implied relationship
> between
> > the Hive plugin config and the DFS plugin config?
> >
> > Suppose I had two Hive plugin configs, Hive1 and Hive2. And, two DFS
> > configs: DFS1 and DFS2. What is the implied relationship (if any) between
> > Hive1 and either DFS1 or DFS2? Between Hive2 and DFS1 or DFS2?
> >
> > Given these ambiguities, it would seem to explain why Hive's HDFS URL is
> > configured with Hive and is distinct from other a similar HDFS URL
> defined
> > for DFS.
> >
> > Can you suggest a way to avoid duplication and link the two? Perhaps, in
> > Hive config, name a DFS config rather than duplicating the HDFS config
> for
> > Hive?
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >     On Wednesday, August 22, 2018, 4:41:37 PM PDT, Timothy Farkas <
> > tfarkas@mapr.com> wrote:
> >
> >  Hi All,
> >
> > I'm a bit confused and I was hoping to get some clarification about how
> the
> > HiveStoragePlugin interacts with the FileSystem plugin. Currently the
> > HiveStoragePlugin allows the user to configure their own value for
> > fs.defaultFS in the plugin properties, which overrides the defaultFS used
> > when doing a native parquet scan for Hive. Is this intentional? Also what
> > is the high level theory about how Hive and the FileSystem plugins
> > interact? Specifically does Drill support querying Hive when Hive is
> using
> > a different FileSystem than the one specified in the file system plugin?
> Or
> > does Drill assume that the Hive is using the same FileSystem as the one
> > defined in the Drill FileSystem plugin?
> >
> > Thanks,
> > Tim
> >
>

Re: [Question] HiveStoragePlugin and NativeParquetRowGroupScan

Posted by Vitalii Diravka <vi...@gmail.com>.

Hi Tim,

Some comments from me.

*HiveStoragePlugin*
*fs.defaultFS *is Hive specific property. This is the URI used by Hive
Metastore to point where tables are placed. There is no need to specify
this property, if default value from *core-site.xml* is acceptable, see
more:
https://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-common/core-default.xml

*Hive Native readers. *
Currently Drill has two Hive Native readers: Parquet and MapR Json. Both of
them use appropriate default File Format Plugins. It is a limitation and
there is no way for now to change FormatPlugins config for them.
There is Jira ticket for it:
https://issues.apache.org/jira/browse/DRILL-6621


Kind regards
Vitalii


On Thu, Aug 23, 2018 at 3:02 AM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Tim,
>
> I don't have an answer. But, I can point out some factors to consider.
>
> Hive describes a set of data in a specific file system. Would make sense
> to associate that file system with the Hive configuration. Else, I could
> use a Hive metastore for FS A, with a DFS configured for FS B, and have
> nothing work for reasons that would be hard to figure out.
>
> Further, isn't Hive its own storage plugin, and thus would be referenced
> as, say, "myHive.customers"? What would be the implied relationship between
> the Hive plugin config and the DFS plugin config?
>
> Suppose I had two Hive plugin configs, Hive1 and Hive2. And, two DFS
> configs: DFS1 and DFS2. What is the implied relationship (if any) between
> Hive1 and either DFS1 or DFS2? Between Hive2 and DFS1 or DFS2?
>
> Given these ambiguities, it would seem to explain why Hive's HDFS URL is
> configured with Hive and is distinct from other a similar HDFS URL defined
> for DFS.
>
> Can you suggest a way to avoid duplication and link the two? Perhaps, in
> Hive config, name a DFS config rather than duplicating the HDFS config for
> Hive?
>
> Thanks,
> - Paul
>
>
>
>     On Wednesday, August 22, 2018, 4:41:37 PM PDT, Timothy Farkas <
> tfarkas@mapr.com> wrote:
>
>  Hi All,
>
> I'm a bit confused and I was hoping to get some clarification about how the
> HiveStoragePlugin interacts with the FileSystem plugin. Currently the
> HiveStoragePlugin allows the user to configure their own value for
> fs.defaultFS in the plugin properties, which overrides the defaultFS used
> when doing a native parquet scan for Hive. Is this intentional? Also what
> is the high level theory about how Hive and the FileSystem plugins
> interact? Specifically does Drill support querying Hive when Hive is using
> a different FileSystem than the one specified in the file system plugin? Or
> does Drill assume that the Hive is using the same FileSystem as the one
> defined in the Drill FileSystem plugin?
>
> Thanks,
> Tim
>

Re: [Question] HiveStoragePlugin and NativeParquetRowGroupScan

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Tim,

I don't have an answer. But, I can point out some factors to consider.

Hive describes a set of data in a specific file system. Would make sense to associate that file system with the Hive configuration. Else, I could use a Hive metastore for FS A, with a DFS configured for FS B, and have nothing work for reasons that would be hard to figure out.

Further, isn't Hive its own storage plugin, and thus would be referenced as, say, "myHive.customers"? What would be the implied relationship between the Hive plugin config and the DFS plugin config?

Suppose I had two Hive plugin configs, Hive1 and Hive2. And, two DFS configs: DFS1 and DFS2. What is the implied relationship (if any) between Hive1 and either DFS1 or DFS2? Between Hive2 and DFS1 or DFS2?

Given these ambiguities, it would seem to explain why Hive's HDFS URL is configured with Hive and is distinct from other a similar HDFS URL defined for DFS.

Can you suggest a way to avoid duplication and link the two? Perhaps, in Hive config, name a DFS config rather than duplicating the HDFS config for Hive?

Thanks,
- Paul

On Wednesday, August 22, 2018, 4:41:37 PM PDT, Timothy Farkas <tf...@mapr.com> wrote:

Hi All,

I'm a bit confused and I was hoping to get some clarification about how the
HiveStoragePlugin interacts with the FileSystem plugin. Currently the
HiveStoragePlugin allows the user to configure their own value for
fs.defaultFS in the plugin properties, which overrides the defaultFS used
when doing a native parquet scan for Hive. Is this intentional? Also what
is the high level theory about how Hive and the FileSystem plugins
interact? Specifically does Drill support querying Hive when Hive is using
a different FileSystem than the one specified in the file system plugin? Or
does Drill assume that the Hive is using the same FileSystem as the one
defined in the Drill FileSystem plugin?

Thanks,
Tim