You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@hudi.apache.org by Vinoth Chandar <vi...@apache.org> on 2020/07/11 16:30:56 UTC

Re: Keeping Hive in Sync

(cc-ing users@ where we should start routing user support here on)

Sorry, we kind of dropped the ball here..  If you set the following config
to "true", then the data will be written under the following partition path
..

<basePath>/yyyy=2020/mm=07/dd=01 instead of simply <basePath>/2020/07/01

Then when you do spark.read.parquet() and have a predicate on yyyy, mm, dd
, spark should partition prune properly.. Let me know if you face issues..
Happy to work with you to get this resolved

/**
  * Flag to indicate whether to use Hive style partitioning.
  * If set true, the names of partition folders follow
<partition_column_name>=<partition_value> format.
  * By default false (the names of partition folders are only partition values)
  */
val HIVE_STYLE_PARTITIONING_OPT_KEY =
"hoodie.datasource.write.hive_style_partitioning"
val DEFAULT_HIVE_STYLE_PARTITIONING_OPT_VAL = "false"

On Wed, Jul 8, 2020 at 7:26 AM vbalaji@apache.org <vb...@apache.org>
wrote:

> I don't remember the root cause completely Vinoth. I guess it was due to
> some protocol mismatch.
> Balaji.V   On Tuesday, July 7, 2020, 10:25:48 PM PDT, Vinoth Chandar <
> vinoth@apache.org> wrote:
>
>  Hi,
>
> Yes. It can be an issue, probably good to get the table written using hive
> style partitioning. I will check  on this more and get back to you
>
> Balaji, do you know top of your head?
>
> Thanks
> Vinoth
>
> On Sat, Jul 4, 2020 at 11:22 PM selvaraj periyasamy <
> selvaraj.periyasamy1983@gmail.com> wrote:
>
> > Add some more info, my join condition would look for 180 days range
> > folders.
> >
> > On Sat, Jul 4, 2020 at 11:13 PM selvaraj periyasamy <
> > selvaraj.periyasamy1983@gmail.com> wrote:
> >
> > > Team,
> > >
> > > I have a question on keeping hive in sync.  Due to a shared Hadoop
> > > Environment restricting me from using hudi 0.5.1 or higher version i
> > ended
> > > up using 0.5.0.  Currently my hadoop cluster is having hive 1.2.x ,
> which
> > > is not supporting Hudi to keep hive in sync.
> > >
> > > So , I am not using the hive feature. I am reading it as below.
> > >
> > >
> > > sparkSession.
> > > read.
> > > format("org.apache.hudi").
> > > load("/projects/cdp/data/base/request_application/*/*").
> > > createOrReplaceTempView(s"base_request_application")
> > >
> > >
> > > I am going to store 3 years worth of data partitioned by day/hour.
> When I
> > > load 3 years data, I would have (3*365*24) = 26280 directories. Using
> the
> > > above approach and reading every time, I see all the directories names
> > are
> > > indexed. Would it impact the perfromance during joining with other
> table,
> > > if i dont use hive way of partition pruning?
> > >
> > > Thanks,
> > > Selva
> > >
> > >
> >
>

Re: Keeping Hive in Sync

Posted by selvaraj periyasamy <se...@gmail.com>.

Thanks Vinoth.

I copied below java files and customized a little bit   and it worked with
hive 1.2.1.
HoodieHiveClient
HiveSyncConfig
Hive SyncTool
SchemaUtil

Now these custom files are built from an application maven setup with hive
1.2.1 dependency.

The main change I made was that I used spark code to check doesTableExist
logic, instead of hive. Because this was the one , which failed due to
versioning issues.

sparkSession.catalog.tableExists(databaseName, tableName)


hive> show partitions test_schema.test1;

OK

transaction_day=20200723/transaction_hour=00

transaction_day=20200723/transaction_hour=01

Time taken: 0.678 seconds, Fetched: 2 row(s)


scala> spark.sql("select count(1) from test_schema.test1 where
transaction_day=20200723 and transaction_hour=00").show(false)

+--------+

|count(1)|

+--------+

|106     |

+--------+



Thought of sharing with the community. It may or maynot help someone.


Thanks,

Selva




On Sat, Jul 11, 2020 at 9:31 AM Vinoth Chandar <vi...@apache.org> wrote:

> (cc-ing users@ where we should start routing user support here on)
>
> Sorry, we kind of dropped the ball here..  If you set the following config
> to "true", then the data will be written under the following partition path
> ..
>
> <basePath>/yyyy=2020/mm=07/dd=01 instead of simply <basePath>/2020/07/01
>
> Then when you do spark.read.parquet() and have a predicate on yyyy, mm, dd
> , spark should partition prune properly.. Let me know if you face issues..
> Happy to work with you to get this resolved
>
> /**
>   * Flag to indicate whether to use Hive style partitioning.
>   * If set true, the names of partition folders follow
> <partition_column_name>=<partition_value> format.
>   * By default false (the names of partition folders are only partition
> values)
>   */
> val HIVE_STYLE_PARTITIONING_OPT_KEY =
> "hoodie.datasource.write.hive_style_partitioning"
> val DEFAULT_HIVE_STYLE_PARTITIONING_OPT_VAL = "false"
>
>
>
> On Wed, Jul 8, 2020 at 7:26 AM vbalaji@apache.org <vb...@apache.org>
> wrote:
>
> > I don't remember the root cause completely Vinoth. I guess it was due to
> > some protocol mismatch.
> > Balaji.V   On Tuesday, July 7, 2020, 10:25:48 PM PDT, Vinoth Chandar <
> > vinoth@apache.org> wrote:
> >
> >  Hi,
> >
> > Yes. It can be an issue, probably good to get the table written using
> hive
> > style partitioning. I will check  on this more and get back to you
> >
> > Balaji, do you know top of your head?
> >
> > Thanks
> > Vinoth
> >
> > On Sat, Jul 4, 2020 at 11:22 PM selvaraj periyasamy <
> > selvaraj.periyasamy1983@gmail.com> wrote:
> >
> > > Add some more info, my join condition would look for 180 days range
> > > folders.
> > >
> > > On Sat, Jul 4, 2020 at 11:13 PM selvaraj periyasamy <
> > > selvaraj.periyasamy1983@gmail.com> wrote:
> > >
> > > > Team,
> > > >
> > > > I have a question on keeping hive in sync.  Due to a shared Hadoop
> > > > Environment restricting me from using hudi 0.5.1 or higher version i
> > > ended
> > > > up using 0.5.0.  Currently my hadoop cluster is having hive 1.2.x ,
> > which
> > > > is not supporting Hudi to keep hive in sync.
> > > >
> > > > So , I am not using the hive feature. I am reading it as below.
> > > >
> > > >
> > > > sparkSession.
> > > > read.
> > > > format("org.apache.hudi").
> > > > load("/projects/cdp/data/base/request_application/*/*").
> > > > createOrReplaceTempView(s"base_request_application")
> > > >
> > > >
> > > > I am going to store 3 years worth of data partitioned by day/hour.
> > When I
> > > > load 3 years data, I would have (3*365*24) = 26280 directories. Using
> > the
> > > > above approach and reading every time, I see all the directories
> names
> > > are
> > > > indexed. Would it impact the perfromance during joining with other
> > table,
> > > > if i dont use hive way of partition pruning?
> > > >
> > > > Thanks,
> > > > Selva
> > > >
> > > >
> > >
> >
>