You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by selvaraj periyasamy <se...@gmail.com> on 2020/08/28 20:31:24 UTC

HUDI-1232

I have created this https://issues.apache.org/jira/browse/HUDI-1232 ticket
for tracking a couple of issues.

One of the concerns I have in my use cases is that, have a COW type table
name called TRR.  I see below pasted logs rolling for all individual
partitions even though my write is on only a couple of partitions  and it
takes upto 4 to 5  mins. I pasted only a few of them alone. I am wondering
, in the future , I will have 3 years worth of data, and writing will be
very slow every time I write into only a couple of partitions.

20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
type COPY_ON_WRITE from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
java.util.stream.ReferencePipeline$Head@fed0a8b
20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200714/01, #FileGroups=1
20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1
20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
from base path:
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
files under
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01
20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
[hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
core-site.xml, mapred-default.xml, m
apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
hdfs-site.xml], FileSystem:
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
ugi=svchdc36q@V
ISA.COM (auth:KERBEROS)]]]
20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
type COPY_ON_WRITE from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
java.util.stream.ReferencePipeline$Head@285c67a9
20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200714/02, #FileGroups=1
20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=0
20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
from base path:
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
files under
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/02
20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
[hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
core-site.xml, mapred-default.xml, m
apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
hdfs-site.xml], FileSystem:
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
ugi=svchdc36q@V
ISA.COM (auth:KERBEROS)]]]
20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
type COPY_ON_WRITE from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
java.util.stream.ReferencePipeline$Head@2edd9c8
20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200714/03, #FileGroups=1
20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=4, FileGroupsCreationTime=1, StoreTimeTaken=0
20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
from base path:
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
files under
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/03
20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
[hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem:
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
ugi=svchdc36q@VISA.COM (auth:KERBEROS)]]]



 Seems more and more partitions we have,  path filter lists take more time.
Could someone provide more insight on how to make these things work faster
and make it scalable when the number of partitions is increasing?


Thanks,

Selva

Re: HUDI-1232

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.
 Depending on the ordering of the jar is messy but if it works for you as a temporary measure, it should be ok :)
Balaji.V
    On Tuesday, September 1, 2020, 12:44:23 AM PDT, selvaraj periyasamy <se...@gmail.com> wrote:  
 
 Thanks Balaji. Since upgrade is not an immediate solution in a shared
cluster, I tried a workaround. Added
org.apache.hudi.hadoop.HoodieROTablePathFilter
class in a common project module and added the caching logic and created a
jar and then added common.jar before the hudi jar.  It is now able to use
custom class and takes care of caching. I can manage with tihs until we
upgrade.

spark2-submit --jars
/home/selva/common.jar,/home/selva/hudi-spark-bundle-0.5.0-incubating.jar
--conf spark.sql.hive.convertMetastoreParquet=false --conf
'spark.serializer=org.apache.spark.serializer.KryoSerializer' --master yarn
--deploy-mode client --driver-memory 4g --executor-memory 10g
--num-executors 200 --executor-cores 1  --conf
spark.executor.memoryOverhead=4096 --conf
spark.shuffle.service.enabled=true  --class
com.test.cdp.reporting.trr.TRREngine
/home/seperiya/transformation-engine.jar

Thanks,
Selva

On Sat, Aug 29, 2020 at 12:55 PM Balaji Varadarajan
<v....@ymail.com.invalid> wrote:

>  Hi Selvaraj,
> Yes, you are right. Sorry for the confusion. As mentioned in the release
> notes, Spark 2.4.4 runtime is needed although I dont remember what problem
> you will encounter with Spark 2.3.3. I think it will be a worthwhile
> exercise for you to upgrade to Spark 2.4.4 and Hudi latest versions as we
> had been and continuing to improve performance in Hudi :) For instance, the
> very next release will have consolidated metadata which would avoid file
> listing in the first place.
> THanks,Balaji.V    On Saturday, August 29, 2020, 11:09:25 AM PDT, selvaraj
> periyasamy <se...@gmail.com> wrote:
>
>  Thanks Balaji,
>
> I am looking into the steps to upgrade to 0.6.0. I noticed the below
> content in 0.5.1 release notes here https://hudi.apache.org/releases.html.
> It says the runtime spark version must be 2.4+. Little confused now. Could
> you shed more light on this?
> Release HighlightsPermalink
> <https://hudi.apache.org/releases.html#release-highlights-3>
>
>  - Dependency Version Upgrades
>      - Upgrade from Spark 2.1.0 to Spark 2.4.4
>      - Upgrade from Avro 1.7.7 to Avro 1.8.2
>      - Upgrade from Parquet 1.8.1 to Parquet 1.10.1
>      - Upgrade from Kafka 0.8.2.1 to Kafka 2.0.0 as a result of updating
>      spark-streaming-kafka artifact from 0.8_2.11/2.12 to 0.10_2.11/2.12.
>  - *IMPORTANT* This version requires your runtime spark version to be
>  upgraded to 2.4+.
>
> Thanks,
> Selva
>
> On Sat, Aug 29, 2020 at 1:16 AM Balaji Varadarajan
> <v....@ymail.com.invalid> wrote:
>
> >  From the hudiLogs.txt, I find only HoodieROTablePathFiler related logs
> > repeating which suggests this is the read side. So, we recommend you
> using
> > latest version. I tried 2.3.3 and ran quickstart without issues. Give it
> a
> > shot and let us know if there are any issues.
> > Balaji.V
> >    On Friday, August 28, 2020, 04:42:51 PM PDT, selvaraj periyasamy <
> > selvaraj.periyasamy1983@gmail.com> wrote:
> >
> >  Thanks Balaji. My hadoop environment is still running with spark 2.3.
> Can
> > I
> > run 0.6.0 on spark 2.3?
> >
> > For issue 1: I am able to manage it with spark glob read, instead of
> > hive read. With this approach, I am good with this approach.
> >  Issue 2: I see the performance issue while writing into the COW table.
> > This is purely write and no read involved.  Attached the write logs (
> > hudiLogs.txt) in the ticket . The more and more my target has
> partitions, I
> > am noticing a spike in write time.  The fix #1919 mentioned is applicable
> > for writing as well.
> >
> > On Fri, Aug 28, 2020 at 3:28 PM vbalaji@apache.org <vb...@apache.org>
> > wrote:
> >
> > >  Hi Selvaraj,
> > > We had fixed relevant perf issue in  0.6.0 ([HUDI-1144] Speedup spark
> > read
> > > queries by caching metaclient in HoodieROPathFilter (#1919)). Can you
> > > please try 0.6.0
> > > Balaji.V
> > >    On Friday, August 28, 2020, 01:31:42 PM PDT, selvaraj periyasamy <
> > > selvaraj.periyasamy1983@gmail.com> wrote:
> > >
> > >  I have created this https://issues.apache.org/jira/browse/HUDI-1232
> > > ticket
> > > for tracking a couple of issues.
> > >
> > > One of the concerns I have in my use cases is that, have a COW type
> table
> > > name called TRR.  I see below pasted logs rolling for all individual
> > > partitions even though my write is on only a couple of partitions  and
> it
> > > takes upto 4 to 5  mins. I pasted only a few of them alone. I am
> > wondering
> > > , in the future , I will have 3 years worth of data, and writing will
> be
> > > very slow every time I write into only a couple of partitions.
> > >
> > > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties
> from
> > >
> > >
> >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> > > type COPY_ON_WRITE from
> > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> > > java.util.stream.ReferencePipeline$Head@fed0a8b
> > > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups
> for
> > > partition :20200714/01, #FileGroups=1
> > > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> > > NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1
> > > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie
> metadata
> > > from base path:
> > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> > > files under
> > >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01
> > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading
> > HoodieTableMetaClient
> > > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> > > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> > > core-site.xml, mapred-default.xml, m
> > > apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> > > hdfs-site.xml], FileSystem:
> > > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> > > ugi=svchdc36q@V
> > > ISA.COM (auth:KERBEROS)]]]
> > > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties
> from
> > >
> > >
> >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> > > type COPY_ON_WRITE from
> > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> > > java.util.stream.ReferencePipeline$Head@285c67a9
> > > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups
> for
> > > partition :20200714/02, #FileGroups=1
> > > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> > > NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=0
> > > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie
> metadata
> > > from base path:
> > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> > > files under
> > >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/02
> > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading
> > HoodieTableMetaClient
> > > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> > > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> > > core-site.xml, mapred-default.xml, m
> > > apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> > > hdfs-site.xml], FileSystem:
> > > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> > > ugi=svchdc36q@V
> > > ISA.COM (auth:KERBEROS)]]]
> > > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties
> from
> > >
> > >
> >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> > > type COPY_ON_WRITE from
> > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> > > java.util.stream.ReferencePipeline$Head@2edd9c8
> > > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups
> for
> > > partition :20200714/03, #FileGroups=1
> > > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> > > NumFiles=4, FileGroupsCreationTime=1, StoreTimeTaken=0
> > > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie
> metadata
> > > from base path:
> > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> > > files under
> > >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/03
> > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading
> > HoodieTableMetaClient
> > > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> > > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> > > core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
> > > yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem:
> > > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> > > ugi=svchdc36q@VISA.COM (auth:KERBEROS)]]]
> > >
> > >
> > >
> > >  Seems more and more partitions we have,  path filter lists take more
> > time.
> > > Could someone provide more insight on how to make these things work
> > faster
> > > and make it scalable when the number of partitions is increasing?
> > >
> > >
> > > Thanks,
> > >
> > > Selva
> > >
> >
>
  

Re: HUDI-1232

Posted by selvaraj periyasamy <se...@gmail.com>.
Thanks Balaji. Since upgrade is not an immediate solution in a shared
cluster, I tried a workaround. Added
org.apache.hudi.hadoop.HoodieROTablePathFilter
class in a common project module and added the caching logic and created a
jar and then added common.jar before the hudi jar.  It is now able to use
custom class and takes care of caching. I can manage with tihs until we
upgrade.

spark2-submit --jars
/home/selva/common.jar,/home/selva/hudi-spark-bundle-0.5.0-incubating.jar
--conf spark.sql.hive.convertMetastoreParquet=false --conf
'spark.serializer=org.apache.spark.serializer.KryoSerializer' --master yarn
--deploy-mode client --driver-memory 4g --executor-memory 10g
--num-executors 200 --executor-cores 1  --conf
spark.executor.memoryOverhead=4096 --conf
spark.shuffle.service.enabled=true  --class
com.test.cdp.reporting.trr.TRREngine
/home/seperiya/transformation-engine.jar

Thanks,
Selva

On Sat, Aug 29, 2020 at 12:55 PM Balaji Varadarajan
<v....@ymail.com.invalid> wrote:

>  Hi Selvaraj,
> Yes, you are right. Sorry for the confusion. As mentioned in the release
> notes, Spark 2.4.4 runtime is needed although I dont remember what problem
> you will encounter with Spark 2.3.3. I think it will be a worthwhile
> exercise for you to upgrade to Spark 2.4.4 and Hudi latest versions as we
> had been and continuing to improve performance in Hudi :) For instance, the
> very next release will have consolidated metadata which would avoid file
> listing in the first place.
> THanks,Balaji.V    On Saturday, August 29, 2020, 11:09:25 AM PDT, selvaraj
> periyasamy <se...@gmail.com> wrote:
>
>  Thanks Balaji,
>
> I am looking into the steps to upgrade to 0.6.0. I noticed the below
> content in 0.5.1 release notes here https://hudi.apache.org/releases.html.
> It says the runtime spark version must be 2.4+. Little confused now. Could
> you shed more light on this?
> Release HighlightsPermalink
> <https://hudi.apache.org/releases.html#release-highlights-3>
>
>   - Dependency Version Upgrades
>       - Upgrade from Spark 2.1.0 to Spark 2.4.4
>       - Upgrade from Avro 1.7.7 to Avro 1.8.2
>       - Upgrade from Parquet 1.8.1 to Parquet 1.10.1
>       - Upgrade from Kafka 0.8.2.1 to Kafka 2.0.0 as a result of updating
>       spark-streaming-kafka artifact from 0.8_2.11/2.12 to 0.10_2.11/2.12.
>   - *IMPORTANT* This version requires your runtime spark version to be
>   upgraded to 2.4+.
>
> Thanks,
> Selva
>
> On Sat, Aug 29, 2020 at 1:16 AM Balaji Varadarajan
> <v....@ymail.com.invalid> wrote:
>
> >  From the hudiLogs.txt, I find only HoodieROTablePathFiler related logs
> > repeating which suggests this is the read side. So, we recommend you
> using
> > latest version. I tried 2.3.3 and ran quickstart without issues. Give it
> a
> > shot and let us know if there are any issues.
> > Balaji.V
> >    On Friday, August 28, 2020, 04:42:51 PM PDT, selvaraj periyasamy <
> > selvaraj.periyasamy1983@gmail.com> wrote:
> >
> >  Thanks Balaji. My hadoop environment is still running with spark 2.3.
> Can
> > I
> > run 0.6.0 on spark 2.3?
> >
> > For issue 1: I am able to manage it with spark glob read, instead of
> > hive read. With this approach, I am good with this approach.
> >  Issue 2: I see the performance issue while writing into the COW table.
> > This is purely write and no read involved.  Attached the write logs (
> > hudiLogs.txt) in the ticket . The more and more my target has
> partitions, I
> > am noticing a spike in write time.  The fix #1919 mentioned is applicable
> > for writing as well.
> >
> > On Fri, Aug 28, 2020 at 3:28 PM vbalaji@apache.org <vb...@apache.org>
> > wrote:
> >
> > >  Hi Selvaraj,
> > > We had fixed relevant perf issue in  0.6.0 ([HUDI-1144] Speedup spark
> > read
> > > queries by caching metaclient in HoodieROPathFilter (#1919)). Can you
> > > please try 0.6.0
> > > Balaji.V
> > >    On Friday, August 28, 2020, 01:31:42 PM PDT, selvaraj periyasamy <
> > > selvaraj.periyasamy1983@gmail.com> wrote:
> > >
> > >  I have created this https://issues.apache.org/jira/browse/HUDI-1232
> > > ticket
> > > for tracking a couple of issues.
> > >
> > > One of the concerns I have in my use cases is that, have a COW type
> table
> > > name called TRR.  I see below pasted logs rolling for all individual
> > > partitions even though my write is on only a couple of partitions  and
> it
> > > takes upto 4 to 5  mins. I pasted only a few of them alone. I am
> > wondering
> > > , in the future , I will have 3 years worth of data, and writing will
> be
> > > very slow every time I write into only a couple of partitions.
> > >
> > > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties
> from
> > >
> > >
> >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> > > type COPY_ON_WRITE from
> > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> > > java.util.stream.ReferencePipeline$Head@fed0a8b
> > > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups
> for
> > > partition :20200714/01, #FileGroups=1
> > > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> > > NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1
> > > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie
> metadata
> > > from base path:
> > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> > > files under
> > >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01
> > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading
> > HoodieTableMetaClient
> > > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> > > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> > > core-site.xml, mapred-default.xml, m
> > > apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> > > hdfs-site.xml], FileSystem:
> > > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> > > ugi=svchdc36q@V
> > > ISA.COM (auth:KERBEROS)]]]
> > > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties
> from
> > >
> > >
> >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> > > type COPY_ON_WRITE from
> > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> > > java.util.stream.ReferencePipeline$Head@285c67a9
> > > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups
> for
> > > partition :20200714/02, #FileGroups=1
> > > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> > > NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=0
> > > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie
> metadata
> > > from base path:
> > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> > > files under
> > >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/02
> > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading
> > HoodieTableMetaClient
> > > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> > > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> > > core-site.xml, mapred-default.xml, m
> > > apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> > > hdfs-site.xml], FileSystem:
> > > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> > > ugi=svchdc36q@V
> > > ISA.COM (auth:KERBEROS)]]]
> > > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties
> from
> > >
> > >
> >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> > > type COPY_ON_WRITE from
> > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> > > java.util.stream.ReferencePipeline$Head@2edd9c8
> > > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups
> for
> > > partition :20200714/03, #FileGroups=1
> > > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> > > NumFiles=4, FileGroupsCreationTime=1, StoreTimeTaken=0
> > > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie
> metadata
> > > from base path:
> > > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> > > files under
> > >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/03
> > > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading
> > HoodieTableMetaClient
> > > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> > > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> > > core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
> > > yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem:
> > > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> > > ugi=svchdc36q@VISA.COM (auth:KERBEROS)]]]
> > >
> > >
> > >
> > >  Seems more and more partitions we have,  path filter lists take more
> > time.
> > > Could someone provide more insight on how to make these things work
> > faster
> > > and make it scalable when the number of partitions is increasing?
> > >
> > >
> > > Thanks,
> > >
> > > Selva
> > >
> >
>

Re: HUDI-1232

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.
 Hi Selvaraj,
Yes, you are right. Sorry for the confusion. As mentioned in the release notes, Spark 2.4.4 runtime is needed although I dont remember what problem you will encounter with Spark 2.3.3. I think it will be a worthwhile exercise for you to upgrade to Spark 2.4.4 and Hudi latest versions as we had been and continuing to improve performance in Hudi :) For instance, the very next release will have consolidated metadata which would avoid file listing in the first place. 
THanks,Balaji.V    On Saturday, August 29, 2020, 11:09:25 AM PDT, selvaraj periyasamy <se...@gmail.com> wrote:  
 
 Thanks Balaji,

I am looking into the steps to upgrade to 0.6.0. I noticed the below
content in 0.5.1 release notes here https://hudi.apache.org/releases.html.
It says the runtime spark version must be 2.4+. Little confused now. Could
you shed more light on this?
Release HighlightsPermalink
<https://hudi.apache.org/releases.html#release-highlights-3>

  - Dependency Version Upgrades
      - Upgrade from Spark 2.1.0 to Spark 2.4.4
      - Upgrade from Avro 1.7.7 to Avro 1.8.2
      - Upgrade from Parquet 1.8.1 to Parquet 1.10.1
      - Upgrade from Kafka 0.8.2.1 to Kafka 2.0.0 as a result of updating
      spark-streaming-kafka artifact from 0.8_2.11/2.12 to 0.10_2.11/2.12.
  - *IMPORTANT* This version requires your runtime spark version to be
  upgraded to 2.4+.

Thanks,
Selva

On Sat, Aug 29, 2020 at 1:16 AM Balaji Varadarajan
<v....@ymail.com.invalid> wrote:

>  From the hudiLogs.txt, I find only HoodieROTablePathFiler related logs
> repeating which suggests this is the read side. So, we recommend you using
> latest version. I tried 2.3.3 and ran quickstart without issues. Give it a
> shot and let us know if there are any issues.
> Balaji.V
>    On Friday, August 28, 2020, 04:42:51 PM PDT, selvaraj periyasamy <
> selvaraj.periyasamy1983@gmail.com> wrote:
>
>  Thanks Balaji. My hadoop environment is still running with spark 2.3. Can
> I
> run 0.6.0 on spark 2.3?
>
> For issue 1: I am able to manage it with spark glob read, instead of
> hive read. With this approach, I am good with this approach.
>  Issue 2: I see the performance issue while writing into the COW table.
> This is purely write and no read involved.  Attached the write logs (
> hudiLogs.txt) in the ticket . The more and more my target has partitions, I
> am noticing a spike in write time.  The fix #1919 mentioned is applicable
> for writing as well.
>
> On Fri, Aug 28, 2020 at 3:28 PM vbalaji@apache.org <vb...@apache.org>
> wrote:
>
> >  Hi Selvaraj,
> > We had fixed relevant perf issue in  0.6.0 ([HUDI-1144] Speedup spark
> read
> > queries by caching metaclient in HoodieROPathFilter (#1919)). Can you
> > please try 0.6.0
> > Balaji.V
> >    On Friday, August 28, 2020, 01:31:42 PM PDT, selvaraj periyasamy <
> > selvaraj.periyasamy1983@gmail.com> wrote:
> >
> >  I have created this https://issues.apache.org/jira/browse/HUDI-1232
> > ticket
> > for tracking a couple of issues.
> >
> > One of the concerns I have in my use cases is that, have a COW type table
> > name called TRR.  I see below pasted logs rolling for all individual
> > partitions even though my write is on only a couple of partitions  and it
> > takes upto 4 to 5  mins. I pasted only a few of them alone. I am
> wondering
> > , in the future , I will have 3 years worth of data, and writing will be
> > very slow every time I write into only a couple of partitions.
> >
> > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
> >
> >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> > type COPY_ON_WRITE from
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> > java.util.stream.ReferencePipeline$Head@fed0a8b
> > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> > partition :20200714/01, #FileGroups=1
> > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> > NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1
> > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
> > from base path:
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> > files under
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01
> > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading
> HoodieTableMetaClient
> > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> > core-site.xml, mapred-default.xml, m
> > apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> > hdfs-site.xml], FileSystem:
> > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> > ugi=svchdc36q@V
> > ISA.COM (auth:KERBEROS)]]]
> > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
> >
> >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> > type COPY_ON_WRITE from
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> > java.util.stream.ReferencePipeline$Head@285c67a9
> > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> > partition :20200714/02, #FileGroups=1
> > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> > NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=0
> > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
> > from base path:
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> > files under
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/02
> > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading
> HoodieTableMetaClient
> > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> > core-site.xml, mapred-default.xml, m
> > apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> > hdfs-site.xml], FileSystem:
> > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> > ugi=svchdc36q@V
> > ISA.COM (auth:KERBEROS)]]]
> > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
> >
> >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> > type COPY_ON_WRITE from
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> > java.util.stream.ReferencePipeline$Head@2edd9c8
> > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> > partition :20200714/03, #FileGroups=1
> > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> > NumFiles=4, FileGroupsCreationTime=1, StoreTimeTaken=0
> > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
> > from base path:
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> > files under
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/03
> > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading
> HoodieTableMetaClient
> > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> > core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
> > yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem:
> > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> > ugi=svchdc36q@VISA.COM (auth:KERBEROS)]]]
> >
> >
> >
> >  Seems more and more partitions we have,  path filter lists take more
> time.
> > Could someone provide more insight on how to make these things work
> faster
> > and make it scalable when the number of partitions is increasing?
> >
> >
> > Thanks,
> >
> > Selva
> >
>
  

Re: HUDI-1232

Posted by selvaraj periyasamy <se...@gmail.com>.
Thanks Balaji,

I am looking into the steps to upgrade to 0.6.0. I noticed the below
content in 0.5.1 release notes here https://hudi.apache.org/releases.html.
It says the runtime spark version must be 2.4+. Little confused now. Could
you shed more light on this?
Release HighlightsPermalink
<https://hudi.apache.org/releases.html#release-highlights-3>

   - Dependency Version Upgrades
      - Upgrade from Spark 2.1.0 to Spark 2.4.4
      - Upgrade from Avro 1.7.7 to Avro 1.8.2
      - Upgrade from Parquet 1.8.1 to Parquet 1.10.1
      - Upgrade from Kafka 0.8.2.1 to Kafka 2.0.0 as a result of updating
      spark-streaming-kafka artifact from 0.8_2.11/2.12 to 0.10_2.11/2.12.
   - *IMPORTANT* This version requires your runtime spark version to be
   upgraded to 2.4+.

Thanks,
Selva

On Sat, Aug 29, 2020 at 1:16 AM Balaji Varadarajan
<v....@ymail.com.invalid> wrote:

>  From the hudiLogs.txt, I find only HoodieROTablePathFiler related logs
> repeating which suggests this is the read side. So, we recommend you using
> latest version. I tried 2.3.3 and ran quickstart without issues. Give it a
> shot and let us know if there are any issues.
> Balaji.V
>     On Friday, August 28, 2020, 04:42:51 PM PDT, selvaraj periyasamy <
> selvaraj.periyasamy1983@gmail.com> wrote:
>
>  Thanks Balaji. My hadoop environment is still running with spark 2.3. Can
> I
> run 0.6.0 on spark 2.3?
>
> For issue 1: I am able to manage it with spark glob read, instead of
> hive read. With this approach, I am good with this approach.
>  Issue 2: I see the performance issue while writing into the COW table.
> This is purely write and no read involved.  Attached the write logs (
> hudiLogs.txt) in the ticket . The more and more my target has partitions, I
> am noticing a spike in write time.  The fix #1919 mentioned is applicable
> for writing as well.
>
> On Fri, Aug 28, 2020 at 3:28 PM vbalaji@apache.org <vb...@apache.org>
> wrote:
>
> >  Hi Selvaraj,
> > We had fixed relevant perf issue in  0.6.0 ([HUDI-1144] Speedup spark
> read
> > queries by caching metaclient in HoodieROPathFilter (#1919)). Can you
> > please try 0.6.0
> > Balaji.V
> >    On Friday, August 28, 2020, 01:31:42 PM PDT, selvaraj periyasamy <
> > selvaraj.periyasamy1983@gmail.com> wrote:
> >
> >  I have created this https://issues.apache.org/jira/browse/HUDI-1232
> > ticket
> > for tracking a couple of issues.
> >
> > One of the concerns I have in my use cases is that, have a COW type table
> > name called TRR.  I see below pasted logs rolling for all individual
> > partitions even though my write is on only a couple of partitions  and it
> > takes upto 4 to 5  mins. I pasted only a few of them alone. I am
> wondering
> > , in the future , I will have 3 years worth of data, and writing will be
> > very slow every time I write into only a couple of partitions.
> >
> > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
> >
> >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> > type COPY_ON_WRITE from
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> > java.util.stream.ReferencePipeline$Head@fed0a8b
> > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> > partition :20200714/01, #FileGroups=1
> > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> > NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1
> > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
> > from base path:
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> > files under
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01
> > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading
> HoodieTableMetaClient
> > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> > core-site.xml, mapred-default.xml, m
> > apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> > hdfs-site.xml], FileSystem:
> > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> > ugi=svchdc36q@V
> > ISA.COM (auth:KERBEROS)]]]
> > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
> >
> >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> > type COPY_ON_WRITE from
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> > java.util.stream.ReferencePipeline$Head@285c67a9
> > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> > partition :20200714/02, #FileGroups=1
> > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> > NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=0
> > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
> > from base path:
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> > files under
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/02
> > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading
> HoodieTableMetaClient
> > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> > core-site.xml, mapred-default.xml, m
> > apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> > hdfs-site.xml], FileSystem:
> > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> > ugi=svchdc36q@V
> > ISA.COM (auth:KERBEROS)]]]
> > 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
> >
> >
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> > type COPY_ON_WRITE from
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> > java.util.stream.ReferencePipeline$Head@2edd9c8
> > 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> > partition :20200714/03, #FileGroups=1
> > 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> > NumFiles=4, FileGroupsCreationTime=1, StoreTimeTaken=0
> > 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
> > from base path:
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> > files under
> > hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/03
> > 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading
> HoodieTableMetaClient
> > from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> > 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> > [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> > core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
> > yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem:
> > [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> > ugi=svchdc36q@VISA.COM (auth:KERBEROS)]]]
> >
> >
> >
> >  Seems more and more partitions we have,  path filter lists take more
> time.
> > Could someone provide more insight on how to make these things work
> faster
> > and make it scalable when the number of partitions is increasing?
> >
> >
> > Thanks,
> >
> > Selva
> >
>

Re: HUDI-1232

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.
 From the hudiLogs.txt, I find only HoodieROTablePathFiler related logs repeating which suggests this is the read side. So, we recommend you using latest version. I tried 2.3.3 and ran quickstart without issues. Give it a shot and let us know if there are any issues.
Balaji.V
    On Friday, August 28, 2020, 04:42:51 PM PDT, selvaraj periyasamy <se...@gmail.com> wrote:  
 
 Thanks Balaji. My hadoop environment is still running with spark 2.3. Can I
run 0.6.0 on spark 2.3?

For issue 1: I am able to manage it with spark glob read, instead of
hive read. With this approach, I am good with this approach.
 Issue 2: I see the performance issue while writing into the COW table.
This is purely write and no read involved.  Attached the write logs (
hudiLogs.txt) in the ticket . The more and more my target has partitions, I
am noticing a spike in write time.  The fix #1919 mentioned is applicable
for writing as well.

On Fri, Aug 28, 2020 at 3:28 PM vbalaji@apache.org <vb...@apache.org>
wrote:

>  Hi Selvaraj,
> We had fixed relevant perf issue in  0.6.0 ([HUDI-1144] Speedup spark read
> queries by caching metaclient in HoodieROPathFilter (#1919)). Can you
> please try 0.6.0
> Balaji.V
>    On Friday, August 28, 2020, 01:31:42 PM PDT, selvaraj periyasamy <
> selvaraj.periyasamy1983@gmail.com> wrote:
>
>  I have created this https://issues.apache.org/jira/browse/HUDI-1232
> ticket
> for tracking a couple of issues.
>
> One of the concerns I have in my use cases is that, have a COW type table
> name called TRR.  I see below pasted logs rolling for all individual
> partitions even though my write is on only a couple of partitions  and it
> takes upto 4 to 5  mins. I pasted only a few of them alone. I am wondering
> , in the future , I will have 3 years worth of data, and writing will be
> very slow every time I write into only a couple of partitions.
>
> 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
>
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> type COPY_ON_WRITE from
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> java.util.stream.ReferencePipeline$Head@fed0a8b
> 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> partition :20200714/01, #FileGroups=1
> 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1
> 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
> from base path:
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> files under
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
> from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> core-site.xml, mapred-default.xml, m
> apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> hdfs-site.xml], FileSystem:
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> ugi=svchdc36q@V
> ISA.COM (auth:KERBEROS)]]]
> 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
>
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> type COPY_ON_WRITE from
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> java.util.stream.ReferencePipeline$Head@285c67a9
> 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> partition :20200714/02, #FileGroups=1
> 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=0
> 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
> from base path:
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> files under
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/02
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
> from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> core-site.xml, mapred-default.xml, m
> apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> hdfs-site.xml], FileSystem:
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> ugi=svchdc36q@V
> ISA.COM (auth:KERBEROS)]]]
> 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
>
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> type COPY_ON_WRITE from
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> java.util.stream.ReferencePipeline$Head@2edd9c8
> 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> partition :20200714/03, #FileGroups=1
> 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> NumFiles=4, FileGroupsCreationTime=1, StoreTimeTaken=0
> 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
> from base path:
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> files under
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/03
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
> from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem:
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> ugi=svchdc36q@VISA.COM (auth:KERBEROS)]]]
>
>
>
>  Seems more and more partitions we have,  path filter lists take more time.
> Could someone provide more insight on how to make these things work faster
> and make it scalable when the number of partitions is increasing?
>
>
> Thanks,
>
> Selva
>
  

Re: HUDI-1232

Posted by selvaraj periyasamy <se...@gmail.com>.
Thanks Balaji. My hadoop environment is still running with spark 2.3. Can I
run 0.6.0 on spark 2.3?

For issue 1: I am able to manage it with spark glob read, instead of
hive read. With this approach, I am good with this approach.
 Issue 2: I see the performance issue while writing into the COW table.
This is purely write and no read involved.   Attached the write logs (
hudiLogs.txt) in the ticket . The more and more my target has partitions, I
am noticing a spike in write time.  The fix #1919 mentioned is applicable
for writing as well.

On Fri, Aug 28, 2020 at 3:28 PM vbalaji@apache.org <vb...@apache.org>
wrote:

>  Hi Selvaraj,
> We had fixed relevant perf issue in  0.6.0 ([HUDI-1144] Speedup spark read
> queries by caching metaclient in HoodieROPathFilter (#1919)). Can you
> please try 0.6.0
> Balaji.V
>     On Friday, August 28, 2020, 01:31:42 PM PDT, selvaraj periyasamy <
> selvaraj.periyasamy1983@gmail.com> wrote:
>
>  I have created this https://issues.apache.org/jira/browse/HUDI-1232
> ticket
> for tracking a couple of issues.
>
> One of the concerns I have in my use cases is that, have a COW type table
> name called TRR.  I see below pasted logs rolling for all individual
> partitions even though my write is on only a couple of partitions  and it
> takes upto 4 to 5  mins. I pasted only a few of them alone. I am wondering
> , in the future , I will have 3 years worth of data, and writing will be
> very slow every time I write into only a couple of partitions.
>
> 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
>
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> type COPY_ON_WRITE from
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> java.util.stream.ReferencePipeline$Head@fed0a8b
> 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> partition :20200714/01, #FileGroups=1
> 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1
> 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
> from base path:
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> files under
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
> from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> core-site.xml, mapred-default.xml, m
> apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> hdfs-site.xml], FileSystem:
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> ugi=svchdc36q@V
> ISA.COM (auth:KERBEROS)]]]
> 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
>
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> type COPY_ON_WRITE from
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> java.util.stream.ReferencePipeline$Head@285c67a9
> 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> partition :20200714/02, #FileGroups=1
> 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=0
> 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
> from base path:
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> files under
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/02
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
> from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> core-site.xml, mapred-default.xml, m
> apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
> hdfs-site.xml], FileSystem:
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> ugi=svchdc36q@V
> ISA.COM (auth:KERBEROS)]]]
> 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
>
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
> type COPY_ON_WRITE from
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
> java.util.stream.ReferencePipeline$Head@2edd9c8
> 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
> partition :20200714/03, #FileGroups=1
> 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
> NumFiles=4, FileGroupsCreationTime=1, StoreTimeTaken=0
> 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
> from base path:
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
> files under
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/03
> 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
> from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
> 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
> [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem:
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
> ugi=svchdc36q@VISA.COM (auth:KERBEROS)]]]
>
>
>
>  Seems more and more partitions we have,  path filter lists take more time.
> Could someone provide more insight on how to make these things work faster
> and make it scalable when the number of partitions is increasing?
>
>
> Thanks,
>
> Selva
>

Re: HUDI-1232

Posted by "vbalaji@apache.org" <vb...@apache.org>.
 Hi Selvaraj,
We had fixed relevant perf issue in  0.6.0 ([HUDI-1144] Speedup spark read queries by caching metaclient in HoodieROPathFilter (#1919)). Can you please try 0.6.0
Balaji.V
    On Friday, August 28, 2020, 01:31:42 PM PDT, selvaraj periyasamy <se...@gmail.com> wrote:  
 
 I have created this https://issues.apache.org/jira/browse/HUDI-1232 ticket
for tracking a couple of issues.

One of the concerns I have in my use cases is that, have a COW type table
name called TRR.  I see below pasted logs rolling for all individual
partitions even though my write is on only a couple of partitions  and it
takes upto 4 to 5  mins. I pasted only a few of them alone. I am wondering
, in the future , I will have 3 years worth of data, and writing will be
very slow every time I write into only a couple of partitions.

20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
type COPY_ON_WRITE from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
java.util.stream.ReferencePipeline$Head@fed0a8b
20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200714/01, #FileGroups=1
20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1
20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
from base path:
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
files under
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01
20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
[hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
core-site.xml, mapred-default.xml, m
apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
hdfs-site.xml], FileSystem:
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
ugi=svchdc36q@V
ISA.COM (auth:KERBEROS)]]]
20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
type COPY_ON_WRITE from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
java.util.stream.ReferencePipeline$Head@285c67a9
20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200714/02, #FileGroups=1
20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=0
20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
from base path:
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
files under
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/02
20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
[hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
core-site.xml, mapred-default.xml, m
apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
hdfs-site.xml], FileSystem:
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
ugi=svchdc36q@V
ISA.COM (auth:KERBEROS)]]]
20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
type COPY_ON_WRITE from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
java.util.stream.ReferencePipeline$Head@2edd9c8
20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200714/03, #FileGroups=1
20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=4, FileGroupsCreationTime=1, StoreTimeTaken=0
20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
from base path:
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
files under
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/03
20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
[hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem:
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
ugi=svchdc36q@VISA.COM (auth:KERBEROS)]]]



 Seems more and more partitions we have,  path filter lists take more time.
Could someone provide more insight on how to make these things work faster
and make it scalable when the number of partitions is increasing?


Thanks,

Selva