You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by selvaraj periyasamy <se...@gmail.com> on 2020/03/10 08:45:10 UTC

upsert on COW Takes 6 min for 150K Record

Team,

Am using 0.5.0 version of Hudi Jars built from my local.  While running
upsert

20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer records

20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread

20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is
done; notifying producer threads


20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer records
20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread

20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is
done; notifying producer threads


[image: image.png]

Re:Re: upsert on COW Takes 6 min for 150K Record

Posted by lamberken <la...@163.com>.

More, we improve the performance issuse around DiskBasedMap & kryo at master branch.
You also can try build hudi jar use master branch.


best,
lamber-ken





At 2020-03-10 17:07:58, "selvaraj periyasamy" <se...@gmail.com> wrote:

Sorry for the partial emails. My company portal don’t allow me to add test code .  Am using 0.5.0 version of Hudi Jars built from my local.  While running upsert , it takes more than 6 or 7 mins for processing 150k records.



Is there any tuning that could reduce the processing time from 6 or 7 mins ? Overwrite just takes less than a min ? Each row has 100 columns .



Thanks,
Selva


On Tue, Mar 10, 2020 at 1:51 AM selvaraj periyasamy <se...@gmail.com> wrote:

Team,


Am using 0.5.0 version of Hudi Jars built from my local.  While running upsert , it takes more than 6 or 7 mins for processing 150k records. Below are the code and logs.  


20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer records
20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is done; notifying producer threads


20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer records
20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is done; notifying producer threads


While running insert 


On Tue, Mar 10, 2020 at 1:45 AM selvaraj periyasamy <se...@gmail.com> wrote:

Team,


Am using 0.5.0 version of Hudi Jars built from my local.  While running upsert 


20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer records
20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is done; notifying producer threads


20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer records
20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is done; notifying producer threads




Re: Re: Re: upsert on COW Takes 6 min for 150K Record

Posted by Vinoth Chandar <vi...@apache.org>.
In general, please see
https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide for more tips
on tuning this in a real setting..

What lamber-ken mentioned should alleviate the issue of serialization, if
that's the bottleneck. Sine Hudi uses Spark caching during upsert
operation, also ensure you have sufficient spark executor memory.
In general, it may make more sense to try to run benchmark on a real
cluster and observe the bottlenecks..  Some tradeoffs we make (e.g caching)
may seem like overhead when running with small amount of data, but really
comes in handy when scaling it up.

I have some parallel efforts going on, to try to make the out-of-box single
node benchmark better, but until then if we can engage on a github issue
where you can paste code snippets and spark UI etc, happy to work with you
get that time down.

thanks
vinoth

On Wed, Mar 11, 2020 at 11:25 AM lamberken <la...@163.com> wrote:

>
>
> Hi,
>
>
> The unit is byte, it is an example, you need to modify it according to
> your own env.
>
>
> Best,
> Lamber-Ken
>
>
>
> At 2020-03-12 01:51:20, "selvaraj periyasamy" <
> selvaraj.periyasamy1983@gmail.com> wrote:
> >Thanks . What is this number 2004857600000? is it in bits or bytes?
> >
> >Thanks,
> >Selva
> >
> >On Tue, Mar 10, 2020 at 2:57 AM lamberken <la...@163.com> wrote:
> >
> >>
> >>
> >> hi,
> >>
> >>
> >> IMO, when upsert 150K record with 100columns, these records need
> >> serializate to disk and deserialize from disk.
> >> You can try add < option("hoodie.memory.merge.max.size",
> "2004857600000") >
> >>
> >>
> >> best,
> >> lamber-ken
> >>
> >>
> >>
> >>
> >>
> >> At 2020-03-10 17:07:58, "selvaraj periyasamy" <
> >> selvaraj.periyasamy1983@gmail.com> wrote:
> >>
> >> Sorry for the partial emails. My company portal don’t allow me to add
> test
> >> code .  Am using 0.5.0 version of Hudi Jars built from my local.  While
> >> running upsert , it takes more than 6 or 7 mins for processing 150k
> records.
> >>
> >>
> >>
> >> Is there any tuning that could reduce the processing time from 6 or 7
> mins
> >> ? Overwrite just takes less than a min ? Each row has 100 columns .
> >>
> >>
> >>
> >> Thanks,
> >> Selva
> >>
> >>
> >> On Tue, Mar 10, 2020 at 1:51 AM selvaraj periyasamy <
> >> selvaraj.periyasamy1983@gmail.com> wrote:
> >>
> >> Team,
> >>
> >>
> >> Am using 0.5.0 version of Hudi Jars built from my local.  While running
> >> upsert , it takes more than 6 or 7 mins for processing 150k records.
> Below
> >> are the code and logs.
> >>
> >>
> >> 20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer
> >> records
> >> 20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
> >> 20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering
> >> records
> >> 20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is
> done;
> >> notifying producer threads
> >>
> >>
> >> 20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer
> >> records
> >> 20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
> >> 20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering
> >> records
> >> 20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is
> done;
> >> notifying producer threads
> >>
> >>
> >> While running insert
> >>
> >>
> >> On Tue, Mar 10, 2020 at 1:45 AM selvaraj periyasamy <
> >> selvaraj.periyasamy1983@gmail.com> wrote:
> >>
> >> Team,
> >>
> >>
> >> Am using 0.5.0 version of Hudi Jars built from my local.  While running
> >> upsert
> >>
> >>
> >> 20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer
> >> records
> >> 20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
> >> 20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering
> >> records
> >> 20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is
> done;
> >> notifying producer threads
> >>
> >>
> >> 20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer
> >> records
> >> 20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
> >> 20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering
> >> records
> >> 20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is
> done;
> >> notifying producer threads
> >>
> >>
> >>
> >>
>

Re:Re: Re: upsert on COW Takes 6 min for 150K Record

Posted by lamberken <la...@163.com>.

Hi, 


The unit is byte, it is an example, you need to modify it according to your own env.


Best,
Lamber-Ken



At 2020-03-12 01:51:20, "selvaraj periyasamy" <se...@gmail.com> wrote:
>Thanks . What is this number 2004857600000? is it in bits or bytes?
>
>Thanks,
>Selva
>
>On Tue, Mar 10, 2020 at 2:57 AM lamberken <la...@163.com> wrote:
>
>>
>>
>> hi,
>>
>>
>> IMO, when upsert 150K record with 100columns, these records need
>> serializate to disk and deserialize from disk.
>> You can try add < option("hoodie.memory.merge.max.size", "2004857600000") >
>>
>>
>> best,
>> lamber-ken
>>
>>
>>
>>
>>
>> At 2020-03-10 17:07:58, "selvaraj periyasamy" <
>> selvaraj.periyasamy1983@gmail.com> wrote:
>>
>> Sorry for the partial emails. My company portal don’t allow me to add test
>> code .  Am using 0.5.0 version of Hudi Jars built from my local.  While
>> running upsert , it takes more than 6 or 7 mins for processing 150k records.
>>
>>
>>
>> Is there any tuning that could reduce the processing time from 6 or 7 mins
>> ? Overwrite just takes less than a min ? Each row has 100 columns .
>>
>>
>>
>> Thanks,
>> Selva
>>
>>
>> On Tue, Mar 10, 2020 at 1:51 AM selvaraj periyasamy <
>> selvaraj.periyasamy1983@gmail.com> wrote:
>>
>> Team,
>>
>>
>> Am using 0.5.0 version of Hudi Jars built from my local.  While running
>> upsert , it takes more than 6 or 7 mins for processing 150k records. Below
>> are the code and logs.
>>
>>
>> 20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer
>> records
>> 20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
>> 20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering
>> records
>> 20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is done;
>> notifying producer threads
>>
>>
>> 20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer
>> records
>> 20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
>> 20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering
>> records
>> 20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is done;
>> notifying producer threads
>>
>>
>> While running insert
>>
>>
>> On Tue, Mar 10, 2020 at 1:45 AM selvaraj periyasamy <
>> selvaraj.periyasamy1983@gmail.com> wrote:
>>
>> Team,
>>
>>
>> Am using 0.5.0 version of Hudi Jars built from my local.  While running
>> upsert
>>
>>
>> 20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer
>> records
>> 20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
>> 20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering
>> records
>> 20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is done;
>> notifying producer threads
>>
>>
>> 20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer
>> records
>> 20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
>> 20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering
>> records
>> 20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is done;
>> notifying producer threads
>>
>>
>>
>>

Re: Re: upsert on COW Takes 6 min for 150K Record

Posted by selvaraj periyasamy <se...@gmail.com>.
Thanks . What is this number 2004857600000? is it in bits or bytes?

Thanks,
Selva

On Tue, Mar 10, 2020 at 2:57 AM lamberken <la...@163.com> wrote:

>
>
> hi,
>
>
> IMO, when upsert 150K record with 100columns, these records need
> serializate to disk and deserialize from disk.
> You can try add < option("hoodie.memory.merge.max.size", "2004857600000") >
>
>
> best,
> lamber-ken
>
>
>
>
>
> At 2020-03-10 17:07:58, "selvaraj periyasamy" <
> selvaraj.periyasamy1983@gmail.com> wrote:
>
> Sorry for the partial emails. My company portal don’t allow me to add test
> code .  Am using 0.5.0 version of Hudi Jars built from my local.  While
> running upsert , it takes more than 6 or 7 mins for processing 150k records.
>
>
>
> Is there any tuning that could reduce the processing time from 6 or 7 mins
> ? Overwrite just takes less than a min ? Each row has 100 columns .
>
>
>
> Thanks,
> Selva
>
>
> On Tue, Mar 10, 2020 at 1:51 AM selvaraj periyasamy <
> selvaraj.periyasamy1983@gmail.com> wrote:
>
> Team,
>
>
> Am using 0.5.0 version of Hudi Jars built from my local.  While running
> upsert , it takes more than 6 or 7 mins for processing 150k records. Below
> are the code and logs.
>
>
> 20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer
> records
> 20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
> 20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering
> records
> 20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is done;
> notifying producer threads
>
>
> 20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer
> records
> 20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
> 20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering
> records
> 20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is done;
> notifying producer threads
>
>
> While running insert
>
>
> On Tue, Mar 10, 2020 at 1:45 AM selvaraj periyasamy <
> selvaraj.periyasamy1983@gmail.com> wrote:
>
> Team,
>
>
> Am using 0.5.0 version of Hudi Jars built from my local.  While running
> upsert
>
>
> 20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer
> records
> 20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
> 20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering
> records
> 20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is done;
> notifying producer threads
>
>
> 20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer
> records
> 20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
> 20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering
> records
> 20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is done;
> notifying producer threads
>
>
>
>

Re:Re: upsert on COW Takes 6 min for 150K Record

Posted by lamberken <la...@163.com>.

hi, 


IMO, when upsert 150K record with 100columns, these records need serializate to disk and deserialize from disk.
You can try add < option("hoodie.memory.merge.max.size", "2004857600000") >


best,
lamber-ken





At 2020-03-10 17:07:58, "selvaraj periyasamy" <se...@gmail.com> wrote:

Sorry for the partial emails. My company portal don’t allow me to add test code .  Am using 0.5.0 version of Hudi Jars built from my local.  While running upsert , it takes more than 6 or 7 mins for processing 150k records.



Is there any tuning that could reduce the processing time from 6 or 7 mins ? Overwrite just takes less than a min ? Each row has 100 columns .



Thanks,
Selva


On Tue, Mar 10, 2020 at 1:51 AM selvaraj periyasamy <se...@gmail.com> wrote:

Team,


Am using 0.5.0 version of Hudi Jars built from my local.  While running upsert , it takes more than 6 or 7 mins for processing 150k records. Below are the code and logs.  


20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer records
20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is done; notifying producer threads


20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer records
20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is done; notifying producer threads


While running insert 


On Tue, Mar 10, 2020 at 1:45 AM selvaraj periyasamy <se...@gmail.com> wrote:

Team,


Am using 0.5.0 version of Hudi Jars built from my local.  While running upsert 


20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer records
20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is done; notifying producer threads


20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer records
20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is done; notifying producer threads




Re: upsert on COW Takes 6 min for 150K Record

Posted by selvaraj periyasamy <se...@gmail.com>.
Sorry for the partial emails. My company portal don’t allow me to add test
code .  Am using 0.5.0 version of Hudi Jars built from my local.  While
running upsert , it takes more than 6 or 7 mins for processing 150k records.

Is there any tuning that could reduce the processing time from 6 or 7 mins
? Overwrite just takes less than a min ? Each row has 100 columns .

Thanks,
Selva

On Tue, Mar 10, 2020 at 1:51 AM selvaraj periyasamy <
selvaraj.periyasamy1983@gmail.com> wrote:

> Team,
>
> Am using 0.5.0 version of Hudi Jars built from my local.  While running
> upsert , it takes more than 6 or 7 mins for processing 150k records. Below
> are the code and logs.
>
> 20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer records
>
> 20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
>
> 20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering records
> 20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is done; notifying producer threads
>
>
> 20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer records
> 20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
>
> 20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering records
> 20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is done; notifying producer threads
>
>
> While running insert
>
> [image: image.png]
>
>
> On Tue, Mar 10, 2020 at 1:45 AM selvaraj periyasamy <
> selvaraj.periyasamy1983@gmail.com> wrote:
>
>> Team,
>>
>> Am using 0.5.0 version of Hudi Jars built from my local.  While running
>> upsert
>>
>> 20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer records
>>
>> 20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
>>
>> 20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering records
>> 20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is done; notifying producer threads
>>
>>
>> 20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer records
>> 20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
>>
>> 20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering records
>> 20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is done; notifying producer threads
>>
>>
>> [image: image.png]
>>
>>

Re: upsert on COW Takes 6 min for 150K Record

Posted by selvaraj periyasamy <se...@gmail.com>.
Team,

Am using 0.5.0 version of Hudi Jars built from my local.  While running
upsert , it takes more than 6 or 7 mins for processing 150k records. Below
are the code and logs.

20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer records

20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread

20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is
done; notifying producer threads


20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer records
20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread

20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering records
20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is
done; notifying producer threads


While running insert

[image: image.png]


On Tue, Mar 10, 2020 at 1:45 AM selvaraj periyasamy <
selvaraj.periyasamy1983@gmail.com> wrote:

> Team,
>
> Am using 0.5.0 version of Hudi Jars built from my local.  While running
> upsert
>
> 20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer records
>
> 20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
>
> 20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering records
> 20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is done; notifying producer threads
>
>
> 20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer records
> 20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
>
> 20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering records
> 20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is done; notifying producer threads
>
>
> [image: image.png]
>
>