You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Netsanet Gebretsadkan <ne...@gmail.com> on 2019/06/03 08:00:40 UTC

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Thanks, Vinoth

Its working now. But i have 2 questions:
1. The ingestion latency of using DataSource API with
the  HoodieSparkSQLWriter  is high compared to using delta streamer. Why is
it slow? Are there specific option where we could specify to minimize the
ingestion latency.
   For example: when i run the delta streamer its talking about 1 minute to
insert some data. If i use DataSource API with HoodieSparkSQLWriter, its
taking 5 minutes. How can we optimize this?
2. Where do we categorize Hudi in general (Is it batch processing or
streaming)?  I am asking this because currently the copy on write is the
one which is fully working and since the functionality of the merge on read
is not fully done which enables us to have a near real time analytics, can
we consider Hudi as a batch job?

Kind regards,

On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi,
>
> Short answer, by default any parameter you pass in using option(k,v) or
> options() beginning with "_" would be saved to the commit metadata.
> You can change "_" prefix to something else by using the
>  DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> Reason you are not seeing the checkpointstr inside the commit metadata is
> because its just supposed to be a prefix for all such commit metadata.
>
> val metaMap = parameters.filter(kv =>
> kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
>
> On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <ne...@gmail.com>
> wrote:
>
> > I am trying to use the HoodieSparkSQLWriter to upsert data from any
> > dataframe into a hoodie modeled table.  Its creating everything correctly
> > but , i also want to save the checkpoint but i couldn't even though am
> > passing it as an argument.
> >
> > inputDF.write()
> > .format("com.uber.hoodie")
> > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
> > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> "partition")
> > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
> > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> > checkpointstr)
> > .mode(SaveMode.Append)
> > .save(basePath);
> >
> > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for inserting the
> > checkpoint while using the dataframe writer but i couldn't add the
> > checkpoint meta data in to the .hoodie meta data. Is there a way i can
> add
> > the checkpoint meta data while using the dataframe writer API?
> >
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Posted by Netsanet Gebretsadkan <ne...@gmail.com>.

Dear Vinoth,

I added the recent issue to my previous performance related issue in the
following link: https://github.com/apache/incubator-hudi/issues/714
The follow up can be done from there.

Thanks,

On Thu, Jul 11, 2019 at 6:33 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi,
>
> I see you have a pretty small 3 executor job. is that right?
> Unfortunately, the mailing list does not support images.. Mind opening a
> JIRA or a GH issue to follow up on this?
>
> /Thanks/
>
> On Thu, Jul 11, 2019 at 5:51 AM Netsanet Gebretsadkan <ne...@gmail.com>
> wrote:
>
> > Dear Vinoth,
> >
> > Thanks for the detailed and precise explanation. I understood the  result
> > of the benchmark very well now.
> >
> > For my specific use case, i used a splited JSON data source  and am
> > sharing you the UI of the spark job.
> > The settings i used  for a cluster with (30 GB of RAM   and  100 GB
> > available disk) are:
> > spark.driver.memory = 4096m
> > spark.executor.memory = 6144m
> > spark.executor.instances =3
> > spark.driver.cores =1
> > spark.executor.cores =1
> > hoodie.datasource.write.operation="upsert"
> > hoodie.upsert.shuffle.parallellism="1500"
> >
> > This took about 38 minutes. You can see the details from the UI provided
> > below and the schema have 20 columns.
> >
> > Thanks for your consideration.
> >
> > kind regards,
> >
> >
> >
> > On Thu, Jul 11, 2019 at 12:28 AM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> >> Hi,
> >>
> >> >>And also when you say bulk insert, do you mean hoodies bulk insert
> >> operation?
> >> No it does not refer to bulk_insert operation in Hudi. I think it says
> >> "bulk load" and it refers to ingesting database tables in full, unlike
> >> using Hudi upserts to do it incrementally. Simply put, its the
> difference
> >> between fully rewriting your table as you would do in the pre-Hudi world
> >> and incrementally rewriting at the file level in present day using Hudi.
> >>
> >> >>Why is it taking much  time for 500 GB of data and does the data
> include
> >> changes or its first time insert data?
> >> Hudi write performance depends on two things : indexing (which has
> gotten
> >> lot faster since that benchmark) and writing parquet files (it depends
> on
> >> your schema & cpu cores on the box). And since Hudi writing is a Spark
> >> job,
> >> speed also depends on parallelism you provide.. In a perfect world, you
> >> have as much parallelism as parquet files (file groups) and indexing
> takes
> >> 1-2 mins or so and writing takes 1-2 mins. For this specific dataset,
> the
> >> schema has 1000 columns, so parquet writing is much slower.
> >>
> >> the Hudi bulk insert or insert operation is kind of documented in the
> >> delta
> >> streamer CLI help. If you know your dataset has no updates, then you can
> >> issue insert/bulk_insert instead of upsert to completely avoid indexing
> >> step and that will gain speed. Difference between insert and bulk_insert
> >> is
> >> an implementation detail : insert() caches the input data in memory to
> do
> >> all the cool storage file sizing etc, while bulk_insert() used a sort
> >> based
> >> writing mechanism which can scale to multi terabyte initial loads ..
> >> In short, you do bulk_insert() to bootstrap the dataset, then insert or
> >> upsert depending on needs.
> >>
> >> for your specific use case, if you can share the spark UI, me or someone
> >> else here can take a look and see if there is scope to make it go
> faster.
> >>
> >> /thanks/vinoth
> >>
> >> On Wed, Jul 10, 2019 at 1:26 PM Netsanet Gebretsadkan <
> net22geb@gmail.com
> >> >
> >> wrote:
> >>
> >> > Dear Vinoth,
> >> >
> >> > I want to try to check out the performance comparison of hudi upsert
> and
> >> > bulk insert.  In the hudi documentation, specifically performance
> >> > comparison section https://hudi.apache.org/performance.html#upserts
> ,
> >> > which tries to compare bulk insert and upsert, its showing that  it
> >> takes
> >> > about 17 min for upserting  20 TB of data and 22 min for ingesting 500
> >> GB
> >> > of data. Why is it taking much  time for 500 GB of data and does the
> >> data
> >> > include changes or its first time insert data? I assumed its data to
> be
> >> > inserted for the first time since you made the comparison with bulk
> >> insert.
> >> >
> >> >  And also when you say bulk insert, do you mean hoodies bulk insert
> >> > operation?  If so, what is the difference with hoodies upsert
> >> operation? In
> >> > addition to this, The latency of ingesting 6 GB of data is 25 minutes
> >> with
> >> > the cluster i provided. How can i enhance this?
> >> >
> >> > Thanks for your consideration.
> >> >
> >> > Kind regards,
> >> >
> >> > On Sun, Jun 23, 2019 at 5:42 PM Netsanet Gebretsadkan <
> >> net22geb@gmail.com>
> >> > wrote:
> >> >
> >> > > Thanks Vbalaji.
> >> > > I will check it out.
> >> > >
> >> > > Kind regards,
> >> > >
> >> > > On Sat, Jun 22, 2019 at 3:29 PM vbalaji@apache.org <
> >> vbalaji@apache.org>
> >> > > wrote:
> >> > >
> >> > >>
> >> > >> Here is the correct gist link :
> >> > >> https://gist.github.com/bvaradar/e18d96f9b99980dfb67a6601de5aa626
> >> > >>
> >> > >>
> >> > >>     On Saturday, June 22, 2019, 6:08:48 AM PDT, vbalaji@apache.org
> <
> >> > >> vbalaji@apache.org> wrote:
> >> > >>
> >> > >>   Hi,
> >> > >> I have given a sample command to set up and run deltastreamer in
> >> > >> continuous mode and ingest fake data in the following gist
> >> > >> https://gist.github.com/bvaradar/c5feec486fd4b2a3dac40c93649962c7
> >> > >>
> >> > >> We will eventually get this to project wiki.
> >> > >> Balaji.V
> >> > >>
> >> > >>     On Friday, June 21, 2019, 3:12:49 PM PDT, Netsanet
> Gebretsadkan <
> >> > >> net22geb@gmail.com> wrote:
> >> > >>
> >> > >>  @Vinoth, Thanks , that would be great if Balaji could share it.
> >> > >>
> >> > >> Kind regards,
> >> > >>
> >> > >>
> >> > >> On Thu, Jun 20, 2019 at 11:17 PM Vinoth Chandar <vinoth@apache.org
> >
> >> > >> wrote:
> >> > >>
> >> > >> > Hi,
> >> > >> >
> >> > >> > We usually test with our production workloads.. However, balaji
> >> > recently
> >> > >> > merged a DistributedTestDataSource,
> >> > >> >
> >> > >> >
> >> > >>
> >> >
> >>
> https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d
> >> > >> >
> >> > >> >
> >> > >> > that can generate some random data for testing..  Balaji, do you
> >> mind
> >> > >> > sharing a command that can be used to kick something off like
> that?
> >> > >> >
> >> > >> >
> >> > >> > On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan <
> >> > >> net22geb@gmail.com>
> >> > >> > wrote:
> >> > >> >
> >> > >> > > Dear Vinoth,
> >> > >> > >
> >> > >> > > I want to try to check out the performance comparison of upsert
> >> and
> >> > >> bulk
> >> > >> > > insert.  But i couldn't find a clean data set more than 10 GB.
> >> > >> > > Would it be possible to get a data set from Hudi team? For
> >> example i
> >> > >> was
> >> > >> > > using the stocks data that you provided on your demo. Hence,
> can
> >> i
> >> > get
> >> > >> > > more GB's of that dataset for my experiment?
> >> > >> > >
> >> > >> > > Thanks for your consideration.
> >> > >> > >
> >> > >> > > Kind regards,
> >> > >> > >
> >> > >> > > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <
> vinoth@apache.org
> >> >
> >> > >> wrote:
> >> > >> > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159
> >> > >> > > >
> >> > >> > > > Just circling back with the resolution on the mailing list as
> >> > well.
> >> > >> > > >
> >> > >> > > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <
> >> > >> > net22geb@gmail.com
> >> > >> > > >
> >> > >> > > > wrote:
> >> > >> > > >
> >> > >> > > > > Dear Vinoth,
> >> > >> > > > >
> >> > >> > > > > Thanks for your fast response.
> >> > >> > > > > I have created a new issue called Performance Comparison of
> >> > >> > > > > HoodieDeltaStreamer and DataSourceAPI #714  with the
> >> screnshots
> >> > of
> >> > >> > the
> >> > >> > > > > spark UI which can be found at the  following  link
> >> > >> > > > > https://github.com/apache/incubator-hudi/issues/714.
> >> > >> > > > > In the UI,  it seems that the ingestion with the data
> source
> >> API
> >> > >> is
> >> > >> > > > > spending  much time in the count by key of HoodieBloomIndex
> >> and
> >> > >> > > workload
> >> > >> > > > > profile.  Looking forward to receive insights from you.
> >> > >> > > > >
> >> > >> > > > > Kinde regards,
> >> > >> > > > >
> >> > >> > > > >
> >> > >> > > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <
> >> > vinoth@apache.org>
> >> > >> > > wrote:
> >> > >> > > > >
> >> > >> > > > > > Hi,
> >> > >> > > > > >
> >> > >> > > > > > Both datasource and deltastreamer use the same APIs
> >> > underneath.
> >> > >> So
> >> > >> > > not
> >> > >> > > > > > sure. If you can grab screenshots of spark UI for both
> and
> >> > open
> >> > >> a
> >> > >> > > > ticket,
> >> > >> > > > > > glad to take a look.
> >> > >> > > > > >
> >> > >> > > > > > On 2, well one of goals of Hudi is to break this
> dichotomy
> >> and
> >> > >> > enable
> >> > >> > > > > > streaming style (I call it incremental processing) of
> >> > processing
> >> > >> > even
> >> > >> > > > in
> >> > >> > > > > a
> >> > >> > > > > > batch job. MOR is in production at uber. Atm MOR is
> lacking
> >> > just
> >> > >> > one
> >> > >> > > > > > feature (incr pull using log files) that Nishith is
> >> planning
> >> > to
> >> > >> > merge
> >> > >> > > > > soon.
> >> > >> > > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously
> >> > while
> >> > >> > > > managing
> >> > >> > > > > > compaction etc in the same job. I already knocked off
> some
> >> > index
> >> > >> > > > > > performance problems and working on indexing the log
> files,
> >> > >> which
> >> > >> > > > should
> >> > >> > > > > > unlock near real time ingest.
> >> > >> > > > > >
> >> > >> > > > > > Putting all these together, within a month or so near
> real
> >> > time
> >> > >> MOR
> >> > >> > > > > vision
> >> > >> > > > > > should be very real. Ofc we need community help with dev
> >> and
> >> > >> > testing
> >> > >> > > to
> >> > >> > > > > > speed things up. :)
> >> > >> > > > > >
> >> > >> > > > > > Hope that gives you a clearer picture.
> >> > >> > > > > >
> >> > >> > > > > > Thanks
> >> > >> > > > > > Vinoth
> >> > >> > > > > >
> >> > >> > > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <
> >> > >> > > > net22geb@gmail.com
> >> > >> > > > > >
> >> > >> > > > > > wrote:
> >> > >> > > > > >
> >> > >> > > > > > > Thanks, Vinoth
> >> > >> > > > > > >
> >> > >> > > > > > > Its working now. But i have 2 questions:
> >> > >> > > > > > > 1. The ingestion latency of using DataSource API with
> >> > >> > > > > > > the  HoodieSparkSQLWriter  is high compared to using
> >> delta
> >> > >> > > streamer.
> >> > >> > > > > Why
> >> > >> > > > > > is
> >> > >> > > > > > > it slow? Are there specific option where we could
> >> specify to
> >> > >> > > minimize
> >> > >> > > > > the
> >> > >> > > > > > > ingestion latency.
> >> > >> > > > > > >    For example: when i run the delta streamer its
> talking
> >> > >> about 1
> >> > >> > > > > minute
> >> > >> > > > > > to
> >> > >> > > > > > > insert some data. If i use DataSource API with
> >> > >> > > HoodieSparkSQLWriter,
> >> > >> > > > > its
> >> > >> > > > > > > taking 5 minutes. How can we optimize this?
> >> > >> > > > > > > 2. Where do we categorize Hudi in general (Is it batch
> >> > >> processing
> >> > >> > > or
> >> > >> > > > > > > streaming)?  I am asking this because currently the
> copy
> >> on
> >> > >> write
> >> > >> > > is
> >> > >> > > > > the
> >> > >> > > > > > > one which is fully working and since the functionality
> of
> >> > the
> >> > >> > merge
> >> > >> > > > on
> >> > >> > > > > > read
> >> > >> > > > > > > is not fully done which enables us to have a near real
> >> time
> >> > >> > > > analytics,
> >> > >> > > > > > can
> >> > >> > > > > > > we consider Hudi as a batch job?
> >> > >> > > > > > >
> >> > >> > > > > > > Kind regards,
> >> > >> > > > > > >
> >> > >> > > > > > >
> >> > >> > > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <
> >> > >> > vinoth@apache.org>
> >> > >> > > > > > wrote:
> >> > >> > > > > > >
> >> > >> > > > > > > > Hi,
> >> > >> > > > > > > >
> >> > >> > > > > > > > Short answer, by default any parameter you pass in
> >> using
> >> > >> > > > option(k,v)
> >> > >> > > > > or
> >> > >> > > > > > > > options() beginning with "_" would be saved to the
> >> commit
> >> > >> > > metadata.
> >> > >> > > > > > > > You can change "_" prefix to something else by using
> >> the
> >> > >> > > > > > > >
> >> > DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> >> > >> > > > > > > > Reason you are not seeing the checkpointstr inside
> the
> >> > >> commit
> >> > >> > > > > metadata
> >> > >> > > > > > is
> >> > >> > > > > > > > because its just supposed to be a prefix for all such
> >> > commit
> >> > >> > > > > metadata.
> >> > >> > > > > > > >
> >> > >> > > > > > > > val metaMap = parameters.filter(kv =>
> >> > >> > > > > > > >
> >> > >> > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
> >> > >> > > > > > > >
> >> > >> > > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet
> Gebretsadkan <
> >> > >> > > > > > > net22geb@gmail.com>
> >> > >> > > > > > > > wrote:
> >> > >> > > > > > > >
> >> > >> > > > > > > > > I am trying to use the HoodieSparkSQLWriter to
> upsert
> >> > data
> >> > >> > from
> >> > >> > > > any
> >> > >> > > > > > > > > dataframe into a hoodie modeled table.  Its
> creating
> >> > >> > everything
> >> > >> > > > > > > correctly
> >> > >> > > > > > > > > but , i also want to save the checkpoint but i
> >> couldn't
> >> > >> even
> >> > >> > > > though
> >> > >> > > > > > am
> >> > >> > > > > > > > > passing it as an argument.
> >> > >> > > > > > > > >
> >> > >> > > > > > > > > inputDF.write()
> >> > >> > > > > > > > > .format("com.uber.hoodie")
> >> > >> > > > > > > > >
> >> > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
> >> > >> > > > > "_row_key")
> >> > >> > > > > > > > >
> >> > >> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> >> > >> > > > > > > > "partition")
> >> > >> > > > > > > > >
> >> > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
> >> > >> > > > > > "timestamp")
> >> > >> > > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> >> > >> > > > > > > > >
> >> > >> > > >
> >> > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> >> > >> > > > > > > > > checkpointstr)
> >> > >> > > > > > > > > .mode(SaveMode.Append)
> >> > >> > > > > > > > > .save(basePath);
> >> > >> > > > > > > > >
> >> > >> > > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY()
> for
> >> > >> > inserting
> >> > >> > > > the
> >> > >> > > > > > > > > checkpoint while using the dataframe writer but i
> >> > couldn't
> >> > >> > add
> >> > >> > > > the
> >> > >> > > > > > > > > checkpoint meta data in to the .hoodie meta data.
> Is
> >> > >> there a
> >> > >> > > way
> >> > >> > > > i
> >> > >> > > > > > can
> >> > >> > > > > > > > add
> >> > >> > > > > > > > > the checkpoint meta data while using the dataframe
> >> > writer
> >> > >> > API?
> >> > >> > > > > > > > >
> >> > >> > > > > > > >
> >> > >> > > > > > >
> >> > >> > > > > >
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> > >
> >> >
> >>
> >
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Posted by Vinoth Chandar <vi...@apache.org>.

Hi,

I see you have a pretty small 3 executor job. is that right?
Unfortunately, the mailing list does not support images.. Mind opening a
JIRA or a GH issue to follow up on this?

/Thanks/

On Thu, Jul 11, 2019 at 5:51 AM Netsanet Gebretsadkan <ne...@gmail.com>
wrote:

> Dear Vinoth,
>
> Thanks for the detailed and precise explanation. I understood the  result
> of the benchmark very well now.
>
> For my specific use case, i used a splited JSON data source  and am
> sharing you the UI of the spark job.
> The settings i used  for a cluster with (30 GB of RAM   and  100 GB
> available disk) are:
> spark.driver.memory = 4096m
> spark.executor.memory = 6144m
> spark.executor.instances =3
> spark.driver.cores =1
> spark.executor.cores =1
> hoodie.datasource.write.operation="upsert"
> hoodie.upsert.shuffle.parallellism="1500"
>
> This took about 38 minutes. You can see the details from the UI provided
> below and the schema have 20 columns.
>
> Thanks for your consideration.
>
> kind regards,
>
>
>
> On Thu, Jul 11, 2019 at 12:28 AM Vinoth Chandar <vi...@apache.org> wrote:
>
>> Hi,
>>
>> >>And also when you say bulk insert, do you mean hoodies bulk insert
>> operation?
>> No it does not refer to bulk_insert operation in Hudi. I think it says
>> "bulk load" and it refers to ingesting database tables in full, unlike
>> using Hudi upserts to do it incrementally. Simply put, its the difference
>> between fully rewriting your table as you would do in the pre-Hudi world
>> and incrementally rewriting at the file level in present day using Hudi.
>>
>> >>Why is it taking much  time for 500 GB of data and does the data include
>> changes or its first time insert data?
>> Hudi write performance depends on two things : indexing (which has gotten
>> lot faster since that benchmark) and writing parquet files (it depends on
>> your schema & cpu cores on the box). And since Hudi writing is a Spark
>> job,
>> speed also depends on parallelism you provide.. In a perfect world, you
>> have as much parallelism as parquet files (file groups) and indexing takes
>> 1-2 mins or so and writing takes 1-2 mins. For this specific dataset, the
>> schema has 1000 columns, so parquet writing is much slower.
>>
>> the Hudi bulk insert or insert operation is kind of documented in the
>> delta
>> streamer CLI help. If you know your dataset has no updates, then you can
>> issue insert/bulk_insert instead of upsert to completely avoid indexing
>> step and that will gain speed. Difference between insert and bulk_insert
>> is
>> an implementation detail : insert() caches the input data in memory to do
>> all the cool storage file sizing etc, while bulk_insert() used a sort
>> based
>> writing mechanism which can scale to multi terabyte initial loads ..
>> In short, you do bulk_insert() to bootstrap the dataset, then insert or
>> upsert depending on needs.
>>
>> for your specific use case, if you can share the spark UI, me or someone
>> else here can take a look and see if there is scope to make it go faster.
>>
>> /thanks/vinoth
>>
>> On Wed, Jul 10, 2019 at 1:26 PM Netsanet Gebretsadkan <net22geb@gmail.com
>> >
>> wrote:
>>
>> > Dear Vinoth,
>> >
>> > I want to try to check out the performance comparison of hudi upsert and
>> > bulk insert.  In the hudi documentation, specifically performance
>> > comparison section https://hudi.apache.org/performance.html#upserts  ,
>> > which tries to compare bulk insert and upsert, its showing that  it
>> takes
>> > about 17 min for upserting  20 TB of data and 22 min for ingesting 500
>> GB
>> > of data. Why is it taking much  time for 500 GB of data and does the
>> data
>> > include changes or its first time insert data? I assumed its data to be
>> > inserted for the first time since you made the comparison with bulk
>> insert.
>> >
>> >  And also when you say bulk insert, do you mean hoodies bulk insert
>> > operation?  If so, what is the difference with hoodies upsert
>> operation? In
>> > addition to this, The latency of ingesting 6 GB of data is 25 minutes
>> with
>> > the cluster i provided. How can i enhance this?
>> >
>> > Thanks for your consideration.
>> >
>> > Kind regards,
>> >
>> > On Sun, Jun 23, 2019 at 5:42 PM Netsanet Gebretsadkan <
>> net22geb@gmail.com>
>> > wrote:
>> >
>> > > Thanks Vbalaji.
>> > > I will check it out.
>> > >
>> > > Kind regards,
>> > >
>> > > On Sat, Jun 22, 2019 at 3:29 PM vbalaji@apache.org <
>> vbalaji@apache.org>
>> > > wrote:
>> > >
>> > >>
>> > >> Here is the correct gist link :
>> > >> https://gist.github.com/bvaradar/e18d96f9b99980dfb67a6601de5aa626
>> > >>
>> > >>
>> > >>     On Saturday, June 22, 2019, 6:08:48 AM PDT, vbalaji@apache.org <
>> > >> vbalaji@apache.org> wrote:
>> > >>
>> > >>   Hi,
>> > >> I have given a sample command to set up and run deltastreamer in
>> > >> continuous mode and ingest fake data in the following gist
>> > >> https://gist.github.com/bvaradar/c5feec486fd4b2a3dac40c93649962c7
>> > >>
>> > >> We will eventually get this to project wiki.
>> > >> Balaji.V
>> > >>
>> > >>     On Friday, June 21, 2019, 3:12:49 PM PDT, Netsanet Gebretsadkan <
>> > >> net22geb@gmail.com> wrote:
>> > >>
>> > >>  @Vinoth, Thanks , that would be great if Balaji could share it.
>> > >>
>> > >> Kind regards,
>> > >>
>> > >>
>> > >> On Thu, Jun 20, 2019 at 11:17 PM Vinoth Chandar <vi...@apache.org>
>> > >> wrote:
>> > >>
>> > >> > Hi,
>> > >> >
>> > >> > We usually test with our production workloads.. However, balaji
>> > recently
>> > >> > merged a DistributedTestDataSource,
>> > >> >
>> > >> >
>> > >>
>> >
>> https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d
>> > >> >
>> > >> >
>> > >> > that can generate some random data for testing..  Balaji, do you
>> mind
>> > >> > sharing a command that can be used to kick something off like that?
>> > >> >
>> > >> >
>> > >> > On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan <
>> > >> net22geb@gmail.com>
>> > >> > wrote:
>> > >> >
>> > >> > > Dear Vinoth,
>> > >> > >
>> > >> > > I want to try to check out the performance comparison of upsert
>> and
>> > >> bulk
>> > >> > > insert.  But i couldn't find a clean data set more than 10 GB.
>> > >> > > Would it be possible to get a data set from Hudi team? For
>> example i
>> > >> was
>> > >> > > using the stocks data that you provided on your demo. Hence, can
>> i
>> > get
>> > >> > > more GB's of that dataset for my experiment?
>> > >> > >
>> > >> > > Thanks for your consideration.
>> > >> > >
>> > >> > > Kind regards,
>> > >> > >
>> > >> > > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <vinoth@apache.org
>> >
>> > >> wrote:
>> > >> > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159
>> > >> > > >
>> > >> > > > Just circling back with the resolution on the mailing list as
>> > well.
>> > >> > > >
>> > >> > > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <
>> > >> > net22geb@gmail.com
>> > >> > > >
>> > >> > > > wrote:
>> > >> > > >
>> > >> > > > > Dear Vinoth,
>> > >> > > > >
>> > >> > > > > Thanks for your fast response.
>> > >> > > > > I have created a new issue called Performance Comparison of
>> > >> > > > > HoodieDeltaStreamer and DataSourceAPI #714  with the
>> screnshots
>> > of
>> > >> > the
>> > >> > > > > spark UI which can be found at the  following  link
>> > >> > > > > https://github.com/apache/incubator-hudi/issues/714.
>> > >> > > > > In the UI,  it seems that the ingestion with the data source
>> API
>> > >> is
>> > >> > > > > spending  much time in the count by key of HoodieBloomIndex
>> and
>> > >> > > workload
>> > >> > > > > profile.  Looking forward to receive insights from you.
>> > >> > > > >
>> > >> > > > > Kinde regards,
>> > >> > > > >
>> > >> > > > >
>> > >> > > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <
>> > vinoth@apache.org>
>> > >> > > wrote:
>> > >> > > > >
>> > >> > > > > > Hi,
>> > >> > > > > >
>> > >> > > > > > Both datasource and deltastreamer use the same APIs
>> > underneath.
>> > >> So
>> > >> > > not
>> > >> > > > > > sure. If you can grab screenshots of spark UI for both and
>> > open
>> > >> a
>> > >> > > > ticket,
>> > >> > > > > > glad to take a look.
>> > >> > > > > >
>> > >> > > > > > On 2, well one of goals of Hudi is to break this dichotomy
>> and
>> > >> > enable
>> > >> > > > > > streaming style (I call it incremental processing) of
>> > processing
>> > >> > even
>> > >> > > > in
>> > >> > > > > a
>> > >> > > > > > batch job. MOR is in production at uber. Atm MOR is lacking
>> > just
>> > >> > one
>> > >> > > > > > feature (incr pull using log files) that Nishith is
>> planning
>> > to
>> > >> > merge
>> > >> > > > > soon.
>> > >> > > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously
>> > while
>> > >> > > > managing
>> > >> > > > > > compaction etc in the same job. I already knocked off some
>> > index
>> > >> > > > > > performance problems and working on indexing the log files,
>> > >> which
>> > >> > > > should
>> > >> > > > > > unlock near real time ingest.
>> > >> > > > > >
>> > >> > > > > > Putting all these together, within a month or so near real
>> > time
>> > >> MOR
>> > >> > > > > vision
>> > >> > > > > > should be very real. Ofc we need community help with dev
>> and
>> > >> > testing
>> > >> > > to
>> > >> > > > > > speed things up. :)
>> > >> > > > > >
>> > >> > > > > > Hope that gives you a clearer picture.
>> > >> > > > > >
>> > >> > > > > > Thanks
>> > >> > > > > > Vinoth
>> > >> > > > > >
>> > >> > > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <
>> > >> > > > net22geb@gmail.com
>> > >> > > > > >
>> > >> > > > > > wrote:
>> > >> > > > > >
>> > >> > > > > > > Thanks, Vinoth
>> > >> > > > > > >
>> > >> > > > > > > Its working now. But i have 2 questions:
>> > >> > > > > > > 1. The ingestion latency of using DataSource API with
>> > >> > > > > > > the  HoodieSparkSQLWriter  is high compared to using
>> delta
>> > >> > > streamer.
>> > >> > > > > Why
>> > >> > > > > > is
>> > >> > > > > > > it slow? Are there specific option where we could
>> specify to
>> > >> > > minimize
>> > >> > > > > the
>> > >> > > > > > > ingestion latency.
>> > >> > > > > > >    For example: when i run the delta streamer its talking
>> > >> about 1
>> > >> > > > > minute
>> > >> > > > > > to
>> > >> > > > > > > insert some data. If i use DataSource API with
>> > >> > > HoodieSparkSQLWriter,
>> > >> > > > > its
>> > >> > > > > > > taking 5 minutes. How can we optimize this?
>> > >> > > > > > > 2. Where do we categorize Hudi in general (Is it batch
>> > >> processing
>> > >> > > or
>> > >> > > > > > > streaming)?  I am asking this because currently the copy
>> on
>> > >> write
>> > >> > > is
>> > >> > > > > the
>> > >> > > > > > > one which is fully working and since the functionality of
>> > the
>> > >> > merge
>> > >> > > > on
>> > >> > > > > > read
>> > >> > > > > > > is not fully done which enables us to have a near real
>> time
>> > >> > > > analytics,
>> > >> > > > > > can
>> > >> > > > > > > we consider Hudi as a batch job?
>> > >> > > > > > >
>> > >> > > > > > > Kind regards,
>> > >> > > > > > >
>> > >> > > > > > >
>> > >> > > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <
>> > >> > vinoth@apache.org>
>> > >> > > > > > wrote:
>> > >> > > > > > >
>> > >> > > > > > > > Hi,
>> > >> > > > > > > >
>> > >> > > > > > > > Short answer, by default any parameter you pass in
>> using
>> > >> > > > option(k,v)
>> > >> > > > > or
>> > >> > > > > > > > options() beginning with "_" would be saved to the
>> commit
>> > >> > > metadata.
>> > >> > > > > > > > You can change "_" prefix to something else by using
>> the
>> > >> > > > > > > >
>> > DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
>> > >> > > > > > > > Reason you are not seeing the checkpointstr inside the
>> > >> commit
>> > >> > > > > metadata
>> > >> > > > > > is
>> > >> > > > > > > > because its just supposed to be a prefix for all such
>> > commit
>> > >> > > > > metadata.
>> > >> > > > > > > >
>> > >> > > > > > > > val metaMap = parameters.filter(kv =>
>> > >> > > > > > > >
>> > >> > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
>> > >> > > > > > > >
>> > >> > > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
>> > >> > > > > > > net22geb@gmail.com>
>> > >> > > > > > > > wrote:
>> > >> > > > > > > >
>> > >> > > > > > > > > I am trying to use the HoodieSparkSQLWriter to upsert
>> > data
>> > >> > from
>> > >> > > > any
>> > >> > > > > > > > > dataframe into a hoodie modeled table.  Its creating
>> > >> > everything
>> > >> > > > > > > correctly
>> > >> > > > > > > > > but , i also want to save the checkpoint but i
>> couldn't
>> > >> even
>> > >> > > > though
>> > >> > > > > > am
>> > >> > > > > > > > > passing it as an argument.
>> > >> > > > > > > > >
>> > >> > > > > > > > > inputDF.write()
>> > >> > > > > > > > > .format("com.uber.hoodie")
>> > >> > > > > > > > >
>> > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
>> > >> > > > > "_row_key")
>> > >> > > > > > > > >
>> > >> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
>> > >> > > > > > > > "partition")
>> > >> > > > > > > > >
>> > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
>> > >> > > > > > "timestamp")
>> > >> > > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
>> > >> > > > > > > > >
>> > >> > > >
>> > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
>> > >> > > > > > > > > checkpointstr)
>> > >> > > > > > > > > .mode(SaveMode.Append)
>> > >> > > > > > > > > .save(basePath);
>> > >> > > > > > > > >
>> > >> > > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for
>> > >> > inserting
>> > >> > > > the
>> > >> > > > > > > > > checkpoint while using the dataframe writer but i
>> > couldn't
>> > >> > add
>> > >> > > > the
>> > >> > > > > > > > > checkpoint meta data in to the .hoodie meta data. Is
>> > >> there a
>> > >> > > way
>> > >> > > > i
>> > >> > > > > > can
>> > >> > > > > > > > add
>> > >> > > > > > > > > the checkpoint meta data while using the dataframe
>> > writer
>> > >> > API?
>> > >> > > > > > > > >
>> > >> > > > > > > >
>> > >> > > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Posted by Netsanet Gebretsadkan <ne...@gmail.com>.

Dear Vinoth,

Thanks for the detailed and precise explanation. I understood the  result
of the benchmark very well now.

For my specific use case, i used a splited JSON data source  and am sharing
you the UI of the spark job.
The settings i used  for a cluster with (30 GB of RAM   and  100 GB
available disk) are:
spark.driver.memory = 4096m
spark.executor.memory = 6144m
spark.executor.instances =3
spark.driver.cores =1
spark.executor.cores =1
hoodie.datasource.write.operation="upsert"
hoodie.upsert.shuffle.parallellism="1500"

This took about 38 minutes. You can see the details from the UI provided
below and the schema have 20 columns.

Thanks for your consideration.

kind regards,



On Thu, Jul 11, 2019 at 12:28 AM Vinoth Chandar <vi...@apache.org> wrote:

> Hi,
>
> >>And also when you say bulk insert, do you mean hoodies bulk insert
> operation?
> No it does not refer to bulk_insert operation in Hudi. I think it says
> "bulk load" and it refers to ingesting database tables in full, unlike
> using Hudi upserts to do it incrementally. Simply put, its the difference
> between fully rewriting your table as you would do in the pre-Hudi world
> and incrementally rewriting at the file level in present day using Hudi.
>
> >>Why is it taking much  time for 500 GB of data and does the data include
> changes or its first time insert data?
> Hudi write performance depends on two things : indexing (which has gotten
> lot faster since that benchmark) and writing parquet files (it depends on
> your schema & cpu cores on the box). And since Hudi writing is a Spark job,
> speed also depends on parallelism you provide.. In a perfect world, you
> have as much parallelism as parquet files (file groups) and indexing takes
> 1-2 mins or so and writing takes 1-2 mins. For this specific dataset, the
> schema has 1000 columns, so parquet writing is much slower.
>
> the Hudi bulk insert or insert operation is kind of documented in the delta
> streamer CLI help. If you know your dataset has no updates, then you can
> issue insert/bulk_insert instead of upsert to completely avoid indexing
> step and that will gain speed. Difference between insert and bulk_insert is
> an implementation detail : insert() caches the input data in memory to do
> all the cool storage file sizing etc, while bulk_insert() used a sort based
> writing mechanism which can scale to multi terabyte initial loads ..
> In short, you do bulk_insert() to bootstrap the dataset, then insert or
> upsert depending on needs.
>
> for your specific use case, if you can share the spark UI, me or someone
> else here can take a look and see if there is scope to make it go faster.
>
> /thanks/vinoth
>
> On Wed, Jul 10, 2019 at 1:26 PM Netsanet Gebretsadkan <ne...@gmail.com>
> wrote:
>
> > Dear Vinoth,
> >
> > I want to try to check out the performance comparison of hudi upsert and
> > bulk insert.  In the hudi documentation, specifically performance
> > comparison section https://hudi.apache.org/performance.html#upserts  ,
> > which tries to compare bulk insert and upsert, its showing that  it takes
> > about 17 min for upserting  20 TB of data and 22 min for ingesting 500 GB
> > of data. Why is it taking much  time for 500 GB of data and does the data
> > include changes or its first time insert data? I assumed its data to be
> > inserted for the first time since you made the comparison with bulk
> insert.
> >
> >  And also when you say bulk insert, do you mean hoodies bulk insert
> > operation?  If so, what is the difference with hoodies upsert operation?
> In
> > addition to this, The latency of ingesting 6 GB of data is 25 minutes
> with
> > the cluster i provided. How can i enhance this?
> >
> > Thanks for your consideration.
> >
> > Kind regards,
> >
> > On Sun, Jun 23, 2019 at 5:42 PM Netsanet Gebretsadkan <
> net22geb@gmail.com>
> > wrote:
> >
> > > Thanks Vbalaji.
> > > I will check it out.
> > >
> > > Kind regards,
> > >
> > > On Sat, Jun 22, 2019 at 3:29 PM vbalaji@apache.org <vbalaji@apache.org
> >
> > > wrote:
> > >
> > >>
> > >> Here is the correct gist link :
> > >> https://gist.github.com/bvaradar/e18d96f9b99980dfb67a6601de5aa626
> > >>
> > >>
> > >>     On Saturday, June 22, 2019, 6:08:48 AM PDT, vbalaji@apache.org <
> > >> vbalaji@apache.org> wrote:
> > >>
> > >>   Hi,
> > >> I have given a sample command to set up and run deltastreamer in
> > >> continuous mode and ingest fake data in the following gist
> > >> https://gist.github.com/bvaradar/c5feec486fd4b2a3dac40c93649962c7
> > >>
> > >> We will eventually get this to project wiki.
> > >> Balaji.V
> > >>
> > >>     On Friday, June 21, 2019, 3:12:49 PM PDT, Netsanet Gebretsadkan <
> > >> net22geb@gmail.com> wrote:
> > >>
> > >>  @Vinoth, Thanks , that would be great if Balaji could share it.
> > >>
> > >> Kind regards,
> > >>
> > >>
> > >> On Thu, Jun 20, 2019 at 11:17 PM Vinoth Chandar <vi...@apache.org>
> > >> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > We usually test with our production workloads.. However, balaji
> > recently
> > >> > merged a DistributedTestDataSource,
> > >> >
> > >> >
> > >>
> >
> https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d
> > >> >
> > >> >
> > >> > that can generate some random data for testing..  Balaji, do you
> mind
> > >> > sharing a command that can be used to kick something off like that?
> > >> >
> > >> >
> > >> > On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan <
> > >> net22geb@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Dear Vinoth,
> > >> > >
> > >> > > I want to try to check out the performance comparison of upsert
> and
> > >> bulk
> > >> > > insert.  But i couldn't find a clean data set more than 10 GB.
> > >> > > Would it be possible to get a data set from Hudi team? For
> example i
> > >> was
> > >> > > using the stocks data that you provided on your demo. Hence, can i
> > get
> > >> > > more GB's of that dataset for my experiment?
> > >> > >
> > >> > > Thanks for your consideration.
> > >> > >
> > >> > > Kind regards,
> > >> > >
> > >> > > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <vi...@apache.org>
> > >> wrote:
> > >> > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159
> > >> > > >
> > >> > > > Just circling back with the resolution on the mailing list as
> > well.
> > >> > > >
> > >> > > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <
> > >> > net22geb@gmail.com
> > >> > > >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Dear Vinoth,
> > >> > > > >
> > >> > > > > Thanks for your fast response.
> > >> > > > > I have created a new issue called Performance Comparison of
> > >> > > > > HoodieDeltaStreamer and DataSourceAPI #714  with the
> screnshots
> > of
> > >> > the
> > >> > > > > spark UI which can be found at the  following  link
> > >> > > > > https://github.com/apache/incubator-hudi/issues/714.
> > >> > > > > In the UI,  it seems that the ingestion with the data source
> API
> > >> is
> > >> > > > > spending  much time in the count by key of HoodieBloomIndex
> and
> > >> > > workload
> > >> > > > > profile.  Looking forward to receive insights from you.
> > >> > > > >
> > >> > > > > Kinde regards,
> > >> > > > >
> > >> > > > >
> > >> > > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <
> > vinoth@apache.org>
> > >> > > wrote:
> > >> > > > >
> > >> > > > > > Hi,
> > >> > > > > >
> > >> > > > > > Both datasource and deltastreamer use the same APIs
> > underneath.
> > >> So
> > >> > > not
> > >> > > > > > sure. If you can grab screenshots of spark UI for both and
> > open
> > >> a
> > >> > > > ticket,
> > >> > > > > > glad to take a look.
> > >> > > > > >
> > >> > > > > > On 2, well one of goals of Hudi is to break this dichotomy
> and
> > >> > enable
> > >> > > > > > streaming style (I call it incremental processing) of
> > processing
> > >> > even
> > >> > > > in
> > >> > > > > a
> > >> > > > > > batch job. MOR is in production at uber. Atm MOR is lacking
> > just
> > >> > one
> > >> > > > > > feature (incr pull using log files) that Nishith is planning
> > to
> > >> > merge
> > >> > > > > soon.
> > >> > > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously
> > while
> > >> > > > managing
> > >> > > > > > compaction etc in the same job. I already knocked off some
> > index
> > >> > > > > > performance problems and working on indexing the log files,
> > >> which
> > >> > > > should
> > >> > > > > > unlock near real time ingest.
> > >> > > > > >
> > >> > > > > > Putting all these together, within a month or so near real
> > time
> > >> MOR
> > >> > > > > vision
> > >> > > > > > should be very real. Ofc we need community help with dev and
> > >> > testing
> > >> > > to
> > >> > > > > > speed things up. :)
> > >> > > > > >
> > >> > > > > > Hope that gives you a clearer picture.
> > >> > > > > >
> > >> > > > > > Thanks
> > >> > > > > > Vinoth
> > >> > > > > >
> > >> > > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <
> > >> > > > net22geb@gmail.com
> > >> > > > > >
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Thanks, Vinoth
> > >> > > > > > >
> > >> > > > > > > Its working now. But i have 2 questions:
> > >> > > > > > > 1. The ingestion latency of using DataSource API with
> > >> > > > > > > the  HoodieSparkSQLWriter  is high compared to using delta
> > >> > > streamer.
> > >> > > > > Why
> > >> > > > > > is
> > >> > > > > > > it slow? Are there specific option where we could specify
> to
> > >> > > minimize
> > >> > > > > the
> > >> > > > > > > ingestion latency.
> > >> > > > > > >    For example: when i run the delta streamer its talking
> > >> about 1
> > >> > > > > minute
> > >> > > > > > to
> > >> > > > > > > insert some data. If i use DataSource API with
> > >> > > HoodieSparkSQLWriter,
> > >> > > > > its
> > >> > > > > > > taking 5 minutes. How can we optimize this?
> > >> > > > > > > 2. Where do we categorize Hudi in general (Is it batch
> > >> processing
> > >> > > or
> > >> > > > > > > streaming)?  I am asking this because currently the copy
> on
> > >> write
> > >> > > is
> > >> > > > > the
> > >> > > > > > > one which is fully working and since the functionality of
> > the
> > >> > merge
> > >> > > > on
> > >> > > > > > read
> > >> > > > > > > is not fully done which enables us to have a near real
> time
> > >> > > > analytics,
> > >> > > > > > can
> > >> > > > > > > we consider Hudi as a batch job?
> > >> > > > > > >
> > >> > > > > > > Kind regards,
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <
> > >> > vinoth@apache.org>
> > >> > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi,
> > >> > > > > > > >
> > >> > > > > > > > Short answer, by default any parameter you pass in using
> > >> > > > option(k,v)
> > >> > > > > or
> > >> > > > > > > > options() beginning with "_" would be saved to the
> commit
> > >> > > metadata.
> > >> > > > > > > > You can change "_" prefix to something else by using the
> > >> > > > > > > >
> > DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> > >> > > > > > > > Reason you are not seeing the checkpointstr inside the
> > >> commit
> > >> > > > > metadata
> > >> > > > > > is
> > >> > > > > > > > because its just supposed to be a prefix for all such
> > commit
> > >> > > > > metadata.
> > >> > > > > > > >
> > >> > > > > > > > val metaMap = parameters.filter(kv =>
> > >> > > > > > > >
> > >> > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
> > >> > > > > > > >
> > >> > > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
> > >> > > > > > > net22geb@gmail.com>
> > >> > > > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > I am trying to use the HoodieSparkSQLWriter to upsert
> > data
> > >> > from
> > >> > > > any
> > >> > > > > > > > > dataframe into a hoodie modeled table.  Its creating
> > >> > everything
> > >> > > > > > > correctly
> > >> > > > > > > > > but , i also want to save the checkpoint but i
> couldn't
> > >> even
> > >> > > > though
> > >> > > > > > am
> > >> > > > > > > > > passing it as an argument.
> > >> > > > > > > > >
> > >> > > > > > > > > inputDF.write()
> > >> > > > > > > > > .format("com.uber.hoodie")
> > >> > > > > > > > >
> > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
> > >> > > > > "_row_key")
> > >> > > > > > > > >
> > >> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> > >> > > > > > > > "partition")
> > >> > > > > > > > >
> > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
> > >> > > > > > "timestamp")
> > >> > > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> > >> > > > > > > > >
> > >> > > >
> > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> > >> > > > > > > > > checkpointstr)
> > >> > > > > > > > > .mode(SaveMode.Append)
> > >> > > > > > > > > .save(basePath);
> > >> > > > > > > > >
> > >> > > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for
> > >> > inserting
> > >> > > > the
> > >> > > > > > > > > checkpoint while using the dataframe writer but i
> > couldn't
> > >> > add
> > >> > > > the
> > >> > > > > > > > > checkpoint meta data in to the .hoodie meta data. Is
> > >> there a
> > >> > > way
> > >> > > > i
> > >> > > > > > can
> > >> > > > > > > > add
> > >> > > > > > > > > the checkpoint meta data while using the dataframe
> > writer
> > >> > API?
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Posted by Vinoth Chandar <vi...@apache.org>.

Hi,

>>And also when you say bulk insert, do you mean hoodies bulk insert
operation?
No it does not refer to bulk_insert operation in Hudi. I think it says
"bulk load" and it refers to ingesting database tables in full, unlike
using Hudi upserts to do it incrementally. Simply put, its the difference
between fully rewriting your table as you would do in the pre-Hudi world
and incrementally rewriting at the file level in present day using Hudi.

>>Why is it taking much  time for 500 GB of data and does the data include
changes or its first time insert data?
Hudi write performance depends on two things : indexing (which has gotten
lot faster since that benchmark) and writing parquet files (it depends on
your schema & cpu cores on the box). And since Hudi writing is a Spark job,
speed also depends on parallelism you provide.. In a perfect world, you
have as much parallelism as parquet files (file groups) and indexing takes
1-2 mins or so and writing takes 1-2 mins. For this specific dataset, the
schema has 1000 columns, so parquet writing is much slower.

the Hudi bulk insert or insert operation is kind of documented in the delta
streamer CLI help. If you know your dataset has no updates, then you can
issue insert/bulk_insert instead of upsert to completely avoid indexing
step and that will gain speed. Difference between insert and bulk_insert is
an implementation detail : insert() caches the input data in memory to do
all the cool storage file sizing etc, while bulk_insert() used a sort based
writing mechanism which can scale to multi terabyte initial loads ..
In short, you do bulk_insert() to bootstrap the dataset, then insert or
upsert depending on needs.

for your specific use case, if you can share the spark UI, me or someone
else here can take a look and see if there is scope to make it go faster.

/thanks/vinoth

On Wed, Jul 10, 2019 at 1:26 PM Netsanet Gebretsadkan <ne...@gmail.com>
wrote:

> Dear Vinoth,
>
> I want to try to check out the performance comparison of hudi upsert and
> bulk insert.  In the hudi documentation, specifically performance
> comparison section https://hudi.apache.org/performance.html#upserts  ,
> which tries to compare bulk insert and upsert, its showing that  it takes
> about 17 min for upserting  20 TB of data and 22 min for ingesting 500 GB
> of data. Why is it taking much  time for 500 GB of data and does the data
> include changes or its first time insert data? I assumed its data to be
> inserted for the first time since you made the comparison with bulk insert.
>
>  And also when you say bulk insert, do you mean hoodies bulk insert
> operation?  If so, what is the difference with hoodies upsert operation? In
> addition to this, The latency of ingesting 6 GB of data is 25 minutes with
> the cluster i provided. How can i enhance this?
>
> Thanks for your consideration.
>
> Kind regards,
>
> On Sun, Jun 23, 2019 at 5:42 PM Netsanet Gebretsadkan <ne...@gmail.com>
> wrote:
>
> > Thanks Vbalaji.
> > I will check it out.
> >
> > Kind regards,
> >
> > On Sat, Jun 22, 2019 at 3:29 PM vbalaji@apache.org <vb...@apache.org>
> > wrote:
> >
> >>
> >> Here is the correct gist link :
> >> https://gist.github.com/bvaradar/e18d96f9b99980dfb67a6601de5aa626
> >>
> >>
> >>     On Saturday, June 22, 2019, 6:08:48 AM PDT, vbalaji@apache.org <
> >> vbalaji@apache.org> wrote:
> >>
> >>   Hi,
> >> I have given a sample command to set up and run deltastreamer in
> >> continuous mode and ingest fake data in the following gist
> >> https://gist.github.com/bvaradar/c5feec486fd4b2a3dac40c93649962c7
> >>
> >> We will eventually get this to project wiki.
> >> Balaji.V
> >>
> >>     On Friday, June 21, 2019, 3:12:49 PM PDT, Netsanet Gebretsadkan <
> >> net22geb@gmail.com> wrote:
> >>
> >>  @Vinoth, Thanks , that would be great if Balaji could share it.
> >>
> >> Kind regards,
> >>
> >>
> >> On Thu, Jun 20, 2019 at 11:17 PM Vinoth Chandar <vi...@apache.org>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > We usually test with our production workloads.. However, balaji
> recently
> >> > merged a DistributedTestDataSource,
> >> >
> >> >
> >>
> https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d
> >> >
> >> >
> >> > that can generate some random data for testing..  Balaji, do you mind
> >> > sharing a command that can be used to kick something off like that?
> >> >
> >> >
> >> > On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan <
> >> net22geb@gmail.com>
> >> > wrote:
> >> >
> >> > > Dear Vinoth,
> >> > >
> >> > > I want to try to check out the performance comparison of upsert and
> >> bulk
> >> > > insert.  But i couldn't find a clean data set more than 10 GB.
> >> > > Would it be possible to get a data set from Hudi team? For example i
> >> was
> >> > > using the stocks data that you provided on your demo. Hence, can i
> get
> >> > > more GB's of that dataset for my experiment?
> >> > >
> >> > > Thanks for your consideration.
> >> > >
> >> > > Kind regards,
> >> > >
> >> > > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <vi...@apache.org>
> >> wrote:
> >> > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159
> >> > > >
> >> > > > Just circling back with the resolution on the mailing list as
> well.
> >> > > >
> >> > > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <
> >> > net22geb@gmail.com
> >> > > >
> >> > > > wrote:
> >> > > >
> >> > > > > Dear Vinoth,
> >> > > > >
> >> > > > > Thanks for your fast response.
> >> > > > > I have created a new issue called Performance Comparison of
> >> > > > > HoodieDeltaStreamer and DataSourceAPI #714  with the screnshots
> of
> >> > the
> >> > > > > spark UI which can be found at the  following  link
> >> > > > > https://github.com/apache/incubator-hudi/issues/714.
> >> > > > > In the UI,  it seems that the ingestion with the data source API
> >> is
> >> > > > > spending  much time in the count by key of HoodieBloomIndex and
> >> > > workload
> >> > > > > profile.  Looking forward to receive insights from you.
> >> > > > >
> >> > > > > Kinde regards,
> >> > > > >
> >> > > > >
> >> > > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <
> vinoth@apache.org>
> >> > > wrote:
> >> > > > >
> >> > > > > > Hi,
> >> > > > > >
> >> > > > > > Both datasource and deltastreamer use the same APIs
> underneath.
> >> So
> >> > > not
> >> > > > > > sure. If you can grab screenshots of spark UI for both and
> open
> >> a
> >> > > > ticket,
> >> > > > > > glad to take a look.
> >> > > > > >
> >> > > > > > On 2, well one of goals of Hudi is to break this dichotomy and
> >> > enable
> >> > > > > > streaming style (I call it incremental processing) of
> processing
> >> > even
> >> > > > in
> >> > > > > a
> >> > > > > > batch job. MOR is in production at uber. Atm MOR is lacking
> just
> >> > one
> >> > > > > > feature (incr pull using log files) that Nishith is planning
> to
> >> > merge
> >> > > > > soon.
> >> > > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously
> while
> >> > > > managing
> >> > > > > > compaction etc in the same job. I already knocked off some
> index
> >> > > > > > performance problems and working on indexing the log files,
> >> which
> >> > > > should
> >> > > > > > unlock near real time ingest.
> >> > > > > >
> >> > > > > > Putting all these together, within a month or so near real
> time
> >> MOR
> >> > > > > vision
> >> > > > > > should be very real. Ofc we need community help with dev and
> >> > testing
> >> > > to
> >> > > > > > speed things up. :)
> >> > > > > >
> >> > > > > > Hope that gives you a clearer picture.
> >> > > > > >
> >> > > > > > Thanks
> >> > > > > > Vinoth
> >> > > > > >
> >> > > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <
> >> > > > net22geb@gmail.com
> >> > > > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Thanks, Vinoth
> >> > > > > > >
> >> > > > > > > Its working now. But i have 2 questions:
> >> > > > > > > 1. The ingestion latency of using DataSource API with
> >> > > > > > > the  HoodieSparkSQLWriter  is high compared to using delta
> >> > > streamer.
> >> > > > > Why
> >> > > > > > is
> >> > > > > > > it slow? Are there specific option where we could specify to
> >> > > minimize
> >> > > > > the
> >> > > > > > > ingestion latency.
> >> > > > > > >    For example: when i run the delta streamer its talking
> >> about 1
> >> > > > > minute
> >> > > > > > to
> >> > > > > > > insert some data. If i use DataSource API with
> >> > > HoodieSparkSQLWriter,
> >> > > > > its
> >> > > > > > > taking 5 minutes. How can we optimize this?
> >> > > > > > > 2. Where do we categorize Hudi in general (Is it batch
> >> processing
> >> > > or
> >> > > > > > > streaming)?  I am asking this because currently the copy on
> >> write
> >> > > is
> >> > > > > the
> >> > > > > > > one which is fully working and since the functionality of
> the
> >> > merge
> >> > > > on
> >> > > > > > read
> >> > > > > > > is not fully done which enables us to have a near real time
> >> > > > analytics,
> >> > > > > > can
> >> > > > > > > we consider Hudi as a batch job?
> >> > > > > > >
> >> > > > > > > Kind regards,
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <
> >> > vinoth@apache.org>
> >> > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hi,
> >> > > > > > > >
> >> > > > > > > > Short answer, by default any parameter you pass in using
> >> > > > option(k,v)
> >> > > > > or
> >> > > > > > > > options() beginning with "_" would be saved to the commit
> >> > > metadata.
> >> > > > > > > > You can change "_" prefix to something else by using the
> >> > > > > > > >
> DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> >> > > > > > > > Reason you are not seeing the checkpointstr inside the
> >> commit
> >> > > > > metadata
> >> > > > > > is
> >> > > > > > > > because its just supposed to be a prefix for all such
> commit
> >> > > > > metadata.
> >> > > > > > > >
> >> > > > > > > > val metaMap = parameters.filter(kv =>
> >> > > > > > > >
> >> > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
> >> > > > > > > >
> >> > > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
> >> > > > > > > net22geb@gmail.com>
> >> > > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > I am trying to use the HoodieSparkSQLWriter to upsert
> data
> >> > from
> >> > > > any
> >> > > > > > > > > dataframe into a hoodie modeled table.  Its creating
> >> > everything
> >> > > > > > > correctly
> >> > > > > > > > > but , i also want to save the checkpoint but i couldn't
> >> even
> >> > > > though
> >> > > > > > am
> >> > > > > > > > > passing it as an argument.
> >> > > > > > > > >
> >> > > > > > > > > inputDF.write()
> >> > > > > > > > > .format("com.uber.hoodie")
> >> > > > > > > > >
> .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
> >> > > > > "_row_key")
> >> > > > > > > > >
> >> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> >> > > > > > > > "partition")
> >> > > > > > > > >
> .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
> >> > > > > > "timestamp")
> >> > > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> >> > > > > > > > >
> >> > > >
> .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> >> > > > > > > > > checkpointstr)
> >> > > > > > > > > .mode(SaveMode.Append)
> >> > > > > > > > > .save(basePath);
> >> > > > > > > > >
> >> > > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for
> >> > inserting
> >> > > > the
> >> > > > > > > > > checkpoint while using the dataframe writer but i
> couldn't
> >> > add
> >> > > > the
> >> > > > > > > > > checkpoint meta data in to the .hoodie meta data. Is
> >> there a
> >> > > way
> >> > > > i
> >> > > > > > can
> >> > > > > > > > add
> >> > > > > > > > > the checkpoint meta data while using the dataframe
> writer
> >> > API?
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Posted by Netsanet Gebretsadkan <ne...@gmail.com>.

Dear Vinoth,

I want to try to check out the performance comparison of hudi upsert and
bulk insert.  In the hudi documentation, specifically performance
comparison section https://hudi.apache.org/performance.html#upserts  ,
which tries to compare bulk insert and upsert, its showing that  it takes
about 17 min for upserting  20 TB of data and 22 min for ingesting 500 GB
of data. Why is it taking much  time for 500 GB of data and does the data
include changes or its first time insert data? I assumed its data to be
inserted for the first time since you made the comparison with bulk insert.

 And also when you say bulk insert, do you mean hoodies bulk insert
operation?  If so, what is the difference with hoodies upsert operation? In
addition to this, The latency of ingesting 6 GB of data is 25 minutes with
the cluster i provided. How can i enhance this?

Thanks for your consideration.

Kind regards,

On Sun, Jun 23, 2019 at 5:42 PM Netsanet Gebretsadkan <ne...@gmail.com>
wrote:

> Thanks Vbalaji.
> I will check it out.
>
> Kind regards,
>
> On Sat, Jun 22, 2019 at 3:29 PM vbalaji@apache.org <vb...@apache.org>
> wrote:
>
>>
>> Here is the correct gist link :
>> https://gist.github.com/bvaradar/e18d96f9b99980dfb67a6601de5aa626
>>
>>
>>     On Saturday, June 22, 2019, 6:08:48 AM PDT, vbalaji@apache.org <
>> vbalaji@apache.org> wrote:
>>
>>   Hi,
>> I have given a sample command to set up and run deltastreamer in
>> continuous mode and ingest fake data in the following gist
>> https://gist.github.com/bvaradar/c5feec486fd4b2a3dac40c93649962c7
>>
>> We will eventually get this to project wiki.
>> Balaji.V
>>
>>     On Friday, June 21, 2019, 3:12:49 PM PDT, Netsanet Gebretsadkan <
>> net22geb@gmail.com> wrote:
>>
>>  @Vinoth, Thanks , that would be great if Balaji could share it.
>>
>> Kind regards,
>>
>>
>> On Thu, Jun 20, 2019 at 11:17 PM Vinoth Chandar <vi...@apache.org>
>> wrote:
>>
>> > Hi,
>> >
>> > We usually test with our production workloads.. However, balaji recently
>> > merged a DistributedTestDataSource,
>> >
>> >
>> https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d
>> >
>> >
>> > that can generate some random data for testing..  Balaji, do you mind
>> > sharing a command that can be used to kick something off like that?
>> >
>> >
>> > On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan <
>> net22geb@gmail.com>
>> > wrote:
>> >
>> > > Dear Vinoth,
>> > >
>> > > I want to try to check out the performance comparison of upsert and
>> bulk
>> > > insert.  But i couldn't find a clean data set more than 10 GB.
>> > > Would it be possible to get a data set from Hudi team? For example i
>> was
>> > > using the stocks data that you provided on your demo. Hence, can i get
>> > > more GB's of that dataset for my experiment?
>> > >
>> > > Thanks for your consideration.
>> > >
>> > > Kind regards,
>> > >
>> > > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <vi...@apache.org>
>> wrote:
>> > >
>> > > >
>> > >
>> >
>> https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159
>> > > >
>> > > > Just circling back with the resolution on the mailing list as well.
>> > > >
>> > > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <
>> > net22geb@gmail.com
>> > > >
>> > > > wrote:
>> > > >
>> > > > > Dear Vinoth,
>> > > > >
>> > > > > Thanks for your fast response.
>> > > > > I have created a new issue called Performance Comparison of
>> > > > > HoodieDeltaStreamer and DataSourceAPI #714  with the screnshots of
>> > the
>> > > > > spark UI which can be found at the  following  link
>> > > > > https://github.com/apache/incubator-hudi/issues/714.
>> > > > > In the UI,  it seems that the ingestion with the data source API
>> is
>> > > > > spending  much time in the count by key of HoodieBloomIndex and
>> > > workload
>> > > > > profile.  Looking forward to receive insights from you.
>> > > > >
>> > > > > Kinde regards,
>> > > > >
>> > > > >
>> > > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <vi...@apache.org>
>> > > wrote:
>> > > > >
>> > > > > > Hi,
>> > > > > >
>> > > > > > Both datasource and deltastreamer use the same APIs underneath.
>> So
>> > > not
>> > > > > > sure. If you can grab screenshots of spark UI for both and open
>> a
>> > > > ticket,
>> > > > > > glad to take a look.
>> > > > > >
>> > > > > > On 2, well one of goals of Hudi is to break this dichotomy and
>> > enable
>> > > > > > streaming style (I call it incremental processing) of processing
>> > even
>> > > > in
>> > > > > a
>> > > > > > batch job. MOR is in production at uber. Atm MOR is lacking just
>> > one
>> > > > > > feature (incr pull using log files) that Nishith is planning to
>> > merge
>> > > > > soon.
>> > > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously while
>> > > > managing
>> > > > > > compaction etc in the same job. I already knocked off some index
>> > > > > > performance problems and working on indexing the log files,
>> which
>> > > > should
>> > > > > > unlock near real time ingest.
>> > > > > >
>> > > > > > Putting all these together, within a month or so near real time
>> MOR
>> > > > > vision
>> > > > > > should be very real. Ofc we need community help with dev and
>> > testing
>> > > to
>> > > > > > speed things up. :)
>> > > > > >
>> > > > > > Hope that gives you a clearer picture.
>> > > > > >
>> > > > > > Thanks
>> > > > > > Vinoth
>> > > > > >
>> > > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <
>> > > > net22geb@gmail.com
>> > > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Thanks, Vinoth
>> > > > > > >
>> > > > > > > Its working now. But i have 2 questions:
>> > > > > > > 1. The ingestion latency of using DataSource API with
>> > > > > > > the  HoodieSparkSQLWriter  is high compared to using delta
>> > > streamer.
>> > > > > Why
>> > > > > > is
>> > > > > > > it slow? Are there specific option where we could specify to
>> > > minimize
>> > > > > the
>> > > > > > > ingestion latency.
>> > > > > > >    For example: when i run the delta streamer its talking
>> about 1
>> > > > > minute
>> > > > > > to
>> > > > > > > insert some data. If i use DataSource API with
>> > > HoodieSparkSQLWriter,
>> > > > > its
>> > > > > > > taking 5 minutes. How can we optimize this?
>> > > > > > > 2. Where do we categorize Hudi in general (Is it batch
>> processing
>> > > or
>> > > > > > > streaming)?  I am asking this because currently the copy on
>> write
>> > > is
>> > > > > the
>> > > > > > > one which is fully working and since the functionality of the
>> > merge
>> > > > on
>> > > > > > read
>> > > > > > > is not fully done which enables us to have a near real time
>> > > > analytics,
>> > > > > > can
>> > > > > > > we consider Hudi as a batch job?
>> > > > > > >
>> > > > > > > Kind regards,
>> > > > > > >
>> > > > > > >
>> > > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <
>> > vinoth@apache.org>
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi,
>> > > > > > > >
>> > > > > > > > Short answer, by default any parameter you pass in using
>> > > > option(k,v)
>> > > > > or
>> > > > > > > > options() beginning with "_" would be saved to the commit
>> > > metadata.
>> > > > > > > > You can change "_" prefix to something else by using the
>> > > > > > > >  DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
>> > > > > > > > Reason you are not seeing the checkpointstr inside the
>> commit
>> > > > > metadata
>> > > > > > is
>> > > > > > > > because its just supposed to be a prefix for all such commit
>> > > > > metadata.
>> > > > > > > >
>> > > > > > > > val metaMap = parameters.filter(kv =>
>> > > > > > > >
>> > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
>> > > > > > > >
>> > > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
>> > > > > > > net22geb@gmail.com>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > I am trying to use the HoodieSparkSQLWriter to upsert data
>> > from
>> > > > any
>> > > > > > > > > dataframe into a hoodie modeled table.  Its creating
>> > everything
>> > > > > > > correctly
>> > > > > > > > > but , i also want to save the checkpoint but i couldn't
>> even
>> > > > though
>> > > > > > am
>> > > > > > > > > passing it as an argument.
>> > > > > > > > >
>> > > > > > > > > inputDF.write()
>> > > > > > > > > .format("com.uber.hoodie")
>> > > > > > > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
>> > > > > "_row_key")
>> > > > > > > > >
>> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
>> > > > > > > > "partition")
>> > > > > > > > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
>> > > > > > "timestamp")
>> > > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
>> > > > > > > > >
>> > > > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
>> > > > > > > > > checkpointstr)
>> > > > > > > > > .mode(SaveMode.Append)
>> > > > > > > > > .save(basePath);
>> > > > > > > > >
>> > > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for
>> > inserting
>> > > > the
>> > > > > > > > > checkpoint while using the dataframe writer but i couldn't
>> > add
>> > > > the
>> > > > > > > > > checkpoint meta data in to the .hoodie meta data. Is
>> there a
>> > > way
>> > > > i
>> > > > > > can
>> > > > > > > > add
>> > > > > > > > > the checkpoint meta data while using the dataframe writer
>> > API?
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Posted by Netsanet Gebretsadkan <ne...@gmail.com>.

Thanks Vbalaji.
I will check it out.

Kind regards,

On Sat, Jun 22, 2019 at 3:29 PM vbalaji@apache.org <vb...@apache.org>
wrote:

>
> Here is the correct gist link :
> https://gist.github.com/bvaradar/e18d96f9b99980dfb67a6601de5aa626
>
>
>     On Saturday, June 22, 2019, 6:08:48 AM PDT, vbalaji@apache.org <
> vbalaji@apache.org> wrote:
>
>   Hi,
> I have given a sample command to set up and run deltastreamer in
> continuous mode and ingest fake data in the following gist
> https://gist.github.com/bvaradar/c5feec486fd4b2a3dac40c93649962c7
>
> We will eventually get this to project wiki.
> Balaji.V
>
>     On Friday, June 21, 2019, 3:12:49 PM PDT, Netsanet Gebretsadkan <
> net22geb@gmail.com> wrote:
>
>  @Vinoth, Thanks , that would be great if Balaji could share it.
>
> Kind regards,
>
>
> On Thu, Jun 20, 2019 at 11:17 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Hi,
> >
> > We usually test with our production workloads.. However, balaji recently
> > merged a DistributedTestDataSource,
> >
> >
> https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d
> >
> >
> > that can generate some random data for testing..  Balaji, do you mind
> > sharing a command that can be used to kick something off like that?
> >
> >
> > On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan <
> net22geb@gmail.com>
> > wrote:
> >
> > > Dear Vinoth,
> > >
> > > I want to try to check out the performance comparison of upsert and
> bulk
> > > insert.  But i couldn't find a clean data set more than 10 GB.
> > > Would it be possible to get a data set from Hudi team? For example i
> was
> > > using the stocks data that you provided on your demo. Hence, can i get
> > > more GB's of that dataset for my experiment?
> > >
> > > Thanks for your consideration.
> > >
> > > Kind regards,
> > >
> > > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <vi...@apache.org>
> wrote:
> > >
> > > >
> > >
> >
> https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159
> > > >
> > > > Just circling back with the resolution on the mailing list as well.
> > > >
> > > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <
> > net22geb@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Dear Vinoth,
> > > > >
> > > > > Thanks for your fast response.
> > > > > I have created a new issue called Performance Comparison of
> > > > > HoodieDeltaStreamer and DataSourceAPI #714  with the screnshots of
> > the
> > > > > spark UI which can be found at the  following  link
> > > > > https://github.com/apache/incubator-hudi/issues/714.
> > > > > In the UI,  it seems that the ingestion with the data source API is
> > > > > spending  much time in the count by key of HoodieBloomIndex and
> > > workload
> > > > > profile.  Looking forward to receive insights from you.
> > > > >
> > > > > Kinde regards,
> > > > >
> > > > >
> > > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <vi...@apache.org>
> > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Both datasource and deltastreamer use the same APIs underneath.
> So
> > > not
> > > > > > sure. If you can grab screenshots of spark UI for both and open a
> > > > ticket,
> > > > > > glad to take a look.
> > > > > >
> > > > > > On 2, well one of goals of Hudi is to break this dichotomy and
> > enable
> > > > > > streaming style (I call it incremental processing) of processing
> > even
> > > > in
> > > > > a
> > > > > > batch job. MOR is in production at uber. Atm MOR is lacking just
> > one
> > > > > > feature (incr pull using log files) that Nishith is planning to
> > merge
> > > > > soon.
> > > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously while
> > > > managing
> > > > > > compaction etc in the same job. I already knocked off some index
> > > > > > performance problems and working on indexing the log files, which
> > > > should
> > > > > > unlock near real time ingest.
> > > > > >
> > > > > > Putting all these together, within a month or so near real time
> MOR
> > > > > vision
> > > > > > should be very real. Ofc we need community help with dev and
> > testing
> > > to
> > > > > > speed things up. :)
> > > > > >
> > > > > > Hope that gives you a clearer picture.
> > > > > >
> > > > > > Thanks
> > > > > > Vinoth
> > > > > >
> > > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <
> > > > net22geb@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks, Vinoth
> > > > > > >
> > > > > > > Its working now. But i have 2 questions:
> > > > > > > 1. The ingestion latency of using DataSource API with
> > > > > > > the  HoodieSparkSQLWriter  is high compared to using delta
> > > streamer.
> > > > > Why
> > > > > > is
> > > > > > > it slow? Are there specific option where we could specify to
> > > minimize
> > > > > the
> > > > > > > ingestion latency.
> > > > > > >    For example: when i run the delta streamer its talking
> about 1
> > > > > minute
> > > > > > to
> > > > > > > insert some data. If i use DataSource API with
> > > HoodieSparkSQLWriter,
> > > > > its
> > > > > > > taking 5 minutes. How can we optimize this?
> > > > > > > 2. Where do we categorize Hudi in general (Is it batch
> processing
> > > or
> > > > > > > streaming)?  I am asking this because currently the copy on
> write
> > > is
> > > > > the
> > > > > > > one which is fully working and since the functionality of the
> > merge
> > > > on
> > > > > > read
> > > > > > > is not fully done which enables us to have a near real time
> > > > analytics,
> > > > > > can
> > > > > > > we consider Hudi as a batch job?
> > > > > > >
> > > > > > > Kind regards,
> > > > > > >
> > > > > > >
> > > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <
> > vinoth@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > Short answer, by default any parameter you pass in using
> > > > option(k,v)
> > > > > or
> > > > > > > > options() beginning with "_" would be saved to the commit
> > > metadata.
> > > > > > > > You can change "_" prefix to something else by using the
> > > > > > > >  DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> > > > > > > > Reason you are not seeing the checkpointstr inside the commit
> > > > > metadata
> > > > > > is
> > > > > > > > because its just supposed to be a prefix for all such commit
> > > > > metadata.
> > > > > > > >
> > > > > > > > val metaMap = parameters.filter(kv =>
> > > > > > > >
> > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
> > > > > > > >
> > > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
> > > > > > > net22geb@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I am trying to use the HoodieSparkSQLWriter to upsert data
> > from
> > > > any
> > > > > > > > > dataframe into a hoodie modeled table.  Its creating
> > everything
> > > > > > > correctly
> > > > > > > > > but , i also want to save the checkpoint but i couldn't
> even
> > > > though
> > > > > > am
> > > > > > > > > passing it as an argument.
> > > > > > > > >
> > > > > > > > > inputDF.write()
> > > > > > > > > .format("com.uber.hoodie")
> > > > > > > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
> > > > > "_row_key")
> > > > > > > > >
> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> > > > > > > > "partition")
> > > > > > > > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
> > > > > > "timestamp")
> > > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> > > > > > > > >
> > > > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> > > > > > > > > checkpointstr)
> > > > > > > > > .mode(SaveMode.Append)
> > > > > > > > > .save(basePath);
> > > > > > > > >
> > > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for
> > inserting
> > > > the
> > > > > > > > > checkpoint while using the dataframe writer but i couldn't
> > add
> > > > the
> > > > > > > > > checkpoint meta data in to the .hoodie meta data. Is there
> a
> > > way
> > > > i
> > > > > > can
> > > > > > > > add
> > > > > > > > > the checkpoint meta data while using the dataframe writer
> > API?
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Posted by "vbalaji@apache.org" <vb...@apache.org>.

 
Here is the correct gist link : https://gist.github.com/bvaradar/e18d96f9b99980dfb67a6601de5aa626


    On Saturday, June 22, 2019, 6:08:48 AM PDT, vbalaji@apache.org <vb...@apache.org> wrote:  
 
  Hi,
I have given a sample command to set up and run deltastreamer in continuous mode and ingest fake data in the following gist
https://gist.github.com/bvaradar/c5feec486fd4b2a3dac40c93649962c7

We will eventually get this to project wiki.
Balaji.V

    On Friday, June 21, 2019, 3:12:49 PM PDT, Netsanet Gebretsadkan <ne...@gmail.com> wrote:  
 
 @Vinoth, Thanks , that would be great if Balaji could share it.

Kind regards,


On Thu, Jun 20, 2019 at 11:17 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi,
>
> We usually test with our production workloads.. However, balaji recently
> merged a DistributedTestDataSource,
>
> https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d
>
>
> that can generate some random data for testing..  Balaji, do you mind
> sharing a command that can be used to kick something off like that?
>
>
> On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan <ne...@gmail.com>
> wrote:
>
> > Dear Vinoth,
> >
> > I want to try to check out the performance comparison of upsert and bulk
> > insert.  But i couldn't find a clean data set more than 10 GB.
> > Would it be possible to get a data set from Hudi team? For example i was
> > using the stocks data that you provided on your demo. Hence, can i get
> > more GB's of that dataset for my experiment?
> >
> > Thanks for your consideration.
> >
> > Kind regards,
> >
> > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <vi...@apache.org> wrote:
> >
> > >
> >
> https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159
> > >
> > > Just circling back with the resolution on the mailing list as well.
> > >
> > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <
> net22geb@gmail.com
> > >
> > > wrote:
> > >
> > > > Dear Vinoth,
> > > >
> > > > Thanks for your fast response.
> > > > I have created a new issue called Performance Comparison of
> > > > HoodieDeltaStreamer and DataSourceAPI #714  with the screnshots of
> the
> > > > spark UI which can be found at the  following  link
> > > > https://github.com/apache/incubator-hudi/issues/714.
> > > > In the UI,  it seems that the ingestion with the data source API is
> > > > spending  much time in the count by key of HoodieBloomIndex and
> > workload
> > > > profile.  Looking forward to receive insights from you.
> > > >
> > > > Kinde regards,
> > > >
> > > >
> > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <vi...@apache.org>
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Both datasource and deltastreamer use the same APIs underneath. So
> > not
> > > > > sure. If you can grab screenshots of spark UI for both and open a
> > > ticket,
> > > > > glad to take a look.
> > > > >
> > > > > On 2, well one of goals of Hudi is to break this dichotomy and
> enable
> > > > > streaming style (I call it incremental processing) of processing
> even
> > > in
> > > > a
> > > > > batch job. MOR is in production at uber. Atm MOR is lacking just
> one
> > > > > feature (incr pull using log files) that Nishith is planning to
> merge
> > > > soon.
> > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously while
> > > managing
> > > > > compaction etc in the same job. I already knocked off some index
> > > > > performance problems and working on indexing the log files, which
> > > should
> > > > > unlock near real time ingest.
> > > > >
> > > > > Putting all these together, within a month or so near real time MOR
> > > > vision
> > > > > should be very real. Ofc we need community help with dev and
> testing
> > to
> > > > > speed things up. :)
> > > > >
> > > > > Hope that gives you a clearer picture.
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <
> > > net22geb@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Thanks, Vinoth
> > > > > >
> > > > > > Its working now. But i have 2 questions:
> > > > > > 1. The ingestion latency of using DataSource API with
> > > > > > the  HoodieSparkSQLWriter  is high compared to using delta
> > streamer.
> > > > Why
> > > > > is
> > > > > > it slow? Are there specific option where we could specify to
> > minimize
> > > > the
> > > > > > ingestion latency.
> > > > > >    For example: when i run the delta streamer its talking about 1
> > > > minute
> > > > > to
> > > > > > insert some data. If i use DataSource API with
> > HoodieSparkSQLWriter,
> > > > its
> > > > > > taking 5 minutes. How can we optimize this?
> > > > > > 2. Where do we categorize Hudi in general (Is it batch processing
> > or
> > > > > > streaming)?  I am asking this because currently the copy on write
> > is
> > > > the
> > > > > > one which is fully working and since the functionality of the
> merge
> > > on
> > > > > read
> > > > > > is not fully done which enables us to have a near real time
> > > analytics,
> > > > > can
> > > > > > we consider Hudi as a batch job?
> > > > > >
> > > > > > Kind regards,
> > > > > >
> > > > > >
> > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <
> vinoth@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Short answer, by default any parameter you pass in using
> > > option(k,v)
> > > > or
> > > > > > > options() beginning with "_" would be saved to the commit
> > metadata.
> > > > > > > You can change "_" prefix to something else by using the
> > > > > > >  DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> > > > > > > Reason you are not seeing the checkpointstr inside the commit
> > > > metadata
> > > > > is
> > > > > > > because its just supposed to be a prefix for all such commit
> > > > metadata.
> > > > > > >
> > > > > > > val metaMap = parameters.filter(kv =>
> > > > > > >
> kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
> > > > > > >
> > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
> > > > > > net22geb@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I am trying to use the HoodieSparkSQLWriter to upsert data
> from
> > > any
> > > > > > > > dataframe into a hoodie modeled table.  Its creating
> everything
> > > > > > correctly
> > > > > > > > but , i also want to save the checkpoint but i couldn't even
> > > though
> > > > > am
> > > > > > > > passing it as an argument.
> > > > > > > >
> > > > > > > > inputDF.write()
> > > > > > > > .format("com.uber.hoodie")
> > > > > > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
> > > > "_row_key")
> > > > > > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> > > > > > > "partition")
> > > > > > > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
> > > > > "timestamp")
> > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> > > > > > > >
> > > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> > > > > > > > checkpointstr)
> > > > > > > > .mode(SaveMode.Append)
> > > > > > > > .save(basePath);
> > > > > > > >
> > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for
> inserting
> > > the
> > > > > > > > checkpoint while using the dataframe writer but i couldn't
> add
> > > the
> > > > > > > > checkpoint meta data in to the .hoodie meta data. Is there a
> > way
> > > i
> > > > > can
> > > > > > > add
> > > > > > > > the checkpoint meta data while using the dataframe writer
> API?
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Posted by "vbalaji@apache.org" <vb...@apache.org>.

 Hi,
I have given a sample command to set up and run deltastreamer in continuous mode and ingest fake data in the following gist
https://gist.github.com/bvaradar/c5feec486fd4b2a3dac40c93649962c7

We will eventually get this to project wiki.
Balaji.V

    On Friday, June 21, 2019, 3:12:49 PM PDT, Netsanet Gebretsadkan <ne...@gmail.com> wrote:  
 
 @Vinoth, Thanks , that would be great if Balaji could share it.

Kind regards,


On Thu, Jun 20, 2019 at 11:17 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi,
>
> We usually test with our production workloads.. However, balaji recently
> merged a DistributedTestDataSource,
>
> https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d
>
>
> that can generate some random data for testing..  Balaji, do you mind
> sharing a command that can be used to kick something off like that?
>
>
> On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan <ne...@gmail.com>
> wrote:
>
> > Dear Vinoth,
> >
> > I want to try to check out the performance comparison of upsert and bulk
> > insert.  But i couldn't find a clean data set more than 10 GB.
> > Would it be possible to get a data set from Hudi team? For example i was
> > using the stocks data that you provided on your demo. Hence, can i get
> > more GB's of that dataset for my experiment?
> >
> > Thanks for your consideration.
> >
> > Kind regards,
> >
> > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <vi...@apache.org> wrote:
> >
> > >
> >
> https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159
> > >
> > > Just circling back with the resolution on the mailing list as well.
> > >
> > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <
> net22geb@gmail.com
> > >
> > > wrote:
> > >
> > > > Dear Vinoth,
> > > >
> > > > Thanks for your fast response.
> > > > I have created a new issue called Performance Comparison of
> > > > HoodieDeltaStreamer and DataSourceAPI #714  with the screnshots of
> the
> > > > spark UI which can be found at the  following  link
> > > > https://github.com/apache/incubator-hudi/issues/714.
> > > > In the UI,  it seems that the ingestion with the data source API is
> > > > spending  much time in the count by key of HoodieBloomIndex and
> > workload
> > > > profile.  Looking forward to receive insights from you.
> > > >
> > > > Kinde regards,
> > > >
> > > >
> > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <vi...@apache.org>
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Both datasource and deltastreamer use the same APIs underneath. So
> > not
> > > > > sure. If you can grab screenshots of spark UI for both and open a
> > > ticket,
> > > > > glad to take a look.
> > > > >
> > > > > On 2, well one of goals of Hudi is to break this dichotomy and
> enable
> > > > > streaming style (I call it incremental processing) of processing
> even
> > > in
> > > > a
> > > > > batch job. MOR is in production at uber. Atm MOR is lacking just
> one
> > > > > feature (incr pull using log files) that Nishith is planning to
> merge
> > > > soon.
> > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously while
> > > managing
> > > > > compaction etc in the same job. I already knocked off some index
> > > > > performance problems and working on indexing the log files, which
> > > should
> > > > > unlock near real time ingest.
> > > > >
> > > > > Putting all these together, within a month or so near real time MOR
> > > > vision
> > > > > should be very real. Ofc we need community help with dev and
> testing
> > to
> > > > > speed things up. :)
> > > > >
> > > > > Hope that gives you a clearer picture.
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <
> > > net22geb@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Thanks, Vinoth
> > > > > >
> > > > > > Its working now. But i have 2 questions:
> > > > > > 1. The ingestion latency of using DataSource API with
> > > > > > the  HoodieSparkSQLWriter  is high compared to using delta
> > streamer.
> > > > Why
> > > > > is
> > > > > > it slow? Are there specific option where we could specify to
> > minimize
> > > > the
> > > > > > ingestion latency.
> > > > > >    For example: when i run the delta streamer its talking about 1
> > > > minute
> > > > > to
> > > > > > insert some data. If i use DataSource API with
> > HoodieSparkSQLWriter,
> > > > its
> > > > > > taking 5 minutes. How can we optimize this?
> > > > > > 2. Where do we categorize Hudi in general (Is it batch processing
> > or
> > > > > > streaming)?  I am asking this because currently the copy on write
> > is
> > > > the
> > > > > > one which is fully working and since the functionality of the
> merge
> > > on
> > > > > read
> > > > > > is not fully done which enables us to have a near real time
> > > analytics,
> > > > > can
> > > > > > we consider Hudi as a batch job?
> > > > > >
> > > > > > Kind regards,
> > > > > >
> > > > > >
> > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <
> vinoth@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Short answer, by default any parameter you pass in using
> > > option(k,v)
> > > > or
> > > > > > > options() beginning with "_" would be saved to the commit
> > metadata.
> > > > > > > You can change "_" prefix to something else by using the
> > > > > > >  DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> > > > > > > Reason you are not seeing the checkpointstr inside the commit
> > > > metadata
> > > > > is
> > > > > > > because its just supposed to be a prefix for all such commit
> > > > metadata.
> > > > > > >
> > > > > > > val metaMap = parameters.filter(kv =>
> > > > > > >
> kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
> > > > > > >
> > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
> > > > > > net22geb@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I am trying to use the HoodieSparkSQLWriter to upsert data
> from
> > > any
> > > > > > > > dataframe into a hoodie modeled table.  Its creating
> everything
> > > > > > correctly
> > > > > > > > but , i also want to save the checkpoint but i couldn't even
> > > though
> > > > > am
> > > > > > > > passing it as an argument.
> > > > > > > >
> > > > > > > > inputDF.write()
> > > > > > > > .format("com.uber.hoodie")
> > > > > > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
> > > > "_row_key")
> > > > > > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> > > > > > > "partition")
> > > > > > > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
> > > > > "timestamp")
> > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> > > > > > > >
> > > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> > > > > > > > checkpointstr)
> > > > > > > > .mode(SaveMode.Append)
> > > > > > > > .save(basePath);
> > > > > > > >
> > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for
> inserting
> > > the
> > > > > > > > checkpoint while using the dataframe writer but i couldn't
> add
> > > the
> > > > > > > > checkpoint meta data in to the .hoodie meta data. Is there a
> > way
> > > i
> > > > > can
> > > > > > > add
> > > > > > > > the checkpoint meta data while using the dataframe writer
> API?
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Posted by Netsanet Gebretsadkan <ne...@gmail.com>.

@Vinoth, Thanks , that would be great if Balaji could share it.

Kind regards,


On Thu, Jun 20, 2019 at 11:17 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi,
>
> We usually test with our production workloads.. However, balaji recently
> merged a DistributedTestDataSource,
>
> https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d
>
>
> that can generate some random data for testing..  Balaji, do you mind
> sharing a command that can be used to kick something off like that?
>
>
> On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan <ne...@gmail.com>
> wrote:
>
> > Dear Vinoth,
> >
> > I want to try to check out the performance comparison of upsert and bulk
> > insert.  But i couldn't find a clean data set more than 10 GB.
> > Would it be possible to get a data set from Hudi team? For example i was
> > using the stocks data that you provided on your demo. Hence, can i get
> > more GB's of that dataset for my experiment?
> >
> > Thanks for your consideration.
> >
> > Kind regards,
> >
> > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <vi...@apache.org> wrote:
> >
> > >
> >
> https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159
> > >
> > > Just circling back with the resolution on the mailing list as well.
> > >
> > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <
> net22geb@gmail.com
> > >
> > > wrote:
> > >
> > > > Dear Vinoth,
> > > >
> > > > Thanks for your fast response.
> > > > I have created a new issue called Performance Comparison of
> > > > HoodieDeltaStreamer and DataSourceAPI #714   with the screnshots of
> the
> > > > spark UI which can be found at the  following  link
> > > > https://github.com/apache/incubator-hudi/issues/714.
> > > > In the UI,  it seems that the ingestion with the data source API is
> > > > spending  much time in the count by key of HoodieBloomIndex and
> > workload
> > > > profile.  Looking forward to receive insights from you.
> > > >
> > > > Kinde regards,
> > > >
> > > >
> > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <vi...@apache.org>
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Both datasource and deltastreamer use the same APIs underneath. So
> > not
> > > > > sure. If you can grab screenshots of spark UI for both and open a
> > > ticket,
> > > > > glad to take a look.
> > > > >
> > > > > On 2, well one of goals of Hudi is to break this dichotomy and
> enable
> > > > > streaming style (I call it incremental processing) of processing
> even
> > > in
> > > > a
> > > > > batch job. MOR is in production at uber. Atm MOR is lacking just
> one
> > > > > feature (incr pull using log files) that Nishith is planning to
> merge
> > > > soon.
> > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously while
> > > managing
> > > > > compaction etc in the same job. I already knocked off some index
> > > > > performance problems and working on indexing the log files, which
> > > should
> > > > > unlock near real time ingest.
> > > > >
> > > > > Putting all these together, within a month or so near real time MOR
> > > > vision
> > > > > should be very real. Ofc we need community help with dev and
> testing
> > to
> > > > > speed things up. :)
> > > > >
> > > > > Hope that gives you a clearer picture.
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <
> > > net22geb@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Thanks, Vinoth
> > > > > >
> > > > > > Its working now. But i have 2 questions:
> > > > > > 1. The ingestion latency of using DataSource API with
> > > > > > the  HoodieSparkSQLWriter  is high compared to using delta
> > streamer.
> > > > Why
> > > > > is
> > > > > > it slow? Are there specific option where we could specify to
> > minimize
> > > > the
> > > > > > ingestion latency.
> > > > > >    For example: when i run the delta streamer its talking about 1
> > > > minute
> > > > > to
> > > > > > insert some data. If i use DataSource API with
> > HoodieSparkSQLWriter,
> > > > its
> > > > > > taking 5 minutes. How can we optimize this?
> > > > > > 2. Where do we categorize Hudi in general (Is it batch processing
> > or
> > > > > > streaming)?  I am asking this because currently the copy on write
> > is
> > > > the
> > > > > > one which is fully working and since the functionality of the
> merge
> > > on
> > > > > read
> > > > > > is not fully done which enables us to have a near real time
> > > analytics,
> > > > > can
> > > > > > we consider Hudi as a batch job?
> > > > > >
> > > > > > Kind regards,
> > > > > >
> > > > > >
> > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <
> vinoth@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Short answer, by default any parameter you pass in using
> > > option(k,v)
> > > > or
> > > > > > > options() beginning with "_" would be saved to the commit
> > metadata.
> > > > > > > You can change "_" prefix to something else by using the
> > > > > > >  DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> > > > > > > Reason you are not seeing the checkpointstr inside the commit
> > > > metadata
> > > > > is
> > > > > > > because its just supposed to be a prefix for all such commit
> > > > metadata.
> > > > > > >
> > > > > > > val metaMap = parameters.filter(kv =>
> > > > > > >
> kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
> > > > > > >
> > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
> > > > > > net22geb@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I am trying to use the HoodieSparkSQLWriter to upsert data
> from
> > > any
> > > > > > > > dataframe into a hoodie modeled table.  Its creating
> everything
> > > > > > correctly
> > > > > > > > but , i also want to save the checkpoint but i couldn't even
> > > though
> > > > > am
> > > > > > > > passing it as an argument.
> > > > > > > >
> > > > > > > > inputDF.write()
> > > > > > > > .format("com.uber.hoodie")
> > > > > > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
> > > > "_row_key")
> > > > > > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> > > > > > > "partition")
> > > > > > > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
> > > > > "timestamp")
> > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> > > > > > > >
> > > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> > > > > > > > checkpointstr)
> > > > > > > > .mode(SaveMode.Append)
> > > > > > > > .save(basePath);
> > > > > > > >
> > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for
> inserting
> > > the
> > > > > > > > checkpoint while using the dataframe writer but i couldn't
> add
> > > the
> > > > > > > > checkpoint meta data in to the .hoodie meta data. Is there a
> > way
> > > i
> > > > > can
> > > > > > > add
> > > > > > > > the checkpoint meta data while using the dataframe writer
> API?
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Posted by Vinoth Chandar <vi...@apache.org>.

Hi,

We usually test with our production workloads.. However, balaji recently
merged a DistributedTestDataSource,
https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d


that can generate some random data for testing..  Balaji, do you mind
sharing a command that can be used to kick something off like that?


On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan <ne...@gmail.com>
wrote:

> Dear Vinoth,
>
> I want to try to check out the performance comparison of upsert and bulk
> insert.  But i couldn't find a clean data set more than 10 GB.
> Would it be possible to get a data set from Hudi team? For example i was
> using the stocks data that you provided on your demo. Hence, can i get
> more GB's of that dataset for my experiment?
>
> Thanks for your consideration.
>
> Kind regards,
>
> On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> >
> https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159
> >
> > Just circling back with the resolution on the mailing list as well.
> >
> > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <net22geb@gmail.com
> >
> > wrote:
> >
> > > Dear Vinoth,
> > >
> > > Thanks for your fast response.
> > > I have created a new issue called Performance Comparison of
> > > HoodieDeltaStreamer and DataSourceAPI #714   with the screnshots of the
> > > spark UI which can be found at the  following  link
> > > https://github.com/apache/incubator-hudi/issues/714.
> > > In the UI,  it seems that the ingestion with the data source API is
> > > spending  much time in the count by key of HoodieBloomIndex and
> workload
> > > profile.  Looking forward to receive insights from you.
> > >
> > > Kinde regards,
> > >
> > >
> > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <vi...@apache.org>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > Both datasource and deltastreamer use the same APIs underneath. So
> not
> > > > sure. If you can grab screenshots of spark UI for both and open a
> > ticket,
> > > > glad to take a look.
> > > >
> > > > On 2, well one of goals of Hudi is to break this dichotomy and enable
> > > > streaming style (I call it incremental processing) of processing even
> > in
> > > a
> > > > batch job. MOR is in production at uber. Atm MOR is lacking just one
> > > > feature (incr pull using log files) that Nishith is planning to merge
> > > soon.
> > > > PR #692 enables Hudi DeltaStreamer to ingest continuously while
> > managing
> > > > compaction etc in the same job. I already knocked off some index
> > > > performance problems and working on indexing the log files, which
> > should
> > > > unlock near real time ingest.
> > > >
> > > > Putting all these together, within a month or so near real time MOR
> > > vision
> > > > should be very real. Ofc we need community help with dev and testing
> to
> > > > speed things up. :)
> > > >
> > > > Hope that gives you a clearer picture.
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <
> > net22geb@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Thanks, Vinoth
> > > > >
> > > > > Its working now. But i have 2 questions:
> > > > > 1. The ingestion latency of using DataSource API with
> > > > > the  HoodieSparkSQLWriter  is high compared to using delta
> streamer.
> > > Why
> > > > is
> > > > > it slow? Are there specific option where we could specify to
> minimize
> > > the
> > > > > ingestion latency.
> > > > >    For example: when i run the delta streamer its talking about 1
> > > minute
> > > > to
> > > > > insert some data. If i use DataSource API with
> HoodieSparkSQLWriter,
> > > its
> > > > > taking 5 minutes. How can we optimize this?
> > > > > 2. Where do we categorize Hudi in general (Is it batch processing
> or
> > > > > streaming)?  I am asking this because currently the copy on write
> is
> > > the
> > > > > one which is fully working and since the functionality of the merge
> > on
> > > > read
> > > > > is not fully done which enables us to have a near real time
> > analytics,
> > > > can
> > > > > we consider Hudi as a batch job?
> > > > >
> > > > > Kind regards,
> > > > >
> > > > >
> > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <vi...@apache.org>
> > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Short answer, by default any parameter you pass in using
> > option(k,v)
> > > or
> > > > > > options() beginning with "_" would be saved to the commit
> metadata.
> > > > > > You can change "_" prefix to something else by using the
> > > > > >  DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> > > > > > Reason you are not seeing the checkpointstr inside the commit
> > > metadata
> > > > is
> > > > > > because its just supposed to be a prefix for all such commit
> > > metadata.
> > > > > >
> > > > > > val metaMap = parameters.filter(kv =>
> > > > > > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
> > > > > >
> > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
> > > > > net22geb@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > I am trying to use the HoodieSparkSQLWriter to upsert data from
> > any
> > > > > > > dataframe into a hoodie modeled table.  Its creating everything
> > > > > correctly
> > > > > > > but , i also want to save the checkpoint but i couldn't even
> > though
> > > > am
> > > > > > > passing it as an argument.
> > > > > > >
> > > > > > > inputDF.write()
> > > > > > > .format("com.uber.hoodie")
> > > > > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
> > > "_row_key")
> > > > > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> > > > > > "partition")
> > > > > > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
> > > > "timestamp")
> > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> > > > > > >
> > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> > > > > > > checkpointstr)
> > > > > > > .mode(SaveMode.Append)
> > > > > > > .save(basePath);
> > > > > > >
> > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for inserting
> > the
> > > > > > > checkpoint while using the dataframe writer but i couldn't add
> > the
> > > > > > > checkpoint meta data in to the .hoodie meta data. Is there a
> way
> > i
> > > > can
> > > > > > add
> > > > > > > the checkpoint meta data while using the dataframe writer API?
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Posted by Netsanet Gebretsadkan <ne...@gmail.com>.

Dear Vinoth,

I want to try to check out the performance comparison of upsert and bulk
insert.  But i couldn't find a clean data set more than 10 GB.
Would it be possible to get a data set from Hudi team? For example i was
using the stocks data that you provided on your demo. Hence, can i get
more GB's of that dataset for my experiment?

Thanks for your consideration.

Kind regards,

On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <vi...@apache.org> wrote:

> https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159
>
> Just circling back with the resolution on the mailing list as well.
>
> On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <ne...@gmail.com>
> wrote:
>
> > Dear Vinoth,
> >
> > Thanks for your fast response.
> > I have created a new issue called Performance Comparison of
> > HoodieDeltaStreamer and DataSourceAPI #714   with the screnshots of the
> > spark UI which can be found at the  following  link
> > https://github.com/apache/incubator-hudi/issues/714.
> > In the UI,  it seems that the ingestion with the data source API is
> > spending  much time in the count by key of HoodieBloomIndex and workload
> > profile.  Looking forward to receive insights from you.
> >
> > Kinde regards,
> >
> >
> > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <vi...@apache.org> wrote:
> >
> > > Hi,
> > >
> > > Both datasource and deltastreamer use the same APIs underneath. So not
> > > sure. If you can grab screenshots of spark UI for both and open a
> ticket,
> > > glad to take a look.
> > >
> > > On 2, well one of goals of Hudi is to break this dichotomy and enable
> > > streaming style (I call it incremental processing) of processing even
> in
> > a
> > > batch job. MOR is in production at uber. Atm MOR is lacking just one
> > > feature (incr pull using log files) that Nishith is planning to merge
> > soon.
> > > PR #692 enables Hudi DeltaStreamer to ingest continuously while
> managing
> > > compaction etc in the same job. I already knocked off some index
> > > performance problems and working on indexing the log files, which
> should
> > > unlock near real time ingest.
> > >
> > > Putting all these together, within a month or so near real time MOR
> > vision
> > > should be very real. Ofc we need community help with dev and testing to
> > > speed things up. :)
> > >
> > > Hope that gives you a clearer picture.
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <
> net22geb@gmail.com
> > >
> > > wrote:
> > >
> > > > Thanks, Vinoth
> > > >
> > > > Its working now. But i have 2 questions:
> > > > 1. The ingestion latency of using DataSource API with
> > > > the  HoodieSparkSQLWriter  is high compared to using delta streamer.
> > Why
> > > is
> > > > it slow? Are there specific option where we could specify to minimize
> > the
> > > > ingestion latency.
> > > >    For example: when i run the delta streamer its talking about 1
> > minute
> > > to
> > > > insert some data. If i use DataSource API with HoodieSparkSQLWriter,
> > its
> > > > taking 5 minutes. How can we optimize this?
> > > > 2. Where do we categorize Hudi in general (Is it batch processing or
> > > > streaming)?  I am asking this because currently the copy on write is
> > the
> > > > one which is fully working and since the functionality of the merge
> on
> > > read
> > > > is not fully done which enables us to have a near real time
> analytics,
> > > can
> > > > we consider Hudi as a batch job?
> > > >
> > > > Kind regards,
> > > >
> > > >
> > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <vi...@apache.org>
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Short answer, by default any parameter you pass in using
> option(k,v)
> > or
> > > > > options() beginning with "_" would be saved to the commit metadata.
> > > > > You can change "_" prefix to something else by using the
> > > > >  DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> > > > > Reason you are not seeing the checkpointstr inside the commit
> > metadata
> > > is
> > > > > because its just supposed to be a prefix for all such commit
> > metadata.
> > > > >
> > > > > val metaMap = parameters.filter(kv =>
> > > > > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
> > > > >
> > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
> > > > net22geb@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > I am trying to use the HoodieSparkSQLWriter to upsert data from
> any
> > > > > > dataframe into a hoodie modeled table.  Its creating everything
> > > > correctly
> > > > > > but , i also want to save the checkpoint but i couldn't even
> though
> > > am
> > > > > > passing it as an argument.
> > > > > >
> > > > > > inputDF.write()
> > > > > > .format("com.uber.hoodie")
> > > > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
> > "_row_key")
> > > > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> > > > > "partition")
> > > > > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
> > > "timestamp")
> > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> > > > > >
> .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> > > > > > checkpointstr)
> > > > > > .mode(SaveMode.Append)
> > > > > > .save(basePath);
> > > > > >
> > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for inserting
> the
> > > > > > checkpoint while using the dataframe writer but i couldn't add
> the
> > > > > > checkpoint meta data in to the .hoodie meta data. Is there a way
> i
> > > can
> > > > > add
> > > > > > the checkpoint meta data while using the dataframe writer API?
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Posted by Vinoth Chandar <vi...@apache.org>.

https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159

Just circling back with the resolution on the mailing list as well.

On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <ne...@gmail.com>
wrote:

> Dear Vinoth,
>
> Thanks for your fast response.
> I have created a new issue called Performance Comparison of
> HoodieDeltaStreamer and DataSourceAPI #714   with the screnshots of the
> spark UI which can be found at the  following  link
> https://github.com/apache/incubator-hudi/issues/714.
> In the UI,  it seems that the ingestion with the data source API is
> spending  much time in the count by key of HoodieBloomIndex and workload
> profile.  Looking forward to receive insights from you.
>
> Kinde regards,
>
>
> On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Hi,
> >
> > Both datasource and deltastreamer use the same APIs underneath. So not
> > sure. If you can grab screenshots of spark UI for both and open a ticket,
> > glad to take a look.
> >
> > On 2, well one of goals of Hudi is to break this dichotomy and enable
> > streaming style (I call it incremental processing) of processing even in
> a
> > batch job. MOR is in production at uber. Atm MOR is lacking just one
> > feature (incr pull using log files) that Nishith is planning to merge
> soon.
> > PR #692 enables Hudi DeltaStreamer to ingest continuously while managing
> > compaction etc in the same job. I already knocked off some index
> > performance problems and working on indexing the log files, which should
> > unlock near real time ingest.
> >
> > Putting all these together, within a month or so near real time MOR
> vision
> > should be very real. Ofc we need community help with dev and testing to
> > speed things up. :)
> >
> > Hope that gives you a clearer picture.
> >
> > Thanks
> > Vinoth
> >
> > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <net22geb@gmail.com
> >
> > wrote:
> >
> > > Thanks, Vinoth
> > >
> > > Its working now. But i have 2 questions:
> > > 1. The ingestion latency of using DataSource API with
> > > the  HoodieSparkSQLWriter  is high compared to using delta streamer.
> Why
> > is
> > > it slow? Are there specific option where we could specify to minimize
> the
> > > ingestion latency.
> > >    For example: when i run the delta streamer its talking about 1
> minute
> > to
> > > insert some data. If i use DataSource API with HoodieSparkSQLWriter,
> its
> > > taking 5 minutes. How can we optimize this?
> > > 2. Where do we categorize Hudi in general (Is it batch processing or
> > > streaming)?  I am asking this because currently the copy on write is
> the
> > > one which is fully working and since the functionality of the merge on
> > read
> > > is not fully done which enables us to have a near real time analytics,
> > can
> > > we consider Hudi as a batch job?
> > >
> > > Kind regards,
> > >
> > >
> > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <vi...@apache.org>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > Short answer, by default any parameter you pass in using option(k,v)
> or
> > > > options() beginning with "_" would be saved to the commit metadata.
> > > > You can change "_" prefix to something else by using the
> > > >  DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> > > > Reason you are not seeing the checkpointstr inside the commit
> metadata
> > is
> > > > because its just supposed to be a prefix for all such commit
> metadata.
> > > >
> > > > val metaMap = parameters.filter(kv =>
> > > > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
> > > >
> > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
> > > net22geb@gmail.com>
> > > > wrote:
> > > >
> > > > > I am trying to use the HoodieSparkSQLWriter to upsert data from any
> > > > > dataframe into a hoodie modeled table.  Its creating everything
> > > correctly
> > > > > but , i also want to save the checkpoint but i couldn't even though
> > am
> > > > > passing it as an argument.
> > > > >
> > > > > inputDF.write()
> > > > > .format("com.uber.hoodie")
> > > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
> "_row_key")
> > > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> > > > "partition")
> > > > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
> > "timestamp")
> > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> > > > > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> > > > > checkpointstr)
> > > > > .mode(SaveMode.Append)
> > > > > .save(basePath);
> > > > >
> > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for inserting the
> > > > > checkpoint while using the dataframe writer but i couldn't add the
> > > > > checkpoint meta data in to the .hoodie meta data. Is there a way i
> > can
> > > > add
> > > > > the checkpoint meta data while using the dataframe writer API?
> > > > >
> > > >
> > >
> >
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Posted by Netsanet Gebretsadkan <ne...@gmail.com>.

Dear Vinoth,

Thanks for your fast response.
I have created a new issue called Performance Comparison of
HoodieDeltaStreamer and DataSourceAPI #714   with the screnshots of the
spark UI which can be found at the  following  link
https://github.com/apache/incubator-hudi/issues/714.
In the UI,  it seems that the ingestion with the data source API is
spending  much time in the count by key of HoodieBloomIndex and workload
profile.  Looking forward to receive insights from you.

Kinde regards,


On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <vi...@apache.org> wrote:

> Hi,
>
> Both datasource and deltastreamer use the same APIs underneath. So not
> sure. If you can grab screenshots of spark UI for both and open a ticket,
> glad to take a look.
>
> On 2, well one of goals of Hudi is to break this dichotomy and enable
> streaming style (I call it incremental processing) of processing even in a
> batch job. MOR is in production at uber. Atm MOR is lacking just one
> feature (incr pull using log files) that Nishith is planning to merge soon.
> PR #692 enables Hudi DeltaStreamer to ingest continuously while managing
> compaction etc in the same job. I already knocked off some index
> performance problems and working on indexing the log files, which should
> unlock near real time ingest.
>
> Putting all these together, within a month or so near real time MOR vision
> should be very real. Ofc we need community help with dev and testing to
> speed things up. :)
>
> Hope that gives you a clearer picture.
>
> Thanks
> Vinoth
>
> On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <ne...@gmail.com>
> wrote:
>
> > Thanks, Vinoth
> >
> > Its working now. But i have 2 questions:
> > 1. The ingestion latency of using DataSource API with
> > the  HoodieSparkSQLWriter  is high compared to using delta streamer. Why
> is
> > it slow? Are there specific option where we could specify to minimize the
> > ingestion latency.
> >    For example: when i run the delta streamer its talking about 1 minute
> to
> > insert some data. If i use DataSource API with HoodieSparkSQLWriter, its
> > taking 5 minutes. How can we optimize this?
> > 2. Where do we categorize Hudi in general (Is it batch processing or
> > streaming)?  I am asking this because currently the copy on write is the
> > one which is fully working and since the functionality of the merge on
> read
> > is not fully done which enables us to have a near real time analytics,
> can
> > we consider Hudi as a batch job?
> >
> > Kind regards,
> >
> >
> > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> > > Hi,
> > >
> > > Short answer, by default any parameter you pass in using option(k,v) or
> > > options() beginning with "_" would be saved to the commit metadata.
> > > You can change "_" prefix to something else by using the
> > >  DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> > > Reason you are not seeing the checkpointstr inside the commit metadata
> is
> > > because its just supposed to be a prefix for all such commit metadata.
> > >
> > > val metaMap = parameters.filter(kv =>
> > > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
> > >
> > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
> > net22geb@gmail.com>
> > > wrote:
> > >
> > > > I am trying to use the HoodieSparkSQLWriter to upsert data from any
> > > > dataframe into a hoodie modeled table.  Its creating everything
> > correctly
> > > > but , i also want to save the checkpoint but i couldn't even though
> am
> > > > passing it as an argument.
> > > >
> > > > inputDF.write()
> > > > .format("com.uber.hoodie")
> > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
> > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> > > "partition")
> > > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
> "timestamp")
> > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> > > > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> > > > checkpointstr)
> > > > .mode(SaveMode.Append)
> > > > .save(basePath);
> > > >
> > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for inserting the
> > > > checkpoint while using the dataframe writer but i couldn't add the
> > > > checkpoint meta data in to the .hoodie meta data. Is there a way i
> can
> > > add
> > > > the checkpoint meta data while using the dataframe writer API?
> > > >
> > >
> >
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Posted by Vinoth Chandar <vi...@apache.org>.

Hi,

Both datasource and deltastreamer use the same APIs underneath. So not
sure. If you can grab screenshots of spark UI for both and open a ticket,
glad to take a look.

On 2, well one of goals of Hudi is to break this dichotomy and enable
streaming style (I call it incremental processing) of processing even in a
batch job. MOR is in production at uber. Atm MOR is lacking just one
feature (incr pull using log files) that Nishith is planning to merge soon.
PR #692 enables Hudi DeltaStreamer to ingest continuously while managing
compaction etc in the same job. I already knocked off some index
performance problems and working on indexing the log files, which should
unlock near real time ingest.

Putting all these together, within a month or so near real time MOR vision
should be very real. Ofc we need community help with dev and testing to
speed things up. :)

Hope that gives you a clearer picture.

Thanks
Vinoth

On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <ne...@gmail.com>
wrote:

> Thanks, Vinoth
>
> Its working now. But i have 2 questions:
> 1. The ingestion latency of using DataSource API with
> the  HoodieSparkSQLWriter  is high compared to using delta streamer. Why is
> it slow? Are there specific option where we could specify to minimize the
> ingestion latency.
>    For example: when i run the delta streamer its talking about 1 minute to
> insert some data. If i use DataSource API with HoodieSparkSQLWriter, its
> taking 5 minutes. How can we optimize this?
> 2. Where do we categorize Hudi in general (Is it batch processing or
> streaming)?  I am asking this because currently the copy on write is the
> one which is fully working and since the functionality of the merge on read
> is not fully done which enables us to have a near real time analytics, can
> we consider Hudi as a batch job?
>
> Kind regards,
>
>
> On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Hi,
> >
> > Short answer, by default any parameter you pass in using option(k,v) or
> > options() beginning with "_" would be saved to the commit metadata.
> > You can change "_" prefix to something else by using the
> >  DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> > Reason you are not seeing the checkpointstr inside the commit metadata is
> > because its just supposed to be a prefix for all such commit metadata.
> >
> > val metaMap = parameters.filter(kv =>
> > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
> >
> > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
> net22geb@gmail.com>
> > wrote:
> >
> > > I am trying to use the HoodieSparkSQLWriter to upsert data from any
> > > dataframe into a hoodie modeled table.  Its creating everything
> correctly
> > > but , i also want to save the checkpoint but i couldn't even though am
> > > passing it as an argument.
> > >
> > > inputDF.write()
> > > .format("com.uber.hoodie")
> > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
> > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> > "partition")
> > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
> > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> > > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> > > checkpointstr)
> > > .mode(SaveMode.Append)
> > > .save(basePath);
> > >
> > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for inserting the
> > > checkpoint while using the dataframe writer but i couldn't add the
> > > checkpoint meta data in to the .hoodie meta data. Is there a way i can
> > add
> > > the checkpoint meta data while using the dataframe writer API?
> > >
> >
>