You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by sa...@gmail.com, sa...@gmail.com on 2019/04/26 10:02:17 UTC

Reading Merge_on_read table| Unable to read updated records after multiple updates

Writing hudi set as below

ds.withColumn("emp_name",lit("upd1 Emily")).withColumn("ts",current_timestamp).write.format("com.uber.hoodie")
.option(HoodieWriteConfig.TABLE_NAME,"emp_mor_26")
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,"emp_id")
.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,"MERGE_ON_READ")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "part_by")
.option("hoodie.upsert.shuffle.parallelism",4)
.mode(SaveMode.Append)
.save("/apps/hive/warehouse/emp_mor_26")


1st run - write record 1,"hudi_045",current_timestamp as ts
read result -- 1, hudi_045
2nd run - write record 1,"hudi_046",current_timestamp as ts
read result -- 1,hudi_046
3rd run -- write record 1, "hoodie_123",current_timestamp as ts
read result --- 1,hudi_046
4th run -- write record 1, "hdie_1232324",current_timestamp as ts
read result --- 1,hudi_046

after multiple updates to same record ,
the generated  log.1 has multiple instances of the same record.
At this point the updated record is not fetched.

14:45 /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1 - has record that was updated in run 1
15:00 /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1 - has record that was updated in run 2 and run 3
14:41 /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
14:41 /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet


So is there any compaction to be enabled before reading or while writing .

Re: Reading Merge_on_read table| Unable to read updated records after multiple updates

Posted by Vinoth Chandar <vi...@apache.org>.

Hi Satish,

Those files are understandable i.e it seems your second update went to a
log file, then compaction was scheduled, the third one went to a new log
file.
Only issue could be the the query is not picking up the right InputFormat
or the Hive table is not registered with the correct
InputFormat/RecordReader.

I know you and Nishith were chatting about this on gh. but did you get the
table registered using sync tool and followed steps here
http://hudi.apache.org/querying_data.html#hive-rt-view ?
Otherwise, it could be that the query engine (I am assuming Hive?) could
just ignore all log files since they begin with"." and only reading parquet
files/

thanks
Vinoth

On Mon, Apr 29, 2019 at 9:33 PM SATISH SIDNAKOPPA <
satish.sidnakoppa.it@gmail.com> wrote:

> Hi Vinoth,
>
> Missed while copying.
> PFB the list of files
>
> 14:45
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
> - has record that was updated in run 1
> 15:00
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
> - has record that was updated in run 2 and run 3
> 14:41 /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
> 14:41
> /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet
>
>
> On Mon, Apr 29, 2019 at 8:26 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Hi Satish,
> >
> > There are no parquet files? Can you share the full listing of files in
> the
> > partition?
> >
> > Thanks
> > Vinoth
> >
> > On Mon, Apr 29, 2019 at 7:22 AM SATISH SIDNAKOPPA <
> > satish.sidnakoppa.it@gmail.com> wrote:
> >
> > > Yes,
> > > As this needed discussion ,the thread was created in google groups for
> > > inputs.
> > > I am unable to read from rt table after multiple updates.
> > >
> > > 14:45
> > >
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
> > > -* has record that was updated in run 1*
> > > 15:00
> > >
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
> > > - *has record that was updated in run 2 and run 3*
> > > 14:41
> > /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
> > > 14:41
> > >
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet
> > >
> > >
> > >
> > >
> > > On Sat, Apr 27, 2019 at 7:24 PM SATISH SIDNAKOPPA <
> > > satish.sidnakoppa.it@gmail.com> wrote:
> > >
> > > > No ,the issue is faced with rt table created by sync tool .
> > > >
> > > > On Fri 26 Apr, 2019, 11:53 PM Vinoth Chandar <vinoth@apache.org
> wrote:
> > > >
> > > >> once you registered the rt table, is this working now for you?
> > > >>
> > > >> On Fri, Apr 26, 2019 at 9:36 AM SATISH SIDNAKOPPA <
> > > >> satish.sidnakoppa.it@gmail.com> wrote:
> > > >>
> > > >> > I am querying real time view of the table.
> > > >> > This table (emp_mor_26_rt) created after runsync tool.
> > > >> > So the first updated record are fetched from log1 file.
> > > >> >
> > > >> > Only after third update both the updates are placed in log files.
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Fri 26 Apr, 2019, 6:30 PM Vinoth Chandar <vinoth@apache.org
> > wrote:
> > > >> >
> > > >> > > Looks like you are querying the RO table? If so, the query only
> > hits
> > > >> > > parquet file; which was probably generated during the first
> upsert
> > > and
> > > >> > all
> > > >> > > others went to the log. Unless compaction runs, it wont show up
> on
> > > ro
> > > >> > table
> > > >> > >
> > > >> > > If you want the latest merged view you need to query the RT
> table.
> > > >> > >
> > > >> > > Does that sound applicable?
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > On Fri, Apr 26, 2019 at 3:02 AM satish.sidnakoppa.it@gmail.com
> <
> > > >> > > satish.sidnakoppa.it@gmail.com> wrote:
> > > >> > >
> > > >> > > > Writing hudi set as below
> > > >> > > >
> > > >> > > > ds.withColumn("emp_name",lit("upd1
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> Emily")).withColumn("ts",current_timestamp).write.format("com.uber.hoodie")
> > > >> > > > .option(HoodieWriteConfig.TABLE_NAME,"emp_mor_26")
> > > >> > > >
> .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,"emp_id")
> > > >> > > >
> > > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,"MERGE_ON_READ")
> > > >> > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,
> > > >> "part_by")
> > > >> > > > .option("hoodie.upsert.shuffle.parallelism",4)
> > > >> > > > .mode(SaveMode.Append)
> > > >> > > > .save("/apps/hive/warehouse/emp_mor_26")
> > > >> > > >
> > > >> > > >
> > > >> > > > 1st run - write record 1,"hudi_045",current_timestamp as ts
> > > >> > > > read result -- 1, hudi_045
> > > >> > > > 2nd run - write record 1,"hudi_046",current_timestamp as ts
> > > >> > > > read result -- 1,hudi_046
> > > >> > > > 3rd run -- write record 1, "hoodie_123",current_timestamp as
> ts
> > > >> > > > read result --- 1,hudi_046
> > > >> > > > 4th run -- write record 1, "hdie_1232324",current_timestamp as
> > ts
> > > >> > > > read result --- 1,hudi_046
> > > >> > > >
> > > >> > > > after multiple updates to same record ,
> > > >> > > > the generated  log.1 has multiple instances of the same
> record.
> > > >> > > > At this point the updated record is not fetched.
> > > >> > > >
> > > >> > > > 14:45
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
> > > >> > > > - has record that was updated in run 1
> > > >> > > > 15:00
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
> > > >> > > > - has record that was updated in run 2 and run 3
> > > >> > > > 14:41
> > > >> > >
> > > /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
> > > >> > > > 14:41
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet
> > > >> > > >
> > > >> > > >
> > > >> > > > So is there any compaction to be enabled before reading or
> while
> > > >> > writing
> > > >> > > .
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Reading Merge_on_read table| Unable to read updated records after multiple updates

Posted by SATISH SIDNAKOPPA <sa...@gmail.com>.

Hi Vinoth,

Missed while copying.
PFB the list of files

14:45 /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
- has record that was updated in run 1
15:00 /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
- has record that was updated in run 2 and run 3
14:41 /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
14:41 /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet


On Mon, Apr 29, 2019 at 8:26 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi Satish,
>
> There are no parquet files? Can you share the full listing of files in the
> partition?
>
> Thanks
> Vinoth
>
> On Mon, Apr 29, 2019 at 7:22 AM SATISH SIDNAKOPPA <
> satish.sidnakoppa.it@gmail.com> wrote:
>
> > Yes,
> > As this needed discussion ,the thread was created in google groups for
> > inputs.
> > I am unable to read from rt table after multiple updates.
> >
> > 14:45
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
> > -* has record that was updated in run 1*
> > 15:00
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
> > - *has record that was updated in run 2 and run 3*
> > 14:41
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
> > 14:41
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet
> >
> >
> >
> >
> > On Sat, Apr 27, 2019 at 7:24 PM SATISH SIDNAKOPPA <
> > satish.sidnakoppa.it@gmail.com> wrote:
> >
> > > No ,the issue is faced with rt table created by sync tool .
> > >
> > > On Fri 26 Apr, 2019, 11:53 PM Vinoth Chandar <vinoth@apache.org wrote:
> > >
> > >> once you registered the rt table, is this working now for you?
> > >>
> > >> On Fri, Apr 26, 2019 at 9:36 AM SATISH SIDNAKOPPA <
> > >> satish.sidnakoppa.it@gmail.com> wrote:
> > >>
> > >> > I am querying real time view of the table.
> > >> > This table (emp_mor_26_rt) created after runsync tool.
> > >> > So the first updated record are fetched from log1 file.
> > >> >
> > >> > Only after third update both the updates are placed in log files.
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Fri 26 Apr, 2019, 6:30 PM Vinoth Chandar <vinoth@apache.org
> wrote:
> > >> >
> > >> > > Looks like you are querying the RO table? If so, the query only
> hits
> > >> > > parquet file; which was probably generated during the first upsert
> > and
> > >> > all
> > >> > > others went to the log. Unless compaction runs, it wont show up on
> > ro
> > >> > table
> > >> > >
> > >> > > If you want the latest merged view you need to query the RT table.
> > >> > >
> > >> > > Does that sound applicable?
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Fri, Apr 26, 2019 at 3:02 AM satish.sidnakoppa.it@gmail.com <
> > >> > > satish.sidnakoppa.it@gmail.com> wrote:
> > >> > >
> > >> > > > Writing hudi set as below
> > >> > > >
> > >> > > > ds.withColumn("emp_name",lit("upd1
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> Emily")).withColumn("ts",current_timestamp).write.format("com.uber.hoodie")
> > >> > > > .option(HoodieWriteConfig.TABLE_NAME,"emp_mor_26")
> > >> > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,"emp_id")
> > >> > > >
> > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,"MERGE_ON_READ")
> > >> > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,
> > >> "part_by")
> > >> > > > .option("hoodie.upsert.shuffle.parallelism",4)
> > >> > > > .mode(SaveMode.Append)
> > >> > > > .save("/apps/hive/warehouse/emp_mor_26")
> > >> > > >
> > >> > > >
> > >> > > > 1st run - write record 1,"hudi_045",current_timestamp as ts
> > >> > > > read result -- 1, hudi_045
> > >> > > > 2nd run - write record 1,"hudi_046",current_timestamp as ts
> > >> > > > read result -- 1,hudi_046
> > >> > > > 3rd run -- write record 1, "hoodie_123",current_timestamp as ts
> > >> > > > read result --- 1,hudi_046
> > >> > > > 4th run -- write record 1, "hdie_1232324",current_timestamp as
> ts
> > >> > > > read result --- 1,hudi_046
> > >> > > >
> > >> > > > after multiple updates to same record ,
> > >> > > > the generated  log.1 has multiple instances of the same record.
> > >> > > > At this point the updated record is not fetched.
> > >> > > >
> > >> > > > 14:45
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
> > >> > > > - has record that was updated in run 1
> > >> > > > 15:00
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
> > >> > > > - has record that was updated in run 2 and run 3
> > >> > > > 14:41
> > >> > >
> > /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
> > >> > > > 14:41
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet
> > >> > > >
> > >> > > >
> > >> > > > So is there any compaction to be enabled before reading or while
> > >> > writing
> > >> > > .
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: Reading Merge_on_read table| Unable to read updated records after multiple updates

Posted by Vinoth Chandar <vi...@apache.org>.

Hi Satish,

There are no parquet files? Can you share the full listing of files in the
partition?

Thanks
Vinoth

On Mon, Apr 29, 2019 at 7:22 AM SATISH SIDNAKOPPA <
satish.sidnakoppa.it@gmail.com> wrote:

> Yes,
> As this needed discussion ,the thread was created in google groups for
> inputs.
> I am unable to read from rt table after multiple updates.
>
> 14:45
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
> -* has record that was updated in run 1*
> 15:00
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
> - *has record that was updated in run 2 and run 3*
> 14:41 /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
> 14:41
> /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet
>
>
>
>
> On Sat, Apr 27, 2019 at 7:24 PM SATISH SIDNAKOPPA <
> satish.sidnakoppa.it@gmail.com> wrote:
>
> > No ,the issue is faced with rt table created by sync tool .
> >
> > On Fri 26 Apr, 2019, 11:53 PM Vinoth Chandar <vinoth@apache.org wrote:
> >
> >> once you registered the rt table, is this working now for you?
> >>
> >> On Fri, Apr 26, 2019 at 9:36 AM SATISH SIDNAKOPPA <
> >> satish.sidnakoppa.it@gmail.com> wrote:
> >>
> >> > I am querying real time view of the table.
> >> > This table (emp_mor_26_rt) created after runsync tool.
> >> > So the first updated record are fetched from log1 file.
> >> >
> >> > Only after third update both the updates are placed in log files.
> >> >
> >> >
> >> >
> >> >
> >> > On Fri 26 Apr, 2019, 6:30 PM Vinoth Chandar <vinoth@apache.org wrote:
> >> >
> >> > > Looks like you are querying the RO table? If so, the query only hits
> >> > > parquet file; which was probably generated during the first upsert
> and
> >> > all
> >> > > others went to the log. Unless compaction runs, it wont show up on
> ro
> >> > table
> >> > >
> >> > > If you want the latest merged view you need to query the RT table.
> >> > >
> >> > > Does that sound applicable?
> >> > >
> >> > >
> >> > >
> >> > > On Fri, Apr 26, 2019 at 3:02 AM satish.sidnakoppa.it@gmail.com <
> >> > > satish.sidnakoppa.it@gmail.com> wrote:
> >> > >
> >> > > > Writing hudi set as below
> >> > > >
> >> > > > ds.withColumn("emp_name",lit("upd1
> >> > > >
> >> > >
> >> >
> >>
> Emily")).withColumn("ts",current_timestamp).write.format("com.uber.hoodie")
> >> > > > .option(HoodieWriteConfig.TABLE_NAME,"emp_mor_26")
> >> > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,"emp_id")
> >> > > >
> .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,"MERGE_ON_READ")
> >> > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,
> >> "part_by")
> >> > > > .option("hoodie.upsert.shuffle.parallelism",4)
> >> > > > .mode(SaveMode.Append)
> >> > > > .save("/apps/hive/warehouse/emp_mor_26")
> >> > > >
> >> > > >
> >> > > > 1st run - write record 1,"hudi_045",current_timestamp as ts
> >> > > > read result -- 1, hudi_045
> >> > > > 2nd run - write record 1,"hudi_046",current_timestamp as ts
> >> > > > read result -- 1,hudi_046
> >> > > > 3rd run -- write record 1, "hoodie_123",current_timestamp as ts
> >> > > > read result --- 1,hudi_046
> >> > > > 4th run -- write record 1, "hdie_1232324",current_timestamp as ts
> >> > > > read result --- 1,hudi_046
> >> > > >
> >> > > > after multiple updates to same record ,
> >> > > > the generated  log.1 has multiple instances of the same record.
> >> > > > At this point the updated record is not fetched.
> >> > > >
> >> > > > 14:45
> >> > > >
> >> > >
> >> >
> >>
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
> >> > > > - has record that was updated in run 1
> >> > > > 15:00
> >> > > >
> >> > >
> >> >
> >>
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
> >> > > > - has record that was updated in run 2 and run 3
> >> > > > 14:41
> >> > >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
> >> > > > 14:41
> >> > > >
> >> > >
> >> >
> >>
> /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet
> >> > > >
> >> > > >
> >> > > > So is there any compaction to be enabled before reading or while
> >> > writing
> >> > > .
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: Reading Merge_on_read table| Unable to read updated records after multiple updates

Posted by SATISH SIDNAKOPPA <sa...@gmail.com>.

Yes,
As this needed discussion ,the thread was created in google groups for
inputs.
I am unable to read from rt table after multiple updates.

14:45 /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
-* has record that was updated in run 1*
15:00 /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
- *has record that was updated in run 2 and run 3*
14:41 /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
14:41 /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet




On Sat, Apr 27, 2019 at 7:24 PM SATISH SIDNAKOPPA <
satish.sidnakoppa.it@gmail.com> wrote:

> No ,the issue is faced with rt table created by sync tool .
>
> On Fri 26 Apr, 2019, 11:53 PM Vinoth Chandar <vinoth@apache.org wrote:
>
>> once you registered the rt table, is this working now for you?
>>
>> On Fri, Apr 26, 2019 at 9:36 AM SATISH SIDNAKOPPA <
>> satish.sidnakoppa.it@gmail.com> wrote:
>>
>> > I am querying real time view of the table.
>> > This table (emp_mor_26_rt) created after runsync tool.
>> > So the first updated record are fetched from log1 file.
>> >
>> > Only after third update both the updates are placed in log files.
>> >
>> >
>> >
>> >
>> > On Fri 26 Apr, 2019, 6:30 PM Vinoth Chandar <vinoth@apache.org wrote:
>> >
>> > > Looks like you are querying the RO table? If so, the query only hits
>> > > parquet file; which was probably generated during the first upsert and
>> > all
>> > > others went to the log. Unless compaction runs, it wont show up on ro
>> > table
>> > >
>> > > If you want the latest merged view you need to query the RT table.
>> > >
>> > > Does that sound applicable?
>> > >
>> > >
>> > >
>> > > On Fri, Apr 26, 2019 at 3:02 AM satish.sidnakoppa.it@gmail.com <
>> > > satish.sidnakoppa.it@gmail.com> wrote:
>> > >
>> > > > Writing hudi set as below
>> > > >
>> > > > ds.withColumn("emp_name",lit("upd1
>> > > >
>> > >
>> >
>> Emily")).withColumn("ts",current_timestamp).write.format("com.uber.hoodie")
>> > > > .option(HoodieWriteConfig.TABLE_NAME,"emp_mor_26")
>> > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,"emp_id")
>> > > > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,"MERGE_ON_READ")
>> > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,
>> "part_by")
>> > > > .option("hoodie.upsert.shuffle.parallelism",4)
>> > > > .mode(SaveMode.Append)
>> > > > .save("/apps/hive/warehouse/emp_mor_26")
>> > > >
>> > > >
>> > > > 1st run - write record 1,"hudi_045",current_timestamp as ts
>> > > > read result -- 1, hudi_045
>> > > > 2nd run - write record 1,"hudi_046",current_timestamp as ts
>> > > > read result -- 1,hudi_046
>> > > > 3rd run -- write record 1, "hoodie_123",current_timestamp as ts
>> > > > read result --- 1,hudi_046
>> > > > 4th run -- write record 1, "hdie_1232324",current_timestamp as ts
>> > > > read result --- 1,hudi_046
>> > > >
>> > > > after multiple updates to same record ,
>> > > > the generated  log.1 has multiple instances of the same record.
>> > > > At this point the updated record is not fetched.
>> > > >
>> > > > 14:45
>> > > >
>> > >
>> >
>> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
>> > > > - has record that was updated in run 1
>> > > > 15:00
>> > > >
>> > >
>> >
>> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
>> > > > - has record that was updated in run 2 and run 3
>> > > > 14:41
>> > > /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
>> > > > 14:41
>> > > >
>> > >
>> >
>> /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet
>> > > >
>> > > >
>> > > > So is there any compaction to be enabled before reading or while
>> > writing
>> > > .
>> > > >
>> > > >
>> > >
>> >
>>
>

Re: Reading Merge_on_read table| Unable to read updated records after multiple updates

Posted by SATISH SIDNAKOPPA <sa...@gmail.com>.

No ,the issue is faced with rt table created by sync tool .

On Fri 26 Apr, 2019, 11:53 PM Vinoth Chandar <vinoth@apache.org wrote:

> once you registered the rt table, is this working now for you?
>
> On Fri, Apr 26, 2019 at 9:36 AM SATISH SIDNAKOPPA <
> satish.sidnakoppa.it@gmail.com> wrote:
>
> > I am querying real time view of the table.
> > This table (emp_mor_26_rt) created after runsync tool.
> > So the first updated record are fetched from log1 file.
> >
> > Only after third update both the updates are placed in log files.
> >
> >
> >
> >
> > On Fri 26 Apr, 2019, 6:30 PM Vinoth Chandar <vinoth@apache.org wrote:
> >
> > > Looks like you are querying the RO table? If so, the query only hits
> > > parquet file; which was probably generated during the first upsert and
> > all
> > > others went to the log. Unless compaction runs, it wont show up on ro
> > table
> > >
> > > If you want the latest merged view you need to query the RT table.
> > >
> > > Does that sound applicable?
> > >
> > >
> > >
> > > On Fri, Apr 26, 2019 at 3:02 AM satish.sidnakoppa.it@gmail.com <
> > > satish.sidnakoppa.it@gmail.com> wrote:
> > >
> > > > Writing hudi set as below
> > > >
> > > > ds.withColumn("emp_name",lit("upd1
> > > >
> > >
> >
> Emily")).withColumn("ts",current_timestamp).write.format("com.uber.hoodie")
> > > > .option(HoodieWriteConfig.TABLE_NAME,"emp_mor_26")
> > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,"emp_id")
> > > > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,"MERGE_ON_READ")
> > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,
> "part_by")
> > > > .option("hoodie.upsert.shuffle.parallelism",4)
> > > > .mode(SaveMode.Append)
> > > > .save("/apps/hive/warehouse/emp_mor_26")
> > > >
> > > >
> > > > 1st run - write record 1,"hudi_045",current_timestamp as ts
> > > > read result -- 1, hudi_045
> > > > 2nd run - write record 1,"hudi_046",current_timestamp as ts
> > > > read result -- 1,hudi_046
> > > > 3rd run -- write record 1, "hoodie_123",current_timestamp as ts
> > > > read result --- 1,hudi_046
> > > > 4th run -- write record 1, "hdie_1232324",current_timestamp as ts
> > > > read result --- 1,hudi_046
> > > >
> > > > after multiple updates to same record ,
> > > > the generated  log.1 has multiple instances of the same record.
> > > > At this point the updated record is not fetched.
> > > >
> > > > 14:45
> > > >
> > >
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
> > > > - has record that was updated in run 1
> > > > 15:00
> > > >
> > >
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
> > > > - has record that was updated in run 2 and run 3
> > > > 14:41
> > > /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
> > > > 14:41
> > > >
> > >
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet
> > > >
> > > >
> > > > So is there any compaction to be enabled before reading or while
> > writing
> > > .
> > > >
> > > >
> > >
> >
>

Re: Reading Merge_on_read table| Unable to read updated records after multiple updates

Posted by Vinoth Chandar <vi...@apache.org>.

once you registered the rt table, is this working now for you?

On Fri, Apr 26, 2019 at 9:36 AM SATISH SIDNAKOPPA <
satish.sidnakoppa.it@gmail.com> wrote:

> I am querying real time view of the table.
> This table (emp_mor_26_rt) created after runsync tool.
> So the first updated record are fetched from log1 file.
>
> Only after third update both the updates are placed in log files.
>
>
>
>
> On Fri 26 Apr, 2019, 6:30 PM Vinoth Chandar <vinoth@apache.org wrote:
>
> > Looks like you are querying the RO table? If so, the query only hits
> > parquet file; which was probably generated during the first upsert and
> all
> > others went to the log. Unless compaction runs, it wont show up on ro
> table
> >
> > If you want the latest merged view you need to query the RT table.
> >
> > Does that sound applicable?
> >
> >
> >
> > On Fri, Apr 26, 2019 at 3:02 AM satish.sidnakoppa.it@gmail.com <
> > satish.sidnakoppa.it@gmail.com> wrote:
> >
> > > Writing hudi set as below
> > >
> > > ds.withColumn("emp_name",lit("upd1
> > >
> >
> Emily")).withColumn("ts",current_timestamp).write.format("com.uber.hoodie")
> > > .option(HoodieWriteConfig.TABLE_NAME,"emp_mor_26")
> > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,"emp_id")
> > > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,"MERGE_ON_READ")
> > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "part_by")
> > > .option("hoodie.upsert.shuffle.parallelism",4)
> > > .mode(SaveMode.Append)
> > > .save("/apps/hive/warehouse/emp_mor_26")
> > >
> > >
> > > 1st run - write record 1,"hudi_045",current_timestamp as ts
> > > read result -- 1, hudi_045
> > > 2nd run - write record 1,"hudi_046",current_timestamp as ts
> > > read result -- 1,hudi_046
> > > 3rd run -- write record 1, "hoodie_123",current_timestamp as ts
> > > read result --- 1,hudi_046
> > > 4th run -- write record 1, "hdie_1232324",current_timestamp as ts
> > > read result --- 1,hudi_046
> > >
> > > after multiple updates to same record ,
> > > the generated  log.1 has multiple instances of the same record.
> > > At this point the updated record is not fetched.
> > >
> > > 14:45
> > >
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
> > > - has record that was updated in run 1
> > > 15:00
> > >
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
> > > - has record that was updated in run 2 and run 3
> > > 14:41
> > /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
> > > 14:41
> > >
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet
> > >
> > >
> > > So is there any compaction to be enabled before reading or while
> writing
> > .
> > >
> > >
> >
>

Re: Reading Merge_on_read table| Unable to read updated records after multiple updates

Posted by SATISH SIDNAKOPPA <sa...@gmail.com>.

I am querying real time view of the table.
This table (emp_mor_26_rt) created after runsync tool.
So the first updated record are fetched from log1 file.

Only after third update both the updates are placed in log files.




On Fri 26 Apr, 2019, 6:30 PM Vinoth Chandar <vinoth@apache.org wrote:

> Looks like you are querying the RO table? If so, the query only hits
> parquet file; which was probably generated during the first upsert and all
> others went to the log. Unless compaction runs, it wont show up on ro table
>
> If you want the latest merged view you need to query the RT table.
>
> Does that sound applicable?
>
>
>
> On Fri, Apr 26, 2019 at 3:02 AM satish.sidnakoppa.it@gmail.com <
> satish.sidnakoppa.it@gmail.com> wrote:
>
> > Writing hudi set as below
> >
> > ds.withColumn("emp_name",lit("upd1
> >
> Emily")).withColumn("ts",current_timestamp).write.format("com.uber.hoodie")
> > .option(HoodieWriteConfig.TABLE_NAME,"emp_mor_26")
> > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,"emp_id")
> > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,"MERGE_ON_READ")
> > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "part_by")
> > .option("hoodie.upsert.shuffle.parallelism",4)
> > .mode(SaveMode.Append)
> > .save("/apps/hive/warehouse/emp_mor_26")
> >
> >
> > 1st run - write record 1,"hudi_045",current_timestamp as ts
> > read result -- 1, hudi_045
> > 2nd run - write record 1,"hudi_046",current_timestamp as ts
> > read result -- 1,hudi_046
> > 3rd run -- write record 1, "hoodie_123",current_timestamp as ts
> > read result --- 1,hudi_046
> > 4th run -- write record 1, "hdie_1232324",current_timestamp as ts
> > read result --- 1,hudi_046
> >
> > after multiple updates to same record ,
> > the generated  log.1 has multiple instances of the same record.
> > At this point the updated record is not fetched.
> >
> > 14:45
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
> > - has record that was updated in run 1
> > 15:00
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
> > - has record that was updated in run 2 and run 3
> > 14:41
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
> > 14:41
> >
> /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet
> >
> >
> > So is there any compaction to be enabled before reading or while writing
> .
> >
> >
>

Re: Reading Merge_on_read table| Unable to read updated records after multiple updates

Posted by Vinoth Chandar <vi...@apache.org>.

https://github.com/apache/incubator-hudi/issues/652#issuecomment-487016906
Looks like Nishith and you were chatting about this here.

On Fri, Apr 26, 2019 at 6:00 AM Vinoth Chandar <vi...@apache.org> wrote:

> Looks like you are querying the RO table? If so, the query only hits
> parquet file; which was probably generated during the first upsert and all
> others went to the log. Unless compaction runs, it wont show up on ro table
>
> If you want the latest merged view you need to query the RT table.
>
> Does that sound applicable?
>
>
>
> On Fri, Apr 26, 2019 at 3:02 AM satish.sidnakoppa.it@gmail.com <
> satish.sidnakoppa.it@gmail.com> wrote:
>
>> Writing hudi set as below
>>
>> ds.withColumn("emp_name",lit("upd1
>> Emily")).withColumn("ts",current_timestamp).write.format("com.uber.hoodie")
>> .option(HoodieWriteConfig.TABLE_NAME,"emp_mor_26")
>> .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,"emp_id")
>> .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,"MERGE_ON_READ")
>> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "part_by")
>> .option("hoodie.upsert.shuffle.parallelism",4)
>> .mode(SaveMode.Append)
>> .save("/apps/hive/warehouse/emp_mor_26")
>>
>>
>> 1st run - write record 1,"hudi_045",current_timestamp as ts
>> read result -- 1, hudi_045
>> 2nd run - write record 1,"hudi_046",current_timestamp as ts
>> read result -- 1,hudi_046
>> 3rd run -- write record 1, "hoodie_123",current_timestamp as ts
>> read result --- 1,hudi_046
>> 4th run -- write record 1, "hdie_1232324",current_timestamp as ts
>> read result --- 1,hudi_046
>>
>> after multiple updates to same record ,
>> the generated  log.1 has multiple instances of the same record.
>> At this point the updated record is not fetched.
>>
>> 14:45
>> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
>> - has record that was updated in run 1
>> 15:00
>> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
>> - has record that was updated in run 2 and run 3
>> 14:41
>> /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
>> 14:41
>> /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet
>>
>>
>> So is there any compaction to be enabled before reading or while writing .
>>
>>

Re: Reading Merge_on_read table| Unable to read updated records after multiple updates

Posted by Vinoth Chandar <vi...@apache.org>.

Looks like you are querying the RO table? If so, the query only hits
parquet file; which was probably generated during the first upsert and all
others went to the log. Unless compaction runs, it wont show up on ro table

If you want the latest merged view you need to query the RT table.

Does that sound applicable?



On Fri, Apr 26, 2019 at 3:02 AM satish.sidnakoppa.it@gmail.com <
satish.sidnakoppa.it@gmail.com> wrote:

> Writing hudi set as below
>
> ds.withColumn("emp_name",lit("upd1
> Emily")).withColumn("ts",current_timestamp).write.format("com.uber.hoodie")
> .option(HoodieWriteConfig.TABLE_NAME,"emp_mor_26")
> .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,"emp_id")
> .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,"MERGE_ON_READ")
> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "part_by")
> .option("hoodie.upsert.shuffle.parallelism",4)
> .mode(SaveMode.Append)
> .save("/apps/hive/warehouse/emp_mor_26")
>
>
> 1st run - write record 1,"hudi_045",current_timestamp as ts
> read result -- 1, hudi_045
> 2nd run - write record 1,"hudi_046",current_timestamp as ts
> read result -- 1,hudi_046
> 3rd run -- write record 1, "hoodie_123",current_timestamp as ts
> read result --- 1,hudi_046
> 4th run -- write record 1, "hdie_1232324",current_timestamp as ts
> read result --- 1,hudi_046
>
> after multiple updates to same record ,
> the generated  log.1 has multiple instances of the same record.
> At this point the updated record is not fetched.
>
> 14:45
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144153.log.1
> - has record that was updated in run 1
> 15:00
> /apps/hive/warehouse/emp_mor_26/2019/09/22/.278a46f9--87a_20190426144540.log.1
> - has record that was updated in run 2 and run 3
> 14:41 /apps/hive/warehouse/emp_mor_26/2019/09/22/.hoodie_partition_metadata
> 14:41
> /apps/hive/warehouse/emp_mor_26/2019/09/22/278a46f9--87a_0_20190426144153.parquet
>
>
> So is there any compaction to be enabled before reading or while writing .
>
>