You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by ra...@gmail.com, ra...@gmail.com on 2019/04/04 08:33:52 UTC

Hudi with duplicate key

Dear All

I am using cow table with INSERT/BULK_INSERT. 
I am loading the data from json files. 

If existing key in hudi dataset is loading again,  then only new data with that key only showing. Can i able to show both data? (In INSERT)

If same key is there in multiple times in a source json file, then only one key is getting loaded. Can i able to load duplicates keys from same file. (both insert/bulk_insert)


Thanks & Regards
Rahul

Re: Hudi with duplicate key

Posted by Vinoth Chandar <vi...@apache.org>.

https://github.com/apache/incubator-hudi/pull/634

Rahul. I do see what you mean. The defaults documented are for the write
client and they are correct..  I think it makes sense to change the
defaults for DataSource and DeltaStreamer inserts.
We can discuss pros/cons on the PR?

On Tue, Apr 9, 2019 at 10:11 AM Vinoth Chandar <
mail.vinoth.chandar@gmail.com> wrote:

> Hi Rahul,
>
> +1 to kabeer's suggestion.. you can just generate a UUID even as a new key
> & issue upserts. It will help you identify duplicates also eventually.
>
> >>@vinod For this use case I don't want key based update, and i just want
> to control small files in hadoop using hudi. I want to use only hudi's
> small file size control feature, incremental pull.
> For your use case, I think you should just use insert operation (which
> totally bypasses the index lookup/update), not upsert. And set
> combine.on.insert to false (please see docs for exact prop name).
>
> On changing defaults, I still think the defaults make sense..
>   private static final String COMBINE_BEFORE_INSERT_PROP =
> "hoodie.combine.before.insert";
>   private static final String DEFAULT_COMBINE_BEFORE_INSERT = "false";
>
> We turn off combining by default for insert operation. Please raise a JIRA
> if thats not working for you out of box
>
> Thanks
> Vinoth
>
> On Tue, Apr 9, 2019 at 8:17 AM Kabeer Ahmed <ka...@linuxmail.org> wrote:
>
>> Hi Rahul,
>>
>> Thank you specifying the example. Isnt your example easy if we really
>> switch the primary_key to say the names of keys like kabeer/rahul etc.? I
>> am sure your example must be a much simplified version of what you might
>> actually have a task at hand.
>> If that is really not possible, then I would consider something as below:
>> name, amount, ID, IDwithName. (compound key)
>> rahul,15,0, 0rahul
>> kabeer,17,0, 0kabeer
>> vinod,18,0,....
>> nishith,16,0,....
>>
>> This will help you with all the further inserts and updates. So a further
>> update would be based on 0rahul, 0kabeer etc and you will have provided
>> HUDI with unique keys and you get the desired results. This is a common way
>> we achieve these results should a need arise similar to that of yours.
>> Thanks
>> Kabeer.
>>
>> On Apr 9 2019, at 7:31 am, Unknown wrote:
>> >
>> >
>> > On 2019/04/09 06:22:14, rahuledavalath@gmail.com <
>> rahuledavalath@gmail.com> wrote:
>> > >
>> > >
>> > > On 2019/04/08 01:41:16, Vinoth Chandar <vi...@apache.org> wrote:
>> > > > Good discussion.. Sorry, to jump in late.. (been having a downtime
>> last
>> > > > week)
>> > > >
>> > > > insert/bulk_insert operations will in fact introduce duplicates if
>> your
>> > > > input has duplicates. I would also like to understand what feature
>> of Hudi
>> > > > is useful to you in general, since you seem to want duplicates.
>> > > >
>> > > > Only two things I can think of, which could filter our duplicate
>> records
>> > > > and both apply to duplicates within the same batch only (i.e you
>> load both
>> > > > json files that contain duplicates in the same run)
>> > > >
>> > > > - Either you pass the -filter-dupes option to DeltaStreamer tool
>> > > > - You have precombining on for inserts
>> > > > http://hudi.apache.org/configurations.html#combineInput .
>> > > >
>> > > > Do any of these apply to you..
>> > > > Thanks
>> > > > Vinoth
>> > > >
>> > > >
>> > > >
>> > > > On Fri, Apr 5, 2019 at 9:10 AM Kabeer Ahmed <ka...@linuxmail.org>
>> wrote:
>> > > > > Hi Rahul,
>> > > > > I am sorry, I didnt understand the use case properly. Can you
>> please
>> > > > > explain with an example? Let me put my version of understanding
>> based on
>> > > > > your email.
>> > > > > > In json file, every time i will pass a fixed value for a key
>> field.
>> > > > >
>> > > > > Are you saying that you will always have only one value for every
>> key?
>> > > > > Example: Rahul -> "Some Value"
>> > > > >
>> > > > > > Currently if i load data like this only 1 entry per file only
>> load.
>> > > > > What do you mean by this line? Do you mean currently you are
>> loading data
>> > > > > like this and only 1 entry per file is loading. Isnt that what
>> you are
>> > > > > trying to achieve in the line above?
>> > > > >
>> > > > > > I don't want same key's values to be skipped while inserting.
>> > > > > All you are saying that you want to have same values also
>> repeated in your
>> > > > > keys eg: if Rahul primary_key has "Some Value" insert 5 times,
>> then you
>> > > > > would want that to appear 5 times in your store?
>> > > > >
>> > > > > In summary: it appears that what you want is if someone enters 5
>> values
>> > > > > even if they are same. So you need something as below:
>> > > > > > | primary_key | Values |
>> > > > > > | Rahul | "Some Value", "Some Value", ..... |
>> > > > >
>> > > > >
>> > > > > Let me know if my understanding is correct.
>> > > > > Thanks
>> > > > > Kabeer.
>> > > > >
>> > > > > > Dear Omar/Kabeer
>> > > > > > In one of my usecasetthink like i don't want update at all. In
>> json
>> > > > >
>> > > > > file, every time i will pass a fixed value for a key field.
>> Currently if i
>> > > > > load data like this only 1 entry per file only load. I don't want
>> same
>> > > > > key's values to be skipped while inserting.
>> > > > > > Thanks & Regards
>> > > > > > Rahul
>> > > > >
>> > > > >
>> > > > > On Apr 5 2019, at 9:11 am, Unknown wrote:
>> > > > > >
>> > > > > >
>> > > > > > On 2019/04/04 19:48:39, Kabeer Ahmed <ka...@linuxmail.org>
>> wrote:
>> > > > > > > Omkar - there might be various reasons to have duplicates eg:
>> handle
>> > > > > >
>> > > > >
>> > > > > trades in a given day from a single client, track visitor click
>> data to the
>> > > > > website etc.
>> > > > > > >
>> > > > > > > Rahul - If you can give more details about your requirements,
>> then we
>> > > > > can come up with a solution.
>> > > > > > > I have never used INSERT & BULK_INSERT at all and I am not
>> sure if
>> > > > > >
>> > > > >
>> > > > > these options (insert and bulk_insert) do allow user to specify
>> the logic
>> > > > > that you are seeking. Without knowing your exact requirement, I
>> can still
>> > > > > give a suggestion to look into the option of implementing your own
>> > > > > combineAndGetUpdateValue() logic.
>> > > > > > > Lets say all your values for a particular key are strings.
>> You could
>> > > > > >
>> > > > >
>> > > > > append the string values to existing values and store them as:
>> > > > > > >
>> > > > > > > key | Value
>> > > > > > > Rahul | Nice
>> > > > > > > // when there is another entry append the existing one with
>> value with
>> > > > > >
>> > > > >
>> > > > > a comma separator per say.
>> > > > > > >
>> > > > > > > key | Value
>> > > > > > > Rahul | Nice, Person
>> > > > > > > When you retrieve the key values you could then decide to
>> ship back to
>> > > > > >
>> > > > >
>> > > > > user as you want - which is something you would know based on your
>> > > > > requirement - since your json is anyways having multiple ways to
>> insert
>> > > > > values for a key.
>> > > > > > >
>> > > > > > > Feel free to reach out if you need help and I will help you
>> as much as
>> > > > > I can.
>> > > > > > > On Apr 4 2019, at 6:35 pm, Omkar Joshi <om...@uber.com.INVALID>
>> wrote:
>> > > > > > > > Hi Rahul,
>> > > > > > > >
>> > > > > > > > Thanks for trying out Hudi!!
>> > > > > > > > Any reason why you need to have duplicates in HUDI dataset?
>> Will you
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > > ever
>> > > > > > > > be updating it later?
>> > > > > > > >
>> > > > > > > > Thanks,
>> > > > > > > > Omkar
>> > > > > > > >
>> > > > > > > > On Thu, Apr 4, 2019 at 1:33 AM rahuledavalath@gmail.com <
>> > > > > > > > rahuledavalath@gmail.com> wrote:
>> > > > > > > >
>> > > > > > > > > Dear All
>> > > > > > > > > I am using cow table with INSERT/BULK_INSERT.
>> > > > > > > > > I am loading the data from json files.
>> > > > > > > > >
>> > > > > > > > > If existing key in hudi dataset is loading again, then
>> only new
>> > > > > data with
>> > > > > > > > > that key only showing. Can i able to show both data? (In
>> INSERT)
>> > > > > > > > >
>> > > > > > > > > If same key is there in multiple times in a source json
>> file, then
>> > > > > only
>> > > > > > > > > one key is getting loaded. Can i able to load duplicates
>> keys from
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > > same
>> > > > > > > > > file. (both insert/bulk_insert)
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Thanks & Regards
>> > > > > > > > > Rahul
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > Dear Omar/Kabeer
>> > > > > > In one of my usecasetthink like i don't want update at all. In
>> json
>> > > > >
>> > > > > file, every time i will pass a fixed value for a key field.
>> Currently if i
>> > > > > load data like this only 1 entry per file only load. I don't want
>> same
>> > > > > key's values to be skipped while inserting.
>> > > > > > Thanks & Regards
>> > > > > > Rahul
>> > > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> > > Dear Kabeer/Vinod
>> > >
>> > > For exaple I have a file which contains
>> UserName,Tranasaction_Amount,ID fileds.
>> > > In this json file i am putting every time same value for it & i
>> mapped this as the hudi dataset key filed.
>> > > (Currently all records which will come are new records & i don't have
>> auto increment ID in the files which i am getting).
>> > >
>> > > suppose if i have 4 entries in a json file
>> > > eg :
>> > > rahul,15,0
>> > > kabeer,17,0
>> > > vinod,18,0
>> > > nishith,16,0
>> > >
>> > > currently if i load it normall,y only 1 record will be there in hudi
>> dataset as all the key is 0 (while selecting from hive table).
>> > > I want to have all 4 entries to be loaded
>> > > @vinod For this use case I don't want key based update, and i just
>> want to control small files in hadoop using hudi. I want to use only hudi's
>> small file size control feature, incremental pull.
>> > >
>> > > Thanks & Regards
>> > > Rahul
>> > >
>> > >
>> > >
>> > >
>> > Dear Vinod
>> > As per your suggestion i checked hoodie.combine.before.upsert property.
>> > combineInput(on_insert = false, on_update=true)
>> > Property: hoodie.combine.before.insert, hoodie.combine.before.upsert
>> > Flag which first combines the input RDD and merges multiple partial
>> records into a single record before inserting or updating in DFS
>> >
>> > But it's mentioned default it is false so first i thought not to try
>> this. Anyway tried with the false now it's inserting duplicate records.
>> > After this while searching i found already raised issue for this.
>> > HoodieWriteConfig writeConfig =
>> HoodieWriteConfig.newBuilder().combineInput(true, true)
>> > .withPath(basePath).withAutoCommit(false)
>> >
>> > in that it's telling about default values as true,true need to be
>> changed.
>> > I can see still in the latest code it's not yet updated. Please check
>> this.
>> >
>> > Thanks & Regards
>> > Rahul P
>> >
>>
>>

Re: Hudi with duplicate key

Posted by Vinoth Chandar <ma...@gmail.com>.

Hi Rahul,

+1 to kabeer's suggestion.. you can just generate a UUID even as a new key
& issue upserts. It will help you identify duplicates also eventually.

>>@vinod For this use case I don't want key based update, and i just want
to control small files in hadoop using hudi. I want to use only hudi's
small file size control feature, incremental pull.
For your use case, I think you should just use insert operation (which
totally bypasses the index lookup/update), not upsert. And set
combine.on.insert to false (please see docs for exact prop name).

On changing defaults, I still think the defaults make sense..
  private static final String COMBINE_BEFORE_INSERT_PROP =
"hoodie.combine.before.insert";
  private static final String DEFAULT_COMBINE_BEFORE_INSERT = "false";

We turn off combining by default for insert operation. Please raise a JIRA
if thats not working for you out of box

Thanks
Vinoth

On Tue, Apr 9, 2019 at 8:17 AM Kabeer Ahmed <ka...@linuxmail.org> wrote:

> Hi Rahul,
>
> Thank you specifying the example. Isnt your example easy if we really
> switch the primary_key to say the names of keys like kabeer/rahul etc.? I
> am sure your example must be a much simplified version of what you might
> actually have a task at hand.
> If that is really not possible, then I would consider something as below:
> name, amount, ID, IDwithName. (compound key)
> rahul,15,0, 0rahul
> kabeer,17,0, 0kabeer
> vinod,18,0,....
> nishith,16,0,....
>
> This will help you with all the further inserts and updates. So a further
> update would be based on 0rahul, 0kabeer etc and you will have provided
> HUDI with unique keys and you get the desired results. This is a common way
> we achieve these results should a need arise similar to that of yours.
> Thanks
> Kabeer.
>
> On Apr 9 2019, at 7:31 am, Unknown wrote:
> >
> >
> > On 2019/04/09 06:22:14, rahuledavalath@gmail.com <
> rahuledavalath@gmail.com> wrote:
> > >
> > >
> > > On 2019/04/08 01:41:16, Vinoth Chandar <vi...@apache.org> wrote:
> > > > Good discussion.. Sorry, to jump in late.. (been having a downtime
> last
> > > > week)
> > > >
> > > > insert/bulk_insert operations will in fact introduce duplicates if
> your
> > > > input has duplicates. I would also like to understand what feature
> of Hudi
> > > > is useful to you in general, since you seem to want duplicates.
> > > >
> > > > Only two things I can think of, which could filter our duplicate
> records
> > > > and both apply to duplicates within the same batch only (i.e you
> load both
> > > > json files that contain duplicates in the same run)
> > > >
> > > > - Either you pass the -filter-dupes option to DeltaStreamer tool
> > > > - You have precombining on for inserts
> > > > http://hudi.apache.org/configurations.html#combineInput .
> > > >
> > > > Do any of these apply to you..
> > > > Thanks
> > > > Vinoth
> > > >
> > > >
> > > >
> > > > On Fri, Apr 5, 2019 at 9:10 AM Kabeer Ahmed <ka...@linuxmail.org>
> wrote:
> > > > > Hi Rahul,
> > > > > I am sorry, I didnt understand the use case properly. Can you
> please
> > > > > explain with an example? Let me put my version of understanding
> based on
> > > > > your email.
> > > > > > In json file, every time i will pass a fixed value for a key
> field.
> > > > >
> > > > > Are you saying that you will always have only one value for every
> key?
> > > > > Example: Rahul -> "Some Value"
> > > > >
> > > > > > Currently if i load data like this only 1 entry per file only
> load.
> > > > > What do you mean by this line? Do you mean currently you are
> loading data
> > > > > like this and only 1 entry per file is loading. Isnt that what you
> are
> > > > > trying to achieve in the line above?
> > > > >
> > > > > > I don't want same key's values to be skipped while inserting.
> > > > > All you are saying that you want to have same values also repeated
> in your
> > > > > keys eg: if Rahul primary_key has "Some Value" insert 5 times,
> then you
> > > > > would want that to appear 5 times in your store?
> > > > >
> > > > > In summary: it appears that what you want is if someone enters 5
> values
> > > > > even if they are same. So you need something as below:
> > > > > > | primary_key | Values |
> > > > > > | Rahul | "Some Value", "Some Value", ..... |
> > > > >
> > > > >
> > > > > Let me know if my understanding is correct.
> > > > > Thanks
> > > > > Kabeer.
> > > > >
> > > > > > Dear Omar/Kabeer
> > > > > > In one of my usecasetthink like i don't want update at all. In
> json
> > > > >
> > > > > file, every time i will pass a fixed value for a key field.
> Currently if i
> > > > > load data like this only 1 entry per file only load. I don't want
> same
> > > > > key's values to be skipped while inserting.
> > > > > > Thanks & Regards
> > > > > > Rahul
> > > > >
> > > > >
> > > > > On Apr 5 2019, at 9:11 am, Unknown wrote:
> > > > > >
> > > > > >
> > > > > > On 2019/04/04 19:48:39, Kabeer Ahmed <ka...@linuxmail.org>
> wrote:
> > > > > > > Omkar - there might be various reasons to have duplicates eg:
> handle
> > > > > >
> > > > >
> > > > > trades in a given day from a single client, track visitor click
> data to the
> > > > > website etc.
> > > > > > >
> > > > > > > Rahul - If you can give more details about your requirements,
> then we
> > > > > can come up with a solution.
> > > > > > > I have never used INSERT & BULK_INSERT at all and I am not
> sure if
> > > > > >
> > > > >
> > > > > these options (insert and bulk_insert) do allow user to specify
> the logic
> > > > > that you are seeking. Without knowing your exact requirement, I
> can still
> > > > > give a suggestion to look into the option of implementing your own
> > > > > combineAndGetUpdateValue() logic.
> > > > > > > Lets say all your values for a particular key are strings. You
> could
> > > > > >
> > > > >
> > > > > append the string values to existing values and store them as:
> > > > > > >
> > > > > > > key | Value
> > > > > > > Rahul | Nice
> > > > > > > // when there is another entry append the existing one with
> value with
> > > > > >
> > > > >
> > > > > a comma separator per say.
> > > > > > >
> > > > > > > key | Value
> > > > > > > Rahul | Nice, Person
> > > > > > > When you retrieve the key values you could then decide to ship
> back to
> > > > > >
> > > > >
> > > > > user as you want - which is something you would know based on your
> > > > > requirement - since your json is anyways having multiple ways to
> insert
> > > > > values for a key.
> > > > > > >
> > > > > > > Feel free to reach out if you need help and I will help you as
> much as
> > > > > I can.
> > > > > > > On Apr 4 2019, at 6:35 pm, Omkar Joshi <om...@uber.com.INVALID>
> wrote:
> > > > > > > > Hi Rahul,
> > > > > > > >
> > > > > > > > Thanks for trying out Hudi!!
> > > > > > > > Any reason why you need to have duplicates in HUDI dataset?
> Will you
> > > > > > >
> > > > > >
> > > > >
> > > > > ever
> > > > > > > > be updating it later?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Omkar
> > > > > > > >
> > > > > > > > On Thu, Apr 4, 2019 at 1:33 AM rahuledavalath@gmail.com <
> > > > > > > > rahuledavalath@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Dear All
> > > > > > > > > I am using cow table with INSERT/BULK_INSERT.
> > > > > > > > > I am loading the data from json files.
> > > > > > > > >
> > > > > > > > > If existing key in hudi dataset is loading again, then
> only new
> > > > > data with
> > > > > > > > > that key only showing. Can i able to show both data? (In
> INSERT)
> > > > > > > > >
> > > > > > > > > If same key is there in multiple times in a source json
> file, then
> > > > > only
> > > > > > > > > one key is getting loaded. Can i able to load duplicates
> keys from
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > > same
> > > > > > > > > file. (both insert/bulk_insert)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks & Regards
> > > > > > > > > Rahul
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > Dear Omar/Kabeer
> > > > > > In one of my usecasetthink like i don't want update at all. In
> json
> > > > >
> > > > > file, every time i will pass a fixed value for a key field.
> Currently if i
> > > > > load data like this only 1 entry per file only load. I don't want
> same
> > > > > key's values to be skipped while inserting.
> > > > > > Thanks & Regards
> > > > > > Rahul
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > > Dear Kabeer/Vinod
> > >
> > > For exaple I have a file which contains
> UserName,Tranasaction_Amount,ID fileds.
> > > In this json file i am putting every time same value for it & i mapped
> this as the hudi dataset key filed.
> > > (Currently all records which will come are new records & i don't have
> auto increment ID in the files which i am getting).
> > >
> > > suppose if i have 4 entries in a json file
> > > eg :
> > > rahul,15,0
> > > kabeer,17,0
> > > vinod,18,0
> > > nishith,16,0
> > >
> > > currently if i load it normall,y only 1 record will be there in hudi
> dataset as all the key is 0 (while selecting from hive table).
> > > I want to have all 4 entries to be loaded
> > > @vinod For this use case I don't want key based update, and i just
> want to control small files in hadoop using hudi. I want to use only hudi's
> small file size control feature, incremental pull.
> > >
> > > Thanks & Regards
> > > Rahul
> > >
> > >
> > >
> > >
> > Dear Vinod
> > As per your suggestion i checked hoodie.combine.before.upsert property.
> > combineInput(on_insert = false, on_update=true)
> > Property: hoodie.combine.before.insert, hoodie.combine.before.upsert
> > Flag which first combines the input RDD and merges multiple partial
> records into a single record before inserting or updating in DFS
> >
> > But it's mentioned default it is false so first i thought not to try
> this. Anyway tried with the false now it's inserting duplicate records.
> > After this while searching i found already raised issue for this.
> > HoodieWriteConfig writeConfig =
> HoodieWriteConfig.newBuilder().combineInput(true, true)
> > .withPath(basePath).withAutoCommit(false)
> >
> > in that it's telling about default values as true,true need to be
> changed.
> > I can see still in the latest code it's not yet updated. Please check
> this.
> >
> > Thanks & Regards
> > Rahul P
> >
>
>

Re: Hudi with duplicate key

Posted by Kabeer Ahmed <ka...@linuxmail.org>.

Hi Rahul,

Thank you specifying the example. Isnt your example easy if we really switch the primary_key to say the names of keys like kabeer/rahul etc.? I am sure your example must be a much simplified version of what you might actually have a task at hand.
If that is really not possible, then I would consider something as below:
name, amount, ID, IDwithName. (compound key)
rahul,15,0, 0rahul
kabeer,17,0, 0kabeer
vinod,18,0,....
nishith,16,0,....

This will help you with all the further inserts and updates. So a further update would be based on 0rahul, 0kabeer etc and you will have provided HUDI with unique keys and you get the desired results. This is a common way we achieve these results should a need arise similar to that of yours.
Thanks
Kabeer.

On Apr 9 2019, at 7:31 am, Unknown wrote:
>
>
> On 2019/04/09 06:22:14, rahuledavalath@gmail.com <ra...@gmail.com> wrote:
> >
> >
> > On 2019/04/08 01:41:16, Vinoth Chandar <vi...@apache.org> wrote:
> > > Good discussion.. Sorry, to jump in late.. (been having a downtime last
> > > week)
> > >
> > > insert/bulk_insert operations will in fact introduce duplicates if your
> > > input has duplicates. I would also like to understand what feature of Hudi
> > > is useful to you in general, since you seem to want duplicates.
> > >
> > > Only two things I can think of, which could filter our duplicate records
> > > and both apply to duplicates within the same batch only (i.e you load both
> > > json files that contain duplicates in the same run)
> > >
> > > - Either you pass the -filter-dupes option to DeltaStreamer tool
> > > - You have precombining on for inserts
> > > http://hudi.apache.org/configurations.html#combineInput .
> > >
> > > Do any of these apply to you..
> > > Thanks
> > > Vinoth
> > >
> > >
> > >
> > > On Fri, Apr 5, 2019 at 9:10 AM Kabeer Ahmed <ka...@linuxmail.org> wrote:
> > > > Hi Rahul,
> > > > I am sorry, I didnt understand the use case properly. Can you please
> > > > explain with an example? Let me put my version of understanding based on
> > > > your email.
> > > > > In json file, every time i will pass a fixed value for a key field.
> > > >
> > > > Are you saying that you will always have only one value for every key?
> > > > Example: Rahul -> "Some Value"
> > > >
> > > > > Currently if i load data like this only 1 entry per file only load.
> > > > What do you mean by this line? Do you mean currently you are loading data
> > > > like this and only 1 entry per file is loading. Isnt that what you are
> > > > trying to achieve in the line above?
> > > >
> > > > > I don't want same key's values to be skipped while inserting.
> > > > All you are saying that you want to have same values also repeated in your
> > > > keys eg: if Rahul primary_key has "Some Value" insert 5 times, then you
> > > > would want that to appear 5 times in your store?
> > > >
> > > > In summary: it appears that what you want is if someone enters 5 values
> > > > even if they are same. So you need something as below:
> > > > > | primary_key | Values |
> > > > > | Rahul | "Some Value", "Some Value", ..... |
> > > >
> > > >
> > > > Let me know if my understanding is correct.
> > > > Thanks
> > > > Kabeer.
> > > >
> > > > > Dear Omar/Kabeer
> > > > > In one of my usecasetthink like i don't want update at all. In json
> > > >
> > > > file, every time i will pass a fixed value for a key field. Currently if i
> > > > load data like this only 1 entry per file only load. I don't want same
> > > > key's values to be skipped while inserting.
> > > > > Thanks & Regards
> > > > > Rahul
> > > >
> > > >
> > > > On Apr 5 2019, at 9:11 am, Unknown wrote:
> > > > >
> > > > >
> > > > > On 2019/04/04 19:48:39, Kabeer Ahmed <ka...@linuxmail.org> wrote:
> > > > > > Omkar - there might be various reasons to have duplicates eg: handle
> > > > >
> > > >
> > > > trades in a given day from a single client, track visitor click data to the
> > > > website etc.
> > > > > >
> > > > > > Rahul - If you can give more details about your requirements, then we
> > > > can come up with a solution.
> > > > > > I have never used INSERT & BULK_INSERT at all and I am not sure if
> > > > >
> > > >
> > > > these options (insert and bulk_insert) do allow user to specify the logic
> > > > that you are seeking. Without knowing your exact requirement, I can still
> > > > give a suggestion to look into the option of implementing your own
> > > > combineAndGetUpdateValue() logic.
> > > > > > Lets say all your values for a particular key are strings. You could
> > > > >
> > > >
> > > > append the string values to existing values and store them as:
> > > > > >
> > > > > > key | Value
> > > > > > Rahul | Nice
> > > > > > // when there is another entry append the existing one with value with
> > > > >
> > > >
> > > > a comma separator per say.
> > > > > >
> > > > > > key | Value
> > > > > > Rahul | Nice, Person
> > > > > > When you retrieve the key values you could then decide to ship back to
> > > > >
> > > >
> > > > user as you want - which is something you would know based on your
> > > > requirement - since your json is anyways having multiple ways to insert
> > > > values for a key.
> > > > > >
> > > > > > Feel free to reach out if you need help and I will help you as much as
> > > > I can.
> > > > > > On Apr 4 2019, at 6:35 pm, Omkar Joshi <om...@uber.com.INVALID> wrote:
> > > > > > > Hi Rahul,
> > > > > > >
> > > > > > > Thanks for trying out Hudi!!
> > > > > > > Any reason why you need to have duplicates in HUDI dataset? Will you
> > > > > >
> > > > >
> > > >
> > > > ever
> > > > > > > be updating it later?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Omkar
> > > > > > >
> > > > > > > On Thu, Apr 4, 2019 at 1:33 AM rahuledavalath@gmail.com <
> > > > > > > rahuledavalath@gmail.com> wrote:
> > > > > > >
> > > > > > > > Dear All
> > > > > > > > I am using cow table with INSERT/BULK_INSERT.
> > > > > > > > I am loading the data from json files.
> > > > > > > >
> > > > > > > > If existing key in hudi dataset is loading again, then only new
> > > > data with
> > > > > > > > that key only showing. Can i able to show both data? (In INSERT)
> > > > > > > >
> > > > > > > > If same key is there in multiple times in a source json file, then
> > > > only
> > > > > > > > one key is getting loaded. Can i able to load duplicates keys from
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > > same
> > > > > > > > file. (both insert/bulk_insert)
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks & Regards
> > > > > > > > Rahul
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > Dear Omar/Kabeer
> > > > > In one of my usecasetthink like i don't want update at all. In json
> > > >
> > > > file, every time i will pass a fixed value for a key field. Currently if i
> > > > load data like this only 1 entry per file only load. I don't want same
> > > > key's values to be skipped while inserting.
> > > > > Thanks & Regards
> > > > > Rahul
> > > > >
> > > >
> > > >
> > >
> >
> > Dear Kabeer/Vinod
> >
> > For exaple I have a file which contains UserName,Tranasaction_Amount,ID fileds.
> > In this json file i am putting every time same value for it & i mapped this as the hudi dataset key filed.
> > (Currently all records which will come are new records & i don't have auto increment ID in the files which i am getting).
> >
> > suppose if i have 4 entries in a json file
> > eg :
> > rahul,15,0
> > kabeer,17,0
> > vinod,18,0
> > nishith,16,0
> >
> > currently if i load it normall,y only 1 record will be there in hudi dataset as all the key is 0 (while selecting from hive table).
> > I want to have all 4 entries to be loaded
> > @vinod For this use case I don't want key based update, and i just want to control small files in hadoop using hudi. I want to use only hudi's small file size control feature, incremental pull.
> >
> > Thanks & Regards
> > Rahul
> >
> >
> >
> >
> Dear Vinod
> As per your suggestion i checked hoodie.combine.before.upsert property.
> combineInput(on_insert = false, on_update=true)
> Property: hoodie.combine.before.insert, hoodie.combine.before.upsert
> Flag which first combines the input RDD and merges multiple partial records into a single record before inserting or updating in DFS
>
> But it's mentioned default it is false so first i thought not to try this. Anyway tried with the false now it's inserting duplicate records.
> After this while searching i found already raised issue for this.
> HoodieWriteConfig writeConfig = HoodieWriteConfig.newBuilder().combineInput(true, true)
> .withPath(basePath).withAutoCommit(false)
>
> in that it's telling about default values as true,true need to be changed.
> I can see still in the latest code it's not yet updated. Please check this.
>
> Thanks & Regards
> Rahul P
>

Re: Hudi with duplicate key

Posted by ra...@gmail.com, ra...@gmail.com.


On 2019/04/09 06:31:33, rahuledavalath@gmail.com <ra...@gmail.com> wrote: 
> 
> 
> On 2019/04/09 06:22:14, rahuledavalath@gmail.com <ra...@gmail.com> wrote: 
> > 
> > 
> > On 2019/04/08 01:41:16, Vinoth Chandar <vi...@apache.org> wrote: 
> > > Good discussion.. Sorry, to jump in late.. (been having a downtime last
> > > week)
> > > 
> > > insert/bulk_insert operations will in fact introduce duplicates if your
> > > input has duplicates. I would also like to understand what feature of Hudi
> > > is useful to you in general, since you seem to want duplicates.
> > > 
> > > Only two things I can think of, which could filter our duplicate records
> > > and both apply to duplicates within the same batch only (i.e you load both
> > > json files that contain duplicates in the same run)
> > > 
> > >  - Either you pass the -filter-dupes option to DeltaStreamer tool
> > >  - You have precombining on for inserts
> > > http://hudi.apache.org/configurations.html#combineInput .
> > > 
> > > Do any of these apply to you..
> > > 
> > > Thanks
> > > Vinoth
> > > 
> > > 
> > > 
> > > On Fri, Apr 5, 2019 at 9:10 AM Kabeer Ahmed <ka...@linuxmail.org> wrote:
> > > 
> > > > Hi Rahul,
> > > >
> > > > I am sorry, I didnt understand the use case properly. Can you please
> > > > explain with an example? Let me put my version of understanding based on
> > > > your email.
> > > > > In json file, every time i will pass a fixed value for a key field.
> > > > Are you saying that you will always have only one value for every key?
> > > > Example: Rahul -> "Some Value"
> > > >
> > > > > Currently if i load data like this only 1 entry per file only load.
> > > > What do you mean by this line? Do you mean currently you are loading data
> > > > like this and only 1 entry per file is loading. Isnt that what you are
> > > > trying to achieve in the line above?
> > > >
> > > > > I don't want same key's values to be skipped while inserting.
> > > > All you are saying that you want to have same values also repeated in your
> > > > keys eg: if Rahul primary_key has "Some Value" insert 5 times, then you
> > > > would want that to appear 5 times in your store?
> > > >
> > > > In summary: it appears that what you want is if someone enters 5 values
> > > > even if they are same. So you need something as below:
> > > > > | primary_key | Values |
> > > > > | Rahul | "Some Value", "Some Value", ..... |
> > > >
> > > > Let me know if my understanding is correct.
> > > > Thanks
> > > > Kabeer.
> > > >
> > > > > Dear Omar/Kabeer
> > > > > In one of my usecasetthink like i don't want update at all. In json
> > > > file, every time i will pass a fixed value for a key field. Currently if i
> > > > load data like this only 1 entry per file only load. I don't want same
> > > > key's values to be skipped while inserting.
> > > > > Thanks & Regards
> > > > > Rahul
> > > >
> > > > On Apr 5 2019, at 9:11 am, Unknown wrote:
> > > > >
> > > > >
> > > > > On 2019/04/04 19:48:39, Kabeer Ahmed <ka...@linuxmail.org> wrote:
> > > > > > Omkar - there might be various reasons to have duplicates eg: handle
> > > > trades in a given day from a single client, track visitor click data to the
> > > > website etc.
> > > > > >
> > > > > > Rahul - If you can give more details about your requirements, then we
> > > > can come up with a solution.
> > > > > > I have never used INSERT & BULK_INSERT at all and I am not sure if
> > > > these options (insert and bulk_insert) do allow user to specify the logic
> > > > that you are seeking. Without knowing your exact requirement, I can still
> > > > give a suggestion to look into the option of implementing your own
> > > > combineAndGetUpdateValue() logic.
> > > > > > Lets say all your values for a particular key are strings. You could
> > > > append the string values to existing values and store them as:
> > > > > >
> > > > > > key | Value
> > > > > > Rahul | Nice
> > > > > > // when there is another entry append the existing one with value with
> > > > a comma separator per say.
> > > > > >
> > > > > > key | Value
> > > > > > Rahul | Nice, Person
> > > > > > When you retrieve the key values you could then decide to ship back to
> > > > user as you want - which is something you would know based on your
> > > > requirement - since your json is anyways having multiple ways to insert
> > > > values for a key.
> > > > > >
> > > > > > Feel free to reach out if you need help and I will help you as much as
> > > > I can.
> > > > > > On Apr 4 2019, at 6:35 pm, Omkar Joshi <om...@uber.com.INVALID> wrote:
> > > > > > > Hi Rahul,
> > > > > > >
> > > > > > > Thanks for trying out Hudi!!
> > > > > > > Any reason why you need to have duplicates in HUDI dataset? Will you
> > > > ever
> > > > > > > be updating it later?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Omkar
> > > > > > >
> > > > > > > On Thu, Apr 4, 2019 at 1:33 AM rahuledavalath@gmail.com <
> > > > > > > rahuledavalath@gmail.com> wrote:
> > > > > > >
> > > > > > > > Dear All
> > > > > > > > I am using cow table with INSERT/BULK_INSERT.
> > > > > > > > I am loading the data from json files.
> > > > > > > >
> > > > > > > > If existing key in hudi dataset is loading again, then only new
> > > > data with
> > > > > > > > that key only showing. Can i able to show both data? (In INSERT)
> > > > > > > >
> > > > > > > > If same key is there in multiple times in a source json file, then
> > > > only
> > > > > > > > one key is getting loaded. Can i able to load duplicates keys from
> > > > same
> > > > > > > > file. (both insert/bulk_insert)
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks & Regards
> > > > > > > > Rahul
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > Dear Omar/Kabeer
> > > > > In one of my usecasetthink like i don't want update at all. In json
> > > > file, every time i will pass a fixed value for a key field. Currently if i
> > > > load data like this only 1 entry per file only load. I don't want same
> > > > key's values to be skipped while inserting.
> > > > > Thanks & Regards
> > > > > Rahul
> > > > >
> > > >
> > > >
> > > 
> > 
> > Dear Kabeer/Vinod
> > 
> > 
> > For exaple I  have a file which contains UserName,Tranasaction_Amount,ID fileds.
> > In this json file i am putting every time same value for it & i mapped this as the hudi dataset key filed.
> > (Currently all records which will come are new records & i  don't have auto increment ID in the files which i am getting).
> > 
> > suppose if i have 4 entries in a json file
> > eg : 
> > rahul,15,0
> > kabeer,17,0
> > vinod,18,0
> > nishith,16,0
> > 
> > currently if i load it normall,y only 1 record will be there in hudi dataset as all the key is 0 (while selecting from hive table).
> > 
> > I want to have all 4 entries to be loaded
> > 
> > @vinod For this use case I don't want key based update, and i just want to control small files in hadoop using hudi. I want to use only hudi's small file size control feature, incremental pull.
> > 
> > 
> > Thanks & Regards
> > Rahul
> > 
> > 
> > 
> > 
> Dear Vinod
> 
> As per your suggestion i checked  hoodie.combine.before.upsert property.
> 
> combineInput(on_insert = false, on_update=true)
> Property: hoodie.combine.before.insert, hoodie.combine.before.upsert
> Flag which first combines the input RDD and merges multiple partial records into a single record before inserting or updating in DFS
> 
> But it's mentioned  default it is false so first i thought not to try this.  Anyway tried with the false now it's inserting duplicate records.
> 
> After this while searching  i found already raised issue for this. 
> HoodieWriteConfig writeConfig = HoodieWriteConfig.newBuilder().combineInput(true, true)
>     .withPath(basePath).withAutoCommit(false)
> 
> in that it's telling about default values as true,true need to be changed.
> 
> I can see still in the latest code it's not yet updated.  Please check this. 
> 
> 
> Thanks & Regards
> Rahul P
> 
> 
> 
> 
Dear Vinod/Kabeer

With  hoodie.combine.before.upsert property=true i am able to insert duplicates records in the hudi dataset. But if i am using same key in next loading only the new data with that key is showing.

eg: after 1st insert 

rahul,15,0
kabeer,17,0
vinod,18,0
nishith,16,0

I loaded same file again( INSERT I am using for this)

The records are showing like below


rahul,15,0
rahul,15,0
rahul,15,0
rahul,15,0

How can i avoid the second case. I need output as 
rahul,15,0
kabeer,17,0
vinod,18,0
nishith,16,0
rahul,15,0
kabeer,17,0
vinod,18,0
nishith,16,0

Please assist on this

note: hoodie.parquet.small.file.limit feature also i am using. so as my data is less in same parquet file it will append data.

Thanks & Regards
Rahul

Re: Hudi with duplicate key

Posted by ra...@gmail.com, ra...@gmail.com.


On 2019/04/09 06:22:14, rahuledavalath@gmail.com <ra...@gmail.com> wrote: 
> 
> 
> On 2019/04/08 01:41:16, Vinoth Chandar <vi...@apache.org> wrote: 
> > Good discussion.. Sorry, to jump in late.. (been having a downtime last
> > week)
> > 
> > insert/bulk_insert operations will in fact introduce duplicates if your
> > input has duplicates. I would also like to understand what feature of Hudi
> > is useful to you in general, since you seem to want duplicates.
> > 
> > Only two things I can think of, which could filter our duplicate records
> > and both apply to duplicates within the same batch only (i.e you load both
> > json files that contain duplicates in the same run)
> > 
> >  - Either you pass the -filter-dupes option to DeltaStreamer tool
> >  - You have precombining on for inserts
> > http://hudi.apache.org/configurations.html#combineInput .
> > 
> > Do any of these apply to you..
> > 
> > Thanks
> > Vinoth
> > 
> > 
> > 
> > On Fri, Apr 5, 2019 at 9:10 AM Kabeer Ahmed <ka...@linuxmail.org> wrote:
> > 
> > > Hi Rahul,
> > >
> > > I am sorry, I didnt understand the use case properly. Can you please
> > > explain with an example? Let me put my version of understanding based on
> > > your email.
> > > > In json file, every time i will pass a fixed value for a key field.
> > > Are you saying that you will always have only one value for every key?
> > > Example: Rahul -> "Some Value"
> > >
> > > > Currently if i load data like this only 1 entry per file only load.
> > > What do you mean by this line? Do you mean currently you are loading data
> > > like this and only 1 entry per file is loading. Isnt that what you are
> > > trying to achieve in the line above?
> > >
> > > > I don't want same key's values to be skipped while inserting.
> > > All you are saying that you want to have same values also repeated in your
> > > keys eg: if Rahul primary_key has "Some Value" insert 5 times, then you
> > > would want that to appear 5 times in your store?
> > >
> > > In summary: it appears that what you want is if someone enters 5 values
> > > even if they are same. So you need something as below:
> > > > | primary_key | Values |
> > > > | Rahul | "Some Value", "Some Value", ..... |
> > >
> > > Let me know if my understanding is correct.
> > > Thanks
> > > Kabeer.
> > >
> > > > Dear Omar/Kabeer
> > > > In one of my usecasetthink like i don't want update at all. In json
> > > file, every time i will pass a fixed value for a key field. Currently if i
> > > load data like this only 1 entry per file only load. I don't want same
> > > key's values to be skipped while inserting.
> > > > Thanks & Regards
> > > > Rahul
> > >
> > > On Apr 5 2019, at 9:11 am, Unknown wrote:
> > > >
> > > >
> > > > On 2019/04/04 19:48:39, Kabeer Ahmed <ka...@linuxmail.org> wrote:
> > > > > Omkar - there might be various reasons to have duplicates eg: handle
> > > trades in a given day from a single client, track visitor click data to the
> > > website etc.
> > > > >
> > > > > Rahul - If you can give more details about your requirements, then we
> > > can come up with a solution.
> > > > > I have never used INSERT & BULK_INSERT at all and I am not sure if
> > > these options (insert and bulk_insert) do allow user to specify the logic
> > > that you are seeking. Without knowing your exact requirement, I can still
> > > give a suggestion to look into the option of implementing your own
> > > combineAndGetUpdateValue() logic.
> > > > > Lets say all your values for a particular key are strings. You could
> > > append the string values to existing values and store them as:
> > > > >
> > > > > key | Value
> > > > > Rahul | Nice
> > > > > // when there is another entry append the existing one with value with
> > > a comma separator per say.
> > > > >
> > > > > key | Value
> > > > > Rahul | Nice, Person
> > > > > When you retrieve the key values you could then decide to ship back to
> > > user as you want - which is something you would know based on your
> > > requirement - since your json is anyways having multiple ways to insert
> > > values for a key.
> > > > >
> > > > > Feel free to reach out if you need help and I will help you as much as
> > > I can.
> > > > > On Apr 4 2019, at 6:35 pm, Omkar Joshi <om...@uber.com.INVALID> wrote:
> > > > > > Hi Rahul,
> > > > > >
> > > > > > Thanks for trying out Hudi!!
> > > > > > Any reason why you need to have duplicates in HUDI dataset? Will you
> > > ever
> > > > > > be updating it later?
> > > > > >
> > > > > > Thanks,
> > > > > > Omkar
> > > > > >
> > > > > > On Thu, Apr 4, 2019 at 1:33 AM rahuledavalath@gmail.com <
> > > > > > rahuledavalath@gmail.com> wrote:
> > > > > >
> > > > > > > Dear All
> > > > > > > I am using cow table with INSERT/BULK_INSERT.
> > > > > > > I am loading the data from json files.
> > > > > > >
> > > > > > > If existing key in hudi dataset is loading again, then only new
> > > data with
> > > > > > > that key only showing. Can i able to show both data? (In INSERT)
> > > > > > >
> > > > > > > If same key is there in multiple times in a source json file, then
> > > only
> > > > > > > one key is getting loaded. Can i able to load duplicates keys from
> > > same
> > > > > > > file. (both insert/bulk_insert)
> > > > > > >
> > > > > > >
> > > > > > > Thanks & Regards
> > > > > > > Rahul
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > Dear Omar/Kabeer
> > > > In one of my usecasetthink like i don't want update at all. In json
> > > file, every time i will pass a fixed value for a key field. Currently if i
> > > load data like this only 1 entry per file only load. I don't want same
> > > key's values to be skipped while inserting.
> > > > Thanks & Regards
> > > > Rahul
> > > >
> > >
> > >
> > 
> 
> Dear Kabeer/Vinod
> 
> 
> For exaple I  have a file which contains UserName,Tranasaction_Amount,ID fileds.
> In this json file i am putting every time same value for it & i mapped this as the hudi dataset key filed.
> (Currently all records which will come are new records & i  don't have auto increment ID in the files which i am getting).
> 
> suppose if i have 4 entries in a json file
> eg : 
> rahul,15,0
> kabeer,17,0
> vinod,18,0
> nishith,16,0
> 
> currently if i load it normall,y only 1 record will be there in hudi dataset as all the key is 0 (while selecting from hive table).
> 
> I want to have all 4 entries to be loaded
> 
> @vinod For this use case I don't want key based update, and i just want to control small files in hadoop using hudi. I want to use only hudi's small file size control feature, incremental pull.
> 
> 
> Thanks & Regards
> Rahul
> 
> 
> 
> 
Dear Vinod

As per your suggestion i checked  hoodie.combine.before.upsert property.

combineInput(on_insert = false, on_update=true)
Property: hoodie.combine.before.insert, hoodie.combine.before.upsert
Flag which first combines the input RDD and merges multiple partial records into a single record before inserting or updating in DFS

But it's mentioned  default it is false so first i thought not to try this.  Anyway tried with the false now it's inserting duplicate records.

After this while searching  i found already raised issue for this. 
HoodieWriteConfig writeConfig = HoodieWriteConfig.newBuilder().combineInput(true, true)
    .withPath(basePath).withAutoCommit(false)

in that it's telling about default values as true,true need to be changed.

I can see still in the latest code it's not yet updated.  Please check this. 


Thanks & Regards
Rahul P

Re: Hudi with duplicate key

Posted by ra...@gmail.com, ra...@gmail.com.


On 2019/04/08 01:41:16, Vinoth Chandar <vi...@apache.org> wrote: 
> Good discussion.. Sorry, to jump in late.. (been having a downtime last
> week)
> 
> insert/bulk_insert operations will in fact introduce duplicates if your
> input has duplicates. I would also like to understand what feature of Hudi
> is useful to you in general, since you seem to want duplicates.
> 
> Only two things I can think of, which could filter our duplicate records
> and both apply to duplicates within the same batch only (i.e you load both
> json files that contain duplicates in the same run)
> 
>  - Either you pass the -filter-dupes option to DeltaStreamer tool
>  - You have precombining on for inserts
> http://hudi.apache.org/configurations.html#combineInput .
> 
> Do any of these apply to you..
> 
> Thanks
> Vinoth
> 
> 
> 
> On Fri, Apr 5, 2019 at 9:10 AM Kabeer Ahmed <ka...@linuxmail.org> wrote:
> 
> > Hi Rahul,
> >
> > I am sorry, I didnt understand the use case properly. Can you please
> > explain with an example? Let me put my version of understanding based on
> > your email.
> > > In json file, every time i will pass a fixed value for a key field.
> > Are you saying that you will always have only one value for every key?
> > Example: Rahul -> "Some Value"
> >
> > > Currently if i load data like this only 1 entry per file only load.
> > What do you mean by this line? Do you mean currently you are loading data
> > like this and only 1 entry per file is loading. Isnt that what you are
> > trying to achieve in the line above?
> >
> > > I don't want same key's values to be skipped while inserting.
> > All you are saying that you want to have same values also repeated in your
> > keys eg: if Rahul primary_key has "Some Value" insert 5 times, then you
> > would want that to appear 5 times in your store?
> >
> > In summary: it appears that what you want is if someone enters 5 values
> > even if they are same. So you need something as below:
> > > | primary_key | Values |
> > > | Rahul | "Some Value", "Some Value", ..... |
> >
> > Let me know if my understanding is correct.
> > Thanks
> > Kabeer.
> >
> > > Dear Omar/Kabeer
> > > In one of my usecasetthink like i don't want update at all. In json
> > file, every time i will pass a fixed value for a key field. Currently if i
> > load data like this only 1 entry per file only load. I don't want same
> > key's values to be skipped while inserting.
> > > Thanks & Regards
> > > Rahul
> >
> > On Apr 5 2019, at 9:11 am, Unknown wrote:
> > >
> > >
> > > On 2019/04/04 19:48:39, Kabeer Ahmed <ka...@linuxmail.org> wrote:
> > > > Omkar - there might be various reasons to have duplicates eg: handle
> > trades in a given day from a single client, track visitor click data to the
> > website etc.
> > > >
> > > > Rahul - If you can give more details about your requirements, then we
> > can come up with a solution.
> > > > I have never used INSERT & BULK_INSERT at all and I am not sure if
> > these options (insert and bulk_insert) do allow user to specify the logic
> > that you are seeking. Without knowing your exact requirement, I can still
> > give a suggestion to look into the option of implementing your own
> > combineAndGetUpdateValue() logic.
> > > > Lets say all your values for a particular key are strings. You could
> > append the string values to existing values and store them as:
> > > >
> > > > key | Value
> > > > Rahul | Nice
> > > > // when there is another entry append the existing one with value with
> > a comma separator per say.
> > > >
> > > > key | Value
> > > > Rahul | Nice, Person
> > > > When you retrieve the key values you could then decide to ship back to
> > user as you want - which is something you would know based on your
> > requirement - since your json is anyways having multiple ways to insert
> > values for a key.
> > > >
> > > > Feel free to reach out if you need help and I will help you as much as
> > I can.
> > > > On Apr 4 2019, at 6:35 pm, Omkar Joshi <om...@uber.com.INVALID> wrote:
> > > > > Hi Rahul,
> > > > >
> > > > > Thanks for trying out Hudi!!
> > > > > Any reason why you need to have duplicates in HUDI dataset? Will you
> > ever
> > > > > be updating it later?
> > > > >
> > > > > Thanks,
> > > > > Omkar
> > > > >
> > > > > On Thu, Apr 4, 2019 at 1:33 AM rahuledavalath@gmail.com <
> > > > > rahuledavalath@gmail.com> wrote:
> > > > >
> > > > > > Dear All
> > > > > > I am using cow table with INSERT/BULK_INSERT.
> > > > > > I am loading the data from json files.
> > > > > >
> > > > > > If existing key in hudi dataset is loading again, then only new
> > data with
> > > > > > that key only showing. Can i able to show both data? (In INSERT)
> > > > > >
> > > > > > If same key is there in multiple times in a source json file, then
> > only
> > > > > > one key is getting loaded. Can i able to load duplicates keys from
> > same
> > > > > > file. (both insert/bulk_insert)
> > > > > >
> > > > > >
> > > > > > Thanks & Regards
> > > > > > Rahul
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > Dear Omar/Kabeer
> > > In one of my usecasetthink like i don't want update at all. In json
> > file, every time i will pass a fixed value for a key field. Currently if i
> > load data like this only 1 entry per file only load. I don't want same
> > key's values to be skipped while inserting.
> > > Thanks & Regards
> > > Rahul
> > >
> >
> >
> 

Dear Kabeer/Vinod


For exaple I  have a file which contains UserName,Tranasaction_Amount,ID fileds.
In this json file i am putting every time same value for it & i mapped this as the hudi dataset key filed.
(Currently all records which will come are new records & i  don't have auto increment ID in the files which i am getting).

suppose if i have 4 entries in a json file
eg : 
rahul,15,0
kabeer,17,0
vinod,18,0
nishith,16,0

currently if i load it normall,y only 1 record will be there in hudi dataset as all the key is 0 (while selecting from hive table).

I want to have all 4 entries to be loaded

@vinod For this use case I don't want key based update, and i just want to control small files in hadoop using hudi. I want to use only hudi's small file size control feature, incremental pull.


Thanks & Regards
Rahul

Re: Hudi with duplicate key

Posted by Vinoth Chandar <vi...@apache.org>.

Good discussion.. Sorry, to jump in late.. (been having a downtime last
week)

insert/bulk_insert operations will in fact introduce duplicates if your
input has duplicates. I would also like to understand what feature of Hudi
is useful to you in general, since you seem to want duplicates.

Only two things I can think of, which could filter our duplicate records
and both apply to duplicates within the same batch only (i.e you load both
json files that contain duplicates in the same run)

 - Either you pass the -filter-dupes option to DeltaStreamer tool
 - You have precombining on for inserts
http://hudi.apache.org/configurations.html#combineInput .

Do any of these apply to you..

Thanks
Vinoth



On Fri, Apr 5, 2019 at 9:10 AM Kabeer Ahmed <ka...@linuxmail.org> wrote:

> Hi Rahul,
>
> I am sorry, I didnt understand the use case properly. Can you please
> explain with an example? Let me put my version of understanding based on
> your email.
> > In json file, every time i will pass a fixed value for a key field.
> Are you saying that you will always have only one value for every key?
> Example: Rahul -> "Some Value"
>
> > Currently if i load data like this only 1 entry per file only load.
> What do you mean by this line? Do you mean currently you are loading data
> like this and only 1 entry per file is loading. Isnt that what you are
> trying to achieve in the line above?
>
> > I don't want same key's values to be skipped while inserting.
> All you are saying that you want to have same values also repeated in your
> keys eg: if Rahul primary_key has "Some Value" insert 5 times, then you
> would want that to appear 5 times in your store?
>
> In summary: it appears that what you want is if someone enters 5 values
> even if they are same. So you need something as below:
> > | primary_key | Values |
> > | Rahul | "Some Value", "Some Value", ..... |
>
> Let me know if my understanding is correct.
> Thanks
> Kabeer.
>
> > Dear Omar/Kabeer
> > In one of my usecasetthink like i don't want update at all. In json
> file, every time i will pass a fixed value for a key field. Currently if i
> load data like this only 1 entry per file only load. I don't want same
> key's values to be skipped while inserting.
> > Thanks & Regards
> > Rahul
>
> On Apr 5 2019, at 9:11 am, Unknown wrote:
> >
> >
> > On 2019/04/04 19:48:39, Kabeer Ahmed <ka...@linuxmail.org> wrote:
> > > Omkar - there might be various reasons to have duplicates eg: handle
> trades in a given day from a single client, track visitor click data to the
> website etc.
> > >
> > > Rahul - If you can give more details about your requirements, then we
> can come up with a solution.
> > > I have never used INSERT & BULK_INSERT at all and I am not sure if
> these options (insert and bulk_insert) do allow user to specify the logic
> that you are seeking. Without knowing your exact requirement, I can still
> give a suggestion to look into the option of implementing your own
> combineAndGetUpdateValue() logic.
> > > Lets say all your values for a particular key are strings. You could
> append the string values to existing values and store them as:
> > >
> > > key | Value
> > > Rahul | Nice
> > > // when there is another entry append the existing one with value with
> a comma separator per say.
> > >
> > > key | Value
> > > Rahul | Nice, Person
> > > When you retrieve the key values you could then decide to ship back to
> user as you want - which is something you would know based on your
> requirement - since your json is anyways having multiple ways to insert
> values for a key.
> > >
> > > Feel free to reach out if you need help and I will help you as much as
> I can.
> > > On Apr 4 2019, at 6:35 pm, Omkar Joshi <om...@uber.com.INVALID> wrote:
> > > > Hi Rahul,
> > > >
> > > > Thanks for trying out Hudi!!
> > > > Any reason why you need to have duplicates in HUDI dataset? Will you
> ever
> > > > be updating it later?
> > > >
> > > > Thanks,
> > > > Omkar
> > > >
> > > > On Thu, Apr 4, 2019 at 1:33 AM rahuledavalath@gmail.com <
> > > > rahuledavalath@gmail.com> wrote:
> > > >
> > > > > Dear All
> > > > > I am using cow table with INSERT/BULK_INSERT.
> > > > > I am loading the data from json files.
> > > > >
> > > > > If existing key in hudi dataset is loading again, then only new
> data with
> > > > > that key only showing. Can i able to show both data? (In INSERT)
> > > > >
> > > > > If same key is there in multiple times in a source json file, then
> only
> > > > > one key is getting loaded. Can i able to load duplicates keys from
> same
> > > > > file. (both insert/bulk_insert)
> > > > >
> > > > >
> > > > > Thanks & Regards
> > > > > Rahul
> > > >
> > > >
> > > >
> > >
> > >
> > Dear Omar/Kabeer
> > In one of my usecasetthink like i don't want update at all. In json
> file, every time i will pass a fixed value for a key field. Currently if i
> load data like this only 1 entry per file only load. I don't want same
> key's values to be skipped while inserting.
> > Thanks & Regards
> > Rahul
> >
>
>

Re: Hudi with duplicate key

Posted by Kabeer Ahmed <ka...@linuxmail.org>.

Hi Rahul,

I am sorry, I didnt understand the use case properly. Can you please explain with an example? Let me put my version of understanding based on your email.
> In json file, every time i will pass a fixed value for a key field.
Are you saying that you will always have only one value for every key? Example: Rahul -> "Some Value"

> Currently if i load data like this only 1 entry per file only load.
What do you mean by this line? Do you mean currently you are loading data like this and only 1 entry per file is loading. Isnt that what you are trying to achieve in the line above?

> I don't want same key's values to be skipped while inserting.
All you are saying that you want to have same values also repeated in your keys eg: if Rahul primary_key has "Some Value" insert 5 times, then you would want that to appear 5 times in your store?

In summary: it appears that what you want is if someone enters 5 values even if they are same. So you need something as below:
> | primary_key | Values |
> | Rahul | "Some Value", "Some Value", ..... |

Let me know if my understanding is correct.
Thanks
Kabeer.

> Dear Omar/Kabeer
> In one of my usecasetthink like i don't want update at all. In json file, every time i will pass a fixed value for a key field. Currently if i load data like this only 1 entry per file only load. I don't want same key's values to be skipped while inserting.
> Thanks & Regards
> Rahul

On Apr 5 2019, at 9:11 am, Unknown wrote:
>
>
> On 2019/04/04 19:48:39, Kabeer Ahmed <ka...@linuxmail.org> wrote:
> > Omkar - there might be various reasons to have duplicates eg: handle trades in a given day from a single client, track visitor click data to the website etc.
> >
> > Rahul - If you can give more details about your requirements, then we can come up with a solution.
> > I have never used INSERT & BULK_INSERT at all and I am not sure if these options (insert and bulk_insert) do allow user to specify the logic that you are seeking. Without knowing your exact requirement, I can still give a suggestion to look into the option of implementing your own combineAndGetUpdateValue() logic.
> > Lets say all your values for a particular key are strings. You could append the string values to existing values and store them as:
> >
> > key | Value
> > Rahul | Nice
> > // when there is another entry append the existing one with value with a comma separator per say.
> >
> > key | Value
> > Rahul | Nice, Person
> > When you retrieve the key values you could then decide to ship back to user as you want - which is something you would know based on your requirement - since your json is anyways having multiple ways to insert values for a key.
> >
> > Feel free to reach out if you need help and I will help you as much as I can.
> > On Apr 4 2019, at 6:35 pm, Omkar Joshi <om...@uber.com.INVALID> wrote:
> > > Hi Rahul,
> > >
> > > Thanks for trying out Hudi!!
> > > Any reason why you need to have duplicates in HUDI dataset? Will you ever
> > > be updating it later?
> > >
> > > Thanks,
> > > Omkar
> > >
> > > On Thu, Apr 4, 2019 at 1:33 AM rahuledavalath@gmail.com <
> > > rahuledavalath@gmail.com> wrote:
> > >
> > > > Dear All
> > > > I am using cow table with INSERT/BULK_INSERT.
> > > > I am loading the data from json files.
> > > >
> > > > If existing key in hudi dataset is loading again, then only new data with
> > > > that key only showing. Can i able to show both data? (In INSERT)
> > > >
> > > > If same key is there in multiple times in a source json file, then only
> > > > one key is getting loaded. Can i able to load duplicates keys from same
> > > > file. (both insert/bulk_insert)
> > > >
> > > >
> > > > Thanks & Regards
> > > > Rahul
> > >
> > >
> > >
> >
> >
> Dear Omar/Kabeer
> In one of my usecasetthink like i don't want update at all. In json file, every time i will pass a fixed value for a key field. Currently if i load data like this only 1 entry per file only load. I don't want same key's values to be skipped while inserting.
> Thanks & Regards
> Rahul
>

Re: Hudi with duplicate key

Posted by ra...@gmail.com, ra...@gmail.com.


On 2019/04/04 19:48:39, Kabeer Ahmed <ka...@linuxmail.org> wrote: 
> Omkar - there might be various reasons to have duplicates eg: handle trades in a given day from a single client, track visitor click data to the website etc.
> 
> Rahul - If you can give more details about your requirements, then we can come up with a solution.
> I have never used INSERT & BULK_INSERT at all and I am not sure if these options (insert and bulk_insert) do allow user to specify the logic that you are seeking. Without knowing your exact requirement, I can still give a suggestion to look into the option of implementing your own combineAndGetUpdateValue() logic.
> Lets say all your values for a particular key are strings. You could append the string values to existing values and store them as:
> 
> key | Value
> Rahul | Nice
> // when there is another entry append the existing one with value with a comma separator per say.
> 
> key | Value
> Rahul | Nice, Person
> When you retrieve the key values you could then decide to ship back to user as you want - which is something you would know based on your requirement - since your json is anyways having multiple ways to insert values for a key.
> 
> Feel free to reach out if you need help and I will help you as much as I can.
> On Apr 4 2019, at 6:35 pm, Omkar Joshi <om...@uber.com.INVALID> wrote:
> > Hi Rahul,
> >
> > Thanks for trying out Hudi!!
> > Any reason why you need to have duplicates in HUDI dataset? Will you ever
> > be updating it later?
> >
> > Thanks,
> > Omkar
> >
> > On Thu, Apr 4, 2019 at 1:33 AM rahuledavalath@gmail.com <
> > rahuledavalath@gmail.com> wrote:
> >
> > > Dear All
> > > I am using cow table with INSERT/BULK_INSERT.
> > > I am loading the data from json files.
> > >
> > > If existing key in hudi dataset is loading again, then only new data with
> > > that key only showing. Can i able to show both data? (In INSERT)
> > >
> > > If same key is there in multiple times in a source json file, then only
> > > one key is getting loaded. Can i able to load duplicates keys from same
> > > file. (both insert/bulk_insert)
> > >
> > >
> > > Thanks & Regards
> > > Rahul
> >
> >
> 
> 
Dear Omar/Kabeer

 In one of my usecasetthink like  i don't want update at all. In json file, every time  i will pass a fixed value for a key field. Currently if i load data like this only 1 entry per file only load.  I don't want same key's values to be skipped while inserting. 

Thanks & Regards
Rahul

Re: Hudi with duplicate key

Posted by Kabeer Ahmed <ka...@linuxmail.org>.

Omkar - there might be various reasons to have duplicates eg: handle trades in a given day from a single client, track visitor click data to the website etc.

Rahul - If you can give more details about your requirements, then we can come up with a solution.
I have never used INSERT & BULK_INSERT at all and I am not sure if these options (insert and bulk_insert) do allow user to specify the logic that you are seeking. Without knowing your exact requirement, I can still give a suggestion to look into the option of implementing your own combineAndGetUpdateValue() logic.
Lets say all your values for a particular key are strings. You could append the string values to existing values and store them as:

key | Value
Rahul | Nice
// when there is another entry append the existing one with value with a comma separator per say.

key | Value
Rahul | Nice, Person
When you retrieve the key values you could then decide to ship back to user as you want - which is something you would know based on your requirement - since your json is anyways having multiple ways to insert values for a key.

Feel free to reach out if you need help and I will help you as much as I can.
On Apr 4 2019, at 6:35 pm, Omkar Joshi <om...@uber.com.INVALID> wrote:
> Hi Rahul,
>
> Thanks for trying out Hudi!!
> Any reason why you need to have duplicates in HUDI dataset? Will you ever
> be updating it later?
>
> Thanks,
> Omkar
>
> On Thu, Apr 4, 2019 at 1:33 AM rahuledavalath@gmail.com <
> rahuledavalath@gmail.com> wrote:
>
> > Dear All
> > I am using cow table with INSERT/BULK_INSERT.
> > I am loading the data from json files.
> >
> > If existing key in hudi dataset is loading again, then only new data with
> > that key only showing. Can i able to show both data? (In INSERT)
> >
> > If same key is there in multiple times in a source json file, then only
> > one key is getting loaded. Can i able to load duplicates keys from same
> > file. (both insert/bulk_insert)
> >
> >
> > Thanks & Regards
> > Rahul
>
>

Re: Hudi with duplicate key

Posted by Omkar Joshi <om...@uber.com.INVALID>.

Hi Rahul,

Thanks for trying out Hudi!!

Any reason why you need to have duplicates in HUDI dataset? Will you ever
be updating it later?

Thanks,
Omkar

On Thu, Apr 4, 2019 at 1:33 AM rahuledavalath@gmail.com <
rahuledavalath@gmail.com> wrote:

> Dear All
>
> I am using cow table with INSERT/BULK_INSERT.
> I am loading the data from json files.
>
> If existing key in hudi dataset is loading again,  then only new data with
> that key only showing. Can i able to show both data? (In INSERT)
>
> If same key is there in multiple times in a source json file, then only
> one key is getting loaded. Can i able to load duplicates keys from same
> file. (both insert/bulk_insert)
>
>
> Thanks & Regards
> Rahul
>