You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@hop.apache.org by po...@gmx.com on 2022/08/30 11:01:05 UTC

Dimension lookup/update


Hello!



I try to use 'Dimension lookup/update' hop and I do not understand the meaning
of two fields in properties:



"Date range start field / Table date range end"



My data has no date column. I do not need it. So I cannot specify any date.
What this field is for and how to ignore it?





What is exactly 'Version field'? This is to keep several datasets in one
table? I.e same record can have multiple copies with different 'Version'?

What if I do no not need (most of scenarios I think) any 'versions', I just
need single copy of data?

If I have 'version number' I can leave 'key fields' empty?



Regards,



Mike








Re: Dimension lookup/update

Posted by po...@gmx.com.
That's a good idea - thank you!





**Sent:**  Tuesday, August 30, 2022 at 1:53 PM  
**From:**  "Hans Van Akelyen" <ha...@gmail.com>  
**To:**  users@hop.apache.org  
**Subject:**  Re: Dimension lookup/update

Another solution to avoid all the lookups is to use a table input and load the
source table.

Then you can use "Merge Rows (diff)" to do a comparison between the old set
and the new set.

It will mark the rows as new,identical,changed,deleted after that you can use
"Synchronize after merge" and it will do the insert or update depending on the
flag field.



This way you do not bash your database with all the lookup queries, you do
have to load and sort the entire source table to do the comparison against.



Cheers,

Hans



On Tue, 30 Aug 2022 at 13:47, <[podunk@gmx.com](mailto:podunk@gmx.com)> wrote:

>  
>
> Thank you both for quick reaction.
>
>  
>
> I'm searching for some way to quickly update table with millions of records.
>
>  
>
> I know 'Insert / update' but this is terrible slow - so may queries.
>
>  
>
> I know I can do some workaround like instert into temp table and execute SQL
> query (update table taking data from another table).
>
> I was thinking that dimension table is something that could work here.
>
>  
>
>  
>
>  
>
> **Sent:**  Tuesday, August 30, 2022 at 1:09 PM  
>  **From:**  "Bart Maertens"
> <[bartmaer@apache.org](mailto:bartmaer@apache.org)>  
>  **To:**  [users@hop.apache.org](mailto:users@hop.apache.org)  
>  **Subject:**  Re: Dimension lookup/update
>
> Hi Mike,  
>  
>
> The Dimension Lookup/update transform populates Slowly Changing Dimensions,
> typically type 2 [1] with some additional functionality.
>
>  
>
> The "date range start", "date range end" and "version" fields are technical
> fields that are used to determine the validity period and version number for
> a dimension record.
>
> If you don't need any of the versioning information, you're probably not
> maintaining a slowly changing dimension.
>
> Hop comes with other transforms like "Table output", "Insert/update",
> "Database lookup" etc that are better suited to maintain regular tables or
> type 1 dimensions.
>
>  
>
> [1]
> <https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row>
>
>  
>
> Regards,
>
> Bart
>
>  
>
> On Tue, Aug 30, 2022 at 1:01 PM <[podunk@gmx.com](mailto:podunk@gmx.com)>
> wrote:
>

>>  
>>

>> Hello!

>>

>>  
>>

>> I try to use 'Dimension lookup/update' hop and I do not understand the
meaning of two fields in properties:

>>

>>  
>>

>> "Date range start field / Table date range end"

>>

>>  
>>

>> My data has no date column. I do not need it. So I cannot specify any date.
What this field is for and how to ignore it?

>>

>>  
>>

>>  
>>

>> What is exactly 'Version field'? This is to keep several datasets in one
table? I.e same record can have multiple copies with different 'Version'?

>>

>> What if I do no not need (most of scenarios I think) any 'versions', I just
need single copy of data?

>>

>> If I have 'version number' I can leave 'key fields' empty?

>>

>>  
>>

>> Regards,

>>

>>  
>>

>> Mike

>>

>>  
>>

>>  
>>

>>  
>
>  
>
>  
>
>  






Re: Dimension lookup/update

Posted by Hans Van Akelyen <ha...@gmail.com>.
Did you enable the "Use batch update" option?

On Wed, 31 Aug 2022 at 11:45, <po...@gmx.com> wrote:

> HI,
>
> I see "Synchronize after merge" does nothing but executes separate query
> for each row.
> So it will be terrible slow with hundreds of thousands of records.
>
>
> *Sent:* Tuesday, August 30, 2022 at 1:53 PM
> *From:* "Hans Van Akelyen" <ha...@gmail.com>
> *To:* users@hop.apache.org
> *Subject:* Re: Dimension lookup/update
> Another solution to avoid all the lookups is to use a table input and load
> the source table.
> Then you can use "Merge Rows (diff)" to do a comparison between the old
> set and the new set.
> It will mark the rows as new,identical,changed,deleted after that you can
> use "Synchronize after merge" and it will do the insert or update depending
> on the flag field.
>
> This way you do not bash your database with all the lookup queries, you do
> have to load and sort the entire source table to do the comparison against.
>
> Cheers,
> Hans
>
> On Tue, 30 Aug 2022 at 13:47, <po...@gmx.com> wrote:
>
>>
>> Thank you both for quick reaction.
>>
>> I'm searching for some way to quickly update table with millions of
>> records.
>>
>> I know 'Insert / update' but this is terrible slow - so may queries.
>>
>> I know I can do some workaround like instert into temp table and execute
>> SQL query (update table taking data from another table).
>> I was thinking that dimension table is something that could work here.
>>
>>
>>
>> *Sent:* Tuesday, August 30, 2022 at 1:09 PM
>> *From:* "Bart Maertens" <ba...@apache.org>
>> *To:* users@hop.apache.org
>> *Subject:* Re: Dimension lookup/update
>> Hi Mike,
>>
>> The Dimension Lookup/update transform populates Slowly Changing
>> Dimensions, typically type 2 [1] with some additional functionality.
>>
>> The "date range start", "date range end" and "version" fields are
>> technical fields that are used to determine the validity period and version
>> number for a dimension record.
>> If you don't need any of the versioning information, you're probably not
>> maintaining a slowly changing dimension.
>> Hop comes with other transforms like "Table output", "Insert/update",
>> "Database lookup" etc that are better suited to maintain regular tables or
>> type 1 dimensions.
>>
>> [1]
>> https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row
>>
>> Regards,
>> Bart
>>
>> On Tue, Aug 30, 2022 at 1:01 PM <po...@gmx.com> wrote:
>>
>>>
>>> Hello!
>>>
>>> I try to use 'Dimension lookup/update' hop and I do not understand the
>>> meaning of two fields in properties:
>>>
>>> "Date range start field / Table date range end"
>>>
>>> My data has no date column. I do not need it. So I cannot specify any
>>> date. What this field is for and how to ignore it?
>>>
>>>
>>> What is exactly 'Version field'? This is to keep several datasets in one
>>> table? I.e same record can have multiple copies with different 'Version'?
>>> What if I do no not need (most of scenarios I think) any 'versions', I
>>> just need single copy of data?
>>> If I have 'version number' I can leave 'key fields' empty?
>>>
>>> Regards,
>>>
>>> Mike
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
>

Re: Dimension lookup/update

Posted by po...@gmx.com.
HI,



I see "Synchronize after merge" does nothing but executes separate query for
each row.

So it will be terrible slow with hundreds of thousands of records.





**Sent:**  Tuesday, August 30, 2022 at 1:53 PM  
**From:**  "Hans Van Akelyen" <ha...@gmail.com>  
**To:**  users@hop.apache.org  
**Subject:**  Re: Dimension lookup/update

Another solution to avoid all the lookups is to use a table input and load the
source table.

Then you can use "Merge Rows (diff)" to do a comparison between the old set
and the new set.

It will mark the rows as new,identical,changed,deleted after that you can use
"Synchronize after merge" and it will do the insert or update depending on the
flag field.



This way you do not bash your database with all the lookup queries, you do
have to load and sort the entire source table to do the comparison against.



Cheers,

Hans



On Tue, 30 Aug 2022 at 13:47, <[podunk@gmx.com](mailto:podunk@gmx.com)> wrote:

>  
>
> Thank you both for quick reaction.
>
>  
>
> I'm searching for some way to quickly update table with millions of records.
>
>  
>
> I know 'Insert / update' but this is terrible slow - so may queries.
>
>  
>
> I know I can do some workaround like instert into temp table and execute SQL
> query (update table taking data from another table).
>
> I was thinking that dimension table is something that could work here.
>
>  
>
>  
>
>  
>
> **Sent:**  Tuesday, August 30, 2022 at 1:09 PM  
>  **From:**  "Bart Maertens"
> <[bartmaer@apache.org](mailto:bartmaer@apache.org)>  
>  **To:**  [users@hop.apache.org](mailto:users@hop.apache.org)  
>  **Subject:**  Re: Dimension lookup/update
>
> Hi Mike,  
>  
>
> The Dimension Lookup/update transform populates Slowly Changing Dimensions,
> typically type 2 [1] with some additional functionality.
>
>  
>
> The "date range start", "date range end" and "version" fields are technical
> fields that are used to determine the validity period and version number for
> a dimension record.
>
> If you don't need any of the versioning information, you're probably not
> maintaining a slowly changing dimension.
>
> Hop comes with other transforms like "Table output", "Insert/update",
> "Database lookup" etc that are better suited to maintain regular tables or
> type 1 dimensions.
>
>  
>
> [1]
> <https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row>
>
>  
>
> Regards,
>
> Bart
>
>  
>
> On Tue, Aug 30, 2022 at 1:01 PM <[podunk@gmx.com](mailto:podunk@gmx.com)>
> wrote:
>

>>  
>>

>> Hello!

>>

>>  
>>

>> I try to use 'Dimension lookup/update' hop and I do not understand the
meaning of two fields in properties:

>>

>>  
>>

>> "Date range start field / Table date range end"

>>

>>  
>>

>> My data has no date column. I do not need it. So I cannot specify any date.
What this field is for and how to ignore it?

>>

>>  
>>

>>  
>>

>> What is exactly 'Version field'? This is to keep several datasets in one
table? I.e same record can have multiple copies with different 'Version'?

>>

>> What if I do no not need (most of scenarios I think) any 'versions', I just
need single copy of data?

>>

>> If I have 'version number' I can leave 'key fields' empty?

>>

>>  
>>

>> Regards,

>>

>>  
>>

>> Mike

>>

>>  
>>

>>  
>>

>>  
>
>  
>
>  
>
>  






Re: Dimension lookup/update

Posted by Hans Van Akelyen <ha...@gmail.com>.
Another solution to avoid all the lookups is to use a table input and load
the source table.
Then you can use "Merge Rows (diff)" to do a comparison between the old set
and the new set.
It will mark the rows as new,identical,changed,deleted after that you can
use "Synchronize after merge" and it will do the insert or update depending
on the flag field.

This way you do not bash your database with all the lookup queries, you do
have to load and sort the entire source table to do the comparison against.

Cheers,
Hans

On Tue, 30 Aug 2022 at 13:47, <po...@gmx.com> wrote:

>
> Thank you both for quick reaction.
>
> I'm searching for some way to quickly update table with millions of
> records.
>
> I know 'Insert / update' but this is terrible slow - so may queries.
>
> I know I can do some workaround like instert into temp table and execute
> SQL query (update table taking data from another table).
> I was thinking that dimension table is something that could work here.
>
>
>
> *Sent:* Tuesday, August 30, 2022 at 1:09 PM
> *From:* "Bart Maertens" <ba...@apache.org>
> *To:* users@hop.apache.org
> *Subject:* Re: Dimension lookup/update
> Hi Mike,
>
> The Dimension Lookup/update transform populates Slowly Changing
> Dimensions, typically type 2 [1] with some additional functionality.
>
> The "date range start", "date range end" and "version" fields are
> technical fields that are used to determine the validity period and version
> number for a dimension record.
> If you don't need any of the versioning information, you're probably not
> maintaining a slowly changing dimension.
> Hop comes with other transforms like "Table output", "Insert/update",
> "Database lookup" etc that are better suited to maintain regular tables or
> type 1 dimensions.
>
> [1]
> https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row
>
> Regards,
> Bart
>
> On Tue, Aug 30, 2022 at 1:01 PM <po...@gmx.com> wrote:
>
>>
>> Hello!
>>
>> I try to use 'Dimension lookup/update' hop and I do not understand the
>> meaning of two fields in properties:
>>
>> "Date range start field / Table date range end"
>>
>> My data has no date column. I do not need it. So I cannot specify any
>> date. What this field is for and how to ignore it?
>>
>>
>> What is exactly 'Version field'? This is to keep several datasets in one
>> table? I.e same record can have multiple copies with different 'Version'?
>> What if I do no not need (most of scenarios I think) any 'versions', I
>> just need single copy of data?
>> If I have 'version number' I can leave 'key fields' empty?
>>
>> Regards,
>>
>> Mike
>>
>>
>>
>>
>
>
>
>

Re: Dimension lookup/update

Posted by po...@gmx.com.

Thank you both for quick reaction.



I'm searching for some way to quickly update table with millions of records.



I know 'Insert / update' but this is terrible slow - so may queries.



I know I can do some workaround like instert into temp table and execute SQL
query (update table taking data from another table).

I was thinking that dimension table is something that could work here.







**Sent:**  Tuesday, August 30, 2022 at 1:09 PM  
**From:**  "Bart Maertens" <ba...@apache.org>  
**To:**  users@hop.apache.org  
**Subject:**  Re: Dimension lookup/update

Hi Mike,  


The Dimension Lookup/update transform populates Slowly Changing Dimensions,
typically type 2 [1] with some additional functionality.



The "date range start", "date range end" and "version" fields are technical
fields that are used to determine the validity period and version number for a
dimension record.

If you don't need any of the versioning information, you're probably not
maintaining a slowly changing dimension.

Hop comes with other transforms like "Table output", "Insert/update",
"Database lookup" etc that are better suited to maintain regular tables or
type 1 dimensions.



[1]
<https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row>



Regards,

Bart



On Tue, Aug 30, 2022 at 1:01 PM <[podunk@gmx.com](mailto:podunk@gmx.com)>
wrote:

>  
>
> Hello!
>
>  
>
> I try to use 'Dimension lookup/update' hop and I do not understand the
> meaning of two fields in properties:
>
>  
>
> "Date range start field / Table date range end"
>
>  
>
> My data has no date column. I do not need it. So I cannot specify any date.
> What this field is for and how to ignore it?
>
>  
>
>  
>
> What is exactly 'Version field'? This is to keep several datasets in one
> table? I.e same record can have multiple copies with different 'Version'?
>
> What if I do no not need (most of scenarios I think) any 'versions', I just
> need single copy of data?
>
> If I have 'version number' I can leave 'key fields' empty?
>
>  
>
> Regards,
>
>  
>
> Mike
>
>  
>
>  
>
>  








Re: Dimension lookup/update

Posted by Bart Maertens <ba...@apache.org>.
Hi Mike,

The Dimension Lookup/update transform populates Slowly Changing Dimensions,
typically type 2 [1] with some additional functionality.

The "date range start", "date range end" and "version" fields are technical
fields that are used to determine the validity period and version number
for a dimension record.
If you don't need any of the versioning information, you're probably not
maintaining a slowly changing dimension.
Hop comes with other transforms like "Table output", "Insert/update",
"Database lookup" etc that are better suited to maintain regular tables or
type 1 dimensions.

[1]
https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row

Regards,
Bart

On Tue, Aug 30, 2022 at 1:01 PM <po...@gmx.com> wrote:

>
> Hello!
>
> I try to use 'Dimension lookup/update' hop and I do not understand the
> meaning of two fields in properties:
>
> "Date range start field / Table date range end"
>
> My data has no date column. I do not need it. So I cannot specify any
> date. What this field is for and how to ignore it?
>
>
> What is exactly 'Version field'? This is to keep several datasets in one
> table? I.e same record can have multiple copies with different 'Version'?
> What if I do no not need (most of scenarios I think) any 'versions', I
> just need single copy of data?
> If I have 'version number' I can leave 'key fields' empty?
>
> Regards,
>
> Mike
>
>
>
>

Re: Dimension lookup/update

Posted by Hans Van Akelyen <ha...@gmail.com>.
Hi Mike,

The Dimension Lookup/update transform has been created to support Type 2
Slowly changing dimensions explanation about slowly changing dimensions can
be found on wikipedia
<https://en.wikipedia.org/wiki/Slowly_changing_dimension>.

It indeed means you can have multiple rows for the same key, for example
you would like to keep a history of address information.
The date range is then also used to specify from when until when a specific
row is valid.

If you do not need this information you can use the "Combination
Lookup/Update"
<https://hop.apache.org/manual/latest/pipeline/transforms/combinationlookup.html>,
this Transform will do Type 1 dimensions.

Cheers,
Hans


On Tue, 30 Aug 2022 at 13:01, <po...@gmx.com> wrote:

>
> Hello!
>
> I try to use 'Dimension lookup/update' hop and I do not understand the
> meaning of two fields in properties:
>
> "Date range start field / Table date range end"
>
> My data has no date column. I do not need it. So I cannot specify any
> date. What this field is for and how to ignore it?
>
>
> What is exactly 'Version field'? This is to keep several datasets in one
> table? I.e same record can have multiple copies with different 'Version'?
> What if I do no not need (most of scenarios I think) any 'versions', I
> just need single copy of data?
> If I have 'version number' I can leave 'key fields' empty?
>
> Regards,
>
> Mike
>
>
>
>