You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@atlas.apache.org by "Sotelo, Javier" <Ja...@viasat.com> on 2019/10/21 21:25:55 UTC

Create/Update Large Dual-referenced Set of Entities

Hello,

What is the proper way to bulk-insert/bulk-update a large set (~75K) of dual-referenced entities in Atlas 2.0?

TL;DR:

  *   We can’t POST (/v2/entity/bulk ) an rdbms_instance with 15 rdbms_dbs, each with 100 rdbms_tables, each with 50 rdms_columns with parent/child relationships to each other in one API call.
  *   Splitting it bottom-up doesn’t work because column entities require table entities to exist.
  *   Splitting it top-down doesn’t work because the process creates false deletes/updates on the second synchronization cycle.
  *   Other than getting all entities, manually comparing each other field-by-field and splitting up each request such that there are no “dangling” references across API calls, is there a another/better way?

Details:
We are trying to harvest metadata from a large RDBMS instance that we can’t , suppose we have an RDBS instance with 15 databases each with 100 tables and each table with 50 columns producing 75K entities. Since including them all in one API call, would time out (or cause a “broken pipe” error), we would need to split it up into multiple API calls. But since we need a GUID for the parent/child references (and they may not exist yet, aka be negative). We would need to be very careful in how we split it up.

We can’t first create all the columns because the column entities require valid table entities to reference first (same with the database-to-table case). When we try to create all the databases first then tables then columns, it works on the first go-around. However, when our script runs for the second time, as soon as we do a POST on the /v2/entity/bulk endpoint with all the rdbms_dbs first, Atlas deletes all the rdbms_tables (which makes sense since the tables can’t exist without databases, and we just removed all the db-table relationships). At the end of the script our relationship tree is built correctly, however, we end up with many Atlas deletes.

One solution would be to read all pre-existing entities per each entity type, compare each of them (previous vs current), determine which entities are new and which are the same, and hope that the actually diff/update isn’t bigger than the request limit but that seems like a lot of work to end up with a solution that could still fail.

We’ve looked at https://atlas.apache.org/#/ImportAPIOptions but that seems to have been designed to be used along with the export (from Atlas) which doesn’t apply to our scenario.

Is there a better way?

Thank you for your time!
Javier

Re: Create/Update Large Dual-referenced Set of Entities

Posted by Madhan Neethiraj <ma...@apache.org>.

Javier,

 

> as soon as we do a POST on the /v2/entity/bulk endpoint with all the rdbms_dbs first, Atlas deletes all the rdbms_tables

It is possible to update a rdbms_db entity without Atlas deleting rdbms_table entities associated with the db entity – by not including attribute “tables” in the rdbms_db entity instance.

 

Please make sure to use RDBMS models (and Atlas bits) updated in ATLAS-3056.

 

Hope this helps.

 

Madhan

 

From: "Sotelo, Javier" <Ja...@viasat.com>
Reply-To: "user@atlas.apache.org" <us...@atlas.apache.org>
Date: Monday, October 21, 2019 at 2:26 PM
To: "user@atlas.apache.org" <us...@atlas.apache.org>
Subject: Create/Update Large Dual-referenced Set of Entities

 

Hello,

 

What is the proper way to bulk-insert/bulk-update a large set (~75K) of dual-referenced entities in Atlas 2.0?

 

TL;DR: 
We can’t POST (/v2/entity/bulk ) an rdbms_instance with 15 rdbms_dbs, each with 100 rdbms_tables, each with 50 rdms_columns with parent/child relationships to each other in one API call.
Splitting it bottom-up doesn’t work because column entities require table entities to exist.
Splitting it top-down doesn’t work because the process creates false deletes/updates on the second synchronization cycle.
Other than getting all entities, manually comparing each other field-by-field and splitting up each request such that there are no “dangling” references across API calls, is there a another/better way?
 

Details:

We are trying to harvest metadata from a large RDBMS instance that we can’t , suppose we have an RDBS instance with 15 databases each with 100 tables and each table with 50 columns producing 75K entities. Since including them all in one API call, would time out (or cause a “broken pipe” error), we would need to split it up into multiple API calls. But since we need a GUID for the parent/child references (and they may not exist yet, aka be negative). We would need to be very careful in how we split it up.

 

We can’t first create all the columns because the column entities require valid table entities to reference first (same with the database-to-table case). When we try to create all the databases first then tables then columns, it works on the first go-around. However, when our script runs for the second time, as soon as we do a POST on the /v2/entity/bulk endpoint with all the rdbms_dbs first, Atlas deletes all the rdbms_tables (which makes sense since the tables can’t exist without databases, and we just removed all the db-table relationships). At the end of the script our relationship tree is built correctly, however, we end up with many Atlas deletes.

 

One solution would be to read all pre-existing entities per each entity type, compare each of them (previous vs current), determine which entities are new and which are the same, and hope that the actually diff/update isn’t bigger than the request limit but that seems like a lot of work to end up with a solution that could still fail.

 

We’ve looked at https://atlas.apache.org/#/ImportAPIOptions but that seems to have been designed to be used along with the export (from Atlas) which doesn’t apply to our scenario.

 

Is there a better way?

 

Thank you for your time!

Javier