You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@atlas.apache.org by "Ashutosh Mestry (Jira)" <ji...@apache.org> on 2021/09/20 17:36:00 UTC
[jira] [Commented] (ATLAS-4389) Best practice or a way to bring in large number of entities on a regular basis.

    [ https://issues.apache.org/jira/browse/ATLAS-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17417764#comment-17417764 ] 

Ashutosh Mestry commented on ATLAS-4389:
----------------------------------------

Sorry for the delay in replying.

Background: Existing implementation of ingest has linear complexity. This is done to be able to deal with the create/update/delete message types and the temporal nature of these operations.

Here are few that I have tried and worked as solutions for some of our customers:

*Approach 1*

Pre-requisite: Entity creation is in your control. 

Solution: 
 * Create topologically sorted entities. Parent entities are created before the child entities. 
 * Create lineage entities after parent participating entities are created.
 * Use REST APIs to concurrently create entities of a type. Start new type only after all entities of a type are exhausted.

This is the advantage of being able to create entities concurrently as their dependents are already created. This approach gives high throughput and continues to maintain consistency of data.

This needs some amount of book-keeping. This may not be a lot if you are creating Hive entities and follow a consistent pattern for coming up with names for _qualifiedName_ unique attribute.

About code paths: Ingest via Kakfa queue, entity creation via REST APIs and ingest via Import API all follow same code path.

 

> Best practice or a way to bring in large number of entities on a regular basis.
> -------------------------------------------------------------------------------
>
>                 Key: ATLAS-4389
>                 URL: https://issues.apache.org/jira/browse/ATLAS-4389
>             Project: Atlas
>          Issue Type: Bug
>          Components:  atlas-core
>    Affects Versions: 2.0.0, 2.1.0
>            Reporter: Saad
>            Assignee: Ashutosh Mestry
>            Priority: Major
>              Labels: documentation, newbie, performance
>         Attachments: image-2021-08-05-11-22-29-259.png, image-2021-08-05-11-23-05-440.png
>
>
> Would you be so kind to let us know if there is any best practice or a way to bring in large number of entities on a regular basis.
> *Our use case:*
> We will be bringing in around 12,000  datasets, 12,000 jobs and 70,000 columns. We want to do this as part of our deployment pipeline for other upstream projects.
> At every deploy we want to do the following:
>  - Add the jobs, datasets and columns that are not in Atlas
>  - Update the jobs, datasets and columns that are in Atlas
>  - Delete the jobs from Atlas that are deleted from the upstream systems.
> So far we have considered using the bulk API endpoint(/v2/entity/bulk). This has its own issues. We found that if the payload is too big in our case bigger than 300-500 entities this times out. The more deeper the relationships the fewer the entities you can send through the bulk endpoint.
> Inspecting some of the code we feel that both REST and streaming data through Kafka follow the same codepath and finally yield the same performance.
> Further we found that when creating entities the type registry becomes the bottle neck. We discovered this by profiling the jvm. We found that only one core processes the the entities and their relationships.
> *Questions:*
> 1- What is the best practice when bulk loading lots on entities in a reasonable time. We are aiming to load 12k jobs, 12k datasets and 70k columns in less than 10 mins.?
> 2- Where should we start if we want to scale the API, is there any known way to horizontally scale Atlas?
> Here are some of the stats for the load testing we did,
>  
> !image-2021-08-05-11-23-05-440.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)