You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@atlas.apache.org by Nanne <na...@mycel.nl> on 2019/08/18 10:47:19 UTC

Search & Data Discovery: how to achieve a more effective search?

Dear all,

How can we make our users find entities which they are searching for more
effectively in Atlas? We would like to have an open discussion with
tips/hints or suggestions how to improve Atlas in order to make data more
discoverable.

There are several endpoints in Atlas to perform search: attribute search,
basic search, full text search, DSL based search, relationship search.
However, each of these search methods have their limitations and preferable
there should be only one endpoint that support all functionalities. In
particular the combination of:

   1.

   *full text search*: useful for typo’s or words being spelled differently
   2.

   *multiple attributes (also via relationships) and term weighting*: in
   order to get the best results, we should search on column names from tables
   for example
   3.

   *custom ordering*: within our entities we have extra attributes that we
   would like to use to weight our results (for example, how often did an user
   look at the table? or how often is a table queried by an user?)
   4.

   *filtering on types*: because of lineage entities, we want to focus on
   Table entities for example
   5.

   *deleted item filtering*: otherwise get the same table 1000 times

Our teams focus is to add a data discovery frontend for Apache Atlas to
Amundsen[1]. We are adding an Atlas backend to Amundsen, in order to
provide efficient data discovery for our data scientists and data
engineers. Currently we are struggling with getting search as good as in
Amundsen. Note that the code base of Amundsen contain specific code for
Apache Atlas, so it would be nice if we could make improvements or even a
special endpoint in Apache Atlas. Also a mention in the documentation and
help how to set it up.

Amundsen is started by Lyft in order to allow data scientists to discover
which data is there within their company. They explain the goals of
Amundsen on their blog here[2]. The roadmap[3] and especially the design[4]
illustrate an effective way for users to discover data.

Search is essential for good data discovery, and without good search our
data scientists/data engineers cannot find the data. Amundsen by default
uses elastic search and for tables it uses the following ElasticSearch
query[5], which are essentially our requirements for tables. We did add the
total_usage via ranger audit logs and will also include an analytics
attribute in the future.

Kind regards,

Nanne Wielinga

[1] https://github.com/lyft/amundsen
[2]
https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9

[3] https://github.com/lyft/amundsen/blob/master/docs/roadmap.md

[4] https://drive.google.com/drive/folders/12oBrcXUsDtOsuU_QvO93LTvs4Dehx6az

[5]
https://github.com/lyft/amundsensearchlibrary/blob/9a5425d6efe76c3b28213cd1347f32d163a1118e/search_service/proxy/elasticsearch.py#L279