You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by rajat kumar <ku...@gmail.com> on 2022/12/27 19:24:11 UTC

Profiling data quality with Spark

Hi Folks
Hoping you are doing well, I want to implement data quality to detect
issues in data in advance. I have heard about few frameworks like GE/Deequ.
Can anyone pls suggest which one is good and how do I get started on it?

Regards
Rajat

Re: Profiling data quality with Spark

Posted by Walaa Eldin Moustafa <wa...@gmail.com>.

Rajat,

You might want to read about Data Sentinel, a data validation tool on Spark
that is developed at LinkedIn.

https://engineering.linkedin.com/blog/2020/data-sentinel-automating-data-validation

The project is not open source, but the blog post might give you insights
about how such a system could be built.

Thanks,
Walaa.

On Tue, Dec 27, 2022 at 8:13 PM Sean Owen <sr...@gmail.com> wrote:

> I think this is kind of mixed up. Data warehouses are simple SQL
> creatures; Spark is (also) a distributed compute framework. Kind of like
> comparing maybe a web server to Java.
> Are you thinking of Spark SQL? then I dunno sure you may well find it more
> complicated, but it's also just a data warehousey SQL surface.
>
> But none of that relates to the question of data quality tools. You could
> use GE with Redshift, or indeed with Spark - are you familiar with it? It's
> probably one of the most common tools people use with Spark for this in
> fact. It's just a Python lib at heart and you can apply it with Spark, but
> _not_ with a data warehouse, so I'm not sure what you're getting at.
>
> Deequ is also commonly seen. It's actually built on Spark, so again,
> confused about this "use Redshift or Snowflake not Spark".
>
> On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta <go...@gmail.com>
> wrote:
>
>> Hi,
>>
>> SPARK is just another querying engine with a lot of hype.
>>
>> I would highly suggest using Redshift (storage and compute decoupled
>> mode) or Snowflake without all this super complicated understanding of
>> containers/ disk-space, mind numbing variables, rocket science tuning, hair
>> splitting failure scenarios, etc. After that try to choose solutions like
>> Athena, or Trino/ Presto, and then come to SPARK.
>>
>> Try out solutions like  "great expectations" if you are looking for data
>> quality and not entirely sucked into the world of SPARK and want to keep
>> your options open.
>>
>> Dont get me wrong, SPARK used to be great in 2016-2017, but there are
>> superb alternatives now and the industry, in this recession, should focus
>> on getting more value for every single dollar they spend.
>>
>> Best of luck.
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Well, you need to qualify your statement on data quality. Are you
>>> talking about data lineage here?
>>>
>>> HTH
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 27 Dec 2022 at 19:25, rajat kumar <ku...@gmail.com>
>>> wrote:
>>>
>>>> Hi Folks
>>>> Hoping you are doing well, I want to implement data quality to detect
>>>> issues in data in advance. I have heard about few frameworks like GE/Deequ.
>>>> Can anyone pls suggest which one is good and how do I get started on it?
>>>>
>>>> Regards
>>>> Rajat
>>>>
>>>

Re: Profiling data quality with Spark

Posted by Chitral Verma <ch...@gmail.com>.

Hi Rajat,
I have worked for years in democratizing data quality for some of the top
organizations and I'm also an Apache Griffin Contributor and PMC - so I
know a lot about this space. :)

Coming back to your original question, there are a lot of data quality
options available in the market today and I'm listing down some of my top
recommendations with some additional comments,

*Proprietary Solutions*

   - MonteCarlo <https://www.montecarlodata.com/>
      - Pros: State of the art DQ solution with multiple deployment models,
      lots of connectors, SOC-2 compliant and handles the complete DQ lifecycle
      including monitoring and alerting.
      - Cons: Not open source, cannot be a "completely on-prem solution"
      - Anomalo <https://www.anomalo.com/>
      - Pros: One of the best UI for DQ management and operations.
      - Cons: Same as monte carlo - not open source, cannot be a
      "completely on-prem solution"
   - Collibra
   <https://www.collibra.com/us/en/products/data-quality-and-observability>
      - Pros: Predominantly a data cataloging solution, Collibra now offers
      full data governance with its DQ offerings
      - Cons: in my opinion, connectors can be a little pricey over time
      with usage. Also the same cons as monte carlo apply to Collibra as well.
      - IBM Solutions <https://www.ibm.com/in-en/data-quality>
   - Pros: Lots of offerings in DQ space, comes with a UI, has profiling
      and other features built in. It's a solution for complete DQ management.
      - Cons: Proprietary solution which can result in vendor lock in.
      Customizations and extensions may be difficult.
   - Informatica Data Quality tool
   <https://www.informatica.com/in/products/data-quality.html>
      - Pros: Comes with a UI, has profiling and other features built in.
      Its a solution for complete DQ management.
      - Cons: Proprietary solution which can result in vendor lock in.
      Customizations and extensions may be difficult.

*Open Source Solutions*

   - Great Expectations <https://greatexpectations.io/>
   - Pros: built for technical users who want to code DQ as per their
      requirement, easy to extend via code and lots of connectors and
      "expectations" or checks are available out of the box. Fits nicely in a
      python environment with or without Pyspark. Can be made to fit in most
      stacks.
      - Cons: No UI, no alerting or monitoring. However, see the
      recommendation section below for more info on how to get around this.

      - Note: They are coming up with Cloud offering as well in 2023
      - Amazon Deequ <https://github.com/awslabs/deequ>
      - Pros: Actively maintained project that allows technical users to
      code checks using this project as a base library. Contains profiler,
      anomaly detection etc. Runs checks using Spark. Pydeequ is available for
      python users.
      - Cons: Like great expectations, it's a library not a whole end to
      end DQ platform.
   - Apache Griffin <https://github.com/apache/griffin/>
      - Pros: Aims to be a complete open source DQ platform with support
      for lots of streaming and batch datasets. Run checks using spark.
      - Cons: Not actively maintained these days due to lack of
      contributors.

*Recommendation*

   - Make some choices like below to narrow down the offerings,
   - Buy or build the solution?
      - Cloud dominant, mostly On Prem or hybrid?
      - For technical users, non-technical or hybrid end users?
      - Automated workflows or manual custom workflows?
   - For Buy + Cloud dominant + hybrid users + Automation kind of choices
   my recommendation would be to go with Monte Carlo or Anomalo. Otherwise one
   of the open source offerings.
   - For Great Expectations, there is a guide available to push DQ results
   to the open source Datahub <https://datahubproject.io/> Catalog. This
   combination vastly extends the reach of great expectations as a tool, you
   get a UI and for the missing things you can connect with other solutions.
   This Great Expectations + Datahub combination delivers solid valud and is
   basically equivalent to a lot of proprietary offerings like Collibra.
   However this requires some engineering.

*Other Notable mentions*

   - https://www.bigeye.com/
   - https://www.soda.io/

Hope this long note clarifies things for you. :)

On Thu, 29 Dec 2022 at 10:03, infa elance <in...@gmail.com> wrote:

> You can also look at informatica data quality that runs on spark. Of
> course it’s not free but you can sign up for a 30 day free trial. They have
> both profiling and prebuilt data quality rules and accelerators.
>
> Sent from my iPhone
>
> On Dec 28, 2022, at 10:02 PM, vaquar khan <va...@gmail.com> wrote:
>
> 
> @ Gourav Sengupta why you are sending unnecessary emails ,if you think
> snowflake good plz use it ,here question was different and you are talking
> totally different topic.
>
> Plz respects group guidelines
>
>
> Regards,
> Vaquar khan
>
> On Wed, Dec 28, 2022, 10:29 AM vaquar khan <va...@gmail.com> wrote:
>
>> Here you can find all details , you just need to pass spark dataframe and
>> deequ also generate recommendations for rules and you can also write custom
>> complex rules.
>>
>>
>> https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/
>>
>> Regards,
>> Vaquar khan
>>
>> On Wed, Dec 28, 2022, 9:40 AM rajat kumar <ku...@gmail.com>
>> wrote:
>>
>>> Thanks for the input folks.
>>>
>>> Hi Vaquar ,
>>>
>>> I saw that we have various types of checks in GE and Deequ. Could you
>>> please suggest what types of check did you use for Metric based columns
>>>
>>>
>>> Regards
>>> Rajat
>>>
>>> On Wed, Dec 28, 2022 at 12:15 PM vaquar khan <va...@gmail.com>
>>> wrote:
>>>
>>>> I would suggest Deequ , I have implemented many time easy and
>>>> effective.
>>>>
>>>>
>>>> Regards,
>>>> Vaquar khan
>>>>
>>>> On Tue, Dec 27, 2022, 10:30 PM ayan guha <gu...@gmail.com> wrote:
>>>>
>>>>> The way I would approach is to evaluate GE, Deequ (there is a python
>>>>> binding called pydeequ) and others like Delta Live tables with expectations
>>>>> from Data Quality feature perspective. All these tools have their pros and
>>>>> cons, and all of them are compatible with spark as a compute engine.
>>>>>
>>>>> Also, you may want to look at dbt based DQ toolsets if sql is your
>>>>> thing.
>>>>>
>>>>> On Wed, 28 Dec 2022 at 3:14 pm, Sean Owen <sr...@gmail.com> wrote:
>>>>>
>>>>>> I think this is kind of mixed up. Data warehouses are simple SQL
>>>>>> creatures; Spark is (also) a distributed compute framework. Kind of like
>>>>>> comparing maybe a web server to Java.
>>>>>> Are you thinking of Spark SQL? then I dunno sure you may well find it
>>>>>> more complicated, but it's also just a data warehousey SQL surface.
>>>>>>
>>>>>> But none of that relates to the question of data quality tools. You
>>>>>> could use GE with Redshift, or indeed with Spark - are you familiar with
>>>>>> it? It's probably one of the most common tools people use with Spark for
>>>>>> this in fact. It's just a Python lib at heart and you can apply it with
>>>>>> Spark, but _not_ with a data warehouse, so I'm not sure what you're getting
>>>>>> at.
>>>>>>
>>>>>> Deequ is also commonly seen. It's actually built on Spark, so again,
>>>>>> confused about this "use Redshift or Snowflake not Spark".
>>>>>>
>>>>>> On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta <
>>>>>> gourav.sengupta@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> SPARK is just another querying engine with a lot of hype.
>>>>>>>
>>>>>>> I would highly suggest using Redshift (storage and compute decoupled
>>>>>>> mode) or Snowflake without all this super complicated understanding of
>>>>>>> containers/ disk-space, mind numbing variables, rocket science tuning, hair
>>>>>>> splitting failure scenarios, etc. After that try to choose solutions like
>>>>>>> Athena, or Trino/ Presto, and then come to SPARK.
>>>>>>>
>>>>>>> Try out solutions like  "great expectations" if you are looking for
>>>>>>> data quality and not entirely sucked into the world of SPARK and want to
>>>>>>> keep your options open.
>>>>>>>
>>>>>>> Dont get me wrong, SPARK used to be great in 2016-2017, but there
>>>>>>> are superb alternatives now and the industry, in this recession, should
>>>>>>> focus on getting more value for every single dollar they spend.
>>>>>>>
>>>>>>> Best of luck.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Gourav Sengupta
>>>>>>>
>>>>>>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh <
>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>
>>>>>>>> Well, you need to qualify your statement on data quality. Are you
>>>>>>>> talking about data lineage here?
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, 27 Dec 2022 at 19:25, rajat kumar <
>>>>>>>> kumar.rajat20del@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Folks
>>>>>>>>> Hoping you are doing well, I want to implement data quality to
>>>>>>>>> detect issues in data in advance. I have heard about few frameworks like
>>>>>>>>> GE/Deequ. Can anyone pls suggest which one is good and how do I get started
>>>>>>>>> on it?
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Rajat
>>>>>>>>>
>>>>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>

Re: Profiling data quality with Spark

Posted by infa elance <in...@gmail.com>.

You can also look at informatica data quality that runs on spark. Of course
it’s not free but you can sign up for a 30 day free trial. They have both
profiling and prebuilt data quality rules and accelerators.  
  

Sent from my iPhone

  

> On Dec 28, 2022, at 10:02 PM, vaquar khan <va...@gmail.com> wrote:  
>  
>

> 
>
> @ Gourav Sengupta why you are sending unnecessary emails ,if you think
> snowflake good plz use it ,here question was different and you are talking
> totally different topic.
>
>  
>
>
> Plz respects group guidelines
>
>  
>
>
>  
>
>
> Regards,
>
> Vaquar khan
>
>  
>
>
> On Wed, Dec 28, 2022, 10:29 AM vaquar khan
> <[vaquar.khan@gmail.com](mailto:vaquar.khan@gmail.com)> wrote:  
>
>

>> Here you can find all details , you just need to pass spark dataframe and
deequ also generate recommendations for rules and you can also write custom
complex rules.

>>

>>  
>
>>

>> <https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-
deequ/>  
>
>>

>>  
>
>>

>> Regards,

>>

>> Vaquar khan

>>

>>  
>
>>

>> On Wed, Dec 28, 2022, 9:40 AM rajat kumar
<[kumar.rajat20del@gmail.com](mailto:kumar.rajat20del@gmail.com)> wrote:  
>
>>

>>> Thanks for the input folks.  
>  
> Hi Vaquar ,  
>  
> I saw that we have various types of checks in GE and Deequ. Could you please
> suggest what types of check did you use for Metric based columns
>>>

>>>  
>
>>>

>>>  
>
>>>

>>> Regards  
> Rajat
>>>

>>>  
>
>>>

>>> On Wed, Dec 28, 2022 at 12:15 PM vaquar khan
<[vaquar.khan@gmail.com](mailto:vaquar.khan@gmail.com)> wrote:  
>
>>>

>>>> I would suggest Deequ , I have implemented many time easy and effective.

>>>>

>>>>  
>
>>>>

>>>>  
>
>>>>

>>>> Regards,

>>>>

>>>> Vaquar khan

>>>>

>>>>  
>
>>>>

>>>> On Tue, Dec 27, 2022, 10:30 PM ayan guha
<[guha.ayan@gmail.com](mailto:guha.ayan@gmail.com)> wrote:  
>
>>>>

>>>>> The way I would approach is to evaluate GE, Deequ (there is a python
binding called pydeequ) and others like Delta Live tables with expectations
from Data Quality feature perspective. All these tools have their pros and
cons, and all of them are compatible with spark as a compute engine.

>>>>>

>>>>>  
>
>>>>>

>>>>> Also, you may want to look at dbt based DQ toolsets if sql is your
thing.

>>>>>

>>>>>  
>
>>>>>

>>>>> On Wed, 28 Dec 2022 at 3:14 pm, Sean Owen
<[srowen@gmail.com](mailto:srowen@gmail.com)> wrote:  
>
>>>>>

>>>>>> I think this is kind of mixed up. Data warehouses are simple SQL
creatures; Spark is (also) a distributed compute framework. Kind of like
comparing maybe a web server to Java.

>>>>>>

>>>>>> Are you thinking of Spark SQL? then I dunno sure you may well find it
more complicated, but it's also just a data warehousey SQL surface.

>>>>>>

>>>>>>  
> But none of that relates to the question of data quality tools. You could
> use GE with Redshift, or indeed with Spark - are you familiar with it? It's
> probably one of the most common tools people use with Spark for this in
> fact. It's just a Python lib at heart and you can apply it with Spark, but
> _not_ with a data warehouse, so I'm not sure what you're getting at.
>>>>>>

>>>>>>  
>
>>>>>>

>>>>>> Deequ is also commonly seen. It's actually built on Spark, so again,
confused about this "use Redshift or Snowflake not Spark".

>>>>>>

>>>>>>  
>
>>>>>>

>>>>>> On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta
<[gourav.sengupta@gmail.com](mailto:gourav.sengupta@gmail.com)> wrote:  
>
>>>>>>

>>>>>>> Hi,

>>>>>>>

>>>>>>>  
>
>>>>>>>

>>>>>>> SPARK is just another querying engine with a lot of hype.  
>
>>>>>>>

>>>>>>>  
>
>>>>>>>

>>>>>>> I would highly suggest using Redshift (storage and compute decoupled
mode) or Snowflake without all this super complicated understanding of
containers/ disk-space, mind numbing variables, rocket science tuning, hair
splitting failure scenarios, etc. After that try to choose solutions like
Athena, or Trino/ Presto, and then come to SPARK.

>>>>>>>

>>>>>>>  
>
>>>>>>>

>>>>>>> Try out solutions like  "great expectations" if you are looking for
data quality and not entirely sucked into the world of SPARK and want to keep
your options open.

>>>>>>>

>>>>>>>  
>
>>>>>>>

>>>>>>> Dont get me wrong, SPARK used to be great in 2016-2017, but there are
superb alternatives now and the industry, in this recession, should focus on
getting more value for every single dollar they spend.

>>>>>>>

>>>>>>>  
>
>>>>>>>

>>>>>>> Best of luck.

>>>>>>>

>>>>>>>  
>
>>>>>>>

>>>>>>> Regards,

>>>>>>>

>>>>>>> Gourav Sengupta

>>>>>>>

>>>>>>>  
>
>>>>>>>

>>>>>>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh
<[mich.talebzadeh@gmail.com](mailto:mich.talebzadeh@gmail.com)> wrote:  
>
>>>>>>>

>>>>>>>> Well, you need to qualify your statement on data quality. Are you
talking about data lineage here?

>>>>>>>>

>>>>>>>>  
>
>>>>>>>>

>>>>>>>> HTH  
>
>>>>>>>>

>>>>>>>>  
>
>>>>>>>>

>>>>>>>>  
>
>>>>>>>>

>>>>>>>>  ![](https://ci3.googleusercontent.com/mail-
sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE)
**** [ view my Linkedin profile](https://www.linkedin.com/in/mich-talebzadeh-
ph-d-5205b2/)

>>>>>>>>

>>>>>>>>  
>
>>>>>>>>

>>>>>>>>  <https://en.everybodywiki.com/Mich_Talebzadeh>

>>>>>>>>

>>>>>>>>  
>>>>>>>>

>>>>>>>>  **Disclaimer:**  Use it at your own risk. Any and all responsibility
for any loss, damage or destruction of data or any other property which may
arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

>>>>>>>>

>>>>>>>>  
>>>>>>>>

>>>>>>>>  
>
>>>>>>>>

>>>>>>>>  
>
>>>>>>>>

>>>>>>>> On Tue, 27 Dec 2022 at 19:25, rajat kumar
<[kumar.rajat20del@gmail.com](mailto:kumar.rajat20del@gmail.com)> wrote:  
>
>>>>>>>>

>>>>>>>>> Hi Folks

>>>>>>>>>

>>>>>>>>> Hoping you are doing well, I want to implement data quality to
detect issues in data in advance. I have heard about few frameworks like
GE/Deequ. Can anyone pls suggest which one is good and how do I get started on
it?

>>>>>>>>>

>>>>>>>>>  
>
>>>>>>>>>

>>>>>>>>> Regards

>>>>>>>>>

>>>>>>>>> Rajat

>>>>>

>>>>> \--  
>
>>>>>

>>>>> Best Regards,  
> Ayan Guha  
>

Re: Profiling data quality with Spark

Posted by vaquar khan <va...@gmail.com>.

@ Gourav Sengupta why you are sending unnecessary emails ,if you think
snowflake good plz use it ,here question was different and you are talking
totally different topic.

Plz respects group guidelines


Regards,
Vaquar khan

On Wed, Dec 28, 2022, 10:29 AM vaquar khan <va...@gmail.com> wrote:

> Here you can find all details , you just need to pass spark dataframe and
> deequ also generate recommendations for rules and you can also write custom
> complex rules.
>
>
> https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/
>
> Regards,
> Vaquar khan
>
> On Wed, Dec 28, 2022, 9:40 AM rajat kumar <ku...@gmail.com>
> wrote:
>
>> Thanks for the input folks.
>>
>> Hi Vaquar ,
>>
>> I saw that we have various types of checks in GE and Deequ. Could you
>> please suggest what types of check did you use for Metric based columns
>>
>>
>> Regards
>> Rajat
>>
>> On Wed, Dec 28, 2022 at 12:15 PM vaquar khan <va...@gmail.com>
>> wrote:
>>
>>> I would suggest Deequ , I have implemented many time easy and effective.
>>>
>>>
>>> Regards,
>>> Vaquar khan
>>>
>>> On Tue, Dec 27, 2022, 10:30 PM ayan guha <gu...@gmail.com> wrote:
>>>
>>>> The way I would approach is to evaluate GE, Deequ (there is a python
>>>> binding called pydeequ) and others like Delta Live tables with expectations
>>>> from Data Quality feature perspective. All these tools have their pros and
>>>> cons, and all of them are compatible with spark as a compute engine.
>>>>
>>>> Also, you may want to look at dbt based DQ toolsets if sql is your
>>>> thing.
>>>>
>>>> On Wed, 28 Dec 2022 at 3:14 pm, Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>>> I think this is kind of mixed up. Data warehouses are simple SQL
>>>>> creatures; Spark is (also) a distributed compute framework. Kind of like
>>>>> comparing maybe a web server to Java.
>>>>> Are you thinking of Spark SQL? then I dunno sure you may well find it
>>>>> more complicated, but it's also just a data warehousey SQL surface.
>>>>>
>>>>> But none of that relates to the question of data quality tools. You
>>>>> could use GE with Redshift, or indeed with Spark - are you familiar with
>>>>> it? It's probably one of the most common tools people use with Spark for
>>>>> this in fact. It's just a Python lib at heart and you can apply it with
>>>>> Spark, but _not_ with a data warehouse, so I'm not sure what you're getting
>>>>> at.
>>>>>
>>>>> Deequ is also commonly seen. It's actually built on Spark, so again,
>>>>> confused about this "use Redshift or Snowflake not Spark".
>>>>>
>>>>> On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta <
>>>>> gourav.sengupta@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> SPARK is just another querying engine with a lot of hype.
>>>>>>
>>>>>> I would highly suggest using Redshift (storage and compute decoupled
>>>>>> mode) or Snowflake without all this super complicated understanding of
>>>>>> containers/ disk-space, mind numbing variables, rocket science tuning, hair
>>>>>> splitting failure scenarios, etc. After that try to choose solutions like
>>>>>> Athena, or Trino/ Presto, and then come to SPARK.
>>>>>>
>>>>>> Try out solutions like  "great expectations" if you are looking for
>>>>>> data quality and not entirely sucked into the world of SPARK and want to
>>>>>> keep your options open.
>>>>>>
>>>>>> Dont get me wrong, SPARK used to be great in 2016-2017, but there are
>>>>>> superb alternatives now and the industry, in this recession, should focus
>>>>>> on getting more value for every single dollar they spend.
>>>>>>
>>>>>> Best of luck.
>>>>>>
>>>>>> Regards,
>>>>>> Gourav Sengupta
>>>>>>
>>>>>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>
>>>>>>> Well, you need to qualify your statement on data quality. Are you
>>>>>>> talking about data lineage here?
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 27 Dec 2022 at 19:25, rajat kumar <
>>>>>>> kumar.rajat20del@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Folks
>>>>>>>> Hoping you are doing well, I want to implement data quality to
>>>>>>>> detect issues in data in advance. I have heard about few frameworks like
>>>>>>>> GE/Deequ. Can anyone pls suggest which one is good and how do I get started
>>>>>>>> on it?
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Rajat
>>>>>>>>
>>>>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>

Re: Profiling data quality with Spark

Posted by vaquar khan <va...@gmail.com>.

Here you can find all details , you just need to pass spark dataframe and
deequ also generate recommendations for rules and you can also write custom
complex rules.

https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/

Regards,
Vaquar khan

On Wed, Dec 28, 2022, 9:40 AM rajat kumar <ku...@gmail.com>
wrote:

> Thanks for the input folks.
>
> Hi Vaquar ,
>
> I saw that we have various types of checks in GE and Deequ. Could you
> please suggest what types of check did you use for Metric based columns
>
>
> Regards
> Rajat
>
> On Wed, Dec 28, 2022 at 12:15 PM vaquar khan <va...@gmail.com>
> wrote:
>
>> I would suggest Deequ , I have implemented many time easy and effective.
>>
>>
>> Regards,
>> Vaquar khan
>>
>> On Tue, Dec 27, 2022, 10:30 PM ayan guha <gu...@gmail.com> wrote:
>>
>>> The way I would approach is to evaluate GE, Deequ (there is a python
>>> binding called pydeequ) and others like Delta Live tables with expectations
>>> from Data Quality feature perspective. All these tools have their pros and
>>> cons, and all of them are compatible with spark as a compute engine.
>>>
>>> Also, you may want to look at dbt based DQ toolsets if sql is your
>>> thing.
>>>
>>> On Wed, 28 Dec 2022 at 3:14 pm, Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> I think this is kind of mixed up. Data warehouses are simple SQL
>>>> creatures; Spark is (also) a distributed compute framework. Kind of like
>>>> comparing maybe a web server to Java.
>>>> Are you thinking of Spark SQL? then I dunno sure you may well find it
>>>> more complicated, but it's also just a data warehousey SQL surface.
>>>>
>>>> But none of that relates to the question of data quality tools. You
>>>> could use GE with Redshift, or indeed with Spark - are you familiar with
>>>> it? It's probably one of the most common tools people use with Spark for
>>>> this in fact. It's just a Python lib at heart and you can apply it with
>>>> Spark, but _not_ with a data warehouse, so I'm not sure what you're getting
>>>> at.
>>>>
>>>> Deequ is also commonly seen. It's actually built on Spark, so again,
>>>> confused about this "use Redshift or Snowflake not Spark".
>>>>
>>>> On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta <
>>>> gourav.sengupta@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> SPARK is just another querying engine with a lot of hype.
>>>>>
>>>>> I would highly suggest using Redshift (storage and compute decoupled
>>>>> mode) or Snowflake without all this super complicated understanding of
>>>>> containers/ disk-space, mind numbing variables, rocket science tuning, hair
>>>>> splitting failure scenarios, etc. After that try to choose solutions like
>>>>> Athena, or Trino/ Presto, and then come to SPARK.
>>>>>
>>>>> Try out solutions like  "great expectations" if you are looking for
>>>>> data quality and not entirely sucked into the world of SPARK and want to
>>>>> keep your options open.
>>>>>
>>>>> Dont get me wrong, SPARK used to be great in 2016-2017, but there are
>>>>> superb alternatives now and the industry, in this recession, should focus
>>>>> on getting more value for every single dollar they spend.
>>>>>
>>>>> Best of luck.
>>>>>
>>>>> Regards,
>>>>> Gourav Sengupta
>>>>>
>>>>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>> Well, you need to qualify your statement on data quality. Are you
>>>>>> talking about data lineage here?
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 27 Dec 2022 at 19:25, rajat kumar <ku...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Folks
>>>>>>> Hoping you are doing well, I want to implement data quality to
>>>>>>> detect issues in data in advance. I have heard about few frameworks like
>>>>>>> GE/Deequ. Can anyone pls suggest which one is good and how do I get started
>>>>>>> on it?
>>>>>>>
>>>>>>> Regards
>>>>>>> Rajat
>>>>>>>
>>>>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>

Re: Profiling data quality with Spark

Posted by rajat kumar <ku...@gmail.com>.

Thanks for the input folks.

Hi Vaquar ,

I saw that we have various types of checks in GE and Deequ. Could you
please suggest what types of check did you use for Metric based columns


Regards
Rajat

On Wed, Dec 28, 2022 at 12:15 PM vaquar khan <va...@gmail.com> wrote:

> I would suggest Deequ , I have implemented many time easy and effective.
>
>
> Regards,
> Vaquar khan
>
> On Tue, Dec 27, 2022, 10:30 PM ayan guha <gu...@gmail.com> wrote:
>
>> The way I would approach is to evaluate GE, Deequ (there is a python
>> binding called pydeequ) and others like Delta Live tables with expectations
>> from Data Quality feature perspective. All these tools have their pros and
>> cons, and all of them are compatible with spark as a compute engine.
>>
>> Also, you may want to look at dbt based DQ toolsets if sql is your thing.
>>
>> On Wed, 28 Dec 2022 at 3:14 pm, Sean Owen <sr...@gmail.com> wrote:
>>
>>> I think this is kind of mixed up. Data warehouses are simple SQL
>>> creatures; Spark is (also) a distributed compute framework. Kind of like
>>> comparing maybe a web server to Java.
>>> Are you thinking of Spark SQL? then I dunno sure you may well find it
>>> more complicated, but it's also just a data warehousey SQL surface.
>>>
>>> But none of that relates to the question of data quality tools. You
>>> could use GE with Redshift, or indeed with Spark - are you familiar with
>>> it? It's probably one of the most common tools people use with Spark for
>>> this in fact. It's just a Python lib at heart and you can apply it with
>>> Spark, but _not_ with a data warehouse, so I'm not sure what you're getting
>>> at.
>>>
>>> Deequ is also commonly seen. It's actually built on Spark, so again,
>>> confused about this "use Redshift or Snowflake not Spark".
>>>
>>> On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta <
>>> gourav.sengupta@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> SPARK is just another querying engine with a lot of hype.
>>>>
>>>> I would highly suggest using Redshift (storage and compute decoupled
>>>> mode) or Snowflake without all this super complicated understanding of
>>>> containers/ disk-space, mind numbing variables, rocket science tuning, hair
>>>> splitting failure scenarios, etc. After that try to choose solutions like
>>>> Athena, or Trino/ Presto, and then come to SPARK.
>>>>
>>>> Try out solutions like  "great expectations" if you are looking for
>>>> data quality and not entirely sucked into the world of SPARK and want to
>>>> keep your options open.
>>>>
>>>> Dont get me wrong, SPARK used to be great in 2016-2017, but there are
>>>> superb alternatives now and the industry, in this recession, should focus
>>>> on getting more value for every single dollar they spend.
>>>>
>>>> Best of luck.
>>>>
>>>> Regards,
>>>> Gourav Sengupta
>>>>
>>>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> Well, you need to qualify your statement on data quality. Are you
>>>>> talking about data lineage here?
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, 27 Dec 2022 at 19:25, rajat kumar <ku...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Folks
>>>>>> Hoping you are doing well, I want to implement data quality to detect
>>>>>> issues in data in advance. I have heard about few frameworks like GE/Deequ.
>>>>>> Can anyone pls suggest which one is good and how do I get started on it?
>>>>>>
>>>>>> Regards
>>>>>> Rajat
>>>>>>
>>>>> --
>> Best Regards,
>> Ayan Guha
>>
>

Re: Profiling data quality with Spark

Posted by vaquar khan <va...@gmail.com>.

I would suggest Deequ , I have implemented many time easy and effective.


Regards,
Vaquar khan

On Tue, Dec 27, 2022, 10:30 PM ayan guha <gu...@gmail.com> wrote:

> The way I would approach is to evaluate GE, Deequ (there is a python
> binding called pydeequ) and others like Delta Live tables with expectations
> from Data Quality feature perspective. All these tools have their pros and
> cons, and all of them are compatible with spark as a compute engine.
>
> Also, you may want to look at dbt based DQ toolsets if sql is your thing.
>
> On Wed, 28 Dec 2022 at 3:14 pm, Sean Owen <sr...@gmail.com> wrote:
>
>> I think this is kind of mixed up. Data warehouses are simple SQL
>> creatures; Spark is (also) a distributed compute framework. Kind of like
>> comparing maybe a web server to Java.
>> Are you thinking of Spark SQL? then I dunno sure you may well find it
>> more complicated, but it's also just a data warehousey SQL surface.
>>
>> But none of that relates to the question of data quality tools. You could
>> use GE with Redshift, or indeed with Spark - are you familiar with it? It's
>> probably one of the most common tools people use with Spark for this in
>> fact. It's just a Python lib at heart and you can apply it with Spark, but
>> _not_ with a data warehouse, so I'm not sure what you're getting at.
>>
>> Deequ is also commonly seen. It's actually built on Spark, so again,
>> confused about this "use Redshift or Snowflake not Spark".
>>
>> On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta <
>> gourav.sengupta@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> SPARK is just another querying engine with a lot of hype.
>>>
>>> I would highly suggest using Redshift (storage and compute decoupled
>>> mode) or Snowflake without all this super complicated understanding of
>>> containers/ disk-space, mind numbing variables, rocket science tuning, hair
>>> splitting failure scenarios, etc. After that try to choose solutions like
>>> Athena, or Trino/ Presto, and then come to SPARK.
>>>
>>> Try out solutions like  "great expectations" if you are looking for data
>>> quality and not entirely sucked into the world of SPARK and want to keep
>>> your options open.
>>>
>>> Dont get me wrong, SPARK used to be great in 2016-2017, but there are
>>> superb alternatives now and the industry, in this recession, should focus
>>> on getting more value for every single dollar they spend.
>>>
>>> Best of luck.
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> Well, you need to qualify your statement on data quality. Are you
>>>> talking about data lineage here?
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 27 Dec 2022 at 19:25, rajat kumar <ku...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Folks
>>>>> Hoping you are doing well, I want to implement data quality to detect
>>>>> issues in data in advance. I have heard about few frameworks like GE/Deequ.
>>>>> Can anyone pls suggest which one is good and how do I get started on it?
>>>>>
>>>>> Regards
>>>>> Rajat
>>>>>
>>>> --
> Best Regards,
> Ayan Guha
>

Re: Profiling data quality with Spark

Posted by ayan guha <gu...@gmail.com>.

The way I would approach is to evaluate GE, Deequ (there is a python
binding called pydeequ) and others like Delta Live tables with expectations
from Data Quality feature perspective. All these tools have their pros and
cons, and all of them are compatible with spark as a compute engine.

Also, you may want to look at dbt based DQ toolsets if sql is your thing.

On Wed, 28 Dec 2022 at 3:14 pm, Sean Owen <sr...@gmail.com> wrote:

> I think this is kind of mixed up. Data warehouses are simple SQL
> creatures; Spark is (also) a distributed compute framework. Kind of like
> comparing maybe a web server to Java.
> Are you thinking of Spark SQL? then I dunno sure you may well find it more
> complicated, but it's also just a data warehousey SQL surface.
>
> But none of that relates to the question of data quality tools. You could
> use GE with Redshift, or indeed with Spark - are you familiar with it? It's
> probably one of the most common tools people use with Spark for this in
> fact. It's just a Python lib at heart and you can apply it with Spark, but
> _not_ with a data warehouse, so I'm not sure what you're getting at.
>
> Deequ is also commonly seen. It's actually built on Spark, so again,
> confused about this "use Redshift or Snowflake not Spark".
>
> On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta <go...@gmail.com>
> wrote:
>
>> Hi,
>>
>> SPARK is just another querying engine with a lot of hype.
>>
>> I would highly suggest using Redshift (storage and compute decoupled
>> mode) or Snowflake without all this super complicated understanding of
>> containers/ disk-space, mind numbing variables, rocket science tuning, hair
>> splitting failure scenarios, etc. After that try to choose solutions like
>> Athena, or Trino/ Presto, and then come to SPARK.
>>
>> Try out solutions like  "great expectations" if you are looking for data
>> quality and not entirely sucked into the world of SPARK and want to keep
>> your options open.
>>
>> Dont get me wrong, SPARK used to be great in 2016-2017, but there are
>> superb alternatives now and the industry, in this recession, should focus
>> on getting more value for every single dollar they spend.
>>
>> Best of luck.
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Well, you need to qualify your statement on data quality. Are you
>>> talking about data lineage here?
>>>
>>> HTH
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 27 Dec 2022 at 19:25, rajat kumar <ku...@gmail.com>
>>> wrote:
>>>
>>>> Hi Folks
>>>> Hoping you are doing well, I want to implement data quality to detect
>>>> issues in data in advance. I have heard about few frameworks like GE/Deequ.
>>>> Can anyone pls suggest which one is good and how do I get started on it?
>>>>
>>>> Regards
>>>> Rajat
>>>>
>>> --
Best Regards,
Ayan Guha

Re: Profiling data quality with Spark

Posted by Gourav Sengupta <go...@gmail.com>.

Hi Sean,

the entire narrative of SPARK being a unified analytics tool falls flat as
what should have been an engine on SPARK is now deliberately floated off as
a separate company called as Ray, and all the unified narrative rings
hollow.

SPARK is nothing more than a SQL engine as per SPARKs own conference where
they said that around more than 95% (in case I am not wrong) users use the
SQL interface :)

I have seen engineers split their hair for simple operations which takes
minutes in Snowflake or Redshift just because of SPARK configurations,
shuffle, and other operations.

This matters in the industry today, sadly, because people are being laid
off from their jobs because a 1 dollar simple solution has to be run as a
rocket science.

So my suggestion is to decouple your data quality solution from SPARK,
sooner or later everyone is going to see that saving jobs and saving money
makes sense :)

Regards,
Gourav Sengupta

On Wed, Dec 28, 2022 at 4:13 AM Sean Owen <sr...@gmail.com> wrote:

> I think this is kind of mixed up. Data warehouses are simple SQL
> creatures; Spark is (also) a distributed compute framework. Kind of like
> comparing maybe a web server to Java.
> Are you thinking of Spark SQL? then I dunno sure you may well find it more
> complicated, but it's also just a data warehousey SQL surface.
>
> But none of that relates to the question of data quality tools. You could
> use GE with Redshift, or indeed with Spark - are you familiar with it? It's
> probably one of the most common tools people use with Spark for this in
> fact. It's just a Python lib at heart and you can apply it with Spark, but
> _not_ with a data warehouse, so I'm not sure what you're getting at.
>
> Deequ is also commonly seen. It's actually built on Spark, so again,
> confused about this "use Redshift or Snowflake not Spark".
>
> On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta <go...@gmail.com>
> wrote:
>
>> Hi,
>>
>> SPARK is just another querying engine with a lot of hype.
>>
>> I would highly suggest using Redshift (storage and compute decoupled
>> mode) or Snowflake without all this super complicated understanding of
>> containers/ disk-space, mind numbing variables, rocket science tuning, hair
>> splitting failure scenarios, etc. After that try to choose solutions like
>> Athena, or Trino/ Presto, and then come to SPARK.
>>
>> Try out solutions like  "great expectations" if you are looking for data
>> quality and not entirely sucked into the world of SPARK and want to keep
>> your options open.
>>
>> Dont get me wrong, SPARK used to be great in 2016-2017, but there are
>> superb alternatives now and the industry, in this recession, should focus
>> on getting more value for every single dollar they spend.
>>
>> Best of luck.
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Well, you need to qualify your statement on data quality. Are you
>>> talking about data lineage here?
>>>
>>> HTH
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 27 Dec 2022 at 19:25, rajat kumar <ku...@gmail.com>
>>> wrote:
>>>
>>>> Hi Folks
>>>> Hoping you are doing well, I want to implement data quality to detect
>>>> issues in data in advance. I have heard about few frameworks like GE/Deequ.
>>>> Can anyone pls suggest which one is good and how do I get started on it?
>>>>
>>>> Regards
>>>> Rajat
>>>>
>>>

Re: Profiling data quality with Spark

Posted by Sean Owen <sr...@gmail.com>.

I think this is kind of mixed up. Data warehouses are simple SQL creatures;
Spark is (also) a distributed compute framework. Kind of like comparing
maybe a web server to Java.
Are you thinking of Spark SQL? then I dunno sure you may well find it more
complicated, but it's also just a data warehousey SQL surface.

But none of that relates to the question of data quality tools. You could
use GE with Redshift, or indeed with Spark - are you familiar with it? It's
probably one of the most common tools people use with Spark for this in
fact. It's just a Python lib at heart and you can apply it with Spark, but
_not_ with a data warehouse, so I'm not sure what you're getting at.

Deequ is also commonly seen. It's actually built on Spark, so again,
confused about this "use Redshift or Snowflake not Spark".

On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta <go...@gmail.com>
wrote:

> Hi,
>
> SPARK is just another querying engine with a lot of hype.
>
> I would highly suggest using Redshift (storage and compute decoupled mode)
> or Snowflake without all this super complicated understanding of
> containers/ disk-space, mind numbing variables, rocket science tuning, hair
> splitting failure scenarios, etc. After that try to choose solutions like
> Athena, or Trino/ Presto, and then come to SPARK.
>
> Try out solutions like  "great expectations" if you are looking for data
> quality and not entirely sucked into the world of SPARK and want to keep
> your options open.
>
> Dont get me wrong, SPARK used to be great in 2016-2017, but there are
> superb alternatives now and the industry, in this recession, should focus
> on getting more value for every single dollar they spend.
>
> Best of luck.
>
> Regards,
> Gourav Sengupta
>
> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> Well, you need to qualify your statement on data quality. Are you talking
>> about data lineage here?
>>
>> HTH
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 27 Dec 2022 at 19:25, rajat kumar <ku...@gmail.com>
>> wrote:
>>
>>> Hi Folks
>>> Hoping you are doing well, I want to implement data quality to detect
>>> issues in data in advance. I have heard about few frameworks like GE/Deequ.
>>> Can anyone pls suggest which one is good and how do I get started on it?
>>>
>>> Regards
>>> Rajat
>>>
>>

Re: Profiling data quality with Spark

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

SPARK is just another querying engine with a lot of hype.

I would highly suggest using Redshift (storage and compute decoupled mode)
or Snowflake without all this super complicated understanding of
containers/ disk-space, mind numbing variables, rocket science tuning, hair
splitting failure scenarios, etc. After that try to choose solutions like
Athena, or Trino/ Presto, and then come to SPARK.

Try out solutions like  "great expectations" if you are looking for data
quality and not entirely sucked into the world of SPARK and want to keep
your options open.

Dont get me wrong, SPARK used to be great in 2016-2017, but there are
superb alternatives now and the industry, in this recession, should focus
on getting more value for every single dollar they spend.

Best of luck.

Regards,
Gourav Sengupta

On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh <mi...@gmail.com>
wrote:

> Well, you need to qualify your statement on data quality. Are you talking
> about data lineage here?
>
> HTH
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 27 Dec 2022 at 19:25, rajat kumar <ku...@gmail.com>
> wrote:
>
>> Hi Folks
>> Hoping you are doing well, I want to implement data quality to detect
>> issues in data in advance. I have heard about few frameworks like GE/Deequ.
>> Can anyone pls suggest which one is good and how do I get started on it?
>>
>> Regards
>> Rajat
>>
>

Re: Profiling data quality with Spark

Posted by Mich Talebzadeh <mi...@gmail.com>.

Well, you need to qualify your statement on data quality. Are you talking
about data lineage here?

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Tue, 27 Dec 2022 at 19:25, rajat kumar <ku...@gmail.com>
wrote:

> Hi Folks
> Hoping you are doing well, I want to implement data quality to detect
> issues in data in advance. I have heard about few frameworks like GE/Deequ.
> Can anyone pls suggest which one is good and how do I get started on it?
>
> Regards
> Rajat
>