You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Kapil Malik <ka...@snapdeal.com> on 2016/09/03 12:19:51 UTC

Catalog, SessionCatalog and ExternalCatalog in spark 2.0

Hi all,

I have a Spark SQL 1.6 application in production which does following on
executing sqlContext.sql(...) -
1. Identify the table-name mentioned in query
2. Use an external database to decide where's the data located, in which
format (parquet or csv or jdbc) etc.
3. Load the dataframe
4. Register it as temp table (for future calls to this table)

This is achieved by extending HiveContext, and correspondingly HiveCatalog.
I have my own implementation of trait "Catalog", which over-rides the
"lookupRelation" method to do the magic behind the scenes.

However, in spark 2.0, I can see following -
SessionCatalog - which contains lookupRelation method, but doesn't have any
interface / abstract class to it.
ExternalCatalog - which deals with CatalogTable instead of Df / LogicalPlan.
Catalog - which also doesn't expose any method to lookup Df / LogicalPlan.

So apparently it looks like I need to extend SessionCatalog only.
However, just wanted to get a feedback on if there's a better / recommended
approach to achieve this.


Thanks and regards,


Kapil Malik
*Sr. Principal Engineer | Data Platform, Technology*
M: +91 8800836581 | T: 0124-4330000 | EXT: 20910
ASF Centre A | 1st Floor | Udyog Vihar Phase IV |
Gurgaon | Haryana | India

*Disclaimer:* This communication is for the sole use of the addressee and
is confidential and privileged information. If you are not the intended
recipient of this communication, you are prohibited from disclosing it and
are required to delete it forthwith. Please note that the contents of this
communication do not necessarily represent the views of Jasper Infotech
Private Limited ("Company"). E-mail transmission cannot be guaranteed to be
secure or error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The Company,
therefore, does not accept liability for any loss caused due to this
communication. *Jasper Infotech Private Limited, Registered Office: 1st
Floor, Plot 238, Okhla Industrial Estate, New Delhi - 110020 INDIA CIN:
U72300DL2007PTC168097*

Re: Catalog, SessionCatalog and ExternalCatalog in spark 2.0

Posted by Kapil Malik <ka...@snapdeal.com>.
Thanks Raghavendra :)
Will look into Analyzer as well.


Kapil Malik
*Sr. Principal Engineer | Data Platform, Technology*
M: +91 8800836581 | T: 0124-4330000 | EXT: 20910
ASF Centre A | 1st Floor | Udyog Vihar Phase IV |
Gurgaon | Haryana | India

*Disclaimer:* This communication is for the sole use of the addressee and
is confidential and privileged information. If you are not the intended
recipient of this communication, you are prohibited from disclosing it and
are required to delete it forthwith. Please note that the contents of this
communication do not necessarily represent the views of Jasper Infotech
Private Limited ("Company"). E-mail transmission cannot be guaranteed to be
secure or error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The Company,
therefore, does not accept liability for any loss caused due to this
communication. *Jasper Infotech Private Limited, Registered Office: 1st
Floor, Plot 238, Okhla Industrial Estate, New Delhi - 110020 INDIA CIN:
U72300DL2007PTC168097*


On Sat, Sep 3, 2016 at 7:27 PM, Raghavendra Pandey <
raghavendra.pandey@gmail.com> wrote:

> Kapil -- I afraid you need to plugin your own SessionCatalog as
> ResolveRelations class depends on that. To keep up with consistent design
> you may like to implement ExternalCatalog as well.
> You can also look to plug in your own Analyzer class to give your more
> flexibility. Ultimately that is where all Relations get resolved from
> SessionCatalog.
>
> On Sat, Sep 3, 2016 at 5:49 PM, Kapil Malik <ka...@snapdeal.com>
> wrote:
>
>> Hi all,
>>
>> I have a Spark SQL 1.6 application in production which does following on
>> executing sqlContext.sql(...) -
>> 1. Identify the table-name mentioned in query
>> 2. Use an external database to decide where's the data located, in which
>> format (parquet or csv or jdbc) etc.
>> 3. Load the dataframe
>> 4. Register it as temp table (for future calls to this table)
>>
>> This is achieved by extending HiveContext, and correspondingly
>> HiveCatalog. I have my own implementation of trait "Catalog", which
>> over-rides the "lookupRelation" method to do the magic behind the scenes.
>>
>> However, in spark 2.0, I can see following -
>> SessionCatalog - which contains lookupRelation method, but doesn't have
>> any interface / abstract class to it.
>> ExternalCatalog - which deals with CatalogTable instead of Df /
>> LogicalPlan.
>> Catalog - which also doesn't expose any method to lookup Df / LogicalPlan.
>>
>> So apparently it looks like I need to extend SessionCatalog only.
>> However, just wanted to get a feedback on if there's a better /
>> recommended approach to achieve this.
>>
>>
>> Thanks and regards,
>>
>>
>> Kapil Malik
>> *Sr. Principal Engineer | Data Platform, Technology*
>> M: +91 8800836581 | T: 0124-4330000 | EXT: 20910
>> ASF Centre A | 1st Floor | Udyog Vihar Phase IV |
>> Gurgaon | Haryana | India
>>
>> *Disclaimer:* This communication is for the sole use of the addressee
>> and is confidential and privileged information. If you are not the intended
>> recipient of this communication, you are prohibited from disclosing it and
>> are required to delete it forthwith. Please note that the contents of this
>> communication do not necessarily represent the views of Jasper Infotech
>> Private Limited ("Company"). E-mail transmission cannot be guaranteed to be
>> secure or error-free as information could be intercepted, corrupted, lost,
>> destroyed, arrive late or incomplete, or contain viruses. The Company,
>> therefore, does not accept liability for any loss caused due to this
>> communication. *Jasper Infotech Private Limited, Registered Office: 1st
>> Floor, Plot 238, Okhla Industrial Estate, New Delhi - 110020 INDIA CIN:
>> U72300DL2007PTC168097*
>>
>>
>

Re: Catalog, SessionCatalog and ExternalCatalog in spark 2.0

Posted by Kapil Malik <ka...@snapdeal.com>.
Thanks Raghavendra :)
Will look into Analyzer as well.


Kapil Malik
*Sr. Principal Engineer | Data Platform, Technology*
M: +91 8800836581 | T: 0124-4330000 | EXT: 20910
ASF Centre A | 1st Floor | Udyog Vihar Phase IV |
Gurgaon | Haryana | India

*Disclaimer:* This communication is for the sole use of the addressee and
is confidential and privileged information. If you are not the intended
recipient of this communication, you are prohibited from disclosing it and
are required to delete it forthwith. Please note that the contents of this
communication do not necessarily represent the views of Jasper Infotech
Private Limited ("Company"). E-mail transmission cannot be guaranteed to be
secure or error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The Company,
therefore, does not accept liability for any loss caused due to this
communication. *Jasper Infotech Private Limited, Registered Office: 1st
Floor, Plot 238, Okhla Industrial Estate, New Delhi - 110020 INDIA CIN:
U72300DL2007PTC168097*


On Sat, Sep 3, 2016 at 7:27 PM, Raghavendra Pandey <
raghavendra.pandey@gmail.com> wrote:

> Kapil -- I afraid you need to plugin your own SessionCatalog as
> ResolveRelations class depends on that. To keep up with consistent design
> you may like to implement ExternalCatalog as well.
> You can also look to plug in your own Analyzer class to give your more
> flexibility. Ultimately that is where all Relations get resolved from
> SessionCatalog.
>
> On Sat, Sep 3, 2016 at 5:49 PM, Kapil Malik <ka...@snapdeal.com>
> wrote:
>
>> Hi all,
>>
>> I have a Spark SQL 1.6 application in production which does following on
>> executing sqlContext.sql(...) -
>> 1. Identify the table-name mentioned in query
>> 2. Use an external database to decide where's the data located, in which
>> format (parquet or csv or jdbc) etc.
>> 3. Load the dataframe
>> 4. Register it as temp table (for future calls to this table)
>>
>> This is achieved by extending HiveContext, and correspondingly
>> HiveCatalog. I have my own implementation of trait "Catalog", which
>> over-rides the "lookupRelation" method to do the magic behind the scenes.
>>
>> However, in spark 2.0, I can see following -
>> SessionCatalog - which contains lookupRelation method, but doesn't have
>> any interface / abstract class to it.
>> ExternalCatalog - which deals with CatalogTable instead of Df /
>> LogicalPlan.
>> Catalog - which also doesn't expose any method to lookup Df / LogicalPlan.
>>
>> So apparently it looks like I need to extend SessionCatalog only.
>> However, just wanted to get a feedback on if there's a better /
>> recommended approach to achieve this.
>>
>>
>> Thanks and regards,
>>
>>
>> Kapil Malik
>> *Sr. Principal Engineer | Data Platform, Technology*
>> M: +91 8800836581 | T: 0124-4330000 | EXT: 20910
>> ASF Centre A | 1st Floor | Udyog Vihar Phase IV |
>> Gurgaon | Haryana | India
>>
>> *Disclaimer:* This communication is for the sole use of the addressee
>> and is confidential and privileged information. If you are not the intended
>> recipient of this communication, you are prohibited from disclosing it and
>> are required to delete it forthwith. Please note that the contents of this
>> communication do not necessarily represent the views of Jasper Infotech
>> Private Limited ("Company"). E-mail transmission cannot be guaranteed to be
>> secure or error-free as information could be intercepted, corrupted, lost,
>> destroyed, arrive late or incomplete, or contain viruses. The Company,
>> therefore, does not accept liability for any loss caused due to this
>> communication. *Jasper Infotech Private Limited, Registered Office: 1st
>> Floor, Plot 238, Okhla Industrial Estate, New Delhi - 110020 INDIA CIN:
>> U72300DL2007PTC168097*
>>
>>
>

Re: Catalog, SessionCatalog and ExternalCatalog in spark 2.0

Posted by Raghavendra Pandey <ra...@gmail.com>.
Kapil -- I afraid you need to plugin your own SessionCatalog as
ResolveRelations class depends on that. To keep up with consistent design
you may like to implement ExternalCatalog as well.
You can also look to plug in your own Analyzer class to give your more
flexibility. Ultimately that is where all Relations get resolved from
SessionCatalog.

On Sat, Sep 3, 2016 at 5:49 PM, Kapil Malik <ka...@snapdeal.com>
wrote:

> Hi all,
>
> I have a Spark SQL 1.6 application in production which does following on
> executing sqlContext.sql(...) -
> 1. Identify the table-name mentioned in query
> 2. Use an external database to decide where's the data located, in which
> format (parquet or csv or jdbc) etc.
> 3. Load the dataframe
> 4. Register it as temp table (for future calls to this table)
>
> This is achieved by extending HiveContext, and correspondingly
> HiveCatalog. I have my own implementation of trait "Catalog", which
> over-rides the "lookupRelation" method to do the magic behind the scenes.
>
> However, in spark 2.0, I can see following -
> SessionCatalog - which contains lookupRelation method, but doesn't have
> any interface / abstract class to it.
> ExternalCatalog - which deals with CatalogTable instead of Df /
> LogicalPlan.
> Catalog - which also doesn't expose any method to lookup Df / LogicalPlan.
>
> So apparently it looks like I need to extend SessionCatalog only.
> However, just wanted to get a feedback on if there's a better /
> recommended approach to achieve this.
>
>
> Thanks and regards,
>
>
> Kapil Malik
> *Sr. Principal Engineer | Data Platform, Technology*
> M: +91 8800836581 | T: 0124-4330000 | EXT: 20910
> ASF Centre A | 1st Floor | Udyog Vihar Phase IV |
> Gurgaon | Haryana | India
>
> *Disclaimer:* This communication is for the sole use of the addressee and
> is confidential and privileged information. If you are not the intended
> recipient of this communication, you are prohibited from disclosing it and
> are required to delete it forthwith. Please note that the contents of this
> communication do not necessarily represent the views of Jasper Infotech
> Private Limited ("Company"). E-mail transmission cannot be guaranteed to be
> secure or error-free as information could be intercepted, corrupted, lost,
> destroyed, arrive late or incomplete, or contain viruses. The Company,
> therefore, does not accept liability for any loss caused due to this
> communication. *Jasper Infotech Private Limited, Registered Office: 1st
> Floor, Plot 238, Okhla Industrial Estate, New Delhi - 110020 INDIA CIN:
> U72300DL2007PTC168097*
>
>