You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2020/05/02 13:00:00 UTC

Modularising Spark/Scala program

Hi,

I have a Spark Scala program created and compiled with Maven. It works
fine. It basically does the following:

1. Reads an xml file from HDFS location
2. Creates a DF on top of what it reads
3. Creates a new DF with some columns renamed etc
4. Creates a new DF for rejected rows (incorrect value for a column)
5. Puts rejected data into Hive exception table
6. Puts valid rows into Hive main table
7. Nullifies the invalid rows by setting the invalid column to NULL and
puts the rows into the main Hive table

These are currently performed in one method. Ideally I want to break this
down as follows:

1. A method to read the XML file and creates DF and a new DF on top of
previous DF
2. A method to create a DF on top of rejected rows using t
3. A method to put invalid rows into the exception table using tmp table
4. A method to put the correct rows into the main table again using tmp
table

I was wondering if this is correct approach?

Thanks,

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: Modularising Spark/Scala program

Posted by Stephen Boesch <ja...@gmail.com>.

I neglected to include the rationale: the assumption is this will be a
repeatedly needed process thus a reusable method were helpful.  The
predicate/input rules that are supported will need to be flexible enough to
support the range of input data domains and use cases .  For my workflows
the predicates are typically sql's.

Am Sa., 2. Mai 2020 um 06:13 Uhr schrieb Stephen Boesch <ja...@gmail.com>:

> Hi Mich!
>    I think you can combine the good/rejected into one method that
> internally:
>
>    - Create good/rejected df's given an input df and input
>    rules/predicates to apply to the df.
>    - Create a third df containing the good rows and the rejected rows
>    with the bad columns nulled out
>    - Append/insert the two dfs into their respective hive good/exception
>    tables
>    - return value can be a tuple of the (goodDf,exceptionsDf,combinedDf)
>    or maybe just the (combinedDf,exceptionsDf)
>
>
> Am Sa., 2. Mai 2020 um 06:00 Uhr schrieb Mich Talebzadeh <
> mich.talebzadeh@gmail.com>:
>
>>
>> Hi,
>>
>> I have a Spark Scala program created and compiled with Maven. It works
>> fine. It basically does the following:
>>
>>
>>    1. Reads an xml file from HDFS location
>>    2. Creates a DF on top of what it reads
>>    3. Creates a new DF with some columns renamed etc
>>    4. Creates a new DF for rejected rows (incorrect value for a column)
>>    5. Puts rejected data into Hive exception table
>>    6. Puts valid rows into Hive main table
>>    7. Nullifies the invalid rows by setting the invalid column to NULL
>>    and puts the rows into the main Hive table
>>
>> These are currently performed in one method. Ideally I want to break this
>> down as follows:
>>
>>
>>    1. A method to read the XML file and creates DF and a new DF on top
>>    of previous DF
>>    2. A method to create a DF on top of rejected rows using t
>>    3. A method to put invalid rows into the exception table using tmp
>>    table
>>    4. A method to put the correct rows into the main table again using
>>    tmp table
>>
>> I was wondering if this is correct approach?
>>
>> Thanks,
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>

Re: Modularising Spark/Scala program

Posted by Stephen Boesch <ja...@gmail.com>.

Hi Mich!
   I think you can combine the good/rejected into one method that
internally:

   - Create good/rejected df's given an input df and input rules/predicates
   to apply to the df.
   - Create a third df containing the good rows and the rejected rows with
   the bad columns nulled out
   - Append/insert the two dfs into their respective hive good/exception
   tables
   - return value can be a tuple of the (goodDf,exceptionsDf,combinedDf)
   or maybe just the (combinedDf,exceptionsDf)


Am Sa., 2. Mai 2020 um 06:00 Uhr schrieb Mich Talebzadeh <
mich.talebzadeh@gmail.com>:

>
> Hi,
>
> I have a Spark Scala program created and compiled with Maven. It works
> fine. It basically does the following:
>
>
>    1. Reads an xml file from HDFS location
>    2. Creates a DF on top of what it reads
>    3. Creates a new DF with some columns renamed etc
>    4. Creates a new DF for rejected rows (incorrect value for a column)
>    5. Puts rejected data into Hive exception table
>    6. Puts valid rows into Hive main table
>    7. Nullifies the invalid rows by setting the invalid column to NULL
>    and puts the rows into the main Hive table
>
> These are currently performed in one method. Ideally I want to break this
> down as follows:
>
>
>    1. A method to read the XML file and creates DF and a new DF on top of
>    previous DF
>    2. A method to create a DF on top of rejected rows using t
>    3. A method to put invalid rows into the exception table using tmp
>    table
>    4. A method to put the correct rows into the main table again using
>    tmp table
>
> I was wondering if this is correct approach?
>
> Thanks,
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>