You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Aakash Basu <ra...@gmail.com> on 2016/07/19 19:27:57 UTC

Little idea needed

Hi all,

I'm trying to pull a full table from oracle, which is huge with some 10
million records which will be the initial load to HDFS.

Then I will do delta loads everyday in the same folder in HDFS.

Now, my query here is,

DAY 0 - I did the initial load (full dump).

DAY 1 - I'll load only that day's data which has suppose 10 records (5 old
with some column's value altered and 5 new).

Here, my question is, how will I push this file to HDFS through Spark code,
if I do append, it will create duplicates (which i don't want), if i keep
separate files and while using it in other program am giving the path of it
as folder which contains all files /. But in this case also the
registerTempTable will have duplicates for those 5 old rows.

What is the BEST logic to be applied here?

I tried to resolve this by doing a search in that file of the records if
matching load the new ones by deleting the old, but this will be time
consuming for such a huge record, right?

Please help!

Thanks,
Aakash.

Re: Little idea needed

Posted by Aakash Basu <ra...@gmail.com>.

Thanks for the detailed description buddy. But this will actually be done
through NiFi (End to End) so we need to add the delta logic inside NiFi to
automate the whole process.

That's why, need a good (best) solution to solve this problem. Since, this
is a classic issue which we can face any company we work with.
On 20-Jul-2016 1:38 AM, "Mich Talebzadeh" <mi...@gmail.com> wrote:

> Well this is a classic.
>
> The initial load can be done through Sqoop (outside of Spark) or through
> JDBC connection in Spark. 10 million rows in nothing.
>
> Then you have to think of updates and deletes in addition to new rows.
>
> With Sqoop you can load from the last ID in the source table, assuming
> that you have a unique key in Your Oracle table.
>
> If you have 10 new roes and I assume you know how to load these rows from
> Oracle.
>
> I suggest that you add two additional columns to your HDFS/target table,
>
> ,op_type                 int
> ,op_time                 timestamp
>
> These two columns will specify the row type op_type = 1,2,3
> INSERT/UPDATE/DELETE and op_time = cast(from_unixtime(unix_timestamp())
> AS op_time) when the record was added.
>
> So you will end up with two additional columns in your HDFS table compared
> to Oracle table and that will be your staging table.
>
> Of course you can do real time analytics through Oracle GoldenGate that
> read the redolog of the source table in Oracle or better Sap Replication
> Server (SRS). You will achieve real-time integration between RDBMS tables
> and Big Data.
>
> Once you have you have the staging table (immutable) and the rest is
> pretty easy. You have the full Entity Life History in this case for records
> and you can do your queries on them.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 19 July 2016 at 20:27, Aakash Basu <ra...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm trying to pull a full table from oracle, which is huge with some 10
>> million records which will be the initial load to HDFS.
>>
>> Then I will do delta loads everyday in the same folder in HDFS.
>>
>> Now, my query here is,
>>
>> DAY 0 - I did the initial load (full dump).
>>
>> DAY 1 - I'll load only that day's data which has suppose 10 records (5
>> old with some column's value altered and 5 new).
>>
>> Here, my question is, how will I push this file to HDFS through Spark
>> code, if I do append, it will create duplicates (which i don't want), if i
>> keep separate files and while using it in other program am giving the path
>> of it as folder which contains all files /. But in this case also the
>> registerTempTable will have duplicates for those 5 old rows.
>>
>> What is the BEST logic to be applied here?
>>
>> I tried to resolve this by doing a search in that file of the records if
>> matching load the new ones by deleting the old, but this will be time
>> consuming for such a huge record, right?
>>
>> Please help!
>>
>> Thanks,
>> Aakash.
>>
>
>

Re: Little idea needed

Posted by Mich Talebzadeh <mi...@gmail.com>.

In reality a true real time analytics will require interrogating the
transaction (redo) log of the RDBMS database to see for changes.

An RDBMS will only keep on current record (the most recent) so if record is
deleted since last import into HDFS that record will not exist.

If the record has been updated since last import, it could be multiple
updates and it is almost impossible to see which record has been updated
since.

So it is going to be tedious without having an automated mechanism that
reads the transaction log of RDBMS database converts that into SQL
statements (insert/update/delete) and send data to Big Data (Hive, Spark
whatever).

The commercial ones work but as yet I have not seen anything open that can
hook and integrate the redolog of Oracle database and convert that into SQL
and send it to Hive etc.

May be Hortonworks Data Flow (HDF) in recent edition has such thing.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 20 July 2016 at 00:10, ayan guha <gu...@gmail.com> wrote:

> Well this one keeps cropping up in every project especially when hadoop
> implemented alongside MPP.
> For the fact, there is no reliable out of box update operation available
> in hdfs or hive or SPARK.
> Hence, one approach is what Mitch suggested, that do not update. Rather
> just keep all source records, by timestamping their arrival.
> Another way is, if I think an data warehouse with open and closed records,
> you can create a partition in hive only for open records. So, you can
> refresh that partition in every run.
> On 20 Jul 2016 06:08, "Mich Talebzadeh" <mi...@gmail.com> wrote:
>
>> Well this is a classic.
>>
>> The initial load can be done through Sqoop (outside of Spark) or through
>> JDBC connection in Spark. 10 million rows in nothing.
>>
>> Then you have to think of updates and deletes in addition to new rows.
>>
>> With Sqoop you can load from the last ID in the source table, assuming
>> that you have a unique key in Your Oracle table.
>>
>> If you have 10 new roes and I assume you know how to load these rows from
>> Oracle.
>>
>> I suggest that you add two additional columns to your HDFS/target table,
>>
>> ,op_type                 int
>> ,op_time                 timestamp
>>
>> These two columns will specify the row type op_type = 1,2,3
>> INSERT/UPDATE/DELETE and op_time = cast(from_unixtime(unix_timestamp())
>> AS op_time) when the record was added.
>>
>> So you will end up with two additional columns in your HDFS table
>> compared to Oracle table and that will be your staging table.
>>
>> Of course you can do real time analytics through Oracle GoldenGate that
>> read the redolog of the source table in Oracle or better Sap Replication
>> Server (SRS). You will achieve real-time integration between
>> RDBMS tables and Big Data.
>>
>> Once you have you have the staging table (immutable) and the rest is
>> pretty easy. You have the full Entity Life History in this case for records
>> and you can do your queries on them.
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 19 July 2016 at 20:27, Aakash Basu <ra...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I'm trying to pull a full table from oracle, which is huge with some 10
>>> million records which will be the initial load to HDFS.
>>>
>>> Then I will do delta loads everyday in the same folder in HDFS.
>>>
>>> Now, my query here is,
>>>
>>> DAY 0 - I did the initial load (full dump).
>>>
>>> DAY 1 - I'll load only that day's data which has suppose 10 records (5
>>> old with some column's value altered and 5 new).
>>>
>>> Here, my question is, how will I push this file to HDFS through Spark
>>> code, if I do append, it will create duplicates (which i don't want), if i
>>> keep separate files and while using it in other program am giving the path
>>> of it as folder which contains all files /. But in this case also the
>>> registerTempTable will have duplicates for those 5 old rows.
>>>
>>> What is the BEST logic to be applied here?
>>>
>>> I tried to resolve this by doing a search in that file of the records if
>>> matching load the new ones by deleting the old, but this will be time
>>> consuming for such a huge record, right?
>>>
>>> Please help!
>>>
>>> Thanks,
>>> Aakash.
>>>
>>
>>

Re: Little idea needed

Posted by ayan guha <gu...@gmail.com>.

Well this one keeps cropping up in every project especially when hadoop
implemented alongside MPP.
For the fact, there is no reliable out of box update operation available in
hdfs or hive or SPARK.
Hence, one approach is what Mitch suggested, that do not update. Rather
just keep all source records, by timestamping their arrival.
Another way is, if I think an data warehouse with open and closed records,
you can create a partition in hive only for open records. So, you can
refresh that partition in every run.
On 20 Jul 2016 06:08, "Mich Talebzadeh" <mi...@gmail.com> wrote:

> Well this is a classic.
>
> The initial load can be done through Sqoop (outside of Spark) or through
> JDBC connection in Spark. 10 million rows in nothing.
>
> Then you have to think of updates and deletes in addition to new rows.
>
> With Sqoop you can load from the last ID in the source table, assuming
> that you have a unique key in Your Oracle table.
>
> If you have 10 new roes and I assume you know how to load these rows from
> Oracle.
>
> I suggest that you add two additional columns to your HDFS/target table,
>
> ,op_type                 int
> ,op_time                 timestamp
>
> These two columns will specify the row type op_type = 1,2,3
> INSERT/UPDATE/DELETE and op_time = cast(from_unixtime(unix_timestamp())
> AS op_time) when the record was added.
>
> So you will end up with two additional columns in your HDFS table compared
> to Oracle table and that will be your staging table.
>
> Of course you can do real time analytics through Oracle GoldenGate that
> read the redolog of the source table in Oracle or better Sap Replication
> Server (SRS). You will achieve real-time integration between RDBMS tables
> and Big Data.
>
> Once you have you have the staging table (immutable) and the rest is
> pretty easy. You have the full Entity Life History in this case for records
> and you can do your queries on them.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 19 July 2016 at 20:27, Aakash Basu <ra...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm trying to pull a full table from oracle, which is huge with some 10
>> million records which will be the initial load to HDFS.
>>
>> Then I will do delta loads everyday in the same folder in HDFS.
>>
>> Now, my query here is,
>>
>> DAY 0 - I did the initial load (full dump).
>>
>> DAY 1 - I'll load only that day's data which has suppose 10 records (5
>> old with some column's value altered and 5 new).
>>
>> Here, my question is, how will I push this file to HDFS through Spark
>> code, if I do append, it will create duplicates (which i don't want), if i
>> keep separate files and while using it in other program am giving the path
>> of it as folder which contains all files /. But in this case also the
>> registerTempTable will have duplicates for those 5 old rows.
>>
>> What is the BEST logic to be applied here?
>>
>> I tried to resolve this by doing a search in that file of the records if
>> matching load the new ones by deleting the old, but this will be time
>> consuming for such a huge record, right?
>>
>> Please help!
>>
>> Thanks,
>> Aakash.
>>
>
>

Re: Little idea needed

Posted by Mich Talebzadeh <mi...@gmail.com>.

Well this is a classic.

The initial load can be done through Sqoop (outside of Spark) or through
JDBC connection in Spark. 10 million rows in nothing.

Then you have to think of updates and deletes in addition to new rows.

With Sqoop you can load from the last ID in the source table, assuming that
you have a unique key in Your Oracle table.

If you have 10 new roes and I assume you know how to load these rows from
Oracle.

I suggest that you add two additional columns to your HDFS/target table,

,op_type                 int
,op_time                 timestamp

These two columns will specify the row type op_type = 1,2,3
INSERT/UPDATE/DELETE and op_time = cast(from_unixtime(unix_timestamp()) AS
op_time) when the record was added.

So you will end up with two additional columns in your HDFS table compared
to Oracle table and that will be your staging table.

Of course you can do real time analytics through Oracle GoldenGate that
read the redolog of the source table in Oracle or better Sap Replication
Server (SRS). You will achieve real-time integration between RDBMS tables
and Big Data.

Once you have you have the staging table (immutable) and the rest is pretty
easy. You have the full Entity Life History in this case for records and
you can do your queries on them.

HTH

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 19 July 2016 at 20:27, Aakash Basu <ra...@gmail.com> wrote:

> Hi all,
>
> I'm trying to pull a full table from oracle, which is huge with some 10
> million records which will be the initial load to HDFS.
>
> Then I will do delta loads everyday in the same folder in HDFS.
>
> Now, my query here is,
>
> DAY 0 - I did the initial load (full dump).
>
> DAY 1 - I'll load only that day's data which has suppose 10 records (5 old
> with some column's value altered and 5 new).
>
> Here, my question is, how will I push this file to HDFS through Spark
> code, if I do append, it will create duplicates (which i don't want), if i
> keep separate files and while using it in other program am giving the path
> of it as folder which contains all files /. But in this case also the
> registerTempTable will have duplicates for those 5 old rows.
>
> What is the BEST logic to be applied here?
>
> I tried to resolve this by doing a search in that file of the records if
> matching load the new ones by deleting the old, but this will be time
> consuming for such a huge record, right?
>
> Please help!
>
> Thanks,
> Aakash.
>

Re: Little idea needed

Posted by Aakash Basu <ra...@gmail.com>.

Your second point: That's going to be a bottleneck for all the programs
which will fetch the data from that folder and again add extra filters into
the DF. I want to finish that off, there itself.

And that merge logic is weak when one table is huge and the other is very
small (which is the case here), it literally gulps memory and time.

And business won't allow Hive and all else to be used AT ALL, since we may
shift to EMR where Hive has compatibility issues maybe (need to check).
On 20-Jul-2016 1:27 AM, "Jörn Franke" <jo...@gmail.com> wrote:

Well as far as I know there is some update statement planned for spark, but
not sure which release. You could alternatively use Hive+Orc.
Another alternative would be to add the deltas in a separate file and when
accessing the table filtering out the double entries. From time to time you
could have a merge process creating one file out of all the deltas.

On 19 Jul 2016, at 21:27, Aakash Basu <ra...@gmail.com> wrote:

Hi all,

I'm trying to pull a full table from oracle, which is huge with some 10
million records which will be the initial load to HDFS.

Then I will do delta loads everyday in the same folder in HDFS.

Now, my query here is,

DAY 0 - I did the initial load (full dump).

DAY 1 - I'll load only that day's data which has suppose 10 records (5 old
with some column's value altered and 5 new).

Here, my question is, how will I push this file to HDFS through Spark code,
if I do append, it will create duplicates (which i don't want), if i keep
separate files and while using it in other program am giving the path of it
as folder which contains all files /. But in this case also the
registerTempTable will have duplicates for those 5 old rows.

What is the BEST logic to be applied here?

I tried to resolve this by doing a search in that file of the records if
matching load the new ones by deleting the old, but this will be time
consuming for such a huge record, right?

Please help!

Thanks,
Aakash.

Re: Little idea needed

Posted by Jörn Franke <jo...@gmail.com>.

Well as far as I know there is some update statement planned for spark, but not sure which release. You could alternatively use Hive+Orc. 
Another alternative would be to add the deltas in a separate file and when accessing the table filtering out the double entries. From time to time you could have a merge process creating one file out of all the deltas.

> On 19 Jul 2016, at 21:27, Aakash Basu <ra...@gmail.com> wrote:
> 
> Hi all,
> 
> I'm trying to pull a full table from oracle, which is huge with some 10 million records which will be the initial load to HDFS.
> 
> Then I will do delta loads everyday in the same folder in HDFS.
> 
> Now, my query here is,
> 
> DAY 0 - I did the initial load (full dump).
> 
> DAY 1 - I'll load only that day's data which has suppose 10 records (5 old with some column's value altered and 5 new).
> 
> Here, my question is, how will I push this file to HDFS through Spark code, if I do append, it will create duplicates (which i don't want), if i keep separate files and while using it in other program am giving the path of it as folder which contains all files /. But in this case also the registerTempTable will have duplicates for those 5 old rows.
> 
> What is the BEST logic to be applied here?
> 
> I tried to resolve this by doing a search in that file of the records if matching load the new ones by deleting the old, but this will be time consuming for such a huge record, right?
> 
> Please help!
> 
> Thanks,
> Aakash.