You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jeetendra Gangele <ga...@gmail.com> on 2015/08/04 07:14:00 UTC
Re: Data from PostgreSQL to Spark

Here is the solution this looks perfect for me.
thanks for all your help

http://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/

On 28 July 2015 at 23:27, Jörn Franke <jo...@gmail.com> wrote:

> Can you put some transparent cache in front of the database? Or some jdbc
> proxy?
>
> Le mar. 28 juil. 2015 à 19:34, Jeetendra Gangele <ga...@gmail.com> a
> écrit :
>
>> can the source write to Kafka/Flume/Hbase in addition to Postgres? no
>> it can't write ,this is due to the fact that there are many applications
>> those are producing this postGreSql data.I can't really asked all the teams
>> to start writing to some other source.
>>
>>
>> velocity of the application is too high.
>>
>>
>>
>>
>>
>>
>> On 28 July 2015 at 21:50, <sa...@gmail.com> wrote:
>>
>>> Sqoop’s incremental data fetch will reduce the data size you need to
>>> pull from source, but then by the time that incremental data fetch is
>>> complete, is it not current again, if velocity of the data is high?
>>>
>>> May be you can put a trigger in Postgres to send data to the big data
>>> cluster as soon as changes are made. Or as I was saying in another email,
>>> can the source write to Kafka/Flume/Hbase in addition to Postgres?
>>>
>>> Sent from Windows Mail
>>>
>>> *From:* Jeetendra Gangele <ga...@gmail.com>
>>> *Sent:* ‎Tuesday‎, ‎July‎ ‎28‎, ‎2015 ‎5‎:‎43‎ ‎AM
>>> *To:* santoshv98@gmail.com
>>> *Cc:* ayan guha <gu...@gmail.com>, felixcheung_m@hotmail.com,
>>> user@spark.apache.org
>>>
>>> I trying do that, but there will always data mismatch, since by the time
>>> scoop is fetching main database will get so many updates. There is
>>> something called incremental data fetch using scoop but that hits a
>>> database rather than reading the WAL edit.
>>>
>>>
>>>
>>> On 28 July 2015 at 02:52, <sa...@gmail.com> wrote:
>>>
>>>> Why cant you bulk pre-fetch the data to HDFS (like using Sqoop) instead
>>>> of hitting Postgres multiple times?
>>>>
>>>> Sent from Windows Mail
>>>>
>>>> *From:* ayan guha <gu...@gmail.com>
>>>> *Sent:* ‎Monday‎, ‎July‎ ‎27‎, ‎2015 ‎4‎:‎41‎ ‎PM
>>>> *To:* Jeetendra Gangele <ga...@gmail.com>
>>>> *Cc:* felixcheung_m@hotmail.com, user@spark.apache.org
>>>>
>>>> You can call dB connect once per partition. Please have a look at
>>>> design patterns of for each construct in document.
>>>> How big is your data in dB? How soon that data changes? You would be
>>>> better off if data is in spark already
>>>> On 28 Jul 2015 04:48, "Jeetendra Gangele" <ga...@gmail.com> wrote:
>>>>
>>>>> Thanks for your reply.
>>>>>
>>>>> Parallel i will be hitting around 6000 call to postgreSQl which is not
>>>>> good my database will die.
>>>>> these calls to database will keeps on increasing.
>>>>> Handling millions on request is not an issue with Hbase/NOSQL
>>>>>
>>>>> any other alternative?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 27 July 2015 at 23:18, <fe...@hotmail.com> wrote:
>>>>>
>>>>>> You can have Spark reading from PostgreSQL through the data access
>>>>>> API. Do you have any concern with that approach since you mention copying
>>>>>> that data into HBase.
>>>>>>
>>>>>> From: Jeetendra Gangele
>>>>>> Sent: Monday, July 27, 6:00 AM
>>>>>> Subject: Data from PostgreSQL to Spark
>>>>>> To: user
>>>>>>
>>>>>> Hi All
>>>>>>
>>>>>> I have a use case where where I am consuming the Events from RabbitMQ
>>>>>> using spark streaming.This event has some fields on which I want to query
>>>>>> the PostgreSQL and bring the data and then do the join between event data
>>>>>> and PostgreSQl data and put the aggregated data into HDFS, so that I run
>>>>>> run analytics query over this data using SparkSQL.
>>>>>>
>>>>>> my question is PostgreSQL data in production data so i don't want to
>>>>>> hit so many times.
>>>>>>
>>>>>> at any given  1 seconds time I may have 3000 events,that means I need
>>>>>> to fire 3000 parallel query to my PostGreSQl and this data keeps on
>>>>>> growing, so my database will go down.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I can't migrate this PostgreSQL data since lots of system using
>>>>>> it,but I can take this data to some NOSQL like base and query the Hbase,
>>>>>> but here issue is How can I make sure that Hbase has upto date data?
>>>>>>
>>>>>> Any anyone suggest me best approach/ method to handle this case?
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Jeetendra
>>>>>>
>>>>>>