You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jeetendra Gangele <ga...@gmail.com> on 2015/07/22 14:47:41 UTC

Need help in SparkSQL

HI All,

I have data in MongoDb(few TBs) which I want to migrate to HDFS to do
complex queries analysis on this data.Queries like AND queries involved
multiple fields

So my question in which which format I should store the data in HDFS so
that processing will be fast for such kind of queries?


Regards
Jeetendra

RE: Need help in SparkSQL

Posted by Mohammed Guller <mo...@glassbeam.com>.

Parquet

Mohammed

From: Jeetendra Gangele [mailto:gangele397@gmail.com]
Sent: Wednesday, July 22, 2015 5:48 AM
To: user
Subject: Need help in SparkSQL

HI All,

I have data in MongoDb(few TBs) which I want to migrate to HDFS to do complex queries analysis on this data.Queries like AND queries involved multiple fields

So my question in which which format I should store the data in HDFS so that processing will be fast for such kind of queries?


Regards
Jeetendra

Re: Need help in SparkSQL

Posted by ayan guha <gu...@gmail.com>.

Another typical solution is build a search using elasticsearch and use it
as  secondary index for hbase
On 23 Jul 2015 15:50, "Jörn Franke" <jo...@gmail.com> wrote:

> I do not think you can put all your queries into the row key without
> duplicating the data for each query. However, this would be more last
> resort.
>
> Have you checked out phoenix for Hbase? This might suit your needs. It
> makes it much simpler, because it provided sql on top of Hbase.
>
> Nevertheless, Hive could also be a viable alternative depending on how
> often you run queries etc
>
> Le jeu. 23 juil. 2015 à 7:14, Jeetendra Gangele <ga...@gmail.com> a
> écrit :
>
>> Query will be something like that
>>
>> 1. how many users visited 1 BHK flat in last 1 hour in given particular
>> area
>> 2. how many visitor for flats in give area
>> 3. list all user who bought given property in last 30 days
>>
>> Further it may go too complex involving multiple parameters in my query.
>>
>> The problem is HBase is designing row key to get this data efficiently.
>>
>> Since I have multiple fields to query upon base may not be a good choice?
>>
>> i dont dont to iterate the result set which Hbase returns and give the
>> result because this will kill the performance?
>>
>> On 23 July 2015 at 01:02, Jörn Franke <jo...@gmail.com> wrote:
>>
>>> Can you provide an example of an and query ? If you do just look-up you
>>> should try Hbase/ phoenix, otherwise you can try orc with storage index
>>> and/or compression, but this depends on how your queries look like
>>>
>>> Le mer. 22 juil. 2015 à 14:48, Jeetendra Gangele <ga...@gmail.com>
>>> a écrit :
>>>
>>>> HI All,
>>>>
>>>> I have data in MongoDb(few TBs) which I want to migrate to HDFS to do
>>>> complex queries analysis on this data.Queries like AND queries involved
>>>> multiple fields
>>>>
>>>> So my question in which which format I should store the data in HDFS so
>>>> that processing will be fast for such kind of queries?
>>>>
>>>>
>>>> Regards
>>>> Jeetendra
>>>>
>>>>
>>
>>
>> --
>> Hi,
>>
>> Find my attached resume. I have total around 7 years of work experience.
>> I worked for Amazon and Expedia in my previous assignments and currently
>> I am working with start- up technology company called Insideview in
>> hyderabad.
>>
>> Regards
>> Jeetendra
>>
>

Re: Need help in SparkSQL

Posted by Jörn Franke <jo...@gmail.com>.

I do not think you can put all your queries into the row key without
duplicating the data for each query. However, this would be more last
resort.

Have you checked out phoenix for Hbase? This might suit your needs. It
makes it much simpler, because it provided sql on top of Hbase.

Nevertheless, Hive could also be a viable alternative depending on how
often you run queries etc

Le jeu. 23 juil. 2015 à 7:14, Jeetendra Gangele <ga...@gmail.com> a
écrit :

> Query will be something like that
>
> 1. how many users visited 1 BHK flat in last 1 hour in given particular
> area
> 2. how many visitor for flats in give area
> 3. list all user who bought given property in last 30 days
>
> Further it may go too complex involving multiple parameters in my query.
>
> The problem is HBase is designing row key to get this data efficiently.
>
> Since I have multiple fields to query upon base may not be a good choice?
>
> i dont dont to iterate the result set which Hbase returns and give the
> result because this will kill the performance?
>
> On 23 July 2015 at 01:02, Jörn Franke <jo...@gmail.com> wrote:
>
>> Can you provide an example of an and query ? If you do just look-up you
>> should try Hbase/ phoenix, otherwise you can try orc with storage index
>> and/or compression, but this depends on how your queries look like
>>
>> Le mer. 22 juil. 2015 à 14:48, Jeetendra Gangele <ga...@gmail.com>
>> a écrit :
>>
>>> HI All,
>>>
>>> I have data in MongoDb(few TBs) which I want to migrate to HDFS to do
>>> complex queries analysis on this data.Queries like AND queries involved
>>> multiple fields
>>>
>>> So my question in which which format I should store the data in HDFS so
>>> that processing will be fast for such kind of queries?
>>>
>>>
>>> Regards
>>> Jeetendra
>>>
>>>
>
>
> --
> Hi,
>
> Find my attached resume. I have total around 7 years of work experience.
> I worked for Amazon and Expedia in my previous assignments and currently I
> am working with start- up technology company called Insideview in hyderabad.
>
> Regards
> Jeetendra
>

Re: Need help in SparkSQL

Posted by Jeetendra Gangele <ga...@gmail.com>.

Query will be something like that

1. how many users visited 1 BHK flat in last 1 hour in given particular area
2. how many visitor for flats in give area
3. list all user who bought given property in last 30 days

Further it may go too complex involving multiple parameters in my query.

The problem is HBase is designing row key to get this data efficiently.

Since I have multiple fields to query upon base may not be a good choice?

i dont dont to iterate the result set which Hbase returns and give the
result because this will kill the performance?

On 23 July 2015 at 01:02, Jörn Franke <jo...@gmail.com> wrote:

> Can you provide an example of an and query ? If you do just look-up you
> should try Hbase/ phoenix, otherwise you can try orc with storage index
> and/or compression, but this depends on how your queries look like
>
> Le mer. 22 juil. 2015 à 14:48, Jeetendra Gangele <ga...@gmail.com> a
> écrit :
>
>> HI All,
>>
>> I have data in MongoDb(few TBs) which I want to migrate to HDFS to do
>> complex queries analysis on this data.Queries like AND queries involved
>> multiple fields
>>
>> So my question in which which format I should store the data in HDFS so
>> that processing will be fast for such kind of queries?
>>
>>
>> Regards
>> Jeetendra
>>
>>

-- 
Hi,

Find my attached resume. I have total around 7 years of work experience.
I worked for Amazon and Expedia in my previous assignments and currently I
am working with start- up technology company called Insideview in hyderabad.

Regards
Jeetendra

Re: Need help in SparkSQL

Posted by Jörn Franke <jo...@gmail.com>.

Can you provide an example of an and query ? If you do just look-up you
should try Hbase/ phoenix, otherwise you can try orc with storage index
and/or compression, but this depends on how your queries look like

Le mer. 22 juil. 2015 à 14:48, Jeetendra Gangele <ga...@gmail.com> a
écrit :

> HI All,
>
> I have data in MongoDb(few TBs) which I want to migrate to HDFS to do
> complex queries analysis on this data.Queries like AND queries involved
> multiple fields
>
> So my question in which which format I should store the data in HDFS so
> that processing will be fast for such kind of queries?
>
>
> Regards
> Jeetendra
>
>