You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jem Tucker <je...@gmail.com> on 2015/07/16 10:00:30 UTC

Indexed Store for lookup table

Hello,

I have been using IndexedRDD as a large lookup (1 billion records) to join
with small tables (1 million rows). The performance of indexedrdd is great
until it has to be persisted on disk. Are there any alternatives to
IndexedRDD or any changes to how I use it to improve performance with big
data volumes?

Kindest Regards,

Jem

Re: Indexed Store for lookup table

Posted by Jem Tucker <je...@gmail.com>.
Thanks!

On Thu, Jul 16, 2015 at 1:59 PM Vetle Leinonen-Roeim <ve...@roeim.net>
wrote:

> By the way - if you're going this route, see
> https://github.com/datastax/spark-cassandra-connector
>
> On Thu, Jul 16, 2015 at 2:40 PM Vetle Leinonen-Roeim <ve...@roeim.net>
> wrote:
>
>> You'll probably have to install it separately.
>>
>> On Thu, Jul 16, 2015 at 2:29 PM Jem Tucker <je...@gmail.com> wrote:
>>
>>> Hi Vetle,
>>>
>>> IndexedRDD is persisted in the same way RDDs are as far as I am aware.
>>> Are you aware if Cassandra can be built into my application or has to be a
>>> stand alone database which is installed separately?
>>>
>>> Thanks,
>>>
>>> Jem
>>>
>>> On Thu, Jul 16, 2015 at 12:59 PM Vetle Leinonen-Roeim <ve...@roeim.net>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Not sure how IndexedRDD is persisted, but perhaps you're better off
>>>> using a NOSQL database for lookups (perhaps using Cassandra, with the
>>>> Cassandra connector)? That should give you good performance on lookups, but
>>>> persisting those billion records sounds like something that will take some
>>>> time in any case.
>>>>
>>>> Regards,
>>>> Vetle
>>>>
>>>>
>>>> On Thu, Jul 16, 2015 at 10:02 AM Jem Tucker <je...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have been using IndexedRDD as a large lookup (1 billion records) to
>>>>> join with small tables (1 million rows). The performance of indexedrdd is
>>>>> great until it has to be persisted on disk. Are there any alternatives to
>>>>> IndexedRDD or any changes to how I use it to improve performance with big
>>>>> data volumes?
>>>>>
>>>>> Kindest Regards,
>>>>>
>>>>> Jem
>>>>>
>>>>

Re: Indexed Store for lookup table

Posted by Vetle Leinonen-Roeim <ve...@roeim.net>.
By the way - if you're going this route, see
https://github.com/datastax/spark-cassandra-connector

On Thu, Jul 16, 2015 at 2:40 PM Vetle Leinonen-Roeim <ve...@roeim.net>
wrote:

> You'll probably have to install it separately.
>
> On Thu, Jul 16, 2015 at 2:29 PM Jem Tucker <je...@gmail.com> wrote:
>
>> Hi Vetle,
>>
>> IndexedRDD is persisted in the same way RDDs are as far as I am aware.
>> Are you aware if Cassandra can be built into my application or has to be a
>> stand alone database which is installed separately?
>>
>> Thanks,
>>
>> Jem
>>
>> On Thu, Jul 16, 2015 at 12:59 PM Vetle Leinonen-Roeim <ve...@roeim.net>
>> wrote:
>>
>>> Hi,
>>>
>>> Not sure how IndexedRDD is persisted, but perhaps you're better off
>>> using a NOSQL database for lookups (perhaps using Cassandra, with the
>>> Cassandra connector)? That should give you good performance on lookups, but
>>> persisting those billion records sounds like something that will take some
>>> time in any case.
>>>
>>> Regards,
>>> Vetle
>>>
>>>
>>> On Thu, Jul 16, 2015 at 10:02 AM Jem Tucker <je...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have been using IndexedRDD as a large lookup (1 billion records) to
>>>> join with small tables (1 million rows). The performance of indexedrdd is
>>>> great until it has to be persisted on disk. Are there any alternatives to
>>>> IndexedRDD or any changes to how I use it to improve performance with big
>>>> data volumes?
>>>>
>>>> Kindest Regards,
>>>>
>>>> Jem
>>>>
>>>

Re: Indexed Store for lookup table

Posted by Vetle Leinonen-Roeim <ve...@roeim.net>.
You'll probably have to install it separately.

On Thu, Jul 16, 2015 at 2:29 PM Jem Tucker <je...@gmail.com> wrote:

> Hi Vetle,
>
> IndexedRDD is persisted in the same way RDDs are as far as I am aware. Are
> you aware if Cassandra can be built into my application or has to be a
> stand alone database which is installed separately?
>
> Thanks,
>
> Jem
>
> On Thu, Jul 16, 2015 at 12:59 PM Vetle Leinonen-Roeim <ve...@roeim.net>
> wrote:
>
>> Hi,
>>
>> Not sure how IndexedRDD is persisted, but perhaps you're better off using
>> a NOSQL database for lookups (perhaps using Cassandra, with the Cassandra
>> connector)? That should give you good performance on lookups, but
>> persisting those billion records sounds like something that will take some
>> time in any case.
>>
>> Regards,
>> Vetle
>>
>>
>> On Thu, Jul 16, 2015 at 10:02 AM Jem Tucker <je...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have been using IndexedRDD as a large lookup (1 billion records) to
>>> join with small tables (1 million rows). The performance of indexedrdd is
>>> great until it has to be persisted on disk. Are there any alternatives to
>>> IndexedRDD or any changes to how I use it to improve performance with big
>>> data volumes?
>>>
>>> Kindest Regards,
>>>
>>> Jem
>>>
>>

Re: Indexed Store for lookup table

Posted by Jem Tucker <je...@gmail.com>.
Hi Vetle,

IndexedRDD is persisted in the same way RDDs are as far as I am aware. Are
you aware if Cassandra can be built into my application or has to be a
stand alone database which is installed separately?

Thanks,

Jem

On Thu, Jul 16, 2015 at 12:59 PM Vetle Leinonen-Roeim <ve...@roeim.net>
wrote:

> Hi,
>
> Not sure how IndexedRDD is persisted, but perhaps you're better off using
> a NOSQL database for lookups (perhaps using Cassandra, with the Cassandra
> connector)? That should give you good performance on lookups, but
> persisting those billion records sounds like something that will take some
> time in any case.
>
> Regards,
> Vetle
>
>
> On Thu, Jul 16, 2015 at 10:02 AM Jem Tucker <je...@gmail.com> wrote:
>
>> Hello,
>>
>> I have been using IndexedRDD as a large lookup (1 billion records) to
>> join with small tables (1 million rows). The performance of indexedrdd is
>> great until it has to be persisted on disk. Are there any alternatives to
>> IndexedRDD or any changes to how I use it to improve performance with big
>> data volumes?
>>
>> Kindest Regards,
>>
>> Jem
>>
>

Re: Indexed Store for lookup table

Posted by Vetle Leinonen-Roeim <ve...@roeim.net>.
Hi,

Not sure how IndexedRDD is persisted, but perhaps you're better off using a
NOSQL database for lookups (perhaps using Cassandra, with the Cassandra
connector)? That should give you good performance on lookups, but
persisting those billion records sounds like something that will take some
time in any case.

Regards,
Vetle


On Thu, Jul 16, 2015 at 10:02 AM Jem Tucker <je...@gmail.com> wrote:

> Hello,
>
> I have been using IndexedRDD as a large lookup (1 billion records) to join
> with small tables (1 million rows). The performance of indexedrdd is great
> until it has to be persisted on disk. Are there any alternatives to
> IndexedRDD or any changes to how I use it to improve performance with big
> data volumes?
>
> Kindest Regards,
>
> Jem
>