You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Gonzalo Zarza <go...@globant.com> on 2014/07/11 22:35:38 UTC

Spark Questions

Hi all,

We've been evaluating Spark for a long-term project. Although we've been
reading several topics in forum, any hints on the following topics we'll be
extremely welcomed:

1. Which are the data partition strategies available in Spark? How
configurable are these strategies?

2. How would be the best way to use Spark if queries can touch only 3-5
entries/records? Which strategy is the best if they want to perform a full
scan of the entries?

3. Is Spark capable of interacting with RDBMS?

Thanks a lot!

Best regards,

--
*Gonzalo Zarza* | PhD in High-Performance Computing | Big-Data Specialist |
*GLOBANT* | AR: +54 11 4109 1700 ext. 15494 | US: +1 877 215 5230 ext. 15494
 | [image: Facebook] <https://www.facebook.com/Globant> [image: Twitter]
<http://www.twitter.com/globant> [image: Youtube]
<http://www.youtube.com/Globant> [image: Linkedin]
<http://www.linkedin.com/company/globant> [image: Pinterest]
<http://pinterest.com/globant/> [image: Globant] <http://www.globant.com/>

Re: Spark Questions

Posted by Gonzalo Zarza <go...@globant.com>.
Thanks for your answers Shuo Xiang and Aaron Davidson!

Regards,


--
*Gonzalo Zarza* | PhD in High-Performance Computing | Big-Data Specialist |
*GLOBANT* | AR: +54 11 4109 1700 ext. 15494 | US: +1 877 215 5230 ext. 15494
 | [image: Facebook] <https://www.facebook.com/Globant> [image: Twitter]
<http://www.twitter.com/globant> [image: Youtube]
<http://www.youtube.com/Globant> [image: Linkedin]
<http://www.linkedin.com/company/globant> [image: Pinterest]
<http://pinterest.com/globant/> [image: Globant] <http://www.globant.com/>


On Sat, Jul 12, 2014 at 9:02 PM, Aaron Davidson <il...@gmail.com> wrote:

> I am not entirely certain I understand your questions, but let me assume
> you are mostly interested in SparkSQL and are thinking about your problem
> in terms of SQL-like tables.
>
> 1. Shuo Xiang mentioned Spark partitioning strategies, but in case you are
> talking about data partitioning or sharding as exist in Hive, SparkSQL does
> not currently support this, though it is on the roadmap. We can read from
> partitioned Hive tables, however.
>
> 2. If by entries/record you mean something like columns/row, SparkSQL does
> allow you to project out the columns you want, or select all columns. The
> efficiency of such a projection is determined by the how the data is
> stored, however: If your data is stored in an inherently row-based format,
> this projection will be no faster than doing an initial map() over the data
> to only select the desired columns. If it's stored in something like
> Parquet, or cached in memory, however, we would avoid ever looking at the
> unused columns.
>
> 3. Spark has a very generalized data source API, so it is capable of
> interacting with whatever data source. However, I don't think we currently
> have any SparkSQL connectors to RDBMSes that would support column pruning
> or other push-downs. This is all very much viable, however.
>
>
> On Fri, Jul 11, 2014 at 1:35 PM, Gonzalo Zarza <go...@globant.com>
> wrote:
>
>> Hi all,
>>
>> We've been evaluating Spark for a long-term project. Although we've been
>> reading several topics in forum, any hints on the following topics we'll be
>> extremely welcomed:
>>
>> 1. Which are the data partition strategies available in Spark? How
>> configurable are these strategies?
>>
>> 2. How would be the best way to use Spark if queries can touch only 3-5
>> entries/records? Which strategy is the best if they want to perform a full
>> scan of the entries?
>>
>> 3. Is Spark capable of interacting with RDBMS?
>>
>> Thanks a lot!
>>
>> Best regards,
>>
>> --
>> *Gonzalo Zarza* | PhD in High-Performance Computing | Big-Data
>> Specialist |
>> *GLOBANT* | AR: +54 11 4109 1700 ext. 15494 | US: +1 877 215 5230 ext.
>> 15494 | [image: Facebook] <https://www.facebook.com/Globant> [image:
>> Twitter] <http://www.twitter.com/globant> [image: Youtube]
>> <http://www.youtube.com/Globant> [image: Linkedin]
>> <http://www.linkedin.com/company/globant> [image: Pinterest]
>> <http://pinterest.com/globant/> [image: Globant]
>> <http://www.globant.com/>
>>
>
>

Re: Spark Questions

Posted by Aaron Davidson <il...@gmail.com>.
I am not entirely certain I understand your questions, but let me assume
you are mostly interested in SparkSQL and are thinking about your problem
in terms of SQL-like tables.

1. Shuo Xiang mentioned Spark partitioning strategies, but in case you are
talking about data partitioning or sharding as exist in Hive, SparkSQL does
not currently support this, though it is on the roadmap. We can read from
partitioned Hive tables, however.

2. If by entries/record you mean something like columns/row, SparkSQL does
allow you to project out the columns you want, or select all columns. The
efficiency of such a projection is determined by the how the data is
stored, however: If your data is stored in an inherently row-based format,
this projection will be no faster than doing an initial map() over the data
to only select the desired columns. If it's stored in something like
Parquet, or cached in memory, however, we would avoid ever looking at the
unused columns.

3. Spark has a very generalized data source API, so it is capable of
interacting with whatever data source. However, I don't think we currently
have any SparkSQL connectors to RDBMSes that would support column pruning
or other push-downs. This is all very much viable, however.


On Fri, Jul 11, 2014 at 1:35 PM, Gonzalo Zarza <go...@globant.com>
wrote:

> Hi all,
>
> We've been evaluating Spark for a long-term project. Although we've been
> reading several topics in forum, any hints on the following topics we'll be
> extremely welcomed:
>
> 1. Which are the data partition strategies available in Spark? How
> configurable are these strategies?
>
> 2. How would be the best way to use Spark if queries can touch only 3-5
> entries/records? Which strategy is the best if they want to perform a full
> scan of the entries?
>
> 3. Is Spark capable of interacting with RDBMS?
>
> Thanks a lot!
>
> Best regards,
>
> --
> *Gonzalo Zarza* | PhD in High-Performance Computing | Big-Data Specialist
> |
> *GLOBANT* | AR: +54 11 4109 1700 ext. 15494 | US: +1 877 215 5230 ext.
> 15494 | [image: Facebook] <https://www.facebook.com/Globant> [image:
> Twitter] <http://www.twitter.com/globant> [image: Youtube]
> <http://www.youtube.com/Globant> [image: Linkedin]
> <http://www.linkedin.com/company/globant> [image: Pinterest]
> <http://pinterest.com/globant/> [image: Globant] <http://www.globant.com/>
>

Re: Spark Questions

Posted by Shuo Xiang <sh...@gmail.com>.
For your first question, the partitioning strategy can be tuned by applying
different partitioner. You can use existing ones such as HashPartitioner or
write your own.See this link(
http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf)
for some instructions.



On Fri, Jul 11, 2014 at 1:35 PM, Gonzalo Zarza <go...@globant.com>
wrote:

> Hi all,
>
> We've been evaluating Spark for a long-term project. Although we've been
> reading several topics in forum, any hints on the following topics we'll be
> extremely welcomed:
>
> 1. Which are the data partition strategies available in Spark? How
> configurable are these strategies?
>
> 2. How would be the best way to use Spark if queries can touch only 3-5
> entries/records? Which strategy is the best if they want to perform a full
> scan of the entries?
>
> 3. Is Spark capable of interacting with RDBMS?
>
> Thanks a lot!
>
> Best regards,
>
> --
> *Gonzalo Zarza* | PhD in High-Performance Computing | Big-Data Specialist
> |
> *GLOBANT* | AR: +54 11 4109 1700 ext. 15494 | US: +1 877 215 5230 ext.
> 15494 | [image: Facebook] <https://www.facebook.com/Globant> [image:
> Twitter] <http://www.twitter.com/globant> [image: Youtube]
> <http://www.youtube.com/Globant> [image: Linkedin]
> <http://www.linkedin.com/company/globant> [image: Pinterest]
> <http://pinterest.com/globant/> [image: Globant] <http://www.globant.com/>
>