You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Pedro Rodriguez <sk...@gmail.com> on 2016/06/18 04:13:42 UTC

Spark 2.0 Dataset Documentation

Hi All,

At my workplace we are starting to use Datasets in 1.6.1 and even more with
Spark 2.0 in place of Dataframes. I looked at the 1.6.1 documentation then
the 2.0 documentation and it looks like not much time has been spent
writing a Dataset guide/tutorial.

Preview Docs:
https://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/sql-programming-guide.html#creating-datasets
Spark master docs:
https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md

I would like to spend the time to contribute an improvement to those docs
with a more in depth examples of creating and using Datasets (eg using $ to
select columns). Is this of value, and if so what should my next step be to
get this going (create JIRA etc)?

-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
R&D Data Science Intern at Oracle Data Cloud
UC Berkeley AMPLab Alumni

ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Re: Spark 2.0 Dataset Documentation

Posted by Reynold Xin <rx...@databricks.com>.
Please go for it!

On Friday, June 17, 2016, Pedro Rodriguez <sk...@gmail.com> wrote:

> I would be open to working on Dataset documentation if no one else isn't
> already working on it. Thoughts?
>
> On Fri, Jun 17, 2016 at 11:44 PM, Cheng Lian <lian.cs.zju@gmail.com
> <javascript:_e(%7B%7D,'cvml','lian.cs.zju@gmail.com');>> wrote:
>
>> As mentioned in the PR description, this is just an initial PR to bring
>> existing contents up to date, so that people can add more contents
>> incrementally.
>>
>> We should definitely cover more about Dataset.
>>
>>
>> Cheng
>>
>> On 6/17/16 10:28 PM, Pedro Rodriguez wrote:
>>
>> The updates look great!
>>
>> Looks like many places are updated to the new APIs, but there still isn't
>> a section for working with Datasets (most of the docs work with
>> Dataframes). Are you planning on adding more? I am thinking something that
>> would address common questions like the one I posted on the user email list
>> earlier today.
>>
>> Should I take discussion to your PR?
>>
>> Pedro
>>
>> On Fri, Jun 17, 2016 at 11:12 PM, Cheng Lian <lian.cs.zju@gmail.com
>> <javascript:_e(%7B%7D,'cvml','lian.cs.zju@gmail.com');>> wrote:
>>
>>> Hey Pedro,
>>>
>>> SQL programming guide is being updated. Here's the PR, but not merged
>>> yet: https://github.com/apache/spark/pull/13592
>>>
>>> Cheng
>>> On 6/17/16 9:13 PM, Pedro Rodriguez wrote:
>>>
>>> Hi All,
>>>
>>> At my workplace we are starting to use Datasets in 1.6.1 and even more
>>> with Spark 2.0 in place of Dataframes. I looked at the 1.6.1 documentation
>>> then the 2.0 documentation and it looks like not much time has been spent
>>> writing a Dataset guide/tutorial.
>>>
>>> Preview Docs:
>>> https://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/sql-programming-guide.html#creating-datasets
>>> Spark master docs:
>>> https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md
>>>
>>> I would like to spend the time to contribute an improvement to those
>>> docs with a more in depth examples of creating and using Datasets (eg using
>>> $ to select columns). Is this of value, and if so what should my next step
>>> be to get this going (create JIRA etc)?
>>>
>>> --
>>> Pedro Rodriguez
>>> PhD Student in Distributed Machine Learning | CU Boulder
>>> R&D Data Science Intern at Oracle Data Cloud
>>> UC Berkeley AMPLab Alumni
>>>
>>> <javascript:_e(%7B%7D,'cvml','ski.rodriguez@gmail.com');>
>>> ski.rodriguez@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','ski.rodriguez@gmail.com');> |
>>> pedrorodriguez.io | 909-353-4423
>>> Github: github.com/EntilZha | LinkedIn:
>>> <https://www.linkedin.com/in/pedrorodriguezscience>
>>> https://www.linkedin.com/in/pedrorodriguezscience
>>>
>>>
>>>
>>
>>
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodriguez@gmail.com
>> <javascript:_e(%7B%7D,'cvml','ski.rodriguez@gmail.com');> |
>> pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodriguez@gmail.com
> <javascript:_e(%7B%7D,'cvml','ski.rodriguez@gmail.com');> |
> pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>

Re: Spark 2.0 Dataset Documentation

Posted by Pedro Rodriguez <sk...@gmail.com>.
I would be open to working on Dataset documentation if no one else isn't
already working on it. Thoughts?

On Fri, Jun 17, 2016 at 11:44 PM, Cheng Lian <li...@gmail.com> wrote:

> As mentioned in the PR description, this is just an initial PR to bring
> existing contents up to date, so that people can add more contents
> incrementally.
>
> We should definitely cover more about Dataset.
>
>
> Cheng
>
> On 6/17/16 10:28 PM, Pedro Rodriguez wrote:
>
> The updates look great!
>
> Looks like many places are updated to the new APIs, but there still isn't
> a section for working with Datasets (most of the docs work with
> Dataframes). Are you planning on adding more? I am thinking something that
> would address common questions like the one I posted on the user email list
> earlier today.
>
> Should I take discussion to your PR?
>
> Pedro
>
> On Fri, Jun 17, 2016 at 11:12 PM, Cheng Lian <li...@gmail.com>
> wrote:
>
>> Hey Pedro,
>>
>> SQL programming guide is being updated. Here's the PR, but not merged
>> yet: https://github.com/apache/spark/pull/13592
>>
>> Cheng
>> On 6/17/16 9:13 PM, Pedro Rodriguez wrote:
>>
>> Hi All,
>>
>> At my workplace we are starting to use Datasets in 1.6.1 and even more
>> with Spark 2.0 in place of Dataframes. I looked at the 1.6.1 documentation
>> then the 2.0 documentation and it looks like not much time has been spent
>> writing a Dataset guide/tutorial.
>>
>> Preview Docs:
>> https://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/sql-programming-guide.html#creating-datasets
>> Spark master docs:
>> https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md
>>
>> I would like to spend the time to contribute an improvement to those docs
>> with a more in depth examples of creating and using Datasets (eg using $ to
>> select columns). Is this of value, and if so what should my next step be to
>> get this going (create JIRA etc)?
>>
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> R&D Data Science Intern at Oracle Data Cloud
>> UC Berkeley AMPLab Alumni
>>
>> <sk...@gmail.com>ski.rodriguez@gmail.com | pedrorodriguez.io |
>> 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> <https://www.linkedin.com/in/pedrorodriguezscience>
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>
>


-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Re: Spark 2.0 Dataset Documentation

Posted by Cheng Lian <li...@gmail.com>.
As mentioned in the PR description, this is just an initial PR to bring 
existing contents up to date, so that people can add more contents 
incrementally.

We should definitely cover more about Dataset.


Cheng


On 6/17/16 10:28 PM, Pedro Rodriguez wrote:
> The updates look great!
>
> Looks like many places are updated to the new APIs, but there still 
> isn't a section for working with Datasets (most of the docs work with 
> Dataframes). Are you planning on adding more? I am thinking something 
> that would address common questions like the one I posted on the user 
> email list earlier today.
>
> Should I take discussion to your PR?
>
> Pedro
>
> On Fri, Jun 17, 2016 at 11:12 PM, Cheng Lian <lian.cs.zju@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hey Pedro,
>
>     SQL programming guide is being updated. Here's the PR, but not
>     merged yet: https://github.com/apache/spark/pull/13592
>
>     Cheng
>
>     On 6/17/16 9:13 PM, Pedro Rodriguez wrote:
>>     Hi All,
>>
>>     At my workplace we are starting to use Datasets in 1.6.1 and even
>>     more with Spark 2.0 in place of Dataframes. I looked at the 1.6.1
>>     documentation then the 2.0 documentation and it looks like not
>>     much time has been spent writing a Dataset guide/tutorial.
>>
>>     Preview Docs:
>>     https://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/sql-programming-guide.html#creating-datasets
>>     <https://home.apache.org/%7Epwendell/spark-releases/spark-2.0.0-preview-docs/sql-programming-guide.html#creating-datasets>
>>     Spark master docs:
>>     https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md
>>
>>
>>     I would like to spend the time to contribute an improvement to
>>     those docs with a more in depth examples of creating and using
>>     Datasets (eg using $ to select columns). Is this of value, and if
>>     so what should my next step be to get this going (create JIRA etc)?
>>
>>     -- 
>>     Pedro Rodriguez
>>     PhD Student in Distributed Machine Learning | CU Boulder
>>     R&D Data Science Intern at Oracle Data Cloud
>>     UC Berkeley AMPLab Alumni
>>
>>     ski.rodriguez@gmail.com <ma...@gmail.com> |
>>     pedrorodriguez.io <http://pedrorodriguez.io> | 909-353-4423
>>     <tel:909-353-4423>
>>     Github: github.com/EntilZha <http://github.com/EntilZha> |
>>     LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience
>>
>
>
>
>
> -- 
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodriguez@gmail.com <ma...@gmail.com> | 
> pedrorodriguez.io <http://pedrorodriguez.io> | 909-353-4423
> Github: github.com/EntilZha <http://github.com/EntilZha> | LinkedIn: 
> https://www.linkedin.com/in/pedrorodriguezscience
>


Re: Spark 2.0 Dataset Documentation

Posted by Pedro Rodriguez <sk...@gmail.com>.
The updates look great!

Looks like many places are updated to the new APIs, but there still isn't a
section for working with Datasets (most of the docs work with Dataframes).
Are you planning on adding more? I am thinking something that would address
common questions like the one I posted on the user email list earlier today.

Should I take discussion to your PR?

Pedro

On Fri, Jun 17, 2016 at 11:12 PM, Cheng Lian <li...@gmail.com> wrote:

> Hey Pedro,
>
> SQL programming guide is being updated. Here's the PR, but not merged yet:
> https://github.com/apache/spark/pull/13592
>
> Cheng
> On 6/17/16 9:13 PM, Pedro Rodriguez wrote:
>
> Hi All,
>
> At my workplace we are starting to use Datasets in 1.6.1 and even more
> with Spark 2.0 in place of Dataframes. I looked at the 1.6.1 documentation
> then the 2.0 documentation and it looks like not much time has been spent
> writing a Dataset guide/tutorial.
>
> Preview Docs:
> https://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/sql-programming-guide.html#creating-datasets
> Spark master docs:
> https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md
>
> I would like to spend the time to contribute an improvement to those docs
> with a more in depth examples of creating and using Datasets (eg using $ to
> select columns). Is this of value, and if so what should my next step be to
> get this going (create JIRA etc)?
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> R&D Data Science Intern at Oracle Data Cloud
> UC Berkeley AMPLab Alumni
>
> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>
>


-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Re: Spark 2.0 Dataset Documentation

Posted by Cheng Lian <li...@gmail.com>.
Hey Pedro,

SQL programming guide is being updated. Here's the PR, but not merged 
yet: https://github.com/apache/spark/pull/13592

Cheng

On 6/17/16 9:13 PM, Pedro Rodriguez wrote:
> Hi All,
>
> At my workplace we are starting to use Datasets in 1.6.1 and even more 
> with Spark 2.0 in place of Dataframes. I looked at the 1.6.1 
> documentation then the 2.0 documentation and it looks like not much 
> time has been spent writing a Dataset guide/tutorial.
>
> Preview Docs: 
> https://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/sql-programming-guide.html#creating-datasets 
> <https://home.apache.org/%7Epwendell/spark-releases/spark-2.0.0-preview-docs/sql-programming-guide.html#creating-datasets>
> Spark master docs: 
> https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md
>
> I would like to spend the time to contribute an improvement to those 
> docs with a more in depth examples of creating and using Datasets (eg 
> using $ to select columns). Is this of value, and if so what should my 
> next step be to get this going (create JIRA etc)?
>
> -- 
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> R&D Data Science Intern at Oracle Data Cloud
> UC Berkeley AMPLab Alumni
>
> ski.rodriguez@gmail.com <ma...@gmail.com> | 
> pedrorodriguez.io <http://pedrorodriguez.io> | 909-353-4423
> Github: github.com/EntilZha <http://github.com/EntilZha> | LinkedIn: 
> https://www.linkedin.com/in/pedrorodriguezscience
>


Re: Spark 2.0 Dataset Documentation

Posted by Pedro Rodriguez <sk...@gmail.com>.
Going to go ahead and starting working on the docs assuming this gets
merged https://github.com/apache/spark/pull/13592. Opened a JIRA
https://issues.apache.org/jira/browse/SPARK-16046

Having some issues building docs. The Java docs fail to build. Output when
it fails is here:
https://gist.github.com/EntilZha/9c585662ef7cda820c311d1c7eb16e42

This might be causing an issue where loading the API docs fails due to some
javascript errors (doesn't seem to switch page correctly). The main one
repeated several times is: main.js:2 Uncaught SyntaxError: Unexpected token
<

Pedro

On Sat, Jun 18, 2016 at 6:03 AM, Jacek Laskowski <ja...@japila.pl> wrote:

> On Sat, Jun 18, 2016 at 6:13 AM, Pedro Rodriguez
> <sk...@gmail.com> wrote:
>
> > using Datasets (eg using $ to select columns).
>
> Or even my favourite one - the tick ` :-)
>
> Jacek
>



-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Re: Spark 2.0 Dataset Documentation

Posted by Jacek Laskowski <ja...@japila.pl>.
On Sat, Jun 18, 2016 at 6:13 AM, Pedro Rodriguez
<sk...@gmail.com> wrote:

> using Datasets (eg using $ to select columns).

Or even my favourite one - the tick ` :-)

Jacek

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org