You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by dwmclary <gi...@git.apache.org> on 2015/02/06 09:12:27 UTC

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

GitHub user dwmclary opened a pull request:

    https://github.com/apache/spark/pull/4421

    Spark-2789: Apply names to RDD to create DataFrame

    This seemed like a reasonably useful function to add to SparkSQL.  However, unlike the [JIRA](https://issues.apache.org/jira/browse/SPARK-2789), this implementation does not parse type characters (e.g. brackets and braces).  This method creates a DataFrame with column names that map to the existing types in the RDD.  In general, this seems far more useful, as users likely wish to quickly apply names to existing collections.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dwmclary/spark SPARK-2789

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4421.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4421
    
----
commit df8b01528519ebe0c480daedcc5099306e690a5e
Author: Dan McClary <da...@gmail.com>
Date:   2015-02-05T18:56:14Z

    basic apply names functionality

commit 15eb351e2a1c43191193bca768607cc56ce3aede
Author: Dan McClary <da...@gmail.com>
Date:   2015-02-05T23:31:04Z

    working for map type

commit aa38d7618a9cd069f73cf8673bfdef4ecc0fe339
Author: Dan McClary <da...@gmail.com>
Date:   2015-02-06T02:43:30Z

    added array and list types, struct types don't seem relevant

commit 29d8ffa58b6faa9f20b9c36b5afe649d523e2eb8
Author: Dan McClary <da...@gmail.com>
Date:   2015-02-06T05:14:34Z

    added applyNames to pyspark

commit 8c773b372c122c4b90f375933e83816ec99ace1d
Author: Dan McClary <da...@gmail.com>
Date:   2015-02-06T07:41:24Z

    added pyspark method and tests

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4421#issuecomment-73625791
  
    Types need to exist, but names don't. They can just be random column names like _1, _2, _3. 
    
    In Scala, if you import sqlContext.implicits._, then any RDD[Product] (which includes RDD of case classes and RDD of tuples) can be implicitly turned into a DataFrame.
    
    In Python, I think we can add an explicit method that turns a RDD of tuple into a DataFrame, if that doesn't exist yet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

Posted by dwmclary <gi...@git.apache.org>.

Github user dwmclary commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4421#discussion_r24253601
  
    --- Diff: python/pyspark/sql.py ---
    @@ -1469,6 +1470,44 @@ def applySchema(self, rdd, schema):
             df = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
             return DataFrame(df, self)
     
    +    def applyNames(self, nameString, plainRdd):
    +        """
    +        Builds a DataFrame from an RDD based on column names.
    +
    +        Assumes RDD contains iterables of equal length.
    +        >>> unparsedStrings = sc.parallelize(["1, A1, true", "2, B2, false", "3, C3, true", "4, D4, false"])
    +        >>> input = unparsedStrings.map(lambda x: x.split(",")).map(lambda x: [int(x[0]), x[1], bool(x[2])])
    +        >>> df1 = sqlCtx.applyNames("a b c", input)
    +        >>> df1.registerTempTable("df1")
    +        >>> sqlCtx.sql("select a from df1").collect()
    +        [Row(a=1), Row(a=2), Row(a=3), Row(a=4)]
    +        >>> input2 = unparsedStrings.map(lambda x: x.split(",")).map(lambda x: [int(x[0]), x[1], bool(x[2]), {"k":int(x[0]), "v":2*int(x[0])}, x])
    +        >>> df2 = sqlCtx.applyNames("a b c d e", input2)
    +        >>> df2.registerTempTable("df2")
    +        >>> sqlCtx.sql("select d['k']+d['v'] from df2").collect()
    +        [Row(c0=3), Row(c0=6), Row(c0=9), Row(c0=12)]
    +        >>> sqlCtx.sql("select b, e[1] from df2").collect()
    +        [Row(b=u' A1', c1=u' A1'), Row(b=u' B2', c1=u' B2'), Row(b=u' C3', c1=u' C3'), Row(b=u' D4', c1=u' D4')]
    +        """
    +        fieldNames = [f for f in re.split("( |\\\".*?\\\"|'.*?')", nameString) if f.strip()]
    +        reservedWords = set(map(string.lower,["ABS","ALL","AND", "APPROXIMATE", "AS", "ASC", "AVG", "BETWEEN", "BY", \
    --- End diff --
    
    Seems like a reasonable request to me.  I couldn't decide if it was better
    to have to pickle and ship a list of words or just to have it instantiated
    in both places.
    
    On Fri, Feb 6, 2015 at 7:31 AM, Josh Rosen <no...@github.com> wrote:
    
    > In python/pyspark/sql.py
    > <https://github.com/apache/spark/pull/4421#discussion_r24247234>:
    >
    > > +        >>> unparsedStrings = sc.parallelize(["1, A1, true", "2, B2, false", "3, C3, true", "4, D4, false"])
    > > +        >>> input = unparsedStrings.map(lambda x: x.split(",")).map(lambda x: [int(x[0]), x[1], bool(x[2])])
    > > +        >>> df1 = sqlCtx.applyNames("a b c", input)
    > > +        >>> df1.registerTempTable("df1")
    > > +        >>> sqlCtx.sql("select a from df1").collect()
    > > +        [Row(a=1), Row(a=2), Row(a=3), Row(a=4)]
    > > +        >>> input2 = unparsedStrings.map(lambda x: x.split(",")).map(lambda x: [int(x[0]), x[1], bool(x[2]), {"k":int(x[0]), "v":2*int(x[0])}, x])
    > > +        >>> df2 = sqlCtx.applyNames("a b c d e", input2)
    > > +        >>> df2.registerTempTable("df2")
    > > +        >>> sqlCtx.sql("select d['k']+d['v'] from df2").collect()
    > > +        [Row(c0=3), Row(c0=6), Row(c0=9), Row(c0=12)]
    > > +        >>> sqlCtx.sql("select b, e[1] from df2").collect()
    > > +        [Row(b=u' A1', c1=u' A1'), Row(b=u' B2', c1=u' B2'), Row(b=u' C3', c1=u' C3'), Row(b=u' D4', c1=u' D4')]
    > > +        """
    > > +        fieldNames = [f for f in re.split("( |\\\".*?\\\"|'.*?')", nameString) if f.strip()]
    > > +        reservedWords = set(map(string.lower,["ABS","ALL","AND", "APPROXIMATE", "AS", "ASC", "AVG", "BETWEEN", "BY", \
    >
    > I can't really speak to this patch in general, since I don't know much
    > about this part of Spark SQL, but to avoid duplication it probably makes
    > sense to keep the list of reserved words in the JVM and fetch it into
    > Python from there.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/4421/files#r24247234>.
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Add toDataFrame to PySpark SQL

Posted by dwmclary <gi...@git.apache.org>.

Github user dwmclary commented on the pull request:

    https://github.com/apache/spark/pull/4421#issuecomment-73771478
  
    So, we'll allow a column named SELECT regardless of whether it's been
    called out as `SELECT`?  It just seems to me that it invites a lot of
    potentially erroneous user behavior at DDL time.
    
    On Tue, Feb 10, 2015 at 11:53 AM, Michael Armbrust <notifications@github.com
    > wrote:
    
    > Why do you need to check reserved words. In SQL you can use backticks to
    > access columns that are named after reserved words.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/4421#issuecomment-73770871>.
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

Posted by dwmclary <gi...@git.apache.org>.

Github user dwmclary commented on the pull request:

    https://github.com/apache/spark/pull/4421#issuecomment-73623236
  
    Reynold,
    
      It is similar, but I think the distinction here is that toDataFrame
    appears to require that old names (and a schema) exist.  Or, at least
    that's what DataFrameImpl.scala suggests:
    https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameImpl.scala,
    line 93.
    
      I think there's a benefit for a quick way to get a DataFrame from a
    "plain" RDD.  If we don't want to do @davies applyNames idea, then maybe we
    can change the behavior of toDataFrame.
    
    Cheers,
    Dan
    
    On Mon, Feb 9, 2015 at 4:33 PM, Reynold Xin <no...@github.com>
    wrote:
    
    > In particular, I'm talking about
    > https://github.com/apache/spark/blob/68b25cf695e0fce9e465288d5a053e540a3fccb4/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L105
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/4421#issuecomment-73621532>.
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

Posted by dwmclary <gi...@git.apache.org>.

Github user dwmclary commented on the pull request:

    https://github.com/apache/spark/pull/4421#issuecomment-73632890
  
    Sounds like a plan -- I'll do it on top of #4479.
    
    Thought: I've added a getReservedWords private method to SQLContext.scala.
    I feel like leaving that there isn't a bad idea: other methods may need to
    check reserved words in the future.
    
    On Mon, Feb 9, 2015 at 5:27 PM, Reynold Xin <no...@github.com>
    wrote:
    
    > Adding toDataFrame to Python DataFrame is a great idea. You can do it in
    > this PR if you want (make sure you update the title).
    >
    > Also - you might want to do it on top of #4479
    > <https://github.com/apache/spark/pull/4479> otherwise it will conflict.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/4421#issuecomment-73627424>.
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

Posted by dwmclary <gi...@git.apache.org>.

Github user dwmclary commented on the pull request:

    https://github.com/apache/spark/pull/4421#issuecomment-73325875
  
    Updated to keep reserved words in the JVM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Add toDataFrame to PySpark SQL

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4421#issuecomment-73772116
  
    Believe it or not that is valid SQL ...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4421#issuecomment-73621476
  
    @dwmclary thanks for submitting this. I think this is similar to the toDataFrame method that supports renaming, isn't it?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Add toDataFrame to PySpark SQL

Posted by dwmclary <gi...@git.apache.org>.

Github user dwmclary commented on the pull request:

    https://github.com/apache/spark/pull/4421#issuecomment-73770007
  
    OK, I've updated this to use as a reference.  One thing we may want to take from this PR is that toDataFrame and createDataFrame absolutely need to check reserved words in column names.  I've added the behavior in scala and in the DataFrame Suite.
    
    Perhaps I should just open a new PR with the reserved words checking?
    
    I'll take a look at @davies PR when it shows up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4421#issuecomment-73631311
  
    I just talked to @davies offline. He is going to submit a PR that adds createDataFrame with named columns. I think we can roll this into that one and close this PR. Would be great if @dwmclary you can take a look once that is submitted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Add toDataFrame to PySpark SQL

Posted by dwmclary <gi...@git.apache.org>.

Github user dwmclary commented on the pull request:

    https://github.com/apache/spark/pull/4421#issuecomment-73783731
  
    I've been thinking of it as equivalent to a CREATE TABLE, in which case I
    think it's dialect-specific.  Perhaps ANSI and pgSQL allow it, but, for
    example, Oracle disallows:
    
    SQL> create table dumb_name (select varchar2(10), from varchar2(10));
    create table dumb_name (select varchar2(10), from varchar2(10))
                            *
    ERROR at line 1:
    ORA-00904: : invalid identifier
    
    
    SQL> create table dumb_name ("select" varchar2(10), "from" varchar2(10));
    
    Table created.
    
    
    Either way, I'm fine to just close out this PR.  We should close SPARK-2789
    too.
    
    Cheers,
    Dan
    
    
    
    On Tue, Feb 10, 2015 at 11:59 AM, Reynold Xin <no...@github.com>
    wrote:
    
    > Believe it or not that is valid SQL ...
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/4421#issuecomment-73772116>.
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

Posted by dwmclary <gi...@git.apache.org>.

Github user dwmclary commented on the pull request:

    https://github.com/apache/spark/pull/4421#issuecomment-73626452
  
    Ah, yes, I see that now.
    
    Python doesn't seem to have a toDataFrame, so maybe the logical thing to do
    here is to just do a new PR with a Python implementation of toDataFrame --
    it'd be a little bit from my current PR and then call into the Scala method.
    
    What do you think?
    
    On Mon, Feb 9, 2015 at 5:12 PM, Reynold Xin <no...@github.com>
    wrote:
    
    > Types need to exist, but names don't. They can just be random column names
    > like _1, _2, _3.
    >
    > In Scala, if you import sqlContext.implicits._, then any RDDProduct
    > <http://which%20includes%20RDD%20of%20case%20classes%20and%20RDD%20of%20tuples>
    > can be implicitly turned into a DataFrame.
    >
    > In Python, I think we can add an explicit method that turns a RDD of tuple
    > into a DataFrame, if that doesn't exist yet.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/4421#issuecomment-73625791>.
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4421#discussion_r24247234
  
    --- Diff: python/pyspark/sql.py ---
    @@ -1469,6 +1470,44 @@ def applySchema(self, rdd, schema):
             df = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
             return DataFrame(df, self)
     
    +    def applyNames(self, nameString, plainRdd):
    +        """
    +        Builds a DataFrame from an RDD based on column names.
    +
    +        Assumes RDD contains iterables of equal length.
    +        >>> unparsedStrings = sc.parallelize(["1, A1, true", "2, B2, false", "3, C3, true", "4, D4, false"])
    +        >>> input = unparsedStrings.map(lambda x: x.split(",")).map(lambda x: [int(x[0]), x[1], bool(x[2])])
    +        >>> df1 = sqlCtx.applyNames("a b c", input)
    +        >>> df1.registerTempTable("df1")
    +        >>> sqlCtx.sql("select a from df1").collect()
    +        [Row(a=1), Row(a=2), Row(a=3), Row(a=4)]
    +        >>> input2 = unparsedStrings.map(lambda x: x.split(",")).map(lambda x: [int(x[0]), x[1], bool(x[2]), {"k":int(x[0]), "v":2*int(x[0])}, x])
    +        >>> df2 = sqlCtx.applyNames("a b c d e", input2)
    +        >>> df2.registerTempTable("df2")
    +        >>> sqlCtx.sql("select d['k']+d['v'] from df2").collect()
    +        [Row(c0=3), Row(c0=6), Row(c0=9), Row(c0=12)]
    +        >>> sqlCtx.sql("select b, e[1] from df2").collect()
    +        [Row(b=u' A1', c1=u' A1'), Row(b=u' B2', c1=u' B2'), Row(b=u' C3', c1=u' C3'), Row(b=u' D4', c1=u' D4')]
    +        """
    +        fieldNames = [f for f in re.split("( |\\\".*?\\\"|'.*?')", nameString) if f.strip()]
    +        reservedWords = set(map(string.lower,["ABS","ALL","AND", "APPROXIMATE", "AS", "ASC", "AVG", "BETWEEN", "BY", \
    --- End diff --
    
    I can't really speak to this patch in general, since I don't know much about this part of Spark SQL, but to avoid duplication it probably makes sense to keep the list of reserved words in the JVM and fetch it into Python from there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4421#issuecomment-73621532
  
    In particular, I'm talking about https://github.com/apache/spark/blob/68b25cf695e0fce9e465288d5a053e540a3fccb4/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L105


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Add toDataFrame to PySpark SQL

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/4421#issuecomment-73770871
  
    Why do you need to check reserved words.  In SQL you can use backticks to access columns that are named after reserved words.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4421#issuecomment-73627424
  
    Adding toDataFrame to Python DataFrame is a great idea. You can do it in this PR if you want (make sure you update the title). 
    
    Also - you might want to do it on top of https://github.com/apache/spark/pull/4479 otherwise it will conflict.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

Posted by dwmclary <gi...@git.apache.org>.

Github user dwmclary commented on the pull request:

    https://github.com/apache/spark/pull/4421#issuecomment-73626542
  
    Or, I guess I can just do it in this PR if you don't mind it changing a
    bunch.
    
    On Mon, Feb 9, 2015 at 5:18 PM, Dan McClary <da...@gmail.com> wrote:
    
    > Ah, yes, I see that now.
    >
    > Python doesn't seem to have a toDataFrame, so maybe the logical thing to
    > do here is to just do a new PR with a Python implementation of toDataFrame
    > -- it'd be a little bit from my current PR and then call into the Scala
    > method.
    >
    > What do you think?
    >
    > On Mon, Feb 9, 2015 at 5:12 PM, Reynold Xin <no...@github.com>
    > wrote:
    >
    >> Types need to exist, but names don't. They can just be random column
    >> names like _1, _2, _3.
    >>
    >> In Scala, if you import sqlContext.implicits._, then any RDDProduct
    >> <http://which%20includes%20RDD%20of%20case%20classes%20and%20RDD%20of%20tuples>
    >> can be implicitly turned into a DataFrame.
    >>
    >> In Python, I think we can add an explicit method that turns a RDD of
    >> tuple into a DataFrame, if that doesn't exist yet.
    >>
    >> —
    >> Reply to this email directly or view it on GitHub
    >> <https://github.com/apache/spark/pull/4421#issuecomment-73625791>.
    >>
    >
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Add toDataFrame to PySpark SQL

Posted by dwmclary <gi...@git.apache.org>.

Github user dwmclary closed the pull request at:

    https://github.com/apache/spark/pull/4421


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Add toDataFrame to PySpark SQL

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/4421#issuecomment-73771328
  
    @dwmclary It's almost ready: https://github.com/apache/spark/pull/4498


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4421#issuecomment-73201347
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org