You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Yi Zhou (JIRA)" <ji...@apache.org> on 2015/09/07 03:49:45 UTC

[jira] [Commented] (SPARK-10310) [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimiter

    [ https://issues.apache.org/jira/browse/SPARK-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14733195#comment-14733195 ] 

Yi Zhou commented on SPARK-10310:
---------------------------------

Hi [~marmbrus]
Could you please help to review and evaluate this critical Spark SQL issue to see if it can be fixed in Spark 1.5.0 (I saw the code is ready) ?  The issue caused to fail to extract correct record due to missing line/filed delimiter and it blocked the conformance validation. Thanks in advance !

> [Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimiter
> ------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-10310
>                 URL: https://issues.apache.org/jira/browse/SPARK-10310
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Yi Zhou
>            Priority: Critical
>
> There is real case using python stream script in Spark SQL query. We found that all result records were wroten in ONE line as input from "select" pipeline for python script and so it caused script will not identify each record.Other, filed separator in spark sql will be '^A' or '\001' which is inconsistent/incompatible the '\t' in Hive implementation.
> Key query:
> {code:sql}
> CREATE VIEW temp1 AS
> SELECT *
> FROM
> (
>   FROM
>   (
>     SELECT
>       c.wcs_user_sk,
>       w.wp_type,
>       (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
>     FROM web_clickstreams c, web_page w
>     WHERE c.wcs_web_page_sk = w.wp_web_page_sk
>     AND   c.wcs_web_page_sk IS NOT NULL
>     AND   c.wcs_user_sk     IS NOT NULL
>     AND   c.wcs_sales_sk    IS NULL --abandoned implies: no sale
>     DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
>   ) clicksAnWebPageType
>   REDUCE
>     wcs_user_sk,
>     tstamp_inSec,
>     wp_type
>   USING 'python sessionize.py 3600'
>   AS (
>     wp_type STRING,
>     tstamp BIGINT, 
>     sessionid STRING)
> ) sessionized
> {code}
> Key Python script:
> {noformat}
> for line in sys.stdin:
>      user_sk,  tstamp_str, value  = line.strip().split("\t")
> {noformat}
> Sample SELECT result:
> {noformat}
> ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
> {noformat}
> Expected result:
> {noformat}
> 31   3237764860   feedback
> 31   3237769106   dynamic
> 31   3237779027   review
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org