You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Kevin Jung <it...@samsung.com> on 2015/08/19 08:03:15 UTC

SaveAsTable changes the order of rows

I have a simple RDD with Key/Value and do

val partitioned = rdd.partitionBy(new HashPartitioner(400))
val row = partitioned.first

I can get a key "G2726" from a returned row. This first row is located on a partition #0 because "G2726".hashCode is 67114000 and 67114000%400 is 0. But the order of keys is changed when I save rdd to table by doing saveAsTable. After doing this and calling sqlContext.table, a key from a first row is "G265". Does DataFrame forget a parent's partitioner or Parquet format always rearranges the order of original data? In my case, the order is not important but some of users may want to keep their keys ordered.

Kevin

상기 메일은 지정된 수신인만을 위한 것이며, 부정경쟁방지 및 영업비밀보호에 관한 법률,개인정보 보호법을 포함하여
관련 법령에 따라 보호의 대상이 되는 영업비밀, 산업기술,기밀정보, 개인정보 등을 포함하고 있을 수 있습니다.
본 문서에 포함된 정보의 전부 또는 일부를 무단으로 복사 또는 사용하거나 제3자에게 공개, 배포, 제공하는 것은 엄격히
금지됩니다. 본 메일이 잘못 전송된 경우 발신인 또는 당사에게 알려주시고 본 메일을 즉시 삭제하여 주시기 바랍니다.
The contents of this e-mail message and any attachments are confidential and are intended solely for addressee.
The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery
to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of
this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender
by reply e-mail or phone and delete this message and its attachments, if any.