You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Николай Ижиков <ni...@gmail.com> on 2017/11/28 15:40:12 UTC

Spark Data Frame. PreSorded partitions

Hello, guys!

I work on implementation of custom DataSource for Spark Data Frame API and have a question:

If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data inside a partition in my data source.

Do I have a built-in option to tell spark that data from each partition already sorted?

It seems that Spark can benefit from usage of already sorted partitions.
By using of distributed merge sort algorithm, for example.

Does it make sense for you?

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark Data Frame. PreSorded partitions

Posted by Jörn Franke <jo...@gmail.com>.

Well usually you sort only on a certain column and not on all columns so most of the columns will always be unsorted, Spark may then still need to sort if you for example join (for some joins) on an unsorted column.

That being said, depending on the data you may not want to sort it, but cluster different column values together to be close to each other. Maybe this clustering information could be also part of the datasource API V2

> On 4. Dec 2017, at 16:37, Li Jin <ic...@gmail.com> wrote:
> 
> Just to give another data point: most of the data we use with Spark are sorted on disk, having a way to allow data source to pass ordered distributed to DataFrames is really useful for us.
> 
>> On Mon, Dec 4, 2017 at 9:12 AM, Николай Ижиков <ni...@gmail.com> wrote:
>> Hello, guys.
>> 
>> Thank you for answers!
>> 
>> > I think pushing down a sort .... could make a big difference.
>> > You can however proposes to the data source api 2 to be included.
>> 
>> Jörn, are you talking about this jira issue? - https://issues.apache.org/jira/browse/SPARK-15689
>> Is there any additional documentation I has to learn before making any proposition?
>> 
>> 
>> 
>> 04.12.2017 14:05, Holden Karau пишет:
>>> I think pushing down a sort (or really more in the case where the data is already naturally returned in sorted order on some column) could make a big difference. Probably the simplest argument for a lot of time being spent sorting (in some use cases) is the fact it's still one of the standard benchmarks.
>>> 
>>> On Mon, Dec 4, 2017 at 1:55 AM, Jörn Franke <jornfranke@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>>     I do not think that the data source api exposes such a thing. You can however proposes to the data source api 2 to be included.
>>> 
>>>     However there are some caveats , because sorted can mean two different things (weak vs strict order).
>>> 
>>>     Then, is really a lot of time lost because of sorting? The best thing is to not read data that is not needed at all (see min/max indexes in orc/parquet or bloom filters in Orc). What is not read
>>>     does not need to be sorted. See also predicate pushdown.
>>> 
>>>      > On 4. Dec 2017, at 07:50, Николай Ижиков <nizhikov.dev@gmail.com <ma...@gmail.com>> wrote:
>>>      >
>>>      > Cross-posting from @user.
>>>      >
>>>      > Hello, guys!
>>>      >
>>>      > I work on implementation of custom DataSource for Spark Data Frame API and have a question:
>>>      >
>>>      > If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data inside a partition in my data source.
>>>      >
>>>      > Do I have a built-in option to tell spark that data from each partition already sorted?
>>>      >
>>>      > It seems that Spark can benefit from usage of already sorted partitions.
>>>      > By using of distributed merge sort algorithm, for example.
>>>      >
>>>      > Does it make sense for you?
>>>      >
>>>      >
>>>      > 28.11.2017 18:42, Michael Artz пишет:
>>>      >> I'm not sure other than retrieving from a hive table that is already sorted.  This sounds cool though, would be interested to know this as well
>>>      >> On Nov 28, 2017 10:40 AM, "Николай Ижиков" <nizhikov.dev@gmail.com <ma...@gmail.com> <mailto:nizhikov.dev@gmail.com <ma...@gmail.com>>> wrote:
>>>      >>    Hello, guys!
>>>      >>    I work on implementation of custom DataSource for Spark Data Frame API and have a question:
>>>      >>    If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data inside a partition in my data source.
>>>      >>    Do I have a built-in option to tell spark that data from each partition already sorted?
>>>      >>    It seems that Spark can benefit from usage of already sorted partitions.
>>>      >>    By using of distributed merge sort algorithm, for example.
>>>      >>    Does it make sense for you?
>>>      >>    ---------------------------------------------------------------------
>>>      >>    To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org> <mailto:user-unsubscribe@spark.apache.org <ma...@spark.apache.org>>
>>>      >
>>>      > ---------------------------------------------------------------------
>>>      > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>>      >
>>> 
>>>     ---------------------------------------------------------------------
>>>     To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Twitter: https://twitter.com/holdenkarau
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>> 
>

Re: Spark Data Frame. PreSorded partitions

Posted by Li Jin <ic...@gmail.com>.

Sorry, s/ordered distributed/ordered distribution/g

On Mon, Dec 4, 2017 at 10:37 AM, Li Jin <ic...@gmail.com> wrote:

> Just to give another data point: most of the data we use with Spark are
> sorted on disk, having a way to allow data source to pass ordered
> distributed to DataFrames is really useful for us.
>
> On Mon, Dec 4, 2017 at 9:12 AM, Николай Ижиков <ni...@gmail.com>
> wrote:
>
>> Hello, guys.
>>
>> Thank you for answers!
>>
>> > I think pushing down a sort .... could make a big difference.
>> > You can however proposes to the data source api 2 to be included.
>>
>> Jörn, are you talking about this jira issue? -
>> https://issues.apache.org/jira/browse/SPARK-15689
>> Is there any additional documentation I has to learn before making any
>> proposition?
>>
>>
>>
>> 04.12.2017 14:05, Holden Karau пишет:
>>
>>> I think pushing down a sort (or really more in the case where the data
>>> is already naturally returned in sorted order on some column) could make a
>>> big difference. Probably the simplest argument for a lot of time being
>>> spent sorting (in some use cases) is the fact it's still one of the
>>> standard benchmarks.
>>>
>>> On Mon, Dec 4, 2017 at 1:55 AM, Jörn Franke <jornfranke@gmail.com
>>> <ma...@gmail.com>> wrote:
>>>
>>>     I do not think that the data source api exposes such a thing. You
>>> can however proposes to the data source api 2 to be included.
>>>
>>>     However there are some caveats , because sorted can mean two
>>> different things (weak vs strict order).
>>>
>>>     Then, is really a lot of time lost because of sorting? The best
>>> thing is to not read data that is not needed at all (see min/max indexes in
>>> orc/parquet or bloom filters in Orc). What is not read
>>>     does not need to be sorted. See also predicate pushdown.
>>>
>>>      > On 4. Dec 2017, at 07:50, Николай Ижиков <nizhikov.dev@gmail.com
>>> <ma...@gmail.com>> wrote:
>>>      >
>>>      > Cross-posting from @user.
>>>      >
>>>      > Hello, guys!
>>>      >
>>>      > I work on implementation of custom DataSource for Spark Data
>>> Frame API and have a question:
>>>      >
>>>      > If I have a `SELECT * FROM table1 ORDER BY some_column` query I
>>> can sort data inside a partition in my data source.
>>>      >
>>>      > Do I have a built-in option to tell spark that data from each
>>> partition already sorted?
>>>      >
>>>      > It seems that Spark can benefit from usage of already sorted
>>> partitions.
>>>      > By using of distributed merge sort algorithm, for example.
>>>      >
>>>      > Does it make sense for you?
>>>      >
>>>      >
>>>      > 28.11.2017 18:42, Michael Artz пишет:
>>>      >> I'm not sure other than retrieving from a hive table that is
>>> already sorted.  This sounds cool though, would be interested to know this
>>> as well
>>>      >> On Nov 28, 2017 10:40 AM, "Николай Ижиков" <
>>> nizhikov.dev@gmail.com <ma...@gmail.com> <mailto:
>>> nizhikov.dev@gmail.com <ma...@gmail.com>>> wrote:
>>>      >>    Hello, guys!
>>>      >>    I work on implementation of custom DataSource for Spark Data
>>> Frame API and have a question:
>>>      >>    If I have a `SELECT * FROM table1 ORDER BY some_column` query
>>> I can sort data inside a partition in my data source.
>>>      >>    Do I have a built-in option to tell spark that data from each
>>> partition already sorted?
>>>      >>    It seems that Spark can benefit from usage of already sorted
>>> partitions.
>>>      >>    By using of distributed merge sort algorithm, for example.
>>>      >>    Does it make sense for you?
>>>      >>    ------------------------------------------------------------
>>> ---------
>>>      >>    To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>> <ma...@spark.apache.org> <mailto:
>>> user-unsubscribe@spark.apache.org <mailto:user-unsubscribe@spark
>>> .apache.org>>
>>>      >
>>>      > ------------------------------------------------------------
>>> ---------
>>>      > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <mailto:
>>> dev-unsubscribe@spark.apache.org>
>>>      >
>>>
>>>     ------------------------------------------------------------
>>> ---------
>>>     To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <mailto:
>>> dev-unsubscribe@spark.apache.org>
>>>
>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>

Re: Spark Data Frame. PreSorded partitions

Posted by Li Jin <ic...@gmail.com>.

Just to give another data point: most of the data we use with Spark are
sorted on disk, having a way to allow data source to pass ordered
distributed to DataFrames is really useful for us.

On Mon, Dec 4, 2017 at 9:12 AM, Николай Ижиков <ni...@gmail.com>
wrote:

> Hello, guys.
>
> Thank you for answers!
>
> > I think pushing down a sort .... could make a big difference.
> > You can however proposes to the data source api 2 to be included.
>
> Jörn, are you talking about this jira issue? -
> https://issues.apache.org/jira/browse/SPARK-15689
> Is there any additional documentation I has to learn before making any
> proposition?
>
>
>
> 04.12.2017 14:05, Holden Karau пишет:
>
>> I think pushing down a sort (or really more in the case where the data is
>> already naturally returned in sorted order on some column) could make a big
>> difference. Probably the simplest argument for a lot of time being spent
>> sorting (in some use cases) is the fact it's still one of the standard
>> benchmarks.
>>
>> On Mon, Dec 4, 2017 at 1:55 AM, Jörn Franke <jornfranke@gmail.com
>> <ma...@gmail.com>> wrote:
>>
>>     I do not think that the data source api exposes such a thing. You can
>> however proposes to the data source api 2 to be included.
>>
>>     However there are some caveats , because sorted can mean two
>> different things (weak vs strict order).
>>
>>     Then, is really a lot of time lost because of sorting? The best thing
>> is to not read data that is not needed at all (see min/max indexes in
>> orc/parquet or bloom filters in Orc). What is not read
>>     does not need to be sorted. See also predicate pushdown.
>>
>>      > On 4. Dec 2017, at 07:50, Николай Ижиков <nizhikov.dev@gmail.com
>> <ma...@gmail.com>> wrote:
>>      >
>>      > Cross-posting from @user.
>>      >
>>      > Hello, guys!
>>      >
>>      > I work on implementation of custom DataSource for Spark Data Frame
>> API and have a question:
>>      >
>>      > If I have a `SELECT * FROM table1 ORDER BY some_column` query I
>> can sort data inside a partition in my data source.
>>      >
>>      > Do I have a built-in option to tell spark that data from each
>> partition already sorted?
>>      >
>>      > It seems that Spark can benefit from usage of already sorted
>> partitions.
>>      > By using of distributed merge sort algorithm, for example.
>>      >
>>      > Does it make sense for you?
>>      >
>>      >
>>      > 28.11.2017 18:42, Michael Artz пишет:
>>      >> I'm not sure other than retrieving from a hive table that is
>> already sorted.  This sounds cool though, would be interested to know this
>> as well
>>      >> On Nov 28, 2017 10:40 AM, "Николай Ижиков" <
>> nizhikov.dev@gmail.com <ma...@gmail.com> <mailto:
>> nizhikov.dev@gmail.com <ma...@gmail.com>>> wrote:
>>      >>    Hello, guys!
>>      >>    I work on implementation of custom DataSource for Spark Data
>> Frame API and have a question:
>>      >>    If I have a `SELECT * FROM table1 ORDER BY some_column` query
>> I can sort data inside a partition in my data source.
>>      >>    Do I have a built-in option to tell spark that data from each
>> partition already sorted?
>>      >>    It seems that Spark can benefit from usage of already sorted
>> partitions.
>>      >>    By using of distributed merge sort algorithm, for example.
>>      >>    Does it make sense for you?
>>      >>    ------------------------------------------------------------
>> ---------
>>      >>    To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> <ma...@spark.apache.org> <mailto:user-unsubscribe@spark
>> .apache.org <ma...@spark.apache.org>>
>>      >
>>      > ------------------------------------------------------------
>> ---------
>>      > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <mailto:
>> dev-unsubscribe@spark.apache.org>
>>      >
>>
>>     ---------------------------------------------------------------------
>>     To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <mailto:
>> dev-unsubscribe@spark.apache.org>
>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Spark Data Frame. PreSorded partitions

Posted by Wenchen Fan <cl...@gmail.com>.

Data Source V2 is still under development. Ordering reporting is one of the
planned features, but it's not done yet, we are still thinking about what
the API should be, e.g. we need to include sort order, null first/last and
other sorting related properties.

On Mon, Dec 4, 2017 at 10:12 PM, Николай Ижиков <ni...@gmail.com>
wrote:

> Hello, guys.
>
> Thank you for answers!
>
> > I think pushing down a sort .... could make a big difference.
> > You can however proposes to the data source api 2 to be included.
>
> Jörn, are you talking about this jira issue? -
> https://issues.apache.org/jira/browse/SPARK-15689
> Is there any additional documentation I has to learn before making any
> proposition?
>
>
>
> 04.12.2017 14:05, Holden Karau пишет:
>
>> I think pushing down a sort (or really more in the case where the data is
>> already naturally returned in sorted order on some column) could make a big
>> difference. Probably the simplest argument for a lot of time being spent
>> sorting (in some use cases) is the fact it's still one of the standard
>> benchmarks.
>>
>> On Mon, Dec 4, 2017 at 1:55 AM, Jörn Franke <jornfranke@gmail.com
>> <ma...@gmail.com>> wrote:
>>
>>     I do not think that the data source api exposes such a thing. You can
>> however proposes to the data source api 2 to be included.
>>
>>     However there are some caveats , because sorted can mean two
>> different things (weak vs strict order).
>>
>>     Then, is really a lot of time lost because of sorting? The best thing
>> is to not read data that is not needed at all (see min/max indexes in
>> orc/parquet or bloom filters in Orc). What is not read
>>     does not need to be sorted. See also predicate pushdown.
>>
>>      > On 4. Dec 2017, at 07:50, Николай Ижиков <nizhikov.dev@gmail.com
>> <ma...@gmail.com>> wrote:
>>      >
>>      > Cross-posting from @user.
>>      >
>>      > Hello, guys!
>>      >
>>      > I work on implementation of custom DataSource for Spark Data Frame
>> API and have a question:
>>      >
>>      > If I have a `SELECT * FROM table1 ORDER BY some_column` query I
>> can sort data inside a partition in my data source.
>>      >
>>      > Do I have a built-in option to tell spark that data from each
>> partition already sorted?
>>      >
>>      > It seems that Spark can benefit from usage of already sorted
>> partitions.
>>      > By using of distributed merge sort algorithm, for example.
>>      >
>>      > Does it make sense for you?
>>      >
>>      >
>>      > 28.11.2017 18:42, Michael Artz пишет:
>>      >> I'm not sure other than retrieving from a hive table that is
>> already sorted.  This sounds cool though, would be interested to know this
>> as well
>>      >> On Nov 28, 2017 10:40 AM, "Николай Ижиков" <
>> nizhikov.dev@gmail.com <ma...@gmail.com> <mailto:
>> nizhikov.dev@gmail.com <ma...@gmail.com>>> wrote:
>>      >>    Hello, guys!
>>      >>    I work on implementation of custom DataSource for Spark Data
>> Frame API and have a question:
>>      >>    If I have a `SELECT * FROM table1 ORDER BY some_column` query
>> I can sort data inside a partition in my data source.
>>      >>    Do I have a built-in option to tell spark that data from each
>> partition already sorted?
>>      >>    It seems that Spark can benefit from usage of already sorted
>> partitions.
>>      >>    By using of distributed merge sort algorithm, for example.
>>      >>    Does it make sense for you?
>>      >>    ------------------------------------------------------------
>> ---------
>>      >>    To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> <ma...@spark.apache.org> <mailto:user-unsubscribe@spark
>> .apache.org <ma...@spark.apache.org>>
>>      >
>>      > ------------------------------------------------------------
>> ---------
>>      > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <mailto:
>> dev-unsubscribe@spark.apache.org>
>>      >
>>
>>     ---------------------------------------------------------------------
>>     To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <mailto:
>> dev-unsubscribe@spark.apache.org>
>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Spark Data Frame. PreSorded partitions

Posted by Николай Ижиков <ni...@gmail.com>.

Hello, guys.

Thank you for answers!

 > I think pushing down a sort .... could make a big difference.
 > You can however proposes to the data source api 2 to be included.

Jörn, are you talking about this jira issue? - https://issues.apache.org/jira/browse/SPARK-15689
Is there any additional documentation I has to learn before making any proposition?



04.12.2017 14:05, Holden Karau пишет:
> I think pushing down a sort (or really more in the case where the data is already naturally returned in sorted order on some column) could make a big difference. Probably the simplest argument for a 
> lot of time being spent sorting (in some use cases) is the fact it's still one of the standard benchmarks.
> 
> On Mon, Dec 4, 2017 at 1:55 AM, Jörn Franke <jornfranke@gmail.com <ma...@gmail.com>> wrote:
> 
>     I do not think that the data source api exposes such a thing. You can however proposes to the data source api 2 to be included.
> 
>     However there are some caveats , because sorted can mean two different things (weak vs strict order).
> 
>     Then, is really a lot of time lost because of sorting? The best thing is to not read data that is not needed at all (see min/max indexes in orc/parquet or bloom filters in Orc). What is not read
>     does not need to be sorted. See also predicate pushdown.
> 
>      > On 4. Dec 2017, at 07:50, Николай Ижиков <nizhikov.dev@gmail.com <ma...@gmail.com>> wrote:
>      >
>      > Cross-posting from @user.
>      >
>      > Hello, guys!
>      >
>      > I work on implementation of custom DataSource for Spark Data Frame API and have a question:
>      >
>      > If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data inside a partition in my data source.
>      >
>      > Do I have a built-in option to tell spark that data from each partition already sorted?
>      >
>      > It seems that Spark can benefit from usage of already sorted partitions.
>      > By using of distributed merge sort algorithm, for example.
>      >
>      > Does it make sense for you?
>      >
>      >
>      > 28.11.2017 18:42, Michael Artz пишет:
>      >> I'm not sure other than retrieving from a hive table that is already sorted.  This sounds cool though, would be interested to know this as well
>      >> On Nov 28, 2017 10:40 AM, "Николай Ижиков" <nizhikov.dev@gmail.com <ma...@gmail.com> <mailto:nizhikov.dev@gmail.com <ma...@gmail.com>>> wrote:
>      >>    Hello, guys!
>      >>    I work on implementation of custom DataSource for Spark Data Frame API and have a question:
>      >>    If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data inside a partition in my data source.
>      >>    Do I have a built-in option to tell spark that data from each partition already sorted?
>      >>    It seems that Spark can benefit from usage of already sorted partitions.
>      >>    By using of distributed merge sort algorithm, for example.
>      >>    Does it make sense for you?
>      >>    ---------------------------------------------------------------------
>      >>    To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org> <mailto:user-unsubscribe@spark.apache.org <ma...@spark.apache.org>>
>      >
>      > ---------------------------------------------------------------------
>      > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>      >
> 
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> 
> 
> 
> 
> -- 
> Twitter: https://twitter.com/holdenkarau

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Spark Data Frame. PreSorded partitions

Posted by Holden Karau <ho...@pigscanfly.ca>.

I think pushing down a sort (or really more in the case where the data is
already naturally returned in sorted order on some column) could make a big
difference. Probably the simplest argument for a lot of time being spent
sorting (in some use cases) is the fact it's still one of the standard
benchmarks.

On Mon, Dec 4, 2017 at 1:55 AM, Jörn Franke <jo...@gmail.com> wrote:

> I do not think that the data source api exposes such a thing. You can
> however proposes to the data source api 2 to be included.
>
> However there are some caveats , because sorted can mean two different
> things (weak vs strict order).
>
> Then, is really a lot of time lost because of sorting? The best thing is
> to not read data that is not needed at all (see min/max indexes in
> orc/parquet or bloom filters in Orc). What is not read does not need to be
> sorted. See also predicate pushdown.
>
> > On 4. Dec 2017, at 07:50, Николай Ижиков <ni...@gmail.com> wrote:
> >
> > Cross-posting from @user.
> >
> > Hello, guys!
> >
> > I work on implementation of custom DataSource for Spark Data Frame API
> and have a question:
> >
> > If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort
> data inside a partition in my data source.
> >
> > Do I have a built-in option to tell spark that data from each partition
> already sorted?
> >
> > It seems that Spark can benefit from usage of already sorted partitions.
> > By using of distributed merge sort algorithm, for example.
> >
> > Does it make sense for you?
> >
> >
> > 28.11.2017 18:42, Michael Artz пишет:
> >> I'm not sure other than retrieving from a hive table that is already
> sorted.  This sounds cool though, would be interested to know this as well
> >> On Nov 28, 2017 10:40 AM, "Николай Ижиков" <nizhikov.dev@gmail.com
> <ma...@gmail.com>> wrote:
> >>    Hello, guys!
> >>    I work on implementation of custom DataSource for Spark Data Frame
> API and have a question:
> >>    If I have a `SELECT * FROM table1 ORDER BY some_column` query I can
> sort data inside a partition in my data source.
> >>    Do I have a built-in option to tell spark that data from each
> partition already sorted?
> >>    It seems that Spark can benefit from usage of already sorted
> partitions.
> >>    By using of distributed merge sort algorithm, for example.
> >>    Does it make sense for you?
> >>    ------------------------------------------------------------
> ---------
> >>    To unsubscribe e-mail: user-unsubscribe@spark.apache.org <mailto:
> user-unsubscribe@spark.apache.org>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>


-- 
Twitter: https://twitter.com/holdenkarau

Re: Spark Data Frame. PreSorded partitions

Posted by Jörn Franke <jo...@gmail.com>.

I do not think that the data source api exposes such a thing. You can however proposes to the data source api 2 to be included.

However there are some caveats , because sorted can mean two different things (weak vs strict order).

Then, is really a lot of time lost because of sorting? The best thing is to not read data that is not needed at all (see min/max indexes in orc/parquet or bloom filters in Orc). What is not read does not need to be sorted. See also predicate pushdown.

> On 4. Dec 2017, at 07:50, Николай Ижиков <ni...@gmail.com> wrote:
> 
> Cross-posting from @user.
> 
> Hello, guys!
> 
> I work on implementation of custom DataSource for Spark Data Frame API and have a question:
> 
> If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data inside a partition in my data source.
> 
> Do I have a built-in option to tell spark that data from each partition already sorted?
> 
> It seems that Spark can benefit from usage of already sorted partitions.
> By using of distributed merge sort algorithm, for example.
> 
> Does it make sense for you?
> 
> 
> 28.11.2017 18:42, Michael Artz пишет:
>> I'm not sure other than retrieving from a hive table that is already sorted.  This sounds cool though, would be interested to know this as well
>> On Nov 28, 2017 10:40 AM, "Николай Ижиков" <nizhikov.dev@gmail.com <ma...@gmail.com>> wrote:
>>    Hello, guys!
>>    I work on implementation of custom DataSource for Spark Data Frame API and have a question:
>>    If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data inside a partition in my data source.
>>    Do I have a built-in option to tell spark that data from each partition already sorted?
>>    It seems that Spark can benefit from usage of already sorted partitions.
>>    By using of distributed merge sort algorithm, for example.
>>    Does it make sense for you?
>>    ---------------------------------------------------------------------
>>    To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Spark Data Frame. PreSorded partitions

Posted by Николай Ижиков <ni...@gmail.com>.

Cross-posting from @user.

Hello, guys!

I work on implementation of custom DataSource for Spark Data Frame API and have a question:

If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data inside a partition in my data source.

Do I have a built-in option to tell spark that data from each partition already sorted?

It seems that Spark can benefit from usage of already sorted partitions.
By using of distributed merge sort algorithm, for example.

Does it make sense for you?


28.11.2017 18:42, Michael Artz пишет:
> I'm not sure other than retrieving from a hive table that is already sorted.  This sounds cool though, would be interested to know this as well
> 
> On Nov 28, 2017 10:40 AM, "Николай Ижиков" <nizhikov.dev@gmail.com <ma...@gmail.com>> wrote:
> 
>     Hello, guys!
> 
>     I work on implementation of custom DataSource for Spark Data Frame API and have a question:
> 
>     If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data inside a partition in my data source.
> 
>     Do I have a built-in option to tell spark that data from each partition already sorted?
> 
>     It seems that Spark can benefit from usage of already sorted partitions.
>     By using of distributed merge sort algorithm, for example.
> 
>     Does it make sense for you?
> 
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> 

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Spark Data Frame. PreSorded partitions

Posted by Michael Artz <mi...@gmail.com>.

I'm not sure other than retrieving from a hive table that is already
sorted.  This sounds cool though, would be interested to know this as well

On Nov 28, 2017 10:40 AM, "Николай Ижиков" <ni...@gmail.com> wrote:

> Hello, guys!
>
> I work on implementation of custom DataSource for Spark Data Frame API and
> have a question:
>
> If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort
> data inside a partition in my data source.
>
> Do I have a built-in option to tell spark that data from each partition
> already sorted?
>
> It seems that Spark can benefit from usage of already sorted partitions.
> By using of distributed merge sort algorithm, for example.
>
> Does it make sense for you?
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>