You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Moein Hosseini <mo...@gmail.com> on 2019/02/02 06:12:10 UTC

Feature request: split dataset based on condition

I've seen many application need to split dataset to multiple datasets based
on some conditions. As there is no method to do it in one place, developers
use *filter *method multiple times. I think it can be useful to have method
to split dataset based on condition in one iteration, something like
*partition* method of scala (of-course scala partition just split list into
two list, but something more general can be more useful).
If you think it can be helpful, I can create Jira issue and work on it to
send PR.

Best Regards
Moein

-- 

Moein Hosseini
Data Engineer
mobile: +98 912 468 1859 <+98+912+468+1859>
site: www.moein.xyz
email: moein7tl@gmail.com
[image: linkedin] <https://www.linkedin.com/in/moeinhm>
[image: twitter] <https://twitter.com/moein7tl>

Re: Feature request: split dataset based on condition

Posted by "Thakrar, Jayesh" <jt...@conversantmedia.com>.
Just wondering if this is what you are implying Ryan (example only):

val data = (dataset to be partitionned)

val splitCondition =
s"""
CASE
           WHEN …. THEN ….
           WHEN …. THEN …..
  END partition_condition
"""
val partitionedData = data.withColumn("partitionColumn", expr(splitCondition))

In this case there might be a need to cache/persist the partitionedData dataset to avoid recomputation as each "partition" is processed (e.g. saved, etc.) later on, correct?

From: Ryan Blue <rb...@netflix.com.INVALID>
Reply-To: <rb...@netflix.com>
Date: Monday, February 4, 2019 at 12:16 PM
To: Andrew Melo <an...@gmail.com>
Cc: Moein Hosseini <mo...@gmail.com>, dev <de...@spark.apache.org>
Subject: Re: Feature request: split dataset based on condition

To partition by a condition, you would need to create a column with the result of that condition. Then you would partition by that column. The sort option would also work here.

I don't think that there is much of a use case for this. You have a set of conditions on which to partition your data, and partitioning is already supported. The idea to use conditions to create separate data frames would actually make that harder because you'd need to create and name tables for each one.

On Mon, Feb 4, 2019 at 9:16 AM Andrew Melo <an...@gmail.com>> wrote:
Hello Ryan,

On Mon, Feb 4, 2019 at 10:52 AM Ryan Blue <rb...@netflix.com>> wrote:
>
> Andrew, can you give us more information about why partitioning the output data doesn't work for your use case?
>
> It sounds like all you need to do is to create a table partitioned by A and B, then you would automatically get the divisions you want. If what you're looking for is a way to scale the number of combinations then you can use formats that support more partitions, or you could sort by the fields and rely on Parquet row group pruning to filter out data you don't want.
>

TBH, I don't understand what that would look like in pyspark and what
the consequences would be. Looking at the docs, it doesn't appear to
be the syntax for partitioning on a condition (most of our conditions
are of the form 'X > 30'). The use of Spark is still somewhat new in
our field, so it's possible we're not using it correctly.

Cheers
Andrew

> rb
>
> On Mon, Feb 4, 2019 at 8:33 AM Andrew Melo <an...@gmail.com>> wrote:
>>
>> Hello
>>
>> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini <mo...@gmail.com>> wrote:
>> >
>> > I've seen many application need to split dataset to multiple datasets based on some conditions. As there is no method to do it in one place, developers use filter method multiple times. I think it can be useful to have method to split dataset based on condition in one iteration, something like partition method of scala (of-course scala partition just split list into two list, but something more general can be more useful).
>> > If you think it can be helpful, I can create Jira issue and work on it to send PR.
>>
>> This would be a really useful feature for our use case (processing
>> collision data from the LHC). We typically want to take some sort of
>> input and split into multiple disjoint outputs based on some
>> conditions. E.g. if we have two conditions A and B, we'll end up with
>> 4 outputs (AB, !AB, A!B, !A!B). As we add more conditions, the
>> combinatorics explode like n^2, when we could produce them all up
>> front with this "multi filter" (or however it would be called).
>>
>> Cheers
>> Andrew
>>
>> >
>> > Best Regards
>> > Moein
>> >
>> > --
>> >
>> > Moein Hosseini
>> > Data Engineer
>> > mobile: +98 912 468 1859
>> > site: www.moein.xyz<http://www.moein.xyz>
>> > email: moein7tl@gmail.com<ma...@gmail.com>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix


--
Ryan Blue
Software Engineer
Netflix

Re: Feature request: split dataset based on condition

Posted by Andrew Melo <an...@gmail.com>.
Hi Ryan,

On Mon, Feb 4, 2019 at 12:17 PM Ryan Blue <rb...@netflix.com> wrote:
>
> To partition by a condition, you would need to create a column with the result of that condition. Then you would partition by that column. The sort option would also work here.

We actually do something similar to filter based on physics properties
by running a python UDF to create a column then filtering on that
column. Doing something similar to sort/partition would also require a
shuffle though, right?

>
> I don't think that there is much of a use case for this. You have a set of conditions on which to partition your data, and partitioning is already supported. The idea to use conditions to create separate data frames would actually make that harder because you'd need to create and name tables for each one.

At the end, however, we do need separate dataframes for each of these
subsamples, unless there's something basic I'm missing in how the
partitioning works. After the input datasets are split into signal and
background regions, we still need to perform further (different)
computations on each of the subsamples. e.g. for subsamples with
exactly 2 electrons, we'll need to calculate the sum of their 4-d
momenta, while samples with <2 electrons will need subtract two
different physical quantities -- several more steps before we get to
the point where we'll histogram the different subsamples for the
outputs.

Cheers
Andrew


>
> On Mon, Feb 4, 2019 at 9:16 AM Andrew Melo <an...@gmail.com> wrote:
>>
>> Hello Ryan,
>>
>> On Mon, Feb 4, 2019 at 10:52 AM Ryan Blue <rb...@netflix.com> wrote:
>> >
>> > Andrew, can you give us more information about why partitioning the output data doesn't work for your use case?
>> >
>> > It sounds like all you need to do is to create a table partitioned by A and B, then you would automatically get the divisions you want. If what you're looking for is a way to scale the number of combinations then you can use formats that support more partitions, or you could sort by the fields and rely on Parquet row group pruning to filter out data you don't want.
>> >
>>
>> TBH, I don't understand what that would look like in pyspark and what
>> the consequences would be. Looking at the docs, it doesn't appear to
>> be the syntax for partitioning on a condition (most of our conditions
>> are of the form 'X > 30'). The use of Spark is still somewhat new in
>> our field, so it's possible we're not using it correctly.
>>
>> Cheers
>> Andrew
>>
>> > rb
>> >
>> > On Mon, Feb 4, 2019 at 8:33 AM Andrew Melo <an...@gmail.com> wrote:
>> >>
>> >> Hello
>> >>
>> >> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini <mo...@gmail.com> wrote:
>> >> >
>> >> > I've seen many application need to split dataset to multiple datasets based on some conditions. As there is no method to do it in one place, developers use filter method multiple times. I think it can be useful to have method to split dataset based on condition in one iteration, something like partition method of scala (of-course scala partition just split list into two list, but something more general can be more useful).
>> >> > If you think it can be helpful, I can create Jira issue and work on it to send PR.
>> >>
>> >> This would be a really useful feature for our use case (processing
>> >> collision data from the LHC). We typically want to take some sort of
>> >> input and split into multiple disjoint outputs based on some
>> >> conditions. E.g. if we have two conditions A and B, we'll end up with
>> >> 4 outputs (AB, !AB, A!B, !A!B). As we add more conditions, the
>> >> combinatorics explode like n^2, when we could produce them all up
>> >> front with this "multi filter" (or however it would be called).
>> >>
>> >> Cheers
>> >> Andrew
>> >>
>> >> >
>> >> > Best Regards
>> >> > Moein
>> >> >
>> >> > --
>> >> >
>> >> > Moein Hosseini
>> >> > Data Engineer
>> >> > mobile: +98 912 468 1859
>> >> > site: www.moein.xyz
>> >> > email: moein7tl@gmail.com
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>> >>
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: Feature request: split dataset based on condition

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
To partition by a condition, you would need to create a column with the
result of that condition. Then you would partition by that column. The sort
option would also work here.

I don't think that there is much of a use case for this. You have a set of
conditions on which to partition your data, and partitioning is already
supported. The idea to use conditions to create separate data frames would
actually make that harder because you'd need to create and name tables for
each one.

On Mon, Feb 4, 2019 at 9:16 AM Andrew Melo <an...@gmail.com> wrote:

> Hello Ryan,
>
> On Mon, Feb 4, 2019 at 10:52 AM Ryan Blue <rb...@netflix.com> wrote:
> >
> > Andrew, can you give us more information about why partitioning the
> output data doesn't work for your use case?
> >
> > It sounds like all you need to do is to create a table partitioned by A
> and B, then you would automatically get the divisions you want. If what
> you're looking for is a way to scale the number of combinations then you
> can use formats that support more partitions, or you could sort by the
> fields and rely on Parquet row group pruning to filter out data you don't
> want.
> >
>
> TBH, I don't understand what that would look like in pyspark and what
> the consequences would be. Looking at the docs, it doesn't appear to
> be the syntax for partitioning on a condition (most of our conditions
> are of the form 'X > 30'). The use of Spark is still somewhat new in
> our field, so it's possible we're not using it correctly.
>
> Cheers
> Andrew
>
> > rb
> >
> > On Mon, Feb 4, 2019 at 8:33 AM Andrew Melo <an...@gmail.com>
> wrote:
> >>
> >> Hello
> >>
> >> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini <mo...@gmail.com>
> wrote:
> >> >
> >> > I've seen many application need to split dataset to multiple datasets
> based on some conditions. As there is no method to do it in one place,
> developers use filter method multiple times. I think it can be useful to
> have method to split dataset based on condition in one iteration, something
> like partition method of scala (of-course scala partition just split list
> into two list, but something more general can be more useful).
> >> > If you think it can be helpful, I can create Jira issue and work on
> it to send PR.
> >>
> >> This would be a really useful feature for our use case (processing
> >> collision data from the LHC). We typically want to take some sort of
> >> input and split into multiple disjoint outputs based on some
> >> conditions. E.g. if we have two conditions A and B, we'll end up with
> >> 4 outputs (AB, !AB, A!B, !A!B). As we add more conditions, the
> >> combinatorics explode like n^2, when we could produce them all up
> >> front with this "multi filter" (or however it would be called).
> >>
> >> Cheers
> >> Andrew
> >>
> >> >
> >> > Best Regards
> >> > Moein
> >> >
> >> > --
> >> >
> >> > Moein Hosseini
> >> > Data Engineer
> >> > mobile: +98 912 468 1859
> >> > site: www.moein.xyz
> >> > email: moein7tl@gmail.com
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Feature request: split dataset based on condition

Posted by Andrew Melo <an...@gmail.com>.
Hello Ryan,

On Mon, Feb 4, 2019 at 10:52 AM Ryan Blue <rb...@netflix.com> wrote:
>
> Andrew, can you give us more information about why partitioning the output data doesn't work for your use case?
>
> It sounds like all you need to do is to create a table partitioned by A and B, then you would automatically get the divisions you want. If what you're looking for is a way to scale the number of combinations then you can use formats that support more partitions, or you could sort by the fields and rely on Parquet row group pruning to filter out data you don't want.
>

TBH, I don't understand what that would look like in pyspark and what
the consequences would be. Looking at the docs, it doesn't appear to
be the syntax for partitioning on a condition (most of our conditions
are of the form 'X > 30'). The use of Spark is still somewhat new in
our field, so it's possible we're not using it correctly.

Cheers
Andrew

> rb
>
> On Mon, Feb 4, 2019 at 8:33 AM Andrew Melo <an...@gmail.com> wrote:
>>
>> Hello
>>
>> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini <mo...@gmail.com> wrote:
>> >
>> > I've seen many application need to split dataset to multiple datasets based on some conditions. As there is no method to do it in one place, developers use filter method multiple times. I think it can be useful to have method to split dataset based on condition in one iteration, something like partition method of scala (of-course scala partition just split list into two list, but something more general can be more useful).
>> > If you think it can be helpful, I can create Jira issue and work on it to send PR.
>>
>> This would be a really useful feature for our use case (processing
>> collision data from the LHC). We typically want to take some sort of
>> input and split into multiple disjoint outputs based on some
>> conditions. E.g. if we have two conditions A and B, we'll end up with
>> 4 outputs (AB, !AB, A!B, !A!B). As we add more conditions, the
>> combinatorics explode like n^2, when we could produce them all up
>> front with this "multi filter" (or however it would be called).
>>
>> Cheers
>> Andrew
>>
>> >
>> > Best Regards
>> > Moein
>> >
>> > --
>> >
>> > Moein Hosseini
>> > Data Engineer
>> > mobile: +98 912 468 1859
>> > site: www.moein.xyz
>> > email: moein7tl@gmail.com
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: Feature request: split dataset based on condition

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Andrew, can you give us more information about why partitioning the output
data doesn't work for your use case?

It sounds like all you need to do is to create a table partitioned by A and
B, then you would automatically get the divisions you want. If what you're
looking for is a way to scale the number of combinations then you can use
formats that support more partitions, or you could sort by the fields and
rely on Parquet row group pruning to filter out data you don't want.

rb

On Mon, Feb 4, 2019 at 8:33 AM Andrew Melo <an...@gmail.com> wrote:

> Hello
>
> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini <mo...@gmail.com> wrote:
> >
> > I've seen many application need to split dataset to multiple datasets
> based on some conditions. As there is no method to do it in one place,
> developers use filter method multiple times. I think it can be useful to
> have method to split dataset based on condition in one iteration, something
> like partition method of scala (of-course scala partition just split list
> into two list, but something more general can be more useful).
> > If you think it can be helpful, I can create Jira issue and work on it
> to send PR.
>
> This would be a really useful feature for our use case (processing
> collision data from the LHC). We typically want to take some sort of
> input and split into multiple disjoint outputs based on some
> conditions. E.g. if we have two conditions A and B, we'll end up with
> 4 outputs (AB, !AB, A!B, !A!B). As we add more conditions, the
> combinatorics explode like n^2, when we could produce them all up
> front with this "multi filter" (or however it would be called).
>
> Cheers
> Andrew
>
> >
> > Best Regards
> > Moein
> >
> > --
> >
> > Moein Hosseini
> > Data Engineer
> > mobile: +98 912 468 1859
> > site: www.moein.xyz
> > email: moein7tl@gmail.com
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Feature request: split dataset based on condition

Posted by Andrew Melo <an...@gmail.com>.
Hello

On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini <mo...@gmail.com> wrote:
>
> I've seen many application need to split dataset to multiple datasets based on some conditions. As there is no method to do it in one place, developers use filter method multiple times. I think it can be useful to have method to split dataset based on condition in one iteration, something like partition method of scala (of-course scala partition just split list into two list, but something more general can be more useful).
> If you think it can be helpful, I can create Jira issue and work on it to send PR.

This would be a really useful feature for our use case (processing
collision data from the LHC). We typically want to take some sort of
input and split into multiple disjoint outputs based on some
conditions. E.g. if we have two conditions A and B, we'll end up with
4 outputs (AB, !AB, A!B, !A!B). As we add more conditions, the
combinatorics explode like n^2, when we could produce them all up
front with this "multi filter" (or however it would be called).

Cheers
Andrew

>
> Best Regards
> Moein
>
> --
>
> Moein Hosseini
> Data Engineer
> mobile: +98 912 468 1859
> site: www.moein.xyz
> email: moein7tl@gmail.com
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: Feature request: split dataset based on condition

Posted by Maciej Szymkiewicz <ms...@gmail.com>.
If the goal is to split the output, then `DataFrameWriter.partitionBy`
should do what you need, and no additional methods are required. If not you
can also check Silex's implementation muxPartitions (see
https://stackoverflow.com/a/37956034), but the applications are rather
limited, due to high resource usage.

On Sun, 3 Feb 2019 at 15:41, Sean Owen <sr...@gmail.com> wrote:

> I don't think Spark supports this model, where N inputs depending on
> parent are computed once at the same time. You can of course cache the
> parent and filter N times and do the same amount of work. One problem is,
> where would the N inputs live? they'd have to be stored if not used
> immediately, and presumably in any use case, only one of them would be used
> immediately. If you have a job that needs to split records of a parent into
> N subsets, and then all N subsets are used, you can do that -- you are just
> transforming the parent to one child that has rows with those N splits of
> each input row, and then consume that. See randomSplit() for maybe the best
> case, where it still produce N Datasets but can do so efficiently because
> it's just a random sample.
>
> On Sun, Feb 3, 2019 at 12:20 AM Moein Hosseini <mo...@gmail.com> wrote:
>
>> I don't consider it as method to apply filtering multiple time, instead
>> use it as semi-action not just transformation. Let's think that we have
>> something like map-partition which accept multiple lambda that each one
>> collect their ROW for their dataset (or something like it). Is it possible?
>>
>> On Sat, Feb 2, 2019 at 5:59 PM Sean Owen <sr...@gmail.com> wrote:
>>
>>> I think the problem is that can't produce multiple Datasets from one
>>> source in one operation - consider that reproducing one of them would mean
>>> reproducing all of them. You can write a method that would do the filtering
>>> multiple times but it wouldn't be faster. What do you have in mind that's
>>> different?
>>>
>>> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini <mo...@gmail.com>
>>> wrote:
>>>
>>>> I've seen many application need to split dataset to multiple datasets
>>>> based on some conditions. As there is no method to do it in one place,
>>>> developers use *filter *method multiple times. I think it can be
>>>> useful to have method to split dataset based on condition in one iteration,
>>>> something like *partition* method of scala (of-course scala partition
>>>> just split list into two list, but something more general can be more
>>>> useful).
>>>> If you think it can be helpful, I can create Jira issue and work on it
>>>> to send PR.
>>>>
>>>> Best Regards
>>>> Moein
>>>>
>>>> --
>>>>
>>>> Moein Hosseini
>>>> Data Engineer
>>>> mobile: +98 912 468 1859 <+98+912+468+1859>
>>>> site: www.moein.xyz
>>>> email: moein7tl@gmail.com
>>>> [image: linkedin] <https://www.linkedin.com/in/moeinhm>
>>>> [image: twitter] <https://twitter.com/moein7tl>
>>>>
>>>>
>>
>> --
>>
>> Moein Hosseini
>> Data Engineer
>> mobile: +98 912 468 1859 <+98+912+468+1859>
>> site: www.moein.xyz
>> email: moein7tl@gmail.com
>> [image: linkedin] <https://www.linkedin.com/in/moeinhm>
>> [image: twitter] <https://twitter.com/moein7tl>
>>
>>

-- 

Regards,
Maciej

Re: Feature request: split dataset based on condition

Posted by Sean Owen <sr...@gmail.com>.
I don't think Spark supports this model, where N inputs depending on parent
are computed once at the same time. You can of course cache the parent and
filter N times and do the same amount of work. One problem is, where would
the N inputs live? they'd have to be stored if not used immediately, and
presumably in any use case, only one of them would be used immediately. If
you have a job that needs to split records of a parent into N subsets, and
then all N subsets are used, you can do that -- you are just transforming
the parent to one child that has rows with those N splits of each input
row, and then consume that. See randomSplit() for maybe the best case,
where it still produce N Datasets but can do so efficiently because it's
just a random sample.

On Sun, Feb 3, 2019 at 12:20 AM Moein Hosseini <mo...@gmail.com> wrote:

> I don't consider it as method to apply filtering multiple time, instead
> use it as semi-action not just transformation. Let's think that we have
> something like map-partition which accept multiple lambda that each one
> collect their ROW for their dataset (or something like it). Is it possible?
>
> On Sat, Feb 2, 2019 at 5:59 PM Sean Owen <sr...@gmail.com> wrote:
>
>> I think the problem is that can't produce multiple Datasets from one
>> source in one operation - consider that reproducing one of them would mean
>> reproducing all of them. You can write a method that would do the filtering
>> multiple times but it wouldn't be faster. What do you have in mind that's
>> different?
>>
>> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini <mo...@gmail.com>
>> wrote:
>>
>>> I've seen many application need to split dataset to multiple datasets
>>> based on some conditions. As there is no method to do it in one place,
>>> developers use *filter *method multiple times. I think it can be useful
>>> to have method to split dataset based on condition in one iteration,
>>> something like *partition* method of scala (of-course scala partition
>>> just split list into two list, but something more general can be more
>>> useful).
>>> If you think it can be helpful, I can create Jira issue and work on it
>>> to send PR.
>>>
>>> Best Regards
>>> Moein
>>>
>>> --
>>>
>>> Moein Hosseini
>>> Data Engineer
>>> mobile: +98 912 468 1859 <+98+912+468+1859>
>>> site: www.moein.xyz
>>> email: moein7tl@gmail.com
>>> [image: linkedin] <https://www.linkedin.com/in/moeinhm>
>>> [image: twitter] <https://twitter.com/moein7tl>
>>>
>>>
>
> --
>
> Moein Hosseini
> Data Engineer
> mobile: +98 912 468 1859 <+98+912+468+1859>
> site: www.moein.xyz
> email: moein7tl@gmail.com
> [image: linkedin] <https://www.linkedin.com/in/moeinhm>
> [image: twitter] <https://twitter.com/moein7tl>
>
>

Re: Feature request: split dataset based on condition

Posted by Moein Hosseini <mo...@gmail.com>.
I don't consider it as method to apply filtering multiple time, instead use
it as semi-action not just transformation. Let's think that we have
something like map-partition which accept multiple lambda that each one
collect their ROW for their dataset (or something like it). Is it possible?

On Sat, Feb 2, 2019 at 5:59 PM Sean Owen <sr...@gmail.com> wrote:

> I think the problem is that can't produce multiple Datasets from one
> source in one operation - consider that reproducing one of them would mean
> reproducing all of them. You can write a method that would do the filtering
> multiple times but it wouldn't be faster. What do you have in mind that's
> different?
>
> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini <mo...@gmail.com> wrote:
>
>> I've seen many application need to split dataset to multiple datasets
>> based on some conditions. As there is no method to do it in one place,
>> developers use *filter *method multiple times. I think it can be useful
>> to have method to split dataset based on condition in one iteration,
>> something like *partition* method of scala (of-course scala partition
>> just split list into two list, but something more general can be more
>> useful).
>> If you think it can be helpful, I can create Jira issue and work on it to
>> send PR.
>>
>> Best Regards
>> Moein
>>
>> --
>>
>> Moein Hosseini
>> Data Engineer
>> mobile: +98 912 468 1859 <+98+912+468+1859>
>> site: www.moein.xyz
>> email: moein7tl@gmail.com
>> [image: linkedin] <https://www.linkedin.com/in/moeinhm>
>> [image: twitter] <https://twitter.com/moein7tl>
>>
>>

-- 

Moein Hosseini
Data Engineer
mobile: +98 912 468 1859 <+98+912+468+1859>
site: www.moein.xyz
email: moein7tl@gmail.com
[image: linkedin] <https://www.linkedin.com/in/moeinhm>
[image: twitter] <https://twitter.com/moein7tl>

Re: Feature request: split dataset based on condition

Posted by Sean Owen <sr...@gmail.com>.
I think the problem is that can't produce multiple Datasets from one source
in one operation - consider that reproducing one of them would mean
reproducing all of them. You can write a method that would do the filtering
multiple times but it wouldn't be faster. What do you have in mind that's
different?

On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini <mo...@gmail.com> wrote:

> I've seen many application need to split dataset to multiple datasets
> based on some conditions. As there is no method to do it in one place,
> developers use *filter *method multiple times. I think it can be useful
> to have method to split dataset based on condition in one iteration,
> something like *partition* method of scala (of-course scala partition
> just split list into two list, but something more general can be more
> useful).
> If you think it can be helpful, I can create Jira issue and work on it to
> send PR.
>
> Best Regards
> Moein
>
> --
>
> Moein Hosseini
> Data Engineer
> mobile: +98 912 468 1859 <+98+912+468+1859>
> site: www.moein.xyz
> email: moein7tl@gmail.com
> [image: linkedin] <https://www.linkedin.com/in/moeinhm>
> [image: twitter] <https://twitter.com/moein7tl>
>
>