You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mungeol Heo <mu...@gmail.com> on 2016/10/17 01:50:56 UTC

Is spark a right tool for updating a dataframe repeatedly

Hello, everyone.

As I mentioned at the tile, I wonder that is spark a right tool for
updating a data frame repeatedly until there is no more date to
update.

For example.

while (if there was a updating) {
update a data frame A
}

If it is the right tool, then what is the best practice for this kind of work?
Thank you.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Is spark a right tool for updating a dataframe repeatedly

Posted by Mike Metzger <mi...@flexiblecreations.com>.

I've not done this in Scala yet, but in PySpark I've run into a similar
issue where having too many dataframes cached does cause memory issues.
Unpersist by itself did not clear the memory usage, but rather setting the
variable equal to None allowed all the references to be cleared and the
memory issues went away.

I do not full understand Scala yet, but you may be able to set one of your
dataframes to null to accomplish the same.

Mike


On Mon, Oct 17, 2016 at 8:38 PM, Mungeol Heo <mu...@gmail.com> wrote:

> First of all, Thank you for your comments.
> Actually, What I mean "update" is generate a new data frame with modified
> data.
> The more detailed while loop will be something like below.
>
> var continue = 1
> var dfA = "a data frame"
> dfA.persist
>
> while (continue > 0) {
>   val temp = "modified dfA"
>   temp.persist
>   temp.count
>   dfA.unpersist
>
>   dfA = "modified temp"
>   dfA.persist
>   dfA.count
>   temp.unperist
>
>   if ("dfA is not modifed") {
>     continue = 0
>   }
> }
>
> The problem is it will cause OOM finally.
> And, the number of skipped stages will increase ever time, even though
> I am not sure whether this is the reason causing OOM.
> Maybe, I need to check the source code of one of the spark ML algorithms.
> Again, thank you all.
>
>
> On Mon, Oct 17, 2016 at 10:54 PM, Thakrar, Jayesh
> <jt...@conversantmedia.com> wrote:
> > Yes, iterating over a dataframe and making changes is not uncommon.
> >
> > Ofcourse RDDs, dataframes and datasets are immultable, but there is some
> > optimization in the optimizer that can potentially help to dampen the
> > effect/impact of creating a new rdd, df or ds.
> >
> > Also, the use-case you cited is similar to what is done in regression,
> > clustering and other algorithms.
> >
> > I.e. you iterate making a change to a dataframe/dataset until the desired
> > condition.
> >
> > E.g. see this -
> > https://spark.apache.org/docs/1.6.1/ml-classification-
> regression.html#linear-regression
> > and the setting of the iteration ceiling
> >
> >
> >
> > // instantiate the base classifier
> >
> > val classifier = new LogisticRegression()
> >
> >   .setMaxIter(params.maxIter)
> >
> >   .setTol(params.tol)
> >
> >   .setFitIntercept(params.fitIntercept)
> >
> >
> >
> > Now the impact of that depends on a variety of things.
> >
> > E.g. if the data is completely contained in memory and there is no spill
> > over to disk, it might not be a big issue (ofcourse there will still be
> > memory, CPU and network overhead/latency).
> >
> > If you are looking at storing the data on disk (e.g. as part of a
> checkpoint
> > or explicit storage), then there can be substantial I/O activity.
> >
> >
> >
> >
> >
> >
> >
> > From: Xi Shen <da...@gmail.com>
> > Date: Monday, October 17, 2016 at 2:54 AM
> > To: Divya Gehlot <di...@gmail.com>, Mungeol Heo
> > <mu...@gmail.com>
> > Cc: "user @spark" <us...@spark.apache.org>
> > Subject: Re: Is spark a right tool for updating a dataframe repeatedly
> >
> >
> >
> > I think most of the "big data" tools, like Spark and Hive, are not
> designed
> > to edit data. They are only designed to query data. I wonder in what
> > scenario you need to update large volume of data repetitively.
> >
> >
> >
> >
> >
> > On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot <di...@gmail.com>
> > wrote:
> >
> > If  my understanding is correct about your query
> >
> > In spark Dataframes are immutable , cant update the dataframe.
> >
> > you have to create a new dataframe to update the current dataframe .
> >
> >
> >
> >
> >
> > Thanks,
> >
> > Divya
> >
> >
> >
> >
> >
> > On 17 October 2016 at 09:50, Mungeol Heo <mu...@gmail.com> wrote:
> >
> > Hello, everyone.
> >
> > As I mentioned at the tile, I wonder that is spark a right tool for
> > updating a data frame repeatedly until there is no more date to
> > update.
> >
> > For example.
> >
> > while (if there was a updating) {
> > update a data frame A
> > }
> >
> > If it is the right tool, then what is the best practice for this kind of
> > work?
> > Thank you.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
> >
> >
> > --
> >
> >
> > Thanks,
> > David S.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Is spark a right tool for updating a dataframe repeatedly

Posted by Mungeol Heo <mu...@gmail.com>.

First of all, Thank you for your comments.
Actually, What I mean "update" is generate a new data frame with modified data.
The more detailed while loop will be something like below.

var continue = 1
var dfA = "a data frame"
dfA.persist

while (continue > 0) {
  val temp = "modified dfA"
  temp.persist
  temp.count
  dfA.unpersist

  dfA = "modified temp"
  dfA.persist
  dfA.count
  temp.unperist

  if ("dfA is not modifed") {
    continue = 0
  }
}

The problem is it will cause OOM finally.
And, the number of skipped stages will increase ever time, even though
I am not sure whether this is the reason causing OOM.
Maybe, I need to check the source code of one of the spark ML algorithms.
Again, thank you all.


On Mon, Oct 17, 2016 at 10:54 PM, Thakrar, Jayesh
<jt...@conversantmedia.com> wrote:
> Yes, iterating over a dataframe and making changes is not uncommon.
>
> Ofcourse RDDs, dataframes and datasets are immultable, but there is some
> optimization in the optimizer that can potentially help to dampen the
> effect/impact of creating a new rdd, df or ds.
>
> Also, the use-case you cited is similar to what is done in regression,
> clustering and other algorithms.
>
> I.e. you iterate making a change to a dataframe/dataset until the desired
> condition.
>
> E.g. see this -
> https://spark.apache.org/docs/1.6.1/ml-classification-regression.html#linear-regression
> and the setting of the iteration ceiling
>
>
>
> // instantiate the base classifier
>
> val classifier = new LogisticRegression()
>
>   .setMaxIter(params.maxIter)
>
>   .setTol(params.tol)
>
>   .setFitIntercept(params.fitIntercept)
>
>
>
> Now the impact of that depends on a variety of things.
>
> E.g. if the data is completely contained in memory and there is no spill
> over to disk, it might not be a big issue (ofcourse there will still be
> memory, CPU and network overhead/latency).
>
> If you are looking at storing the data on disk (e.g. as part of a checkpoint
> or explicit storage), then there can be substantial I/O activity.
>
>
>
>
>
>
>
> From: Xi Shen <da...@gmail.com>
> Date: Monday, October 17, 2016 at 2:54 AM
> To: Divya Gehlot <di...@gmail.com>, Mungeol Heo
> <mu...@gmail.com>
> Cc: "user @spark" <us...@spark.apache.org>
> Subject: Re: Is spark a right tool for updating a dataframe repeatedly
>
>
>
> I think most of the "big data" tools, like Spark and Hive, are not designed
> to edit data. They are only designed to query data. I wonder in what
> scenario you need to update large volume of data repetitively.
>
>
>
>
>
> On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot <di...@gmail.com>
> wrote:
>
> If  my understanding is correct about your query
>
> In spark Dataframes are immutable , cant update the dataframe.
>
> you have to create a new dataframe to update the current dataframe .
>
>
>
>
>
> Thanks,
>
> Divya
>
>
>
>
>
> On 17 October 2016 at 09:50, Mungeol Heo <mu...@gmail.com> wrote:
>
> Hello, everyone.
>
> As I mentioned at the tile, I wonder that is spark a right tool for
> updating a data frame repeatedly until there is no more date to
> update.
>
> For example.
>
> while (if there was a updating) {
> update a data frame A
> }
>
> If it is the right tool, then what is the best practice for this kind of
> work?
> Thank you.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
>
> --
>
>
> Thanks,
> David S.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Is spark a right tool for updating a dataframe repeatedly

Posted by "Thakrar, Jayesh" <jt...@conversantmedia.com>.

Yes, iterating over a dataframe and making changes is not uncommon.
Ofcourse RDDs, dataframes and datasets are immultable, but there is some optimization in the optimizer that can potentially help to dampen the effect/impact of creating a new rdd, df or ds.
Also, the use-case you cited is similar to what is done in regression, clustering and other algorithms.
I.e. you iterate making a change to a dataframe/dataset until the desired condition.
E.g. see this - https://spark.apache.org/docs/1.6.1/ml-classification-regression.html#linear-regression and the setting of the iteration ceiling

// instantiate the base classifier
val classifier = new LogisticRegression()
  .setMaxIter(params.maxIter)
  .setTol(params.tol)
  .setFitIntercept(params.fitIntercept)

Now the impact of that depends on a variety of things.
E.g. if the data is completely contained in memory and there is no spill over to disk, it might not be a big issue (ofcourse there will still be memory, CPU and network overhead/latency).
If you are looking at storing the data on disk (e.g. as part of a checkpoint or explicit storage), then there can be substantial I/O activity.



From: Xi Shen <da...@gmail.com>
Date: Monday, October 17, 2016 at 2:54 AM
To: Divya Gehlot <di...@gmail.com>, Mungeol Heo <mu...@gmail.com>
Cc: "user @spark" <us...@spark.apache.org>
Subject: Re: Is spark a right tool for updating a dataframe repeatedly

I think most of the "big data" tools, like Spark and Hive, are not designed to edit data. They are only designed to query data. I wonder in what scenario you need to update large volume of data repetitively.


On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot <di...@gmail.com>> wrote:
If  my understanding is correct about your query
In spark Dataframes are immutable , cant update the dataframe.
you have to create a new dataframe to update the current dataframe .


Thanks,
Divya


On 17 October 2016 at 09:50, Mungeol Heo <mu...@gmail.com>> wrote:
Hello, everyone.

As I mentioned at the tile, I wonder that is spark a right tool for
updating a data frame repeatedly until there is no more date to
update.

For example.

while (if there was a updating) {
update a data frame A
}

If it is the right tool, then what is the best practice for this kind of work?
Thank you.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>

--

Thanks,
David S.

Re: Is spark a right tool for updating a dataframe repeatedly

Posted by Xi Shen <da...@gmail.com>.

I think most of the "big data" tools, like Spark and Hive, are not designed
to edit data. They are only designed to query data. I wonder in what
scenario you need to update large volume of data repetitively.


On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot <di...@gmail.com>
wrote:

> If  my understanding is correct about your query
> In spark Dataframes are immutable , cant update the dataframe.
> you have to create a new dataframe to update the current dataframe .
>
>
> Thanks,
> Divya
>
>
> On 17 October 2016 at 09:50, Mungeol Heo <mu...@gmail.com> wrote:
>
> Hello, everyone.
>
> As I mentioned at the tile, I wonder that is spark a right tool for
> updating a data frame repeatedly until there is no more date to
> update.
>
> For example.
>
> while (if there was a updating) {
> update a data frame A
> }
>
> If it is the right tool, then what is the best practice for this kind of
> work?
> Thank you.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
> --


Thanks,
David S.

Re: Is spark a right tool for updating a dataframe repeatedly

Posted by Divya Gehlot <di...@gmail.com>.

If  my understanding is correct about your query
In spark Dataframes are immutable , cant update the dataframe.
you have to create a new dataframe to update the current dataframe .

Thanks,
Divya

On 17 October 2016 at 09:50, Mungeol Heo <mu...@gmail.com> wrote:

> Hello, everyone.
>
> As I mentioned at the tile, I wonder that is spark a right tool for
> updating a data frame repeatedly until there is no more date to
> update.
>
> For example.
>
> while (if there was a updating) {
> update a data frame A
> }
>
> If it is the right tool, then what is the best practice for this kind of
> work?
> Thank you.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>