You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Sandeep Giri <sa...@knowbigdata.com> on 2015/08/05 17:49:53 UTC

(Unknown)

Yes, but in the take() approach we will be bringing the data to the driver
and is no longer distributed.

Also, the take() takes only count as argument which means that every time
we would transferring the redundant elements.


Regards,
Sandeep Giri,
+1 347 781 4573 (US)
+91-953-899-8962 (IN)

www.KnowBigData.com. <http://KnowBigData.com.>
Phone: +1-253-397-1945 (Office)

[image: linkedin icon] <https://linkedin.com/company/knowbigdata> [image:
other site icon] <http://knowbigdata.com>  [image: facebook icon]
<https://facebook.com/knowbigdata> [image: twitter icon]
<https://twitter.com/IKnowBigData> <https://twitter.com/IKnowBigData>


On Wed, Aug 5, 2015 at 3:09 PM, Sean Owen <so...@cloudera.com> wrote:

> I don't think countApprox is appropriate here unless approximation is OK.
> But more generally, counting everything matching a filter requires applying
> the filter to the whole data set, which seems like the thing to be avoided
> here.
>
> The take approach is better since it would stop after finding n matching
> elements (it might do a little extra work given partitioning and
> buffering). It would not filter the whole data set.
>
> The only downside there is that it would copy n elements to the driver.
>
>
> On Wed, Aug 5, 2015 at 10:34 AM, Sandeep Giri <sa...@knowbigdata.com>
> wrote:
>
>> Hi Jonathan,
>>
>> Does that guarantee a result? I do not see that it is really optimized.
>>
>> Hi Carsten,
>>
>>
>> How does the following code work:
>>
>> data.filter(qualifying_function).take(n).count() >= n
>>
>>
>> Also, as per my understanding, in both the approaches you mentioned the
>> qualifying function will be executed on whole dataset even if the value was
>> already found in the first element of RDD:
>>
>>
>>    - data.filter(qualifying_function).take(n).count() >= n
>>       - val contains1MatchingElement = !(data.filter(qualifying_
>>       function).isEmpty())
>>
>> Isn't it? Am I missing something?
>>
>>
>> Regards,
>> Sandeep Giri,
>> +1 347 781 4573 (US)
>> +91-953-899-8962 (IN)
>>
>> www.KnowBigData.com. <http://KnowBigData.com.>
>> Phone: +1-253-397-1945 (Office)
>>
>> [image: linkedin icon] <https://linkedin.com/company/knowbigdata> [image:
>> other site icon] <http://knowbigdata.com>  [image: facebook icon]
>> <https://facebook.com/knowbigdata> [image: twitter icon]
>> <https://twitter.com/IKnowBigData> <https://twitter.com/IKnowBigData>
>>
>>
>> On Fri, Jul 31, 2015 at 3:37 PM, Jonathan Winandy <
>> jonathan.winandy@gmail.com> wrote:
>>
>>> Hello !
>>>
>>> You could try something like that :
>>>
>>> def exists[T](rdd:RDD[T])(f:T=>Boolean, n:Int):Boolean = {
>>>   rdd.filter(f).countApprox(timeout = 10000).getFinalValue().low > n
>>> }
>>>
>>> If would work for large datasets and large value of n.
>>>
>>> Have a nice day,
>>>
>>> Jonathan
>>>
>>>
>>>
>>> On 31 July 2015 at 11:29, Carsten Schnober <
>>> schnober@ukp.informatik.tu-darmstadt.de> wrote:
>>>
>>>> Hi,
>>>> the RDD class does not have an exist()-method (in the Scala API), but
>>>> the functionality you need seems easy to resemble with the existing
>>>> methods:
>>>>
>>>> val containsNMatchingElements =
>>>> data.filter(qualifying_function).take(n).count() >= n
>>>>
>>>> Note: I am not sure whether the intermediate take(n) really increases
>>>> performance, but the idea is to arbitrarily reduce the number of
>>>> elements in the RDD before counting because we are not interested in the
>>>> full count.
>>>>
>>>> If you need to check specifically whether there is at least one matching
>>>> occurrence, it is probably preferable to use isEmpty() instead of
>>>> count() and check whether the result is false:
>>>>
>>>> val contains1MatchingElement =
>>>> !(data.filter(qualifying_function).isEmpty())
>>>>
>>>> Best,
>>>> Carsten
>>>>
>>>>
>>>>
>>>> Am 31.07.2015 um 11:11 schrieb Sandeep Giri:
>>>> > Dear Spark Dev Community,
>>>> >
>>>> > I am wondering if there is already a function to solve my problem. If
>>>> > not, then should I work on this?
>>>> >
>>>> > Say you just want to check if a word exists in a huge text file. I
>>>> could
>>>> > not find better ways than those mentioned here
>>>> > <
>>>> http://www.knowbigdata.com/blog/interview-questions-apache-spark-part-2#q6
>>>> >.
>>>> >
>>>> > So, I was proposing if we have a function called /exists /in RDD with
>>>> > the following signature:
>>>> >
>>>> > #returns the true if n elements exist which qualify our criteria.
>>>> > #qualifying function would receive the element and its index and
>>>> return
>>>> > true or false.
>>>> > def /exists/(qualifying_function, n):
>>>> >      ....
>>>> >
>>>> >
>>>> > Regards,
>>>> > Sandeep Giri,
>>>> > +1 347 781 4573 (US)
>>>> > +91-953-899-8962 (IN)
>>>> >
>>>> > www.KnowBigData.com. <http://KnowBigData.com.>
>>>> > Phone: +1-253-397-1945 (Office)
>>>> >
>>>> > linkedin icon <https://linkedin.com/company/knowbigdata> other site
>>>> icon
>>>> > <http://knowbigdata.com> facebook icon
>>>> > <https://facebook.com/knowbigdata>twitter icon
>>>> > <https://twitter.com/IKnowBigData><https://twitter.com/IKnowBigData>
>>>> >
>>>>
>>>> --
>>>> Carsten Schnober
>>>> Doctoral Researcher
>>>> Ubiquitous Knowledge Processing (UKP) Lab
>>>> FB 20 / Computer Science Department
>>>> Technische Universität Darmstadt
>>>> Hochschulstr. 10, D-64289 Darmstadt, Germany
>>>> phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111
>>>> schnober@ukp.informatik.tu-darmstadt.de
>>>> www.ukp.tu-darmstadt.de
>>>>
>>>> Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
>>>> GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources
>>>> (AIPHES): www.aiphes.tu-darmstadt.de
>>>> PhD program: Knowledge Discovery in Scientific Literature (KDSL)
>>>> www.kdsl.tu-darmstadt.de
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re:

Posted by Sandeep Giri <sa...@knowbigdata.com>.

Looks good.

Regards,
Sandeep Giri,
+1 347 781 4573 (US)
+91-953-899-8962 (IN)

www.KnowBigData.com. <http://KnowBigData.com.>
Phone: +1-253-397-1945 (Office)

[image: linkedin icon] <https://linkedin.com/company/knowbigdata> [image:
other site icon] <http://knowbigdata.com>  [image: facebook icon]
<https://facebook.com/knowbigdata> [image: twitter icon]
<https://twitter.com/IKnowBigData> <https://twitter.com/IKnowBigData>


On Thu, Aug 6, 2015 at 3:02 PM, Jonathan Winandy <jonathan.winandy@gmail.com
> wrote:

> Hello !
>
> I think I found a performant and nice solution based on take' source code
> :
>
> def exists[T](rdd: RDD[T])(qualif: T => Boolean, num: Int): Boolean = {
>   if (num == 0) {
>     true
>   } else {
>     var count: Int = 0
>     val totalParts: Int = rdd.partitions.length
>     var partsScanned: Int = 0
>     while (count < num && partsScanned < totalParts) {
>       var numPartsToTry: Int = 1
>       if (partsScanned > 0) {
>         if (count == 0) {
>           numPartsToTry = partsScanned * 4
>         } else {
>           numPartsToTry = Math.max((1.5 * num * partsScanned / count).toInt - partsScanned, 1)
>           numPartsToTry = Math.min(numPartsToTry, partsScanned * 4)
>         }
>       }
>
>       val left: Int = num - count
>       val p: Range = partsScanned until math.min(partsScanned + numPartsToTry, totalParts)
>       val res: Array[Int] = rdd.sparkContext.runJob(rdd, (it: Iterator[T]) => it.filter(qualif).take(left).size, p, allowLocal = true)
>
>       count = count + res.sum
>       partsScanned += numPartsToTry
>     }
>
>     count >= num
>   }
> }
>
> //val all:RDD[Any]
> println(exists(all)(_ => {println(".") ; true}, 10))
>
> It's super fast for small values of n and I think it parallelise nicely for large values.
>
> Please tell me what you think.
>
>
> Have a nice day,
>
> Jonathan
>
>
>
>
>
> On 5 August 2015 at 19:18, Jonathan Winandy <jo...@gmail.com>
> wrote:
>
>> Hello !
>>
>> You could try something like that :
>>
>> def exists[T](rdd:RDD[T])(f:T=>Boolean, n:Long):Boolean = {
>>
>>   val context: SparkContext = rdd.sparkContext
>>   val grp: String = Random.alphanumeric.take(10).mkString
>>   context.setJobGroup(grp, "exist")
>>   val count: Accumulator[Long] = context.accumulator(0L)
>>
>>   val iteratorToInt: (Iterator[T]) => Int = {
>>     iterator =>
>>       val i: Int = iterator.count(f)
>>       count += i
>>       i
>>   }
>>
>>   val t = new Thread {
>>     override def run {
>>       while (count.value < n) {}
>>       context.cancelJobGroup(grp)
>>     }
>>   }
>>   t.start()
>>   try {
>>     context.runJob(rdd, iteratorToInt) > n
>>   } catch  {
>>     case e:SparkException => {
>>       count.value > n
>>     }
>>   } finally {
>>     t.stop()
>>   }
>>
>> }
>>
>>
>>
>> It stops the computation if enough elements satisfying the condition are
>> witnessed.
>>
>> It is performant if the RDD is well partitioned. If this is a problem,
>> you could change iteratorToInt to :
>>
>> val iteratorToInt: (Iterator[T]) => Int = {
>>   iterator =>
>>     val i: Int = iterator.count(x => {
>>       if(f(x)) {
>>         count += 1
>>         true
>>       } else false
>>     })
>>     i
>>
>> }
>>
>>
>> I am interested in a safer way to perform partial computation in spark.
>>
>> Cheers,
>> Jonathan
>>
>> On 5 August 2015 at 18:54, Feynman Liang <fl...@databricks.com> wrote:
>>
>>> qualifying_function() will be executed on each partition in parallel;
>>> stopping all parallel execution after the first instance satisfying
>>> qualifying_function() would mean that you would have to effectively make
>>> the computation sequential.
>>>
>>> On Wed, Aug 5, 2015 at 9:05 AM, Sandeep Giri <sa...@knowbigdata.com>
>>> wrote:
>>>
>>>> Okay. I think I got it now. Yes take() does not need to be called more
>>>> than once. I got the impression that we wanted to bring elements to the
>>>> driver node and then run out qualifying_function on driver_node.
>>>>
>>>> Now, I am back to my question which I started with: Could there be an
>>>> approach where the qualifying_function() does not get called after an
>>>> element has been found?
>>>>
>>>>
>>>> Regards,
>>>> Sandeep Giri,
>>>> +1 347 781 4573 (US)
>>>> +91-953-899-8962 (IN)
>>>>
>>>> www.KnowBigData.com. <http://KnowBigData.com.>
>>>> Phone: +1-253-397-1945 (Office)
>>>>
>>>> [image: linkedin icon] <https://linkedin.com/company/knowbigdata> [image:
>>>> other site icon] <http://knowbigdata.com>  [image: facebook icon]
>>>> <https://facebook.com/knowbigdata> [image: twitter icon]
>>>> <https://twitter.com/IKnowBigData> <https://twitter.com/IKnowBigData>
>>>>
>>>>
>>>> On Wed, Aug 5, 2015 at 9:21 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>>> take only brings n elements to the driver, which is probably still a
>>>>> win if n is small. I'm not sure what you mean by only taking a count
>>>>> argument -- what else would be an arg to take?
>>>>>
>>>>> On Wed, Aug 5, 2015 at 4:49 PM, Sandeep Giri <sa...@knowbigdata.com>
>>>>> wrote:
>>>>>
>>>>>> Yes, but in the take() approach we will be bringing the data to the
>>>>>> driver and is no longer distributed.
>>>>>>
>>>>>> Also, the take() takes only count as argument which means that every
>>>>>> time we would transferring the redundant elements.
>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re:

Posted by Jonathan Winandy <jo...@gmail.com>.

Hello !

I think I found a performant and nice solution based on take' source code :

def exists[T](rdd: RDD[T])(qualif: T => Boolean, num: Int): Boolean = {
  if (num == 0) {
    true
  } else {
    var count: Int = 0
    val totalParts: Int = rdd.partitions.length
    var partsScanned: Int = 0
    while (count < num && partsScanned < totalParts) {
      var numPartsToTry: Int = 1
      if (partsScanned > 0) {
        if (count == 0) {
          numPartsToTry = partsScanned * 4
        } else {
          numPartsToTry = Math.max((1.5 * num * partsScanned /
count).toInt - partsScanned, 1)
          numPartsToTry = Math.min(numPartsToTry, partsScanned * 4)
        }
      }

      val left: Int = num - count
      val p: Range = partsScanned until math.min(partsScanned +
numPartsToTry, totalParts)
      val res: Array[Int] = rdd.sparkContext.runJob(rdd, (it:
Iterator[T]) => it.filter(qualif).take(left).size, p, allowLocal =
true)

      count = count + res.sum
      partsScanned += numPartsToTry
    }

    count >= num
  }
}

//val all:RDD[Any]
println(exists(all)(_ => {println(".") ; true}, 10))

It's super fast for small values of n and I think it parallelise
nicely for large values.

Please tell me what you think.


Have a nice day,

Jonathan





On 5 August 2015 at 19:18, Jonathan Winandy <jo...@gmail.com>
wrote:

> Hello !
>
> You could try something like that :
>
> def exists[T](rdd:RDD[T])(f:T=>Boolean, n:Long):Boolean = {
>
>   val context: SparkContext = rdd.sparkContext
>   val grp: String = Random.alphanumeric.take(10).mkString
>   context.setJobGroup(grp, "exist")
>   val count: Accumulator[Long] = context.accumulator(0L)
>
>   val iteratorToInt: (Iterator[T]) => Int = {
>     iterator =>
>       val i: Int = iterator.count(f)
>       count += i
>       i
>   }
>
>   val t = new Thread {
>     override def run {
>       while (count.value < n) {}
>       context.cancelJobGroup(grp)
>     }
>   }
>   t.start()
>   try {
>     context.runJob(rdd, iteratorToInt) > n
>   } catch  {
>     case e:SparkException => {
>       count.value > n
>     }
>   } finally {
>     t.stop()
>   }
>
> }
>
>
>
> It stops the computation if enough elements satisfying the condition are
> witnessed.
>
> It is performant if the RDD is well partitioned. If this is a problem, you
> could change iteratorToInt to :
>
> val iteratorToInt: (Iterator[T]) => Int = {
>   iterator =>
>     val i: Int = iterator.count(x => {
>       if(f(x)) {
>         count += 1
>         true
>       } else false
>     })
>     i
>
> }
>
>
> I am interested in a safer way to perform partial computation in spark.
>
> Cheers,
> Jonathan
>
> On 5 August 2015 at 18:54, Feynman Liang <fl...@databricks.com> wrote:
>
>> qualifying_function() will be executed on each partition in parallel;
>> stopping all parallel execution after the first instance satisfying
>> qualifying_function() would mean that you would have to effectively make
>> the computation sequential.
>>
>> On Wed, Aug 5, 2015 at 9:05 AM, Sandeep Giri <sa...@knowbigdata.com>
>> wrote:
>>
>>> Okay. I think I got it now. Yes take() does not need to be called more
>>> than once. I got the impression that we wanted to bring elements to the
>>> driver node and then run out qualifying_function on driver_node.
>>>
>>> Now, I am back to my question which I started with: Could there be an
>>> approach where the qualifying_function() does not get called after an
>>> element has been found?
>>>
>>>
>>> Regards,
>>> Sandeep Giri,
>>> +1 347 781 4573 (US)
>>> +91-953-899-8962 (IN)
>>>
>>> www.KnowBigData.com. <http://KnowBigData.com.>
>>> Phone: +1-253-397-1945 (Office)
>>>
>>> [image: linkedin icon] <https://linkedin.com/company/knowbigdata> [image:
>>> other site icon] <http://knowbigdata.com>  [image: facebook icon]
>>> <https://facebook.com/knowbigdata> [image: twitter icon]
>>> <https://twitter.com/IKnowBigData> <https://twitter.com/IKnowBigData>
>>>
>>>
>>> On Wed, Aug 5, 2015 at 9:21 PM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> take only brings n elements to the driver, which is probably still a
>>>> win if n is small. I'm not sure what you mean by only taking a count
>>>> argument -- what else would be an arg to take?
>>>>
>>>> On Wed, Aug 5, 2015 at 4:49 PM, Sandeep Giri <sa...@knowbigdata.com>
>>>> wrote:
>>>>
>>>>> Yes, but in the take() approach we will be bringing the data to the
>>>>> driver and is no longer distributed.
>>>>>
>>>>> Also, the take() takes only count as argument which means that every
>>>>> time we would transferring the redundant elements.
>>>>>
>>>>>
>>>
>>
>

Re:

Posted by Jonathan Winandy <jo...@gmail.com>.

Hello !

You could try something like that :

def exists[T](rdd:RDD[T])(f:T=>Boolean, n:Long):Boolean = {

  val context: SparkContext = rdd.sparkContext
  val grp: String = Random.alphanumeric.take(10).mkString
  context.setJobGroup(grp, "exist")
  val count: Accumulator[Long] = context.accumulator(0L)

  val iteratorToInt: (Iterator[T]) => Int = {
    iterator =>
      val i: Int = iterator.count(f)
      count += i
      i
  }

  val t = new Thread {
    override def run {
      while (count.value < n) {}
      context.cancelJobGroup(grp)
    }
  }
  t.start()
  try {
    context.runJob(rdd, iteratorToInt) > n
  } catch  {
    case e:SparkException => {
      count.value > n
    }
  } finally {
    t.stop()
  }

}



It stops the computation if enough elements satisfying the condition are
witnessed.

It is performant if the RDD is well partitioned. If this is a problem, you
could change iteratorToInt to :

val iteratorToInt: (Iterator[T]) => Int = {
  iterator =>
    val i: Int = iterator.count(x => {
      if(f(x)) {
        count += 1
        true
      } else false
    })
    i

}


I am interested in a safer way to perform partial computation in spark.

Cheers,
Jonathan

On 5 August 2015 at 18:54, Feynman Liang <fl...@databricks.com> wrote:

> qualifying_function() will be executed on each partition in parallel;
> stopping all parallel execution after the first instance satisfying
> qualifying_function() would mean that you would have to effectively make
> the computation sequential.
>
> On Wed, Aug 5, 2015 at 9:05 AM, Sandeep Giri <sa...@knowbigdata.com>
> wrote:
>
>> Okay. I think I got it now. Yes take() does not need to be called more
>> than once. I got the impression that we wanted to bring elements to the
>> driver node and then run out qualifying_function on driver_node.
>>
>> Now, I am back to my question which I started with: Could there be an
>> approach where the qualifying_function() does not get called after an
>> element has been found?
>>
>>
>> Regards,
>> Sandeep Giri,
>> +1 347 781 4573 (US)
>> +91-953-899-8962 (IN)
>>
>> www.KnowBigData.com. <http://KnowBigData.com.>
>> Phone: +1-253-397-1945 (Office)
>>
>> [image: linkedin icon] <https://linkedin.com/company/knowbigdata> [image:
>> other site icon] <http://knowbigdata.com>  [image: facebook icon]
>> <https://facebook.com/knowbigdata> [image: twitter icon]
>> <https://twitter.com/IKnowBigData> <https://twitter.com/IKnowBigData>
>>
>>
>> On Wed, Aug 5, 2015 at 9:21 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> take only brings n elements to the driver, which is probably still a win
>>> if n is small. I'm not sure what you mean by only taking a count argument
>>> -- what else would be an arg to take?
>>>
>>> On Wed, Aug 5, 2015 at 4:49 PM, Sandeep Giri <sa...@knowbigdata.com>
>>> wrote:
>>>
>>>> Yes, but in the take() approach we will be bringing the data to the
>>>> driver and is no longer distributed.
>>>>
>>>> Also, the take() takes only count as argument which means that every
>>>> time we would transferring the redundant elements.
>>>>
>>>>
>>
>

Re:

Posted by Feynman Liang <fl...@databricks.com>.

qualifying_function() will be executed on each partition in parallel;
stopping all parallel execution after the first instance satisfying
qualifying_function() would mean that you would have to effectively make
the computation sequential.

On Wed, Aug 5, 2015 at 9:05 AM, Sandeep Giri <sa...@knowbigdata.com>
wrote:

> Okay. I think I got it now. Yes take() does not need to be called more
> than once. I got the impression that we wanted to bring elements to the
> driver node and then run out qualifying_function on driver_node.
>
> Now, I am back to my question which I started with: Could there be an
> approach where the qualifying_function() does not get called after an
> element has been found?
>
>
> Regards,
> Sandeep Giri,
> +1 347 781 4573 (US)
> +91-953-899-8962 (IN)
>
> www.KnowBigData.com. <http://KnowBigData.com.>
> Phone: +1-253-397-1945 (Office)
>
> [image: linkedin icon] <https://linkedin.com/company/knowbigdata> [image:
> other site icon] <http://knowbigdata.com>  [image: facebook icon]
> <https://facebook.com/knowbigdata> [image: twitter icon]
> <https://twitter.com/IKnowBigData> <https://twitter.com/IKnowBigData>
>
>
> On Wed, Aug 5, 2015 at 9:21 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> take only brings n elements to the driver, which is probably still a win
>> if n is small. I'm not sure what you mean by only taking a count argument
>> -- what else would be an arg to take?
>>
>> On Wed, Aug 5, 2015 at 4:49 PM, Sandeep Giri <sa...@knowbigdata.com>
>> wrote:
>>
>>> Yes, but in the take() approach we will be bringing the data to the
>>> driver and is no longer distributed.
>>>
>>> Also, the take() takes only count as argument which means that every
>>> time we would transferring the redundant elements.
>>>
>>>
>

Re:

Posted by Sandeep Giri <sa...@knowbigdata.com>.

Okay. I think I got it now. Yes take() does not need to be called more than
once. I got the impression that we wanted to bring elements to the driver
node and then run out qualifying_function on driver_node.

Now, I am back to my question which I started with: Could there be an
approach where the qualifying_function() does not get called after an
element has been found?

Regards,
Sandeep Giri,
+1 347 781 4573 (US)
+91-953-899-8962 (IN)

www.KnowBigData.com. <http://KnowBigData.com.>
Phone: +1-253-397-1945 (Office)

[image: linkedin icon] <https://linkedin.com/company/knowbigdata> [image:
other site icon] <http://knowbigdata.com>  [image: facebook icon]
<https://facebook.com/knowbigdata> [image: twitter icon]
<https://twitter.com/IKnowBigData> <https://twitter.com/IKnowBigData>

On Wed, Aug 5, 2015 at 9:21 PM, Sean Owen <so...@cloudera.com> wrote:

> take only brings n elements to the driver, which is probably still a win
> if n is small. I'm not sure what you mean by only taking a count argument
> -- what else would be an arg to take?
>
> On Wed, Aug 5, 2015 at 4:49 PM, Sandeep Giri <sa...@knowbigdata.com>
> wrote:
>
>> Yes, but in the take() approach we will be bringing the data to the
>> driver and is no longer distributed.
>>
>> Also, the take() takes only count as argument which means that every time
>> we would transferring the redundant elements.
>>
>>

Re:

Posted by Sean Owen <so...@cloudera.com>.

take only brings n elements to the driver, which is probably still a win if
n is small. I'm not sure what you mean by only taking a count argument --
what else would be an arg to take?

On Wed, Aug 5, 2015 at 4:49 PM, Sandeep Giri <sa...@knowbigdata.com>
wrote:

> Yes, but in the take() approach we will be bringing the data to the driver
> and is no longer distributed.
>
> Also, the take() takes only count as argument which means that every time
> we would transferring the redundant elements.
>
>