You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Eugen Cepoi <ce...@gmail.com> on 2013/11/19 17:33:06 UTC

HttpBroadcast strange behaviour, bug?

Maybe a bug with HttpBroadcast or maybe my fault but can't find where :)

The problem:
  At the beginning a job computes a treemap(string, someobject) with a
custom order (some dummy lowercase), this treemap is broadcasted.
  Then i use this map to do some matching against input rdd (excluding
those that don't exist).
  What happens? In local (bc is in that case not used) or by passing all
the treemap without broadcast I got more than 3M matchings, after broadcast
it falls to 20K.

 Replacing HttpBroadcastFactory with TreeBroadcastFactory solves the
problem (I obtain expected results). I am trying to implement a test case
to reproduce it, but it is quite tricky in that case...

BTW is there a way to reproduce the broadcast mechanism in local (I see
that the SparkEnv instance is shared as static, so I guess there is no easy
way)?

Thanks,
Eugen

Re: HttpBroadcast strange behaviour, bug?

Posted by Eugen Cepoi <ce...@gmail.com>.
This is the code creating the treemap:

object CaseInsensitiveOrdered extends Ordering[String] {
    def compare(x: String, y: String): Int = x.compareToIgnoreCase(y)
}

TreeMap[String, JobTitle](dico.toArray:_*)(CaseInsensitiveOrdered)

this is the map that is broadcasted.
BTW* if I remove the ordering I got coherent results* (close to the 3M)
with the ordering I am falling down to the 20K.


2013/11/19 Sriram Ramachandrasekaran <sr...@gmail.com>

> aah, yes. I missed that. I looked into the code. Both TreeBroadcast and
> HttpBroadcast don't do send or write respectively.. Will wait for other
> inputs.
>
>
> On Tue, Nov 19, 2013 at 10:40 PM, Eugen Cepoi <ce...@gmail.com>wrote:
>
>> Yes sure for usual tests it is fine, but the broadcast is only done if we
>> are not in local mode (at least seems so).
>>
>> In SparkContext we have def broadcast[T](value: T) =
>> env.broadcastManager.newBroadcast[T](value, isLocal)
>> the is local is computed from the master name ("local" or "local[...").
>> Now If we look int HttpBroadcast we see
>> if (!isLocal) {
>>     HttpBroadcast.write(id, value_)
>>   }
>>
>> The broadcast is not done in local. I guess this is an optimization in
>> case we run multiple threads sharing the same broadcasted variable. But
>> perhaps am I missing something?
>>
>>
>> 2013/11/19 Sriram Ramachandrasekaran <sr...@gmail.com>
>>
>>> Trying local[m], where m is the number of workers. For tests, local[2]
>>> should be ideal. This is the way to accomplish writing tests for Spark code
>>> generally.
>>>
>>>
>>> On Tue, Nov 19, 2013 at 10:03 PM, Eugen Cepoi <ce...@gmail.com>wrote:
>>>
>>>> Maybe a bug with HttpBroadcast or maybe my fault but can't find where :)
>>>>
>>>> The problem:
>>>>   At the beginning a job computes a treemap(string, someobject) with a
>>>> custom order (some dummy lowercase), this treemap is broadcasted.
>>>>   Then i use this map to do some matching against input rdd (excluding
>>>> those that don't exist).
>>>>   What happens? In local (bc is in that case not used) or by passing
>>>> all the treemap without broadcast I got more than 3M matchings, after
>>>> broadcast it falls to 20K.
>>>>
>>>>  Replacing HttpBroadcastFactory with TreeBroadcastFactory solves the
>>>> problem (I obtain expected results). I am trying to implement a test case
>>>> to reproduce it, but it is quite tricky in that case...
>>>>
>>>> BTW is there a way to reproduce the broadcast mechanism in local (I see
>>>> that the SparkEnv instance is shared as static, so I guess there is no easy
>>>> way)?
>>>>
>>>> Thanks,
>>>> Eugen
>>>>
>>>
>>>
>>>
>>> --
>>> It's just about how deep your longing is!
>>>
>>
>>
>
>
> --
> It's just about how deep your longing is!
>

Re: HttpBroadcast strange behaviour, bug?

Posted by Sriram Ramachandrasekaran <sr...@gmail.com>.
aah, yes. I missed that. I looked into the code. Both TreeBroadcast and
HttpBroadcast don't do send or write respectively.. Will wait for other
inputs.


On Tue, Nov 19, 2013 at 10:40 PM, Eugen Cepoi <ce...@gmail.com> wrote:

> Yes sure for usual tests it is fine, but the broadcast is only done if we
> are not in local mode (at least seems so).
>
> In SparkContext we have def broadcast[T](value: T) =
> env.broadcastManager.newBroadcast[T](value, isLocal)
> the is local is computed from the master name ("local" or "local[...").
> Now If we look int HttpBroadcast we see
> if (!isLocal) {
>     HttpBroadcast.write(id, value_)
>   }
>
> The broadcast is not done in local. I guess this is an optimization in
> case we run multiple threads sharing the same broadcasted variable. But
> perhaps am I missing something?
>
>
> 2013/11/19 Sriram Ramachandrasekaran <sr...@gmail.com>
>
>> Trying local[m], where m is the number of workers. For tests, local[2]
>> should be ideal. This is the way to accomplish writing tests for Spark code
>> generally.
>>
>>
>> On Tue, Nov 19, 2013 at 10:03 PM, Eugen Cepoi <ce...@gmail.com>wrote:
>>
>>> Maybe a bug with HttpBroadcast or maybe my fault but can't find where :)
>>>
>>> The problem:
>>>   At the beginning a job computes a treemap(string, someobject) with a
>>> custom order (some dummy lowercase), this treemap is broadcasted.
>>>   Then i use this map to do some matching against input rdd (excluding
>>> those that don't exist).
>>>   What happens? In local (bc is in that case not used) or by passing all
>>> the treemap without broadcast I got more than 3M matchings, after broadcast
>>> it falls to 20K.
>>>
>>>  Replacing HttpBroadcastFactory with TreeBroadcastFactory solves the
>>> problem (I obtain expected results). I am trying to implement a test case
>>> to reproduce it, but it is quite tricky in that case...
>>>
>>> BTW is there a way to reproduce the broadcast mechanism in local (I see
>>> that the SparkEnv instance is shared as static, so I guess there is no easy
>>> way)?
>>>
>>> Thanks,
>>> Eugen
>>>
>>
>>
>>
>> --
>> It's just about how deep your longing is!
>>
>
>


-- 
It's just about how deep your longing is!

Re: HttpBroadcast strange behaviour, bug?

Posted by Eugen Cepoi <ce...@gmail.com>.
Yes sure for usual tests it is fine, but the broadcast is only done if we
are not in local mode (at least seems so).

In SparkContext we have def broadcast[T](value: T) =
env.broadcastManager.newBroadcast[T](value, isLocal)
the is local is computed from the master name ("local" or "local[..."). Now
If we look int HttpBroadcast we see
if (!isLocal) {
    HttpBroadcast.write(id, value_)
  }

The broadcast is not done in local. I guess this is an optimization in case
we run multiple threads sharing the same broadcasted variable. But perhaps
am I missing something?


2013/11/19 Sriram Ramachandrasekaran <sr...@gmail.com>

> Trying local[m], where m is the number of workers. For tests, local[2]
> should be ideal. This is the way to accomplish writing tests for Spark code
> generally.
>
>
> On Tue, Nov 19, 2013 at 10:03 PM, Eugen Cepoi <ce...@gmail.com>wrote:
>
>> Maybe a bug with HttpBroadcast or maybe my fault but can't find where :)
>>
>> The problem:
>>   At the beginning a job computes a treemap(string, someobject) with a
>> custom order (some dummy lowercase), this treemap is broadcasted.
>>   Then i use this map to do some matching against input rdd (excluding
>> those that don't exist).
>>   What happens? In local (bc is in that case not used) or by passing all
>> the treemap without broadcast I got more than 3M matchings, after broadcast
>> it falls to 20K.
>>
>>  Replacing HttpBroadcastFactory with TreeBroadcastFactory solves the
>> problem (I obtain expected results). I am trying to implement a test case
>> to reproduce it, but it is quite tricky in that case...
>>
>> BTW is there a way to reproduce the broadcast mechanism in local (I see
>> that the SparkEnv instance is shared as static, so I guess there is no easy
>> way)?
>>
>> Thanks,
>> Eugen
>>
>
>
>
> --
> It's just about how deep your longing is!
>

Re: HttpBroadcast strange behaviour, bug?

Posted by Sriram Ramachandrasekaran <sr...@gmail.com>.
Trying local[m], where m is the number of workers. For tests, local[2]
should be ideal. This is the way to accomplish writing tests for Spark code
generally.


On Tue, Nov 19, 2013 at 10:03 PM, Eugen Cepoi <ce...@gmail.com> wrote:

> Maybe a bug with HttpBroadcast or maybe my fault but can't find where :)
>
> The problem:
>   At the beginning a job computes a treemap(string, someobject) with a
> custom order (some dummy lowercase), this treemap is broadcasted.
>   Then i use this map to do some matching against input rdd (excluding
> those that don't exist).
>   What happens? In local (bc is in that case not used) or by passing all
> the treemap without broadcast I got more than 3M matchings, after broadcast
> it falls to 20K.
>
>  Replacing HttpBroadcastFactory with TreeBroadcastFactory solves the
> problem (I obtain expected results). I am trying to implement a test case
> to reproduce it, but it is quite tricky in that case...
>
> BTW is there a way to reproduce the broadcast mechanism in local (I see
> that the SparkEnv instance is shared as static, so I guess there is no easy
> way)?
>
> Thanks,
> Eugen
>



-- 
It's just about how deep your longing is!