You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Pedro Tuero <tu...@gmail.com> on 2017/05/22 19:42:25 UTC

Broadcasted Object is empty in executors.

Hi,
I'm using spark 2.1.0 in aws emr. Kryo Serializer.

I'm broadcasting a java class :

public class NameMatcher {

    private static final Logger LOG =
LoggerFactory.getLogger(NameMatcher.class);
    private final Splitter splitter;
    private final SetMultimap<String, IdNamed> itemsByWord;
    private final Multiset<IdNamed> wordCount;

    private NameMatcher(Builder builder) {
        splitter = builder.splitter;
        itemsByWord = cloneMultiMap(builder.itemsByWord);
        wordCount = cloneMultiSet(builder.wordCount);
        LOG.info("Matcher itemsByWorld size: {}", itemsByWord.size());
        LOG.info("Matcher wordCount size: {}", wordCount.size());
    }

    private <T> Multiset<T> cloneMultiSet(Multiset<T> multiset) {
        Multiset<T> result = HashMultiset.create();
        result.addAll(multiset);
        return result;
    }

    private <T, U> SetMultimap<T, U> cloneMultiMap(Multimap<T, U> multimap)
{
        SetMultimap<T, U> result = HashMultimap.create();
        result.putAll(multimap);
        return result;
    }

    public Set<IdNamed> match(CharSequence text) {
        LOG.info("itemsByWorld Keys {}", itemsByWord.keys());
        LOG.info("QueryMatching: {}", text);
        Multiset<IdNamed> counter = HashMultiset.create();
        Set<IdNamed> result = Sets.newHashSet();
        for (String word : Sets.newHashSet(splitter.split(text))) {
            if (itemsByWord.containsKey(word)) {
                for (IdNamed item : itemsByWord.get(word)) {
                    counter.add(item);
                    if (wordCount.count(item) == counter.count(item)) {
                        result.add(item);
                    }
                }
            }
        }
        return result;
    }
}

So the logs in the constructor are ok:
LOG.info("Matcher itemsByWorld size: {}", itemsByWord.size());
prints itemsByWorld sizes and it's as expected. But when calling:
nameMatcher.getValue().match(...
in a RDD transformation, the log line in match method:
 LOG.info("itemsByWorld Keys {}", itemsByWord.keys());
Prints an empty list.

This works alright running locally  in my computer, but fail with no match
running in aws emr.
I usually broadcast objects and map with no problems.
Can anyone give me a clue about what's happening here?
Thanks you very much,
Pedro.