You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Pedro Tuero <tu...@gmail.com> on 2017/05/22 19:42:25 UTC
Broadcasted Object is empty in executors.
Hi,
I'm using spark 2.1.0 in aws emr. Kryo Serializer.
I'm broadcasting a java class :
public class NameMatcher {
private static final Logger LOG =
LoggerFactory.getLogger(NameMatcher.class);
private final Splitter splitter;
private final SetMultimap<String, IdNamed> itemsByWord;
private final Multiset<IdNamed> wordCount;
private NameMatcher(Builder builder) {
splitter = builder.splitter;
itemsByWord = cloneMultiMap(builder.itemsByWord);
wordCount = cloneMultiSet(builder.wordCount);
LOG.info("Matcher itemsByWorld size: {}", itemsByWord.size());
LOG.info("Matcher wordCount size: {}", wordCount.size());
}
private <T> Multiset<T> cloneMultiSet(Multiset<T> multiset) {
Multiset<T> result = HashMultiset.create();
result.addAll(multiset);
return result;
}
private <T, U> SetMultimap<T, U> cloneMultiMap(Multimap<T, U> multimap)
{
SetMultimap<T, U> result = HashMultimap.create();
result.putAll(multimap);
return result;
}
public Set<IdNamed> match(CharSequence text) {
LOG.info("itemsByWorld Keys {}", itemsByWord.keys());
LOG.info("QueryMatching: {}", text);
Multiset<IdNamed> counter = HashMultiset.create();
Set<IdNamed> result = Sets.newHashSet();
for (String word : Sets.newHashSet(splitter.split(text))) {
if (itemsByWord.containsKey(word)) {
for (IdNamed item : itemsByWord.get(word)) {
counter.add(item);
if (wordCount.count(item) == counter.count(item)) {
result.add(item);
}
}
}
}
return result;
}
}
So the logs in the constructor are ok:
LOG.info("Matcher itemsByWorld size: {}", itemsByWord.size());
prints itemsByWorld sizes and it's as expected. But when calling:
nameMatcher.getValue().match(...
in a RDD transformation, the log line in match method:
LOG.info("itemsByWorld Keys {}", itemsByWord.keys());
Prints an empty list.
This works alright running locally in my computer, but fail with no match
running in aws emr.
I usually broadcast objects and map with no problems.
Can anyone give me a clue about what's happening here?
Thanks you very much,
Pedro.