You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@streams.apache.org by Robert Douglas <ro...@gmail.com> on 2014/06/30 18:50:59 UTC

Desired behavior for DataSift user_mention serialization

Hi all,

I’m currently working on cleaning up the implementation of the DataSift
serializer and have come upon an issue. The data that we get back in a
DataSift Interaction object contains two fields, mentions (which has all
the handles for mentioned users) and mention_ids (which has all the Ids for
mentioned users). Problem is, there is no guarantee that these two lists
will be the same size. My current solution is to merge together the handles
and Ids into individual UserMention objects whenever the mentions and
mention_ids lists are the same size. In the event that those lists are not
the same size, I create UserMention objects for every entry in both lists.

Does anyone have an different opinion on how this should be handled?

— Robert

Re: Desired behavior for DataSift user_mention serialization

Posted by Steve Blackmon <st...@blackmon.org>.
While that approach may result in mention objects with ids/names that were originally paired, we can't guarantee that without making an API lookup to twitter.

In general I’m in favor of streams maintaining and attempting to improve data accuracy, in this scenario it seems the inbound document has been degraded in a way that goes beyond the current scope and authority (API-wise) of the module to resolve, and given that I’m wary of setting the extension fields in a way that could potentially recombine fields incorrectly and thus make the problem worse.

So my vote would be create a separate object for id and name in every case, maintaining all of the original information and leave it to a downstream processor to improve the metadata if there is value to doing so.

Steve Blackmon
steve@blackmon.org



On Jun 30, 2014, at 9:50 AM, Robert Douglas <ro...@gmail.com> wrote:

> Hi all,
> 
> I’m currently working on cleaning up the implementation of the DataSift
> serializer and have come upon an issue. The data that we get back in a
> DataSift Interaction object contains two fields, mentions (which has all
> the handles for mentioned users) and mention_ids (which has all the Ids for
> mentioned users). Problem is, there is no guarantee that these two lists
> will be the same size. My current solution is to merge together the handles
> and Ids into individual UserMention objects whenever the mentions and
> mention_ids lists are the same size. In the event that those lists are not
> the same size, I create UserMention objects for every entry in both lists.
> 
> Does anyone have an different opinion on how this should be handled?
> 
> — Robert