You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Hamish Whittal <ha...@cloud-fundis.co.za> on 2020/08/24 14:06:38 UTC

Stream to Stream joins

Hi folks,

I've got a stream coming from Kafka. It has the following schema:

 userdata : { id: INT, acctid: INT, uid: STRING, logintm: datetime }

I'm trying to count the number of logins by acctid.

I can do the count fine, but the table only has the acctid and the count.
Now I wish to get all the other columns.

I have tried to do this:
origdata = uDF2.select("userdetail.*")
# This gives me all the columns.
# +---+-------+---------+--------------------+
#  | id| acctid|      uid|             logintm|
# +---+-------+---------+--------------------+
#  |014|1075627|curtis.ga|2020-08-24 13:58:...|
#  +---+-------+---------+--------------------+

logins = origdata\
               .withWatermark('logintm', '10 minutes')\
               .groupBy("acctid", "logintm")\
               .count()\
               .select('acctid', 'count')
# This gives me the count of the accountId.

Now I thought I could do this
   alldata = logins.join(origdata, "acctid")

But I get no rows returned. Am I just completely missing the mark here? The
original dataframe has all the columns as expected (including the join
column) and the count has the join column and the count.

If I can't do this, then how otherwise does one get all the other data WITH
the count?

Thanks for your help in advance.

Hamish