You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by durga <du...@gmail.com> on 2014/07/22 02:41:08 UTC

Joining by timestamp.

Hi

I have peculiar problem,

I have two data sets (large ones) . 
Data set1:

((timestamp),iterable[Any]) => {
(2014-07-10T00:02:45.045+0000,ArrayBuffer((2014-07-10T00:02:45.045+0000,98.4859,22)))
(2014-07-10T00:07:32.618+0000,ArrayBuffer((2014-07-10T00:07:32.618+0000,75.4737,22)))
}

DataSet2:
((timestamp),iterable[Any]) =>{
(2014-07-10T00:03:16.952+0000,ArrayBuffer((2014-07-10T00:03:16.952+0000,99.6148,23)))
(2014-07-10T00:08:11.329+0000,ArrayBuffer((2014-07-10T00:08:11.329+0000,80.9017,23)))
}

I need to join them , But the catch is , both time stamps are not same ,
they can be approximately 4mins +/-.

those records needs to be joined

Any idea is very much appreciated.

I am thinking right now.

file descriptor for sorted Dataset2.
Read the sorted records of dataset1 .
     for each record , check for any record matching with the criteria , 
        if match emit the record1,record2
        if not matching continue reading record2 until it matches.

I know this works for a very small files , That's the reason I need help.

Thanks,
D.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-timestamp-tp10367.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Joining by timestamp.

Posted by durga <du...@gmail.com>.

Thanks Chen



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-timestamp-tp10367p10449.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Joining by timestamp.

Posted by "Cheng, Hao" <ha...@intel.com>.

Durga, you can start from the documents
  http://spark.apache.org/docs/latest/quick-start.html 
  http://spark.apache.org/docs/latest/programming-guide.html


-----Original Message-----
From: durga [mailto:durgaktkm@gmail.com] 
Sent: Tuesday, July 22, 2014 12:45 PM
To: user@spark.incubator.apache.org
Subject: RE: Joining by timestamp.

Hi Chen,

Thank you very much for your reply. I think I do not understand how can I do the join using spark api. If you have time , could you please write some code . 

Thanks again,
D.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-timestamp-tp10367p10383.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Joining by timestamp.

Posted by durga <du...@gmail.com>.

Hi Chen,

Thank you very much for your reply. I think I do not understand how can I do
the join using spark api. If you have time , could you please write some
code . 

Thanks again,
D.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-timestamp-tp10367p10383.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Joining by timestamp.

Posted by "Cheng, Hao" <ha...@intel.com>.

Actually it's just a pseudo algorithm I described, you can do it with spark API. Hope the algorithm helpful.

-----Original Message-----
From: durga [mailto:durgaktkm@gmail.com] 
Sent: Tuesday, July 22, 2014 11:56 AM
To: user@spark.incubator.apache.org
Subject: RE: Joining by timestamp.

Hi Chen,
I am new to the Spark as well as SparkSQL , could you please explain how would I create a table and run query on top of it.That would be super helpful.

Thanks,
D.

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-timestamp-tp10367p10381.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Joining by timestamp.

Posted by durga <du...@gmail.com>.

Hi Chen,
I am new to the Spark as well as SparkSQL , could you please explain how
would I create a table and run query on top of it.That would be super
helpful.

Thanks,
D.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-timestamp-tp10367p10381.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Joining by timestamp.

Posted by "Cheng, Hao" <ha...@intel.com>.

This is a very interesting problem. SparkSQL supports the Non Equi Join, but it is in very low efficiency with large tables.

One possible solution is make both table partition based and the partition keys are (cast(ds as bigint) / 240), and with each partition in dataset1, you probably can write SQL like "select * from dataset1 inner join dataset2 on abs(cast(dataset1.ds as bigint) -cast(dataset2.ds as bigint)) < 240".

We assume the dataset2 is always the 3 adjacent partitions and the partitions key are: key -1, key, and key + 1 (which means the dataset2 with maximum "ds" 240 seconds greater than dataset 1 and minimum timestamp 240 seconds less than the dataset1).

After finish iterating every partition in dataset1, you should get the result.

BTW, not sure if you really want the stream sql: https://github.com/thunderain-project/StreamSQL


-----Original Message-----
From: durga [mailto:durgaktkm@gmail.com] 
Sent: Tuesday, July 22, 2014 8:41 AM
To: user@spark.incubator.apache.org
Subject: Joining by timestamp.

Hi

I have peculiar problem,

I have two data sets (large ones) . 
Data set1:

((timestamp),iterable[Any]) => {
(2014-07-10T00:02:45.045+0000,ArrayBuffer((2014-07-10T00:02:45.045+0000,98.4859,22)))
(2014-07-10T00:07:32.618+0000,ArrayBuffer((2014-07-10T00:07:32.618+0000,75.4737,22)))
}

DataSet2:
((timestamp),iterable[Any]) =>{
(2014-07-10T00:03:16.952+0000,ArrayBuffer((2014-07-10T00:03:16.952+0000,99.6148,23)))
(2014-07-10T00:08:11.329+0000,ArrayBuffer((2014-07-10T00:08:11.329+0000,80.9017,23)))
}

I need to join them , But the catch is , both time stamps are not same , they can be approximately 4mins +/-.

those records needs to be joined

Any idea is very much appreciated.

I am thinking right now.

file descriptor for sorted Dataset2.
Read the sorted records of dataset1 .
     for each record , check for any record matching with the criteria , 
        if match emit the record1,record2
        if not matching continue reading record2 until it matches.

I know this works for a very small files , That's the reason I need help.

Thanks,
D.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-timestamp-tp10367.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.