You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Divya Gehlot <di...@gmail.com> on 2016/09/27 07:52:12 UTC
read multiple files
Hi,
The input data files for my spark job generated at every five minutes file
name follows epoch time convention as below :
InputFolder/batch-1474959600000
InputFolder/batch-1474959900000
InputFolder/batch-1474960200000
InputFolder/batch-1474960500000
InputFolder/batch-1474960800000
InputFolder/batch-1474961100000
InputFolder/batch-1474961400000
InputFolder/batch-1474961700000
InputFolder/batch-1474962000000
InputFolder/batch-1474962300000
As per requirement I need to read one month of data from current timestamp.
Would really appreciate if anybody could help me .
Thanks,
Divya
Re: read multiple files
Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi Divya,
There are a number of ways you can do this
Get today's date in epoch format. These are my package imports
import java.util.Calendar
import org.joda.time._
import java.math.BigDecimal
import java.sql.{Timestamp, Date}
import org.joda.time.format.DateTimeFormat
// Get epoch time now
scala> val epoch = System.currentTimeMillis
epoch: Long = 1474996552292
//get thirty days ago in epoch time
scala> val thirtydaysago = epoch - (30 * 24 * 60 * 60 * 1000L)
thirtydaysago: Long = 1472404552292
// *note that L for Long at the end*
// Define a function to convert date to str to double check if indeed it is
30 days ago
scala> def timeToStr(epochMillis: Long): String = {
| DateTimeFormat.forPattern("YYYY-MM-dd HH:mm:ss").print(epochMillis)}
timeToStr: (epochMillis: Long)String
scala> timeToStr(epoch)
res4: String = 2016-09-27 18:15:52
So you need to pick files >= file_thirtydaysago UP to file_epoch
Regardless I think you can do better with partitioning of directories. With
a file created every 5 minutes you will have 288 files generated daily
(12*24). Just partition the sub-directory daily. Flume can do that for you
or you can do it in a shell script.
HTH
Dr Mich Talebzadeh
LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On 27 September 2016 at 15:53, Peter Figliozzi <pe...@gmail.com>
wrote:
> If you're up for a fancy but excellent solution:
>
> - Store your data in Cassandra.
> - Use the expiring data feature (TTL)
> <https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html> so
> data will automatically be removed a month later.
> - Now in your Spark process, just read from the database and you don't
> have to worry about the timestamp.
> - You'll still have all your old files if you need to refer back them.
>
> Pete
>
> On Tue, Sep 27, 2016 at 2:52 AM, Divya Gehlot <di...@gmail.com>
> wrote:
>
>> Hi,
>> The input data files for my spark job generated at every five minutes
>> file name follows epoch time convention as below :
>>
>> InputFolder/batch-1474959600000
>> InputFolder/batch-1474959900000
>> InputFolder/batch-1474960200000
>> InputFolder/batch-1474960500000
>> InputFolder/batch-1474960800000
>> InputFolder/batch-1474961100000
>> InputFolder/batch-1474961400000
>> InputFolder/batch-1474961700000
>> InputFolder/batch-1474962000000
>> InputFolder/batch-1474962300000
>>
>> As per requirement I need to read one month of data from current
>> timestamp.
>>
>> Would really appreciate if anybody could help me .
>>
>> Thanks,
>> Divya
>>
>
>
Re: read multiple files
Posted by Peter Figliozzi <pe...@gmail.com>.
If you're up for a fancy but excellent solution:
- Store your data in Cassandra.
- Use the expiring data feature (TTL)
<https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html> so
data will automatically be removed a month later.
- Now in your Spark process, just read from the database and you don't
have to worry about the timestamp.
- You'll still have all your old files if you need to refer back them.
Pete
On Tue, Sep 27, 2016 at 2:52 AM, Divya Gehlot <di...@gmail.com>
wrote:
> Hi,
> The input data files for my spark job generated at every five minutes file
> name follows epoch time convention as below :
>
> InputFolder/batch-1474959600000
> InputFolder/batch-1474959900000
> InputFolder/batch-1474960200000
> InputFolder/batch-1474960500000
> InputFolder/batch-1474960800000
> InputFolder/batch-1474961100000
> InputFolder/batch-1474961400000
> InputFolder/batch-1474961700000
> InputFolder/batch-1474962000000
> InputFolder/batch-1474962300000
>
> As per requirement I need to read one month of data from current timestamp.
>
> Would really appreciate if anybody could help me .
>
> Thanks,
> Divya
>