You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Divya Gehlot <di...@gmail.com> on 2016/09/27 07:52:12 UTC

read multiple files

Hi,
The input data files for my spark job generated at every five minutes file
name follows epoch time convention  as below :

InputFolder/batch-1474959600000
InputFolder/batch-1474959900000
InputFolder/batch-1474960200000
InputFolder/batch-1474960500000
InputFolder/batch-1474960800000
InputFolder/batch-1474961100000
InputFolder/batch-1474961400000
InputFolder/batch-1474961700000
InputFolder/batch-1474962000000
InputFolder/batch-1474962300000

As per requirement I need to read one month of data from current timestamp.

Would really appreciate if anybody could help me .

Thanks,
Divya

Re: read multiple files

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Divya,

There are a number of ways you can do this

Get today's date in epoch format. These are my package imports

import java.util.Calendar
import org.joda.time._
import java.math.BigDecimal
import java.sql.{Timestamp, Date}
import org.joda.time.format.DateTimeFormat

// Get epoch time now

scala> val epoch = System.currentTimeMillis
epoch: Long = 1474996552292

//get thirty days ago in epoch time

scala> val thirtydaysago = epoch - (30 * 24 * 60 * 60 * 1000L)
thirtydaysago: Long = 1472404552292

// *note that L for Long at the end*

// Define a function to convert date to str to double check if indeed it is
30 days ago

scala> def timeToStr(epochMillis: Long): String = {
     | DateTimeFormat.forPattern("YYYY-MM-dd HH:mm:ss").print(epochMillis)}
timeToStr: (epochMillis: Long)String


scala> timeToStr(epoch)
res4: String = 2016-09-27 18:15:52

So you need to pick files >= file_thirtydaysago UP to  file_epoch

Regardless I think you can do better with partitioning of directories. With
a file created every 5 minutes you will have 288 files generated daily
(12*24). Just partition the sub-directory daily. Flume can do that for you
or you can do it in a shell script.

HTH











Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 September 2016 at 15:53, Peter Figliozzi <pe...@gmail.com>
wrote:

> If you're up for a fancy but excellent solution:
>
>    - Store your data in Cassandra.
>    - Use the expiring data feature (TTL)
>    <https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html> so
>    data will automatically be removed a month later.
>    - Now in your Spark process, just read from the database and you don't
>    have to worry about the timestamp.
>    - You'll still have all your old files if you need to refer back them.
>
> Pete
>
> On Tue, Sep 27, 2016 at 2:52 AM, Divya Gehlot <di...@gmail.com>
> wrote:
>
>> Hi,
>> The input data files for my spark job generated at every five minutes
>> file name follows epoch time convention  as below :
>>
>> InputFolder/batch-1474959600000
>> InputFolder/batch-1474959900000
>> InputFolder/batch-1474960200000
>> InputFolder/batch-1474960500000
>> InputFolder/batch-1474960800000
>> InputFolder/batch-1474961100000
>> InputFolder/batch-1474961400000
>> InputFolder/batch-1474961700000
>> InputFolder/batch-1474962000000
>> InputFolder/batch-1474962300000
>>
>> As per requirement I need to read one month of data from current
>> timestamp.
>>
>> Would really appreciate if anybody could help me .
>>
>> Thanks,
>> Divya
>>
>
>

Re: read multiple files

Posted by Peter Figliozzi <pe...@gmail.com>.

If you're up for a fancy but excellent solution:

   - Store your data in Cassandra.
   - Use the expiring data feature (TTL)
   <https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html> so
   data will automatically be removed a month later.
   - Now in your Spark process, just read from the database and you don't
   have to worry about the timestamp.
   - You'll still have all your old files if you need to refer back them.

Pete

On Tue, Sep 27, 2016 at 2:52 AM, Divya Gehlot <di...@gmail.com>
wrote:

> Hi,
> The input data files for my spark job generated at every five minutes file
> name follows epoch time convention  as below :
>
> InputFolder/batch-1474959600000
> InputFolder/batch-1474959900000
> InputFolder/batch-1474960200000
> InputFolder/batch-1474960500000
> InputFolder/batch-1474960800000
> InputFolder/batch-1474961100000
> InputFolder/batch-1474961400000
> InputFolder/batch-1474961700000
> InputFolder/batch-1474962000000
> InputFolder/batch-1474962300000
>
> As per requirement I need to read one month of data from current timestamp.
>
> Would really appreciate if anybody could help me .
>
> Thanks,
> Divya
>