You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Masf <ma...@gmail.com> on 2015/08/19 19:16:17 UTC

SQLContext load. Filtering files

Hi.

I'd like to read Avro files using this library
https://github.com/databricks/spark-avro

I need to load several files from a folder, not all files. Is there some
functionality to filter the files to load?

And... Is is possible to know the name of the files loaded from a folder?

My problem is that I have a folder where an external process is inserting
files every X minutes and I need process these files once, and I can't
move, rename or copy the source files.


Thanks
-- 

Regards
Miguel Ángel

Re: SQLContext load. Filtering files

Posted by Akhil Das <ak...@sigmoidanalytics.com>.
If you have enabled checkpointing the spark will handle that for you.

Thanks
Best Regards

On Thu, Aug 27, 2015 at 4:21 PM, Masf <ma...@gmail.com> wrote:

> Thanks Akhil, I will have a look.
>
> I have a dude regarding to spark streaming and filestream. If spark
> streaming crashs and while spark was down new files are created in input
> folder, when spark streaming is launched again, how can I process these
> files?
>
> Thanks.
> Regards.
> Miguel.
>
>
>
> On Thu, Aug 27, 2015 at 12:29 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Have a look at the spark streaming. You can make use of the
>> ssc.fileStream.
>>
>> Eg:
>>
>> val avroStream = ssc.fileStream[AvroKey[GenericRecord], NullWritable,
>>       AvroKeyInputFormat[GenericRecord]](input)
>>
>> You can also specify a filter function
>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext>
>> as the second argument.
>>
>> Thanks
>> Best Regards
>>
>> On Wed, Aug 19, 2015 at 10:46 PM, Masf <ma...@gmail.com> wrote:
>>
>>> Hi.
>>>
>>> I'd like to read Avro files using this library
>>> https://github.com/databricks/spark-avro
>>>
>>> I need to load several files from a folder, not all files. Is there some
>>> functionality to filter the files to load?
>>>
>>> And... Is is possible to know the name of the files loaded from a folder?
>>>
>>> My problem is that I have a folder where an external process is
>>> inserting files every X minutes and I need process these files once, and I
>>> can't move, rename or copy the source files.
>>>
>>>
>>> Thanks
>>> --
>>>
>>> Regards
>>> Miguel Ángel
>>>
>>
>>
>
>
> --
>
>
> Saludos.
> Miguel Ángel
>

Re: SQLContext load. Filtering files

Posted by Masf <ma...@gmail.com>.
Thanks Akhil, I will have a look.

I have a dude regarding to spark streaming and filestream. If spark
streaming crashs and while spark was down new files are created in input
folder, when spark streaming is launched again, how can I process these
files?

Thanks.
Regards.
Miguel.



On Thu, Aug 27, 2015 at 12:29 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Have a look at the spark streaming. You can make use of the ssc.fileStream.
>
> Eg:
>
> val avroStream = ssc.fileStream[AvroKey[GenericRecord], NullWritable,
>       AvroKeyInputFormat[GenericRecord]](input)
>
> You can also specify a filter function
> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext>
> as the second argument.
>
> Thanks
> Best Regards
>
> On Wed, Aug 19, 2015 at 10:46 PM, Masf <ma...@gmail.com> wrote:
>
>> Hi.
>>
>> I'd like to read Avro files using this library
>> https://github.com/databricks/spark-avro
>>
>> I need to load several files from a folder, not all files. Is there some
>> functionality to filter the files to load?
>>
>> And... Is is possible to know the name of the files loaded from a folder?
>>
>> My problem is that I have a folder where an external process is inserting
>> files every X minutes and I need process these files once, and I can't
>> move, rename or copy the source files.
>>
>>
>> Thanks
>> --
>>
>> Regards
>> Miguel Ángel
>>
>
>


-- 


Saludos.
Miguel Ángel

Re: SQLContext load. Filtering files

Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Have a look at the spark streaming. You can make use of the ssc.fileStream.

Eg:

val avroStream = ssc.fileStream[AvroKey[GenericRecord], NullWritable,
      AvroKeyInputFormat[GenericRecord]](input)

You can also specify a filter function
<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext>
as the second argument.

Thanks
Best Regards

On Wed, Aug 19, 2015 at 10:46 PM, Masf <ma...@gmail.com> wrote:

> Hi.
>
> I'd like to read Avro files using this library
> https://github.com/databricks/spark-avro
>
> I need to load several files from a folder, not all files. Is there some
> functionality to filter the files to load?
>
> And... Is is possible to know the name of the files loaded from a folder?
>
> My problem is that I have a folder where an external process is inserting
> files every X minutes and I need process these files once, and I can't
> move, rename or copy the source files.
>
>
> Thanks
> --
>
> Regards
> Miguel Ángel
>