You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Tathagata Das <ta...@gmail.com> on 2014/02/05 10:05:25 UTC

Re: Streaming files as a whole

This is a very late reply for this thread. If you are trying to read xml
files from a directory and put it into a stream, there are two ways that
may work.

1. Something like this  -  streamingContext.fileStream[LongWritable, Text, X
MLInputFormat](<directory>)
The XMLInputFormat<https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java>class
is what Woody suggested. If this InputFormat works correctly, then
any new XML files created in the <directory> should get read as RDD in a
DStream. However, there is no guarantee that it will read one file at a
time. If two files got generated within a batch interval, then both will
get read together in the same batch.

2. If you want to manually control how the RDDs are fed, then take a look
at streamingContext.queueStream. This allows you to create RDDs manually
and push them in a queue. Spark Streaming will pull those RDDs and treat
them as a stream.

Hope this helps. Apologies for the late response.


On Thu, Jan 30, 2014 at 5:55 AM, Mayur Rustagi <ma...@gmail.com>wrote:

> Hi,
> I am using Spark Streaming for this, in Streaming I am trying to open the
> file as text file and Dstream.
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Thu, Jan 30, 2014 at 7:17 PM, Woody Christy <wc...@cloudera.com>wrote:
>
>> Take a look at the Mahout xmlinputformat class. That should get  you
>> started.
>>
>>
>> On Thu, Jan 30, 2014 at 5:08 AM, Mayur Rustagi <ma...@gmail.com>wrote:
>>
>>> I am trying to load xml in streaming and convert to csv and store it.
>>> When I use textfile it separates the file on "\n" and hence breaks the
>>> parser. Is it possible to receive the data one file at a time from the hdfs
>>> folder ?
>>>
>>> Mayur Rustagi
>>> Ph: +919632149971
>>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>>> https://twitter.com/mayur_rustagi
>>>
>>
>>
>>
>> --
>>
>> Woody Christy
>> Solutions Architect | Partner Engineering | Cloudera Inc
>> @woodychristy
>>
>>
>>
>>
>