You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Oleg Proudnikov <ol...@gmail.com> on 2014/06/01 14:37:42 UTC

sc.textFileGroupByPath("*/*.txt")

Hi All,

Is it possible to create an RDD from a directory tree of the following form?

RDD[(PATH, Seq[TEXT])]

Thank you,
Oleg

Re: sc.textFileGroupByPath("*/*.txt")

Posted by Oleg Proudnikov <ol...@gmail.com>.
Nicholas,

The new in 1.0 wholeTextFiles() gets me exactly what I need. It would be
great to have this functionality with an arbitrary directory tree.

Thank you,
Oleg


​

Re: sc.textFileGroupByPath("*/*.txt")

Posted by Nicholas Chammas <ni...@gmail.com>.
sc.wholeTextFiles()
<http://spark.apache.org/docs/latest/api/python/pyspark.context.SparkContext-class.html#wholeTextFiles>
will
get you close. Alternately, you could write a loop with plain sc.textFile()
that loads all the files under each batch into a separate RDD.


On Sun, Jun 1, 2014 at 4:40 PM, Oleg Proudnikov <ol...@gmail.com>
wrote:

> I have a large number of directories under a common root:
>
> batch-1/file1.txt
> batch-1/file2.txt
> batch-1/file3.txt
> ...
> batch-2/file1.txt
> batch-2/file2.txt
> batch-2/file3.txt
> ...
> batch-N/file1.txt
> batch-N/file2.txt
> batch-N/file3.txt
> ...
>
> I would like to read them into an RDD like
>
> {
> "batch-1" : [ content1, content2, content3,...]
> "batch-2" : [ content1, content2, content3,...]
> ...
> "batch-N" : [ content1, content2, content3,...]
> }
>
> Thank you,
> Oleg
>
>
>
> On 1 June 2014 17:00, Nicholas Chammas <ni...@gmail.com> wrote:
>
>> Could you provide an example of what you mean?
>>
>> I know it's possible to create an RDD from a path with wildcards, like in
>> the subject.
>>
>> For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also
>> provide a comma delimited list of paths.
>>
>> Nick
>>
>> 2014년 6월 1일 일요일, Oleg Proudnikov<ol...@gmail.com>님이 작성한 메시지:
>>
>> Hi All,
>>>
>>> Is it possible to create an RDD from a directory tree of the following
>>> form?
>>>
>>> RDD[(PATH, Seq[TEXT])]
>>>
>>> Thank you,
>>> Oleg
>>>
>>>
>
>
> --
> Kind regards,
>
> Oleg
>
>

Re: sc.textFileGroupByPath("*/*.txt")

Posted by Oleg Proudnikov <ol...@gmail.com>.
I have a large number of directories under a common root:

batch-1/file1.txt
batch-1/file2.txt
batch-1/file3.txt
...
batch-2/file1.txt
batch-2/file2.txt
batch-2/file3.txt
...
batch-N/file1.txt
batch-N/file2.txt
batch-N/file3.txt
...

I would like to read them into an RDD like

{
"batch-1" : [ content1, content2, content3,...]
"batch-2" : [ content1, content2, content3,...]
...
"batch-N" : [ content1, content2, content3,...]
}

Thank you,
Oleg



On 1 June 2014 17:00, Nicholas Chammas <ni...@gmail.com> wrote:

> Could you provide an example of what you mean?
>
> I know it's possible to create an RDD from a path with wildcards, like in
> the subject.
>
> For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also
> provide a comma delimited list of paths.
>
> Nick
>
> 2014년 6월 1일 일요일, Oleg Proudnikov<ol...@gmail.com>님이 작성한 메시지:
>
> Hi All,
>>
>> Is it possible to create an RDD from a directory tree of the following
>> form?
>>
>> RDD[(PATH, Seq[TEXT])]
>>
>> Thank you,
>> Oleg
>>
>>


-- 
Kind regards,

Oleg

Re: sc.textFileGroupByPath("*/*.txt")

Posted by Oleg Proudnikov <ol...@gmail.com>.
Anwar,

Will try this as it might do exactly what I need. I will follow your
pattern but use sc.textFile() for each file.

I am now thinking that I could start with an RDD of file paths and map it
into (path, content) pairs, provided I could read a file on the server.

Thank you,
Oleg



On 1 June 2014 18:41, Anwar Rizal <an...@gmail.com> wrote:

> I presume that you need to have access to the path of each file you are
> reading.
>
> I don't know whether there is a good way to do that for HDFS, I need to
> read the files myself, something like:
>
> def openWithPath(inputPath: String, sc:SparkContext) =  {
>   val fs        = (new
> Path(inputPath)).getFileSystem(sc.hadoopConfiguration)
>   val filesIt   = fs.listFiles(path, false)
>   val paths   = new ListBuffer[URI]
>   while (filesIt.hasNext) {
>     paths += filesIt.next.getPath.toUri
>   }
>   val withPaths = paths.toList.map{  p =>
>     sc.newAPIHadoopFile[LongWritable, Text,
> TextInputFormat](p.toString).map{ case (_,s)  => (p, s.toString) }
>   }
>   withPaths.reduce{ _ ++ _ }
> }
> ...
>
> I would be interested if there is a better way to do the same thing ...
>
> Cheers,
> a:
>
>
> On Sun, Jun 1, 2014 at 6:00 PM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> Could you provide an example of what you mean?
>>
>> I know it's possible to create an RDD from a path with wildcards, like in
>> the subject.
>>
>> For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also
>> provide a comma delimited list of paths.
>>
>> Nick
>>
>> 2014년 6월 1일 일요일, Oleg Proudnikov<ol...@gmail.com>님이 작성한 메시지:
>>
>> Hi All,
>>>
>>> Is it possible to create an RDD from a directory tree of the following
>>> form?
>>>
>>> RDD[(PATH, Seq[TEXT])]
>>>
>>> Thank you,
>>> Oleg
>>>
>>>
>


-- 
Kind regards,

Oleg

Re: sc.textFileGroupByPath("*/*.txt")

Posted by Anwar Rizal <an...@gmail.com>.
I presume that you need to have access to the path of each file you are
reading.

I don't know whether there is a good way to do that for HDFS, I need to
read the files myself, something like:

def openWithPath(inputPath: String, sc:SparkContext) =  {
  val fs        = (new
Path(inputPath)).getFileSystem(sc.hadoopConfiguration)
  val filesIt   = fs.listFiles(path, false)
  val paths   = new ListBuffer[URI]
  while (filesIt.hasNext) {
    paths += filesIt.next.getPath.toUri
  }
  val withPaths = paths.toList.map{  p =>
    sc.newAPIHadoopFile[LongWritable, Text,
TextInputFormat](p.toString).map{ case (_,s)  => (p, s.toString) }
  }
  withPaths.reduce{ _ ++ _ }
}
...

I would be interested if there is a better way to do the same thing ...

Cheers,
a:


On Sun, Jun 1, 2014 at 6:00 PM, Nicholas Chammas <nicholas.chammas@gmail.com
> wrote:

> Could you provide an example of what you mean?
>
> I know it's possible to create an RDD from a path with wildcards, like in
> the subject.
>
> For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also
> provide a comma delimited list of paths.
>
> Nick
>
> 2014년 6월 1일 일요일, Oleg Proudnikov<ol...@gmail.com>님이 작성한 메시지:
>
> Hi All,
>>
>> Is it possible to create an RDD from a directory tree of the following
>> form?
>>
>> RDD[(PATH, Seq[TEXT])]
>>
>> Thank you,
>> Oleg
>>
>>

Re: sc.textFileGroupByPath("*/*.txt")

Posted by Nicholas Chammas <ni...@gmail.com>.
Could you provide an example of what you mean?

I know it's possible to create an RDD from a path with wildcards, like in
the subject.

For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also
provide a comma delimited list of paths.

Nick

2014년 6월 1일 일요일, Oleg Proudnikov<ol...@gmail.com>님이 작성한 메시지:

> Hi All,
>
> Is it possible to create an RDD from a directory tree of the following
> form?
>
> RDD[(PATH, Seq[TEXT])]
>
> Thank you,
> Oleg
>
>