You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Fengyun RAO <ra...@gmail.com> on 2014/07/30 06:02:51 UTC

Is it possible to read file head in each partition?

Hi, all

We are migrating from mapreduce to spark, and encountered a problem.

Our input files are IIS logs with file head. It's easy to get the file head
if we process only one file, e.g.

val lines = sc.textFile('hdfs://*/u_ex14073011.log')
val head = lines.take(4)

Then we can write our map method using this head.

However, if we input multiple files, each of which may have a different
file head, how can we get file head for each partition?

It seems we have two options:

1. still use textFile() to get lines.

Since each partition may have a different "head", we have to write
mapPartitionsWithContext method. However we can't find a way to get the
"head" for each partition.

In our former mapreduce program, we could simply use

Path path = ((FileSplit) context.getInputSplit()).getPath()

but there seems no way in spark, since HadoopPartition which wraps
InputSplit inside HadoopRDD is a private class.

2. use wholeTextFile() to get whole contents.

 It's easy to get file head for each file, but according to the document,
this API is better for small files.


*Any suggestions on how to process these files with heads?*

Re: Is it possible to read file head in each partition?

Posted by Fengyun RAO <ra...@gmail.com>.
of course we can filter them out. A typical file head is as below:
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-07-04 20:00:00
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
time-taken

It doesn't help if we simply filter all heads out. For each partition, we
don't know which head to use, because we don't know which file this
partition belongs to.


2014-07-30 14:47 GMT+08:00 Cheng Lian <li...@gmail.com>:

> What's the format of the file header? Is it possible to filter them out by
> prefix string matching or regex?
>
>
> On Wed, Jul 30, 2014 at 1:39 PM, Fengyun RAO <ra...@gmail.com> wrote:
>
>> It will certainly cause bad performance, since it reads the whole content
>> of a large file into one value, instead of splitting it into partitions.
>>
>> Typically one file is 1 GB. Suppose we have 3 large files, in this way,
>> there would only be 3 key-value pairs, and thus 3 tasks at most.
>>
>>
>> 2014-07-30 12:49 GMT+08:00 Hossein <fa...@gmail.com>:
>>
>> You can use SparkContext.wholeTextFile().
>>>
>>> Please note that the documentation suggests: "Small files are
>>> preferred, large file is also allowable, but may cause bad performance."
>>>
>>> --Hossein
>>>
>>>
>>> On Tue, Jul 29, 2014 at 9:21 PM, Nicholas Chammas <
>>> nicholas.chammas@gmail.com> wrote:
>>>
>>>> This is an interesting question. I’m curious to know as well how this
>>>> problem can be approached.
>>>>
>>>> Is there a way, perhaps, to ensure that each input file matching the
>>>> glob expression gets mapped to exactly one partition? Then you could
>>>> probably get what you want using RDD.mapPartitions().
>>>>
>>>> Nick
>>>> ​
>>>>
>>>>
>>>> On Wed, Jul 30, 2014 at 12:02 AM, Fengyun RAO <ra...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, all
>>>>>
>>>>> We are migrating from mapreduce to spark, and encountered a problem.
>>>>>
>>>>> Our input files are IIS logs with file head. It's easy to get the file
>>>>> head if we process only one file, e.g.
>>>>>
>>>>> val lines = sc.textFile('hdfs://*/u_ex14073011.log')
>>>>> val head = lines.take(4)
>>>>>
>>>>> Then we can write our map method using this head.
>>>>>
>>>>> However, if we input multiple files, each of which may have a
>>>>> different file head, how can we get file head for each partition?
>>>>>
>>>>> It seems we have two options:
>>>>>
>>>>> 1. still use textFile() to get lines.
>>>>>
>>>>> Since each partition may have a different "head", we have to write
>>>>> mapPartitionsWithContext method. However we can't find a way to get
>>>>> the "head" for each partition.
>>>>>
>>>>> In our former mapreduce program, we could simply use
>>>>>
>>>>> Path path = ((FileSplit) context.getInputSplit()).getPath()
>>>>>
>>>>> but there seems no way in spark, since HadoopPartition which wraps
>>>>> InputSplit inside HadoopRDD is a private class.
>>>>>
>>>>> 2. use wholeTextFile() to get whole contents.
>>>>>
>>>>>  It's easy to get file head for each file, but according to the
>>>>> document, this API is better for small files.
>>>>>
>>>>>
>>>>> *Any suggestions on how to process these files with heads?*
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Is it possible to read file head in each partition?

Posted by Cheng Lian <li...@gmail.com>.
What's the format of the file header? Is it possible to filter them out by
prefix string matching or regex?


On Wed, Jul 30, 2014 at 1:39 PM, Fengyun RAO <ra...@gmail.com> wrote:

> It will certainly cause bad performance, since it reads the whole content
> of a large file into one value, instead of splitting it into partitions.
>
> Typically one file is 1 GB. Suppose we have 3 large files, in this way,
> there would only be 3 key-value pairs, and thus 3 tasks at most.
>
>
> 2014-07-30 12:49 GMT+08:00 Hossein <fa...@gmail.com>:
>
> You can use SparkContext.wholeTextFile().
>>
>> Please note that the documentation suggests: "Small files are preferred,
>> large file is also allowable, but may cause bad performance."
>>
>> --Hossein
>>
>>
>> On Tue, Jul 29, 2014 at 9:21 PM, Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>>> This is an interesting question. I’m curious to know as well how this
>>> problem can be approached.
>>>
>>> Is there a way, perhaps, to ensure that each input file matching the
>>> glob expression gets mapped to exactly one partition? Then you could
>>> probably get what you want using RDD.mapPartitions().
>>>
>>> Nick
>>> ​
>>>
>>>
>>> On Wed, Jul 30, 2014 at 12:02 AM, Fengyun RAO <ra...@gmail.com>
>>> wrote:
>>>
>>>> Hi, all
>>>>
>>>> We are migrating from mapreduce to spark, and encountered a problem.
>>>>
>>>> Our input files are IIS logs with file head. It's easy to get the file
>>>> head if we process only one file, e.g.
>>>>
>>>> val lines = sc.textFile('hdfs://*/u_ex14073011.log')
>>>> val head = lines.take(4)
>>>>
>>>> Then we can write our map method using this head.
>>>>
>>>> However, if we input multiple files, each of which may have a different
>>>> file head, how can we get file head for each partition?
>>>>
>>>> It seems we have two options:
>>>>
>>>> 1. still use textFile() to get lines.
>>>>
>>>> Since each partition may have a different "head", we have to write
>>>> mapPartitionsWithContext method. However we can't find a way to get
>>>> the "head" for each partition.
>>>>
>>>> In our former mapreduce program, we could simply use
>>>>
>>>> Path path = ((FileSplit) context.getInputSplit()).getPath()
>>>>
>>>> but there seems no way in spark, since HadoopPartition which wraps
>>>> InputSplit inside HadoopRDD is a private class.
>>>>
>>>> 2. use wholeTextFile() to get whole contents.
>>>>
>>>>  It's easy to get file head for each file, but according to the
>>>> document, this API is better for small files.
>>>>
>>>>
>>>> *Any suggestions on how to process these files with heads?*
>>>>
>>>
>>>
>>
>

Re: Is it possible to read file head in each partition?

Posted by Fengyun RAO <ra...@gmail.com>.
It will certainly cause bad performance, since it reads the whole content
of a large file into one value, instead of splitting it into partitions.

Typically one file is 1 GB. Suppose we have 3 large files, in this way,
there would only be 3 key-value pairs, and thus 3 tasks at most.


2014-07-30 12:49 GMT+08:00 Hossein <fa...@gmail.com>:

> You can use SparkContext.wholeTextFile().
>
> Please note that the documentation suggests: "Small files are preferred,
> large file is also allowable, but may cause bad performance."
>
> --Hossein
>
>
> On Tue, Jul 29, 2014 at 9:21 PM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> This is an interesting question. I’m curious to know as well how this
>> problem can be approached.
>>
>> Is there a way, perhaps, to ensure that each input file matching the glob
>> expression gets mapped to exactly one partition? Then you could probably
>> get what you want using RDD.mapPartitions().
>>
>> Nick
>> ​
>>
>>
>> On Wed, Jul 30, 2014 at 12:02 AM, Fengyun RAO <ra...@gmail.com>
>> wrote:
>>
>>> Hi, all
>>>
>>> We are migrating from mapreduce to spark, and encountered a problem.
>>>
>>> Our input files are IIS logs with file head. It's easy to get the file
>>> head if we process only one file, e.g.
>>>
>>> val lines = sc.textFile('hdfs://*/u_ex14073011.log')
>>> val head = lines.take(4)
>>>
>>> Then we can write our map method using this head.
>>>
>>> However, if we input multiple files, each of which may have a different
>>> file head, how can we get file head for each partition?
>>>
>>> It seems we have two options:
>>>
>>> 1. still use textFile() to get lines.
>>>
>>> Since each partition may have a different "head", we have to write
>>> mapPartitionsWithContext method. However we can't find a way to get the
>>> "head" for each partition.
>>>
>>> In our former mapreduce program, we could simply use
>>>
>>> Path path = ((FileSplit) context.getInputSplit()).getPath()
>>>
>>> but there seems no way in spark, since HadoopPartition which wraps
>>> InputSplit inside HadoopRDD is a private class.
>>>
>>> 2. use wholeTextFile() to get whole contents.
>>>
>>>  It's easy to get file head for each file, but according to the
>>> document, this API is better for small files.
>>>
>>>
>>> *Any suggestions on how to process these files with heads?*
>>>
>>
>>
>

Re: Is it possible to read file head in each partition?

Posted by Hossein <fa...@gmail.com>.
You can use SparkContext.wholeTextFile().

Please note that the documentation suggests: "Small files are preferred,
large file is also allowable, but may cause bad performance."

--Hossein


On Tue, Jul 29, 2014 at 9:21 PM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> This is an interesting question. I’m curious to know as well how this
> problem can be approached.
>
> Is there a way, perhaps, to ensure that each input file matching the glob
> expression gets mapped to exactly one partition? Then you could probably
> get what you want using RDD.mapPartitions().
>
> Nick
> ​
>
>
> On Wed, Jul 30, 2014 at 12:02 AM, Fengyun RAO <ra...@gmail.com>
> wrote:
>
>> Hi, all
>>
>> We are migrating from mapreduce to spark, and encountered a problem.
>>
>> Our input files are IIS logs with file head. It's easy to get the file
>> head if we process only one file, e.g.
>>
>> val lines = sc.textFile('hdfs://*/u_ex14073011.log')
>> val head = lines.take(4)
>>
>> Then we can write our map method using this head.
>>
>> However, if we input multiple files, each of which may have a different
>> file head, how can we get file head for each partition?
>>
>> It seems we have two options:
>>
>> 1. still use textFile() to get lines.
>>
>> Since each partition may have a different "head", we have to write
>> mapPartitionsWithContext method. However we can't find a way to get the
>> "head" for each partition.
>>
>> In our former mapreduce program, we could simply use
>>
>> Path path = ((FileSplit) context.getInputSplit()).getPath()
>>
>> but there seems no way in spark, since HadoopPartition which wraps
>> InputSplit inside HadoopRDD is a private class.
>>
>> 2. use wholeTextFile() to get whole contents.
>>
>>  It's easy to get file head for each file, but according to the document,
>> this API is better for small files.
>>
>>
>> *Any suggestions on how to process these files with heads?*
>>
>
>

Re: Is it possible to read file head in each partition?

Posted by Nicholas Chammas <ni...@gmail.com>.
This is an interesting question. I’m curious to know as well how this
problem can be approached.

Is there a way, perhaps, to ensure that each input file matching the glob
expression gets mapped to exactly one partition? Then you could probably
get what you want using RDD.mapPartitions().

Nick
​


On Wed, Jul 30, 2014 at 12:02 AM, Fengyun RAO <ra...@gmail.com> wrote:

> Hi, all
>
> We are migrating from mapreduce to spark, and encountered a problem.
>
> Our input files are IIS logs with file head. It's easy to get the file
> head if we process only one file, e.g.
>
> val lines = sc.textFile('hdfs://*/u_ex14073011.log')
> val head = lines.take(4)
>
> Then we can write our map method using this head.
>
> However, if we input multiple files, each of which may have a different
> file head, how can we get file head for each partition?
>
> It seems we have two options:
>
> 1. still use textFile() to get lines.
>
> Since each partition may have a different "head", we have to write
> mapPartitionsWithContext method. However we can't find a way to get the
> "head" for each partition.
>
> In our former mapreduce program, we could simply use
>
> Path path = ((FileSplit) context.getInputSplit()).getPath()
>
> but there seems no way in spark, since HadoopPartition which wraps
> InputSplit inside HadoopRDD is a private class.
>
> 2. use wholeTextFile() to get whole contents.
>
>  It's easy to get file head for each file, but according to the document,
> this API is better for small files.
>
>
> *Any suggestions on how to process these files with heads?*
>