You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Bahubali Jain <ba...@gmail.com> on 2016/07/12 12:54:28 UTC

Large files with wholetextfile()

Hi,
We have a requirement where in we need to process set of xml files, each of
the xml files contain several records (eg:
<RECORD>
     data of record 1......
</RECORD>

<RECORD>
    data of record 2......
</RECORD>

Expected output is   <filename and individual records>

Since we needed file name as well in output ,we chose wholetextfile() . We
had to go against using StreamXmlRecordReader and StreamInputFormat since I
could not find a way to retreive the filename.

These xml files could be pretty big, occasionally they could reach a size
of 1GB.Since contents of each file would be put into a single partition,would
such big files be a issue ?
The AWS cluster(50 Nodes) that we use is fairly strong , with each machine
having memory of around 60GB.

Thanks,
Baahu

Re: Large files with wholetextfile()

Posted by Hyukjin Kwon <gu...@gmail.com>.

Otherwise, please consider using https://github.com/databricks/spark-xml.

Actually, there is a function to find the input file name, which is..

input_file_name function,
https://github.com/apache/spark/blob/5f342049cce9102fb62b4de2d8d8fa691c2e8ac4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L948

This is available from 1.6.0

Please refer , https://github.com/apache/spark/pull/13806 and
https://github.com/apache/spark/pull/13759





2016-07-12 22:04 GMT+09:00 Prashant Sharma <sc...@gmail.com>:

> Hi Baahu,
>
> That should not be a problem, given you allocate sufficient buffer for
> reading.
>
> I was just working on implementing a patch[1] to support the feature for
> reading wholetextfiles in SQL. This can actually be slightly better
> approach, because here we read to offheap memory for holding data(using
> unsafe interface).
>
> 1. https://github.com/apache/spark/pull/14151
>
> Thanks,
>
>
>
> --Prashant
>
>
> On Tue, Jul 12, 2016 at 6:24 PM, Bahubali Jain <ba...@gmail.com> wrote:
>
>> Hi,
>> We have a requirement where in we need to process set of xml files, each
>> of the xml files contain several records (eg:
>> <RECORD>
>>      data of record 1......
>> </RECORD>
>>
>> <RECORD>
>>     data of record 2......
>> </RECORD>
>>
>> Expected output is   <filename and individual records>
>>
>> Since we needed file name as well in output ,we chose wholetextfile() .
>> We had to go against using StreamXmlRecordReader and StreamInputFormat
>> since I could not find a way to retreive the filename.
>>
>> These xml files could be pretty big, occasionally they could reach a size
>> of 1GB.Since contents of each file would be put into a single partition,would
>> such big files be a issue ?
>> The AWS cluster(50 Nodes) that we use is fairly strong , with each
>> machine having memory of around 60GB.
>>
>> Thanks,
>> Baahu
>>
>
>

Re: Large files with wholetextfile()

Posted by Prashant Sharma <sc...@gmail.com>.

Hi Baahu,

That should not be a problem, given you allocate sufficient buffer for
reading.

I was just working on implementing a patch[1] to support the feature for
reading wholetextfiles in SQL. This can actually be slightly better
approach, because here we read to offheap memory for holding data(using
unsafe interface).

1. https://github.com/apache/spark/pull/14151

Thanks,

--Prashant

On Tue, Jul 12, 2016 at 6:24 PM, Bahubali Jain <ba...@gmail.com> wrote:

> Hi,
> We have a requirement where in we need to process set of xml files, each
> of the xml files contain several records (eg:
> <RECORD>
>      data of record 1......
> </RECORD>
>
> <RECORD>
>     data of record 2......
> </RECORD>
>
> Expected output is   <filename and individual records>
>
> Since we needed file name as well in output ,we chose wholetextfile() . We
> had to go against using StreamXmlRecordReader and StreamInputFormat since I
> could not find a way to retreive the filename.
>
> These xml files could be pretty big, occasionally they could reach a size
> of 1GB.Since contents of each file would be put into a single partition,would
> such big files be a issue ?
> The AWS cluster(50 Nodes) that we use is fairly strong , with each machine
> having memory of around 60GB.
>
> Thanks,
> Baahu
>