You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Dennis Hunziker <de...@gmail.com> on 2016/06/01 20:51:05 UTC

Re: Spark input size when filtering on parquet files

Thanks, that makes sense. What I wonder though is that if we use parquet
meta data caching, spark should be able to execute queries much faster when
using a large amount of smaller .parquet files compared to a smaller amount
of large ones. At least as long as the min/max indexing is efficient (i.e.
the data is grouped/ordered). However, I'm not seeing this in my tests
because of the overhead for creating many tasks for all these small files
that mostly end up doing nothing at all. Is it possible to prevent that? I
assume only if the driver was able to inspect the cached meta data and
avoid creating tasks for files that aren't used in the first place.

On 27 May 2016 at 04:25, Takeshi Yamamuro <li...@gmail.com> wrote:

> Hi,
>
> Spark just prints #bytes in the web UI that is accumulated from
> InputSplit#getLength (it is just a length of files).
> Therefore, I'm afraid this metric does not reflect actual read #bytes for
> parquet.
> If you get the metric, you need to use other tools such as iostat or
> something.
>
> // maropu
>
>
> // maropu
>
>
> On Fri, May 27, 2016 at 5:45 AM, Dennis Hunziker <
> dennis.hunziker@gmail.com> wrote:
>
>> Hi all
>>
>>
>>
>> I was looking into Spark 1.6.1 (Parquet 1.7.0, Hive 1.2.1) in order to
>> find out about the improvements made in filtering/scanning parquet files
>> when querying for tables using SparkSQL and how these changes relate to the
>> new filter API introduced in Parquet 1.7.0.
>>
>>
>>
>> After checking the usual sources, I still can’t make sense of some of the
>> numbers shown on the Spark UI. As an example, I’m looking at the collect
>> stage for a query that’s selecting a single row from a table containing 1
>> million numbers using a simple where clause (i.e. col1 = 500000) and this
>> is what I see on the UI:
>>
>>
>>
>> 0 SUCCESS ... 2.4 MB (hadoop) / 0
>>
>> 1 SUCCESS ... 2.4 MB (hadoop) / 250000
>>
>> 2 SUCCESS ... 2.4 MB (hadoop) / 0
>>
>> 3 SUCCESS ... 2.4 MB (hadoop) / 0
>>
>>
>>
>> Based on the min/max statistics of each of the parquet parts, it makes
>> sense not to expect any records for 3 out of the 4, because the record I’m
>> looking for can only be in a single file. But why is the input size above
>> shown as 2.4 MB, totaling up to an overall input size of 9.7 MB for the
>> whole stage? Isn't it just meant to read the metadata and ignore the
>> content of the file?
>>
>>
>>
>> Regards,
>>
>> Dennis
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Spark input size when filtering on parquet files

Posted by Takeshi Yamamuro <li...@gmail.com>.
Technically, yes.
I'm not sure there is a parquet api for easily catching file statistics
(min, max, ...) though,
if it exists, it seems we could skip some file splits in
`ParquetFileFormat`.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L273

// maropu

On Thu, Jun 2, 2016 at 5:51 AM, Dennis Hunziker <de...@gmail.com>
wrote:

> Thanks, that makes sense. What I wonder though is that if we use parquet
> meta data caching, spark should be able to execute queries much faster when
> using a large amount of smaller .parquet files compared to a smaller amount
> of large ones. At least as long as the min/max indexing is efficient (i.e.
> the data is grouped/ordered). However, I'm not seeing this in my tests
> because of the overhead for creating many tasks for all these small files
> that mostly end up doing nothing at all. Is it possible to prevent that? I
> assume only if the driver was able to inspect the cached meta data and
> avoid creating tasks for files that aren't used in the first place.
>
>
> On 27 May 2016 at 04:25, Takeshi Yamamuro <li...@gmail.com> wrote:
>
>> Hi,
>>
>> Spark just prints #bytes in the web UI that is accumulated from
>> InputSplit#getLength (it is just a length of files).
>> Therefore, I'm afraid this metric does not reflect actual read #bytes for
>> parquet.
>> If you get the metric, you need to use other tools such as iostat or
>> something.
>>
>> // maropu
>>
>>
>> // maropu
>>
>>
>> On Fri, May 27, 2016 at 5:45 AM, Dennis Hunziker <
>> dennis.hunziker@gmail.com> wrote:
>>
>>> Hi all
>>>
>>>
>>>
>>> I was looking into Spark 1.6.1 (Parquet 1.7.0, Hive 1.2.1) in order to
>>> find out about the improvements made in filtering/scanning parquet files
>>> when querying for tables using SparkSQL and how these changes relate to the
>>> new filter API introduced in Parquet 1.7.0.
>>>
>>>
>>>
>>> After checking the usual sources, I still can’t make sense of some of
>>> the numbers shown on the Spark UI. As an example, I’m looking at the
>>> collect stage for a query that’s selecting a single row from a table
>>> containing 1 million numbers using a simple where clause (i.e. col1 =
>>> 500000) and this is what I see on the UI:
>>>
>>>
>>>
>>> 0 SUCCESS ... 2.4 MB (hadoop) / 0
>>>
>>> 1 SUCCESS ... 2.4 MB (hadoop) / 250000
>>>
>>> 2 SUCCESS ... 2.4 MB (hadoop) / 0
>>>
>>> 3 SUCCESS ... 2.4 MB (hadoop) / 0
>>>
>>>
>>>
>>> Based on the min/max statistics of each of the parquet parts, it makes
>>> sense not to expect any records for 3 out of the 4, because the record I’m
>>> looking for can only be in a single file. But why is the input size above
>>> shown as 2.4 MB, totaling up to an overall input size of 9.7 MB for the
>>> whole stage? Isn't it just meant to read the metadata and ignore the
>>> content of the file?
>>>
>>>
>>>
>>> Regards,
>>>
>>> Dennis
>>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro