You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Stone Zhong <st...@gmail.com> on 2020/01/08 01:02:13 UTC

Query Parquet large parquet file, return empty columns

Hi,

This is my first message to the user email list.

*Symptom*: I have a large parquet file (size = 1TB), I run a query against
the parquet file, the query is like below:

select * from `dfs`.`/root/data/hwd/machine_main2`
WHERE
    verifiedSerial='1547NM70EN'
LIMIT 20;

However, I found in both Drill Web UI and API, the returned column for
'verifiedSerial' has empty value, which is impossible since I have explicit
condition in the query which the column `verifiedSerial` cannot be empty.

*my configuration*:

   - I am running Apache Drill 1.17 (the latest version now) in embedded
   mode
   - The parquet file is generated from Apache Spark (3.x version). E.g., I
   uses such code to generate the parquet file:
   `df.coalesce(1).write.mode('overwrite').parquet('/tmp/foo.parquet')`
   - The parquet file is about 1TB in size
   - My machine running drill has 32GB RAM, drill is pretty much the only
   app running on that machine


What is wired is, if I filter the parquet file in spark (which result in a
smaller parquet file), everything works fine.

My Question is:

   1. Is the behavior I described expected?
   2. If (1) is not expected, how can I avoid this behavior, what kind of
   configuration change is needed? Or perhaps is this a bug?

Thanks in advance for any kind of help!

Regards,
Stone Zhong

Re: Query Parquet large parquet file, return empty columns

Posted by Stone Zhong <st...@gmail.com>.
Thank you Paul for the help. I double checked on config, seems my drill
deployment is not using the official bits, I built it from source code
directly shortly before 1.17 was officially released, so it is likely
something wrong with my build.

Sorry for the inconvenience and thanks again for the help!

Regards,
Stone

On Tue, Jan 7, 2020 at 6:48 PM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Stone,
>
> Welcome to the Drill mailing list!
>
> The situation you describe is not the expected behavior. Something is
> going wrong.
>
> You mention that, in the large-file case, you do get results, but the
> value is empty. This suggests that the query did, in fact run (that is, it
> didn't run out of memory or encounter some other internal problem.) The
> question is: why were the columns empty?
>
> And, when you say "empty", do you mean NULL (SQL NULL) or an empty string.
> (Or, might they be a string of spaces?)
>
> You mention that you used both the Web console and API. Do you mean REST
> API? What results do you get if you use the Sqlline command line tool?
>
>
> Your configuration seems fine. Your query is a simple scan and filter, so
> not much memory should be needed. 32 GB should be fine. You are using the
> default Drill memory options?
>
> The odd thing is that you have two files: one small, one large. The small
> one works. Is the schema the same for the two Parquet files?
>
> Parquet files are divided into row groups. How large are the row groups in
> your file?
>
>
> With only the data you provided, my hunch is that the problem is related
> to file size somehow, but we will have to poke around to find where the
> problem lies.
>
>
> Please do file a bug. It would be super helpful if you can include:
> * The same description as in your e-mail.
> * The actual output of the query (from Sqlline or the REST API.)
>
> * An example small Parquet file. Mostly we will need to see the schema so
> we can try to reproduce the problem.
>
> Here is a possible workaround. Drill does not need the entire data set in
> one file: you can create a directory of files. Try doing this with files
> of, say, 100MB or 256MB in size. A side benefit is that, for real queries
> (not just LIMIT 10), having multiple files will let Drill divide up the
> query to run scans in parallel so it can use all the CPUs on your machine.
>
> Thanks,
> - Paul
>
>
>
>     On Tuesday, January 7, 2020, 5:14:12 PM PST, Stone Zhong <
> stone.zhong@gmail.com> wrote:
>
>  Hi,
>
> This is my first message to the user email list.
>
> *Symptom*: I have a large parquet file (size = 1TB), I run a query against
> the parquet file, the query is like below:
>
> select * from `dfs`.`/root/data/hwd/machine_main2`
> WHERE
>     verifiedSerial='1547NM70EN'
> LIMIT 20;
>
> However, I found in both Drill Web UI and API, the returned column for
> 'verifiedSerial' has empty value, which is impossible since I have explicit
> condition in the query which the column `verifiedSerial` cannot be empty.
>
> *my configuration*:
>
>   - I am running Apache Drill 1.17 (the latest version now) in embedded
>   mode
>   - The parquet file is generated from Apache Spark (3.x version). E.g., I
>   uses such code to generate the parquet file:
>   `df.coalesce(1).write.mode('overwrite').parquet('/tmp/foo.parquet')`
>   - The parquet file is about 1TB in size
>   - My machine running drill has 32GB RAM, drill is pretty much the only
>   app running on that machine
>
>
> What is wired is, if I filter the parquet file in spark (which result in a
> smaller parquet file), everything works fine.
>
> My Question is:
>
>   1. Is the behavior I described expected?
>   2. If (1) is not expected, how can I avoid this behavior, what kind of
>   configuration change is needed? Or perhaps is this a bug?
>
> Thanks in advance for any kind of help!
>
> Regards,
> Stone Zhong
>

Re: Query Parquet large parquet file, return empty columns

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Stone,

Welcome to the Drill mailing list!

The situation you describe is not the expected behavior. Something is going wrong.

You mention that, in the large-file case, you do get results, but the value is empty. This suggests that the query did, in fact run (that is, it didn't run out of memory or encounter some other internal problem.) The question is: why were the columns empty?

And, when you say "empty", do you mean NULL (SQL NULL) or an empty string. (Or, might they be a string of spaces?)

You mention that you used both the Web console and API. Do you mean REST API? What results do you get if you use the Sqlline command line tool?


Your configuration seems fine. Your query is a simple scan and filter, so not much memory should be needed. 32 GB should be fine. You are using the default Drill memory options?

The odd thing is that you have two files: one small, one large. The small one works. Is the schema the same for the two Parquet files?

Parquet files are divided into row groups. How large are the row groups in your file?


With only the data you provided, my hunch is that the problem is related to file size somehow, but we will have to poke around to find where the problem lies.


Please do file a bug. It would be super helpful if you can include:
* The same description as in your e-mail.
* The actual output of the query (from Sqlline or the REST API.)

* An example small Parquet file. Mostly we will need to see the schema so we can try to reproduce the problem.

Here is a possible workaround. Drill does not need the entire data set in one file: you can create a directory of files. Try doing this with files of, say, 100MB or 256MB in size. A side benefit is that, for real queries (not just LIMIT 10), having multiple files will let Drill divide up the query to run scans in parallel so it can use all the CPUs on your machine.

Thanks,
- Paul

 

    On Tuesday, January 7, 2020, 5:14:12 PM PST, Stone Zhong <st...@gmail.com> wrote:  
 
 Hi,

This is my first message to the user email list.

*Symptom*: I have a large parquet file (size = 1TB), I run a query against
the parquet file, the query is like below:

select * from `dfs`.`/root/data/hwd/machine_main2`
WHERE
    verifiedSerial='1547NM70EN'
LIMIT 20;

However, I found in both Drill Web UI and API, the returned column for
'verifiedSerial' has empty value, which is impossible since I have explicit
condition in the query which the column `verifiedSerial` cannot be empty.

*my configuration*:

  - I am running Apache Drill 1.17 (the latest version now) in embedded
  mode
  - The parquet file is generated from Apache Spark (3.x version). E.g., I
  uses such code to generate the parquet file:
  `df.coalesce(1).write.mode('overwrite').parquet('/tmp/foo.parquet')`
  - The parquet file is about 1TB in size
  - My machine running drill has 32GB RAM, drill is pretty much the only
  app running on that machine


What is wired is, if I filter the parquet file in spark (which result in a
smaller parquet file), everything works fine.

My Question is:

  1. Is the behavior I described expected?
  2. If (1) is not expected, how can I avoid this behavior, what kind of
  configuration change is needed? Or perhaps is this a bug?

Thanks in advance for any kind of help!

Regards,
Stone Zhong