You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Shashank Sharma <sh...@jungleworks.com> on 2020/04/15 19:08:44 UTC

Drill large data build up in fragment by using join

Hi folks,

I have a two large big json data set and querying on distributed apache
drill system, can anyone explain why it is  making or build billion of
records to scan in fragment when join between two big records by hash join
as well as merge join with only 60,000 record data set through s3 bucket
file distributed system?

-- 

[image: https://jungleworks.com/] <https://jungleworks.com/>

Shashank Sharma

Software Engineer

Phone: +91 8968101068

<https://www.facebook.com/jungleworks1> <https://twitter.com/jungleworks1>
<https://www.linkedin.com/company/jungleworks/>

Re: Drill large data build up in fragment by using join

Posted by Ted Dunning <te...@gmail.com>.

Can you give a sample query?



On Wed, Apr 15, 2020 at 12:32 PM Shashank Sharma <
shashank.sharma@jungleworks.com> wrote:

> Hi folks,
>
> I have a two large big json data set and querying on distributed apache
> drill system, can anyone explain why it is  making or build billion of
> records to scan in fragment when join between two big records by hash join
> as well as merge join with only 60,000 record data set through s3 bucket
> file distributed system?
>
> --
>
> [image: https://jungleworks.com/] <https://jungleworks.com/>
>
> Shashank Sharma
>
> Software Engineer
>
> Phone: +91 8968101068
>
> <https://www.facebook.com/jungleworks1> <https://twitter.com/jungleworks1>
> <https://www.linkedin.com/company/jungleworks/>
>

Re: Drill large data build up in fragment by using join

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Shashank,

Let me make sure I understand the question. You have to large JSON data files? You are on a distributed Drill cluster. You want to know why you are seeiing a billion rows in one fragment rather than the work being distributed across multiple fragments? Is this an accurate summary?

The key thing to know is that Drill (and most Hadoop-based systems) rely on files to be "block-splittable". That is, if your file is 1 GB in size, Drill needs to be able to read, say, blocks of 256 MB from the file so that we can have four Drill fragments read that single 1 GB file. This is true even if you store the files in S3.

CSV, Parquet, Sequence File and others are block splittable. As it turns out, JSON is not. The reason is simple: there is no way to jump into a typical JSON file and scan for the start of the next record. With CSV, newlines are record separators. Parquet has row groups. With JSON, there may or may not be newlines between records, and there may or may not be newlines within records.

It turns out that there is an emerging standard called jsonlines [1] which requires that there be newlines between, but not within, JSON records. Using jsonlines would make JSON into a block-splittable format. Drill does not yet support this specialized JSON format, but doing so would be good enhancement for data files that adhere to the jsonlines format. Is your data in jsonlines format?

For now, the solution is simple: rather than storing your data in a single large JSON file, simply split the data into multiple small files within a single directory. Drill will read each file in a separate fragment, giving you the parallelism you want. Make each file on the order of 100MB, say. The key is to ensure that you have at least as many files as you have minor fragments. The number of minor fragments will be 70% of your CPU count per node. If you have 10 CPUs, say, Drill will create 7 fragments per node. Then, multiply this by the number of nodes. If you have 4 nodes, say, you'll have 28 minor fragments total. You want to have at least 28 JSON files so you can keep each fragment busy.

If your code generates the JSON, then you can change the code to split the data into smaller files. If you obtain the JSON from somewhere else, then your options may be more limited.

Will any of this help resolve your issue?

Thanks,
- Paul

[1] http://jsonlines.org/

On Wednesday, April 15, 2020, 12:32:35 PM PDT, Shashank Sharma <sh...@jungleworks.com> wrote:

Hi folks,

I have a two large big json data set and querying on distributed apache
drill system, can anyone explain why it is making or build billion of
records to scan in fragment when join between two big records by hash join
as well as merge join with only 60,000 record data set through s3 bucket
file distributed system?

[image: https://jungleworks.com/] <https://jungleworks.com/>

Shashank Sharma

Software Engineer

Phone: +91 8968101068

<https://www.facebook.com/jungleworks1> <https://twitter.com/jungleworks1>
<https://www.linkedin.com/company/jungleworks/>