You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@asterixdb.apache.org by "Wail Y. Alkowaileet (Jira)" <ji...@apache.org> on 2023/10/24 16:47:00 UTC
[jira] [Created] (ASTERIXDB-3286) Support COPY TO

Wail Y. Alkowaileet created ASTERIXDB-3286:
----------------------------------------------

             Summary: Support COPY TO
                 Key: ASTERIXDB-3286
                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-3286
             Project: Apache AsterixDB
          Issue Type: Epic
          Components: COMP - Compiler, RT - Runtime
    Affects Versions: 0.9.9
            Reporter: Wail Y. Alkowaileet
            Assignee: Wail Y. Alkowaileet
             Fix For: 0.9.9


Currently, AsterixDB do not have a clean way to extract query result or dump a dataset to a storage device. The only channel provided currently is the Query Service (i.e., running the query and write it somehow at the client side). We need to support a way to write query results (or dump a dataset) in parallel to a storage device.

 

To illustrate we want to do the following:
{noformat}
USE CopyToDataverse;

COPY ColumnDataset
TO localfs
PATH("localhost:///media/backup/CopyToResult")
WITH {
    "format" : "json"
};{noformat}
In this example, the data in ColumnDataset will be written in each node at the provided path localhost:///media/backup/CopyToResult. Simply, each node will write its own partitions of the data of ColumnDataset locally. The written files will be in raw JSON format.

 

Another example:
{noformat}
USE CopyToDataverse;

COPY (SELECT cd.uid uid, 
             cd.sensor_info.name name, 
             to_bigint(cd.sensor_info.battery_status) battery_status
      FROM ColumnDataset cd
) toWrite
TO s3 
PATH("CopyToResult/" || to_string(b))
OVER (
   PARTITION BY toWrite.battery_status b
   ORDER BY toWrite.name
)
WITH {
    "format" : "json",
    "compression": "gzip",
    "max-objects-per-file": 100,
    "container": "myBucket",
    "accessKeyId": "<access-key>",
    "secretAccessKey": "<secret-key>",
    "region": "us-west-2"
};{noformat}
The second example shows how to write the result of a query and also partition the result so that each partition will be written to a certain path. In this example, we partition by the battery_status (say an integer value from 0 to 100). The final result will be written to myBucke in Amazon S3. 

Each partition will have the path CopyToResult/<battery_status>. For example CopyToResult/0, CopyToResult/1 ..., CopyToResult/99, CopyToResult/100). This partitioning scheme can be useful if a user wants to exploit dynamic prefixes (external data filters) (see ASTERIXDB-3073)

Additionally, the records in each partition will be ordered by the sensor_name (toWrite.name). Note that this ordering isn't global but per partition.

Also, the written files will be compressed using *gzip* and each file should have at most 100 records max ({*}max-objects-per-file{*}).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)