You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/09 07:49:00 UTC

[GitHub] [arrow] jhorstmann opened a new pull request #8409: ARROW-10240: [Rust] Optionally load data into memory before running benchmark query

jhorstmann opened a new pull request #8409:
URL: https://github.com/apache/arrow/pull/8409


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] andygrove closed pull request #8409: ARROW-10240: [Rust] Optionally load data into memory before running benchmark query

Posted by GitBox <gi...@apache.org>.

andygrove closed pull request #8409:
URL: https://github.com/apache/arrow/pull/8409


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8409: ARROW-10240: [Rust] Optionally load data into memory before running benchmark query

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #8409:
URL: https://github.com/apache/arrow/pull/8409#issuecomment-706036980


   https://issues.apache.org/jira/browse/ARROW-10240


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jhorstmann commented on pull request #8409: ARROW-10240: [Rust] Optionally load data into memory before running benchmark query

Posted by GitBox <gi...@apache.org>.

jhorstmann commented on pull request #8409:
URL: https://github.com/apache/arrow/pull/8409#issuecomment-706505896


   That's indeed interesting. Could the issue actually be the batch size? Seems the MemTable::scan method ignores the batch size parameter and instead uses the hardcoded one used for loading.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jhorstmann commented on a change in pull request #8409: ARROW-10240: [Rust] Optionally load data into memory before running benchmark query

Posted by GitBox <gi...@apache.org>.

jhorstmann commented on a change in pull request #8409:
URL: https://github.com/apache/arrow/pull/8409#discussion_r502249059



##########
File path: rust/benchmarks/src/bin/tpch.rs
##########
@@ -59,6 +61,10 @@ struct TpchOpt {
     /// File format: `csv` or `parquet`
     #[structopt(short = "f", long = "format", default_value = "csv")]
     file_format: String,
+
+    /// Load the data into a MemTable before executing the query
+    #[structopt(short = "l", long = "load")]

Review comment:
       There is probably a better/clearer name for this parameter




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jhorstmann commented on a change in pull request #8409: ARROW-10240: [Rust] Optionally load data into memory before running benchmark query

Posted by GitBox <gi...@apache.org>.

jhorstmann commented on a change in pull request #8409:
URL: https://github.com/apache/arrow/pull/8409#discussion_r502679298



##########
File path: rust/benchmarks/src/bin/tpch.rs
##########
@@ -73,31 +79,42 @@ async fn main() -> Result<()> {
 
     let path = opt.path.to_str().unwrap();
 
-    match opt.file_format.as_str() {
-        // dbgen creates .tbl ('|' delimited) files
-        "tbl" => {
-            let path = format!("{}/lineitem.tbl", path);
-            let schema = lineitem_schema();
-            let options = CsvReadOptions::new()
-                .schema(&schema)
-                .delimiter(b'|')
-                .file_extension(".tbl");
-            ctx.register_csv("lineitem", &path, options)?
-        }
-        "csv" => {
-            let path = format!("{}/lineitem", path);
-            let schema = lineitem_schema();
-            let options = CsvReadOptions::new().schema(&schema).has_header(true);
-            ctx.register_csv("lineitem", &path, options)?
-        }
-        "parquet" => {
-            let path = format!("{}/lineitem", path);
-            ctx.register_parquet("lineitem", &path)?
-        }
-        other => {
-            println!("Invalid file format '{}'", other);
-            process::exit(-1);
-        }
+    let tableprovider: Box<dyn TableProvider + Send + Sync> =
+        match opt.file_format.as_str() {
+            // dbgen creates .tbl ('|' delimited) files
+            "tbl" => {
+                let path = format!("{}/lineitem.tbl", path);
+                let schema = lineitem_schema();
+                let options = CsvReadOptions::new()
+                    .schema(&schema)
+                    .delimiter(b'|')
+                    .file_extension(".tbl");
+
+                Box::new(CsvFile::try_new(&path, options)?)
+            }
+            "csv" => {
+                let path = format!("{}/lineitem", path);
+                let schema = lineitem_schema();
+                let options = CsvReadOptions::new().schema(&schema).has_header(true);
+
+                Box::new(CsvFile::try_new(&path, options)?)
+            }
+            "parquet" => {
+                let path = format!("{}/lineitem", path);
+                Box::new(ParquetTable::try_new(&path)?)
+            }
+            other => {
+                println!("Invalid file format '{}'", other);
+                process::exit(-1);
+            }
+        };
+
+    if opt.load {
+        let memtable = MemTable::load(tableprovider.as_ref()).await?;

Review comment:
       That's a very good idea




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] andygrove commented on pull request #8409: ARROW-10240: [Rust] Optionally load data into memory before running benchmark query

Posted by GitBox <gi...@apache.org>.

andygrove commented on pull request #8409:
URL: https://github.com/apache/arrow/pull/8409#issuecomment-706592358


   It's looking much better now :rocket: 
   
   ```
   Running benchmarks with the following options: TpchOpt { query: 1, debug: false, iterations: 3, concurrency: 24, batch_size: 4096, path: "/mnt/tpch/s1/parquet", file_format: "parquet", mem_table: true }
   Loading data into memory
   Loaded data into memory in 4334 ms
   Query 1 iteration 0 took 174 ms
   Query 1 iteration 1 took 144 ms
   Query 1 iteration 2 took 148 ms
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] andygrove commented on a change in pull request #8409: ARROW-10240: [Rust] Optionally load data into memory before running benchmark query

Posted by GitBox <gi...@apache.org>.

andygrove commented on a change in pull request #8409:
URL: https://github.com/apache/arrow/pull/8409#discussion_r502598284



##########
File path: rust/benchmarks/src/bin/tpch.rs
##########
@@ -73,31 +79,42 @@ async fn main() -> Result<()> {
 
     let path = opt.path.to_str().unwrap();
 
-    match opt.file_format.as_str() {
-        // dbgen creates .tbl ('|' delimited) files
-        "tbl" => {
-            let path = format!("{}/lineitem.tbl", path);
-            let schema = lineitem_schema();
-            let options = CsvReadOptions::new()
-                .schema(&schema)
-                .delimiter(b'|')
-                .file_extension(".tbl");
-            ctx.register_csv("lineitem", &path, options)?
-        }
-        "csv" => {
-            let path = format!("{}/lineitem", path);
-            let schema = lineitem_schema();
-            let options = CsvReadOptions::new().schema(&schema).has_header(true);
-            ctx.register_csv("lineitem", &path, options)?
-        }
-        "parquet" => {
-            let path = format!("{}/lineitem", path);
-            ctx.register_parquet("lineitem", &path)?
-        }
-        other => {
-            println!("Invalid file format '{}'", other);
-            process::exit(-1);
-        }
+    let tableprovider: Box<dyn TableProvider + Send + Sync> =
+        match opt.file_format.as_str() {
+            // dbgen creates .tbl ('|' delimited) files
+            "tbl" => {
+                let path = format!("{}/lineitem.tbl", path);
+                let schema = lineitem_schema();
+                let options = CsvReadOptions::new()
+                    .schema(&schema)
+                    .delimiter(b'|')
+                    .file_extension(".tbl");
+
+                Box::new(CsvFile::try_new(&path, options)?)
+            }
+            "csv" => {
+                let path = format!("{}/lineitem", path);
+                let schema = lineitem_schema();
+                let options = CsvReadOptions::new().schema(&schema).has_header(true);
+
+                Box::new(CsvFile::try_new(&path, options)?)
+            }
+            "parquet" => {
+                let path = format!("{}/lineitem", path);
+                Box::new(ParquetTable::try_new(&path)?)
+            }
+            other => {
+                println!("Invalid file format '{}'", other);
+                process::exit(-1);
+            }
+        };
+
+    if opt.load {
+        let memtable = MemTable::load(tableprovider.as_ref()).await?;

Review comment:
       This is just a nit but it would be nice to have some printlns here showing that the data is loading, and how long it takes




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] andygrove commented on a change in pull request #8409: ARROW-10240: [Rust] Optionally load data into memory before running benchmark query

Posted by GitBox <gi...@apache.org>.

andygrove commented on a change in pull request #8409:
URL: https://github.com/apache/arrow/pull/8409#discussion_r502598284



##########
File path: rust/benchmarks/src/bin/tpch.rs
##########
@@ -73,31 +79,42 @@ async fn main() -> Result<()> {
 
     let path = opt.path.to_str().unwrap();
 
-    match opt.file_format.as_str() {
-        // dbgen creates .tbl ('|' delimited) files
-        "tbl" => {
-            let path = format!("{}/lineitem.tbl", path);
-            let schema = lineitem_schema();
-            let options = CsvReadOptions::new()
-                .schema(&schema)
-                .delimiter(b'|')
-                .file_extension(".tbl");
-            ctx.register_csv("lineitem", &path, options)?
-        }
-        "csv" => {
-            let path = format!("{}/lineitem", path);
-            let schema = lineitem_schema();
-            let options = CsvReadOptions::new().schema(&schema).has_header(true);
-            ctx.register_csv("lineitem", &path, options)?
-        }
-        "parquet" => {
-            let path = format!("{}/lineitem", path);
-            ctx.register_parquet("lineitem", &path)?
-        }
-        other => {
-            println!("Invalid file format '{}'", other);
-            process::exit(-1);
-        }
+    let tableprovider: Box<dyn TableProvider + Send + Sync> =
+        match opt.file_format.as_str() {
+            // dbgen creates .tbl ('|' delimited) files
+            "tbl" => {
+                let path = format!("{}/lineitem.tbl", path);
+                let schema = lineitem_schema();
+                let options = CsvReadOptions::new()
+                    .schema(&schema)
+                    .delimiter(b'|')
+                    .file_extension(".tbl");
+
+                Box::new(CsvFile::try_new(&path, options)?)
+            }
+            "csv" => {
+                let path = format!("{}/lineitem", path);
+                let schema = lineitem_schema();
+                let options = CsvReadOptions::new().schema(&schema).has_header(true);
+
+                Box::new(CsvFile::try_new(&path, options)?)
+            }
+            "parquet" => {
+                let path = format!("{}/lineitem", path);
+                Box::new(ParquetTable::try_new(&path)?)
+            }
+            other => {
+                println!("Invalid file format '{}'", other);
+                process::exit(-1);
+            }
+        };
+
+    if opt.load {
+        let memtable = MemTable::load(tableprovider.as_ref()).await?;

Review comment:
       This is just a nit but it would be nice to have some printlns here showing that the data is loaded, and how long it takes




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] andygrove commented on pull request #8409: ARROW-10240: [Rust] Optionally load data into memory before running benchmark query

Posted by GitBox <gi...@apache.org>.

andygrove commented on pull request #8409:
URL: https://github.com/apache/arrow/pull/8409#issuecomment-706429236


   The results are pretty interesting for me.
   
   Without `--mem-table`:
   
   ```
   Running benchmarks with the following options: TpchOpt { query: 1, debug: false, iterations: 3, concurrency: 24, batch_size: 4096, path: "/mnt/tpch/s1/parquet", file_format: "parquet", mem_table: false }
   Query 1 iteration 0 took 241 ms
   Query 1 iteration 1 took 164 ms
   Query 1 iteration 2 took 167 ms
   ```
   
   With `--mem-table`:
   
   ```
   Running benchmarks with the following options: TpchOpt { query: 1, debug: false, iterations: 3, concurrency: 24, batch_size: 4096, path: "/mnt/tpch/s1/parquet", file_format: "parquet", mem_table: true }
   Loading data into memory
   Loaded data into memory in 11240 ms
   Query 1 iteration 0 took 353 ms
   Query 1 iteration 1 took 302 ms
   Query 1 iteration 2 took 322 ms
   ```
   
   I filed https://issues.apache.org/jira/browse/ARROW-10251 to fix the single-threaded loading in MemTable but I'm not sure why the actual query time is slower for mem tables than for Parquet.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] andygrove commented on a change in pull request #8409: ARROW-10240: [Rust] Optionally load data into memory before running benchmark query

Posted by GitBox <gi...@apache.org>.

andygrove commented on a change in pull request #8409:
URL: https://github.com/apache/arrow/pull/8409#discussion_r502573069



##########
File path: rust/benchmarks/src/bin/tpch.rs
##########
@@ -59,6 +61,10 @@ struct TpchOpt {
     /// File format: `csv` or `parquet`
     #[structopt(short = "f", long = "format", default_value = "csv")]
     file_format: String,
+
+    /// Load the data into a MemTable before executing the query
+    #[structopt(short = "l", long = "load")]

Review comment:
       Perhaps `--mem-table` ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jhorstmann commented on a change in pull request #8409: ARROW-10240: [Rust] Optionally load data into memory before running benchmark query

Posted by GitBox <gi...@apache.org>.

jhorstmann commented on a change in pull request #8409:
URL: https://github.com/apache/arrow/pull/8409#discussion_r502249059



##########
File path: rust/benchmarks/src/bin/tpch.rs
##########
@@ -59,6 +61,10 @@ struct TpchOpt {
     /// File format: `csv` or `parquet`
     #[structopt(short = "f", long = "format", default_value = "csv")]
     file_format: String,
+
+    /// Load the data into a MemTable before executing the query
+    #[structopt(short = "l", long = "load")]

Review comment:
       There is probably a better/clearer name for this parameter




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8409: ARROW-10240: [Rust] Optionally load data into memory before running benchmark query

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #8409:
URL: https://github.com/apache/arrow/pull/8409#issuecomment-706036980


   https://issues.apache.org/jira/browse/ARROW-10240


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org