You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jychen7 (via GitHub)" <gi...@apache.org> on 2023/02/26 05:56:49 UTC

[GitHub] [arrow-datafusion] jychen7 opened a new issue, #5404: Datafusion v19.rc1 scan parquet slower than DuckDB

jychen7 opened a new issue, #5404:
URL: https://github.com/apache/arrow-datafusion/issues/5404

   **Describe the problem**
   Datafusion v19.rc1 by default turn on  `repartition_file_scans` at https://github.com/apache/arrow-datafusion/pull/5295
   
   with my local Macbook Pro (2.6 GHz 6-Core Intel Core i7, 32 GB 2667 MHz DDR4), for following query on clickbench 14GB `hits.parquet`:
   - v19.rc1 took 12.343 seconds (yeah, 8x faster than v18, was 83.863 seconds)
   - DuckDB v0.6.1 took `real 0.566 user 1.876031 sys 0.357483`
       - clock time 566ms
       - cpu time 1.87s
       - I think clock time is smaller than cpu time, because of it uses multiple CPU cores in parallel.
   
   **To Reproduce**
   Setup
   ```
   wget --continue https://datasets.clickhouse.com/hits_compatible/hits.parquet
   ```
   
   Datafusion
   ```
   git clone https://github.com/apache/arrow-datafusion.git
   git checkout 19.0.0-rc1
   cd datafusion-cli
   cargo build --release
   
   target/release/datafusion-cli -f create.sql q23_limit_1.sql
   // output: 1 row in set. Query took 12.343 seconds
   ```
   where
   ```
   # create.sql 
   CREATE EXTERNAL TABLE hits
   STORED AS PARQUET
   LOCATION 'hits.parquet';
   # q23_limit_1.sql
   SELECT * FROM hits WHERE "URL" LIKE '%google%' limit 1;
   ```
   
   DuckDB
   ```
   brew install duckdb
   duckdb
   > .timer on
   > SELECT * FROM read_parquet('hits.parquet') WHERE URL LIKE '%google%' LIMIT 1;
   // output: Run Time (s): real 0.566 user 1.876031 sys 0.357483
   ```
   
   **Expected behavior**
   1. with single core, datafusion-cli tooks 2s (like cpu time of DuckDB)
   2. with multi cores, datafusion-cli tooks 0.6s (like real time of DuckDB)
   
   **Additional context**
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jychen7 commented on issue #5404: Datafusion v19.rc1 scan parquet 20x slower than DuckDB v0.6.1 on 15GB ClickBench data

Posted by "jychen7 (via GitHub)" <gi...@apache.org>.
jychen7 commented on issue #5404:
URL: https://github.com/apache/arrow-datafusion/issues/5404#issuecomment-1447470120

   > I checked in my 16-core linux with SSD, duckdb read parquet still faster.
   > DuckDB v0.6.0: 0.957s
   > Datafusion 2.849s
   
   my macOS only have 6 Core, but inspiring by your message, I find out github build version is faster than brew one (Brew is macOS package manager). More details here: https://github.com/duckdb/duckdb/issues/6495


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] sundy-li commented on issue #5404: Datafusion v19.rc1 scan parquet 20x slower than DuckDB

Posted by "sundy-li (via GitHub)" <gi...@apache.org>.
sundy-li commented on issue #5404:
URL: https://github.com/apache/arrow-datafusion/issues/5404#issuecomment-1445319733

   TopK is a partial factor.
   
   1.  Lazy projection(aka Later projection) can improve this case, we just fetch `URL` column at the first query and apply the order limit then projection other columns by rowids.
   2.  `URL` is a large binary column in the hits dataset, duckdb optimized reading parquet to it's memory model. You can prove that by `select max(URL) from table`
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jychen7 commented on issue #5404: Datafusion v19.rc1 scan parquet 20x slower than DuckDB v0.6.1 on 15GB ClickBench data

Posted by "jychen7 (via GitHub)" <gi...@apache.org>.
jychen7 commented on issue #5404:
URL: https://github.com/apache/arrow-datafusion/issues/5404#issuecomment-1501025273

   > Perhaps you may be able to try with https://github.com/apache/arrow-datafusion/pull/5416 and the various predicate pushdown features enabled on ParquetOptions
   
   @tustvold based on the `explain` result in https://github.com/apache/arrow-datafusion/issues/5404#issuecomment-1447472667, 
   Expect: predicate pushdown, limit pushdown
   Actual: predicate pushdown, limit NOT pushdown
   
   I understand that your PR https://github.com/apache/arrow-datafusion/pull/5416 support `limit pushdown` in the physical plan, but looks like the above query does not have a limit pushdown in the logical plan or physical plan.
   
   Do you think this is something to improve ❓  I haven't checked the related code in the logical plan, so I will try to take a look tomorrow.
   (I feel like when ALL predicates are pushed down, we can push down the limit as well. Is this generally true?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jychen7 commented on issue #5404: Datafusion v19.rc1 scan parquet 20x slower than DuckDB v0.6.1

Posted by "jychen7 (via GitHub)" <gi...@apache.org>.
jychen7 commented on issue #5404:
URL: https://github.com/apache/arrow-datafusion/issues/5404#issuecomment-1445370933

   > Lazy projection(aka Later projection) can improve this case mostly, with this we just fetch URL column at the first query and apply the order limit then projection other columns by rowids.
   
   wow, you are right, select one `URL` column makes datafusion and duckdb closer, 1.75x.
   
   ```
   # datafusion v19.rc1
   > SELECT "URL" FROM hits WHERE "URL" LIKE '%google%';
   // 15911 rows in set. Query took 5.626 seconds
   
   # duckdb v0.6.1
   > SELECT URL FROM read_parquet('hits.parquet') WHERE URL LIKE '%google%';
   // 15911 rows. Run Time (s): real 3.207 user 34.520757 sys 1.555060
   ```
   
   > URL is a large binary column in the hits dataset, duckdb optimized reading parquet to it's memory model (reused the original buffer). You can prove that by select max(URL) from table
   
   thanks for the info. Do you happen to know the code/blog link to "reused the original buffer"?
   I did a test and found datafusion and duckdb performs basically same in `SELECT max("URL") FROM hits`
   
   ```
   # datafusion v19.rc1
   SELECT max("URL") FROM hits;
   +-----------------------------------------+
   | MAX(hits.URL)                           |
   +-----------------------------------------+
   | https://yugra-advert2792270][to]=&input |
   +-----------------------------------------+
   1 row in set. Query took 2.726 seconds.
   
   # duckdb v0.6.1
   ┌─────────────────────────────────────────┐
   │               max("URL")                │
   │                 varchar                 │
   ├─────────────────────────────────────────┤
   │ https://yugra-advert2792270][to]=&input │
   └─────────────────────────────────────────┘
   Run Time (s): real 2.746 user 28.837000 sys 2.205152
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jychen7 commented on issue #5404: Datafusion v19.rc1 scan parquet slower than DuckDB

Posted by "jychen7 (via GitHub)" <gi...@apache.org>.
jychen7 commented on issue #5404:
URL: https://github.com/apache/arrow-datafusion/issues/5404#issuecomment-1445277653

   > FilterExec: URL@13 LIKE %google%, metrics=[output_rows=10767
   
   looks like Datafusion does no push down `limit` to parquet scanner?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jychen7 commented on issue #5404: Datafusion v19.rc1 scan parquet 20x slower than DuckDB v0.6.1

Posted by "jychen7 (via GitHub)" <gi...@apache.org>.
jychen7 commented on issue #5404:
URL: https://github.com/apache/arrow-datafusion/issues/5404#issuecomment-1445388517

   > I'm sorry, arrow-rs already implement it, arrow2 had this issue.
   
   @sundy-li TIL, thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #5404: Datafusion v19.rc1 scan parquet 20x slower than DuckDB v0.6.1 on 15GB ClickBench data

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #5404:
URL: https://github.com/apache/arrow-datafusion/issues/5404#issuecomment-1501065487

   It may be https://github.com/apache/arrow-datafusion/issues/4028 is required for this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] sundy-li commented on issue #5404: Datafusion v19.rc1 scan parquet 20x slower than DuckDB v0.6.1

Posted by "sundy-li (via GitHub)" <gi...@apache.org>.
sundy-li commented on issue #5404:
URL: https://github.com/apache/arrow-datafusion/issues/5404#issuecomment-1445557500

   @jychen7  I checked in my 16-core linux with SSD, duckdb read parquet still faster.
   
   duckdb v0.6.0
   ```
   D CREATE VIEW hits AS
   > SELECT *
   > REPLACE
   > (epoch_ms(EventTime * 1000) AS EventTime,
   >  DATE '1970-01-01' + INTERVAL (EventDate) DAYS AS EventDate)
   > FROM read_parquet('hits.parquet', binary_as_string=True);
   D
   D .timer on
   D select count(1),  max(URL) from hits;
   ┌──────────┬─────────────────────────────────────────┐
   │ count(1) │               max("URL")                │
   │  int64   │                 varchar                 │
   ├──────────┼─────────────────────────────────────────┤
   │ 99997497 │ https://yugra-advert2792270][to]=&input │
   └──────────┴─────────────────────────────────────────┘
   Run Time (s): real 0.957 user 21.641735 sys 4.245200
   ```
   
   datafusion:
   ```
   ❯ select max("URL") from hits;
   +-----------------------------------------+
   | MAX(hits.URL)                           |
   +-----------------------------------------+
   | https://yugra-advert2792270][to]=&input |
   +-----------------------------------------+
   1 row in set. Query took 2.849 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #5404: Datafusion v19.rc1 scan parquet 20x slower than DuckDB v0.6.1 on 15GB ClickBench data

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #5404:
URL: https://github.com/apache/arrow-datafusion/issues/5404#issuecomment-1446218984

   > Lazy projection(aka Later projection) can improve this case mostly
   
   Support for this was added to parquet not too long ago, it may just be a case of hooking it up - https://github.com/apache/arrow-rs/pull/3633


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jychen7 commented on issue #5404: Datafusion v19.rc1 scan parquet slower than DuckDB

Posted by "jychen7 (via GitHub)" <gi...@apache.org>.
jychen7 commented on issue #5404:
URL: https://github.com/apache/arrow-datafusion/issues/5404#issuecomment-1445277435

   **Additional context**
   Datafusion Plan
   ```
   > explain ANALYZE SELECT * FROM hits WHERE "URL" LIKE '%google%' limit 1;
   +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
   +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | Plan with Metrics | ProjectionExec: expr=[WatchID@0 as WatchID, JavaEnable@1 as JavaEnable, Title@2 as Title, GoodEvent@3 as GoodEvent, EventTime@4 as EventTime, EventDate@5 as EventDate, CounterID@6 as CounterID, ClientIP@7 as ClientIP, RegionID@8 as RegionID, UserID@9 as UserID, CounterClass@10 as CounterClass, OS@11 as OS, UserAgent@12 as UserAgent, URL@13 as URL, Referer@14 as Referer, IsRefresh@15 as IsRefresh, RefererCategoryID@16 as RefererCategoryID, RefererRegionID@17 as RefererRegionID, URLCategoryID@18 as URLCategoryID, URLRegionID@19 as URLRegionID, ResolutionWidth@20 as ResolutionWidth, ResolutionHeight@21 as ResolutionHeight, ResolutionDepth@22 as ResolutionDepth, FlashMajor@23 as FlashMajor, FlashMinor@24 as FlashMinor, FlashMinor2@25 as FlashMinor2, NetMajor@26 as NetMajor, NetMinor@27 as NetMinor, UserAgentMajor@28 as UserAgentMajor, UserAgentMinor@29 as UserAgentMinor, CookieEnable@30 as CookieEnable, JavascriptEnable@31 as JavascriptEnable, IsMobile@32 as IsMo
 bile, MobilePhone@33 as MobilePhone, MobilePhoneModel@34 as MobilePhoneModel, Params@35 as Params, IPNetworkID@36 as IPNetworkID, TraficSourceID@37 as TraficSourceID, SearchEngineID@38 as SearchEngineID, SearchPhrase@39 as SearchPhrase, AdvEngineID@40 as AdvEngineID, IsArtifical@41 as IsArtifical, WindowClientWidth@42 as WindowClientWidth, WindowClientHeight@43 as WindowClientHeight, ClientTimeZone@44 as ClientTimeZone, ClientEventTime@45 as ClientEventTime, SilverlightVersion1@46 as SilverlightVersion1, SilverlightVersion2@47 as SilverlightVersion2, SilverlightVersion3@48 as SilverlightVersion3, SilverlightVersion4@49 as SilverlightVersion4, PageCharset@50 as PageCharset, CodeVersion@51 as CodeVersion, IsLink@52 as IsLink, IsDownload@53 as IsDownload, IsNotBounce@54 as IsNotBounce, FUniqID@55 as FUniqID, OriginalURL@56 as OriginalURL, HID@57 as HID, IsOldCounter@58 as IsOldCounter, IsEvent@59 as IsEvent, IsParameter@60 as IsParameter, DontCountHits@61 as DontCountHits, WithHash@62 
 as WithHash, HitColor@63 as HitColor, LocalEventTime@64 as LocalEventTime, Age@65 as Age, Sex@66 as Sex, Income@67 as Income, Interests@68 as Interests, Robotness@69 as Robotness, RemoteIP@70 as RemoteIP, WindowName@71 as WindowName, OpenerName@72 as OpenerName, HistoryLength@73 as HistoryLength, BrowserLanguage@74 as BrowserLanguage, BrowserCountry@75 as BrowserCountry, SocialNetwork@76 as SocialNetwork, SocialAction@77 as SocialAction, HTTPError@78 as HTTPError, SendTiming@79 as SendTiming, DNSTiming@80 as DNSTiming, ConnectTiming@81 as ConnectTiming, ResponseStartTiming@82 as ResponseStartTiming, ResponseEndTiming@83 as ResponseEndTiming, FetchTiming@84 as FetchTiming, SocialSourceNetworkID@85 as SocialSourceNetworkID, SocialSourcePage@86 as SocialSourcePage, ParamPrice@87 as ParamPrice, ParamOrderID@88 as ParamOrderID, ParamCurrency@89 as ParamCurrency, ParamCurrencyID@90 as ParamCurrencyID, OpenstatServiceName@91 as OpenstatServiceName, OpenstatCampaignID@92 as OpenstatCampaign
 ID, OpenstatAdID@93 as OpenstatAdID, OpenstatSourceID@94 as OpenstatSourceID, UTMSource@95 as UTMSource, UTMMedium@96 as UTMMedium, UTMCampaign@97 as UTMCampaign, UTMContent@98 as UTMContent, UTMTerm@99 as UTMTerm, FromTag@100 as FromTag, HasGCLID@101 as HasGCLID, RefererHash@102 as RefererHash, URLHash@103 as URLHash, CLID@104 as CLID], metrics=[output_rows=1, elapsed_compute=21.342µs, spill_count=0, spilled_bytes=0, mem_used=0] |
   |                   |   GlobalLimitExec: skip=0, fetch=1, metrics=[output_rows=1, elapsed_compute=55.749µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
   |                   |     CoalescePartitionsExec, metrics=[output_rows=410, elapsed_compute=19.372µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
   |                   |       CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=410, elapsed_compute=25.377822ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
   |                   |         FilterExec: URL@13 LIKE %google%, metrics=[output_rows=10767, elapsed_compute=12.106422195s, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
   |                   |           ParquetExec: limit=None, partitions={12 groups: [[hits.parquet:0..1231664704], [hits.parquet:1231664704..2463329408], [hits.parquet:2463329408..3694994112], [hits.parquet:3694994112..4926658816], [hits.parquet:4926658816..6158323520], [hits.parquet:6158323520..7389988224], [hits.parquet:7389988224..8621652928], [hits.parquet:8621652928..9853317632], [hits.parquet:9853317632..11084982336], [hits.parquet:11084982336..12316647040], [hits.parquet:12316647040..13548311744], [hits.parquet:13548311744..14779976446]]}, predicate=URL LIKE Utf8("%google%"), projection=[WatchID, JavaEnable, Title, GoodEvent, EventTime, EventDate, CounterID, ClientIP, RegionID, UserID, CounterClass, OS, UserAgent, URL, Referer, IsRefresh, RefererCategoryID, RefererRegionID, URLCategoryID, URLRegionID, ResolutionWidth, ResolutionHeight, ResolutionDepth, FlashMajor, FlashMinor, FlashMinor2, NetMajor, NetMinor, UserAgentMajor, UserAgentMinor, CookieEnable, JavascriptEnable, IsMobi
 le, MobilePhone, MobilePhoneModel, Params, IPNetworkID, TraficSourceID, SearchEngineID, SearchPhrase, AdvEngineID, IsArtifical, WindowClientWidth, WindowClientHeight, ClientTimeZone, ClientEventTime, SilverlightVersion1, SilverlightVersion2, SilverlightVersion3, SilverlightVersion4, PageCharset, CodeVersion, IsLink, IsDownload, IsNotBounce, FUniqID, OriginalURL, HID, IsOldCounter, IsEvent, IsParameter, DontCountHits, WithHash, HitColor, LocalEventTime, Age, Sex, Income, Interests, Robotness, RemoteIP, WindowName, OpenerName, HistoryLength, BrowserLanguage, BrowserCountry, SocialNetwork, SocialAction, HTTPError, SendTiming, DNSTiming, ConnectTiming, ResponseStartTiming, ResponseEndTiming, FetchTiming, SocialSourceNetworkID, SocialSourcePage, ParamPrice, ParamOrderID, ParamCurrency, ParamCurrencyID, OpenstatServiceName, OpenstatCampaignID, OpenstatAdID, OpenstatSourceID, UTMSource, UTMMedium, UTMCampaign, UTMContent, UTMTerm, FromTag, HasGCLID, RefererHash, URLHash, CLID], metrics=[ou
 tput_rows=73363236, elapsed_compute=12ns, spill_count=0, spilled_bytes=0, mem_used=0, predicate_evaluation_errors=0, pushdown_rows_filtered=0, row_groups_pruned=0, bytes_scanned=11496382994, page_index_rows_filtered=0, num_predicate_creation_errors=0, time_elapsed_scanning_total=151.775198747s, pushdown_eval_time=24ns, time_elapsed_processing=133.355923578s, time_elapsed_opening=1.135712999s, page_index_eval_time=24ns, time_elapsed_scanning_until_data=2.005873968s]                                                                                                                                                                                      |
   |                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
   +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   1 row in set. Query took 12.817 seconds.
   ```
   
   DuckDB Plan
   ```
   > explain ANALYZE SELECT * FROM read_parquet('hits.parquet') WHERE URL LIKE '%google%' LIMIT 1;
   
   ┌─────────────────────────────┐
   │┌───────────────────────────┐│
   │└───────────────────────────┘│
   └─────────────────────────────┘
   ┌─────────────────────────────────────┐
   │┌───────────────────────────────────┐│
   ││    Query Profiling Information    ││
   │└───────────────────────────────────┘│
   └─────────────────────────────────────┘
   explain ANALYZE SELECT * FROM read_parquet('hits.parquet') WHERE URL LIKE '%google%' LIMIT 1;
   ┌─────────────────────────────────────┐
   │┌───────────────────────────────────┐│
   ││         Total Time: 0.475s        ││
   │└───────────────────────────────────┘│
   └─────────────────────────────────────┘
   ┌───────────────────────────┐
   │      EXPLAIN_ANALYZE      │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │             0             │
   │          (0.00s)          │
   └─────────────┬─────────────┘
   ┌─────────────┴─────────────┐
   │           LIMIT           │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │             1             │
   │          (0.00s)          │
   └─────────────┬─────────────┘
   ┌─────────────┴─────────────┐
   │         PROJECTION        │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │          WatchID          │
   │         JavaEnable        │
   │           Title           │
   │         GoodEvent         │
   │         EventTime         │
   │         EventDate         │
   │         CounterID         │
   │          ClientIP         │
   │          RegionID         │
   │           UserID          │
   │        CounterClass       │
   │             OS            │
   │         UserAgent         │
   │            URL            │
   │          Referer          │
   │         IsRefresh         │
   │     RefererCategoryID     │
   │      RefererRegionID      │
   │       URLCategoryID       │
   │        URLRegionID        │
   │      ResolutionWidth      │
   │      ResolutionHeight     │
   │      ResolutionDepth      │
   │         FlashMajor        │
   │         FlashMinor        │
   │        FlashMinor2        │
   │          NetMajor         │
   │          NetMinor         │
   │       UserAgentMajor      │
   └─────────────┬─────────────┘
   ┌─────────────┴─────────────┐
   │           FILTER          │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │  contains(URL, 'google')  │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │             29            │
   │          (0.07s)          │
   └─────────────┬─────────────┘
   ┌─────────────┴─────────────┐
   │        PARQUET_SCAN       │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │            URL            │
   │          WatchID          │
   │         JavaEnable        │
   │           Title           │
   │         GoodEvent         │
   │         EventTime         │
   │         EventDate         │
   │         CounterID         │
   │          ClientIP         │
   │          RegionID         │
   │           UserID          │
   │        CounterClass       │
   │             OS            │
   │         UserAgent         │
   │          Referer          │
   │         IsRefresh         │
   │     RefererCategoryID     │
   │      RefererRegionID      │
   │       URLCategoryID       │
   │        URLRegionID        │
   │      ResolutionWidth      │
   │      ResolutionHeight     │
   │      ResolutionDepth      │
   │         FlashMajor        │
   │         FlashMinor        │
   │        FlashMinor2        │
   │          NetMajor         │
   │          NetMinor         │
   │       UserAgentMajor      │
   └───────────────────────────┘
   Run Time (s): real 0.590 user 1.888174 sys 0.381627
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] sundy-li commented on issue #5404: Datafusion v19.rc1 scan parquet 20x slower than DuckDB v0.6.1

Posted by "sundy-li (via GitHub)" <gi...@apache.org>.
sundy-li commented on issue #5404:
URL: https://github.com/apache/arrow-datafusion/issues/5404#issuecomment-1445388169

   > Do you happen to know the code/blog link to "reused the original buffer"
   
   I'm sorry, `arrow-rs` already implement it, arrow2 had this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jychen7 commented on issue #5404: Datafusion v19.rc1 scan parquet 20x slower than DuckDB v0.6.1 on 15GB ClickBench data

Posted by "jychen7 (via GitHub)" <gi...@apache.org>.
jychen7 commented on issue #5404:
URL: https://github.com/apache/arrow-datafusion/issues/5404#issuecomment-1501221831

   > It may be https://github.com/apache/arrow-datafusion/issues/4028 is required for this
   
   yes, I think https://github.com/apache/arrow-datafusion/issues/4028 should help this one 👍 
   (I try hardcode `TableProviderFilterPushDown::Exact` for the above query and the limit is pushed down. In this case, it improves from `6s` to `0.2s`)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jychen7 commented on issue #5404: Datafusion v19.rc1 scan parquet 20x slower than DuckDB v0.6.1 on 15GB ClickBench data

Posted by "jychen7 (via GitHub)" <gi...@apache.org>.
jychen7 commented on issue #5404:
URL: https://github.com/apache/arrow-datafusion/issues/5404#issuecomment-1447472667

   > perhaps you may be able to try with https://github.com/apache/arrow-datafusion/pull/5416 and the various predicate pushdown features enabled on ParquetOptions
   
   @tustvold thank you, nice work! let me try master (as of now, c676d1026f8fa25c1495ec139b8a379ce1f2f86b) which includes #5416 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org