You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/28 15:51:36 UTC

[GitHub] [arrow-datafusion] Igosuki opened a new pull request #1697: [arrow2] Merge arrow2 and datafusion latest

Igosuki opened a new pull request #1697:
URL: https://github.com/apache/arrow-datafusion/pull/1697


   # Which issue does this PR close?
   
   None
   
    # Rationale for this change
   Updates the arrow2 branch with the latest of arrow2 and datafusion master
   
   # What changes are included in this PR?
   Extension traits to keep the API consistent and make future datafusion master merging easier.
   Internally, some methods start using `Chunk`, I haven't done any perf tests.
   
   # Are there any user-facing changes?
   RecordBatch is `datafusion::record_batch::RecordBatch` instead of `datafusion::arrow::record_batch::RecordBatch`
   
   # Issues 
   Some tests fail, I haven't fixed everything.
   Notably, checking schema validity between columns and schemas in record batch fails for decimal tests because arrow creates a Decimal(32, 32) column for Int128Array : https://github.com/jorgecarleitao/arrow2/blob/main/src/datatypes/mod.rs#L295


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Igosuki commented on pull request #1697: [arrow2] Merge arrow2 and datafusion latest

Posted by GitBox <gi...@apache.org>.
Igosuki commented on pull request #1697:
URL: https://github.com/apache/arrow-datafusion/pull/1697#issuecomment-1024900121


   @houqp Some fixes here https://github.com/Igosuki/arrow-datafusion/tree/arrow_testfix_20220129 remain a parquet projection issue, the decimal type issue I mentioned here and a test that fails because it's not spilling any memory when it's expected that it will.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp merged pull request #1697: [arrow2] Merge arrow2 and datafusion latest

Posted by GitBox <gi...@apache.org>.
houqp merged pull request #1697:
URL: https://github.com/apache/arrow-datafusion/pull/1697


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1697: [arrow2] Merge arrow2 and datafusion latest

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1697:
URL: https://github.com/apache/arrow-datafusion/pull/1697#issuecomment-1024360582


   I changed the base to arrow2 branch rather than master.
   Epic work @Igosuki  L


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Igosuki commented on pull request #1697: [arrow2] Merge arrow2 and datafusion latest

Posted by GitBox <gi...@apache.org>.
Igosuki commented on pull request #1697:
URL: https://github.com/apache/arrow-datafusion/pull/1697#issuecomment-1024598102


   Although the extension traits help, I think the subtle differences between arrow-rs and arrow2 such as in Array creation will make merging always a bit painful, especially when merging unit tests... Not sure this way will be sustainable in the long run.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1697: [arrow2] Merge arrow2 and datafusion latest

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1697:
URL: https://github.com/apache/arrow-datafusion/pull/1697#discussion_r794756250



##########
File path: datafusion/src/field_util.rs
##########
@@ -109,3 +111,321 @@ pub fn struct_array_from(pairs: Vec<(Field, ArrayRef)>) -> StructArray {
     let values = pairs.iter().map(|v| v.1.clone()).collect();
     StructArray::from_data(DataType::Struct(fields), values, None)
 }
+
+/// Imitate arrow-rs Schema behavior by extending arrow2 Schema
+pub trait SchemaExt {

Review comment:
       this is cool

##########
File path: datafusion/src/datasource/file_format/csv.rs
##########
@@ -102,25 +103,18 @@ impl FileFormat for CsvFormat {
                 .has_headers(self.has_header)
                 .from_reader(obj_reader?.sync_reader()?);
 
-            let schema = csv::read::infer_schema(
+            let (fields, records_read) = csv::read::infer_schema(
                 &mut reader,
                 Some(records_to_read),
                 self.has_header,
                 &csv::read::infer,
             )?;
 
-            // if records_read == 0 {
-            //     continue;
-            // }
-            // schemas.push(schema.clone());
-            // records_to_read -= records_read;
-            // if records_to_read == 0 {
-            //     break;
-            // }
-            //
-            // FIXME: return recods_read from infer_schema
-            schemas.push(schema.clone());
-            records_to_read -= records_to_read;
+            if records_read == 0 {

Review comment:
       👍 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on pull request #1697: [arrow2] Merge arrow2 and datafusion latest

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1697:
URL: https://github.com/apache/arrow-datafusion/pull/1697#issuecomment-1024835493


   I believe we should not use squash merge to maintain long running branches, so I am going to manually push these changes into the arrow2 branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Igosuki commented on pull request #1697: [arrow2] Merge arrow2 and datafusion latest

Posted by GitBox <gi...@apache.org>.
Igosuki commented on pull request #1697:
URL: https://github.com/apache/arrow-datafusion/pull/1697#issuecomment-1024381468


   Woops 😆
   
   Le ven. 28 janv. 2022 à 17:04, Andrew Lamb ***@***.***> a
   écrit :
   
   > I changed the base to arrow2 branch rather than master.
   > Epic work @Igosuki <https://github.com/Igosuki> L
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/arrow-datafusion/pull/1697#issuecomment-1024360582>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AADDFBRRG6CVTWP63DXKXU3UYK5AZANCNFSM5NBDF3ZA>
   > .
   > Triage notifications on the go with GitHub Mobile for iOS
   > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
   > or Android
   > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
   >
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on pull request #1697: [arrow2] Merge arrow2 and datafusion latest

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1697:
URL: https://github.com/apache/arrow-datafusion/pull/1697#issuecomment-1024837733


   @Igosuki if you are planning on fixing some of the tests in the short term, then let me know so I won't be stepping on your toes. Otherwise, I will help with the tests this weekend.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on pull request #1697: [arrow2] Merge arrow2 and datafusion latest

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1697:
URL: https://github.com/apache/arrow-datafusion/pull/1697#issuecomment-1024834134


   Really nice work on both the extension trait and recordbatch struct :+1:
   
   With regards to unit tests merging, @Jimexist did some prior work to help reduce conflicts at: https://github.com/apache/arrow-datafusion/pull/1588. But I agree with you that merging is a lot of work. I think the goal is not to maintain the in the long run, but rather close the gaps in the arrow2 milestones as fast as we could, then make a final decision on whether we want to merge this into master or not as a community.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org