You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2024/03/18 10:50:20 UTC

[I] Add `DataFrame::with_column` [arrow-datafusion]

alamb opened a new issue, #9672:
URL: https://github.com/apache/arrow-datafusion/issues/9672

   ### Is your feature request related to a problem or challenge?
   
   This comes from a Discord discussion: 
   https://discord.com/channels/885562378132000778/1166447479609376850/1218573269662437477
   
   Hello. I'm trying to add brand new string column to existing dataframe. How it can be done in idiomatic way? 
   I have this dataframe as input:
   ```rust
   +----+------+
   | id | data |
   +----+------+
   | 1  | 42   |
   | 2  | 43   |
   | 3  | 44   |
   +----+------+
   ```
   I want to get similar to this output (it's from polars):
   ```rust
   ┌─────┬──────┬─────────┐
   │ id  ┆ data ┆ new_col │
   │ --- ┆ ---  ┆ ---     │
   │ i32 ┆ i32  ┆ str     │
   ╞═════╪══════╪═════════╡
   │ 1   ┆ 42   ┆ foo     │
   │ 2   ┆ 43   ┆ bar     │
   │ 3   ┆ 44   ┆ baz     │
   └─────┴──────┴─────────┘
   ```
   Code example:
   ```rust
   // in polars I can do like this:
       let mut df = df!(
           "id" => &[1, 2, 3],
           "data" => &[42, 43, 44], 
       )?;
       let new_col = vec!["foo", "bar", "baz"];
       let s = Series::new("new_col", new_col);
       let df = df.with_column(s)?;
       println!("{:?}", df);
   
   // don't understand how to do the same with datafusion
       let schema = Arc::new(Schema::new(vec![
           Field::new("id", DataType::Int32, false),
           Field::new("data", DataType::Int32, true),
       ]));
       let batch = RecordBatch::try_new(
           schema.clone(),
           vec![
               Arc::new(Int32Array::from(vec![1, 2, 3])),
               Arc::new(Int32Array::from(vec![42, 43, 44])),
           ],
       )?;
       let ctx = SessionContext::new();
       ctx.register_batch("t", batch)?;
       let df = ctx.table("t").await?;
       let data = vec!["foo", "bar", "baz"];
   // mismatched type expected struct `GenericListArray<i32>` found struct `Vec<&str>`
       let res = df.with_column("new_col", Expr::Literal(ScalarValue::List(Arc::new(data))))?; 
   ```
   
   ### Describe the solution you'd like
   
   Add a `DataFrame::with_column` that does the same as 
   
   https://docs.rs/polars/latest/polars/frame/struct.DataFrame.html#method.with_column
   
   
   
   ### Describe alternatives you've considered
   
   The user suggests: https://discord.com/channels/885562378132000778/1166447479609376850/1218908027609157702
   
   The only way I've found that works is to create new dataframe with required column and join them. Please, let me know if I'm missing something?
   ```rust
       let schema = Arc::new(Schema::new(vec![
           Field::new("id", DataType::Int32, false),
           Field::new("data", DataType::Int32, true),
       ]));
       let batch1 = RecordBatch::try_new(
           schema.clone(),
           vec![
               Arc::new(Int32Array::from(vec![1, 2, 3])),
               Arc::new(Int32Array::from(vec![42, 43, 44])),
           ],
       )?;
       let ctx = SessionContext::new();
       ctx.register_batch("t1", batch1.clone())?;
       let new_col = vec!["foo", "bar", "baz"];
       let ids = schema.field_with_name("id")?.to_owned();
       let ids_data = batch1.column_by_name("id").unwrap().clone();
       let schema = Arc::new(Schema::new(vec![
           ids,
           Field::new("new_col", DataType::Utf8, true),
       ]));
       let batch2 = RecordBatch::try_new(
           schema, 
           vec![
               ids_data,
               Arc::new(StringArray::from(new_col)),
           ]
       )?;
       ctx.register_batch("t2", batch2)?;
       let res = ctx
           .sql("select t1.id, t1.data, t2.new_col \
               from t1 \
               inner join t2 on t1.id = t2.id").await?;
       
       res.show().await?;
   ```
   
   @Omega359  suggests https://discord.com/channels/885562378132000778/1166447479609376850/1218943627443830945
   
   I would think that using the new unnest function (not yet in a released version that I'm aware of) would work ... however when I tried it I get an error
   ```
       let schema = Arc::new(Schema::new(vec![
           Field::new("id", DataType::Int32, false),
           Field::new("data", DataType::Int32, true),
       ]));
       let batch = RecordBatch::try_new(
           schema.clone(),
           vec![
               Arc::new(Int32Array::from(vec![1, 2, 3])),
               Arc::new(Int32Array::from(vec![42, 43, 44])),
           ],
       )?;
       let ctx = SessionContext::new();
       ctx.register_batch("t", batch)?;
       let df = ctx.table("t").await?;
       let data = ["foo", "bar", "baz"];
       let expr = make_array(data.iter().map(|&d| lit(d)).collect());
       let res = df.with_column("new_col", Expr::Unnest(Unnest { exprs: vec![expr] }))?;
   
       res.show().await?;
   ```
   
   Error: Context("type_coercion", Internal("Unnest should be rewritten to LogicalPlan::Unnest before type coercion"))
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add `DataFrame::with_column` [arrow-datafusion]

Posted by "yyy1000 (via GitHub)" <gi...@apache.org>.
yyy1000 commented on issue #9672:
URL: https://github.com/apache/arrow-datafusion/issues/9672#issuecomment-2007586518

   I'd like to look at this and find how to do it. First I will try to see whether `Unnest` could do similar thing. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add `DataFrame::with_column` [arrow-datafusion]

Posted by "Omega359 (via GitHub)" <gi...@apache.org>.
Omega359 commented on issue #9672:
URL: https://github.com/apache/arrow-datafusion/issues/9672#issuecomment-2016585052

   Is this the first request for this? If so I'd say that while it seems useful it's not actually something that is an issue in general.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add `DataFrame::with_column` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #9672: Add `DataFrame::with_column`
URL: https://github.com/apache/arrow-datafusion/issues/9672


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add `DataFrame::with_column` [arrow-datafusion]

Posted by "Omega359 (via GitHub)" <gi...@apache.org>.
Omega359 commented on issue #9672:
URL: https://github.com/apache/arrow-datafusion/issues/9672#issuecomment-2016466398

   My guess is that the idea for the Polars implementation came from Pandas where to add a column you do something like 
   ```
   df1 = df1.assign(e=pd.Series(np.random.randn(sLength)).values)
   ```
   which returns a new df 
   
   My first thought for this was to look into the cast issue with the first solution and to see if there was something there that could be adjusted to make it work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add `DataFrame::with_column` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #9672:
URL: https://github.com/apache/arrow-datafusion/issues/9672#issuecomment-2016454379

   > But dataframe in Datafusion is of a LogicalPlan, so I think it maybe different and looks more like a logical one? 🤔
   
   That does sound correct.
   
   Or maybe the only way to implement "add_column" would be to actually execute the he DataFrame (aka https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.collect) and then append the column's data to the resulting record batch
   
   However, that API sounds somewhat specialized -- and I am not sure it make sense
   
   So maybe this request doesn't make sense and we should close the issue 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add `DataFrame::with_column` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #9672:
URL: https://github.com/apache/arrow-datafusion/issues/9672#issuecomment-2016777711

   Sounds good -- closing the ticket for now nd we can reopen / revisit if it is requested again.
   
   Thanks @Omega359  and @yyy1000 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add `DataFrame::with_column` [arrow-datafusion]

Posted by "yyy1000 (via GitHub)" <gi...@apache.org>.
yyy1000 commented on issue #9672:
URL: https://github.com/apache/arrow-datafusion/issues/9672#issuecomment-2016503424

   > My first thought for this was to look into the cast issue with the first solution and to see if there was something there that could be adjusted to make it work.
   Do you mean using the `with_column` method in DF? I think that doesn't make sense, and it can only add existing column by adding projection. 🤔


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add `DataFrame::with_column` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #9672:
URL: https://github.com/apache/arrow-datafusion/issues/9672#issuecomment-2016581536

   > Do you mean using the with_column method in DF? I think that doesn't make sense, and it can only add existing column by adding projection. 🤔
   
   As you point out, DataFusion's dataframe  can already add a new column as a derived expression (by using `project`).
   
   What it can't do is append a new column to an existing DataFrame:
   
   ```rust
   // Read 100 rows in
   let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
   
   // Create a new column with 100 integers
   let new_column = RecordBatch::try_from_iter(vec![
     ("foo", Arc::new(Int32Array::from(0..100)))
   ]).unwrap();
   
   // Append the new column to the dataframe
   // (errors if the row counts don't match)
   let df = df.append_column(new_column).await?
   ```
   
   However, I am not sure how useful this feature would be


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add `DataFrame::with_column` [arrow-datafusion]

Posted by "yyy1000 (via GitHub)" <gi...@apache.org>.
yyy1000 commented on issue #9672:
URL: https://github.com/apache/arrow-datafusion/issues/9672#issuecomment-2015307720

   I have some updates to share:
   
   the `with_column` implementation in Datafusion can't add a new column, it's the same as Spark's implementation, which says in https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.withColumn.html
   
   > Returns a new [DataFrame](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.html#pyspark.sql.DataFrame) by adding a column or replacing the existing column that has the same name.
   > 
   > The column expression must be an expression over this [DataFrame](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.html#pyspark.sql.DataFrame); attempting to add a column from some other [DataFrame](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.html#pyspark.sql.DataFrame) will raise an error.
   So using `unnest` is not a solution IMO.
   
   When I tried to implement a new method, I got stuck on how to retrieve the data from a dataframe. I think Dataframe in `Polars` is consists of some columns, see https://docs.rs/polars-core/0.38.3/src/polars_core/frame/mod.rs.html#134, and it looks more like a physical one. But dataframe in Datafusion is of a `LogicalPlan`, so I think it maybe different and looks more like a logical one? 🤔
   
   Correct me if I'm wrong, I'm not very familiar with this. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add `DataFrame::with_column` [arrow-datafusion]

Posted by "yyy1000 (via GitHub)" <gi...@apache.org>.
yyy1000 commented on issue #9672:
URL: https://github.com/apache/arrow-datafusion/issues/9672#issuecomment-2016590106

   Agree, I think adding the method the user needed would not be a good choice, because it will execute the dataframe. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add `DataFrame::with_column` [arrow-datafusion]

Posted by "yyy1000 (via GitHub)" <gi...@apache.org>.
yyy1000 commented on issue #9672:
URL: https://github.com/apache/arrow-datafusion/issues/9672#issuecomment-2016503017

   > > But dataframe in Datafusion is of a LogicalPlan, so I think it maybe different and looks more like a logical one? 🤔
   > 
   > That does sound correct.
   > 
   > Or maybe the only way to implement "add_column" would be to actually execute the he DataFrame (aka https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.collect) and then append the column's data to the resulting record batch
   > 
   > However, that API sounds somewhat specialized -- and I am not sure it make sense
   > 
   > So maybe this request doesn't make sense and we should close the issue 🤔
   
   Yeah, I also think maybe we can only append the column to the RecordBatch, and I think maybe closing this issue makes sense to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org