You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/01/07 07:19:07 UTC
[GitHub] [iceberg] wangxujin1221 opened a new issue #3858: How do I quickly traverse a very large table?
wangxujin1221 opened a new issue #3858:
URL: https://github.com/apache/iceberg/issues/3858
Hi team,
I'm new to iceberg, and i have a question about query big table.
We have a Hive table with a total of 3.6 million records and 120 fields per record. and we want to transfer all the records in this table to other databases, such as pg, kafak, etc.
Currently we do like this:
`
Dataset<Row> dataset = connection.client.read().format("iceberg").load("default.table");
// here will stuck for a very long time
dataset.foreachPartition(par ->{
par.forEachRemaining(row ->{
```
});
});
`
but it can get stuck for a long time in the foreach process.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on issue #3858: How do I quickly traverse a very large table?
Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #3858:
URL: https://github.com/apache/iceberg/issues/3858#issuecomment-1012321440
I would look at the Spark Connectors for those other systems, it should be much more efficient than writing your own sinks in the foreach. For example
spark.read.format("iceberg")....write.format("jdbc) ....
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] wangxujin1221 edited a comment on issue #3858: How do I quickly traverse a very large table?
Posted by GitBox <gi...@apache.org>.
wangxujin1221 edited a comment on issue #3858:
URL: https://github.com/apache/iceberg/issues/3858#issuecomment-1011057333
@powerzhangquan I think the key is your table has too many small files, so it is very slowly to scan it. You can check you partition rule or user `df.repartition()` to decrease the num of spark partitions.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] wangxujin1221 commented on issue #3858: How do I quickly traverse a very large table?
Posted by GitBox <gi...@apache.org>.
wangxujin1221 commented on issue #3858:
URL: https://github.com/apache/iceberg/issues/3858#issuecomment-1011057333
@powerzhangquan I think the key is your table has too many small files, so it is very slowly to scan it. You can check you partition rule or user `df.repartition()` to decrease the spark partitions.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] wangxujin1221 closed issue #3858: How do I quickly traverse a very large table?
Posted by GitBox <gi...@apache.org>.
wangxujin1221 closed issue #3858:
URL: https://github.com/apache/iceberg/issues/3858
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] powerzhangquan commented on issue #3858: How do I quickly traverse a very large table?
Posted by GitBox <gi...@apache.org>.
powerzhangquan commented on issue #3858:
URL: https://github.com/apache/iceberg/issues/3858#issuecomment-1011013109
hi wangxujin1221,I have a similar scenario, do you have a good way
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org