You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/19 18:18:47 UTC

[GitHub] [iceberg] rymurr opened a new pull request #1216: [Python] create table support

rymurr opened a new pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216


   First step in write support for python.
   
   Adds support & tests for create on:
   * `HiveTables`
   * `FilesystemTables`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r476274629



##########
File path: python/iceberg/hive/hive_table_operations.py
##########
@@ -85,20 +89,20 @@ def commit(self, base, metadata):
 
             tbl.sd = storage_descriptor(metadata)
             metadata_location = tbl.parameters.get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP, None)
-            base_metadata_location = base.metadataFileLocation() if base is not None else None
+            base_metadata_location = base.location if base else None

Review comment:
       should be good to go now @rdblue, we had missed a property in `TableMetadata` in python 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#issuecomment-679244957


   @rymurr, thanks for the update. I think this is fine to commit without the integration tests, as long as someone has tested it by hand to make sure it actually works to create tables in Hive. It works, right?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r476574660



##########
File path: python/iceberg/core/filesystem/filesystem_tables.py
##########
@@ -15,39 +15,39 @@
 # specific language governing permissions and limitations
 # under the License.
 
-
-from iceberg.api import Tables
-from iceberg.exceptions import NoSuchTableException
-
 from .filesystem_table_operations import FilesystemTableOperations
+from .. import TableOperations
 from ..table_metadata import TableMetadata
+from ...api import PartitionSpec, Schema, Table, Tables
+from ...exceptions import NoSuchTableException
 
 
 class FilesystemTables(Tables):
 
-    def __init__(self, conf=None):
+    def __init__(self: "FilesystemTables", conf: dict = None) -> None:
         self.conf = conf if conf is not None else dict()
 
-    def load(self, location):
+    def load(self: "FilesystemTables", table_identifier: str) -> Table:
         from ..base_table import BaseTable
-        ops = self.new_table_ops(location)
+        ops = self.new_table_ops(table_identifier)
         if ops.current() is None:
-            raise NoSuchTableException("Table does not exist at location: %s" % location)
+            raise NoSuchTableException("Table does not exist at location: %s" % table_identifier)
 
-        return BaseTable(ops, location)
+        return BaseTable(ops, table_identifier)
 
-    def create(self, schema, table_identifier=None, spec=None, properties=None, location=None):
+    def create(self: "FilesystemTables", schema: Schema, table_identifier: str, spec: PartitionSpec = None,
+               properties: dict = None, location: str = None) -> Table:

Review comment:
       Should we check that `location is None` to ensure that a user doesn't pass both a table identifier (location) and a different location?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r475209811



##########
File path: python/iceberg/core/base_metastore_tables.py
##########
@@ -16,28 +16,51 @@
 # under the License.
 
 from iceberg.api import Tables
-from iceberg.exceptions import NoSuchTableException
+from iceberg.exceptions import AlreadyExistsException, CommitFailedException, NoSuchTableException
 
 from .base_table import BaseTable
+from .table_metadata import TableMetadata
 
 
 class BaseMetastoreTables(Tables):
+    DOT = '.'
 
     def __init__(self, conf):
         self.conf = conf
 
     def new_table_ops(self, conf, database, table):
         raise RuntimeError("Abstract Implementation")
 
-    def load(self, database, table):
+    def load(self, table_identifier):
+        parts = table_identifier.rsplit(BaseMetastoreTables.DOT, 1)
+        if len(parts) > 1:
+            database = parts[0]
+            table = parts[1]
+        else:
+            database = "default"
+            table = parts[0]
         ops = self.new_table_ops(self.conf, database, table)
         if ops.current() is None:
             raise NoSuchTableException("Table does not exist: {}.{}".format(database, table))
 
         return BaseTable(ops, "{}.{}".format(database, table))
 
-    def create(self, schema, spec, table_identifier=None, database=None, table=None):
-        raise RuntimeError("Not Yet Implemented")
+    def create(self, schema, table_identifier=None, spec=None, properties=None):
+        database, table = table_identifier.rsplit(BaseMetastoreTables.DOT, 1)
+        ops = self.new_table_ops(self.conf, database, table)
+        if ops.current() is not None:
+            raise AlreadyExistsException("Table already exists: " + table_identifier)
+
+        base_location = self.default_warehouse_location(self.conf, database, table)
+
+        metadata = TableMetadata.new_table_metadata(ops, schema, spec, base_location, dict() if properties is None else properties)

Review comment:
       yeah, its strange it wasn't caught by flake8. Fixed now




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r475798239



##########
File path: python/iceberg/hive/hive_table_operations.py
##########
@@ -85,20 +89,20 @@ def commit(self, base, metadata):
 
             tbl.sd = storage_descriptor(metadata)
             metadata_location = tbl.parameters.get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP, None)
-            base_metadata_location = base.metadataFileLocation() if base is not None else None
+            base_metadata_location = base.location if base else None

Review comment:
       ahh, i see now. It appears that the Java and Python impls are a bit out of sync. Python doesn't use `file` at all, which is where java takes its location from. Will look at figuring out what happened there and update the PR this eve or tomorrow




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r476574223



##########
File path: python/iceberg/api/types/type.py
##########
@@ -22,22 +22,22 @@
 
 @unique
 class TypeID(Enum):
-    BOOLEAN = {"java_class": "Boolean.class", "python_class": bool, "id": 1}
-    INTEGER = {"java_class": "Integer.class", "python_class": int, "id": 2}
-    LONG = {"java_class": "Long.class", "python_class": int, "id": 3}
-    FLOAT = {"java_class": "Float.class", "python_class": float, "id": 4}
-    DOUBLE = {"java_class": "Double.class", "python_class": float, "id": 5}
-    DATE = {"java_class": "Integer.class", "python_class": int, "id": 6}
-    TIME = {"java_class": "Long.class", "python_class": int, "id": 7}
-    TIMESTAMP = {"java_class": "Long.class", "python_class": int, "id": 8}
-    STRING = {"java_class": "CharSequence.class", "python_class": str, "id": 9}
-    UUID = {"java_class": "java.util.UUID.class", "python_class": uuid.UUID, "id": 10}
-    FIXED = {"java_class": "ByteBuffer.class", "python_class": bytes, "id": 11}
-    BINARY = {"java_class": "ByteBuffer.class", "python_class": bytearray, "id": 12}
-    DECIMAL = {"java_class": "BigDecimal.class", "python_class": Decimal, "id": 13}
-    STRUCT = {"java_class": "Void.class", "python_class": None, "id": 14}
-    LIST = {"java_class": "Void.class", "python_class": None, "id": 15}
-    MAP = {"java_class": "Void.class", "python_class": None, "id": 16}
+    BOOLEAN = {"java_class": "Boolean.class", "python_class": bool, "id": 1, "hive_name": 'boolean'}
+    INTEGER = {"java_class": "Integer.class", "python_class": int, "id": 2, "hive_name": 'int'}
+    LONG = {"java_class": "Long.class", "python_class": int, "id": 3, "hive_name": 'bigint'}
+    FLOAT = {"java_class": "Float.class", "python_class": float, "id": 4, "hive_name": 'float'}
+    DOUBLE = {"java_class": "Double.class", "python_class": float, "id": 5, "hive_name": 'double'}
+    DATE = {"java_class": "Integer.class", "python_class": int, "id": 6, "hive_name": 'date'}
+    TIME = {"java_class": "Long.class", "python_class": int, "id": 7, "hive_name": 'string'}
+    TIMESTAMP = {"java_class": "Long.class", "python_class": int, "id": 8, "hive_name": 'timestamp'}
+    STRING = {"java_class": "CharSequence.class", "python_class": str, "id": 9, "hive_name": 'string'}
+    UUID = {"java_class": "java.util.UUID.class", "python_class": uuid.UUID, "id": 10, "hive_name": 'string'}
+    FIXED = {"java_class": "ByteBuffer.class", "python_class": bytes, "id": 11, "hive_name": 'binary'}
+    BINARY = {"java_class": "ByteBuffer.class", "python_class": bytearray, "id": 12, "hive_name": "binary"}
+    DECIMAL = {"java_class": "BigDecimal.class", "python_class": Decimal, "id": 13, "hive_name": None}
+    STRUCT = {"java_class": "Void.class", "python_class": None, "id": 14, "hive_name": None}
+    LIST = {"java_class": "Void.class", "python_class": None, "id": 15, "hive_name": None}
+    MAP = {"java_class": "Void.class", "python_class": None, "id": 16, "hive_name": None}

Review comment:
       I don't see this `hive_name` property used anywhere. Did you mean to remove these changes from types?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rymurr commented on pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#issuecomment-678774677


   thanks for the review @rdblue! I have addressed your comments and made the code more pythonic. For some odd reason I thought keeping python close to Java was a good idea when I first raised the PR :-)
   
   As for the metastore tests: adding an integration test with HMS has proved slightly more annoying than I hoped so we can either leave this open while I work on it or I can add it as part of another PR later


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r467337774



##########
File path: python/iceberg/core/base_metastore_tables.py
##########
@@ -16,28 +16,51 @@
 # under the License.
 
 from iceberg.api import Tables
-from iceberg.exceptions import NoSuchTableException
+from iceberg.exceptions import AlreadyExistsException, CommitFailedException, NoSuchTableException
 
 from .base_table import BaseTable
+from .table_metadata import TableMetadata
 
 
 class BaseMetastoreTables(Tables):
+    DOT = '.'
 
     def __init__(self, conf):
         self.conf = conf
 
     def new_table_ops(self, conf, database, table):
         raise RuntimeError("Abstract Implementation")
 
-    def load(self, database, table):
+    def load(self, table_identifier):
+        parts = table_identifier.rsplit(BaseMetastoreTables.DOT, 1)
+        if len(parts) > 1:
+            database = parts[0]
+            table = parts[1]
+        else:
+            database = "default"
+            table = parts[0]
         ops = self.new_table_ops(self.conf, database, table)
         if ops.current() is None:
             raise NoSuchTableException("Table does not exist: {}.{}".format(database, table))
 
         return BaseTable(ops, "{}.{}".format(database, table))
 
-    def create(self, schema, spec, table_identifier=None, database=None, table=None):
-        raise RuntimeError("Not Yet Implemented")
+    def create(self, schema, table_identifier=None, spec=None, properties=None):
+        database, table = table_identifier.rsplit(BaseMetastoreTables.DOT, 1)

Review comment:
       In `load`, the number of items returned is checked. Should that be done here as well?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r467343967



##########
File path: python/iceberg/hive/hive_table_operations.py
##########
@@ -50,10 +63,146 @@ def refresh(self):
         return self.current()
 
     def commit(self, base, metadata):
-        raise NotImplementedError()
+        new_metadata_location = self.write_new_metadata(metadata, self.version + 1)
+
+        threw = True
+        lock_id = None
+        try:
+            lock_id = self.acquire_lock()
+            if base is not None:
+                with self._client as open_client:
+                    tbl = open_client.get_table(self.database, self.table)
+            else:
+                current_time_millis = int(time.time())
+                tbl = Table(self.table,
+                            self.database,
+                            getpass.getuser(),
+                            current_time_millis // 1000,
+                            current_time_millis // 1000,
+                            sd=storage_descriptor(metadata),
+                            tableType="EXTERNAL_TABLE",
+                            parameters={'EXTERNAL': 'TRUE'})
+
+            tbl.sd = storage_descriptor(metadata)
+            metadata_location = tbl.parameters.get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP, None)
+            base_metadata_location = base.metadataFileLocation() if base is not None else None
+            if base_metadata_location != metadata_location:
+                raise CommitFailedException(
+                    "Base metadata location '%s' is not same as the current table metadata location '%s' for %s.%s",
+                    base_metadata_location, metadata_location, self.database, self.table)
+
+            self.set_parameters(new_metadata_location, tbl)
+
+            if base is not None:
+                with self._client as open_client:
+                    env_context = EnvironmentContext(
+                        {"DO_NOT_UPDATE_STATS": "true"}
+                    )
+                    open_client.alter_table(self.database, self.table, tbl, env_context)
+
+            else:
+                with self._client as open_client:
+                    open_client.create_table(tbl)
+            threw = False
+        except AlreadyExistsException:
+            raise IcebergAlreadyExistsException("Table already exists: {}.{}".format(self.database, self.table))
+        except TException as e:
+            if e is not None and "Table/View 'HIVE_LOCKS' does not exist" in str(e):
+                raise Exception("""Failed to acquire locks from metastore because 'HIVE_LOCKS' doesn't
+                                exist, this probably happened when using embedded metastore or doesn't create a
+                                transactional meta table. To fix this, use an alternative metastore""", e)
+
+            raise Exception("Metastore operation failed for {}.{}".format(self.database, self.table), e)
+        finally:
+            if threw:
+                self.io().delete(new_metadata_location)
+            self.unlock(lock_id)
+
+    def set_parameters(self, new_metadata_location, tbl):
+        parameters = tbl.parameters
+
+        if not parameters:
+            parameters = dict()
+
+        parameters[BaseMetastoreTableOperations.TABLE_TYPE_PROP] = \
+            BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.upper()
+        parameters[BaseMetastoreTableOperations.METADATA_LOCATION_PROP] = new_metadata_location
+
+        if self.current_metadata_location is not None and len(self.current_metadata_location) > 0:
+            parameters[BaseMetastoreTableOperations.PREVIOUS_METADATA_LOCATION_PROP] = self.current_metadata_location
+
+        tbl.parameters = parameters
+
+    def unlock(self, lock_id):
+        if lock_id:
+            try:
+                self.do_unlock(LockResponse(lock_id))
+            except Exception as e:
+                logging.warning("Failed to unlock {}.{}".format(self.database, self.table), e)
+
+    def do_unlock(self, lock_id):
+        with self._client as open_client:
+            open_client.unlock(lock_id)
+
+    def acquire_lock(self):
+        lock_component = LockComponent(LockType.EXCLUSIVE, LockLevel.TABLE, self.database, self.table)
+
+        lock_request = LockRequest([lock_component], user=getpass.getuser(), hostname=socket.gethostname())
+        with self._client as open_client:
+            lock_response = open_client.lock(lock_request)
+
+        state = lock_response.state
+        lock_id = lock_response.lockid
+        start = int(time.time())
+        duration = 0
+        timeout = False
+        while not timeout and state == LockState.WAITING:
+            with self._client as open_client:
+                lock_response = open_client.check_lock(lock_response)
+            state = lock_response.state
+
+            duration = int(time.time()) - start
+            if duration > 3 * 60 * 1000:
+                timeout = True
+            else:
+                time.sleep(0.05)
+
+        if timeout and state != LockState.ACQUIRED:
+            raise CommitFailedException("Timed out after {} ms waiting for lock on {}.{}".format(duration,
+                                                                                                 self.database,
+                                                                                                 self.table))
+
+        if state != LockState.ACQUIRED:
+            raise CommitFailedException(
+                "Could not acquire the lock on {}.{}, lock request ended in state {}".format(self.database, self.table,
+                                                                                             state))
+        return lock_id
 
     def io(self):
-        raise NotImplementedError()
+        return get_fs(self.base_location, self.conf)
 
     def close(self):
         self._client.close()
+
+
+def storage_descriptor(metadata):
+    ser_de_info = SerDeInfo(serializationLib="org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe")
+    return StorageDescriptor(columns(metadata.schema),
+                             metadata.location,
+                             "org.apache.hadoop.mapred.FileInputFormat",
+                             "org.apache.hadoop.mapred.FileOutputFormat",
+                             serdeInfo=ser_de_info)
+
+
+def columns(schema):
+    return [FieldSchema(col.name, convert_hive_type(col.type), "") for col in schema.columns()]
+
+
+def convert_hive_type(col_type):
+    try:
+        type_id = col_type.type_id.value['hive_name']
+        if type_id is not None:
+            return type_id
+    except:  # NOQA
+        raise NotImplementedError("Not yet implemented column type " + col_type)

Review comment:
       Since this is identical to the fall through case, can this be `pass` instead?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r475213270



##########
File path: python/iceberg/hive/hive_table_operations.py
##########
@@ -50,10 +63,146 @@ def refresh(self):
         return self.current()
 
     def commit(self, base, metadata):
-        raise NotImplementedError()
+        new_metadata_location = self.write_new_metadata(metadata, self.version + 1)
+
+        threw = True
+        lock_id = None
+        try:
+            lock_id = self.acquire_lock()
+            if base is not None:
+                with self._client as open_client:
+                    tbl = open_client.get_table(self.database, self.table)
+            else:
+                current_time_millis = int(time.time())
+                tbl = Table(self.table,
+                            self.database,
+                            getpass.getuser(),
+                            current_time_millis // 1000,
+                            current_time_millis // 1000,
+                            sd=storage_descriptor(metadata),
+                            tableType="EXTERNAL_TABLE",
+                            parameters={'EXTERNAL': 'TRUE'})
+
+            tbl.sd = storage_descriptor(metadata)
+            metadata_location = tbl.parameters.get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP, None)
+            base_metadata_location = base.metadataFileLocation() if base is not None else None
+            if base_metadata_location != metadata_location:
+                raise CommitFailedException(
+                    "Base metadata location '%s' is not same as the current table metadata location '%s' for %s.%s",
+                    base_metadata_location, metadata_location, self.database, self.table)
+
+            self.set_parameters(new_metadata_location, tbl)
+
+            if base is not None:
+                with self._client as open_client:
+                    env_context = EnvironmentContext(
+                        {"DO_NOT_UPDATE_STATS": "true"}
+                    )
+                    open_client.alter_table(self.database, self.table, tbl, env_context)
+
+            else:
+                with self._client as open_client:
+                    open_client.create_table(tbl)
+            threw = False
+        except AlreadyExistsException:
+            raise IcebergAlreadyExistsException("Table already exists: {}.{}".format(self.database, self.table))
+        except TException as e:
+            if e is not None and "Table/View 'HIVE_LOCKS' does not exist" in str(e):
+                raise Exception("""Failed to acquire locks from metastore because 'HIVE_LOCKS' doesn't
+                                exist, this probably happened when using embedded metastore or doesn't create a
+                                transactional meta table. To fix this, use an alternative metastore""", e)
+
+            raise Exception("Metastore operation failed for {}.{}".format(self.database, self.table), e)
+        finally:
+            if threw:
+                self.io().delete(new_metadata_location)
+            self.unlock(lock_id)
+
+    def set_parameters(self, new_metadata_location, tbl):
+        parameters = tbl.parameters
+
+        if not parameters:
+            parameters = dict()
+
+        parameters[BaseMetastoreTableOperations.TABLE_TYPE_PROP] = \
+            BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.upper()
+        parameters[BaseMetastoreTableOperations.METADATA_LOCATION_PROP] = new_metadata_location
+
+        if self.current_metadata_location is not None and len(self.current_metadata_location) > 0:

Review comment:
       correct, fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r467342804



##########
File path: python/iceberg/hive/hive_table_operations.py
##########
@@ -50,10 +63,146 @@ def refresh(self):
         return self.current()
 
     def commit(self, base, metadata):
-        raise NotImplementedError()
+        new_metadata_location = self.write_new_metadata(metadata, self.version + 1)
+
+        threw = True
+        lock_id = None
+        try:
+            lock_id = self.acquire_lock()
+            if base is not None:
+                with self._client as open_client:
+                    tbl = open_client.get_table(self.database, self.table)
+            else:
+                current_time_millis = int(time.time())
+                tbl = Table(self.table,
+                            self.database,
+                            getpass.getuser(),
+                            current_time_millis // 1000,
+                            current_time_millis // 1000,
+                            sd=storage_descriptor(metadata),
+                            tableType="EXTERNAL_TABLE",
+                            parameters={'EXTERNAL': 'TRUE'})
+
+            tbl.sd = storage_descriptor(metadata)
+            metadata_location = tbl.parameters.get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP, None)
+            base_metadata_location = base.metadataFileLocation() if base is not None else None
+            if base_metadata_location != metadata_location:
+                raise CommitFailedException(
+                    "Base metadata location '%s' is not same as the current table metadata location '%s' for %s.%s",
+                    base_metadata_location, metadata_location, self.database, self.table)
+
+            self.set_parameters(new_metadata_location, tbl)
+
+            if base is not None:
+                with self._client as open_client:
+                    env_context = EnvironmentContext(
+                        {"DO_NOT_UPDATE_STATS": "true"}
+                    )
+                    open_client.alter_table(self.database, self.table, tbl, env_context)
+
+            else:
+                with self._client as open_client:
+                    open_client.create_table(tbl)
+            threw = False
+        except AlreadyExistsException:
+            raise IcebergAlreadyExistsException("Table already exists: {}.{}".format(self.database, self.table))
+        except TException as e:
+            if e is not None and "Table/View 'HIVE_LOCKS' does not exist" in str(e):
+                raise Exception("""Failed to acquire locks from metastore because 'HIVE_LOCKS' doesn't
+                                exist, this probably happened when using embedded metastore or doesn't create a
+                                transactional meta table. To fix this, use an alternative metastore""", e)
+
+            raise Exception("Metastore operation failed for {}.{}".format(self.database, self.table), e)
+        finally:
+            if threw:
+                self.io().delete(new_metadata_location)
+            self.unlock(lock_id)
+
+    def set_parameters(self, new_metadata_location, tbl):
+        parameters = tbl.parameters
+
+        if not parameters:
+            parameters = dict()
+
+        parameters[BaseMetastoreTableOperations.TABLE_TYPE_PROP] = \
+            BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.upper()
+        parameters[BaseMetastoreTableOperations.METADATA_LOCATION_PROP] = new_metadata_location
+
+        if self.current_metadata_location is not None and len(self.current_metadata_location) > 0:
+            parameters[BaseMetastoreTableOperations.PREVIOUS_METADATA_LOCATION_PROP] = self.current_metadata_location
+
+        tbl.parameters = parameters
+
+    def unlock(self, lock_id):
+        if lock_id:
+            try:
+                self.do_unlock(LockResponse(lock_id))
+            except Exception as e:
+                logging.warning("Failed to unlock {}.{}".format(self.database, self.table), e)
+
+    def do_unlock(self, lock_id):
+        with self._client as open_client:
+            open_client.unlock(lock_id)
+
+    def acquire_lock(self):
+        lock_component = LockComponent(LockType.EXCLUSIVE, LockLevel.TABLE, self.database, self.table)
+
+        lock_request = LockRequest([lock_component], user=getpass.getuser(), hostname=socket.gethostname())
+        with self._client as open_client:
+            lock_response = open_client.lock(lock_request)
+
+        state = lock_response.state
+        lock_id = lock_response.lockid
+        start = int(time.time())
+        duration = 0
+        timeout = False

Review comment:
       Nit: I think `timedOut` would be a better name?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r475213464



##########
File path: python/iceberg/hive/hive_table_operations.py
##########
@@ -50,10 +63,146 @@ def refresh(self):
         return self.current()
 
     def commit(self, base, metadata):
-        raise NotImplementedError()
+        new_metadata_location = self.write_new_metadata(metadata, self.version + 1)
+
+        threw = True
+        lock_id = None
+        try:
+            lock_id = self.acquire_lock()
+            if base is not None:
+                with self._client as open_client:
+                    tbl = open_client.get_table(self.database, self.table)
+            else:
+                current_time_millis = int(time.time())
+                tbl = Table(self.table,
+                            self.database,
+                            getpass.getuser(),
+                            current_time_millis // 1000,
+                            current_time_millis // 1000,
+                            sd=storage_descriptor(metadata),
+                            tableType="EXTERNAL_TABLE",
+                            parameters={'EXTERNAL': 'TRUE'})
+
+            tbl.sd = storage_descriptor(metadata)
+            metadata_location = tbl.parameters.get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP, None)
+            base_metadata_location = base.metadataFileLocation() if base is not None else None
+            if base_metadata_location != metadata_location:
+                raise CommitFailedException(
+                    "Base metadata location '%s' is not same as the current table metadata location '%s' for %s.%s",
+                    base_metadata_location, metadata_location, self.database, self.table)
+
+            self.set_parameters(new_metadata_location, tbl)
+
+            if base is not None:
+                with self._client as open_client:
+                    env_context = EnvironmentContext(
+                        {"DO_NOT_UPDATE_STATS": "true"}
+                    )
+                    open_client.alter_table(self.database, self.table, tbl, env_context)
+
+            else:
+                with self._client as open_client:
+                    open_client.create_table(tbl)
+            threw = False
+        except AlreadyExistsException:
+            raise IcebergAlreadyExistsException("Table already exists: {}.{}".format(self.database, self.table))
+        except TException as e:
+            if e is not None and "Table/View 'HIVE_LOCKS' does not exist" in str(e):
+                raise Exception("""Failed to acquire locks from metastore because 'HIVE_LOCKS' doesn't
+                                exist, this probably happened when using embedded metastore or doesn't create a
+                                transactional meta table. To fix this, use an alternative metastore""", e)
+
+            raise Exception("Metastore operation failed for {}.{}".format(self.database, self.table), e)
+        finally:
+            if threw:
+                self.io().delete(new_metadata_location)
+            self.unlock(lock_id)
+
+    def set_parameters(self, new_metadata_location, tbl):
+        parameters = tbl.parameters
+
+        if not parameters:
+            parameters = dict()
+
+        parameters[BaseMetastoreTableOperations.TABLE_TYPE_PROP] = \
+            BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.upper()
+        parameters[BaseMetastoreTableOperations.METADATA_LOCATION_PROP] = new_metadata_location
+
+        if self.current_metadata_location is not None and len(self.current_metadata_location) > 0:
+            parameters[BaseMetastoreTableOperations.PREVIOUS_METADATA_LOCATION_PROP] = self.current_metadata_location
+
+        tbl.parameters = parameters
+
+    def unlock(self, lock_id):
+        if lock_id:
+            try:
+                self.do_unlock(LockResponse(lock_id))
+            except Exception as e:
+                logging.warning("Failed to unlock {}.{}".format(self.database, self.table), e)
+
+    def do_unlock(self, lock_id):
+        with self._client as open_client:
+            open_client.unlock(lock_id)
+
+    def acquire_lock(self):
+        lock_component = LockComponent(LockType.EXCLUSIVE, LockLevel.TABLE, self.database, self.table)
+
+        lock_request = LockRequest([lock_component], user=getpass.getuser(), hostname=socket.gethostname())
+        with self._client as open_client:
+            lock_response = open_client.lock(lock_request)
+
+        state = lock_response.state
+        lock_id = lock_response.lockid
+        start = int(time.time())
+        duration = 0
+        timeout = False
+        while not timeout and state == LockState.WAITING:
+            with self._client as open_client:
+                lock_response = open_client.check_lock(lock_response)
+            state = lock_response.state
+
+            duration = int(time.time()) - start
+            if duration > 3 * 60 * 1000:
+                timeout = True
+            else:
+                time.sleep(0.05)
+
+        if timeout and state != LockState.ACQUIRED:
+            raise CommitFailedException("Timed out after {} ms waiting for lock on {}.{}".format(duration,
+                                                                                                 self.database,
+                                                                                                 self.table))
+
+        if state != LockState.ACQUIRED:
+            raise CommitFailedException(
+                "Could not acquire the lock on {}.{}, lock request ended in state {}".format(self.database, self.table,
+                                                                                             state))
+        return lock_id
 
     def io(self):
-        raise NotImplementedError()
+        return get_fs(self.base_location, self.conf)
 
     def close(self):
         self._client.close()
+
+
+def storage_descriptor(metadata):
+    ser_de_info = SerDeInfo(serializationLib="org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe")
+    return StorageDescriptor(columns(metadata.schema),
+                             metadata.location,
+                             "org.apache.hadoop.mapred.FileInputFormat",
+                             "org.apache.hadoop.mapred.FileOutputFormat",
+                             serdeInfo=ser_de_info)
+
+
+def columns(schema):
+    return [FieldSchema(col.name, convert_hive_type(col.type), "") for col in schema.columns()]
+
+
+def convert_hive_type(col_type):
+    try:
+        type_id = col_type.type_id.value['hive_name']
+        if type_id is not None:
+            return type_id
+    except:  # NOQA
+        raise NotImplementedError("Not yet implemented column type " + col_type)

Review comment:
       agreed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r475208467



##########
File path: python/iceberg/core/filesystem/filesystem_tables.py
##########
@@ -36,15 +35,15 @@ def load(self, location):
 
         return BaseTable(ops, location)
 
-    def create(self, schema, table_identifier=None, spec=None, properties=None, location=None):
+    def create(self, schema, table_identifier=None, spec=None, properties=None):

Review comment:
       agreed, fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#issuecomment-661956214


   I merged #1214. Could you please rebase this one? Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r475213365



##########
File path: python/iceberg/hive/hive_table_operations.py
##########
@@ -50,10 +63,146 @@ def refresh(self):
         return self.current()
 
     def commit(self, base, metadata):
-        raise NotImplementedError()
+        new_metadata_location = self.write_new_metadata(metadata, self.version + 1)
+
+        threw = True
+        lock_id = None
+        try:
+            lock_id = self.acquire_lock()
+            if base is not None:
+                with self._client as open_client:
+                    tbl = open_client.get_table(self.database, self.table)
+            else:
+                current_time_millis = int(time.time())
+                tbl = Table(self.table,
+                            self.database,
+                            getpass.getuser(),
+                            current_time_millis // 1000,
+                            current_time_millis // 1000,
+                            sd=storage_descriptor(metadata),
+                            tableType="EXTERNAL_TABLE",
+                            parameters={'EXTERNAL': 'TRUE'})
+
+            tbl.sd = storage_descriptor(metadata)
+            metadata_location = tbl.parameters.get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP, None)
+            base_metadata_location = base.metadataFileLocation() if base is not None else None
+            if base_metadata_location != metadata_location:
+                raise CommitFailedException(
+                    "Base metadata location '%s' is not same as the current table metadata location '%s' for %s.%s",
+                    base_metadata_location, metadata_location, self.database, self.table)
+
+            self.set_parameters(new_metadata_location, tbl)
+
+            if base is not None:
+                with self._client as open_client:
+                    env_context = EnvironmentContext(
+                        {"DO_NOT_UPDATE_STATS": "true"}
+                    )
+                    open_client.alter_table(self.database, self.table, tbl, env_context)
+
+            else:
+                with self._client as open_client:
+                    open_client.create_table(tbl)
+            threw = False
+        except AlreadyExistsException:
+            raise IcebergAlreadyExistsException("Table already exists: {}.{}".format(self.database, self.table))
+        except TException as e:
+            if e is not None and "Table/View 'HIVE_LOCKS' does not exist" in str(e):
+                raise Exception("""Failed to acquire locks from metastore because 'HIVE_LOCKS' doesn't
+                                exist, this probably happened when using embedded metastore or doesn't create a
+                                transactional meta table. To fix this, use an alternative metastore""", e)
+
+            raise Exception("Metastore operation failed for {}.{}".format(self.database, self.table), e)
+        finally:
+            if threw:
+                self.io().delete(new_metadata_location)
+            self.unlock(lock_id)
+
+    def set_parameters(self, new_metadata_location, tbl):
+        parameters = tbl.parameters
+
+        if not parameters:
+            parameters = dict()
+
+        parameters[BaseMetastoreTableOperations.TABLE_TYPE_PROP] = \
+            BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.upper()
+        parameters[BaseMetastoreTableOperations.METADATA_LOCATION_PROP] = new_metadata_location
+
+        if self.current_metadata_location is not None and len(self.current_metadata_location) > 0:
+            parameters[BaseMetastoreTableOperations.PREVIOUS_METADATA_LOCATION_PROP] = self.current_metadata_location
+
+        tbl.parameters = parameters
+
+    def unlock(self, lock_id):
+        if lock_id:
+            try:
+                self.do_unlock(LockResponse(lock_id))
+            except Exception as e:
+                logging.warning("Failed to unlock {}.{}".format(self.database, self.table), e)
+
+    def do_unlock(self, lock_id):

Review comment:
       yup, fixed

##########
File path: python/iceberg/hive/hive_table_operations.py
##########
@@ -50,10 +63,146 @@ def refresh(self):
         return self.current()
 
     def commit(self, base, metadata):
-        raise NotImplementedError()
+        new_metadata_location = self.write_new_metadata(metadata, self.version + 1)
+
+        threw = True
+        lock_id = None
+        try:
+            lock_id = self.acquire_lock()
+            if base is not None:
+                with self._client as open_client:
+                    tbl = open_client.get_table(self.database, self.table)
+            else:
+                current_time_millis = int(time.time())
+                tbl = Table(self.table,
+                            self.database,
+                            getpass.getuser(),
+                            current_time_millis // 1000,
+                            current_time_millis // 1000,
+                            sd=storage_descriptor(metadata),
+                            tableType="EXTERNAL_TABLE",
+                            parameters={'EXTERNAL': 'TRUE'})
+
+            tbl.sd = storage_descriptor(metadata)
+            metadata_location = tbl.parameters.get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP, None)
+            base_metadata_location = base.metadataFileLocation() if base is not None else None
+            if base_metadata_location != metadata_location:
+                raise CommitFailedException(
+                    "Base metadata location '%s' is not same as the current table metadata location '%s' for %s.%s",
+                    base_metadata_location, metadata_location, self.database, self.table)
+
+            self.set_parameters(new_metadata_location, tbl)
+
+            if base is not None:
+                with self._client as open_client:
+                    env_context = EnvironmentContext(
+                        {"DO_NOT_UPDATE_STATS": "true"}
+                    )
+                    open_client.alter_table(self.database, self.table, tbl, env_context)
+
+            else:
+                with self._client as open_client:
+                    open_client.create_table(tbl)
+            threw = False
+        except AlreadyExistsException:
+            raise IcebergAlreadyExistsException("Table already exists: {}.{}".format(self.database, self.table))
+        except TException as e:
+            if e is not None and "Table/View 'HIVE_LOCKS' does not exist" in str(e):
+                raise Exception("""Failed to acquire locks from metastore because 'HIVE_LOCKS' doesn't
+                                exist, this probably happened when using embedded metastore or doesn't create a
+                                transactional meta table. To fix this, use an alternative metastore""", e)
+
+            raise Exception("Metastore operation failed for {}.{}".format(self.database, self.table), e)
+        finally:
+            if threw:
+                self.io().delete(new_metadata_location)
+            self.unlock(lock_id)
+
+    def set_parameters(self, new_metadata_location, tbl):
+        parameters = tbl.parameters
+
+        if not parameters:
+            parameters = dict()
+
+        parameters[BaseMetastoreTableOperations.TABLE_TYPE_PROP] = \
+            BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.upper()
+        parameters[BaseMetastoreTableOperations.METADATA_LOCATION_PROP] = new_metadata_location
+
+        if self.current_metadata_location is not None and len(self.current_metadata_location) > 0:
+            parameters[BaseMetastoreTableOperations.PREVIOUS_METADATA_LOCATION_PROP] = self.current_metadata_location
+
+        tbl.parameters = parameters
+
+    def unlock(self, lock_id):
+        if lock_id:
+            try:
+                self.do_unlock(LockResponse(lock_id))
+            except Exception as e:
+                logging.warning("Failed to unlock {}.{}".format(self.database, self.table), e)
+
+    def do_unlock(self, lock_id):
+        with self._client as open_client:
+            open_client.unlock(lock_id)
+
+    def acquire_lock(self):
+        lock_component = LockComponent(LockType.EXCLUSIVE, LockLevel.TABLE, self.database, self.table)
+
+        lock_request = LockRequest([lock_component], user=getpass.getuser(), hostname=socket.gethostname())
+        with self._client as open_client:
+            lock_response = open_client.lock(lock_request)
+
+        state = lock_response.state
+        lock_id = lock_response.lockid
+        start = int(time.time())
+        duration = 0
+        timeout = False

Review comment:
       agreed, fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rymurr commented on pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#issuecomment-661981645


   > I merged #1214. Could you please rebase this one? Thanks!
   
   :+1: rebased and ready for review!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r467338885



##########
File path: python/iceberg/core/filesystem/filesystem_tables.py
##########
@@ -36,15 +35,15 @@ def load(self, location):
 
         return BaseTable(ops, location)
 
-    def create(self, schema, table_identifier=None, spec=None, properties=None, location=None):
+    def create(self, schema, table_identifier=None, spec=None, properties=None):

Review comment:
       I agree with using table_identifier here. Should we also rename it in `load` since the location is the table identifier for FS tables?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#issuecomment-680128767


   Everything looks good to me. There are a couple of minor things, like still having the Hive types defined (I'm not sure if you intended to remove those). Tests are passing except what looks like a flaky test in Java.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r475783316



##########
File path: python/iceberg/hive/hive_table_operations.py
##########
@@ -85,20 +89,20 @@ def commit(self, base, metadata):
 
             tbl.sd = storage_descriptor(metadata)
             metadata_location = tbl.parameters.get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP, None)
-            base_metadata_location = base.metadataFileLocation() if base is not None else None
+            base_metadata_location = base.location if base else None

Review comment:
       `metadataFileLocation` is the Java name, `location` is the Python name. Same value though.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r475780731



##########
File path: python/iceberg/core/base_metastore_tables.py
##########
@@ -14,46 +14,40 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+from typing import Tuple
 
-from iceberg.api import Tables
-from iceberg.exceptions import AlreadyExistsException, CommitFailedException, NoSuchTableException
-
+from . import TableOperations
 from .base_table import BaseTable
 from .table_metadata import TableMetadata
+from ..api import PartitionSpec, Schema, Table, Tables
+from ..exceptions import AlreadyExistsException, CommitFailedException, NoSuchTableException
 
 
 class BaseMetastoreTables(Tables):
-    DOT = '.'
 
-    def __init__(self, conf):
+    def __init__(self: "BaseMetastoreTables", conf: dict) -> None:
         self.conf = conf
 
-    def new_table_ops(self, conf, database, table):
+    def new_table_ops(self: "BaseMetastoreTables", conf: dict, database: str, table: str) -> "TableOperations":
         raise RuntimeError("Abstract Implementation")
 
-    def load(self, table_identifier):
-        parts = table_identifier.rsplit(BaseMetastoreTables.DOT, 1)
-        if len(parts) > 1:
-            database = parts[0]
-            table = parts[1]
-        else:
-            database = "default"
-            table = parts[0]
+    def load(self: "BaseMetastoreTables", table_identifier: str) -> Table:
+        database, table = _parse_table_identifier(table_identifier)
         ops = self.new_table_ops(self.conf, database, table)
-        if ops.current() is None:
-            raise NoSuchTableException("Table does not exist: {}.{}".format(database, table))
-
-        return BaseTable(ops, "{}.{}".format(database, table))
+        if ops.current():
+            return BaseTable(ops, "{}.{}".format(database, table))
+        raise NoSuchTableException("Table does not exist: {}.{}".format(database, table))
 
-    def create(self, schema, table_identifier=None, spec=None, properties=None):
-        database, table = table_identifier.rsplit(BaseMetastoreTables.DOT, 1)
+    def create(self: "BaseMetastoreTables", schema: Schema, table_identifier: str, spec: PartitionSpec = None,
+               properties: dict = None) -> Table:
+        database, table = _parse_table_identifier(table_identifier)
         ops = self.new_table_ops(self.conf, database, table)
-        if ops.current() is not None:
+        if ops.current() is not None:  # not None check here to ensure MagicMocks aren't treated as None

Review comment:
       just wasn't mocked correctly, fixed now and removed the `is not None` check




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r475759675



##########
File path: python/iceberg/hive/hive_table_operations.py
##########
@@ -85,20 +89,20 @@ def commit(self, base, metadata):
 
             tbl.sd = storage_descriptor(metadata)
             metadata_location = tbl.parameters.get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP, None)
-            base_metadata_location = base.metadataFileLocation() if base is not None else None
+            base_metadata_location = base.location if base else None

Review comment:
       I think this does need to be the metadata file location and not the table location.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rymurr commented on pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#issuecomment-668688905


   Hey @danielcweeks & @rdblue anyone have some time to take a peek at this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jacques-n commented on pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
jacques-n commented on pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#issuecomment-679246500


   > It works, right?
   
   This comment made my morning :)
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rymurr commented on pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#issuecomment-680908629


   fixed the build error, should be good to go now :-)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r475789732



##########
File path: python/iceberg/api/types/type.py
##########
@@ -22,22 +22,22 @@
 
 @unique
 class TypeID(Enum):
-    BOOLEAN = {"java_class": "Boolean.class", "python_class": bool, "id": 1}
-    INTEGER = {"java_class": "Integer.class", "python_class": int, "id": 2}
-    LONG = {"java_class": "Long.class", "python_class": int, "id": 3}
-    FLOAT = {"java_class": "Float.class", "python_class": float, "id": 4}
-    DOUBLE = {"java_class": "Double.class", "python_class": float, "id": 5}
-    DATE = {"java_class": "Integer.class", "python_class": int, "id": 6}
-    TIME = {"java_class": "Long.class", "python_class": int, "id": 7}
-    TIMESTAMP = {"java_class": "Long.class", "python_class": int, "id": 8}
-    STRING = {"java_class": "CharSequence.class", "python_class": str, "id": 9}
-    UUID = {"java_class": "java.util.UUID.class", "python_class": uuid.UUID, "id": 10}
-    FIXED = {"java_class": "ByteBuffer.class", "python_class": bytes, "id": 11}
-    BINARY = {"java_class": "ByteBuffer.class", "python_class": bytearray, "id": 12}
-    DECIMAL = {"java_class": "BigDecimal.class", "python_class": Decimal, "id": 13}
-    STRUCT = {"java_class": "Void.class", "python_class": None, "id": 14}
-    LIST = {"java_class": "Void.class", "python_class": None, "id": 15}
-    MAP = {"java_class": "Void.class", "python_class": None, "id": 16}
+    BOOLEAN = {"java_class": "Boolean.class", "python_class": bool, "id": 1, "hive_name": 'boolean'}
+    INTEGER = {"java_class": "Integer.class", "python_class": int, "id": 2, "hive_name": 'int'}
+    LONG = {"java_class": "Long.class", "python_class": int, "id": 3, "hive_name": 'bigint'}
+    FLOAT = {"java_class": "Float.class", "python_class": float, "id": 4, "hive_name": 'float'}
+    DOUBLE = {"java_class": "Double.class", "python_class": float, "id": 5, "hive_name": 'double'}
+    DATE = {"java_class": "Integer.class", "python_class": int, "id": 6, "hive_name": 'date'}
+    TIME = {"java_class": "Long.class", "python_class": int, "id": 7, "hive_name": 'string'}
+    TIMESTAMP = {"java_class": "Long.class", "python_class": int, "id": 8, "hive_name": 'timestamp'}
+    STRING = {"java_class": "CharSequence.class", "python_class": str, "id": 9, "hive_name": 'string'}
+    UUID = {"java_class": "java.util.UUID.class", "python_class": uuid.UUID, "id": 10, "hive_name": 'string'}
+    FIXED = {"java_class": "ByteBuffer.class", "python_class": bytes, "id": 11, "hive_name": 'binary'}
+    BINARY = {"java_class": "ByteBuffer.class", "python_class": bytearray, "id": 12, "hive_name": "binary"}
+    DECIMAL = {"java_class": "BigDecimal.class", "python_class": Decimal, "id": 13, "hive_name": None}
+    STRUCT = {"java_class": "Void.class", "python_class": None, "id": 14, "hive_name": None}
+    LIST = {"java_class": "Void.class", "python_class": None, "id": 15, "hive_name": None}
+    MAP = {"java_class": "Void.class", "python_class": None, "id": 16, "hive_name": None}

Review comment:
       agreed. thats a good point. Thats fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r476619221



##########
File path: python/iceberg/core/filesystem/filesystem_tables.py
##########
@@ -37,7 +37,19 @@ def load(self: "FilesystemTables", table_identifier: str) -> Table:
 
     def create(self: "FilesystemTables", schema: Schema, table_identifier: str, spec: PartitionSpec = None,
                properties: dict = None, location: str = None) -> Table:
+        """
+        Create a new table on the filesystem.
+
+        Note: it is expected that the filesystem has atomic operations to ensure consistency for metadata updates.
+        Filesystems that don't have this guarantee could lead to data loss.
+
+        Only table_identifier or location should be not None. Will throw an error if both are not None.

Review comment:
       This doesn't agree with what is enforced. I like that `location` must always be `None` in the code so it would be good to state that here. It would also be easier to read since it avoids "should be not None".




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#issuecomment-670803702


   @rymurr, overall this looks good to me. We might want to make it more pythonic in areas, like relying on `if reference` instead of `if reference is not None`, but those are minor.
   
   Have you tested this in practice by actually connecting to a Hive Metastore? It looks like the tests only mock the behavior. We might want to repurpose some of the Java test code to run a metastore for the duration of python testing from a script.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#issuecomment-668923338


   @rymurr, I should be able to review it this week and hopefully tomorrow.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r475209143



##########
File path: python/iceberg/core/base_metastore_tables.py
##########
@@ -16,28 +16,51 @@
 # under the License.
 
 from iceberg.api import Tables
-from iceberg.exceptions import NoSuchTableException
+from iceberg.exceptions import AlreadyExistsException, CommitFailedException, NoSuchTableException
 
 from .base_table import BaseTable
+from .table_metadata import TableMetadata
 
 
 class BaseMetastoreTables(Tables):
+    DOT = '.'
 
     def __init__(self, conf):
         self.conf = conf
 
     def new_table_ops(self, conf, database, table):
         raise RuntimeError("Abstract Implementation")
 
-    def load(self, database, table):
+    def load(self, table_identifier):

Review comment:
       I agree, I have added types in places where this PR has made changes. Hopefully we can build full type coverage over time rather than having someone slog through the code base and add them individually




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r476622260



##########
File path: python/iceberg/core/filesystem/filesystem_tables.py
##########
@@ -37,7 +37,19 @@ def load(self: "FilesystemTables", table_identifier: str) -> Table:
 
     def create(self: "FilesystemTables", schema: Schema, table_identifier: str, spec: PartitionSpec = None,
                properties: dict = None, location: str = None) -> Table:
+        """
+        Create a new table on the filesystem.
+
+        Note: it is expected that the filesystem has atomic operations to ensure consistency for metadata updates.
+        Filesystems that don't have this guarantee could lead to data loss.
+
+        Only table_identifier or location should be not None. Will throw an error if both are not None.

Review comment:
       yup, apologies. Changed my mind halfway through impl. Fixed now




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r467337995



##########
File path: python/iceberg/core/base_metastore_tables.py
##########
@@ -16,28 +16,51 @@
 # under the License.
 
 from iceberg.api import Tables
-from iceberg.exceptions import NoSuchTableException
+from iceberg.exceptions import AlreadyExistsException, CommitFailedException, NoSuchTableException
 
 from .base_table import BaseTable
+from .table_metadata import TableMetadata
 
 
 class BaseMetastoreTables(Tables):
+    DOT = '.'
 
     def __init__(self, conf):
         self.conf = conf
 
     def new_table_ops(self, conf, database, table):
         raise RuntimeError("Abstract Implementation")
 
-    def load(self, database, table):
+    def load(self, table_identifier):
+        parts = table_identifier.rsplit(BaseMetastoreTables.DOT, 1)
+        if len(parts) > 1:
+            database = parts[0]
+            table = parts[1]
+        else:
+            database = "default"
+            table = parts[0]
         ops = self.new_table_ops(self.conf, database, table)
         if ops.current() is None:
             raise NoSuchTableException("Table does not exist: {}.{}".format(database, table))
 
         return BaseTable(ops, "{}.{}".format(database, table))
 
-    def create(self, schema, spec, table_identifier=None, database=None, table=None):
-        raise RuntimeError("Not Yet Implemented")
+    def create(self, schema, table_identifier=None, spec=None, properties=None):
+        database, table = table_identifier.rsplit(BaseMetastoreTables.DOT, 1)
+        ops = self.new_table_ops(self.conf, database, table)
+        if ops.current() is not None:
+            raise AlreadyExistsException("Table already exists: " + table_identifier)
+
+        base_location = self.default_warehouse_location(self.conf, database, table)
+
+        metadata = TableMetadata.new_table_metadata(ops, schema, spec, base_location, dict() if properties is None else properties)

Review comment:
       I'm getting a warning about PEP 8 and limiting lines to 120 chars. Is that something we want to do in Python?
   
   Seems to make sense to me to limit lines to 120.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r475786217



##########
File path: python/iceberg/hive/hive_table_operations.py
##########
@@ -85,20 +89,20 @@ def commit(self, base, metadata):
 
             tbl.sd = storage_descriptor(metadata)
             metadata_location = tbl.parameters.get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP, None)
-            base_metadata_location = base.metadataFileLocation() if base is not None else None
+            base_metadata_location = base.location if base else None

Review comment:
       That seems confusing to me. How do you get the table's base location in Python?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rymurr commented on pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#issuecomment-660686390


   Should be  merged after #1214 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r467338601



##########
File path: python/iceberg/core/base_metastore_tables.py
##########
@@ -16,28 +16,51 @@
 # under the License.
 
 from iceberg.api import Tables
-from iceberg.exceptions import NoSuchTableException
+from iceberg.exceptions import AlreadyExistsException, CommitFailedException, NoSuchTableException
 
 from .base_table import BaseTable
+from .table_metadata import TableMetadata
 
 
 class BaseMetastoreTables(Tables):
+    DOT = '.'
 
     def __init__(self, conf):
         self.conf = conf
 
     def new_table_ops(self, conf, database, table):
         raise RuntimeError("Abstract Implementation")
 
-    def load(self, database, table):
+    def load(self, table_identifier):

Review comment:
       Minor: if we add types, the docs are better. For example, changing this to `table_identifier: str` made it so that documentation for `str.rsplit` is found, at least in PyCharm. We should probably start adding types where it makes sense, like in the base class `Tables`.
   
   Not something we need to do in this commit, but it really helps reviewing.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r475780993



##########
File path: python/iceberg/api/types/type.py
##########
@@ -22,22 +22,22 @@
 
 @unique
 class TypeID(Enum):
-    BOOLEAN = {"java_class": "Boolean.class", "python_class": bool, "id": 1}
-    INTEGER = {"java_class": "Integer.class", "python_class": int, "id": 2}
-    LONG = {"java_class": "Long.class", "python_class": int, "id": 3}
-    FLOAT = {"java_class": "Float.class", "python_class": float, "id": 4}
-    DOUBLE = {"java_class": "Double.class", "python_class": float, "id": 5}
-    DATE = {"java_class": "Integer.class", "python_class": int, "id": 6}
-    TIME = {"java_class": "Long.class", "python_class": int, "id": 7}
-    TIMESTAMP = {"java_class": "Long.class", "python_class": int, "id": 8}
-    STRING = {"java_class": "CharSequence.class", "python_class": str, "id": 9}
-    UUID = {"java_class": "java.util.UUID.class", "python_class": uuid.UUID, "id": 10}
-    FIXED = {"java_class": "ByteBuffer.class", "python_class": bytes, "id": 11}
-    BINARY = {"java_class": "ByteBuffer.class", "python_class": bytearray, "id": 12}
-    DECIMAL = {"java_class": "BigDecimal.class", "python_class": Decimal, "id": 13}
-    STRUCT = {"java_class": "Void.class", "python_class": None, "id": 14}
-    LIST = {"java_class": "Void.class", "python_class": None, "id": 15}
-    MAP = {"java_class": "Void.class", "python_class": None, "id": 16}
+    BOOLEAN = {"java_class": "Boolean.class", "python_class": bool, "id": 1, "hive_name": 'boolean'}
+    INTEGER = {"java_class": "Integer.class", "python_class": int, "id": 2, "hive_name": 'int'}
+    LONG = {"java_class": "Long.class", "python_class": int, "id": 3, "hive_name": 'bigint'}
+    FLOAT = {"java_class": "Float.class", "python_class": float, "id": 4, "hive_name": 'float'}
+    DOUBLE = {"java_class": "Double.class", "python_class": float, "id": 5, "hive_name": 'double'}
+    DATE = {"java_class": "Integer.class", "python_class": int, "id": 6, "hive_name": 'date'}
+    TIME = {"java_class": "Long.class", "python_class": int, "id": 7, "hive_name": 'string'}
+    TIMESTAMP = {"java_class": "Long.class", "python_class": int, "id": 8, "hive_name": 'timestamp'}
+    STRING = {"java_class": "CharSequence.class", "python_class": str, "id": 9, "hive_name": 'string'}
+    UUID = {"java_class": "java.util.UUID.class", "python_class": uuid.UUID, "id": 10, "hive_name": 'string'}
+    FIXED = {"java_class": "ByteBuffer.class", "python_class": bytes, "id": 11, "hive_name": 'binary'}
+    BINARY = {"java_class": "ByteBuffer.class", "python_class": bytearray, "id": 12, "hive_name": "binary"}
+    DECIMAL = {"java_class": "BigDecimal.class", "python_class": Decimal, "id": 13, "hive_name": None}
+    STRUCT = {"java_class": "Void.class", "python_class": None, "id": 14, "hive_name": None}
+    LIST = {"java_class": "Void.class", "python_class": None, "id": 15, "hive_name": None}
+    MAP = {"java_class": "Void.class", "python_class": None, "id": 16, "hive_name": None}

Review comment:
       I don't really like the idea of exposing hive types through Iceberg types natively.  Seems like we should externalize this and have a separate hive<->iceberg type mapping utility class/function.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r475209778



##########
File path: python/iceberg/core/base_metastore_tables.py
##########
@@ -16,28 +16,51 @@
 # under the License.
 
 from iceberg.api import Tables
-from iceberg.exceptions import NoSuchTableException
+from iceberg.exceptions import AlreadyExistsException, CommitFailedException, NoSuchTableException
 
 from .base_table import BaseTable
+from .table_metadata import TableMetadata
 
 
 class BaseMetastoreTables(Tables):
+    DOT = '.'
 
     def __init__(self, conf):
         self.conf = conf
 
     def new_table_ops(self, conf, database, table):
         raise RuntimeError("Abstract Implementation")
 
-    def load(self, database, table):
+    def load(self, table_identifier):
+        parts = table_identifier.rsplit(BaseMetastoreTables.DOT, 1)
+        if len(parts) > 1:
+            database = parts[0]
+            table = parts[1]
+        else:
+            database = "default"
+            table = parts[0]
         ops = self.new_table_ops(self.conf, database, table)
         if ops.current() is None:
             raise NoSuchTableException("Table does not exist: {}.{}".format(database, table))
 
         return BaseTable(ops, "{}.{}".format(database, table))
 
-    def create(self, schema, spec, table_identifier=None, database=None, table=None):
-        raise RuntimeError("Not Yet Implemented")
+    def create(self, schema, table_identifier=None, spec=None, properties=None):
+        database, table = table_identifier.rsplit(BaseMetastoreTables.DOT, 1)

Review comment:
       yes, fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r467342183



##########
File path: python/iceberg/hive/hive_table_operations.py
##########
@@ -50,10 +63,146 @@ def refresh(self):
         return self.current()
 
     def commit(self, base, metadata):
-        raise NotImplementedError()
+        new_metadata_location = self.write_new_metadata(metadata, self.version + 1)
+
+        threw = True
+        lock_id = None
+        try:
+            lock_id = self.acquire_lock()
+            if base is not None:
+                with self._client as open_client:
+                    tbl = open_client.get_table(self.database, self.table)
+            else:
+                current_time_millis = int(time.time())
+                tbl = Table(self.table,
+                            self.database,
+                            getpass.getuser(),
+                            current_time_millis // 1000,
+                            current_time_millis // 1000,
+                            sd=storage_descriptor(metadata),
+                            tableType="EXTERNAL_TABLE",
+                            parameters={'EXTERNAL': 'TRUE'})
+
+            tbl.sd = storage_descriptor(metadata)
+            metadata_location = tbl.parameters.get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP, None)
+            base_metadata_location = base.metadataFileLocation() if base is not None else None
+            if base_metadata_location != metadata_location:
+                raise CommitFailedException(
+                    "Base metadata location '%s' is not same as the current table metadata location '%s' for %s.%s",
+                    base_metadata_location, metadata_location, self.database, self.table)
+
+            self.set_parameters(new_metadata_location, tbl)
+
+            if base is not None:
+                with self._client as open_client:
+                    env_context = EnvironmentContext(
+                        {"DO_NOT_UPDATE_STATS": "true"}
+                    )
+                    open_client.alter_table(self.database, self.table, tbl, env_context)
+
+            else:
+                with self._client as open_client:
+                    open_client.create_table(tbl)
+            threw = False
+        except AlreadyExistsException:
+            raise IcebergAlreadyExistsException("Table already exists: {}.{}".format(self.database, self.table))
+        except TException as e:
+            if e is not None and "Table/View 'HIVE_LOCKS' does not exist" in str(e):
+                raise Exception("""Failed to acquire locks from metastore because 'HIVE_LOCKS' doesn't
+                                exist, this probably happened when using embedded metastore or doesn't create a
+                                transactional meta table. To fix this, use an alternative metastore""", e)
+
+            raise Exception("Metastore operation failed for {}.{}".format(self.database, self.table), e)
+        finally:
+            if threw:
+                self.io().delete(new_metadata_location)
+            self.unlock(lock_id)
+
+    def set_parameters(self, new_metadata_location, tbl):
+        parameters = tbl.parameters
+
+        if not parameters:
+            parameters = dict()
+
+        parameters[BaseMetastoreTableOperations.TABLE_TYPE_PROP] = \
+            BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.upper()
+        parameters[BaseMetastoreTableOperations.METADATA_LOCATION_PROP] = new_metadata_location
+
+        if self.current_metadata_location is not None and len(self.current_metadata_location) > 0:

Review comment:
       Couldn't this just be `if self.current_metadata_location`? I think both `None` and `""` evaluate to `False` in python if statements.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rymurr commented on pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#issuecomment-679267344


   > > It works, right?
   > 
   > This comment made my morning :)
   
   Thanks @jacques-n  :rofl:  and yes @rdblue it is working :-)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r475757923



##########
File path: python/iceberg/core/base_metastore_tables.py
##########
@@ -14,46 +14,40 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+from typing import Tuple
 
-from iceberg.api import Tables
-from iceberg.exceptions import AlreadyExistsException, CommitFailedException, NoSuchTableException
-
+from . import TableOperations
 from .base_table import BaseTable
 from .table_metadata import TableMetadata
+from ..api import PartitionSpec, Schema, Table, Tables
+from ..exceptions import AlreadyExistsException, CommitFailedException, NoSuchTableException
 
 
 class BaseMetastoreTables(Tables):
-    DOT = '.'
 
-    def __init__(self, conf):
+    def __init__(self: "BaseMetastoreTables", conf: dict) -> None:
         self.conf = conf
 
-    def new_table_ops(self, conf, database, table):
+    def new_table_ops(self: "BaseMetastoreTables", conf: dict, database: str, table: str) -> "TableOperations":
         raise RuntimeError("Abstract Implementation")
 
-    def load(self, table_identifier):
-        parts = table_identifier.rsplit(BaseMetastoreTables.DOT, 1)
-        if len(parts) > 1:
-            database = parts[0]
-            table = parts[1]
-        else:
-            database = "default"
-            table = parts[0]
+    def load(self: "BaseMetastoreTables", table_identifier: str) -> Table:
+        database, table = _parse_table_identifier(table_identifier)
         ops = self.new_table_ops(self.conf, database, table)
-        if ops.current() is None:
-            raise NoSuchTableException("Table does not exist: {}.{}".format(database, table))
-
-        return BaseTable(ops, "{}.{}".format(database, table))
+        if ops.current():
+            return BaseTable(ops, "{}.{}".format(database, table))
+        raise NoSuchTableException("Table does not exist: {}.{}".format(database, table))
 
-    def create(self, schema, table_identifier=None, spec=None, properties=None):
-        database, table = table_identifier.rsplit(BaseMetastoreTables.DOT, 1)
+    def create(self: "BaseMetastoreTables", schema: Schema, table_identifier: str, spec: PartitionSpec = None,
+               properties: dict = None) -> Table:
+        database, table = _parse_table_identifier(table_identifier)
         ops = self.new_table_ops(self.conf, database, table)
-        if ops.current() is not None:
+        if ops.current() is not None:  # not None check here to ensure MagicMocks aren't treated as None

Review comment:
       I'm not sure I understand this comment. Will a mock object evaluate to `False` in an if?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1216: [Python] create table support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1216:
URL: https://github.com/apache/iceberg/pull/1216#discussion_r467342449



##########
File path: python/iceberg/hive/hive_table_operations.py
##########
@@ -50,10 +63,146 @@ def refresh(self):
         return self.current()
 
     def commit(self, base, metadata):
-        raise NotImplementedError()
+        new_metadata_location = self.write_new_metadata(metadata, self.version + 1)
+
+        threw = True
+        lock_id = None
+        try:
+            lock_id = self.acquire_lock()
+            if base is not None:
+                with self._client as open_client:
+                    tbl = open_client.get_table(self.database, self.table)
+            else:
+                current_time_millis = int(time.time())
+                tbl = Table(self.table,
+                            self.database,
+                            getpass.getuser(),
+                            current_time_millis // 1000,
+                            current_time_millis // 1000,
+                            sd=storage_descriptor(metadata),
+                            tableType="EXTERNAL_TABLE",
+                            parameters={'EXTERNAL': 'TRUE'})
+
+            tbl.sd = storage_descriptor(metadata)
+            metadata_location = tbl.parameters.get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP, None)
+            base_metadata_location = base.metadataFileLocation() if base is not None else None
+            if base_metadata_location != metadata_location:
+                raise CommitFailedException(
+                    "Base metadata location '%s' is not same as the current table metadata location '%s' for %s.%s",
+                    base_metadata_location, metadata_location, self.database, self.table)
+
+            self.set_parameters(new_metadata_location, tbl)
+
+            if base is not None:
+                with self._client as open_client:
+                    env_context = EnvironmentContext(
+                        {"DO_NOT_UPDATE_STATS": "true"}
+                    )
+                    open_client.alter_table(self.database, self.table, tbl, env_context)
+
+            else:
+                with self._client as open_client:
+                    open_client.create_table(tbl)
+            threw = False
+        except AlreadyExistsException:
+            raise IcebergAlreadyExistsException("Table already exists: {}.{}".format(self.database, self.table))
+        except TException as e:
+            if e is not None and "Table/View 'HIVE_LOCKS' does not exist" in str(e):
+                raise Exception("""Failed to acquire locks from metastore because 'HIVE_LOCKS' doesn't
+                                exist, this probably happened when using embedded metastore or doesn't create a
+                                transactional meta table. To fix this, use an alternative metastore""", e)
+
+            raise Exception("Metastore operation failed for {}.{}".format(self.database, self.table), e)
+        finally:
+            if threw:
+                self.io().delete(new_metadata_location)
+            self.unlock(lock_id)
+
+    def set_parameters(self, new_metadata_location, tbl):
+        parameters = tbl.parameters
+
+        if not parameters:
+            parameters = dict()
+
+        parameters[BaseMetastoreTableOperations.TABLE_TYPE_PROP] = \
+            BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.upper()
+        parameters[BaseMetastoreTableOperations.METADATA_LOCATION_PROP] = new_metadata_location
+
+        if self.current_metadata_location is not None and len(self.current_metadata_location) > 0:
+            parameters[BaseMetastoreTableOperations.PREVIOUS_METADATA_LOCATION_PROP] = self.current_metadata_location
+
+        tbl.parameters = parameters
+
+    def unlock(self, lock_id):
+        if lock_id:
+            try:
+                self.do_unlock(LockResponse(lock_id))
+            except Exception as e:
+                logging.warning("Failed to unlock {}.{}".format(self.database, self.table), e)
+
+    def do_unlock(self, lock_id):

Review comment:
       I think you can probably embed this method in `unlock`. I think the `doUnlock` method in Java is only there because of the use of `Optional` for a lock id.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org