You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/07/27 06:19:14 UTC
[GitHub] [beam] chamikaramj commented on a change in pull request #15126: Add option in ReadAll Transforms to add filename to the output

chamikaramj commented on a change in pull request #15126:
URL: https://github.com/apache/beam/pull/15126#discussion_r677140730



##########
File path: sdks/python/apache_beam/io/avroio.py
##########
@@ -181,6 +182,9 @@ def __init__(
                        splitting the input into bundles.
       use_fastavro (bool); when set, use the `fastavro` library for IO, which
         is significantly faster, and is now the default.
+      with_context: If True, returns a Key Value with the key being the file
+        path and the value being the actual read. If False, it only returns

Review comment:
       s/file path/file name ?
   s/read/data ?

##########
File path: sdks/python/apache_beam/io/parquetio.py
##########
@@ -64,14 +64,24 @@
 class _ArrowTableToRowDictionaries(DoFn):
   """ A DoFn that consumes an Arrow table and yields a python dictionary for
   each row in the table."""
-  def process(self, table):
+  def process(self, table, with_context=False):
+    if with_context:
+      path = table[0]
+      table = table[1]
     num_rows = table.num_rows
     data_items = table.to_pydict().items()
-    for n in range(num_rows):
-      row = {}
-      for column, values in data_items:
-        row[column] = values[n]
-      yield row
+    if with_context:
+      for n in range(num_rows):

Review comment:
       We should rearrange the code so that we don't copy blocks for code.

##########
File path: sdks/python/apache_beam/io/textio.py
##########
@@ -484,6 +485,9 @@ def __init__(
         from each source file. Must be 0 or higher. Large number of skipped
         lines might impact performance.
       coder: Coder used to decode each line.
+      with_context: If True, returns a Key Value with the key being the file

Review comment:
       Not sure if you noticed, but we also have a "ReadFromTextWithFilename". Probably good to describe the difference between the two in the module level documentation.

##########
File path: sdks/python/apache_beam/io/avroio.py
##########
@@ -171,6 +171,7 @@ def __init__(
       min_bundle_size=0,
       desired_bundle_size=DEFAULT_DESIRED_BUNDLE_SIZE,
       use_fastavro=True,
+      with_context=False,

Review comment:
       Should this be called "with_filename" or are you hoping to extend to add more information to the context ? (if latter we should return a more general object instead of a string).

##########
File path: sdks/python/apache_beam/io/filebasedsource.py
##########
@@ -377,8 +382,13 @@ def process(self, element, *args, **kwargs):
     if not source_list:
       return
     source = source_list[0].source
-    for record in source.read(range.new_tracker()):
-      yield record
+
+    if self._with_context:

Review comment:
       Probably cleaner to write as:
   
   for record in source.read(range.new_tracker()):
     if self._with_context:
       ...
     else:
       ...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org