You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/22 18:51:37 UTC

[GitHub] [beam] tvalentyn opened a new issue, #22000: [Bug]: Possible regression in 0de98210f .

tvalentyn opened a new issue, #22000:
URL: https://github.com/apache/beam/issues/22000

   ### What happened?
   
   https://github.com/apache/beam/commit/0de98210f4531fbfd88265bc02052b27bd299602 introduces a regression in a TFT unit test.
   
   ```
   ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() [while running 'AnalyzeAndTransformDataset/AnalyzeDataset/FlattenLists[approximate_vocabulary]/FlattenLists']
   ```
   
   Repro (in a clean environment):
   
   `pip install tensorflow-transform`
   
   create a file test.py:
   
   ```
   import os
   
   import apache_beam as beam
   
   import tensorflow as tf
   import tensorflow_transform as tft
   from tensorflow_transform import common_types
   from tensorflow_transform.beam import analyzer_impls
   from tensorflow_transform.beam import impl as beam_impl
   from tensorflow_transform.beam import tft_unit
   from tensorflow_transform.beam.tft_beam_io import transform_fn_io
   
   from tensorflow_metadata.proto.v0 import schema_pb2
   
   _COMPOSITE_COMPUTE_AND_APPLY_VOCABULARY_TEST_CASES = [
       dict(
           testcase_name='sparse',
           input_data=[
               {
                   'val': ['hello'],
                   'idx0': [0],
                   'idx1': [0]
               },
               {
                   'val': ['world'],
                   'idx0': [1],
                   'idx1': [1]
               },
               {
                   'val': ['hello', 'goodbye'],
                   'idx0': [0, 1],
                   'idx1': [1, 2]
               },
               {
                   'val': ['hello', 'goodbye', ' '],
                   'idx0': [0, 1, 1],
                   'idx1': [0, 1, 2]
               },
           ],
           input_metadata=tft.DatasetMetadata.from_feature_spec({
               'x': tf.io.SparseFeature(['idx0', 'idx1'], 'val', tf.string, [2, 3])
           }),
           expected_data=[{
               'index$sparse_indices_0': [0],
               'index$sparse_indices_1': [0],
               'index$sparse_values': [0],
           }, {
               'index$sparse_indices_0': [1],
               'index$sparse_indices_1': [1],
               'index$sparse_values': [2],
           }, {
               'index$sparse_indices_0': [0, 1],
               'index$sparse_indices_1': [1, 2],
               'index$sparse_values': [0, 1],
           }, {
               'index$sparse_indices_0': [0, 1, 1],
               'index$sparse_indices_1': [0, 1, 2],
               'index$sparse_values': [0, 1, 3],
           }],
           expected_vocab_file_contents={
               'my_vocab': [b'hello', b'goodbye', b'world', b' ']
           }),
   ]
   
   
   class VocabularyIntegrationTest(tft_unit.TransformTestCase):
   
     def setUp(self):
       tf.compat.v1.logging.info('Starting test case: %s', self._testMethodName)
       super().setUp()
   
     def _VocabFormat(self):
       return 'text'
   
     _WITH_LABEL_PARAMS = tft_unit.cross_named_parameters([
         dict(
             testcase_name='_string',
             x_data=[
                 b'hello', b'hello', b'hello', b'goodbye', b'aaaaa', b'aaaaa',
                 b'goodbye', b'goodbye', b'aaaaa', b'aaaaa', b'goodbye', b'goodbye'
             ],
             x_feature_spec=tf.io.FixedLenFeature([], tf.string),
             expected_vocab_file_contents=[(b'goodbye', 1.9753224),
                                           (b'aaaaa', 1.6600707),
                                           (b'hello', 1.2450531)]),
         dict(
             testcase_name='_int64',
             x_data=[3, 3, 3, 1, 2, 2, 1, 1, 2, 2, 1, 1],
             x_feature_spec=tf.io.FixedLenFeature([], tf.int64),
             expected_vocab_file_contents=[(b'1', 1.9753224), (b'2', 1.6600707),
                                           (b'3', 1.2450531)]),
     ], [
         dict(
             testcase_name='with_label',
             label_data=[1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
             label_feature_spec=tf.io.FixedLenFeature([], tf.int64),
             min_diff_from_avg=0.0,
             store_frequency=True),
     ])
   
     @tft_unit.named_parameters(*tft_unit.cross_named_parameters(
         [
             dict(
                 testcase_name='_string',
                 input_data=[{
                     'x': b'hello'
                 }, {
                     'x': b'hello'
                 }, {
                     'x': b'hello'
                 }, {
                     'x': b'goodbye'
                 }, {
                     'x': b'aaaaa'
                 }, {
                     'x': b'aaaaa'
                 }, {
                     'x': b'goodbye'
                 }, {
                     'x': b'goodbye'
                 }, {
                     'x': b'aaaaa'
                 }, {
                     'x': b'aaaaa'
                 }, {
                     'x': b'goodbye'
                 }, {
                     'x': b'goodbye'
                 }],
                 make_feature_spec=lambda:  # pylint: disable=g-long-lambda
                 {'x': tf.io.FixedLenFeature([], tf.string)},
                 top_k=2,
                 make_expected_vocab_fn=(
                     lambda _: [(b'goodbye', 5), (b'aaaaa', 4)])),
         ],
         [
             dict(testcase_name='no_frequency', store_frequency=False),
         ]))
     def testApproximateVocabulary(self, input_data, make_feature_spec, top_k,
                                   make_expected_vocab_fn, store_frequency):
       input_metadata = tft.DatasetMetadata.from_feature_spec(
           tft_unit.make_feature_spec_wrapper(make_feature_spec))
   
       def preprocessing_fn(inputs):
         x = inputs['x']
         weights = inputs.get('weights')
         # Note even though the return value is not used, calling
         # tft.experimental.approximate_vocabulary will generate the vocabulary as
         # a side effect, and since we have named this vocabulary it can be looked
         # up using public APIs.
         tft.experimental.approximate_vocabulary(
             x,
             top_k,
             store_frequency=store_frequency,
             weights=weights,
             vocab_filename='my_approximate_vocab',
             file_format=self._VocabFormat())
         return inputs
   
       expected_vocab_file_contents = make_expected_vocab_fn(self._VocabFormat())
       if not store_frequency:
         expected_vocab_file_contents = [
             token for token, _ in expected_vocab_file_contents
         ]
       self.assertAnalyzeAndTransformResults(
           input_data,
           input_metadata,
           preprocessing_fn,
           expected_vocab_file_contents={
               'my_approximate_vocab': expected_vocab_file_contents
           })
   
   if __name__ == '__main__':
     tft_unit.main()
   ```
   pythom -m test
   
   
   ### Issue Priority
   
   Priority: 1
   
   ### Issue Component
   
   Component: sdk-py-core


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on issue #22000: [Bug]: Possible regression in 0de98210f .

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on issue #22000:
URL: https://github.com/apache/beam/issues/22000#issuecomment-1275030912

   Hey @pabloem it looks like you may have missed my reply. This was due to my change and has been addressed in #22006


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] pabloem commented on issue #22000: [Bug]: Possible regression in 0de98210f .

Posted by GitBox <gi...@apache.org>.
pabloem commented on issue #22000:
URL: https://github.com/apache/beam/issues/22000#issuecomment-1274197292

   Ah ok sorry I was confused. @TheNeuralBit can you take a look?
   
   On Wed, Jun 22, 2022, 12:26 PM tvalentyn ***@***.***> wrote:
   
   > ah nvm this may not be related to @TheNeuralBit
   > <https://github.com/TheNeuralBit> 's change
   >
   > I confirmed the commit by bisection
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/beam/issues/22000#issuecomment-1163517021>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAJ5Z3DET73F7SBD3ZKYXSLVQNSEVANCNFSM5ZRJQ45A>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit closed issue #22000: [Bug]: Possible regression in 0de98210f .

Posted by GitBox <gi...@apache.org>.
TheNeuralBit closed issue #22000: [Bug]: Possible regression in 0de98210f .
URL: https://github.com/apache/beam/issues/22000


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on issue #22000: [Bug]: Possible regression in 0de98210f .

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on issue #22000:
URL: https://github.com/apache/beam/issues/22000#issuecomment-1163593155

   Thanks @tvalentyn. I repro'ed and got the full stack trace"
   
   ```
         Root cause: Python exception: Traceback (most recent call last):
     File "apache_beam/runners/common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
       return self.do_fn_invoker.invoke_process(windowed_value)
     File "apache_beam/runners/common.py", line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process
       self.output_handler.handle_process_outputs(
     File "apache_beam/runners/common.py", line 1564, in apache_beam.runners.common._OutputHandler.handle_process_outputs
       results = results or []
   ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
   ```
   
   This looks to be due to an unnecessary cleanup I included in 0de98210f4531fbfd88265bc02052b27bd299602: https://github.com/apache/beam/commit/0de98210f4531fbfd88265bc02052b27bd299602#diff-350e908b1c77bb2c78ff5a16f77e7da591ea87dd3911a296be8c598162896b19R1608
   
   I will roll that part back and propose a cherrypick.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] pabloem commented on issue #22000: [Bug]: Possible regression in 0de98210f .

Posted by GitBox <gi...@apache.org>.
pabloem commented on issue #22000:
URL: https://github.com/apache/beam/issues/22000#issuecomment-1163490947

   fyi: @TheNeuralBit 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on issue #22000: [Bug]: Possible regression in 0de98210f .

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #22000:
URL: https://github.com/apache/beam/issues/22000#issuecomment-1163517021

   > ah nvm this may not be related to @TheNeuralBit 's change
   
   I confirmed the commit by bisection


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] TheNeuralBit commented on issue #22000: [Bug]: Possible regression in 0de98210f .

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on issue #22000:
URL: https://github.com/apache/beam/issues/22000#issuecomment-1163594023

   For impact, I think this breaks any DoFn that returns a numpy array and expects beam to iterate over it (as in a FlatMap).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] pabloem commented on issue #22000: [Bug]: Possible regression in 0de98210f .

Posted by GitBox <gi...@apache.org>.
pabloem commented on issue #22000:
URL: https://github.com/apache/beam/issues/22000#issuecomment-1163491357

   ah nvm this may not be related to @TheNeuralBit 's change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on issue #22000: [Bug]: Possible regression in 0de98210f .

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #22000:
URL: https://github.com/apache/beam/issues/22000#issuecomment-1163488896

   cc: @pabloem 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org