You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Marco Villalobos <mv...@kineteque.com> on 2022/08/17 04:56:44 UTC

without DISTINCT unique lines show up many times in FLINK SQL

Hello everybody,

When I perform this simple set of queries, a unique line from the source file
shows up many times.

I have verified many times that a unique line in the source shows up as much as 100 times in the select statement.

Is this the correct behavior for Flink 1.15.1?

FYI, it does show the correct results when I perform a DISTINCT query.

Here is the SQL:


CREATE TABLE historical_raw_source_template(
        `file.path`              STRING NOT NULL METADATA,
        `file.name`              STRING NOT NULL METADATA,
        `file.size`              BIGINT NOT NULL METADATA,
        `file.modification-time` TIMESTAMP_LTZ(3) NOT NULL METADATA,
        line                    STRING
      ) WITH (
        'connector' = 'filesystem',   -- required: specify the connector
        'format' = 'raw'              -- required: file system connector requires to specify a format
      );


CREATE TABLE historical_raw_source
      WITH (
        'path' = 's3://raw/'      -- required: path to a directory
      ) LIKE historical_raw_source_template;


SELECT
        `file.modification-time` AS modification_time,
        `file.path` AS file_path,
        line        
      FROM
          historical_raw_source

Re: without DISTINCT unique lines show up many times in FLINK SQL

Posted by yuxia <lu...@alumni.sjtu.edu.cn>.
Seems it's the same problem to the problem discussed in [1]

[1]:https://lists.apache.org/thread/3lvkd8hryb1zdxs3o8z65mrjyoqzs88l

Best regards,
Yuxia

----- 原始邮件 -----
发件人: "Marco Villalobos" <mv...@kineteque.com>
收件人: "User" <us...@flink.apache.org>
发送时间: 星期三, 2022年 8 月 17日 下午 12:56:44
主题: without DISTINCT unique lines show up many times in FLINK SQL

Hello everybody,

When I perform this simple set of queries, a unique line from the source file
shows up many times.

I have verified many times that a unique line in the source shows up as much as 100 times in the select statement.

Is this the correct behavior for Flink 1.15.1?

FYI, it does show the correct results when I perform a DISTINCT query.

Here is the SQL:


CREATE TABLE historical_raw_source_template(
        `file.path`              STRING NOT NULL METADATA,
        `file.name`              STRING NOT NULL METADATA,
        `file.size`              BIGINT NOT NULL METADATA,
        `file.modification-time` TIMESTAMP_LTZ(3) NOT NULL METADATA,
        line                    STRING
      ) WITH (
        'connector' = 'filesystem',   -- required: specify the connector
        'format' = 'raw'              -- required: file system connector requires to specify a format
      );


CREATE TABLE historical_raw_source
      WITH (
        'path' = 's3://raw/'      -- required: path to a directory
      ) LIKE historical_raw_source_template;


SELECT
        `file.modification-time` AS modification_time,
        `file.path` AS file_path,
        line        
      FROM
          historical_raw_source