You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "kou (via GitHub)" <gi...@apache.org> on 2023/04/17 09:28:17 UTC

[GitHub] [arrow] kou opened a new pull request, #35178: GH-34852: [C++][FlightRPC] Add support for ordered data

kou opened a new pull request, #35178:
URL: https://github.com/apache/arrow/pull/35178

   ### Rationale for this change
   
   No ordering is unnecessarily limiting.  Systems can and do implement distributed sorts, but they can’t reflect this in Flight RPC.
   
   ### What changes are included in this PR?
   
   These changes add `FlightInfo.ordered`.
   
   ### Are these changes tested?
   
   Yes.
   
   ### Are there any user-facing changes?
   
   Yes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35178: GH-34852: [C++][FlightRPC] Add support for ordered data

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35178:
URL: https://github.com/apache/arrow/pull/35178#issuecomment-1511007502

   * Closes: #34852


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on pull request #35178: GH-34852: [C++][FlightRPC] Add support for ordered data

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on PR #35178:
URL: https://github.com/apache/arrow/pull/35178#issuecomment-1516193333

   @github-actions crossbow submit preview-docs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #35178: GH-34852: [C++][Go][Java][FlightRPC] Add support for ordered data

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1180943748


##########
docs/source/format/Flight.rst:
##########
@@ -90,9 +90,14 @@ A client that wishes to download the data would:
    An endpoint contains a list of locations (server addresses) where
    this data can be retrieved from, and a ``Ticket``, an opaque binary
    token that the server will use to identify the data being
-   requested. There is no ordering defined on endpoints or the data
-   within, so if the dataset is sorted, applications should return
-   data in a single endpoint.
+   requested. If ``FlightInfo.ordered`` is set, returned endpoints are
+   in the same order as the data. Otherwise, there is no ordering
+   defined on endpoints or the data within. The client can read
+   ordered data by reading data from returned endpoints in order from
+   front to back. Note that a client may ignore
+   ``FlightInfo.ordered``. If an ordering is important and the client
+   may ignore ``FlightInfo.ordered``, applications should return data
+   in a single endpoint.

Review Comment:
   Thanks for the suggestion - the new wording sounds good to me (Kou already pointed out the typo)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35178: GH-34852: [C++][Go][Java][FlightRPC] Add support for ordered data

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35178:
URL: https://github.com/apache/arrow/pull/35178#issuecomment-1540623944

   Benchmark runs are scheduled for baseline = aa8ffbe1a1fe630316c8ce8a0785a0559d0ffb03 and contender = b73ddc3336f9834ae61c14a285e8fd243e979222. b73ddc3336f9834ae61c14a285e8fd243e979222 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/96090c8d777a4e2b8d24aae25d4f52fc...2f29f45ec24d44f7ac6c9efb153c2316/)
   [Finished :arrow_down:2.82% :arrow_up:0.23%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/afb80cb3985c4b4782a3b4eb35d522e6...26f058375b2146d19bd0fa65e2c4758a/)
   [Finished :arrow_down:0.26% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/72485ee38f534b72853b6768f7fd37e4...f0f03a8144cd41ddace31663ae5bdb10/)
   [Finished :arrow_down:4.25% :arrow_up:3.76%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/ed9ebb62f8f144e596d467f1a22d634c...80080f6681aa4654bd60ec3a3a542129/)
   Buildkite builds:
   [Finished] [`b73ddc33` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2834)
   [Finished] [`b73ddc33` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2871)
   [Finished] [`b73ddc33` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2834)
   [Finished] [`b73ddc33` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2861)
   [Finished] [`aa8ffbe1` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2833)
   [Finished] [`aa8ffbe1` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2870)
   [Finished] [`aa8ffbe1` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2833)
   [Finished] [`aa8ffbe1` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2860)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on pull request #35178: GH-34852: [C++][FlightRPC] Add support for ordered data

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on PR #35178:
URL: https://github.com/apache/arrow/pull/35178#issuecomment-1514044399

   An integration test is added.
   We can review this but we can NOT merge this yet. We need one more implementation and a formal vote before we merge this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #35178: GH-34852: [C++][FlightRPC] Add support for ordered data

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1170955786


##########
format/Flight.proto:
##########
@@ -266,14 +266,28 @@ message FlightInfo {
    * In other words, an application can use multiple endpoints to
    * represent partitioned data.
    *
-   * There is no ordering defined on endpoints. Hence, if the returned
-   * data has an ordering, it should be returned in a single endpoint.
+   * If the returned data has an ordering, an application can use
+   * "FlightInfo.ordered = true" or should return the all data in a
+   * single endpoint.
+   *
+   * Note that a client may ignore "FlightInfo.ordered = true". If an
+   * ordering is important for an application, an application must
+   * choose one of them:
+   *
+   * * An application requires that all clients must read data in
+   *   returned endpoints order.
+   * * An application must return the all data in a single endpoint.

Review Comment:
   Compared to the wording in Flight.rst, it is not as clear what this is trying to say



##########
cpp/src/arrow/flight/sql/example/sqlite_server.cc:
##########
@@ -305,7 +305,7 @@ class SQLiteFlightSqlServer::Impl {
                           EncodeTransactionQuery(query, command.transaction_id));
     std::vector<FlightEndpoint> endpoints{FlightEndpoint{std::move(ticket), {}}};
     ARROW_ASSIGN_OR_RAISE(auto result,
-                          FlightInfo::Make(*schema, descriptor, endpoints, -1, -1))
+                          FlightInfo::Make(*schema, descriptor, endpoints, -1, -1, false))

Review Comment:
   SQLite doesn't tell us whether the result set is sorted or not, so I suppose technically this should always be 'true'? (That said, this server also only ever returns one endpoint so it's moot either way.)



##########
docs/source/format/Flight.rst:
##########
@@ -90,9 +90,13 @@ A client that wishes to download the data would:
    An endpoint contains a list of locations (server addresses) where
    this data can be retrieved from, and a ``Ticket``, an opaque binary
    token that the server will use to identify the data being
-   requested. There is no ordering defined on endpoints or the data
-   within, so if the dataset is sorted, applications should return
-   data in a single endpoint.
+   requested. If ``FlightInfo.ordered`` is set, returned endpoints are
+   in the same order as the data. The client can read ordered data by
+   reading data from returned endpoints in order from front to
+   back. Note that a client may ignore ``FlightInfo.ordered``. If an
+   ordering is important and the client may ignore
+   ``FlightInfo.ordered``, applications should return data in a single
+   endpoint.

Review Comment:
   Maybe to be clear, leave in an "Otherwise, there is no ordering defined on endpoints or the data within"?



##########
dev/archery/archery/integration/runner.py:
##########
@@ -430,6 +430,11 @@ def run_all_tests(with_cpp=True, with_java=True, with_js=True,
             "middleware",
             description="Ensure headers are propagated via middleware.",
         ),
+        Scenario(
+            "ordered",
+            description="Ensure FlightInfo.ordered is supported.",
+            skip={"Java", "JS", "C#", "Go", "Rust"},

Review Comment:
   I can put together Java support too.



##########
cpp/src/arrow/flight/integration_tests/test_integration.cc:
##########
@@ -271,6 +274,148 @@ class MiddlewareScenario : public Scenario {
   std::shared_ptr<TestClientMiddlewareFactory> client_middleware_;
 };
 
+/// \brief The server used for testing FlightInfo.ordered.
+///
+/// If the given command is "ordered", the server sets
+/// FlightInfo.ordered. The client that supports FlightInfo.ordered
+/// must read data from endpoints from front to back. The client that
+/// doesn't support FlightInfo.ordered may read data from endpoints in
+/// random order.
+///
+/// This scenario is passed only when the client supports
+/// FlightInfo.ordered.
+class OrderedServer : public FlightServerBase {
+  Status GetFlightInfo(const ServerCallContext& context,
+                       const FlightDescriptor& descriptor,
+                       std::unique_ptr<FlightInfo>* result) override {
+    const auto ordered = (descriptor.type == FlightDescriptor::DescriptorType::CMD &&
+                          descriptor.cmd == "ordered");
+    auto schema = BuildSchema();
+    std::vector<FlightEndpoint> endpoints;
+    if (ordered) {
+      endpoints.push_back(FlightEndpoint{{"1"}, {}});
+      endpoints.push_back(FlightEndpoint{{"2"}, {}});
+      endpoints.push_back(FlightEndpoint{{"3"}, {}});
+    } else {
+      endpoints.push_back(FlightEndpoint{{"1"}, {}});
+      endpoints.push_back(FlightEndpoint{{"3"}, {}});
+      endpoints.push_back(FlightEndpoint{{"2"}, {}});
+    }
+    ARROW_ASSIGN_OR_RAISE(
+        auto info, FlightInfo::Make(*schema, descriptor, endpoints, -1, -1, ordered));
+    *result = std::make_unique<FlightInfo>(info);
+    return Status::OK();
+  }
+
+  Status DoGet(const ServerCallContext& context, const Ticket& request,
+               std::unique_ptr<FlightDataStream>* stream) override {
+    ARROW_ASSIGN_OR_RAISE(auto builder, RecordBatchBuilder::Make(
+                                            BuildSchema(), arrow::default_memory_pool()));
+    auto number_builder = builder->GetFieldAs<Int32Builder>(0);
+    if (request.ticket == "1") {
+      ARROW_RETURN_NOT_OK(number_builder->Append(1));
+      ARROW_RETURN_NOT_OK(number_builder->Append(2));
+      ARROW_RETURN_NOT_OK(number_builder->Append(3));
+    } else if (request.ticket == "2") {
+      ARROW_RETURN_NOT_OK(number_builder->Append(10));
+      ARROW_RETURN_NOT_OK(number_builder->Append(20));
+      ARROW_RETURN_NOT_OK(number_builder->Append(30));
+    } else if (request.ticket == "3") {
+      ARROW_RETURN_NOT_OK(number_builder->Append(100));
+      ARROW_RETURN_NOT_OK(number_builder->Append(200));
+      ARROW_RETURN_NOT_OK(number_builder->Append(300));
+    } else {
+      return Status::KeyError("Could not find flight: ", request.ticket);
+    }
+    ARROW_ASSIGN_OR_RAISE(auto record_batch, builder->Flush());
+    std::vector<std::shared_ptr<RecordBatch>> record_batches{record_batch};
+    ARROW_ASSIGN_OR_RAISE(auto record_batch_reader,
+                          RecordBatchReader::Make(record_batches));
+    *stream = std::make_unique<RecordBatchStream>(record_batch_reader);
+    return Status::OK();
+  }
+
+ private:
+  std::shared_ptr<Schema> BuildSchema() {
+    return arrow::schema({arrow::field("number", arrow::int32())});
+  }
+};
+
+/// \brief The ordered scenario.
+///
+/// This tests that the server and client get expected header values.
+class OrderedScenario : public Scenario {
+  Status MakeServer(std::unique_ptr<FlightServerBase>* server,
+                    FlightServerOptions* options) override {
+    server->reset(new OrderedServer());
+    return Status::OK();
+  }
+
+  Status MakeClient(FlightClientOptions* options) override { return Status::OK(); }
+
+  Status RunClient(std::unique_ptr<FlightClient> client) override {
+    ARROW_ASSIGN_OR_RAISE(auto info,
+                          client->GetFlightInfo(FlightDescriptor::Command("ordered")));
+    const auto& endpoints = info->endpoints();
+    std::vector<FlightEndpoint> ordered_endpoints(endpoints.size());
+    // If the data is ordered, read data from front to back.
+    // If the data is NOT ordered, read data from back to front.
+    if (info->ordered()) {

Review Comment:
   Shouldn't we be asserting that this is true here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #35178: GH-34852: [C++][FlightRPC] Add support for ordered data

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1172002810


##########
cpp/src/arrow/flight/types.h:
##########
@@ -554,6 +556,9 @@ class ARROW_FLIGHT_EXPORT FlightInfo {
   /// The total number of bytes in the dataset. If unknown, set to -1
   int64_t total_bytes() const { return data_.total_bytes; }
 
+  /// Whether endpoints are in the same order as the data.
+  bool ordered() const { return data_.ordered; }

Review Comment:
   Good question, but no, this is just a semantic flag to indicate that (unlike the current status) there is _some_ ordering on the dataset. It's still up to the application to define what that means.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #35178: GH-34852: [C++][Go][FlightRPC] Add support for ordered data

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1174733194


##########
go/arrow/internal/flight_integration/scenario.go:
##########
@@ -526,6 +528,168 @@ func (m *middlewareScenarioTester) GetFlightInfo(ctx context.Context, desc *flig
 	}, nil
 }
 
+type orderedScenarioTester struct {
+	flight.BaseFlightServer
+}
+
+func (m *orderedScenarioTester) RunClient(addr string, opts ...grpc.DialOption) error {
+	client, err := flight.NewClientWithMiddleware(addr, nil, nil, opts...)
+	if err != nil {
+		return err
+	}
+	defer client.Close()
+
+	ctx := context.Background()
+	info, err := client.GetFlightInfo(ctx, &flight.FlightDescriptor{Type: flight.DescriptorCMD, Cmd: []byte("ordered")})
+	if err != nil {
+		return err
+	}
+
+	if !info.GetOrdered() {
+		return fmt.Errorf("expected to server return FlightInfo.ordered = true")
+	}
+
+	recs := make([]arrow.Record, len(info.Endpoint))
+	for i, ep := range info.Endpoint {
+		if len(ep.Location) != 0 {
+			return fmt.Errorf("expected to receive empty locations to use the original service: %s",
+				ep.Location)
+		}
+
+		stream, err := client.DoGet(ctx, ep.Ticket)
+		if err != nil {
+			return err
+		}
+
+		rdr, err := flight.NewRecordReader(stream)
+		if err != nil {
+			return err
+		}
+		defer rdr.Release()
+
+		for rdr.Next() {
+			record := rdr.Record()
+			record.Retain()
+			defer record.Release()
+			recs[i] = record
+		}
+		if rdr.Err() != nil {
+			return rdr.Err()
+		}
+	}
+
+	// Build expected records
+	mem := memory.DefaultAllocator
+	schema := arrow.NewSchema(
+		[]arrow.Field{
+			{Name: "number", Type: arrow.PrimitiveTypes.Int32},
+		},
+		nil,
+	)
+	expected_table, _ := array.TableFromJSON(mem, schema, []string{
+		`[
+                   {"number": 1},
+                   {"number": 2},
+                   {"number": 3}
+                 ]`,
+		`[
+                   {"number": 10},
+                   {"number": 20},
+                   {"number": 30}
+                 ]`,
+		`[
+                   {"number": 100},
+                   {"number": 200},
+                   {"number": 300}
+                 ]`,
+	})
+	defer expected_table.Release()
+
+	table := array.NewTableFromRecords(schema, recs)
+	defer table.Release()
+	if !array.TableEqual(table, expected_table) {
+		return fmt.Errorf("read data isn't expected\n" +
+			"Expected:\n" +
+			"%s\n" +
+			"num-rows: %d\n" +
+			"num-cols: %d\n" +
+			"Actual:\n" +
+			"%s\n" +
+			"num-rows: %d\n" +
+			"num-cols: %d",
+			expected_table.Schema(),
+			expected_table.NumRows(),
+			expected_table.NumCols(),
+			table.Schema(),
+			table.NumRows(),
+			table.NumCols())

Review Comment:
   Created an issue for `arrow.Table.ToString()`: https://github.com/apache/arrow/issues/35296



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on pull request #35178: GH-34852: [C++][FlightRPC] Add support for ordered data

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on PR #35178:
URL: https://github.com/apache/arrow/pull/35178#issuecomment-1517118146

   I took a look at the Go implementation. It seems that we don't need to change the Go implementation for this. (We just need to regenerated `Flight.pb.go`.)
   
   I'll add a Go integration test for this before we start a discussion for this proposal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35178: GH-34852: [C++][FlightRPC] Add support for ordered data

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35178:
URL: https://github.com/apache/arrow/pull/35178#issuecomment-1511007560

   :warning: GitHub issue #34852 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alamb commented on a diff in pull request #35178: GH-34852: [C++][Go][Java][FlightRPC] Add support for ordered data

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1181038605


##########
docs/source/format/Flight.rst:
##########
@@ -90,9 +90,14 @@ A client that wishes to download the data would:
    An endpoint contains a list of locations (server addresses) where
    this data can be retrieved from, and a ``Ticket``, an opaque binary
    token that the server will use to identify the data being
-   requested. There is no ordering defined on endpoints or the data
-   within, so if the dataset is sorted, applications should return
-   data in a single endpoint.
+   requested. If ``FlightInfo.ordered`` is set, returned endpoints are
+   in the same order as the data. Otherwise, there is no ordering
+   defined on endpoints or the data within. The client can read
+   ordered data by reading data from returned endpoints in order from
+   front to back. Note that a client may ignore
+   ``FlightInfo.ordered``. If an ordering is important and the client
+   may ignore ``FlightInfo.ordered``, applications should return data
+   in a single endpoint.

Review Comment:
   Updated wording:
   
   ```suggestion
      requested. 
      
      If ``FlightInfo.ordered`` is true, this signals there is some
      order between data from different endpoints. 
      Clients should produce the same results as if the data returned 
      from each of the endpoints was concatenated, in order, from front to back. 
      If ``FlightInfo.ordered`` is not set, the client may return data  from any 
      of the endpoints in arbitrary order. Data from any specific endpoint
      must be returned in order, but the data from different endpoints may be 
      interleaved to allow parallel fetches.  
      Note that since some clients may ignore ``FlightInfo.ordered``,
      if ordering is important and client support can not be ensured, 
      servers should return a single endpoint.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #35178: GH-34852: [C++][Go][Java][FlightRPC] Add support for ordered data

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1182048427


##########
docs/source/format/Flight.rst:
##########
@@ -117,7 +129,9 @@ A client that wishes to download the data would:
    The client must consume all endpoints to retrieve the complete data
    set. The client can consume endpoints in any order, or even in
    parallel, or distribute the endpoints among multiple machines for
-   consumption; this is up to the application to implement.
+   consumption; this is up to the application to implement. The client
+   can also use ``FlightInfo.ordered``. See the previous item for
+   details of ``FlightInfo.ordered``.

Review Comment:
   I forgot to update this sentence...
   If anyone has a suggestion for wording, please share it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35178: GH-34852: [C++][FlightRPC] Add support for ordered data

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1172002140


##########
cpp/src/arrow/flight/types.h:
##########
@@ -554,6 +556,9 @@ class ARROW_FLIGHT_EXPORT FlightInfo {
   /// The total number of bytes in the dataset. If unknown, set to -1
   int64_t total_bytes() const { return data_.total_bytes; }
 
+  /// Whether endpoints are in the same order as the data.
+  bool ordered() const { return data_.ordered; }

Review Comment:
   I am not familiar with flight so maybe a dumb question: does the user know which order it is? Like ascending/descending and what are the sorted columns?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #35178: GH-34852: [C++][Go][Java][FlightRPC] Add support for ordered data

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1177164694


##########
dev/archery/archery/integration/runner.py:
##########
@@ -430,6 +430,11 @@ def run_all_tests(with_cpp=True, with_java=True, with_js=True,
             "middleware",
             description="Ensure headers are propagated via middleware.",
         ),
+        Scenario(
+            "ordered",
+            description="Ensure FlightInfo.ordered is supported.",
+            skip={"Java", "JS", "C#", "Go", "Rust"},

Review Comment:
   Thanks!
   Merged.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35178: GH-34852: [C++][FlightRPC] Add support for ordered data

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35178:
URL: https://github.com/apache/arrow/pull/35178#issuecomment-1516196388

   Revision: a0e1184bb338162c175b13cae0b6f0c15c1fdfea
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-8737544410](https://github.com/ursacomputing/crossbow/branches/all?query=actions-8737544410)
   
   |Task|Status|
   |----|------|
   |preview-docs|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-8737544410-github-preview-docs)](https://github.com/ursacomputing/crossbow/actions/runs/4753922764/jobs/8446104447)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alamb commented on a diff in pull request #35178: GH-34852: [C++][Go][Java][FlightRPC] Add support for ordered data

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1181038605


##########
docs/source/format/Flight.rst:
##########
@@ -90,9 +90,14 @@ A client that wishes to download the data would:
    An endpoint contains a list of locations (server addresses) where
    this data can be retrieved from, and a ``Ticket``, an opaque binary
    token that the server will use to identify the data being
-   requested. There is no ordering defined on endpoints or the data
-   within, so if the dataset is sorted, applications should return
-   data in a single endpoint.
+   requested. If ``FlightInfo.ordered`` is set, returned endpoints are
+   in the same order as the data. Otherwise, there is no ordering
+   defined on endpoints or the data within. The client can read
+   ordered data by reading data from returned endpoints in order from
+   front to back. Note that a client may ignore
+   ``FlightInfo.ordered``. If an ordering is important and the client
+   may ignore ``FlightInfo.ordered``, applications should return data
+   in a single endpoint.

Review Comment:
   Updated wording:
   
   ```suggestion
      requested. 
      
      If ``FlightInfo.ordered`` is true, this signals there is some
      order between data from different endpoints. 
      Clients should produce the same results as if the data returned 
      from each of the endpoints was concatenated, in order, from front to back. 
      If ``FlightInfo.ordered`` is not set, the client may return data from any 
      of the endpoints in arbitrary order. Data from any specific endpoint
      must be returned in order, but the data from different endpoints may be 
      interleaved to allow parallel fetches.  
      Note that since some clients may ignore ``FlightInfo.ordered``,
      if ordering is important and client support can not be ensured, 
      servers should return a single endpoint.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #35178: GH-34852: [C++][Go][Java][FlightRPC] Add support for ordered data

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1180934735


##########
docs/source/format/Flight.rst:
##########
@@ -90,9 +90,14 @@ A client that wishes to download the data would:
    An endpoint contains a list of locations (server addresses) where
    this data can be retrieved from, and a ``Ticket``, an opaque binary
    token that the server will use to identify the data being
-   requested. There is no ordering defined on endpoints or the data
-   within, so if the dataset is sorted, applications should return
-   data in a single endpoint.
+   requested. If ``FlightInfo.ordered`` is set, returned endpoints are
+   in the same order as the data. Otherwise, there is no ordering
+   defined on endpoints or the data within. The client can read
+   ordered data by reading data from returned endpoints in order from
+   front to back. Note that a client may ignore
+   ``FlightInfo.ordered``. If an ordering is important and the client
+   may ignore ``FlightInfo.ordered``, applications should return data
+   in a single endpoint.

Review Comment:
   Thanks!!!
   
   > data  from from
   
   Is this a typo of `data from`?
   
   @lidavidm Could you also review this suggestion?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on pull request #35178: GH-34852: [C++][Go][Java][FlightRPC] Add support for ordered data

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on PR #35178:
URL: https://github.com/apache/arrow/pull/35178#issuecomment-1528769876

   @github-actions crossbow submit preview-docs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35178: GH-34852: [C++][Go][Java][FlightRPC] Add support for ordered data

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35178:
URL: https://github.com/apache/arrow/pull/35178#issuecomment-1530817128

   Revision: 2cc83046a3767b4813553d2f019ce6553b361d3f
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-4c9d228232](https://github.com/ursacomputing/crossbow/branches/all?query=actions-4c9d228232)
   
   |Task|Status|
   |----|------|
   |preview-docs|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-4c9d228232-github-preview-docs)](https://github.com/ursacomputing/crossbow/actions/runs/4857292315/jobs/8657640342)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on pull request #35178: GH-34852: [C++][Go][Java][FlightRPC] Add support for ordered data

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on PR #35178:
URL: https://github.com/apache/arrow/pull/35178#issuecomment-1530815596

   @github-actions crossbow submit preview-docs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou merged pull request #35178: GH-34852: [C++][Go][Java][FlightRPC] Add support for ordered data

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou merged PR #35178:
URL: https://github.com/apache/arrow/pull/35178


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on a diff in pull request #35178: GH-34852: [C++][Go][Java][FlightRPC] Add support for ordered data

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1178102333


##########
go/arrow/flight/server.go:
##########
@@ -52,14 +52,9 @@ type (
 	Empty                           = flight.Empty
 )
 
-// FlightService_ServiceDesc is the grpc.ServiceDesc for the FlightService
-// server. It should only be used for direct call of grpc.RegisterService,
-// and not introspected or modified (even as a copy).
-var FlightService_ServiceDesc = flight.FlightService_ServiceDesc
-

Review Comment:
   Interesting, i see that the updated versions for the `Register` method actually takes a `*grpc.Server` which should likely be sufficient to add a flight service to an existing server. So I think this is fine, but we should definitely mark this as a breaking change as a result.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #35178: GH-34852: [C++][Go][FlightRPC] Add support for ordered data

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1174728740


##########
go/arrow/internal/flight_integration/scenario.go:
##########
@@ -526,6 +528,168 @@ func (m *middlewareScenarioTester) GetFlightInfo(ctx context.Context, desc *flig
 	}, nil
 }
 
+type orderedScenarioTester struct {
+	flight.BaseFlightServer
+}
+
+func (m *orderedScenarioTester) RunClient(addr string, opts ...grpc.DialOption) error {
+	client, err := flight.NewClientWithMiddleware(addr, nil, nil, opts...)
+	if err != nil {
+		return err
+	}
+	defer client.Close()
+
+	ctx := context.Background()
+	info, err := client.GetFlightInfo(ctx, &flight.FlightDescriptor{Type: flight.DescriptorCMD, Cmd: []byte("ordered")})
+	if err != nil {
+		return err
+	}
+
+	if !info.GetOrdered() {
+		return fmt.Errorf("expected to server return FlightInfo.ordered = true")
+	}
+
+	recs := make([]arrow.Record, len(info.Endpoint))
+	for i, ep := range info.Endpoint {
+		if len(ep.Location) != 0 {
+			return fmt.Errorf("expected to receive empty locations to use the original service: %s",
+				ep.Location)
+		}
+
+		stream, err := client.DoGet(ctx, ep.Ticket)
+		if err != nil {
+			return err
+		}
+
+		rdr, err := flight.NewRecordReader(stream)
+		if err != nil {
+			return err
+		}
+		defer rdr.Release()
+
+		for rdr.Next() {
+			record := rdr.Record()
+			record.Retain()
+			defer record.Release()
+			recs[i] = record
+		}
+		if rdr.Err() != nil {
+			return rdr.Err()
+		}
+	}
+
+	// Build expected records
+	mem := memory.DefaultAllocator
+	schema := arrow.NewSchema(
+		[]arrow.Field{
+			{Name: "number", Type: arrow.PrimitiveTypes.Int32},
+		},
+		nil,
+	)
+	expected_table, _ := array.TableFromJSON(mem, schema, []string{
+		`[
+                   {"number": 1},
+                   {"number": 2},
+                   {"number": 3}
+                 ]`,
+		`[
+                   {"number": 10},
+                   {"number": 20},
+                   {"number": 30}
+                 ]`,
+		`[
+                   {"number": 100},
+                   {"number": 200},
+                   {"number": 300}
+                 ]`,
+	})
+	defer expected_table.Release()
+
+	table := array.NewTableFromRecords(schema, recs)
+	defer table.Release()
+	if !array.TableEqual(table, expected_table) {
+		return fmt.Errorf("read data isn't expected\n" +
+			"Expected:\n" +
+			"%s\n" +
+			"num-rows: %d\n" +
+			"num-cols: %d\n" +
+			"Actual:\n" +
+			"%s\n" +
+			"num-rows: %d\n" +
+			"num-cols: %d",
+			expected_table.Schema(),
+			expected_table.NumRows(),
+			expected_table.NumCols(),
+			table.Schema(),
+			table.NumRows(),
+			table.NumCols())

Review Comment:
   I think that separated issue is better to focus on "ordered data support" in this pull request.
   We don't have `array.TableToJSON()` for now... (We have `array.RecordToJSON()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #35178: GH-34852: [C++][Go][FlightRPC] Add support for ordered data

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1174732322


##########
go/arrow/flight/server.go:
##########
@@ -52,14 +52,9 @@ type (
 	Empty                           = flight.Empty
 )
 
-// FlightService_ServiceDesc is the grpc.ServiceDesc for the FlightService
-// server. It should only be used for direct call of grpc.RegisterService,
-// and not introspected or modified (even as a copy).
-var FlightService_ServiceDesc = flight.FlightService_ServiceDesc
-

Review Comment:
   I got the following error with this:
   
   ```text
    arrow/flight/server.go:58:40: undefined: flight.FlightService_ServiceDesc
   ```
   
   `FlightService_ServiceDesc` was renamed to `_FlightService_ServiceDesc` (`_` was prepended) in generated `Flight_grpc.pb.go`: https://github.com/apache/arrow/pull/35178/files#diff-3b9deca07311cc509c9563b866266d044d8d368cc3225f8c4524172576c50f43R632
   
   And we can't use it because it's not exported:
   
   ```text
   arrow/flight/server.go:58:40: _FlightService_serviceDesc not exported by package flight
   ```
   
   If this is important for us, could you regenerate `go/arrow/flight/internal/flight/{Flight.pb.go,Flight_grpc.pb.go}` with old `protoc` and push them to this branch? I don't have old `protoc`...
   
   See also https://github.com/apache/arrow/pull/35178/files#diff-8739a8ac1525784934c2c972465c29a068f72cb631ab2e6b36f74cfb2849abaeR21 for `protoc` versions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #35178: GH-34852: [C++][Go][Java][FlightRPC] Add support for ordered data

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1181068164


##########
docs/source/format/Flight.rst:
##########
@@ -90,9 +90,14 @@ A client that wishes to download the data would:
    An endpoint contains a list of locations (server addresses) where
    this data can be retrieved from, and a ``Ticket``, an opaque binary
    token that the server will use to identify the data being
-   requested. There is no ordering defined on endpoints or the data
-   within, so if the dataset is sorted, applications should return
-   data in a single endpoint.
+   requested. If ``FlightInfo.ordered`` is set, returned endpoints are
+   in the same order as the data. Otherwise, there is no ordering
+   defined on endpoints or the data within. The client can read
+   ordered data by reading data from returned endpoints in order from
+   front to back. Note that a client may ignore
+   ``FlightInfo.ordered``. If an ordering is important and the client
+   may ignore ``FlightInfo.ordered``, applications should return data
+   in a single endpoint.

Review Comment:
   Thanks! Applied!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alamb commented on pull request #35178: GH-34852: [C++][Go][Java][FlightRPC] Add support for ordered data

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #35178:
URL: https://github.com/apache/arrow/pull/35178#issuecomment-1539792289

   Thank you @kou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on pull request #35178: GH-34852: [C++][Go][Java][FlightRPC] Add support for ordered data

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on PR #35178:
URL: https://github.com/apache/arrow/pull/35178#issuecomment-1539247532

   The vote thread: https://lists.apache.org/thread/gnxv7cm5l01zgfos7mw52gr2337qofln
   
   The vote result (passed):  https://lists.apache.org/thread/v05hxvffc118qtd4z4yqclryb7qslgfv
   
   I'll merge this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on a diff in pull request #35178: GH-34852: [C++][Go][FlightRPC] Add support for ordered data

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1174603139


##########
go/arrow/internal/flight_integration/scenario.go:
##########
@@ -526,6 +528,168 @@ func (m *middlewareScenarioTester) GetFlightInfo(ctx context.Context, desc *flig
 	}, nil
 }
 
+type orderedScenarioTester struct {
+	flight.BaseFlightServer
+}
+
+func (m *orderedScenarioTester) RunClient(addr string, opts ...grpc.DialOption) error {
+	client, err := flight.NewClientWithMiddleware(addr, nil, nil, opts...)
+	if err != nil {
+		return err
+	}
+	defer client.Close()
+
+	ctx := context.Background()
+	info, err := client.GetFlightInfo(ctx, &flight.FlightDescriptor{Type: flight.DescriptorCMD, Cmd: []byte("ordered")})
+	if err != nil {
+		return err
+	}
+
+	if !info.GetOrdered() {
+		return fmt.Errorf("expected to server return FlightInfo.ordered = true")
+	}
+
+	recs := make([]arrow.Record, len(info.Endpoint))
+	for i, ep := range info.Endpoint {
+		if len(ep.Location) != 0 {
+			return fmt.Errorf("expected to receive empty locations to use the original service: %s",
+				ep.Location)
+		}
+
+		stream, err := client.DoGet(ctx, ep.Ticket)
+		if err != nil {
+			return err
+		}
+
+		rdr, err := flight.NewRecordReader(stream)
+		if err != nil {
+			return err
+		}
+		defer rdr.Release()
+
+		for rdr.Next() {
+			record := rdr.Record()
+			record.Retain()
+			defer record.Release()
+			recs[i] = record
+		}
+		if rdr.Err() != nil {
+			return rdr.Err()
+		}
+	}
+
+	// Build expected records
+	mem := memory.DefaultAllocator
+	schema := arrow.NewSchema(
+		[]arrow.Field{
+			{Name: "number", Type: arrow.PrimitiveTypes.Int32},
+		},
+		nil,
+	)
+	expected_table, _ := array.TableFromJSON(mem, schema, []string{
+		`[
+                   {"number": 1},
+                   {"number": 2},
+                   {"number": 3}
+                 ]`,
+		`[
+                   {"number": 10},
+                   {"number": 20},
+                   {"number": 30}
+                 ]`,
+		`[
+                   {"number": 100},
+                   {"number": 200},
+                   {"number": 300}
+                 ]`,
+	})
+	defer expected_table.Release()
+
+	table := array.NewTableFromRecords(schema, recs)
+	defer table.Release()
+	if !array.TableEqual(table, expected_table) {
+		return fmt.Errorf("read data isn't expected\n" +
+			"Expected:\n" +
+			"%s\n" +
+			"num-rows: %d\n" +
+			"num-cols: %d\n" +
+			"Actual:\n" +
+			"%s\n" +
+			"num-rows: %d\n" +
+			"num-cols: %d",
+			expected_table.Schema(),
+			expected_table.NumRows(),
+			expected_table.NumCols(),
+			table.Schema(),
+			table.NumRows(),
+			table.NumCols())

Review Comment:
   I agree, we should definitely add a `String()` method to `arrow.Table`. Do you think it's worth adding it to this PR? or should we just make an issue and have it as a separate change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #35178: GH-34852: [C++][Go][Java][FlightRPC] Add support for ordered data

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1178509500


##########
go/arrow/flight/server.go:
##########
@@ -52,14 +52,9 @@ type (
 	Empty                           = flight.Empty
 )
 
-// FlightService_ServiceDesc is the grpc.ServiceDesc for the FlightService
-// server. It should only be used for direct call of grpc.RegisterService,
-// and not introspected or modified (even as a copy).
-var FlightService_ServiceDesc = flight.FlightService_ServiceDesc
-

Review Comment:
   OK.
   I also add "**This PR includes breaking changes to public APIs.**" to the pull request description.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #35178: GH-34852: [C++][Go][FlightRPC] Add support for ordered data

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1176389754


##########
dev/archery/archery/integration/runner.py:
##########
@@ -430,6 +430,11 @@ def run_all_tests(with_cpp=True, with_java=True, with_js=True,
             "middleware",
             description="Ensure headers are propagated via middleware.",
         ),
+        Scenario(
+            "ordered",
+            description="Ensure FlightInfo.ordered is supported.",
+            skip={"Java", "JS", "C#", "Go", "Rust"},

Review Comment:
   https://github.com/kou/arrow/pull/11



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on pull request #35178: GH-34852: [C++][FlightRPC] Add support for ordered data

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on PR #35178:
URL: https://github.com/apache/arrow/pull/35178#issuecomment-1516908124

   @kou do you want to try your hand at a Go implementation of this as a second impl? Or should I try to carve out some time next week to do so? :smile:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on pull request #35178: GH-34852: [C++][FlightRPC] Add support for ordered data

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on PR #35178:
URL: https://github.com/apache/arrow/pull/35178#issuecomment-1517118598

   Preview URL: http://crossbow.voltrondata.com/pr_docs/35178/format/Flight.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #35178: GH-34852: [C++][FlightRPC] Add support for ordered data

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1172091013


##########
dev/archery/archery/integration/runner.py:
##########
@@ -430,6 +430,11 @@ def run_all_tests(with_cpp=True, with_java=True, with_js=True,
             "middleware",
             description="Ensure headers are propagated via middleware.",
         ),
+        Scenario(
+            "ordered",
+            description="Ensure FlightInfo.ordered is supported.",
+            skip={"Java", "JS", "C#", "Go", "Rust"},

Review Comment:
   Wow! Thanks!
   Could you push it to this branch when you work on it? Or you can create another branch for it.



##########
format/Flight.proto:
##########
@@ -266,14 +266,28 @@ message FlightInfo {
    * In other words, an application can use multiple endpoints to
    * represent partitioned data.
    *
-   * There is no ordering defined on endpoints. Hence, if the returned
-   * data has an ordering, it should be returned in a single endpoint.
+   * If the returned data has an ordering, an application can use
+   * "FlightInfo.ordered = true" or should return the all data in a
+   * single endpoint.
+   *
+   * Note that a client may ignore "FlightInfo.ordered = true". If an
+   * ordering is important for an application, an application must
+   * choose one of them:
+   *
+   * * An application requires that all clients must read data in
+   *   returned endpoints order.
+   * * An application must return the all data in a single endpoint.

Review Comment:
   I've copied some missing descriptions from `Flight.rst`.



##########
cpp/src/arrow/flight/sql/example/sqlite_server.cc:
##########
@@ -305,7 +305,7 @@ class SQLiteFlightSqlServer::Impl {
                           EncodeTransactionQuery(query, command.transaction_id));
     std::vector<FlightEndpoint> endpoints{FlightEndpoint{std::move(ticket), {}}};
     ARROW_ASSIGN_OR_RAISE(auto result,
-                          FlightInfo::Make(*schema, descriptor, endpoints, -1, -1))
+                          FlightInfo::Make(*schema, descriptor, endpoints, -1, -1, false))

Review Comment:
   We can use `true` here because this flight server always returns one endpoint.
   But I think that `false` is better here because `false` means "unknown order".
   
   A real Flight SQL server will set `true` here only when it analyzes the given query and detects `ORDER BY` in a main `SELECT`.
   How about adding a comment something like "TODO: Set ordered to true only when the query has ORDER BY" here?



##########
docs/source/format/Flight.rst:
##########
@@ -90,9 +90,13 @@ A client that wishes to download the data would:
    An endpoint contains a list of locations (server addresses) where
    this data can be retrieved from, and a ``Ticket``, an opaque binary
    token that the server will use to identify the data being
-   requested. There is no ordering defined on endpoints or the data
-   within, so if the dataset is sorted, applications should return
-   data in a single endpoint.
+   requested. If ``FlightInfo.ordered`` is set, returned endpoints are
+   in the same order as the data. The client can read ordered data by
+   reading data from returned endpoints in order from front to
+   back. Note that a client may ignore ``FlightInfo.ordered``. If an
+   ordering is important and the client may ignore
+   ``FlightInfo.ordered``, applications should return data in a single
+   endpoint.

Review Comment:
   OK!



##########
cpp/src/arrow/flight/integration_tests/test_integration.cc:
##########
@@ -271,6 +274,148 @@ class MiddlewareScenario : public Scenario {
   std::shared_ptr<TestClientMiddlewareFactory> client_middleware_;
 };
 
+/// \brief The server used for testing FlightInfo.ordered.
+///
+/// If the given command is "ordered", the server sets
+/// FlightInfo.ordered. The client that supports FlightInfo.ordered
+/// must read data from endpoints from front to back. The client that
+/// doesn't support FlightInfo.ordered may read data from endpoints in
+/// random order.
+///
+/// This scenario is passed only when the client supports
+/// FlightInfo.ordered.
+class OrderedServer : public FlightServerBase {
+  Status GetFlightInfo(const ServerCallContext& context,
+                       const FlightDescriptor& descriptor,
+                       std::unique_ptr<FlightInfo>* result) override {
+    const auto ordered = (descriptor.type == FlightDescriptor::DescriptorType::CMD &&
+                          descriptor.cmd == "ordered");
+    auto schema = BuildSchema();
+    std::vector<FlightEndpoint> endpoints;
+    if (ordered) {
+      endpoints.push_back(FlightEndpoint{{"1"}, {}});
+      endpoints.push_back(FlightEndpoint{{"2"}, {}});
+      endpoints.push_back(FlightEndpoint{{"3"}, {}});
+    } else {
+      endpoints.push_back(FlightEndpoint{{"1"}, {}});
+      endpoints.push_back(FlightEndpoint{{"3"}, {}});
+      endpoints.push_back(FlightEndpoint{{"2"}, {}});
+    }
+    ARROW_ASSIGN_OR_RAISE(
+        auto info, FlightInfo::Make(*schema, descriptor, endpoints, -1, -1, ordered));
+    *result = std::make_unique<FlightInfo>(info);
+    return Status::OK();
+  }
+
+  Status DoGet(const ServerCallContext& context, const Ticket& request,
+               std::unique_ptr<FlightDataStream>* stream) override {
+    ARROW_ASSIGN_OR_RAISE(auto builder, RecordBatchBuilder::Make(
+                                            BuildSchema(), arrow::default_memory_pool()));
+    auto number_builder = builder->GetFieldAs<Int32Builder>(0);
+    if (request.ticket == "1") {
+      ARROW_RETURN_NOT_OK(number_builder->Append(1));
+      ARROW_RETURN_NOT_OK(number_builder->Append(2));
+      ARROW_RETURN_NOT_OK(number_builder->Append(3));
+    } else if (request.ticket == "2") {
+      ARROW_RETURN_NOT_OK(number_builder->Append(10));
+      ARROW_RETURN_NOT_OK(number_builder->Append(20));
+      ARROW_RETURN_NOT_OK(number_builder->Append(30));
+    } else if (request.ticket == "3") {
+      ARROW_RETURN_NOT_OK(number_builder->Append(100));
+      ARROW_RETURN_NOT_OK(number_builder->Append(200));
+      ARROW_RETURN_NOT_OK(number_builder->Append(300));
+    } else {
+      return Status::KeyError("Could not find flight: ", request.ticket);
+    }
+    ARROW_ASSIGN_OR_RAISE(auto record_batch, builder->Flush());
+    std::vector<std::shared_ptr<RecordBatch>> record_batches{record_batch};
+    ARROW_ASSIGN_OR_RAISE(auto record_batch_reader,
+                          RecordBatchReader::Make(record_batches));
+    *stream = std::make_unique<RecordBatchStream>(record_batch_reader);
+    return Status::OK();
+  }
+
+ private:
+  std::shared_ptr<Schema> BuildSchema() {
+    return arrow::schema({arrow::field("number", arrow::int32())});
+  }
+};
+
+/// \brief The ordered scenario.
+///
+/// This tests that the server and client get expected header values.
+class OrderedScenario : public Scenario {
+  Status MakeServer(std::unique_ptr<FlightServerBase>* server,
+                    FlightServerOptions* options) override {
+    server->reset(new OrderedServer());
+    return Status::OK();
+  }
+
+  Status MakeClient(FlightClientOptions* options) override { return Status::OK(); }
+
+  Status RunClient(std::unique_ptr<FlightClient> client) override {
+    ARROW_ASSIGN_OR_RAISE(auto info,
+                          client->GetFlightInfo(FlightDescriptor::Command("ordered")));
+    const auto& endpoints = info->endpoints();
+    std::vector<FlightEndpoint> ordered_endpoints(endpoints.size());
+    // If the data is ordered, read data from front to back.
+    // If the data is NOT ordered, read data from back to front.
+    if (info->ordered()) {

Review Comment:
   Ah, it may be a more straightforward approach.
   I'll use the approach.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #35178: GH-34852: [C++][Go][FlightRPC] Add support for ordered data

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1174584004


##########
go/arrow/internal/flight_integration/scenario.go:
##########
@@ -526,6 +528,168 @@ func (m *middlewareScenarioTester) GetFlightInfo(ctx context.Context, desc *flig
 	}, nil
 }
 
+type orderedScenarioTester struct {
+	flight.BaseFlightServer
+}
+
+func (m *orderedScenarioTester) RunClient(addr string, opts ...grpc.DialOption) error {
+	client, err := flight.NewClientWithMiddleware(addr, nil, nil, opts...)
+	if err != nil {
+		return err
+	}
+	defer client.Close()
+
+	ctx := context.Background()
+	info, err := client.GetFlightInfo(ctx, &flight.FlightDescriptor{Type: flight.DescriptorCMD, Cmd: []byte("ordered")})
+	if err != nil {
+		return err
+	}
+
+	if !info.GetOrdered() {
+		return fmt.Errorf("expected to server return FlightInfo.ordered = true")
+	}
+
+	recs := make([]arrow.Record, len(info.Endpoint))
+	for i, ep := range info.Endpoint {
+		if len(ep.Location) != 0 {
+			return fmt.Errorf("expected to receive empty locations to use the original service: %s",
+				ep.Location)
+		}
+
+		stream, err := client.DoGet(ctx, ep.Ticket)
+		if err != nil {
+			return err
+		}
+
+		rdr, err := flight.NewRecordReader(stream)
+		if err != nil {
+			return err
+		}
+		defer rdr.Release()
+
+		for rdr.Next() {
+			record := rdr.Record()
+			record.Retain()
+			defer record.Release()
+			recs[i] = record
+		}
+		if rdr.Err() != nil {
+			return rdr.Err()
+		}
+	}
+
+	// Build expected records
+	mem := memory.DefaultAllocator
+	schema := arrow.NewSchema(
+		[]arrow.Field{
+			{Name: "number", Type: arrow.PrimitiveTypes.Int32},
+		},
+		nil,
+	)
+	expected_table, _ := array.TableFromJSON(mem, schema, []string{
+		`[
+                   {"number": 1},
+                   {"number": 2},
+                   {"number": 3}
+                 ]`,
+		`[
+                   {"number": 10},
+                   {"number": 20},
+                   {"number": 30}
+                 ]`,
+		`[
+                   {"number": 100},
+                   {"number": 200},
+                   {"number": 300}
+                 ]`,
+	})
+	defer expected_table.Release()
+
+	table := array.NewTableFromRecords(schema, recs)
+	defer table.Release()
+	if !array.TableEqual(table, expected_table) {
+		return fmt.Errorf("read data isn't expected\n" +
+			"Expected:\n" +
+			"%s\n" +
+			"num-rows: %d\n" +
+			"num-cols: %d\n" +
+			"Actual:\n" +
+			"%s\n" +
+			"num-rows: %d\n" +
+			"num-cols: %d",
+			expected_table.Schema(),
+			expected_table.NumRows(),
+			expected_table.NumCols(),
+			table.Schema(),
+			table.NumRows(),
+			table.NumCols())

Review Comment:
   If we have `arrow.Table.ToString()`, we can simplify this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on a diff in pull request #35178: GH-34852: [C++][Go][FlightRPC] Add support for ordered data

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1174603446


##########
go/arrow/internal/flight_integration/scenario.go:
##########
@@ -526,6 +528,168 @@ func (m *middlewareScenarioTester) GetFlightInfo(ctx context.Context, desc *flig
 	}, nil
 }
 
+type orderedScenarioTester struct {
+	flight.BaseFlightServer
+}
+
+func (m *orderedScenarioTester) RunClient(addr string, opts ...grpc.DialOption) error {
+	client, err := flight.NewClientWithMiddleware(addr, nil, nil, opts...)
+	if err != nil {
+		return err
+	}
+	defer client.Close()
+
+	ctx := context.Background()
+	info, err := client.GetFlightInfo(ctx, &flight.FlightDescriptor{Type: flight.DescriptorCMD, Cmd: []byte("ordered")})
+	if err != nil {
+		return err
+	}
+
+	if !info.GetOrdered() {
+		return fmt.Errorf("expected to server return FlightInfo.ordered = true")
+	}
+
+	recs := make([]arrow.Record, len(info.Endpoint))
+	for i, ep := range info.Endpoint {
+		if len(ep.Location) != 0 {
+			return fmt.Errorf("expected to receive empty locations to use the original service: %s",
+				ep.Location)
+		}
+
+		stream, err := client.DoGet(ctx, ep.Ticket)
+		if err != nil {
+			return err
+		}
+
+		rdr, err := flight.NewRecordReader(stream)
+		if err != nil {
+			return err
+		}
+		defer rdr.Release()
+
+		for rdr.Next() {
+			record := rdr.Record()
+			record.Retain()
+			defer record.Release()
+			recs[i] = record
+		}
+		if rdr.Err() != nil {
+			return rdr.Err()
+		}
+	}
+
+	// Build expected records
+	mem := memory.DefaultAllocator
+	schema := arrow.NewSchema(
+		[]arrow.Field{
+			{Name: "number", Type: arrow.PrimitiveTypes.Int32},
+		},
+		nil,
+	)
+	expected_table, _ := array.TableFromJSON(mem, schema, []string{
+		`[
+                   {"number": 1},
+                   {"number": 2},
+                   {"number": 3}
+                 ]`,
+		`[
+                   {"number": 10},
+                   {"number": 20},
+                   {"number": 30}
+                 ]`,
+		`[
+                   {"number": 100},
+                   {"number": 200},
+                   {"number": 300}
+                 ]`,
+	})
+	defer expected_table.Release()
+
+	table := array.NewTableFromRecords(schema, recs)
+	defer table.Release()
+	if !array.TableEqual(table, expected_table) {
+		return fmt.Errorf("read data isn't expected\n" +
+			"Expected:\n" +
+			"%s\n" +
+			"num-rows: %d\n" +
+			"num-cols: %d\n" +
+			"Actual:\n" +
+			"%s\n" +
+			"num-rows: %d\n" +
+			"num-cols: %d",
+			expected_table.Schema(),
+			expected_table.NumRows(),
+			expected_table.NumCols(),
+			table.Schema(),
+			table.NumRows(),
+			table.NumCols())

Review Comment:
   for now, maybe just marshal the table to JSON for the error output?



##########
go/arrow/flight/server.go:
##########
@@ -52,14 +52,9 @@ type (
 	Empty                           = flight.Empty
 )
 
-// FlightService_ServiceDesc is the grpc.ServiceDesc for the FlightService
-// server. It should only be used for direct call of grpc.RegisterService,
-// and not introspected or modified (even as a copy).
-var FlightService_ServiceDesc = flight.FlightService_ServiceDesc
-

Review Comment:
   Why is this being removed? We added it to make it easy for others to incorporate a Flight Server into an existing grpc service.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35178: GH-34852: [C++][Go][Java][FlightRPC] Add support for ordered data

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35178:
URL: https://github.com/apache/arrow/pull/35178#issuecomment-1528770205

   Revision: 4c2b78c12952ca9ec37bc3b30b8a8a8dde2d6657
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-0d6f29303c](https://github.com/ursacomputing/crossbow/branches/all?query=actions-0d6f29303c)
   
   |Task|Status|
   |----|------|
   |preview-docs|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-0d6f29303c-github-preview-docs)](https://github.com/ursacomputing/crossbow/actions/runs/4838581592/jobs/8623112432)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35178: GH-34852: [C++][Go][Java][FlightRPC] Add support for ordered data

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35178:
URL: https://github.com/apache/arrow/pull/35178#issuecomment-1540626788

   ['Python', 'R'] benchmarks have high level of regressions.
   [test-mac-arm](https://conbench.ursa.dev/compare/runs/afb80cb3985c4b4782a3b4eb35d522e6...26f058375b2146d19bd0fa65e2c4758a/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alamb commented on a diff in pull request #35178: GH-34852: [C++][Go][Java][FlightRPC] Add support for ordered data

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #35178:
URL: https://github.com/apache/arrow/pull/35178#discussion_r1180349793


##########
docs/source/format/Flight.rst:
##########
@@ -90,9 +90,14 @@ A client that wishes to download the data would:
    An endpoint contains a list of locations (server addresses) where
    this data can be retrieved from, and a ``Ticket``, an opaque binary
    token that the server will use to identify the data being
-   requested. There is no ordering defined on endpoints or the data
-   within, so if the dataset is sorted, applications should return
-   data in a single endpoint.
+   requested. If ``FlightInfo.ordered`` is set, returned endpoints are
+   in the same order as the data. Otherwise, there is no ordering
+   defined on endpoints or the data within. The client can read
+   ordered data by reading data from returned endpoints in order from
+   front to back. Note that a client may ignore
+   ``FlightInfo.ordered``. If an ordering is important and the client
+   may ignore ``FlightInfo.ordered``, applications should return data
+   in a single endpoint.

Review Comment:
   Here is some suggested updates to this wording based on the mailing list discussion https://lists.apache.org/thread/0q1s84p17rtdz3q81wvylckob1hsx83z
   
   I am not sure this is correct but I tried to encode my understanding
   
   ```suggestion
      requested. 
      
      If ``FlightInfo.ordered`` is true, this signals there is some
      order between data from different endpoints. 
      Clients should produce the same results as if the data returned 
      from each of the endpoints was concatenated, in order, from front to back. 
      If ``FlightInfo.ordered`` is not set, the client may return data  from from any 
      of the endpoints in arbitrary order. Data from any specific endpoint
      must be returned in order, but the data from different endpoints may be 
      interleaved to allow parallel fetches.  
      Note that since some clients may ignore ``FlightInfo.ordered``,
      if ordering is important and client support can not be ensured, 
      servers should return a single endpoint.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org