You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@devlake.apache.org by GitBox <gi...@apache.org> on 2022/09/19 06:53:52 UTC

[GitHub] [incubator-devlake] keon94 opened a new issue, #3123: [Feature][Framework][POC] Streamed data collection using Singer Spec

keon94 opened a new issue, #3123:
URL: https://github.com/apache/incubator-devlake/issues/3123

   ### Search before asking
   
   - [X] I had searched in the [issues](https://github.com/apache/incubator-devlake/issues?q=is%3Aissue) and found no similar feature requirement.
   
   
   ### Description
   
   Do a POC on adding support for streamed, incremental data collections using the [Singer spec](https://www.singer.io/). The existing incremental update support does not protect against a half-way failure or error in the collection process, forcing the user to have to start all over again.
   
   ### Use case
   
   The goal would be to add resilience to the collection phase of devlake plugins, and to do so using a standard. Currently if there's an interruption in the middle of collection we don't have a means of recovery/resuming from that interrupted point. This becomes a problem when pulling data from large data sources. We can alleviate this problem by introducing incremental states as the collection takes place. If, say, we have to make 10 API calls to perform 10 DB writes for a given collector, and a failure happens on the 5th invocation, our next attempt should try to pick the collection back up from that step.
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@devlake.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-devlake] klesh closed issue #3123: [Feature][Framework][POC] Streamed data collection using Singer Spec

Posted by GitBox <gi...@apache.org>.
klesh closed issue #3123: [Feature][Framework][POC] Streamed data collection using Singer Spec
URL: https://github.com/apache/incubator-devlake/issues/3123


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@devlake.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-devlake] keon94 commented on issue #3123: [Feature][Framework][POC] Streamed data collection using Singer Spec

Posted by GitBox <gi...@apache.org>.
keon94 commented on issue #3123:
URL: https://github.com/apache/incubator-devlake/issues/3123#issuecomment-1289191947

   Summary of impressions:
   
   Pros:
   1. API calls are completely abstracted away, with no code needed from our side. We just have to study the streams defined in the catalog.json of the tap to see what's available to us, and that's straightforward.
   2. Responses are captured in JSON schemas (in catalog.json) - so, using code-generation, we can have compile-time type safety, and eliminate the boilerplate of manually writing models for them
   3. Built-in support for states to be able to resume Streams (where supported). Fully abstracted away from the developer.
   
   Cons:
   1. Gives us less control over the API calls. Things like managing parallelism and rate-limiting are not within our control. I general, we don't have access to the query-params/headers that get sent. The "config.json" is the most we can control.
   2. Quality of the Tap implementations can be questionable. See my comment above. Some are not well-maintained, or necessarily in sync with the most current API contracts.
   
   My conclusion:
   I think singer-plugins can be useful as long as the plugin doesn't have a need for the Collector layer. Plugins such as Jira do need the raw layer for things like custom fields and modifying transformation rules without re-collection. My current adaptation of the Singer-framework bypasses the Collector stage, and goes straight to Extract. 
   If Collectors are absolutely necessary for all plugins, I don't think we're gaining anything with Singer - in fact, we lose, since we no longer have granular access to the API calls.
   If there can be more trivial plugins that can get away with no Collectors, and they come with well-maintained Singer-taps, I think the framework is useful for them.
   
   @klesh @Startrekzky @hezyin please share your thoughts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@devlake.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-devlake] keon94 commented on issue #3123: [Feature][Framework][POC] Streamed data collection using Singer Spec

Posted by GitBox <gi...@apache.org>.
keon94 commented on issue #3123:
URL: https://github.com/apache/incubator-devlake/issues/3123#issuecomment-1289362311

   Next steps for me, in order to get some more concrete data points:
   1. Study the streams provided by Github and PagerDuty (issue #3498) and determine if they are comprehensive enough for us. 
   2. Study the rate limiting logic of Github and Pagerduty taps to see if they honor any sort of API rate limits.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@devlake.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-devlake] keon94 commented on issue #3123: [Feature][Framework][POC] Streamed data collection using Singer Spec

Posted by GitBox <gi...@apache.org>.
keon94 commented on issue #3123:
URL: https://github.com/apache/incubator-devlake/issues/3123#issuecomment-1296409113

   The PagerDuty tap also supports rate-limiting, but it's not as sophisticated. The logic is [here](https://github.com/singer-io/tap-pagerduty/blob/7e8fcaf585c472c02e45c9224482fb9252e12de4/tap_pagerduty/streams.py#L85). It only sleeps for a constant 60s internval once in the case of hitting a rate-limit. There will be no additional retries. Not configurable either. cc @hezyin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@devlake.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-devlake] hezyin commented on issue #3123: [Feature][Framework][POC] Streamed data collection using Singer Spec

Posted by GitBox <gi...@apache.org>.
hezyin commented on issue #3123:
URL: https://github.com/apache/incubator-devlake/issues/3123#issuecomment-1289926488

   @keon94 Thanks for the summary, Keon.
   
   Regarding the current singer-spec framework implementation: it's worth considering bringing back the raw data layer for two reasons: easier e2e testing and improved dev efficiency when modifying extraction/conversion logic. It's also not too much work as singer tap naturally spits out JSON messages and we just need to dump those JSON into our raw layer. Airbyte's taking a similar approach I believe.
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@devlake.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-devlake] keon94 commented on issue #3123: [Feature][Framework][POC] Streamed data collection using Singer Spec

Posted by GitBox <gi...@apache.org>.
keon94 commented on issue #3123:
URL: https://github.com/apache/incubator-devlake/issues/3123#issuecomment-1250634946

   cc @hezyin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@devlake.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-devlake] keon94 commented on issue #3123: [Feature][Framework][POC] Streamed data collection using Singer Spec

Posted by GitBox <gi...@apache.org>.
keon94 commented on issue #3123:
URL: https://github.com/apache/incubator-devlake/issues/3123#issuecomment-1296406058

   The Github tap does provide rate-limiting support upon closer look. The logic is [here](https://github.com/singer-io/tap-github/blob/8973411ee2f2cf007fe74adcae203f116750180f/tap_github/client.py) called from [here|(https://github.com/singer-io/tap-github/blob/8973411ee2f2cf007fe74adcae203f116750180f/tap_github/client.py#L201)]. There's an undocumented config called "max_sleep_seconds" which defines the upper bound of "X-RateLimit-Reset" (wait-time until next API call). We can add that to the config JSON to override its default of 600s. Simply put, if we exceed the rate limit, the tap sleeps for up to this many seconds before issuing the next API call. It'll fail if it has to sleep for longer. cc @hezyin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@devlake.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org