You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by GitBox <gi...@apache.org> on 2020/08/17 19:19:16 UTC
[GitHub] [incubator-gobblin] chris9692 opened a new pull request #3082: gobblin-1225 new multistage connector with HTTP protocol
chris9692 opened a new pull request #3082:
URL: https://github.com/apache/incubator-gobblin/pull/3082
Many HTTP data sources, including OData and GraphQL based data sources, use a set of common HTTP request methods,
but their parameters can vary significantly case by case. The Gobblin ecosystem has multiple connectors that were
built to suite those needs, for example, the Salesforce connector, the Google Search Console connector, etc.
Multistage connector framework is a paradigm that makes connectors more generic and more reusable. LinkedIn has
used this framework to address the huge variety challenge in data integration with external data sources.
HTTP data sources are the top beneficiary of this framework, including Rest API, OData, and SOAP based data
sources. S3 and GCS (Google Cloud Storage) data sources can also benefit from it in small data volumes that are
less than 10TB, without using SDK.
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
### JIRA
- [X] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-1225
### Description
- [X] Here are some details about my PR, including screenshots (if applicable):
Multistage connector framework is a paradigm that makes connectors more generic and more reusable. LinkedIn has used this framework to address the huge variety challenge in data integration with external data sources.
HTTP data sources are the top beneficiary of this framework, including Rest API, OData, and SOAP based data sources. S3 and GCS (Google Cloud Storage) data sources can also benefit from it in small data volumes that are less than 10TB, without using SDK.
We have also verified the HTTP Multistage connector can greatly simplify data ingestion from Salesforce.com. However, this PR doesn't included the necessary component, CSV Extractor, yet.
This PR includes following functionalities:
- This PR contains a separate multistage module that works with HTTP data sources
- It supports HTTP GET, PUT, POST, and DELETE operations
- It supports nested JSON format
- It supports Rest API calls and its derived OData service
- It supports following ways of pagination
○ By offset and page size
○ By page number
○ By a total record count
○ By signal of an empty response
○ By next page URL
○ By a cursor or next page locator
- It supports session state control as a mechanism of pagination or asynchronous requests
- It supports pulling files from GCS or S3 via HTTP requests
- It supports field project, including nested field projection, based on a given schema
Sample applications of this module will be added in read docs.
### Tests
- [X] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
multistage.configuration.MultistagePropertiesTest. PASSED
multistage.extractor.JsonExtractorTest. PASSED
multistage.extractor.MultistageExtractorTest.testCheckContentType PASSED
multistage.extractor.MultistageExtractorTest.testClose PASSED
multistage.extractor.MultistageExtractorTest.testDeriveEpoc PASSED
multistage.extractor.MultistageExtractorTest.testExtractText PASSED
multistage.extractor.MultistageExtractorTest.testGetExpectedRecordCount PASSED
multistage.extractor.MultistageExtractorTest.testGetHighWatermark PASSED
multistage.extractor.MultistageExtractorTest.testGetOrInferSchema PASSED
multistage.extractor.MultistageExtractorTest.testGetSchema PASSED
multistage.extractor.MultistageExtractorTest.testGetSessionKeyValue PASSED
multistage.extractor.MultistageExtractorTest.testHoldExecutionUnitPresetStartTime PASSED
multistage.extractor.MultistageExtractorTest.testJobProperties PASSED
multistage.extractor.MultistageExtractorTest.testProcessInputStream PASSED
multistage.extractor.MultistageExtractorTest.testReadRecord PASSED
multistage.extractor.MultistageExtractorTest.testSetRowFilter PASSED
multistage.extractor.MultistageExtractorTest.testWorkUnitWatermark PASSED
multistage.extractor.MultistageExtractorTest.testsAddDerivedFieldsToAltSchema PASSED
multistage.extractor.MultistageExtractorTest.testsFailWorkUnit PASSED
multistage.factory.ApacheHttpClientFactoryTest.testGet PASSED
multistage.filter.JsonSchemaBasedFilterTest. PASSED
multistage.filter.MultistageSchemaBasedFilterTest.testFilter PASSED
multistage.keys.SourceKeysTest.testHasSourceSchema PASSED
multistage.keys.SourceKeysTest.testIsPaginationEnabled PASSED
multistage.keys.SourceKeysTest.testIsSessionStateEnabled PASSED
multistage.keys.SourceKeysTest.testValidation PASSED
multistage.source.HttpSourceTest.retriesTest PASSED
multistage.source.HttpSourceTest.testCloseStream PASSED
multistage.source.HttpSourceTest.testExecute PASSED
multistage.source.HttpSourceTest.testGetAuthenticationHeader PASSED
multistage.source.HttpSourceTest.testGetAuthenticationHeader2 PASSED
multistage.source.HttpSourceTest.testGetExtractor PASSED
multistage.source.HttpSourceTest.testGetHttpStatusReasons PASSED
multistage.source.HttpSourceTest.testGetHttpStatuses PASSED
multistage.source.HttpSourceTest.testGetNext PASSED
multistage.source.HttpSourceTest.testGetResponseContentType PASSED
multistage.source.HttpSourceTest.testShutdown PASSED
multistage.source.MultistageSource2Test.testDecode PASSED
multistage.source.MultistageSource2Test.testGetEncodedUtf8 PASSED
multistage.source.MultistageSource2Test.testGetHadoopFsDecoded PASSED
multistage.source.MultistageSource2Test.testGetHadoopFsEncoded PASSED
multistage.source.MultistageSource2Test.testGetWorkUnitSpecificString PASSED
multistage.source.MultistageSource2Test.testInitialize PASSED
multistage.source.MultistageSource2Test.testReplaceVariablesInParameters PASSED
multistage.source.MultistageSourceTest.testAppendActivationParameter PASSED
multistage.source.MultistageSourceTest.testConvertListToInputStream PASSED
multistage.source.MultistageSourceTest.testDerivedFields PASSED
multistage.source.MultistageSourceTest.testGenerateWorkUnitsWithException1 PASSED
multistage.source.MultistageSourceTest.testGenerateWorkUnitsWithException2 PASSED
multistage.source.MultistageSourceTest.testGetDefaultFieldTypes PASSED
multistage.source.MultistageSourceTest.testGetExtractorWithException PASSED
multistage.source.MultistageSourceTest.testGetInitialWorkUnitVariableValues PASSED
multistage.source.MultistageSourceTest.testGetNext PASSED
multistage.source.MultistageSourceTest.testGetPaginationFields PASSED
multistage.source.MultistageSourceTest.testGetPaginationInitialValues PASSED
multistage.source.MultistageSourceTest.testGetPreviousHighWatermarks PASSED
multistage.source.MultistageSourceTest.testGetUpdatedWorkUnitActivation PASSED
multistage.source.MultistageSourceTest.testGetUpdatedWorkUnitVariableValues PASSED
multistage.source.MultistageSourceTest.testGetWorkUnitPartitionTypes PASSED
multistage.source.MultistageSourceTest.testGetWorkUnitPartitionTypesWithExceptions1 PASSED
multistage.source.MultistageSourceTest.testGetWorkUnitPartitionTypesWithExceptions2 PASSED
multistage.source.MultistageSourceTest.testGetWorkUnitsDefault PASSED
multistage.source.MultistageSourceTest.testGetWorkUnitsTooManyPartitions PASSED
multistage.source.MultistageSourceTest.testGetWorkUnitsWithSecondaryInput PASSED
multistage.source.MultistageSourceTest.testGetWorkUnitsWithSecondaryInputWithAuthenticationRetriesDefined PASSED
multistage.source.MultistageSourceTest.testGetWorkUnitsWithSecondaryInputWithAuthenticationRetriesNotDefined PASSED
multistage.source.MultistageSourceTest.testGetWorkUnitsWithSecondaryInputWithNullAuthentication PASSED
multistage.source.MultistageSourceTest.testHadoopFsEncoding PASSED
multistage.source.MultistageSourceTest.testIsSecondaryAuthenticationEnabled PASSED
multistage.source.MultistageSourceTest.testIsSecondaryAuthenticationEnabledWithInvalidSecondaryInput PASSED
multistage.source.MultistageSourceTest.testOutputSchema PASSED
multistage.source.MultistageSourceTest.testParallismMaxSetting PASSED
multistage.source.MultistageSourceTest.testParseSecondaryInputRetry PASSED
multistage.source.MultistageSourceTest.testReadSecondaryAuthentication PASSED
multistage.source.MultistageSourceTest.testReadSecondaryInputs PASSED
multistage.source.MultistageSourceTest.testSourceParameters PASSED
multistage.source.MultistageSourceTest.testUnitWatermark PASSED
multistage.source.MultistageSourceTest.testUrlEncoding PASSED
multistage.source.MultistageSourceTest.testWorkUnitPacingConversion PASSED
multistage.source.MultistageSourceTest.testWorkUnitPacingDef PASSED
multistage.source.MultistageSourceTest.testWorkUnitPartitionDef PASSED
multistage.util.DateTimeUtilsTest. PASSED
multistage.util.EncryptionUtilsTest.testDecryption PASSED
multistage.util.EncryptionUtilsTest.testEncryption PASSED
multistage.util.HdfsReaderTest. PASSED
multistage.util.HttpRequestMethodTest. PASSED
multistage.util.JsonElementTypesTest. PASSED
multistage.util.JsonIntermediateSchemaTest. PASSED
multistage.util.JsonParameterTest. PASSED
multistage.util.JsonSchemaGeneratorTest. PASSED
multistage.util.JsonSchemaTest. PASSED
multistage.util.JsonUtilsTest. PASSED
multistage.util.VariableUtilsTest. PASSED
multistage.util.WatermarkDefinitionTest. PASSED
multistage.util.WorkUnitPartitionTypesTest. PASSED
multistage.util.WorkUnitStatusTest. PASSED
### Commits
- [X] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org