You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sqoop.apache.org by "Veena Basavaraj (JIRA)" <ji...@apache.org> on 2015/02/03 10:20:35 UTC
[jira] [Comment Edited] (SQOOP-1804) Repository Structure + API: Storing/Retrieving the From/To state of the incremental read/ write

    [ https://issues.apache.org/jira/browse/SQOOP-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14283997#comment-14283997 ] 

Veena Basavaraj edited comment on SQOOP-1804 at 2/3/15 9:20 AM:
----------------------------------------------------------------

Summary from the above discussions, broadly we had two proposals

*Proposal 1*
Connector has Config object and Config object has inputs. All these inputs are editable by the user or connector. But in some cases some of these variable may be just created and edited by connector ( temp state variables). In this proposal there is a possibility that the user 's initial input is lost and the connector overrides it with another value. 

Example:
Say we have 3 variables the connector exposes in the config
{code}
1. PrimaryKey
2. lastValue
3. TempDirectory
{code}

Initially user will create a job with the config object

{code}
1. PrimaryKey -"id" - STRING
2. lastValue- "1" - LONG
3. TempDirectory -   STRING ( Editable by the connector only)
{code}
and then the connector will run the job and at then end of the run we have the config object looking like this

{code}
1. PrimaryKey -"id"
2. lastValue- "45"
3. TempDirectory - "/tmp/foo/bar 
{code}

We then took some use cases, where there may not be a one-one mapping between the initial inputs user enters and connectors will create. Here is an example related to Kafka
Say we have one variable in kafka connector 

1. Offset - earliest/latest - ENUM
2. InternalOffset -  - MAP/LIST/ STRING

The user then chooses the value for offset not worrying about partitions etc and the connector stores additional information it can use for the next run in  internaloffset as LIST/MAP with partition/offset info. So in this case, connector never modified  the original input, but it created new input that will be more useful for it in the next delta run. We may also want to expose this to be edited by the user next time. 

We cannot be sure that all connectors will have one-one correspondence to a inputs users gives and the state variables the connectors create to prepare itself optimal next run. In some cases the types will be very different, the user may input a ENUM or a LONG, but the next job run might need more info like a LIST or MAP to optimally run the job for delta records. This was pretty much the proposal 2

*Proposal 2*
Connectors can have a config objects have params some are  Inputs and some are States, they are all declared in the same config object so that the connector declares upfront what it is working with, Inputs can be annotated, States as well can be annotated to provide more information to the user and the connector. There are two main parties in this design, USER and CONNECTOR DEVELOPER/CONNECTOR and they both together control the job inputs.
One way to express this was given in the earlier discussion in the same JIRA ( Names are subject to change)
{code}
public class DeltaFetchConfig1 {
 
  @Input(size = 255) public String column;
  // validate supported operators, can provide a default value etc...
  @Input public String operator;
// if we edit this, override last_value
  @Input(overrides="last_value) public String value;
// relatedTo signifies to the connector that last_value relates to value
  @State(relatedTo="value") public String last_value; 
}

{code}

Starting from scratch always seems to help I guess, so restating the 2 proposals with examples it became clear that what we all wanted was a flexible design to state what inputs are editable by whom ( whether they are editable by user only, connector only, and or by both ), we wanted to support all 3, but leave it to the connector to annotate each input properly ( Now it is an implementation detail, if we want to use attributes on the @Input annotation or use another annotation like @State). It was also pretty clear that all this data now will be now stored in SQ_INPUT. and its values will be in SQ_JOB_INPUT. We may need to add more columns if we add more attributes, but that is again an implementation detail. We may have to add new TYPES to the ENUM - INPUT_TYPE or we may have to add a new class STATE and a new table STATE if we add a new annotation like @State. But all this is up for discussion from an implementation perspective. But when a user says show job, he/she will see all this information in one place. When the user says edit job, he/she can edit the editable inputs in one place and look at the non-editable inputs in one place.

Here are the following top things we will tackle .

1. Clearly define the semantics of the annotations/ attributes to support USER/ CONNECTOR/ BOTH editable.
2. We have define a way to define relationship between the inputs, say Input1 depends on/relatesTo/ overrides Input2, this gives connectors more control and ease to choose the correct value when running subsequent job runs
3. We should be careful about inputs that both user and connector can edit, since the ordering of edits might lead to some race conditions since we do not maintain history yet, so add proper validation on the relationships between the attribute values across inputs.
4. Ability to see all the configs for jobs and then all the inputs and its attributes they have - a command line tool commands
5. Finally if would be nice to start storing history when we make config edits, i.e not just have last snapshot of the configs, but somehow we can map the configs used for every job run. ( This is covered in SQOOP-2025)


For the #1-#4, here are the details of the implementation we propose

#1.We will keep the @Input annotation for everything,  and add "editable" attribute on the Input annotation. The options for the editable will be a ENUM( USER_ONLY, CONNECTOR_ONLY, ANY, os that in future we have more it can include them too, not just BOTH)
{code}
@Input ( editable =USER_ONLY) String primaryKey;
@Input ( editable =CONNECTOR_ONLY) String tempDirectory;
@Input ( editable =ANY) Long lastValue;
{code}
#2. For relationships, we will add a new attribute overrides, ( that can be 1-many at this point), i,e a input can override values of one or more inputs. 
Example, a connector might choose to implement
{code}
@Input ( editable =USER_ONLY, overrides="last_value") Long user_value;
@Input ( editable =CONNECTOR_ONLY) Long lastValue;
{code}
Implementation: We will add a new table to store this 1-M relationship. We will at the minimum add some checks to ensure there is cycles, and ensure such complex relationships between config attributes wont be allowed. We also will allow this for intra config only and for attributes across config objects. If we need to define relationships between attributes, it is best to group them one config object.


#3. Since we want to control the relationships between inputs, and check for cycles and make sure it does not manifest itself as confusing to the users we will do some validations.
There might some edge cases here, that I will think through more in implementation, for now these are validation rules
{code}
------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
USER_ONLY                    ( not itself!! ) other  CONNECTOR_ONLY and ANY inputs, dont allow overriding other USER_ONLY attributes
------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
CONNECTOR_ONLY      can override anything, it can override other CONNECTOR_ONLY inputs possible, so lets allow everything, 
------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
ANY                                  can override anything
------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|

{code}
#4 - SQOOP-1516 will answer it, to provide command to see configs/ edit configs etc.



was (Author: vybs):
Summary from the above discussions, broadly we had two proposals

*Proposal 1*
Connector has Config object and Config object has inputs. All these inputs are editable by the user or connector. But in some cases some of these variable may be just created and edited by connector ( temp state variables). In this proposal there is a possibility that the user 's initial input is lost and the connector overrides it with another value. 

Example:
Say we have 3 variables the connector exposes in the config
{code}
1. PrimaryKey
2. lastValue
3. TempDirectory
{code}

Initially user will create a job with the config object

{code}
1. PrimaryKey -"id" - STRING
2. lastValue- "1" - LONG
3. TempDirectory -   STRING ( Editable by the connector only)
{code}
and then the connector will run the job and at then end of the run we have the config object looking like this

{code}
1. PrimaryKey -"id"
2. lastValue- "45"
3. TempDirectory - "/tmp/foo/bar 
{code}

We then took some use cases, where there may not be a one-one mapping between the initial inputs user enters and connectors will create. Here is an example related to Kafka
Say we have one variable in kafka connector 

1. Offset - earliest/latest - ENUM
2. InternalOffset -  - MAP/LIST/ STRING

The user then chooses the value for offset not worrying about partitions etc and the connector stores additional information it can use for the next run in  internaloffset as LIST/MAP with partition/offset info. So in this case, connector never modified  the original input, but it created new input that will be more useful for it in the next delta run. We may also want to expose this to be edited by the user next time. 

We cannot be sure that all connectors will have one-one correspondence to a inputs users gives and the state variables the connectors create to prepare itself optimal next run. In some cases the types will be very different, the user may input a ENUM or a LONG, but the next job run might need more info like a LIST or MAP to optimally run the job for delta records. This was pretty much the proposal 2

*Proposal 2*
Connectors can have a config objects have params some are  Inputs and some are States, they are all declared in the same config object so that the connector declares upfront what it is working with, Inputs can be annotated, States as well can be annotated to provide more information to the user and the connector. There are two main parties in this design, USER and CONNECTOR DEVELOPER/CONNECTOR and they both together control the job inputs.
One way to express this was given in the earlier discussion in the same JIRA ( Names are subject to change)
{code}
public class DeltaFetchConfig1 {
 
  @Input(size = 255) public String column;
  // validate supported operators, can provide a default value etc...
  @Input public String operator;
// if we edit this, override last_value
  @Input(overrides="last_value) public String value;
// relatedTo signifies to the connector that last_value relates to value
  @State(relatedTo="value") public String last_value; 
}

{code}

Starting from scratch always seems to help I guess, so restating the 2 proposals with examples it became clear that what we all wanted was a flexible design to state what inputs are editable by whom ( whether they are editable by user only, connector only, and or by both ), we wanted to support all 3, but leave it to the connector to annotate each input properly ( Now it is an implementation detail, if we want to use attributes on the @Input annotation or use another annotation like @State). It was also pretty clear that all this data now will be now stored in SQ_INPUT. and its values will be in SQ_JOB_INPUT. We may need to add more columns if we add more attributes, but that is again an implementation detail. We may have to add new TYPES to the ENUM - INPUT_TYPE or we may have to add a new class STATE and a new table STATE if we add a new annotation like @State. But all this is up for discussion from an implementation perspective. But when a user says show job, he/she will see all this information in one place. When the user says edit job, he/she can edit the editable inputs in one place and look at the non-editable inputs in one place.

Here are the following top things we will tackle .

1. Clearly define the semantics of the annotations/ attributes to support USER/ CONNECTOR/ BOTH editable.
2. We have define a way to define relationship between the inputs, say Input1 depends on/relatesTo/ overrides Input2, this gives connectors more control and ease to choose the correct value when running subsequent job runs
3. We should be careful about inputs that both user and connector can edit, since the ordering of edits might lead to some race conditions since we do not maintain history yet, so add proper validation on the relationships between the attribute values across inputs.
4. Ability to see all the configs for jobs and then all the inputs and its attributes they have - a command line tool commands
5. Finally if would be nice to start storing history when we make config edits, i.e not just have last snapshot of the configs, but somehow we can map the configs used for every job run. ( This is covered in SQOOP-2025)


For the #1-#4, here are the details of the implementation we propose

#1.We will keep the @Input annotation for everything,  and add "editable" attribute on the Input annotation. The options for the editable will be a ENUM( USER_ONLY, CONNECTOR_ONLY, ANY, os that in future we have more it can include them too, not just BOTH)
{code}
@Input ( editable =USER_ONLY) String primaryKey;
@Input ( editable =CONNECTOR_ONLY) String tempDirectory;
@Input ( editable =ANY) Long lastValue;
{code}
#2. For relationships, we will add a new attribute overrides, ( that can be 1-many at this point), i,e a input can override values of one or more inputs. 
Example, a connector might choose to implement
{code}
@Input ( editable =USER_ONLY, overrides="last_value") Long user_value;
@Input ( editable =CONNECTOR_ONLY) Long lastValue;
{code}
Implementation: We will add a new table to store this 1-M relationship. We will at the minimum add some checks to ensure there is cycles, and ensure such complex relationships between config attributes wont be allowed. We also will allow this for intra config only and for attributes across config objects. If we need to define relationships between attributes, it is best to group them one config object.


#3. Since we want to control the relationships between inputs, and check for cycles and make sure it does not manifest itself as confusing to the users we will do some validations.
There might some edge cases here, that I will think through more in implementation, for now these are validation rules

------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
USER_ONLY                    ( not itself!! ) other  CONNECTOR_ONLY and ANY inputs, dont allow overriding other USER_ONLY attributes
------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
CONNECTOR_ONLY      can override anything, it can override other CONNECTOR_ONLY inputs possible, so lets allow everything, 
------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
ANY                                  can override anything
------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|

#4 - SQOOp-1516 will answer it, to provide command to see configs/ edit configs etc.


> Repository Structure + API: Storing/Retrieving the From/To state of the incremental read/ write
> -----------------------------------------------------------------------------------------------
>
>                 Key: SQOOP-1804
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1804
>             Project: Sqoop
>          Issue Type: Sub-task
>            Reporter: Veena Basavaraj
>            Assignee: Veena Basavaraj
>             Fix For: 1.99.5
>
>         Attachments: SQOOP-1804.patch
>
>
> Details of this proposal are in the wiki.
> https://cwiki.apache.org/confluence/display/SQOOP/Delta+Fetch+And+Merge+Design#DeltaFetchAndMergeDesign-Wheretostoretheoutputinsqoop?
> Update: The above highlights the pros and cons of each approach. 
> #4 is chosen, since it is less intrusive, more clean and allows U/Edit per value in the output easily.
> Will use this ticket for more detailed discussion on storage options for the output from connectors
> 1. 
> {code}
> // will have FK to submission
>  public static final String QUERY_CREATE_TABLE_SQ_JOB_OUTPUT_SUBMISSION =
>      "CREATE TABLE " + TABLE_SQ_JOB_OUTPUT + " ("
>      + COLUMN_SQ_JOB_OUT_ID + " BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1), "
>      + COLUMN_SQ_JOB_OUT_KEY + " VARCHAR(32), "
>      + COLUMN_SQ_JOB_OUT_VALUE + " LONG VARCHAR,"
>      + COLUMN_SQ_JOB_OUT_TYPE + " VARCHAR(32),"
>      + COLUMN_SQD_ID + " VARCHAR(32)," // FK to the direction table, since this allows to distinguish output from FROM/ TO part of the job
>    + COLUMN_SQRS_SUBMISSION + " BIGINT, "
>    + "CONSTRAINT " + CONSTRAINT_SQRS_SQS + " "
>      + "FOREIGN KEY (" + COLUMN_SQRS_SUBMISSION + ") "
>        + "REFERENCES " + TABLE_SQ_SUBMISSION + "(" + COLUMN_SQS_ID + ") ON DELETE CASCADE "
> {code}
> 2.
> At the code level, we will define  MOutputType, one of the types can be BLOB as well, if a connector decides to store the value as a BLOB
> {code}
> class JobOutput {
> String key;
> Object value;
> MOutputType type;
> }
> {code}
> 3. 
> At the repository API, add a new API to get job output for a particular submission Id and allow updates on values. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)