If you’re using AWS Data Pipelines you probably noticed that the pipelines work better when used incrementally. There are many examples on how to use the pipelines on the AWS Site but the one example I was missing was how to set up an incremental sync between two databases.
It turns out that just setting up the OVERWRITE_EXISTING as the myInsertMode is not going to cut it. You need to set up your insert query as UPSERT.
To do this you can define it as follows:
{
"database": {
"ref": "destination_database"
},
"name": "DestTable",
"insertQuery": "INSERT INTO #{table} VALUES(?,?,?) ON DUPLICATE KEY UPDATE id=values( entity_id), email=values(email), is_subscribed=values(is_subscribed);",
"id": "SqlDataNodeId_xxxx",
"type": "SqlDataNode",
"table": "my_destination_table"
}
In this example the query is defined as:
INSERT INTO #{table} VALUES(?,?,?) ON DUPLICATE KEY UPDATE id=values( entity_id), email=values(email), is_subscribed=values(is_subscribed);
Make sure your destination table does not have any UNIQUE keys on the fields you’re syncing, as that will break the sync and the pipeline will complaint about the duplicate fields on import.