SQLServer CDC Connector #

The SQLServer CDC connector allows for reading snapshot data and incremental data from SQLServer database. This document describes how to setup the SQLServer CDC connector to run SQL queries against SQLServer databases.

Dependencies #

In order to setup the SQLServer CDC connector, the following table provides dependency information for both projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.

Maven dependency #

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-sqlserver-cdc</artifactId>
    <version>3.4.0</version>
</dependency>

SQL Client JAR #

Download link is available only for stable releases.

Download flink-sql-connector-sqlserver-cdc-3.4.0.jar and put it under <FLINK_HOME>/lib/.

Note: Refer to flink-sql-connector-sqlserver-cdc, more released versions will be available in the Maven central warehouse.

Setup SQLServer Database #

A SQL Server administrator must enable change data capture on the source tables that you want to capture. The database must already be enabled for CDC. To enable CDC on a table, a SQL Server administrator runs the stored procedure sys.sp_cdc_enable_table for the table.

Prerequisites:

CDC is enabled on the SQL Server database.
The SQL Server Agent is running.
You are a member of the db_owner fixed database role for the database.

Procedure:

Connect to the SQL Server database by database management studio.
Run the following SQL statement to enable CDC on the table.

USE MyDB
GO

EXEC sys.sp_cdc_enable_table
@source_schema = N'dbo',     -- Specifies the schema of the source table.
@source_name   = N'MyTable', -- Specifies the name of the table that you want to capture.
@role_name     = N'MyRole',  -- Specifies a role MyRole to which you can add users to whom you want to grant SELECT permission on the captured columns of the source table. Users in the sysadmin or db_owner role also have access to the specified change tables. Set the value of @role_name to NULL, to allow only members in the sysadmin or db_owner to have full access to captured information.
@filegroup_name = N'MyDB_CT',-- Specifies the filegroup where SQL Server places the change table for the captured table. The named filegroup must already exist. It is best not to locate change tables in the same filegroup that you use for source tables.
@supports_net_changes = 0
GO

Verifying that the user has access to the CDC table

--The following example runs the stored procedure sys.sp_cdc_help_change_data_capture on the database MyDB:
USE MyDB;
GO
EXEC sys.sp_cdc_help_change_data_capture
GO

The query returns configuration information for each table in the database that is enabled for CDC and that contains change data that the caller is authorized to access. If the result is empty, verify that the user has privileges to access both the capture instance and the CDC tables.

How to create a SQLServer CDC table #

The SqlServer CDC table can be defined as following:

-- register a SqlServer table 'orders' in Flink SQL
CREATE TABLE orders (
    id INT,
    order_date DATE,
    purchaser INT,
    quantity INT,
    product_id INT,
    PRIMARY KEY (id) NOT ENFORCED
) WITH (
    'connector' = 'sqlserver-cdc',
    'hostname' = 'localhost',
    'port' = '1433',
    'username' = 'sa',
    'password' = 'Password!',
    'database-name' = 'inventory',
    'table-name' = 'dob.orders'
);

-- read snapshot and binlogs from orders table
SELECT * FROM orders;

Connector Options #

Option	Required	Default	Type	Description
connector	required	(none)	String	Specify what connector to use, here should be `'sqlserver-cdc'`.
hostname	required	(none)	String	IP address or hostname of the SQLServer database.
username	required	(none)	String	Username to use when connecting to the SQLServer database.
password	required	(none)	String	Password to use when connecting to the SQLServer database.
database-name	required	(none)	String	Database name of the SQLServer database to monitor.
table-name	required	(none)	String	Table name of the SQLServer database to monitor, e.g.: "db1.table1"
port	optional	1433	Integer	Integer port number of the SQLServer database.
server-time-zone	optional	UTC	String	The session time zone in database server, e.g. "Asia/Shanghai".
scan.incremental.snapshot.enabled	optional	true	Boolean	Whether enable parallelism snapshot.
chunk-meta.group.size	optional	1000	Integer	The group size of chunk meta, if the meta size exceeds the group size, the meta will be divided into multiple groups.
chunk-key.even-distribution.factor.lower-bound	optional	0.05d	Double	The lower bound of chunk key distribution factor. The distribution factor is used to determine whether the table is evenly distribution or not. The table chunks would use evenly calculation optimization when the data distribution is even, and the query for splitting would happen when it is uneven. The distribution factor could be calculated by (MAX(id) - MIN(id) + 1) / rowCount.
chunk-key.even-distribution.factor.upper-bound	optional	1000.0d	Double	The upper bound of chunk key distribution factor. The distribution factor is used to determine whether the table is evenly distribution or not. The table chunks would use evenly calculation optimization when the data distribution is even, and the query for splitting would happen when it is uneven. The distribution factor could be calculated by (MAX(id) - MIN(id) + 1) / rowCount.
debezium.*	optional	(none)	String	Pass-through Debezium's properties to Debezium Embedded Engine which is used to capture data changes from SQLServer. For example: `'debezium.snapshot.mode' = 'initial_only'`. See more about the Debezium's SQLServer Connector properties
scan.incremental.close-idle-reader.enabled	optional	false	Boolean	Whether to close idle readers at the end of the snapshot phase. The flink version is required to be greater than or equal to 1.14 when 'execution.checkpointing.checkpoints-after-tasks-finish.enabled' is set to true. If the flink version is greater than or equal to 1.15, the default value of 'execution.checkpointing.checkpoints-after-tasks-finish.enabled' has been changed to true, so it does not need to be explicitly configured 'execution.checkpointing.checkpoints-after-tasks-finish.enabled' = 'true'
scan.incremental.snapshot.chunk.key-column	optional	(none)	String	The chunk key of table snapshot, captured tables are split into multiple chunks by a chunk key when read the snapshot of table. By default, the chunk key is the first column of the primary key. A column that is not part of the primary key can be used as a chunk key, but this may lead to slower query performance. Warning: Using a non-primary key column as a chunk key may lead to data inconsistencies. Please see Warning for details.
scan.incremental.snapshot.unbounded-chunk-first.enabled	optional	false	Boolean	Whether to assign the unbounded chunks first during snapshot reading phase. This might help reduce the risk of the TaskManager experiencing an out-of-memory (OOM) error when taking a snapshot of the largest unbounded chunk. Experimental option, defaults to false.

Available Metadata #

The following format metadata can be exposed as read-only (VIRTUAL) columns in a table definition.

Key	DataType	Description
table_name	STRING NOT NULL	Name of the table that contain the row.
schema_name	STRING NOT NULL	Name of the schema that contain the row.
database_name	STRING NOT NULL	Name of the database that contain the row.
op_ts	TIMESTAMP_LTZ(3) NOT NULL	It indicates the time that the change was made in the database. If the record is read from snapshot of the table instead of the change stream, the value is always 0.

Limitation #

Can’t perform checkpoint during scanning snapshot of tables #

During scanning snapshot of database tables, since there is no recoverable position, we can’t perform checkpoints. In order to not perform checkpoints, SqlServer CDC source will keep the checkpoint waiting to timeout. The timeout checkpoint will be recognized as failed checkpoint, by default, this will trigger a failover for the Flink job. So if the database table is large, it is recommended to add following Flink configurations to avoid failover because of the timeout checkpoints:

execution.checkpointing.interval: 10min
execution.checkpointing.tolerable-failed-checkpoints: 100
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 2147483647

The extended CREATE TABLE example demonstrates the syntax for exposing these metadata fields:

CREATE TABLE products (
    table_name STRING METADATA  FROM 'table_name' VIRTUAL,
    schema_name STRING METADATA  FROM 'schema_name' VIRTUAL,
    db_name STRING METADATA FROM 'database_name' VIRTUAL,
    operation_ts TIMESTAMP_LTZ(3) METADATA FROM 'op_ts' VIRTUAL,
    id INT NOT NULL,
    name STRING,
    description STRING,
    weight DECIMAL(10,3)
) WITH (
    'connector' = 'sqlserver-cdc',
    'hostname' = 'localhost',
    'port' = '1433',
    'username' = 'sa',
    'password' = 'Password!',
    'database-name' = 'inventory',
    'table-name' = 'dbo.products'
);

Features #

Exactly-Once Processing #

The SQLServer CDC connector is a Flink Source connector which will read database snapshot first and then continues to read change events with exactly-once processing even failures happen. Please read How the connector works.

Startup Reading Position #

The config option scan.startup.mode specifies the startup mode for SQLServer CDC consumer. The valid enumerations are:

initial (default): Takes a snapshot of structure and data of captured tables; useful if topics should be populated with a complete representation of the data from the captured tables.
initial-only: Takes a snapshot of structure and data like initial but instead does not transition into streaming changes once the snapshot has completed.
latest-offset: Takes a snapshot of the structure of captured tables only; useful if only changes happening from now onwards should be propagated to topics.

Note: the mechanism of scan.startup.mode option relying on Debezium’s snapshot.mode configuration. So please do not use them together. If you specific both scan.startup.mode and debezium.snapshot.mode options in the table DDL, it may make scan.startup.mode doesn’t work.

Single Thread Reading #

The SQLServer CDC source can’t work in parallel reading, because there is only one task can receive change events.

DataStream Source #

The SQLServer CDC connector can also be a DataStream source. You can create a SourceFunction as the following shows:

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.cdc.debezium.JsonDebeziumDeserializationSchema;
import org.apache.flink.cdc.connectors.sqlserver.SqlServerSource;

public class SqlServerSourceExample {
  public static void main(String[] args) throws Exception {
    SourceFunction<String> sourceFunction = SqlServerSource.<String>builder()
      .hostname("localhost")
      .port(1433)
      .database("sqlserver") // monitor sqlserver database
      .tableList("dbo.products") // monitor products table
      .username("sa")
      .password("Password!")
      .deserializer(new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String
      .build();

    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    env
      .addSource(sourceFunction)
      .print().setParallelism(1); // use parallelism 1 for sink to keep message ordering

    env.execute();
  }
}

The SQLServer CDC incremental connector (after 2.4.0) can be used as the following shows:

import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

import org.apache.flink.cdc.connectors.base.options.StartupOptions;
import org.apache.flink.cdc.connectors.sqlserver.source.SqlServerSourceBuilder;
import org.apache.flink.cdc.connectors.sqlserver.source.SqlServerSourceBuilder.SqlServerIncrementalSource;
import org.apache.flink.cdc.debezium.JsonDebeziumDeserializationSchema;

public class SqlServerIncrementalSourceExample {
    public static void main(String[] args) throws Exception {
        SqlServerIncrementalSource<String> sqlServerSource =
                new SqlServerSourceBuilder()
                        .hostname("localhost")
                        .port(1433)
                        .databaseList("inventory")
                        .tableList("dbo.products")
                        .username("sa")
                        .password("Password!")
                        .deserializer(new JsonDebeziumDeserializationSchema())
                        .startupOptions(StartupOptions.initial())
                        .build();

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // enable checkpoint
        env.enableCheckpointing(3000);
        // set the source parallelism to 2
        env.fromSource(
                        sqlServerSource,
                        WatermarkStrategy.noWatermarks(),
                        "SqlServerIncrementalSource")
                .setParallelism(2)
                .print()
                .setParallelism(1);

        env.execute("Print SqlServer Snapshot + Change Stream");
    }
}

Tables Without primary keys #

Starting from version 3.4.0, SQLServer CDC support tables that do not have a primary key. To use a table without primary keys, you must configure the scan.incremental.snapshot.chunk.key-column option and specify one non-null field.

There are two places that need to be taken care of.

If there is an index in the table, try to use a column which is contained in the index in scan.incremental.snapshot.chunk.key-column. This will increase the speed of select statement.
The processing semantics of a SQLServer CDC table without primary keys is determined based on the behavior of the column that are specified by the scan.incremental.snapshot.chunk.key-column.

If no update operation is performed on the specified column, the exactly-once semantics is ensured.
If the update operation is performed on the specified column, only the at-least-once semantics is ensured. However, you can specify primary keys at downstream and perform the idempotence operation to ensure data correctness.

Warning #

Using a non-primary key column as the scan.incremental.snapshot.chunk.key-column for a SQLServer table with primary keys may lead to data inconsistencies. Below is a scenario illustrating this issue and recommendations to mitigate potential problems.

Problem Scenario #

Table Structure:
- Primary Key: id
- Chunk Key Column: pid (Not a primary key)
Snapshot Splits:
- Split 0: 1 < pid <= 3
- Split 1: 3 < pid <= 5
Operation:
- Two different subtasks are reading Split 0 and Split 1 concurrently.
- An update operation changes pid from 2 to 4 for id=0 while both splits are being read. This update occurs between the low and high watermark of both splits.
Result:
- Split 0: Contains the record [id=0, pid=2]
- Split 1: Contains the record [id=0, pid=4]

Since the order of processing these records cannot be guaranteed, the final value of pid for id=0 may end up being either 2 or 4, leading to potential data inconsistencies.

Available Source metrics #

Metrics can help understand the progress of assignments, and the following are the supported Flink metrics:

Group	Name	Type	Description
namespace.schema.table	isSnapshotting	Gauge	Weather the table is snapshotting or not
namespace.schema.table	isStreamReading	Gauge	Weather the table is stream reading or not
namespace.schema.table	numTablesSnapshotted	Gauge	The number of tables that have been snapshotted
namespace.schema.table	numTablesRemaining	Gauge	The number of tables that have not been snapshotted
namespace.schema.table	numSnapshotSplitsProcessed	Gauge	The number of splits that is being processed
namespace.schema.table	numSnapshotSplitsRemaining	Gauge	The number of splits that have not been processed
namespace.schema.table	numSnapshotSplitsFinished	Gauge	The number of splits that have been processed
namespace.schema.table	snapshotStartTime	Gauge	The time when the snapshot started
namespace.schema.table	snapshotEndTime	Gauge	The time when the snapshot ended

Notice:

The group name is namespace.schema.table, where namespace is the actual database name, schema is the actual schema name, and table is the actual table name.
For SQLServer, the group name will be like test_database.test_schema.test_table.

Data Type Mapping #

SQLServer type	Flink SQL type
char(n)	CHAR(n)
varchar(n) nvarchar(n) nchar(n)	VARCHAR(n)
text ntext xml	STRING
decimal(p, s) money smallmoney	DECIMAL(p, s)
numeric	NUMERIC
float real	DOUBLE
bit	BOOLEAN
int	INT
tinyint	SMALLINT
smallint	SMALLINT
bigint	BIGINT
date	DATE
time(n)	TIME(n)
datetime2 datetime smalldatetime	TIMESTAMP(n)
datetimeoffset	TIMESTAMP_LTZ(3)