Syncing-Redshift-PostgreSQL-in-real-time-with-Ka
376 words·2 mins
Syncing-Redshift-PostgreSQL-in-real-time-with-Ka #
Before showing Kafka Connect, we’ll walk through some setup
- Setting up a distributed Kafka cluster
- Setting up a PostgreSQL database on AWS RDS
- Setting up an AWS Redshift instance
- Setting up Confluent’s open source platform If you’re curious about how Kafka Connect works, I highly recommend reading the concepts and architecture and internals of Kafka Connect on Confluent’s platform documentation. We can see the data in the table as below: Now that we have some data in our PostgreSQL table, we can use Kafka Connect to get these rows as messages in a Kafka topic and have a process listening for any inserts/updates on this table.
timestamp.column.name
: the column name which has the timestampsincrementing.column.name
: the column which has incremental IDstopic.prefix
: to identify the Kafka topics ingested from PostgreSQL we can specify a prefix value which will be appended to all the table names and the topic name will be prefix+table name Thesource-postgres.properties
should look like this:
Schema Registry #
The JDBC connector from Confluent uses Schema Registry to store schema for the messages. We can start schema registry as follows:
~$ /usr/local/confluent/bin/schema-registry-start /usr/local/confluent/etc/schema-registry/schema-registry.properties &
Ingest data from PostgreSQL tables to Kafka topics #
Let’s create a topic in which we want to consume the updates from PostgreSQL.
~$ /usr/local/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic postgres_users
You can check that the topic exists using the following command:
~$ /usr/local/kafka/bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic postgres_users
We will be using Kafka Connect in stand-alone mode and we can start the stand-alone job to start consuming data from PostgreSQL table as follows:
~$ /usr/local/confluent/bin/connect-standalone /usr/local/confluent/etc/schema-registry/connect-avro-standalone.properties /usr/local/confluent/etc/kafka-connect-jdbc/source-postgres.properties
The jdbc
connector serializes the data using Avro and we can use the Avro console consumer provided by Confluent to consume these messages from Kafka topic.
Change the following properties:
name
: name for the connectortopics
: Kafka topic to write data fromconnection.url
: JDBC endpoint for Redshiftauto.create
: it istrue
by default and we will change it tofalse
as we’ve already created the table in Redshift that this data should be written to. Thesink-redshift.properties
should look as follows:
Send messages from the Kafka topic to Redshift #
We are all set to have messages from the Kafka topic write to the Redshift table.