Initialization / Createstandby

Use the action createstandby to initially setup a primary/standby environment. The following steps must be performed as user root:

The createstandby action can cause a short outage of the SKOOR system due to a restart of the database (possible automatic configuration change). It will be fully functional and accessible afterwards. 

Be aware of the following:

  • Databases with a size of 100GB or larger normally need more WAL files than the default of 2048. The parameter wal_keep_size must be increased before the replication can be initialized:

    vi /var/lib/pgsql/data/postgresql.skoor.conf

    Change the parameter (the value can be larger than 4096, consider the free space on the disk):

    wal_keep_size = 4096
  • The script will sync DB tables from the primary to the standby. Check the size of the whole DB using:

    du -sh /var/lib/pgsql

    It is very likely that there will be several gigabytes of data to transfer. In this case, initializing the replication is best done in the evening.

  • Reinitialization does not copy over all files. If an initialization was already done once with createstandby, but it has to be repeated because for some reason the replication is broken, much less data will be transferred from the primary to the standby.

  • However, each database file must be compared by checksum on both sides so the replication mechanisms is able to determine which files need to be transferred. Syncing based solely on database file sizes and their modification times is not reliable enough. Therefore, before the actual syncing starts, the whole database needs to be checksummed on both sides, which can take several hours, depending on host performance and database size.

  • It is a good idea to complete a full backup on the current active system before starting the initialization. This can be done with the command:

    cd /var/lib/pgsql
    sudo -u postgres bash -l -c '/opt/eranger/bin/eranger-server-backup.sh full'

    which will save a full backup to /opt/eranger/server/backups/.

To start the initialization, the createstandby command must be issued on the primary. Since this command can run for several hours, depending on the amount of data that needs to be transferred and the network bandwidth between primary and standby, it is recommended to run the command within a screen session. The screen command allows running a command remotely without the command being terminated if the connection to the remote host is disconnected for some reason. Issue a new screen session (with a large enough history scrollback buffer) on the primary as root using the following command:

screen -h 10000

Then within this screen session run the following command to start the standby initialization process:

/opt/eranger/bin/eranger-server-replication.pl createstandby

The following is an example output of the above command:

192.168.56.221 prepare standby system 192.168.56.222 (primary system 192.168.56.221)..
192.168.56.221
192.168.56.221
192.168.56.221 ============================================================
192.168.56.221 == Setting up standby system on 192.168.56.222
192.168.56.221 == THIS WILL DESTROY DATABASE on 192.168.56.222
192.168.56.221 == make sure you have a backup
192.168.56.221 ============================================================
192.168.56.221
192.168.56.221 if you continue, we will call above command with ssh
press ENTER to continue, Ctrl-C to abort >

Press Enter to continue. The following output will be shown:

192.168.56.221 checking ssh for user reranger
192.168.56.222 stopped skoor syncfs
Removed symlink /etc/systemd/system/multi-user.target.wants/eranger-server.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/eranger-report.service.
192.168.56.221 restart postgresql-13.service..
192.168.56.221 SKOOR Engine start httpd at /opt/eranger/bin/eranger-server-replication.pl line 1818.
192.168.56.221 httpd already running (not starting)
192.168.56.221 done
192.168.56.221 SKOOR Engine start eranger-report at /opt/eranger/bin/eranger-server-replication.pl line 1818.
192.168.56.221 eranger-report already running (not starting)
192.168.56.221 done
192.168.56.221 SKOOR Engine start eranger-server at /opt/eranger/bin/eranger-server-replication.pl line 1818.
192.168.56.221 eranger-server already running (not starting)
192.168.56.221 done
192.168.56.221 SKOOR Engine start eranger-ethd at /opt/eranger/bin/eranger-server-replication.pl line 1818.
192.168.56.221 start eranger-ethd (service eranger-ethd )..
192.168.56.221 done
192.168.56.221 SKOOR Engine start eranger-eth-alerter at /opt/eranger/bin/eranger-server-replication.pl line 1818.
192.168.56.221 start eranger-eth-alerter (service eranger-eth-alerter )..
192.168.56.221 done
192.168.56.221 SKOOR Engine start eranger-collector at /opt/eranger/bin/eranger-server-replication.pl line 1818.
192.168.56.221 eranger-collector already running (not starting)
192.168.56.221 done
192.168.56.221 SKOOR Engine start eranger-agent at /opt/eranger/bin/eranger-server-replication.pl line 1818.
192.168.56.221 eranger-agent already running (not starting)
192.168.56.221 done
192.168.56.221 SKOOR Engine start eranger-ic-alerter at /opt/eranger/bin/eranger-server-replication.pl line 1818.
192.168.56.221 start eranger-ic-alerter (service eranger-ic-alerter )..
192.168.56.221 done
192.168.56.221 copied file to /var/lib/pgsql/13/data/
192.168.56.221 copied file to /var/lib/pgsql/data/ng_tblspc/
192.168.56.221 rsync done
NOTICE:  all required WAL segments have been archived
192.168.56.222 stopped skoor syncfs
192.168.56.222 Enable and start eranger-replication-tunnel.service at /opt/eranger/bin/eranger-server-replication.pl line 1751.
192.168.56.221 started skoor syncfs

The createstandby command can also be run in non-interactive mode using the -f parameter. The output will be shorter in this case:

Please be aware that this command makes this system to the primary server and immediately destroys the database on the other node

# eranger-server-replication.pl -f createstandby
192.168.56.221 prepare standby system 192.168.56.222 (primary system 192.168.56.221)..
192.168.56.221
192.168.56.221 checking ssh for user reranger
192.168.56.222 stopped skoor syncfs
192.168.56.221 restart postgresql-13.service..
192.168.56.221 SKOOR Engine start httpd at /opt/eranger/bin/eranger-server-replication.pl line 1818.
192.168.56.221 httpd already running (not starting)
192.168.56.221 done
192.168.56.221 SKOOR Engine start eranger-report at /opt/eranger/bin/eranger-server-replication.pl line 1818.
192.168.56.221 eranger-report already running (not starting)
192.168.56.221 done
192.168.56.221 SKOOR Engine start eranger-server at /opt/eranger/bin/eranger-server-replication.pl line 1818.
192.168.56.221 eranger-server already running (not starting)
192.168.56.221 done
192.168.56.221 SKOOR Engine start eranger-ethd at /opt/eranger/bin/eranger-server-replication.pl line 1818.
192.168.56.221 start eranger-ethd (service eranger-ethd )..
192.168.56.221 done
192.168.56.221 SKOOR Engine start eranger-eth-alerter at /opt/eranger/bin/eranger-server-replication.pl line 1818.
192.168.56.221 start eranger-eth-alerter (service eranger-eth-alerter )..
192.168.56.221 done
192.168.56.221 SKOOR Engine start eranger-collector at /opt/eranger/bin/eranger-server-replication.pl line 1818.
192.168.56.221 eranger-collector already running (not starting)
192.168.56.221 done
192.168.56.221 SKOOR Engine start eranger-agent at /opt/eranger/bin/eranger-server-replication.pl line 1818.
192.168.56.221 eranger-agent already running (not starting)
192.168.56.221 done
192.168.56.221 SKOOR Engine start eranger-ic-alerter at /opt/eranger/bin/eranger-server-replication.pl line 1818.
192.168.56.221 start eranger-ic-alerter (service eranger-ic-alerter )..
192.168.56.221 done
192.168.56.221 copied file to /var/lib/pgsql/13/data/
192.168.56.221 copied file to /var/lib/pgsql/data/ng_tblspc/
192.168.56.221 rsync done
NOTICE:  all required WAL segments have been archived
192.168.56.222 stopped skoor syncfs
192.168.56.222 Enable and start eranger-replication-tunnel.service at /opt/eranger/bin/eranger-server-replication.pl line 1751.
192.168.56.221 started skoor syncfs

The screen session can be left anytime by pressing first "Ctrl-a" and then "d". To re-enter the screen session, enter the following re-attach command:

screen -h 10000 -r

The script does the following:

  • Compare the checksum of every database file between primary and standby.

  • Stop the eranger-server service on the standby if it is running

  • The files /var/lib/pgsql/data/recovery.signal and skoor-replication.conf (which is included by postgresql.skoor.conf) are created on the standby. Those allow the standby to know its primary's IP address, the replication port, credentials, and the path to the trigger file. The following shows an example of a skoor-replication.con file on the standby:

    primary_conninfo = 'host=10.1.0.88 port=5432 user=replication password=replication'
    promote_trigger_file = '/tmp/postgresql.trigger.5432'

    The promote_trigger_file does not exist if the replication is running. Creating it causes the replication to end (i.e. in case of a switch/failover).
    recovery.signal is only a marker file and hence empty.

  • The parts of the database that differ between the two hosts are transferred using an rsync over ssh process. The following paths are excluded from the sync process:

    • /var/lib/pgsql/data/pg_xlog/ (the binary logs, i.e. the WAL files are not synced during createstandby, their content is only replicated after the createstandby process has started)

    • /var/lib/pgsql/data/recovery.signal

    • /var/lib/pgsql/data/skoor-replication.conf

    • /var/lib/pgsql/data/postmaster.pid

  • The replication is started

 During the createstandby process, the primary can be used normally to accept measurement data from collectors. Also the SKOOR web interface can be used, however, the performance may be degraded due to the checksum calculations and the transfer of large files across the network.
The time it takes to initialize a replication for the first time mainly depends upon the size of the database, the network speed and the host's performance.

Each comment about what the script does is prefixed by the IP address of the host on which the task is being processed.

After the command has finished running, check its exit code. It should be zero:

...
10.1.0.88 done
10.1.0.88 copied file to /var/lib/pgsql/data/
NOTICE:  pg_stop_backup complete, all required WAL segments have been archived
# echo $?
0

The replication is now started and from now on all database queries that are executed on the primary are replicated to the standby immediately.