Sunday, July 29, 2018

Network Slowness Caused Database Contention That Caused Goldengate Lag

I got paged for a goldengate extract lagging behind. Checked the extract configuration and it was normal extract and it seemed stuck without giving any error in the ggserr.log or anywhere else. It wasn't abended either and was in running state.

Tried stopping and restating it, but still it remained in running state while doing nothing and lag was increasing. So the issue was clearly outside of goldengate. Checked the database by starting from alert log and didn't see any errors there either.

Jumped into the database and ran some queries to see which sessions were active and what they were running. After going through various active sessions, turned out that few of them were doing long transactions over a dblink and these sessions were several hours old and seemed stuck. These sessions were also inducing widespread delay on the temp tablespace and were blocking other sessions. Due to undersized temp plus these stuck long running transactions, database performance was also slower than usual.

Ran a select statement over that dblink and it was very slow. Used tnsping to ping that database remotely and it returned with delay. Then used network commands like ping, tracert, etc to check network status and it all was pointing to delay in network.

Killed the long running transaction as it was going nowhere, and that eased the pressure on temp tablespace, which in return enabled extract to finish off the lag.

No comments: