Discussion:
CREATE REPLICATION SLOT fails on a timeout
Steve Singer
2014-05-16 20:37:16 UTC
Permalink
I am finding that my logical walsender connections are being terminated
due to a timeout on the CREATE REPLICATION SLOT command. with
"terminating walsender process due to replication timeout"

Below is the stack trace when this happens

#3 0x000000000067df28 in WalSndCheckTimeOut
(now=***@entry=453585463823871) at walsender.c:1748
#4 0x000000000067eedc in WalSndWaitForWal (loc=691727888) at
walsender.c:1216
#5 logical_read_xlog_page (state=<optimized out>,
targetPagePtr=691724288, reqLen=<optimized out>,
targetRecPtr=<optimized out>, cur_page=0x18476e0 "}\320\005",
pageTLI=<optimized out>) at walsender.c:741
#6 0x00000000004f41bf in ReadPageInternal (state=***@entry=0x17aa140,
pageptr=***@entry=691724288,
reqLen=***@entry=3600) at xlogreader.c:525
#7 0x00000000004f454a in XLogReadRecord (state=0x17aa140,
RecPtr=691727856, ***@entry=0,
errormsg=***@entry=0x7fff7f668c28) at xlogreader.c:228
#8 0x0000000000675e5c in DecodingContextFindStartpoint
(ctx=***@entry=0x17a0358) at logical.c:460
#9 0x0000000000680f16 in CreateReplicationSlot (cmd=0x1798b50) at
walsender.c:800
#10 exec_replication_command (
cmd_string=***@entry=0x17f1478 "CREATE_REPLICATION_SLOT
\"slon_4_1\" LOGICAL \"slony1_funcs.2.2.0\"")
at walsender.c:1291
#11 0x00000000006bf4a1 in PostgresMain (argc=<optimized out>,
argv=***@entry=0x177db50, dbname=0x177db30 "test1",

(gdb) p last_reply_timestamp
$1 = 0


I propose the attached patch sets last_reply_timestamp to now() it
starts processing a command. Since receiving a command is hearing
something from the client.


Steve
Andres Freund
2014-05-16 20:43:31 UTC
Permalink
Hi,
I am finding that my logical walsender connections are being terminated due
to a timeout on the CREATE REPLICATION SLOT command. with "terminating
walsender process due to replication timeout"
Below is the stack trace when this happens
at walsender.c:1748
#4 0x000000000067eedc in WalSndWaitForWal (loc=691727888) at
walsender.c:1216
...
#9 0x0000000000680f16 in CreateReplicationSlot (cmd=0x1798b50) at
walsender.c:800
#10 exec_replication_command () at walsender.c:1291
#11 0x00000000006bf4a1 in PostgresMain (argc=<optimized out>,
(gdb) p last_reply_timestamp
$1 = 0
I propose the attached patch sets last_reply_timestamp to now() it starts
processing a command. Since receiving a command is hearing something from
the client.
Hm. Yes, that's a problem.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
new file mode 100644
index 5c11d68..56a2f10
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
*************** exec_replication_command(const char *cmd
*** 1276,1281 ****
--- 1276,1282 ----
parse_rc))));
cmd_node = replication_parse_result;
+ last_reply_timestamp = GetCurrentTimestamp();
switch (cmd_node->type)
{
I don't think that's going to cut it though. The creation can take
longer than whatever wal_sender_timeout is set to (when there's lots of
longrunning transactions). I think checking whether last_reply_timestamp
= 0 during timeout checking is more robust.

Greetings,

Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Steve Singer
2014-05-16 21:02:33 UTC
Permalink
Post by Andres Freund
Hi,
I don't think that's going to cut it though. The creation can take
longer than whatever wal_sender_timeout is set to (when there's lots of
longrunning transactions). I think checking whether last_reply_timestamp
= 0 during timeout checking is more robust.
Greetings,
Andres Freund
That makes sense.
A patch that does that is attached.

Steve
Andres Freund
2014-05-16 23:34:25 UTC
Permalink
Post by Steve Singer
Post by Andres Freund
I don't think that's going to cut it though. The creation can take
longer than whatever wal_sender_timeout is set to (when there's lots of
longrunning transactions). I think checking whether last_reply_timestamp
= 0 during timeout checking is more robust.
That makes sense.
A patch that does that is attached.
I've attached a heavily revised version of that patch. Didn't touch all
the necessary places...

Survives creation of logical replication slots under 'intensive
pressure', with a wal_sender_timeout=10ms.

Thanks for the bugreport!

Greetings,

Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund
2014-05-28 22:41:56 UTC
Permalink
Hi,
Post by Andres Freund
Post by Steve Singer
Post by Andres Freund
I don't think that's going to cut it though. The creation can take
longer than whatever wal_sender_timeout is set to (when there's lots of
longrunning transactions). I think checking whether last_reply_timestamp
= 0 during timeout checking is more robust.
That makes sense.
A patch that does that is attached.
I've attached a heavily revised version of that patch. Didn't touch all
the necessary places...
Survives creation of logical replication slots under 'intensive
pressure', with a wal_sender_timeout=10ms.
Thanks for the bugreport!
Pushed a fix for it. I am pretty sure it will, but could you still test
that it fixes your problem?

Thanks!

Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Steve Singer
2014-05-30 13:35:33 UTC
Permalink
Post by Andres Freund
Hi,
Pushed a fix for it. I am pretty sure it will, but could you still test
that it fixes your problem?
Thanks!
The fix seems to work (I am no longer getting the timeout on slot creation)

Thanks
Post by Andres Freund
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Loading...