xlog.c: WALInsertLock vs. WALWriteLock

Discussion:

xlog.c: WALInsertLock vs. WALWriteLock

fazool mein

2010-10-22 19:08:54 UTC

Hello guys,

I'm writing a function that will read data from the buffer in xlog (i.e.
from XLogCtl->pages and XLogCtl->xlblocks). I want to make sure that I am
doing it correctly.
For reading from the buffer, do I need to lock WALInsertLock or
WALWriteLock? Also, can you explain a bit the usage of 'LW_SHARED'. Can we
use it for read purposes?

Thanks a lot.

David Fetter

2010-10-23 20:17:43 UTC

Post by fazool mein
Hello guys,
I'm writing a function that will read data from the buffer in xlog
(i.e. from XLogCtl->pages and XLogCtl->xlblocks). I want to make
sure that I am doing it correctly.

Got an example of what the function might look like?

Post by fazool mein
For reading from the buffer, do I need to lock WALInsertLock or
WALWriteLock? Also, can you explain a bit the usage of 'LW_SHARED'.
Can we use it for read purposes?

Help me understand. Do you foresee some kind of concurrency issue,
and if so, what?

Cheers,
David.

Post by fazool mein
Thanks a lot.

--
David Fetter <***@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: ***@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tallat Mahmood

2010-10-24 05:34:51 UTC

Post by fazool mein
I'm writing a function that will read data from the buffer in xlog

Post by fazool mein
(i.e. from XLogCtl->pages and XLogCtl->xlblocks). I want to make
sure that I am doing it correctly.

Got an example of what the function might look like?

Say something like this:

bool ReadLogFromBuffer(char *buf, int len, XLogRecPtr p)

which will mean that we want to read the log (records) into buf at position
p of length len.

Post by fazool mein

Post by fazool mein
For reading from the buffer, do I need to lock WALInsertLock or
WALWriteLock? Also, can you explain a bit the usage of 'LW_SHARED'.
Can we use it for read purposes?

Help me understand. Do you foresee some kind of concurrency issue,
and if so, what?

Yes. For example, while a process is reading from the buffer, another
process may insert new records into the buffer. To give a specific example,
walsender might want to read data from the buffer instead of reading log
from disk. In parallel, there might be transactions on the server that
modify the buffer.

Regards,
Tallat

Robert Haas

2010-10-26 01:31:04 UTC

Post by fazool mein
I'm writing a function that will read data from the buffer in xlog (i.e.
from XLogCtl->pages and XLogCtl->xlblocks). I want to make sure that I am
doing it correctly.
For reading from the buffer, do I need to lock WALInsertLock or
WALWriteLock? Also, can you explain a bit the usage of 'LW_SHARED'. Can we
use it for read purposes?

Holding WALInsertLock in shared mode prevents other processes from
inserting WAL, or in other words it keeps the "end" position from
moving, while holding WALWriteLock in shared mode prevents other
processes from writing the WAL from the buffers out to the operating
system, or in other words it keeps the "start" position from moving.
So you could probably take WALInsertLock in shared mode, figure out
the current end of WAL position, release the lock; then take
WALWriteLock in shared mode, read any WAL before the end of WAL
position, and release the lock. But note that this wouldn't guarantee
that you read all WAL as it's generated....

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jeff Janes

2010-10-26 15:22:38 UTC

Post by Robert Haas

Post by fazool mein
I'm writing a function that will read data from the buffer in xlog (i.e.
from XLogCtl->pages and XLogCtl->xlblocks). I want to make sure that I am
doing it correctly.
For reading from the buffer, do I need to lock WALInsertLock or
WALWriteLock? Also, can you explain a bit the usage of 'LW_SHARED'. Can we
use it for read purposes?

Holding WALInsertLock in shared mode prevents other processes from
inserting WAL, or in other words it keeps the "end" position from
moving, while holding WALWriteLock in shared mode prevents other
processes from writing the WAL from the buffers out to the operating
system, or in other words it keeps the "start" position from moving.
So you could probably take WALInsertLock in shared mode, figure out
the current end of WAL position, release the lock;

Once you release the WALInsertLock, someone else can grab it and
overwrite the part of the buffer you think you are reading.
So I think you have to hold WALInsertLock throughout the duration of
the operation.

Of course it couldn't be overwritten if the wal record itself is not
yet written from buffer to the OS/disk. But since you are not yet
holding the WALWriteLock, this could be happening at any time.

Post by Robert Haas
then take
WALWriteLock in shared mode, read any WAL before the end of WAL
position, and release the lock. But note that this wouldn't guarantee
that you read all WAL as it's generated....

I don't think that holding WALWriteLock accomplishes much. It
prevents part of the buffer from being written out to OS/disk, and
thus becoming eligible for being overwritten in the buffer, but the
WALInsertLock prevents it from actually being overwritten. And what
if the part of the buffer you want to read was already eligible for
overwriting but not yet actually overwritten? WALWriteLock won't
allow you to safely access it, but WALInsertLock will (assuming you
have a safe way to identify the record in the first place). For
either case, holding it in shared mode would be sufficient.

Jeff

--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alvaro Herrera

2010-10-26 15:52:38 UTC

Post by Jeff Janes
I don't think that holding WALWriteLock accomplishes much. It
prevents part of the buffer from being written out to OS/disk, and
thus becoming eligible for being overwritten in the buffer, but the
WALInsertLock prevents it from actually being overwritten. And what
if the part of the buffer you want to read was already eligible for
overwriting but not yet actually overwritten? WALWriteLock won't
allow you to safely access it, but WALInsertLock will (assuming you
have a safe way to identify the record in the first place). For
either case, holding it in shared mode would be sufficient.

And horrible for performance, I imagine. Those locks are highly trafficked.

--
Álvaro Herrera <***@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

2010-10-26 16:09:16 UTC

Post by Alvaro Herrera

Post by Jeff Janes
I don't think that holding WALWriteLock accomplishes much. It
prevents part of the buffer from being written out to OS/disk, and
thus becoming eligible for being overwritten in the buffer, but the
WALInsertLock prevents it from actually being overwritten. And what
if the part of the buffer you want to read was already eligible for
overwriting but not yet actually overwritten? WALWriteLock won't
allow you to safely access it, but WALInsertLock will (assuming you
have a safe way to identify the record in the first place). For
either case, holding it in shared mode would be sufficient.

And horrible for performance, I imagine. Those locks are highly trafficked.

I think you might actually need *both* locks to ensure the WAL buffers
aren't changing underneath you. If you don't have the walwriter locked
out, it is free to change the state of a buffer from "dirty" to
"written" and then to "prepared to receive next page of WAL". If the
latter doesn't involve changing the content of the buffer today, it
still could tomorrow.

And on top of all that, there remains the problem that the piece of WAL
you want might already be gone from the buffers.

Might I suggest adopting the same technique walsender does, ie just read
the data back from disk? There's a reason why we gave up trying to have
walsender read directly from the buffers.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

fazool mein

2010-10-26 18:03:55 UTC

Post by Tom Lane
Might I suggest adopting the same technique walsender does, ie just read
the data back from disk? There's a reason why we gave up trying to have
walsender read directly from the buffers.

That is exactly what I do not want to do, i.e. read from disk, as long as
the piece of WAL is available in the buffers. Can you please describe why
walsender reading directly from the buffers was given up? To avoid a lot of
locking?
The locking issue might not be a problem considering synchronous
replication. In synchronous replication, the primary will anyways wait for
the standby to send a confirmation before it can do more WAL inserts. Hence,
reading from buffers might be better in this case.

So, as I understand from the emails, we need to lock both WALWriteLock and
WALInsertLock in exclusive mode for reading from buffers. Agreed?

Thanks.

Heikki Linnakangas

2010-10-26 18:13:57 UTC

Post by fazool mein

Post by Tom Lane
Might I suggest adopting the same technique walsender does, ie just read
the data back from disk? There's a reason why we gave up trying to have
walsender read directly from the buffers.

That is exactly what I do not want to do, i.e. read from disk, as long as
the piece of WAL is available in the buffers.

Why not? If the reason is performance, I'd like to see some performance
numbers to show that it's worth the trouble. You could perhaps do a
quick and dirty hack that doesn't do the locking 100% correctly first,
and do some benchmarking on that to get a ballpark number of how much
potential there is. Or run oprofile on the current walsender
implementation to see how much time is currently spent reading WAL from
the kernel buffers.

Post by fazool mein
Can you please describe why
walsender reading directly from the buffers was given up? To avoid a lot of
locking?

To avoid locking yes, and complexity in general.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

2010-10-26 18:23:32 UTC

On Tue, Oct 26, 2010 at 2:13 PM, Heikki Linnakangas

Post by Heikki Linnakangas

Post by fazool mein

Post by Tom Lane
Might I suggest adopting the same technique walsender does, ie just read
the data back from disk? There's a reason why we gave up trying to have
walsender read directly from the buffers.

That is exactly what I do not want to do, i.e. read from disk, as long as
the piece of WAL is available in the buffers.

Why not? If the reason is performance, I'd like to see some performance
numbers to show that it's worth the trouble. You could perhaps do a quick
and dirty hack that doesn't do the locking 100% correctly first, and do some
benchmarking on that to get a ballpark number of how much potential there
is. Or run oprofile on the current walsender implementation to see how much
time is currently spent reading WAL from the kernel buffers.

Post by fazool mein
Can you please describe why
walsender reading directly from the buffers was given up? To avoid a lot of
locking?

To avoid locking yes, and complexity in general.

And the fact that it might allow the standby to get ahead of the
master, leading to silent database corruption.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

fazool mein

2010-10-26 18:57:22 UTC

Post by Robert Haas
On Tue, Oct 26, 2010 at 2:13 PM, Heikki Linnakangas

Post by Heikki Linnakangas

Post by fazool mein
Can you please describe why
walsender reading directly from the buffers was given up? To avoid a lot of
locking?

To avoid locking yes, and complexity in general.

And the fact that it might allow the standby to get ahead of the
master, leading to silent database corruption.

I agree that the standby might get ahead, but this doesn't necessarily lead
to database corruption. Here, the interesting case is what happens when the
primary fails, which can lead to *either* of the following two cases:
1) The standby, due to some triggering mechanism, becomes the new primary.
In this case, even if the standby was ahead, its fine.
2) The primary comes back as primary. In this case, the standby will connect
again to the primary. At this point, *if* somehow we are able to detect that
the standby is ahead, then we should abort the standby and create a standby
from scratch.

I agree with Heikki that going through all this trouble only makes sense if
there is a huge performance boost.

Robert Haas

2010-10-26 19:00:05 UTC

Post by fazool mein

Post by Robert Haas
On Tue, Oct 26, 2010 at 2:13 PM, Heikki Linnakangas

Post by Heikki Linnakangas

Post by fazool mein
Can you please describe why
walsender reading directly from the buffers was given up? To avoid a
lot
of
locking?

To avoid locking yes, and complexity in general.

And the fact that it might allow the standby to get ahead of the
master, leading to silent database corruption.

I agree that the standby might get ahead, but this doesn't necessarily lead
to database corruption. Here, the interesting case is what happens when the
1) The standby, due to some triggering mechanism, becomes the new primary.
In this case, even if the standby was ahead, its fine.

True.

Post by fazool mein
2) The primary comes back as primary. In this case, the standby will connect
again to the primary. At this point, *if* somehow we are able to detect that
the standby is ahead, then we should abort the standby and create a standby
from scratch.

Unless you set restart_after_crash=off, the master could
crash-and-restart before you can do anything about it. But that
doesn't exist in the 9.0 branch.

Post by fazool mein
I agree with Heikki that going through all this trouble only makes sense if
there is a huge performance boost.

There's probably quite a large performance boost in the sync rep case
from allowing the master and standby to fsync() in parallel, but first
we need to get something that works at all.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Josh Berkus

2010-10-26 19:00:05 UTC

Post by fazool mein
I agree that the standby might get ahead, but this doesn't necessarily
lead to database corruption. Here, the interesting case is what happens
when the primary fails, which can lead to *either* of the following two
1) The standby, due to some triggering mechanism, becomes the new
primary. In this case, even if the standby was ahead, its fine.
2) The primary comes back as primary. In this case, the standby will
connect again to the primary. At this point, *if* somehow we are able to
detect that the standby is ahead, then we should abort the standby and
create a standby from scratch.

Yes. And we weren't able to implement that for 9.0. It's worth
revisiting for 9.1. In fact, the issue of "is the standby ahead of the
master" has come up repeatedly in potential failure scenarios; I think
we're going to need a fairly bulletproof method to determine this.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

2010-10-26 19:02:27 UTC

Post by fazool mein
I agree that the standby might get ahead, but this doesn't necessarily
lead to database corruption. Here, the interesting case is what happens
when the primary fails, which can lead to *either* of the following two
1) The standby, due to some triggering mechanism, becomes the new
primary. In this case, even if the standby was ahead, its fine.
2) The primary comes back as primary. In this case, the standby will
connect again to the primary. At this point, *if* somehow we are able to
detect that the standby is ahead, then we should abort the standby and
create a standby from scratch.

Yes. And we weren't able to implement that for 9.0. It's worth
revisiting for 9.1. In fact, the issue of "is the standby ahead of the
master" has come up repeatedly in potential failure scenarios; I think
we're going to need a fairly bulletproof method to determine this.

Agreed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fujii Masao

2010-10-27 19:53:29 UTC

Post by fazool mein

Post by Tom Lane
Might I suggest adopting the same technique walsender does, ie just read
the data back from disk? There's a reason why we gave up trying to have
walsender read directly from the buffers.

That is exactly what I do not want to do, i.e. read from disk, as long as
the piece of WAL is available in the buffers.

I implemented before the patch which makes walsender read WAL from the buffer
without holding neither WALInsertLock nor WALWriteLock. That might be helpful
for you. Please see the following post.
http://archives.postgresql.org/pgsql-hackers/2010-06/msg00661.php

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Markus Wanner

2010-10-27 14:44:20 UTC

Post by Alvaro Herrera
And horrible for performance, I imagine. Those locks are highly trafficked.

Note, however, that offloading this to the file-system just moves
congestion there. So we are effectively saying that we expect
filesystems to do a better job (in that aspect) than our WAL implementation.

(Note that I'm not claiming that is or is not true - I didn't measure).

Regards

Markus Wanner

--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alvaro Herrera

2010-10-27 15:06:17 UTC

Post by Markus Wanner

Post by Alvaro Herrera
And horrible for performance, I imagine. Those locks are highly trafficked.

Note, however, that offloading this to the file-system just moves
congestion there. So we are effectively saying that we expect
filesystems to do a better job (in that aspect) than our WAL implementation.

Well, you can just read at your pace from the filesystem; the data is
going to stay there for a long time. WAL buffers are constantly moving,
and aren't as big.

--
Álvaro Herrera <***@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

16 Replies
62 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

fazool mein 2010-10-22 19:08:54 UTC

David Fetter 2010-10-23 20:17:43 UTC

Tallat Mahmood 2010-10-24 05:34:51 UTC

Robert Haas 2010-10-26 01:31:04 UTC

Jeff Janes 2010-10-26 15:22:38 UTC

Alvaro Herrera 2010-10-26 15:52:38 UTC

Tom Lane 2010-10-26 16:09:16 UTC

fazool mein 2010-10-26 18:03:55 UTC

Heikki Linnakangas 2010-10-26 18:13:57 UTC

Robert Haas 2010-10-26 18:23:32 UTC

fazool mein 2010-10-26 18:57:22 UTC

Robert Haas 2010-10-26 19:00:05 UTC

Josh Berkus 2010-10-26 19:00:05 UTC

Robert Haas 2010-10-26 19:02:27 UTC

Fujii Masao 2010-10-27 19:53:29 UTC

Markus Wanner 2010-10-27 14:44:20 UTC

Alvaro Herrera 2010-10-27 15:06:17 UTC

about - legalese

Loading...