Discussion:
wal-size limited to 16MB - Performance issue for subsequent backup
j***@krogh.cc
2014-10-20 19:03:59 UTC
Permalink
Hi.

One of our "production issues" is that the system generates lots of
wal-files, lots is like 151952 files over the last 24h, which is about
2.4TB worth of WAL files. I wouldn't say that isn't an issue by itself,
but the system does indeed work fine. We do subsequently gzip the files to
limit actual disk-usage, this makes the files roughly 30-50% in size.

That being said, along comes the backup, scheduled ones a day and tries to
read off these wal-files, which to the backup looks like "an awfull lot of
small files", our backup utillized a single thread to read of those files
and levels of at reading through 30-40MB/s from a 21 drive Raid50 of
rotating drives, which is quite bad. That causes a daily incremental run
to take in the order of 24h. Differential picking up larger deltas and
full are even worse.

One could of-course question a lot of the other things here, but
currently the only limiting factor is actually the backup being
able to keep up transporting the wal-log away from the system due
to the small wal size.

A short test like:
find . -type f -ctime -1 | tail -n 50 | xargs cat | pipebench > /dev/null
confirms the backup speed to be roughly the same as seen by the backup
software.
Another test from the same volume doing:
find . -type f -ctime -1 | tail -n 50 | xargs cat > largefile
And then wait for the fs to not cache the file any more and
cat largefile | pipebench > /dev/null
confirms that the disk-subsystem can do way (150-200MB/s) better on larger
files.

Thread here around the same topic:
http://postgresql.1045698.n5.nabble.com/How-do-you-change-the-size-of-the-WAL-files-td3425516.html
But not a warm welcoming workaround.

Suggestions are welcome. An archive-command/restore command that could
combine/split wal-segments might be the easiest workaround, but how about
crash-safeness?
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane
2014-10-20 19:08:29 UTC
Permalink
Post by j***@krogh.cc
http://postgresql.1045698.n5.nabble.com/How-do-you-change-the-size-of-the-WAL-files-td3425516.html
But not a warm welcoming workaround.
configure --with-wal-segsize=something ?

regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
j***@krogh.cc
2014-10-20 19:27:39 UTC
Permalink
Post by Tom Lane
Post by j***@krogh.cc
http://postgresql.1045698.n5.nabble.com/How-do-you-change-the-size-of-the-WAL-files-td3425516.html
But not a warm welcoming workaround.
configure --with-wal-segsize=something ?
Yes, but there are good reasons not to go down that route. One is that:

1) It looks like I'am going to be the only one, beware of the dragons.
2) It requires apparently a re-initdb, thus dump/restore of 4.5TB of data

The latter can absolutely be done by scheduling downtime, but the former
would compromise any level of good practice around our DB-operations.

Jesper
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane
2014-10-20 19:39:03 UTC
Permalink
Post by j***@krogh.cc
Post by Tom Lane
configure --with-wal-segsize=something ?
1) It looks like I'am going to be the only one, beware of the dragons.
2) It requires apparently a re-initdb, thus dump/restore of 4.5TB of data
I think a clean shutdown and pg_resetxlog would be sufficient. I agree
you'd need to do some testing though ...

regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
k***@rice.edu
2014-10-20 19:27:21 UTC
Permalink
Post by j***@krogh.cc
Hi.
One of our "production issues" is that the system generates lots of
wal-files, lots is like 151952 files over the last 24h, which is about
2.4TB worth of WAL files. I wouldn't say that isn't an issue by itself,
but the system does indeed work fine. We do subsequently gzip the files to
limit actual disk-usage, this makes the files roughly 30-50% in size.
...
Suggestions are welcome. An archive-command/restore command that could
combine/split wal-segments might be the easiest workaround, but how about
crash-safeness?
Hi,

Have you considered using something like tar/star in the archive command
to pack them into much larger tar archives?

Regards,
Ken
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund
2014-10-20 19:35:08 UTC
Permalink
Hi,
Post by j***@krogh.cc
One of our "production issues" is that the system generates lots of
wal-files, lots is like 151952 files over the last 24h, which is about
2.4TB worth of WAL files. I wouldn't say that isn't an issue by itself,
but the system does indeed work fine. We do subsequently gzip the files to
limit actual disk-usage, this makes the files roughly 30-50% in size.
Have you analyzed what the source of that volume is? Which version of
postgres are you using? What's your checkpoint_timeout/segments
settings?

Greetings,

Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
j***@krogh.cc
2014-10-20 19:41:26 UTC
Permalink
Post by Andres Freund
Post by j***@krogh.cc
One of our "production issues" is that the system generates lots of
wal-files, lots is like 151952 files over the last 24h, which is about
2.4TB worth of WAL files. I wouldn't say that isn't an issue by itself,
but the system does indeed work fine. We do subsequently gzip the files to
limit actual disk-usage, this makes the files roughly 30-50% in size.
Have you analyzed what the source of that volume is? Which version of
postgres are you using? What's your checkpoint_timeout/segments
settings?
Suggestions are surely welcome. I do suspect the majority is from 30
concurrent processes updating an 506GB GIN index, but it would be nice to
confirm that. There is also a message-queue in the DB with a fairly high
turnaround.

Currently PG 9.2 moving to 9.3 hopefully before end-of-year,
checkpoint_timeout = 30min, checkpoint_segments = 4096.

According to logs checkpoints are roughly 15 minutes apart.

Jesper
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund
2014-10-20 19:49:40 UTC
Permalink
Post by j***@krogh.cc
Post by Andres Freund
Post by j***@krogh.cc
One of our "production issues" is that the system generates lots of
wal-files, lots is like 151952 files over the last 24h, which is about
2.4TB worth of WAL files. I wouldn't say that isn't an issue by itself,
but the system does indeed work fine. We do subsequently gzip the files
to limit actual disk-usage, this makes the files roughly 30-50% in size.
I'm a bit doubtful that 16MB vs., say, 64MB files really changes
anything substantial for you. If it indeed is a problem, it's simple
enough to join the files temporarily?
Post by j***@krogh.cc
Post by Andres Freund
Have you analyzed what the source of that volume is? Which version of
postgres are you using? What's your checkpoint_timeout/segments
settings?
Suggestions are surely welcome.
Once you're on 9.3 I'd suggest using pg_xlogdump --stats on it. There's
a backport of the facility for 9.3 (looking somewhat different than what
is now in 9.5) at
http://archives.postgresql.org/message-id/CABRT9RAzGowqLFcEE8aF6VdPoFEy%2BP9gmu7ktGRzw0dgRwVr9Q%40mail.gmail.com

That'd tell you a fair bit more. It's noticeably harder to backport to <
9.3.
Post by j***@krogh.cc
I do suspect the majority is from 30 concurrent processes updating an
506GB GIN index, but it would be nice to confirm that. There is also a
message-queue in the DB with a fairly high turnaround.
A 506GB GIN index? Uh, interesting :). What's it used for? Trigrams?

I'd suspect that the message queue isn't the primary culprit, but it's
hard to say for sure.
Post by j***@krogh.cc
Currently PG 9.2 moving to 9.3 hopefully before end-of-year,
checkpoint_timeout = 30min, checkpoint_segments = 4096.
Generally a high checkpoint_timeout can significantly reduce the WAL
volume because of fewer full page writes. I've seen cases where spacing
checkpoint further apart by a factor of two reduced the overall WAL
volume by more than two.
Post by j***@krogh.cc
According to logs checkpoints are roughly 15 minutes apart.
Can you show log_checkpoints output?

Greetings,

Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
j***@krogh.cc
2014-10-20 20:02:05 UTC
Permalink
Post by Andres Freund
Post by j***@krogh.cc
Post by Andres Freund
Post by j***@krogh.cc
One of our "production issues" is that the system generates lots of
wal-files, lots is like 151952 files over the last 24h, which is
about
Post by Andres Freund
Post by j***@krogh.cc
2.4TB worth of WAL files. I wouldn't say that isn't an issue by
itself,
Post by Andres Freund
Post by j***@krogh.cc
but the system does indeed work fine. We do subsequently gzip the
files
Post by Andres Freund
Post by j***@krogh.cc
to limit actual disk-usage, this makes the files roughly 30-50% in
size.
I'm a bit doubtful that 16MB vs., say, 64MB files really changes
anything substantial for you. If it indeed is a problem, it's simple
enough to join the files temporarily?
I am trying to get my head around a good way to do that. 64MB probably
isn't a silverbullet. But it would definetely benefit the backup in terms
of single thread access to data on rotating drives.
Post by Andres Freund
Post by j***@krogh.cc
Post by Andres Freund
Have you analyzed what the source of that volume is? Which version of
postgres are you using? What's your checkpoint_timeout/segments
settings?
Suggestions are surely welcome.
Once you're on 9.3 I'd suggest using pg_xlogdump --stats on it. There's
a backport of the facility for 9.3 (looking somewhat different than what
is now in 9.5) at
http://archives.postgresql.org/message-id/CABRT9RAzGowqLFcEE8aF6VdPoFEy%2BP9gmu7ktGRzw0dgRwVr9Q%40mail.gmail.com
That'd tell you a fair bit more. It's noticeably harder to backport to <
9.3.
I'll bookmark that one.
Post by Andres Freund
Post by j***@krogh.cc
I do suspect the majority is from 30 concurrent processes updating an
506GB GIN index, but it would be nice to confirm that. There is also a
message-queue in the DB with a fairly high turnaround.
A 506GB GIN index? Uh, interesting :). What's it used for? Trigrams?
It is for full-text-search, but it is being updated entirely regulary,
~100M records. A dump/restore cycle typically reduces the size to 30-40%
of current size.
Post by Andres Freund
Post by j***@krogh.cc
Currently PG 9.2 moving to 9.3 hopefully before end-of-year,
checkpoint_timeout = 30min, checkpoint_segments = 4096.
Generally a high checkpoint_timeout can significantly reduce the WAL
volume because of fewer full page writes. I've seen cases where spacing
checkpoint further apart by a factor of two reduced the overall WAL
volume by more than two.
I'll work with that, I was just uncomfortable bumping checkpoint_segments
up much higher, any field experience in that corner?
Post by Andres Freund
Post by j***@krogh.cc
According to logs checkpoints are roughly 15 minutes apart.
Can you show log_checkpoints output?
2014-10-20 18:10:22 CEST LOG: checkpoint starting: time
2014-10-20 18:15:44 CEST LOG: checkpoint complete: wrote 76851 buffers
(7.3%); 0 transaction log file(s) added, 0 removed, 3238 recycled;
write=295.834 s, sync=23.903 s, total=322.011 s; sync files=2115,
longest=0.278 s, average=0.011 s
2014-10-20 18:40:22 CEST LOG: checkpoint starting: time
2014-10-20 18:44:30 CEST LOG: checkpoint complete: wrote 60550 buffers
(5.8%); 0 transaction log file(s) added, 0 removed, 3460 recycled;
write=224.678 s, sync=21.795 s, total=248.340 s; sync files=2090,
longest=0.963 s, average=0.010 s
2014-10-20 19:10:22 CEST LOG: checkpoint starting: time
2014-10-20 19:14:11 CEST LOG: checkpoint complete: wrote 42720 buffers
(4.1%); 0 transaction log file(s) added, 0 removed, 3598 recycled;
write=206.259 s, sync=21.185 s, total=229.254 s; sync files=2065,
longest=0.945 s, average=0.010 s
2014-10-20 19:40:22 CEST LOG: checkpoint starting: time
2014-10-20 19:43:31 CEST LOG: checkpoint complete: wrote 32897 buffers
(3.1%); 0 transaction log file(s) added, 0 removed, 3626 recycled;
write=161.801 s, sync=26.936 s, total=189.635 s; sync files=2115,
longest=0.458 s, average=0.012 s
2014-10-20 20:10:22 CEST LOG: checkpoint starting: time
2014-10-20 20:14:04 CEST LOG: checkpoint complete: wrote 37557 buffers
(3.6%); 0 transaction log file(s) added, 0 removed, 3285 recycled;
write=205.011 s, sync=16.550 s, total=222.442 s; sync files=2113,
longest=0.935 s, average=0.007 s
2014-10-20 20:40:22 CEST LOG: checkpoint starting: time
2014-10-20 20:45:18 CEST LOG: checkpoint complete: wrote 58012 buffers
(5.5%); 0 transaction log file(s) added, 0 removed, 3678 recycled;
write=252.750 s, sync=39.178 s, total=295.107 s; sync files=2075,
longest=0.990 s, average=0.018 s
2014-10-20 21:10:22 CEST LOG: checkpoint starting: time
2014-10-20 21:13:31 CEST LOG: checkpoint complete: wrote 40530 buffers
(3.9%); 0 transaction log file(s) added, 0 removed, 3652 recycled;
write=167.925 s, sync=19.719 s, total=189.057 s; sync files=2077,
longest=0.470 s, average=0.009 s
2014-10-20 21:40:22 CEST LOG: checkpoint starting: time
2014-10-20 21:44:20 CEST LOG: checkpoint complete: wrote 45158 buffers
(4.3%); 0 transaction log file(s) added, 0 removed, 3449 recycled;
write=202.986 s, sync=32.564 s, total=237.441 s; sync files=2100,
longest=0.445 s, average=0.015 s
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Heikki Linnakangas
2014-10-22 06:56:02 UTC
Permalink
Post by j***@krogh.cc
Post by Andres Freund
Post by j***@krogh.cc
I do suspect the majority is from 30 concurrent processes updating an
506GB GIN index, but it would be nice to confirm that. There is also a
message-queue in the DB with a fairly high turnaround.
A 506GB GIN index? Uh, interesting :). What's it used for? Trigrams?
It is for full-text-search, but it is being updated entirely regulary,
~100M records. A dump/restore cycle typically reduces the size to 30-40%
of current size.
Try 9.4 beta. The on-disk format of GIN indexes was rewritten in 9.4,
making them a lot smaller. That might help with WAL volume too. Or not,
but I'd love to hear what the impact is, in a real-life database :-).

- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Jeff Janes
2014-10-21 21:12:11 UTC
Permalink
Post by j***@krogh.cc
Hi.
One of our "production issues" is that the system generates lots of
wal-files, lots is like 151952 files over the last 24h, which is about
2.4TB worth of WAL files. I wouldn't say that isn't an issue by itself,
but the system does indeed work fine. We do subsequently gzip the files to
limit actual disk-usage, this makes the files roughly 30-50% in size.
That being said, along comes the backup, scheduled ones a day and tries to
read off these wal-files, which to the backup looks like "an awfull lot of
small files", our backup utillized a single thread to read of those files
and levels of at reading through 30-40MB/s from a 21 drive Raid50 of
rotating drives, which is quite bad. That causes a daily incremental run
to take in the order of 24h. Differential picking up larger deltas and
full are even worse.
Why not have archive_command (which gets the files while they are still
cached) put the files directly into their final destination on the backup
server?
Post by j***@krogh.cc
Suggestions are welcome. An archive-command/restore command that could
combine/split wal-segments might be the easiest workaround, but how about
crash-safeness?
I think you would just have to combine them by looking at the file name and
seeking to a specific spot in the large file (rather than just appending to
it) so that if the archive_command fails and gets rerun, it will still end
up in the correct place. I don't see what other crash-safeness issues you
would have, other than the ones you already have. You would want to do the
compression afterward combining, not before, so that all segments are of
predictable size.

It should be pretty easy as long as want your combined files to consist of
either 16 or 256 (or 255 in older versions) WAL files.

You would have to pass through directly any files not matching the filename
pattern of ordinary WAL files.

Cheers,

Jeff
Craig Ringer
2014-10-22 00:43:20 UTC
Permalink
Post by j***@krogh.cc
That being said, along comes the backup, scheduled ones a day and tries to
read off these wal-files, which to the backup looks like "an awfull lot of
small files", our backup utillized a single thread to read of those files
and levels of at reading through 30-40MB/s from a 21 drive Raid50 of
rotating drives, which is quite bad. That causes a daily incremental run
to take in the order of 24h. Differential picking up larger deltas and
full are even worse.
What's the backup system?

151952 files should be a trivial matter for any backup system. I'm very
surprised you're seeing those kind of run times for 2TB of WAL, and
think it's worth investigating just why the backup system is behaving
this way.

What does 'filefrag' say about the WAL segments? Are they generally a
single extent each? If not, how many extents?

It'd be useful to know the kernel version, file system, RAID controller,
whether you use LVM, and other relevant details? What's your RAID
array's stripe size?
Post by j***@krogh.cc
find . -type f -ctime -1 | tail -n 50 | xargs cat | pipebench > /dev/null
confirms the backup speed to be roughly the same as seen by the backup
software.
find . -type f -ctime -1 | tail -n 50 | xargs cat > largefile
And then wait for the fs to not cache the file any more and
cat largefile | pipebench > /dev/null
confirms that the disk-subsystem can do way (150-200MB/s) better on larger
files.
OK, so a larger contiguously allocated file looks like it's probably
read faster. That doesn't mean there's any guarantee that big WAL
segment would be allocated contiguously if there are lots of other
writes interspersed, but the FS will try.

(What does 'filefrag' say about your 'largefile'?)

I'm wondering if you're having issues related to a RAID stripe size that
is close to, or bigger than, your WAL segment size. So each segment is
only being read from one disk or a couple of disks. If that's the case
you're probably not getting ideal write performance either.

That said, I don't see any particular reason why readahead wouldn't
result in you getting similar results from multiple smaller WAL segments
that're allocated contiguously, and they usually would be if they're
created one after the other. What are your readahead settings? (There
are often several at different levels; what exists depends on how your
storage is configured, use of LVM, use of SW RAID, etc).

In my opinion RAID 50, or RAID 5, are generally pretty poor options for
a database file system in performance terms anyway. Especially for
transaction logs. RAID 50 is also not wonderfully durable for arrays of
larger numbers of bigger disks given modern disks' sizes, even with the
low block error rates and relatively low disk failure rates. I
personally tend to consider two parity disks the minimum acceptable for
arrays of more than four or five disks. I'd certainly want continuous
archiving or streaming replication in place if I was running RAID 50 on
a big array.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Loading...