Inefficient barriers on solaris with sun cc

Post by Andres Freund
Binaries compiled on solaris using sun studio cc currently don't have
compiler and memory barriers implemented. That means we fall back to
relatively slow generic implementations for those. Especially compiler,
read, write barriers will be much slower than necessary (since they all
just need to prevent compiler reordering as both sparc and x86 are run
in TSO mode under solaris).
Since my estimate is that we'll use more and more barriers, that's going
to hurt more and more.
I do *not* plan to do anything about it atm, I just thought it might be
helpful to have this stated somewhere searchable.

To put that another way:

If there are any Sun Studio users out there who care about performance
on big iron, please send a patch to fix this...
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Oskari Saarenmaa

2014-09-26 12:36:02 UTC

Attached patch implements compiler and memory barriers for Solaris
Studio based on documentation at
http://docs.oracle.com/cd/E18659_01/html/821-1383/gjzmf.html

I defined read and write barriers as acquire and release barriers
instead of pure read and write ones as that's what other platforms
appear to do.

/ Oskari

Robert Haas

2014-09-26 12:39:38 UTC

Attached patch implements compiler and memory barriers for Solaris Studio
based on documentation at
http://docs.oracle.com/cd/E18659_01/html/821-1383/gjzmf.html
I defined read and write barriers as acquire and release barriers instead of
pure read and write ones as that's what other platforms appear to do.

So you think a read barrier is the same thing as an acquire barrier
and a write barrier is the same as a release barrier? That would be
surprising. It's certainly not true in general.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Oskari Saarenmaa

2014-09-26 12:55:52 UTC

Attached patch implements compiler and memory barriers for Solaris Studio
based on documentation at
http://docs.oracle.com/cd/E18659_01/html/821-1383/gjzmf.html
I defined read and write barriers as acquire and release barriers instead of
pure read and write ones as that's what other platforms appear to do.

So you think a read barrier is the same thing as an acquire barrier
and a write barrier is the same as a release barrier? That would be
surprising. It's certainly not true in general.

The above doc describes the difference: read barrier requires loads
before the barrier to be completed before loads after the barrier - an
acquire barrier is the same, but it also requires loads to be complete
before stores after the barrier.

Similarly write barrier requires stores before the barrier to be
completed before stores after the barrier - a release barrier is the
same, but it also requires loads before the barrier to be completed
before stores after the barrier.

So acquire is read + loads-before-stores and release is write +
loads-before-stores.

The generic gcc atomics also define read barrier to __ATOMIC_ACQUIRE and
write barrier to __ATOMIC_RELEASE.

/ Oskari

--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

2014-09-26 14:28:21 UTC

This post might be inappropriate. Click to display it.

Oskari Saarenmaa

2014-09-26 16:14:03 UTC

Post by Robert Haas
So you think a read barrier is the same thing as an acquire barrier
and a write barrier is the same as a release barrier? That would be
surprising. It's certainly not true in general.

The above doc describes the difference: read barrier requires loads before
the barrier to be completed before loads after the barrier - an acquire
barrier is the same, but it also requires loads to be complete before stores
after the barrier.
Similarly write barrier requires stores before the barrier to be completed
before stores after the barrier - a release barrier is the same, but it also
requires loads before the barrier to be completed before stores after the
barrier.
So acquire is read + loads-before-stores and release is write +
loads-before-stores.

Hmm. My impression was that an acquire barrier means that loads and
stores can migrate forward across the barrier but not backward; and
that a release barrier means that loads and stores can migrate
backward across the barrier but not forward. I'm actually not really
sure what this means unless the barrier also does something in and of

[...]

Post by Robert Haas
With the definition you (and Oracle) propose, this won't work, because
there's nothing to keep the modification of things from being
reordered before flag = 1. What good is that? Apparently, I don't
have any idea!

I'm not proposing any definition for acquire or release barriers, I was
just proposing to use the things Solaris Studio defines as acquire and
release barriers to implement read and write barriers in PostgreSQL
because similar barrier names are used with gcc and on Solaris Studio
acquire is a stronger read barrier and release is a stronger write
barrier. atomics.h's definition of pg_(read|write)_barrier doesn't have
any requirements for loads before stores, though, so we could use
__machine_r_barrier and __machine_w_barrier instead.

But as Andres pointed out all this is probably unnecessary and we could
define read and write barrier as __compiler_barrier with Solaris Studio
cc. It's only available for Solaris (x86 and Sparc) and Linux (x86).

/ Oskari

--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

2014-10-02 14:34:57 UTC

The above doc describes the difference: read barrier requires loads before
the barrier to be completed before loads after the barrier - an acquire
barrier is the same, but it also requires loads to be complete before stores
after the barrier.
Similarly write barrier requires stores before the barrier to be completed
before stores after the barrier - a release barrier is the same, but it also
requires loads before the barrier to be completed before stores after the
barrier.
So acquire is read + loads-before-stores and release is write +
loads-before-stores.

It's actually more complex than that :(

Simple things first:

Oracle's definition seems pretty iron clad:
http://docs.oracle.com/cd/E18659_01/html/821-1383/gjzmf.html
__machine_acq_barrier is a clear superset of __machine_r_barrier and
__machine_rel_barrier is a clear superset of __machine_w_barrier

And that's what we're essentially discussing, no? That said, there seems
to be no reason to avoid using __machine_r/w_barrier().

But for the reason why I defined pg_read_barrier/write_barrier to
__atomic_thread_fence(__ATOMIC_ACQUIRE/RELEASE):

The C11/C++11 definition it's made for is hellishly hard to
understand. There's very subtle differences between acquire/release
operation and acquire/release fences. 29.8.2/7.17.4 seems to be the relevant
parts of the standards. I think it essentially guarantees the mapping
we're talking about, but it's not entirely clear.

The way acquire/release fences are defined is that they form a
'synchronizes-with' relationship with each other. Which would, I think,
be sufficient given that without a release like operation on the other
thread a read/wrie barrier isn't worth much. But there's a rub in that
it requires a atomic operation involved somehere to give that guarantee.

I *did* check that the emitted code on relevant architectures is sane,
but that doesn't guarantee anything for the future.

Therefore I'm proposing to replace it with __ATOMIC_ACQ_REL which is
definitely guaranteeing what we need, even if superflously heavy on some
platforms. It still is significantly more efficient than
__sync_synchronize() which is what was used before. I.e. it generates no
code on x86 (MFENCE otherwise), and only a lwsync on PPC (hwsync
otherwise, although I don't know why) and similar on ia64.

As a reference, relevant standard sections are:
C11: 5.1.2.4 5); 7.17.4
C++11: 29.3; 1.10
Not that we can rely on those, but I think it's a good thing to orient on.

Post by Robert Haas
I'm actually not really sure what this means unless the barrier also
does something in and of itself.
some stuff
CAS(&lock, 0, 1) // i am an acquire barrier
more stuff
lock = 0 // i am a release barrier
even more stuff
If the CAS() and lock = 0 instructions were FULL barriers, then we'd
be saying that the stuff that happens in the critical section needs to
be exactly "more stuff". But if they are acquire and release
barriers, respectively, then the CPU is allowed to move "some stuff"
or "even more stuff" into the critical section; but what it can't do
is move "more stuff" out.
Now if you just have a naked acquire barrier that is not doing
anything itself, I don't really know what the semantics of that should
be.

Which is why these acquire/release fences, in contrast to
acquire/release operations, have more guarantees... You put your finger
right onto the spot.

Post by Robert Haas
Say I want to appear to only change things while flag is 1, so I
flag = 1
acquire barrier
things++
release barrier
flag = 0
With the definition you (and Oracle) propose

As written above, I don't think that applies to oracle's definition?

Post by Robert Haas
this won't work, because
there's nothing to keep the modification of things from being
reordered before flag = 1. What good is that? Apparently, I don't
have any idea!

I hope it's a bit clearer now?

Greetings,

Andres Freund

Robert Haas

2014-10-02 14:55:06 UTC

Post by Andres Freund
It's actually more complex than that :(
http://docs.oracle.com/cd/E18659_01/html/821-1383/gjzmf.html
__machine_acq_barrier is a clear superset of __machine_r_barrier and
__machine_rel_barrier is a clear superset of __machine_w_barrier
And that's what we're essentially discussing, no? That said, there seems
to be no reason to avoid using __machine_r/w_barrier().

So let's use those, then.

Post by Andres Freund
But for the reason why I defined pg_read_barrier/write_barrier to
The C11/C++11 definition it's made for is hellishly hard to
understand. There's very subtle differences between acquire/release
operation and acquire/release fences. 29.8.2/7.17.4 seems to be the relevant
parts of the standards. I think it essentially guarantees the mapping
we're talking about, but it's not entirely clear.
The way acquire/release fences are defined is that they form a
'synchronizes-with' relationship with each other. Which would, I think,
be sufficient given that without a release like operation on the other
thread a read/wrie barrier isn't worth much. But there's a rub in that
it requires a atomic operation involved somehere to give that guarantee.
I *did* check that the emitted code on relevant architectures is sane,
but that doesn't guarantee anything for the future.
Therefore I'm proposing to replace it with __ATOMIC_ACQ_REL which is
definitely guaranteeing what we need, even if superflously heavy on some
platforms. It still is significantly more efficient than
__sync_synchronize() which is what was used before. I.e. it generates no
code on x86 (MFENCE otherwise), and only a lwsync on PPC (hwsync
otherwise, although I don't know why) and similar on ia64.

A fully barrier on x86 should be an mfence, right? With only a
compiler barrier, you have loads ordered with respect to loads and
stores ordered with respect to stores, but the load/store ordering
isn't fully defined.

Post by Andres Freund
Which is why these acquire/release fences, in contrast to
acquire/release operations, have more guarantees... You put your finger
right onto the spot.

But, uh, we still don't seem to know what those guarantees actually ARE.

Post by Robert Haas
Say I want to appear to only change things while flag is 1, so I
flag = 1
acquire barrier
things++
release barrier
flag = 0
With the definition you (and Oracle) propose
this won't work, because
there's nothing to keep the modification of things from being
reordered before flag = 1. What good is that? Apparently, I don't
have any idea!

As written above, I don't think that applies to oracle's definition?

Oracle's definition doesn't look sufficient there. The acquire
barrier guarantees that the load operations before the barrier will be
completed before the load and store operations after the barrier, but
the only operation before the barrier is a store, not a load, so it
guarantees nothing.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

2014-10-02 15:18:39 UTC

So let's use those, then.

Right, I've never contended that.

A fully barrier on x86 should be an mfence, right?

Right. I've not talked about changing full barrier semantics. What I was
referring to is that until the atomics patch we always redefine
read/write barriers to be full barriers when using gcc intrinsics.

Post by Robert Haas
With only a compiler barrier, you have loads ordered with respect to
loads and stores ordered with respect to stores, but the load/store
ordering isn't fully defined.

Yes.

Post by Andres Freund
Which is why these acquire/release fences, in contrast to
acquire/release operations, have more guarantees... You put your finger
right onto the spot.

But, uh, we still don't seem to know what those guarantees actually ARE.

Paired together they form a synchronized-with relationship. Problem #1
is that the standard's language isn't, to me at least, clear if there's
not some case where that's not the case. Problem #2 is that our current
README.barrier definition doesn't actually require barriers to be
paired. Which imo is bad, but still a fact.

The definition of ACQ_REL is pretty clearly sufficient imo: "Full
barrier in both directions and synchronizes with acquire loads and
release stores in another thread.".

As written above, I don't think that applies to oracle's definition?

Oracle's definition doesn't look sufficient there.

Perhaps I'm just not understanding what you want to show with this
example. This started as a discussion of comparing acquire/release with
read/write barriers, right? Or are you generally wondering about the
point acquire/release barriers?

Post by Robert Haas
The acquire
barrier guarantees that the load operations before the barrier will be
completed before the load and store operations after the barrier, but
the only operation before the barrier is a store, not a load, so it
guarantees nothing.

Well, 'acquire' operations always have to related to a load. That's why
standalone 'acquire fences' or 'acquire barriers' are more heavyweight
than just a acquiring read.

And realistically, in the above example, you'd have to read flag to see
that it's not already 1, right?

Greetings,

Andres Freund

Robert Haas

2014-10-02 15:35:32 UTC

Post by Robert Haas
So let's use those, then.

Right, I've never contended that.

OK, cool.

Post by Robert Haas
A fully barrier on x86 should be an mfence, right?

OK, got it. If there's a cheaper way to tell gcc "loads before loads"
or "stores before stores", I'm fine with doing that for those cases.

Post by Andres Freund
Which is why these acquire/release fences, in contrast to
acquire/release operations, have more guarantees... You put your finger
right onto the spot.

But, uh, we still don't seem to know what those guarantees actually ARE.

I don't know what a "synchronized-with relationship" means.

Also, I pretty much designed those definitions to match what Linux
does. And it doesn't require that either, though it says that in most
cases it will work out that way.

Post by Andres Freund
The definition of ACQ_REL is pretty clearly sufficient imo: "Full
barrier in both directions and synchronizes with acquire loads and
release stores in another thread.".

I dunno. What's an acquire load? What's a release store? I know
what loads and stores are; I don't know what the adjectives mean.

Well, 'acquire' operations always have to related to a load.That's why
standalone 'acquire fences' or 'acquire barriers' are more heavyweight
than just a acquiring read.

Again, I can't judge any of this, because you haven't defined the
terms anywhere.

Post by Andres Freund
And realistically, in the above example, you'd have to read flag to see
that it's not already 1, right?

Not necessarily. You could be the only writer. Think about the way
the backend entries in the stats system work. The point of setting
the flag may be for other people to know whether the data is in the
middle of being modified.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

2014-10-02 18:06:03 UTC

Post by Andres Freund
Which is why these acquire/release fences, in contrast to
acquire/release operations, have more guarantees... You put your finger
right onto the spot.

But, uh, we still don't seem to know what those guarantees actually ARE.

I don't know what a "synchronized-with relationship" means.

I'm using the standard's language here, given that I'm trying to reason
about its behaviour...

What it means is that if you have a matching pair of acquire/release
operations or barriers/fences everything that happened *before* the last
release fence will be visible *after* executing the next acquire
operation in a different thread-of-execution. And 'after' is defined in
the way that is true if the 'acquiring' thread can see the result of the
'releasing' operation.
I.e. no loads after the acquire can see values from before the release.

My problem with the definition in the standard is that it's not
particularly clear how acquire fences *without* a underlying explicit
atomic operation are defined in the standard.

I checked gcc's current code and it's fine in that regard. Also other
popular concurrent open source stuff like
http://git.qemu.org/?p=qemu.git;a=blob;f=include/qemu/atomic.h;hb=HEAD
does precisely what I'm talking about:

100 #ifndef smp_wmb
101 #ifdef __ATOMIC_RELEASE
102 #define smp_wmb() __atomic_thread_fence(__ATOMIC_RELEASE)
103 #else
104 #define smp_wmb() __sync_synchronize()
105 #endif
106 #endif
107
108 #ifndef smp_rmb
109 #ifdef __ATOMIC_ACQUIRE
110 #define smp_rmb() __atomic_thread_fence(__ATOMIC_ACQUIRE)
111 #else
112 #define smp_rmb() __sync_synchronize()
113 #endif
114 #endif

The commit that added it
http://git.qemu.org/?p=qemu.git;a=commitdiff;h=5444e768ee1abe6e021bece19a9a932351f88c88
was written by one gcc guy and reviewed by another one...

So I think we can be pretty sure that gcc's __atomic_thread_fence()
behaves like we want. We probably have to be a bit more careful about
extending that definition (by including atomic.h and doing
atomic_thread_fence(memory_order_acquire)) to use general C11. Which is
probably a couple years away anyway.

Post by Robert Haas
Also, I pretty much designed those definitions to match what Linux
does. And it doesn't require that either, though it says that in most
cases it will work out that way.

My point is that that read barriers aren't particularly meaningful
without a defined store order from another thread/process. Without any
form of pairing you don't have that. The writing side could just have
reordered the writes in a way you didn't want them. And the kernel docs
do say "A lack of appropriate pairing is almost certainly an error". But
since read barriers also pair with lock releases operations, that's
normally not a big problem.

Post by Andres Freund
The definition of ACQ_REL is pretty clearly sufficient imo: "Full
barrier in both directions and synchronizes with acquire loads and
release stores in another thread.".

I dunno. What's an acquire load? What's a release store? I know
what loads and stores are; I don't know what the adjectives mean.

An acquire load is either an explicit atomic load (tas, cmpxchg, etc
also count) or a normal load combined with a acquire barrier. The symmetric
definition is true for release store.

(so, on x86 every load/store that prevents compiler reordering
essentially a acquire/release store)

Post by Andres Freund
And realistically, in the above example, you'd have to read flag to see
that it's not already 1, right?

So you're thinking about something seqlock alike... Isn't the problem
then that you actually don't want acquire semantics, but release or
write barrier semantics on that store? The acquire/read barrier part
would be on the reader side, no?
I'm still unsure what you want to show with that example?

Greetings,

Andres Freund

Robert Haas

2014-10-06 15:38:47 UTC

Post by Robert Haas
Also, I pretty much designed those definitions to match what Linux
does. And it doesn't require that either, though it says that in most
cases it will work out that way.

Agreed, but it's possible to have a read-fence where an atomic
operation provides the ordering on the other side, or something like
that.

Post by Andres Freund
I'm still unsure what you want to show with that example?

Me, too. I think we've drifted off in the weeds. Do we know what we
need to know to fix $SUBJECT?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

2014-10-06 15:42:08 UTC

Post by Robert Haas
Also, I pretty much designed those definitions to match what Linux
does. And it doesn't require that either, though it says that in most
cases it will work out that way.

Agreed, but it's possible to have a read-fence where an atomic
operation provides the ordering on the other side, or something like
that.

Sure, that's one of the possible pairings. Most atomics have barrier
semantics...

Post by Andres Freund
I'm still unsure what you want to show with that example?

Me, too. I think we've drifted off in the weeds. Do we know what we
need to know to fix $SUBJECT?

I think we can pretty much apply Oskari's patch after replacing
acquire/release with read/write intrinsics.

I'm opening a bug with the gcc folks about clarifying the docs on their
intrinsics.

Greetings,

Andres Freund

Oskari Saarenmaa

2014-10-23 15:46:09 UTC

Post by Andres Freund
I think we can pretty much apply Oskari's patch after replacing
acquire/release with read/write intrinsics.

Attached a patch rebased to current master using read & write barriers.

/ Oskari

Andres Freund

2014-09-26 12:54:20 UTC