Discussion:
[PATCH] Simplify EXISTS subqueries containing LIMIT
Marti Raudsepp
2014-10-02 21:41:19 UTC
Permalink
Hi list,

Attached patch allows semijoin/antijoin/hashed SubPlan optimization
when an EXISTS subquery contains a LIMIT clause with a positive
constant. It seems to be a fairly common meme to put LIMIT 1 into
EXISTS() subqueries, and it even makes sense when you're not aware
that the database already does this optimization.

Do we want this?

It has come up in #postgresql, and at twice times on mailing lists:
http://www.postgresql.org/message-id/***@freemail.hu
http://www.postgresql.org/message-id/***@pingpong.net

And there may even be good reasons, such as writing performant
portable SQL code for Other Databases:
https://dev.mysql.com/doc/refman/5.1/en/optimizing-subqueries.html

----
The code is fairly straightforward. The only ugly part is that I need
to call eval_const_expressions() on the LIMIT expression because
subquery_planner() does subquery optimizations before constant
folding. A "LIMIT 1" clause will actually produce an int8(1)
expression. And I have to drag along PlannerInfo for that.

If it fails to yield a constant we've done some useless work, but it
should be nothing compared to the caller doing a deep copy of the
whole subquery.

Regards,
Marti
David Rowley
2014-10-19 10:22:39 UTC
Permalink
Post by Marti Raudsepp
Hi list,
Attached patch allows semijoin/antijoin/hashed SubPlan optimization
when an EXISTS subquery contains a LIMIT clause with a positive
constant. It seems to be a fairly common meme to put LIMIT 1 into
EXISTS() subqueries, and it even makes sense when you're not aware
that the database already does this optimization.
I had a quick look at this, and the code looks fairly simple. Although,
I've got mixed feelings about it;

I guess there's not really any real performance penalty in planning time
for everyone else who does not put LIMIT clauses into their exists
subqueries, so maybe it's worth it as it seems there could still be a few
people out there suffering from this, but at the same time, the argument
for this would have been much stronger if anti join support had just been
added last week. It's been quite a few years now and the argument for this
must be getting weaker with every release.

I think I'm leaning towards a +1 on this as it seems a shame for people who
have no control over the queries sent to their database to have to be
excluded from the benefits of semi join and anti join.

Regards

David Rowley

Do we want this?
Post by Marti Raudsepp
And there may even be good reasons, such as writing performant
https://dev.mysql.com/doc/refman/5.1/en/optimizing-subqueries.html
Marti Raudsepp
2014-10-21 10:00:35 UTC
Permalink
Hi

Thanks for taking a look.
the argument for this would
have been much stronger if anti join support had just been added last week.
It's been quite a few years now and the argument for this must be getting
weaker with every release.
I see your point, but I would put it another way: we've had this for a
few years, but people haven't learned and are *still* using LIMIT 1.

----
Actually I thought of a cleverer approach to solve this, that doesn't
require calling eval_const_expressions and works with any expression.

Given query:
EXISTS (SELECT ... WHERE foo LIMIT any_expr())
we could turn it into the almost-equivalent form:
EXISTS (SELECT ... WHERE foo AND any_expr() > 0)

The only problem is that we'd no longer be able to throw an error for
negative values and that seems like a deal-breaker.

Regards,
Marti
--
Sent via pgsql-hackers mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
David Rowley
2014-10-22 10:37:03 UTC
Permalink
Post by Marti Raudsepp
the argument for this would
have been much stronger if anti join support had just been added last
week.
It's been quite a few years now and the argument for this must be getting
weaker with every release.
I see your point, but I would put it another way: we've had this for a
few years, but people haven't learned and are *still* using LIMIT 1.
I've had a bit of a look at this and here's a couple of things:

/*
+ * LIMIT clause can be removed if it's a positive constant or ALL,
to
+ * prevent it from being an optimization barrier. It's a common
meme to put
+ * LIMIT 1 within EXISTS subqueries.
+ */

I think this comment may be better explained along the lines of:

"A subquery which has a LIMIT clause with a positive value is effectively a
no-op in this scenario. With EXISTS we only care about the first row
anyway, so any positive limit value will have no behavioral change to the
query, so we'll simply remove the LIMIT clause. If we're unable to prove
that the LIMIT value is a positive number then we'd better not touch it."


+ /* Checking for negative values is done later; 0 is just silly */
+ if (! limit->constisnull && DatumGetInt64(limit->constvalue) <= 0)

I'm a bit confused by the comment here. You're saying that we'll check for
negatives later... but you're checking for <= 0 on the next line... Did I
miss something or is this a mistake?


This test:

+select * from int4_tbl o where exists (select 1 limit 0);
+ f1
+----
+(0 rows)

I guess here you're just testing to ensure that you've not broken this and
that the new code does not accidentally remove the LIMIT clause. I think it
also might be worth adding some tests to ensure that co-related EXISTS
clauses with a positive LIMIT value get properly changed into a SEMI JOIN
instead of remaining as a subplan. And also make sure that a LIMIT 0 or
LIMIT -1 gets left as a subplan. I'd say such a test would supersede the
above test, and I think it's also the case you were talking about
optimising anyway?

You can use EXPLAIN (COSTS OFF) to get a stable explain output.

Regards

David Rowley

Loading...