Content from 2015-04
After working a lot more with PostgreSQL, I happened to stumble upon a few things, mostly related to query optimisation, that at least for me weren't quite readily apparent. Please note that this is all with regards to PostgreSQL 9.3, but unless otherwise noted it should still be the case for 9.4 as well.
Obviously, indexes are key to good query performance. If you're already using
EXPLAIN, with, or without
ANALYZE, chances are good you know how and why
your queries perform like they do. However I encountered a problem with a
DELETE query, where the output from
EXPLAIN was as expected, i.e. it was
using the optimal plan, but still the performance was abysmal; a query to
delete about 300 elements in bulk, like
DELETE FROM table WHERE id IN (...);,
was quite fast to remove the elements (as tested from a separately running
psql), but still the query took about three minutes(!) to complete.
In this scenario the table in question was about a million rows long, had a
primary index on the
id column and was referenced from three other tables,
which also had foreign key constraints set up; no other triggers were running
on any of the tables involved.
postgres server process in question wasn't doing anything interesting, it
was basically taking up one core with
semop calls as reported by
other I/O was observed though.
At that point I finally turned to
#postgresql. Since the information I could
share wasn't very helpful, there were no immediate replies, but one hint
finally helped me to fix this problem. It turns out that the (kinda obvious)
solution was to check for missing indexes on foreign key constraints from
other tables. With two out of three indexes missing I resumed to
INDEX CONCURRENTLY...; and about five minutes later the
DELETE was now
running in a few milliseconds.
The part where I was really frustrated here is that none of the available
statistics guided me in the search;
EXPLAIN ANALYZE apparently doesn't
include the runtime for foreign key constraints and they don't show up in
other places as well. In hindsight this is something that I should've checked
earlier (and from now on I will), but it's also a weakness of the analyze
framework not to help the developer to see the slowdowns involved in a
situation like this.
Common Table Expressions
Refactoring queries to reuse results with
WITH queries is absolutely worth it
and improved the runtime of rather large queries by a large factor. This is
something that can be seen from the query plan, so when you're using the same
expressions twice, start looking into this and see if it helps both for
readability (don't repeat yourself) and performance.
JSON result construction
We need nested JSON output in a few cases. This means (in 9.3, there are some
better functions available in 9.4) that a combination of
row_to_json(row(json_agg(...))) was necessary to get proper nesting of
aggregated sub-objects, as well as wrapping the end result in another object,
because the output had to be formatted as a JSON object (with curly brackets)
instead of a JSON array (with rectangular brackets).
Technicalities aside the JSON support is pretty good and since the initial code
was written I've discovered that since we in many cases don't actually have
multiple results (for
json_agg), not using that method will again
significantly improve performance.
That means instead of something like the following:
SELECT row_to_json(row(json_agg(...))) FROM ... JOIN ... GROUP BY id;
, where the input to
json_agg is a single result from the
JOIN, we can
write the following instead:
SELECT row_to_json(row(ARRAY[...])) FROM ... JOIN ...;
, which, if you examine the output of
EXPLAIN, means no sorting because of
GROUP BY clause. The convenience of
json_agg here doesn't really
justify the significant slowdown caused by the aggregation function.
Note that the array constructed via
ARRAY is properly converted to JSON, so
the end result is again proper JSON.
Yesterday the 8th European Lisp Symposium finished. In short it was a great experience (I was there the first time, but hopefully not the last). The variety and quality of talks was great, a good number of people attended both the actual talks as well as both(!) dinners, so there were lots of opportunities to exchange thoughts and quiz people, including on Lisp. Also except for one talk I believe all talks happened, which is also a very good ratio.
For the talks I still have to go through the proceedings a bit for details, but obviously the talk about the Lisp/C++ interoperability with Clasp was (at least for me) long awaited and very well executed. Both the background information on the origins, as well as the technical description on the use of LLVM and the integration of multiple other projects (ECL, SICL, Cleavir) were very interesting and informative.
There were also quite a number of Racket talks, which was surprising to me, but given the source of these projects it makes sense since the GUI is pretty good. VIGRA, although it's a bit unfortunate name, looks pretty nice. The fact that the bindings to a number of languages are available and in the case of the Lisps make the interaction a lot easier is good to see, so it might be a good alternative to OpenCV. It's also encouraging that students enjoy this approach and are as it seems productive with the library.
P2R, the Processing implementation in Racket is similarly interesting as there is a huge community using Processing and making programming CAD applications easier via a known environment is obviously nice and should give users more opportunities in that area.
If I remember correctly the final Racket talk was about constraining application behaviour, which was I guess more of a sketch how application modularity and user-understandable permissions could be both implemented and enforced. I still wonder about the applicability in e.g. a Lisp or regular *nix OS.
The more deeply technical talks regarding the garbage collector (be it in SBCL, or Allegro CL) were both very interesting in that normally I (and I imagine lots of people) don't have (a chance) to get down to that level and therefore learning about some details about those things is appreciated.
Same goes for the first talk by Robert Strandh, Processing List Elements in
Reverse Order, which was really great to hear about in the sense that I
usually appreciate the
:from-end parameter of all the sequence functions and
still didn't read the details of the interaction between actual order of
iteration vs. the final result of the function. Then again, the question
persists if any programs are actually processing really long lists in
reverse in production. Somehow the thought that even this case is optimised
would make me sleep easier, but then again, the tradeoff of maintainable code
vs. performance improvements remains (though I don't think that the presented
code was very unreadable).
Escaping the Heap was nice and it'll be great to see an open-sourced library for shared memory and off-heap data structures, be it just for special cases anyway.
Lots of content, so I doubt I'll get to the lightning talks. It'll be just this for now then. Hopefully I have time/opportunity to go to the next ELS or another Lisp conference; I can only recommend going.
For the longest time I've been using quite a number of Firefox extensions. The known problem with that is a steady slowdown, which is only amplified by my habit of soft bookmarks, i.e. having hundreds of open tabs with their corresponding state (which is the whole reason to do that).
However seeing that a lot of state is captured in a very inconvenient form, that is, it's hard to modify a long list of tabs, I want to make both this and incidentally also sharing of state between sessions and even browsers much easier.
The idea is to separate part of the browser state into a separate component, namely a database server for cookies (and other local storage), tabs, sessions and bookmarks. This way and by having a coarse control over loading of sessions the process of migrating state between sessions and browsers should be much easier.
Fortunately most of the browser extensions APIs seem to be usable enough to make this work for at least Firefox and Chrome, so at the moment I'm prototyping the data exchange. Weird as it is for Chrome you have to jump through some conversion hoops (aka local native extensions via a local process exchanging data via stdio), so it seems that the Firefox APIs, since they allow socket connections, are a bit friendlier to use. That said, the exchange format for Chrome, Pascal string encoded JSON, seems like a good idea with the exception of forcing local endianess, which is completely out of the question for a possibly network enabled system (which is to say, I'm definitely going to force network byte order instead).
The title sounds a bit too majestic too be true. However the though just occured to me, that much of what I've been doing over the last months often involved taking "dev ops" code, e.g. configuration code that lives in the deployment scripts, and putting it into a reusable format, i.e. an actual (Python) module.
That usually happens because what was once a depencency for an application (or service if you like), is now needed to be accessible from (Python) code for testing purposes, or because the setup routine isn't actually just setup any more, but happens more regularly as part of the application cycle.
Of course doing this has some major downsides, as the way scripts are written, using a specific library to access remote hosts, without much error handling, is fundamentally different from how a application code works, that is usually with a more powerful database interface, without any shell scripting underlying the commands (which will instead be replaced by "native" file manipulation commands) and with more proper data structures.
That leaves me with some real annoying busy work just to transform code from
one way of writing it to another. Maybe the thing to take away here is that
configuration code isn't and application code will sooner or later become
library code as well -- aka. build everything as reusable modules. This of
course means that using certain configuration (dev ops) frameworks is
prohibited, as they work from the outside in, e.g. by providing a wrapper
application (for Fabric that would be
fab) that invokes "tasks" written in a
Properly done, that would mean that configuration code would be a wrapper so thin that the actual code could still be reused from application code and different projects later on. The difference between configuration and business logic would then be more of a distinction between where code is used, not how and which framework it was written against.