Episode 496: Bruce Momjian on Multi-Version Concurrency Control in Postgres (MVCC) : Software Engineering Radio

This week, Postgres server developer Bruce Momjian joins host Robert Blumen for a discussion of multi-version concurrency control (MVCC) in the Postgres database. They begin with a discussion of the isolation requirement in database transactions (I in ACID); how isolation can be achieved with locking; limitations of locking; how locking limits concurrency and creates variability in query runtimes; multi-version concurrency control as a means to achieve isolation; how Postgres manages multiple versions of a row; snapshots; copy-on-write and snapshots; visibility; database transaction IDs; how tx ids, snapshots and versions interact; the need for locking when there are multiple writers; how MVCC was added to Postgres; and how to clean up unused space left over from aged-out versions.

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Robert Blumen 00:01:05 For Software Engineering Radio, this is Robert Blumen. My guest today is Bruce Momjian. Bruce is a Senior Database Architect and a Postgres evangelist who has written widely on Postgres internals as a frequent conference speaker on that subject. He was an adjunct professor at Drexel University where he taught database courses and is the author of Postgres SQL Introduction and Concepts. Bruce was previously on Software Engineering Radio episode number 328, talking about the Postgres Query Planner. And we have done another episode on Postgres (SE Radio 454) on Postgres as an OLAP Database. Bruce, welcome back to Software Engineering Radio.

Bruce Momjian 00:01:54 Yeah, it’s great to be back. It’s been at least two, maybe three years now.

Robert Blumen 00:01:59 That’s about right. Is there anything you’d like the audience to know about you before we get started?

Bruce Momjian 00:02:06 I live in Philadelphia — obviously been home for a while because of COVID, because I normally travel quite a bit — but it looks like things are heating up. We just had an event in New York City last week; we have another event in Silicon Valley in January, and we’re going to be doing a big conference in Pasadena (CA) in March. We have one in Russia coming up, one in Ottawa. We have Kona live in Austin. There’s a lot of cool stuff going on.

Robert Blumen 00:02:31 We will be getting to multi-version concurrency control, but to start out please briefly describe what is the Postgres Database.

Bruce Momjian 00:02:42 Sure. So, Postgres was originally designed in 1986 at University of California, Berkeley, by Michael Stonebraker. He originally developed Ingress in the 1970s, which was one of the early relational systems, and he developed Postgres in 1986 as the next generation of relational system. That’s why I’ve called Postgres — or “PostIngress,” technically. What was really interesting about what he did was he felt at that time that extensibility for a database was critical. So, the idea of being able to add new data types, new indexing methods, new aggregates, new casts, new store procedure languages was going to be part of his new database. What’s kind of surprising is that within the first probably 20 years after he did it, that value of extensibility really was not appreciated. Even when I started in 1996, that extensibility is kind of a headache that we had to work through. But what’s really interesting if you look in the past 10-15 years is that extensibility that he designed so many decades ago has allowed Postgres to work, move very seamlessly into data warehouse tasks, joining JSON, doing full-text search, doing GIS — really kind of making it available to handle the new data needs, new data ingestion that we have in normal for our modern databases.

Bruce Momjian 00:04:17 Because back in the 70s, 80s and 90s, you know, everyone was at dumb terminals or PCs and they were sort of typing in the data. Now we have so much data coming from Internet of Things and web apps and mobile apps, telemetry data, and GIS data that social media texts come in. So, now we’ve got all this data ingestion and Postgres, because it was designed to be extendable, actually is in a really great position right now. And that’s fueling a lot of its popularity.

Robert Blumen 00:04:48 Postgres, like all relational databases, supports the concept of ACID. That’s ACID transactions. I don’t want to review the entire ACID. I think Software Engineering Radio even did an entire episode on this, and it was a very low show number — like three — but can you just talk about the isolation “I” component of the ACID. What does that mean?

Bruce Momjian 00:05:14 Sure, I’d be glad to. One of the great things about relational databases is that it allows you to, as an application programmer, to interact with the database kind of as though you’re the only person in the database. The best explanation I’ve heard is that, if you’ve got static data, whether it’s shared or not shared it’s very easy to work with. We can all think of cases like a web server who has static data. Well, that’s easy. You can make multiple copies of it and just every night, update it, whatever. If you have data that’s volatile and it’s private, that’s also very easy to deal with because you’ve only got one person there. But with databases, you basically have volatile data and it’s shared. Volatile data and shared, that’s very hard to work with for applications and application programmers.

Bruce Momjian 00:06:03 When I used to write applications, when I’d do a multi-user app, it was much harder than — you know, more than twice as hard as a single user app. So, as you said, the ACID capabilities of the database make it easy for programmers to interact in a simplified manner with the database. And what isolation does — the “I” — is basically to say that my changes are not going to be visible until… you know, I’m not going to see others people’s changes while I’m doing my own work. So, it gives you sort of a static view of the data as much as possible while other people may be changing the data at the same time. And obviously, by kind of allowing the application programmer not to have to worry about all those ACID issues, they can write a much cleaner application, and all the complex stuff gets pushed into the database where it’s much easier to deal with because you’ve got all of the shared state, basically be able to share within the database rather than trying to have the applications share with each other, which is almost a disaster waiting to happen because it’s just really hard to program in that kind of environment. So, what isolation does is to prevent you from seeing other people’s changes while you’re doing your own work.

Robert Blumen 00:07:23 One way to solve that problem might be to say, we’re going to make it appear that only one person can use the database at a time by locking. And in reality, then only one person can use the database at a time, or at least can only modify it. Talk about locking, how locking can achieve this isolation property, and what are some of the downsides to that?

Bruce Momjian 00:07:50 Sure. So, obviously if you just have one giant lock and then we’re going to prevent anyone from going into the database while I’m in there, and then when I’m finished, the next person goes in, that is technically a solution to handle the isolation requirement. The problem with that obviously is the concurrency is terrible, right? So, the database is from the 70s, 80s and early 90s; their approach was, “okay, we can’t lock this whole thing. Can’t lock the whole database for every person. So, we’re going to make the locks granular,” right? So, the early versions, you would lock a table at a time. So, while I was in the table, nobody else could get into the table; once I was done with it then somebody else could get in. So, you had this sort of table-level granularity.

Bruce Momjian 00:08:39 So, you didn’t lock the whole database, you locked the table you were working with. Then they got to the point where they would lock pages. So, you have a table, it might be a gigabyte in size — or at that point it was probably not a gigabyte in size; it was probably maybe a couple of megabytes in size and you broke it down into pages and you say, okay, I am going to be modifying this set of pages and I’m going to lock these. And people can do things with other pages, but these pages, I’m not going to allow somebody into. And then some of the databases got to row-level locking. So, all of a sudden, now I’m going to lock the role that I’m looking at, the row that I’m going to modify, but then people can do anything with any of the other rows. But the problem with that … there’s two, there’s two problems with that.

Bruce Momjian 00:09:25 One, it’s a huge amount of overhead. It’s a huge amount of locking. You’re really not solving the concurrency problem. You’re effectively just sort of pushing it into smaller pieces, right? So, the same problem we had when we were trying to lock the whole thing, now we’re just, we have this smaller problem. It’s just at the page level or the table or the row level. The second problem, and this is a more insidious problem, is something called “lock escalation.” So, the database sometimes doesn’t know what your intent was. So, you lock a row, then you lock another row on the same page. And then you lock another row in the same page and you start locking a lot of rows on that page. And then the database is like, Hmm, maybe I need to lock this page. So, now instead of having locks on individual rows, I need to escalate block escalation, escalate block to that page.

Bruce Momjian 00:10:13 Well, what if somebody else has locked other rows on that page? And I try and escalate the lock? Becomes a big problem, okay? And sometimes you’d have to potentially escalate a page lock to a table-level lock again; same problem. Do you even have access at that point? So, in the early 90s and prior, there was always this problem called lock escalation, where if you tried to sort of — it would try to be as granular as possible, but as your job got bigger and bigger, it started sort of spilling out into other places. I remember when I used to do database maintenance in the 90s on Informix, if I had to do a big update on a table, a lot of times I would start the update typically at night.

Bruce Momjian 00:11:02 So, there was no one in the database and I would start at like 8:00 at night. And then at like 9:15, I get an error and it would say “lock table overflow.” And you’d be like, “oh okay, now I have to update the first million rows in it. And then I got it update the next million. And then I got to do this until the thing gave the kind of guess how big the lock table is.” So, you don’t run another hour and 15 minutes and find out you overfloated again. So, there was this, there was basically this, not only a problem with lock escalation, but a problem when, just tracking all these locks and finding, making sure that the block table was big enough to do what you needed to do. And that gave kind of database the bad name, because people were like, it became this mysterious thing because your application one did that isolation responsibility, but you’d basically — your application would be fine.

Bruce Momjian 00:11:58 I didn’t change my application. Why is it failing all of a sudden? With someone else who’s doing something at the same time, you’d have to explain that the application programmers are like, well, what you do is fine, but then something else was running at the same time. And now that affected yours and maybe you need to run it at night or another time or go to that other person, tell them not to run that while you’re running this. You can imagine kind of what a sort of pain that was to kind of get going.

Robert Blumen 00:12:24 You talked about how, if all we have is read, everybody can share; it’s no problem. It’s clear that if you have different people trying to write, they can’t share. I believe that if somebody is writing that they can’t share it with readers because they might not be done updating the data. Is that correct?

Bruce Momjian 00:12:44 Yeah. That’s the problem with the isolation. So, when you only have a single copy of the row, then as soon as I modify it the old version is kind of gone. Like, you’ve overwritten it. It may exist somewhere in the system, but it’s not really in the table anymore. So, if somebody else comes and they want to read that row, well, we can’t show it to them because of that isolation requirement, right? But we can’t give them the old row either because we don’t know if it’s going to commit or not. And so, all of a sudden that was the other problem that even though the example I was giving previously was two people trying to write in the same table or the same pages, the readers were also affected because you only had one copy. And if that copy was in the process of being modified, then the readers would kind of stop and they’d block what’s going on. And that’s often the reason, for example, I had to do a lot of my work at night. Because I was doing massive updates to two application tables or, you know, and anything was massive. You couldn’t run two in a day cause you just, you just knock everybody out. Cause they would all be like, “oh, why is this system so slow? I did this half an hour ago and it took two seconds and now it’s been a minute and it still isn’t done.” How do you explain to somebody? Well, this other person over there is doing something and they haven’t finished. Or they started something and when they went to lunch. They have their terminal open and you’ll have to wait for them to come back because we can’t read that row at this time. And it was not pleasant.

Robert Blumen 00:14:25 We’ve been talking about the need for the isolation experience and that you can solve that with locking, but that may not be a great solution in a multi-user system. I think now is a good time to talk about our main topic, which is multi version concurrency control. What is it? And how does it compare with locking?

Bruce Momjian 00:14:48 Sure. Multi-version concurrency control was originally a paper written in the late 70’s and sort of became popular in the early 80’s as a different method of doing database updates. So as I said before, the traditional way, the space saving way, of doing updates was to have one copy of the row. But as you can kind of guess from the word “multi-version” in multi-version concurrency control, the way that this paper decided to solve it was to create multiple versions of individual rows. Now, you might think, kind of like, why would you do that? And how do you track that? You think that would be just the worst thing in the world because now you might have a single row, and it might have five copies in the database. And be like well, that seems like a bad idea, right?

Bruce Momjian 00:15:40 But it gets you around a lot of these problems. So, as I said, just to look at the most recent example, the problem of somebody coming to read data while somebody else’s writing it. If we do an update and, instead of overriding that row, we actually create a new version of the row with the new data and leave the old version in place, we can have all of the readers — because they want a consistent write isolated, consistent version of the data — they can effectively read the old version of the row and see a consistent copy of the database at the same time that another newer version of the row is being created may be committed, may not be committed. It depends, but that gives me the ability to give what we call “consistent snapshots” to all of the users in the database and to reduce the amount of blocking — particularly the problem of readers getting blocked by writers goes away. Because you always have one copy of the row that should be visible to anyone who’s currently doing a read operation in the database.

Robert Blumen 00:16:53 You used the word “snapshot,” which I think I can guess what you meant from the context, but that turns out to be a specific terminology in this domain. I’d like you to elaborate on that.

Bruce Momjian 00:17:05 Yeah. I mean, that’s really a loaded term, but the best, it’s really a concept I have to admit. When I was originally working on this way back, I used to read the code and then I would sort of get up from my desk and just walk around the house for like half an hour, because it took a while for the idea of what this was doing to sink in. Because, you know, we normally think of one object like one mug or one set of glasses or one handkerchief. But in this case, you’re actually creating multiple of these and it’s kind of hard to understand what is going on. But the reason the word snapshot is important is that the snapshot is a sort of record that’s created when you start your query. And that snapshot really controls the ACID, particularly the consistency and the isolation visibility of your query.

Bruce Momjian 00:18:16 So, once you take that snapshot at the beginning, the things that we record in that snapshot allow us to distinguish which of the multiple versions of a row should be visible to you. Right? So, let’s go back to the previous example of doing an update let’s suppose our five versions of a row, a row has been updated five times in the recent history. That snapshot should tell me which of those five rows is visible to my transaction. And only one of those five should be visible or maybe none of them are visible, right? It could be that the snapshot indicates that none of those rows should be visible to me, or it might indicate that the third version or the fifth version or the second version is the one that meets a consistent view of the database for my particular query. So that snapshot concept is not, it’s not unique to Postgres, but it is sort of a database term, internals term, because the concept of taking a snapshot is basically saying at the time I start my query or potentially the time I start my transaction, this is the time slot or the instant that I want to see the data at. Even if the data is drifting forward, even if updates are happening, inserts are happening and deletes are happening. That snapshot is going to tie me to a specific, consistent view of the database for the entire duration of my query.

Robert Blumen 00:19:46 Although you and I might both be using the database and in concept, we each have our own copy or snapshot of the entire database. In reality, I need to do a very limited amount of physical copying to make this work. Is that right?

Bruce Momjian 00:20:02 Yeah. I mean, that would you’re right. It would be kind of crazy for us to make a full copy just to run a query. So, the way that we do it is that every row has a creation transaction ID and potentially an expiration transaction ID. And again, if I look at the five copies of one row, each of those five versions of the row are going to have different transaction creation and potential expire expiration IDs on them. And using my snapshot, I can identify which of those five is visible to me. So, you’re right. We’re only really copying when somebody’s making a change to a row and we can trim off the old versions as soon as nobody finds those old versions visible. So, we basically get into a case where we can either prune away the old versions, if we say, okay, we currently have five versions of that row, but honestly only versions three to five are potentially visible to any currently running transaction. Version 1 and version 2 are so old that there is no running transaction that has a snapshot that would ever find those visible. And if that’s true, we can basically reuse that space right away.

Robert Blumen 00:21:19 So you’ve brought up now the idea that every transaction has an ID, how are those IDs assigned? Are they sequential?

Bruce Momjian 00:21:27 They are sequential. We’ve optimized this quite a bit. So, for example, if a transaction only is using read only queries like selects, it doesn’t even get a transaction ID because it’s not going to modify any data. It doesn’t need transaction ID, but any data modification transaction will get its own transaction ID. And those are 4-byte integers, obviously 4 billion. And then once it gets to 4 billion, it’ll wrap around to zero again, and then just go up to 4 billion, just keeps kind of looping around and we have maintenance tasks in the database, which basically handled the problem of looping. You know when it flips around to zero, again, we make sure that there that all of the old rows have proper, sort of fixed IDs that will not be interfered with during the wraparound.

Robert Blumen 00:22:16 Going to mention tangentially. I did research for this interview from a side deck that’s on your website and we’ll link to that in the show notes. You’ve used the term visibility several times. And again, I think it’s clear enough in context, but that does turn out to be another one of those words that is a term of art within your domain. Is there anything you’d like to say about how you use that word that you haven’t already said?

Bruce Momjian 00:22:44 Yeah. I’d love to talk about it again. It’s one of those concepts that I start walking around the house kind of scratching my head years ago to kind of understand what it is. So, I think the best way I can explain it is that if, if I am sitting in a room and my wife is sitting in the room and you know, we have a piece of paper on the desk. And I basically tell my wife, there’s a piece of paper on the desk. And my wife says, yes, I see the piece of paper. We have a shared reality. The two of us see reality the same. And that works if it’s a piece of paper. And we’re not writing on at the same time. But if we start writing on it at the same time then, and you want both people to write on the piece of paper at the same time, things that kind of complicated.

Bruce Momjian 00:23:34 So if she writes a one, but she isn’t finished yet, and I go write a two, should she see my two? And she’s by ACID requirement, she should not see my two. So, I see my two, but she doesn’t. And I don’t see her one yet actually. So, it gets really weird. So, what MVCC effectively does in terms of visibility is it basically says that different users in the database literally see the database differently, depending on when their query started, when their snapshot was taken. Because we have to guarantee that they see a consistent view of the database, even if the database is changing. So, somebody who started transaction before me or after me is potentially going to see a different set of values than I see. And that’s why you don’t hear the term visibility use too much in the real world, because there’s only one piece of paper on the desk.

Bruce Momjian 00:24:33 My wife can see it, I can see it. We have one reality. Well, we have a consistent visibility, but as we talked about earlier to handle the high volume, high concurrency and high write volume requirements of a database, you actually have to split apart the concept of visibility. So, what I see as visible and what some other user sees is visible may be different. And that’s why you don’t, it’s not a term. It’s a term of art because it’s almost, it’s almost like relativity where somebody is going very fast and they see the world differently than somebody standing still. You’re always kind of in that scope where we’re different people, who do things at different times, see actual different realities.

Robert Blumen 00:25:19 I want to go back into something you mentioned briefly before I started transaction, I get transaction ID 100. There are different versions of some rows that I’m interested in that have different snapshot IDs associated with them. What is the algorithm for determining which row that I might read or write? If there’s more than one version?

Bruce Momjian 00:25:47 Yeah. It’s kind of hard to do this without a diagram. I think the diagram is in my slides, but effectively the verbal way of explaining it is that when you start a snapshot, when you get your snapshot at the beginning, the snapshot should guarantee that you see all transactions that have committed before your snapshot. So, any committed work that happened in the past will be visible to you. And as a corollary to that, any work that is in progress and not committed or any work that starts after my snapshot is taken after my query starts, those will not be visible to me.

Robert Blumen 00:26:30 Okay. It’s good enough. There’s slogan that is associated with MVCC from your slide deck – Readers never block writers, writers never block readers? I think at this point it’s pretty clear why that would be the case. If you now have two transactions and they are both interested in writing the same rows, do you have to do something like that lock escalation procedure that you described earlier?

Bruce Momjian 00:26:58 You’re absolutely right. We say that writers don’t block readers, which is good. It solves the problem we talked about earlier, readers don’t block writers? That’s also good, right? For if you’re doing a maintenance operation, for example. But what we don’t say, obviously, readers don’t block other readers because that’s a non-issue. But we don’t say is that writers don’t block writers, right? In fact writers have to block writers. And the reason writers have to block writers is because when you’re updating a row or you’re inserting a row with a unique key that may already exist, we have to know if the previous transaction completes or not. When we do the update where we’re going to insert a duplicate value, we need to know is we need to update the most recent version of this row. So, we talked about isolation, but in fact, the isolation kind of goes out the window when you’re trying to update another row, because you effectively have to see the newest version of that row.

Bruce Momjian 00:28:02 We can’t have somebody updating an old version of that row while somebody is creating a new version of that row. Cause then you’d get all sorts of weird anomalies. So effectively what happens when you try and update a row, that’s worrying you being updated or trying to insert a row within as unique key where another row has already been inserted, but not committed yet is we basically have to stop the insert or update until that transaction either commits or aborts. And once that transaction commits the reports, we then obviously get a lock on it. And then we can decide if our update or our insert should continue.

Robert Blumen 00:28:39 I have this model in mind and it might not be correct. I’m thinking like get where I have master. And then I create a branch. I do the work on my branch. And at some point I need to merge. I work back into master. Is it anything like that? Or is it, we have a bunch of these versions and they all are still exist. And then the database has to show you the right version. And there is no real master.

Bruce Momjian 00:29:05 Yeah. It’s more like the ladder when you’re working with Git, you basically are continually pulling the most recent sources. And then if there’s any conflict you have to kind of manually fix your source code to kind of merge those in. And then once you do the commit, then you’re going to push everything up and you better hope you have the most recent version, because if you don’t, then you make a conflict on the push and then you know, the whole, then you get another error, right? That’s actually one of the things we don’t do because we don’t expect application programmers to sort of be doing sort of get merge, like clean up when something conflicts or whatever. We effectively say, okay, I’m going to update that row and therefore, if somebody else has that role lock, I’m going to wait for them to finish.

Bruce Momjian 00:29:55 And then I’m going to get the most current, I’m going to get a lock myself so nobody else can get in. I’m going to get the current version of that row I’m going to process it and put it back. So in Git the poll and then the push, you know, you might go days or weeks as you’re working on your patch, kind of going through and you’re continually sort of merging stuff in, but in a database, it doesn’t really work that way because you don’t, you don’t really want to, you don’t want to have two people committing like on different versions of the row and then somehow have to merge those two versions together. There are some database systems that do that, particularly if it’s a distributed database and they try and sort of have special data types, like add 10 to this row, but I don’t know what the value is and they kind of can merge another ad 10 together. And it’s 20, but that’s a very specialized use case in the relational systems that I know of in almost every case. You basically, if you’re going to update the row, you’re going to lock it and you’re going to wait for that lock to be given to you exclusively. You’re going to perform the update and then you’re going to send it back right away.

Robert Blumen 00:31:02 I’ve worked with another feature in an older database. I don’t know if this still exists or is popular. It was known at the time as optimistic concurrency control. The way that worked is if I started transaction and maybe I don’t even know if I’m going to lock or modify certain rows in that transaction, the database would give me some kind of a version ID. And then when I commit, I would hand the version ID back. And if that row had changed, then my version ID would be out of date and the transaction would fail. Which is fairly simple as go back to the beginning, just try to do it again. And you’ll refresh at that point. How is that different than what Postgres does when you have transactions that I think the use is I started out a transaction and I might need to modify a row?

Bruce Momjian 00:32:00 Sure. We really have effectively three different transaction isolation levels. These are defined by the SQL standard. The default one, the most common is called free committed. What that effectively means is that every new statement gets a new snapshot. So even if you’re in a multi statement transaction, every new query inside that multi statement transaction gets a new snapshot. We also have something called repeatable read, which means that all of the statements that I’m all the statement transaction get exactly the same snapshot. So you take the snapshot at the beginning of the transaction and that snapshot never changes. And that’s really great for reporting. You know that all your queries in that transaction are going to see a consistent view of the database, no matter what’s going on. Right? So that allows you to run financial reports like in the middle of the day and get an accurate number.

Bruce Momjian 00:32:49 Because in the old days, we could, we’d always have to run our financial reports at night because you never could get an accurate number during the day. Cause money was moving around, you know, as you were running your report. But we do have a third mode called serializable, which is much more similar to the one you’re talking about. And in serializable mode effectively, it does exactly that, as you are running through your multi-statement transaction, you may read some rows. You may not do select for update, right? So traditionally people do select for update. It locks the rows you’ve selected. And then you do, you update those rows. If you, if you want to do optimistic locking effect, or we just do your select, you don’t do the four update. You go to modify the rows. And when you do the commit, it will check to see if anything has been modified underneath you between the time we took the snapshot and the time you did your update, and it will throw an error.

Bruce Momjian 00:33:49 And so serializable mode has been in Postgres for probably 12 years, I think. And it’s really good if you’re do a try to do exactly what you’re saying, you either, aren’t able to do select for updates. You don’t want to do the locking, or maybe your application team doesn’t really want to do that. They don’t want to get involved with that. They don’t understand it. And if you run a serializable mode effectively, any time that something changes between the time you select it, the time you update it will be flagged by Postgres and you’ll get a serializable error and the transaction will have to be rerun.

Robert Blumen 00:34:26 We’ve been talking about MVCC and mainly as a solution to the concurrency problems introduced by excessive locking or solutions that rely on locking. If you are running a report, then you’ll get your own snapshot of the database. It won’t change underneath you while you’re running the report. Even people who start doing modifications while the report is running, you won’t see them. Is that what users want? Is that, is that probably a better solution from I’ll call it a customer standpoint than something that would give you a more frequently updated view of the data while you’re clearing it?

Bruce Momjian 00:35:16 There is a mode that some database is implemented, called dirty read, and in dirty read, you basically discard the ACID requirements. And you basically say, I want to see the data as it’s being part of. I don’t care if it’s not my snapshot, Postgres doesn’t even support that mode. And the reason, the reason that you hear people using dirty read at least years ago is that sometimes that was the only way you could get work done. Right? If you had a non MVCC database, you know, you’d be kind of like this number might be wrong that I’m computing, but it’s never going to finish if I don’t use dirty. So I’m just going to run it. And I’m going to have a lot of caveats about whether this number is accurate or not. Databases that use MVCC like Postgres, they really don’t need dirty read because they don’t have the problem of writers blocking readers anymore.

Bruce Momjian 00:36:16 So Postgres does not support that mode. I don’t know if anybody’s actually asked for that mode because the fact that we, that would mean that not, I’m not talking about the snapshot changing between queries, that’s the default for Postgres. But if you want the visibility change as the queries running and somebody, you know you’re on page 10, somebody adds something to page 11 and you see it right away, even though they haven’t even, you know, that row wasn’t even there when you started your scan. Most people don’t want that because it’s hard to really rely on the data, whereas with an MVCC system, because you have the writers not blocking readers, you get an accurate number. The number may be old. It may only be accurate to the time you started your query, but is accurate as of that time. And there are very few people who really want to see dirty data that effectively does not give them an accurate number of anything, because they could be moving a hundred dollars from one account to the other. You may see that a hundred dollars leave on page 11, and you may realize that it appears on page four, but you already read page four. So you don’t see it. And that’s the classic case where the number may be a little more current in terms of what it sees, but because it isn’t consistent, it isn’t really accurate anymore.

Robert Blumen 00:37:41 I know a lot of reporting would be things from the past, for example, on the first of the month, we want to run a financial report for the previous month. So, you’re really only dealing with the data that can’t change at that point anyway. And it’s definitely better that your query will reliably complete in a short time, then caring about transactions that occurred after the first, which aren’t even part of your query anyway.

Bruce Momjian 00:38:12 Yeah. Well, the problem is not, I don’t think people would be upset if we consistently showed changes from queries that happened after we started. What they don’t want is to see pieces of query of changes that happen. And that’s where the A in Anatomist comes from. So, the problem is that you might see the delete that happened, but the insert might be earlier in the table and you might’ve passed that already. So, imagine somebody scanning through a table, they’re adding a hundred dollars to one account bleeding, a hundred dollars from another account. The addition may be forward in the table for you,so you would see it. But the lesion may be behind you in the table so you wouldn’t see it. And that’s really the problem. There’s really no way that I can think of frankly, that we would show somebody a full completed transaction that had happened while the session was running.

Bruce Momjian 00:39:15 Because you have to realize it’s not just one table. It could be, we could be touching multiple tables. We could be doing a joint. There could be index entries involved, right? So, there’s all this stuff going on. And the idea that we would say, oh, okay, that was an insert that happened. And there’s no delete with it. And maybe that’s okay, because we’ll just throw that into the total, right? You just don’t know because you don’t know the SQL language really doesn’t give you the ability to say, I’m just doing an insert. If you want to show it to people before I commit, go ahead. I don’t have a delete associated with this. It’s only an insert, but then there’s all these things happening in the indexes and page splits. And it just really hard to understand how that would work effectively.

Robert Blumen 00:40:00 You mentioned that Postgres was designed from the beginning to be extensible so it could add new data types. For someone adding a new data type, are there operations or methods they need to write in order for it to work properly with MVCC?

Bruce Momjian 00:40:18 Actually, no, the, yeah, it’s kind of funny. A lot of databases seen Postgres of popularity. A lot of these spaces could have got into the extensible, you know, bandwagon, but you know, it’s really hard to do because Postgres was designed originally with this, we’ve been able to do it, but it’s really hard to sort of retrofit it into a system. So, because Postgres was designed at the beginning for this, it has all these system tables, which store all the data types, the ones that are built in and the extendable ones that once you add it has all the indexing stuff is stored in system tables. All of the stored procedure, language definitions are stored. All the aggregates are stored in system table. So effectively the API for how all of this stuff is handled. When you’re creating a new data type, you really have to worry about, you know, define how long it’s going to be or variable length.

Bruce Momjian 00:41:13 You have to define an input function and you have to find output function, right? That’s pretty much it. Now you probably want some other functions to work on the data. You might want some casting functions to get your data in and out of different data types, but it’s actually really easy to do. You don’t have to muck with all that other stuff. You just need to tell us how that data’s going to come in and Postgres, because it was designed this way, just kind of fits it into roads, automatically puts the transaction IDs on the front, and there’s really no special handling for any data type related to MVCC all at all that I can think of

Robert Blumen 00:41:52 In the enterprise database landscape do most or all the vendors support MVCC?

Bruce Momjian 00:42:00 Oracle does. They’ve had, I think since the late nineties, I believe Microsoft has it as an option, but last I looked, it was not enabled by default. I don’t know about Db2. I think they also have it available, but not on as a default. I think there are some others I want to say Cassandra uses something similar. There’s some of the NoSQL databases use it a little bit. I think, I don’t remember if MySQL MariaDB, they might use it, I don’t know. Postgres implementation is a little unusual because we just leave the old rows in place. And we put new rows in a lot of systems like Oracle don’t technically do that. They actually take the old row and they put it into like an undo segment, and they actually have like pointers. And when you go through the table that maybe isn’t the row, you want you to jump over somewhere else to kind of pull the right version and so for Postgres, just kind of leaves it in the table, which is sort of a unique approach to handling the MVCC problem. But getting back to actually what I just talked about, it has been difficult for traditional relational systems to add MVCC. I know that Microsoft tried it. I know Db2 has done some stuff with it as well, but the problem was that a lot of the applications written particularly for Microsoft SQL or so used to the locking behavior that they had trouble making a true MVCC system that would also work properly and perform it with the applications they currently were deployed on it.

Robert Blumen 00:43:45 Postgres run into that problem, or any other interesting challenges when this was added to Postgres?

Bruce Momjian 00:43:52 No, we did I think in 2000-2001 when Dean MacKay was the guy who sort of added it. At that time we already had sort of the vestiges of an MVCC system, when it was the original design of Postgres was to allow for time travel. So you could run a query and get the results as of like last week. And there was a concept that there were going to be worm drives, write only, read many, worm drives that would keep the old versions that you keep, maybe, you know, a year’s worth or 10 years’ worth of old versions. And you had these CDs, these worm drives which would, I guess, allow you to access old versions of the row. So, the concept of having multiple versions was kind of built into Postgres. What we didn’t have was the MVCC capability. But when it was added in 2000, 2001 1999, our community was so small that everyone was like, great, whatever you think is good with Dean you go at it.

Bruce Momjian 00:44:52 And it served us well. It is complicated. The cleanup of the rows can be challenging, particularly in very high write volume systems, but it behaves really well. And when you benchmark it against Oracle or other systems that actually behaves better in a lot of ways, partially because the way, because the old row stays in place and the new row gets added right next to it typically, you don’t have this sort of bottleneck in an undo segment where there’s this huge concurrency of people, all trying to find the right version of the row. We just kind of leave. It’s like, we just like leave him strewn across the floor. And then later we come off and we clean them up, but it turned out to be a pretty nice clean design for us. And one that doesn’t have a lot of downsides in terms of performance,

Robert Blumen 00:45:40 You just introduced the topic of cleanup. I’m aware from your side deck, that the system does do some cleanup. I could see that if you have a lot of writes going on, you end up with a lot of old rows that are no longer current for any query. How does the cleanup process work?

Bruce Momjian 00:46:01 Yeah, there’s really two scopes to the cleanup. One is what I call pruning. And this can happen at any time. It’s a very lightweight operation, even as select tactically can cause pruning and all pruning does is to remove old versions of the row. It looks at it as you’re doing a sequential scan, let’s say for a select, you read the page, you read all the rows on the page, you’re seeing the transaction IDs and you can look. Okay this was expired by transaction a hundred, all the snapshots currently don’t, can’t see anything older than that so that nobody can see that row. That’s what we referenced earlier. Some rows are very quick to identify this row, cannot be seen by any running transactions. And the system will just, will just restructure the page and free up that space right away, even as during a select. Postgres 14 added that capability to indexes.

Bruce Momjian 00:46:52 So if you’re spinning through an index and Postgres14, and you’re about to split the page and BG pages are split, it’s fairly expensive, very hard to undo a split. And one of our, you know, Peter Gagan was able to identify that we’re getting a lot of splits in cases where we really don’t need to split because there’s a lot of dead rows on in the index. So, we in Postgres14, he along with somebody from Russia, kind of worked together on kind of getting this sort of what we call index pruning working. I think that’s going to have great benefits to Postgres. But there are cases that don’t work that way. And we, an auto vacuum process that runs continually wakes up every minute, looks to see what tables potentially have a lot of dead rows in it, what index is needed to be cleaned up.

Bruce Momjian 00:47:38 And it just kind of runs in the background, freeing up that space and making it available. The nice thing for us is that that auto vacuum process is not done in the foreground. It’s not something that a query is normally going to be working with. It’s basically just kind of running at a low priority in the background, kind of just doing regular cleanup. And we would need that anyway, even if we didn’t use MVCC and we use the old version you still have, when you abort a transaction, you still need to clean up the old aborted rows. So even if we did MVCC differently, we would still have, imagine you do an insert of a thousand rows and you get 900 of a in, and your transaction aborts. Well, when somebody has got to get rid of those 900 rows, so fortunately we have an auto vacuum process that handles that and handles the issue of having multiple versions of an updated row at the same time and deleted rows. Of course, they need to be cleaned up to.

Robert Blumen 00:48:34 That was a lot like how garbage collection works in programming languages. Is that a good comparison?

Bruce Momjian 00:48:41 It is, there is some languages like C where you basically allocate everything and you free everything manually, right? Which is what Postgres is written in. So, I’m obviously very familiar with that method. And then you have more of the Pearl style where the language counts, the number of references and when the number of references drops to zero, it automatically freeze that memory. So, it kind of identifies it tracks where the variable is, is in its scope, as somebody sent a pointer to that somewhere else. And then as soon as it is in the scope anymore, the memories is freed. And in the Java case, of course, you basically have, we just allocate stuff on the fly. And then occasionally a garbage collector comes along and starts to run and just sort of looks through all the objects. There’s not the reference counting in the same way. It just kind of looks at all the objects is saying, which ones are visible, which ones have been thrown away and just kind of cleans it up. And yet Postgres is much more in that style of design. Oracle, I would say is more in the Pearl style, where they’ve got this undo segment where all the old rows go to, and I believe they kind of manage the references to that a little differently than, we do.

Robert Blumen 00:50:00 In your sides, there’s a term I came across in this section, storage space reuse. Is that anything different than what we’ve already talked about?

Bruce Momjian 00:50:11 Yeah, it is. When I’m talking about page pruning and auto vacuum, what they’re effectively doing is they’re taking data that they know is no longer useful and they’re basically freeing it up. So, a page that used to be 90% full now it’s 20% full because we freed up 70% that was just dead, right? And if the pages at the end of the table are all empty, we can truncate the table down. Right? So if you delete all the rows in the table, then vacuum will effectively shrink the file to zero because it knows there’s a whole bunch of empty rows at the end. In fact, the whole thing empty and it will just shrink it down to zero. But and the same thing with indexes will reduce the size of the page. Maybe, you know, if, if we’re about to split a page and we determine there’s a lot of dead rows on there, we’ll cut it down.

Bruce Momjian 00:51:09 So now maybe it’s 40% full instead of it being 90% full. What we typically don’t do is to free up all potential space to the operating system. So, for example, if you have a table and you deleted every other row in the table, okay? And it’s interspersed so every page has now 50% full, right? That remaining empty, 50% is ready for the next insert or the next update. But what we won’t do automatically is to basically shrink down that table because it potentially a table could be half the size, right? Because each page is 50%. So, if we got all the empty space together, it would all, it would be half the table. And then we’d have 50, you know, half the size, all full pages. We have a manual command called vacuum full that does that, which would basically compress the table down and return all that space to the operating system.

Bruce Momjian 00:52:12 But that’s not something we can do automatically because it locks the table. And obviously people can’t do that in production. So, if you’re doing a lot of big maintenance operations, and you’ve removed a lot of data from the table or, or maybe from an index and you basically like, I’m probably never going to need that space again, like I’m not going to be adding new rows or that empty space in the page is probably not going to be useful to me, then you might want to run vacuum full and pretty much all the databases have this problem. You can’t really be shrinking down stuff while people are in the database. You can’t unsplit a B3 page very easily. So, effectively the only way to do it is to lock it, create a new copy and then delete the old copy.

Bruce Momjian 00:53:00 We also have a re-index command, which does that for indexes. So, if you want to just rebuild an index, you can do the re-index. If you want to do the index and the table itself, your vacuum full would be the way to do that or cluster, which also pretty much does the same thing. But you get to the limitations of concurrency, that there are certain operations that are just so potentially disruptive to normal workloads that you have to push the sequel commands. And if you want to run them, you have to make sure you do it at a quad. It’s a time when there are very few people using the database.

Robert Blumen 00:53:35 From our discussion, I understand this is a feature which is intended to give developers or SQL query programmers, a good, intuitive experience and good database performance without them having to really think about it a lot. But is there anything that sequel developers do need to know in order to get the most out of MVCC?

Bruce Momjian 00:54:01 I, you know, I don’t think so. I mean when we used to do the locking yeah. When we had non MVCC systems, application programmers, either they needed to know about it, where they soon learned, they needed to know about it because their applications would not run right. And somebody would come to them and they would say, why did you write this code this way? And the person would say, well, because X, Y, Z. And they’ll basically, that would never, that was never going to work in our system. We have to do it this other way. With MVCC, I don’t think there’s anything that really an application figuring needs to know, needs to do differently. I think there are certain maintenance operations. Again, if you’re deleting 80% of a table, and you’re never going to use the rest of the space, you might want to do a vacuum full on that.

Bruce Momjian 00:54:54 But other than that, really not, it’s really very transparent. I think the only real caveat is the issue you brought up earlier. Either you need to do a, if you’re going to do select, and then you’re going to update the rows in the same transaction, you’re going to rely on synchronization between the data you get out of the select and the updates you do. You either have to run, select for update, or you have to run in serializable mode and be willing to retry when you get a transaction error on commit. Those are not specific to MVCC, but they are generally good practice in any concurrency system.

Robert Blumen 00:55:34 Bruce, I think we’ve covered some really good subtopics within this area. Is there anything that you want to add that we haven’t talked about?

Bruce Momjian 00:55:43 Probably the only thing, and I did a talk last night for Asia and I brought up this topic, but there’s this thing called write amplification, that we still I think struggle with in Postgres. And that is because of the way we do MVCC, Postgres has a tendency to issue somewhat more writes than other relational systems. Part of it’s because of the way we do MVCC because we’re have the old and new versions in the same page, hopefully in the same table. And we just sort of age them out, as you said, with garbage collection. So when that garbage collection happens, even though it’s happening in the background, it is issuing writes to the storage. When the transaction, when we are updating the hint bits of the, or the basically the bits that tell us which transactions are committed or aborted, we’re going to issue writes potentially for those, again, these are all background writes.

Bruce Momjian 00:56:43 They’re not happening in the foreground of the application, but they are writes and they are increasing the write volume. And as I said before when the transaction ID counter wraps around, we have to make sure that none of the old rows have transaction IDs that would now be duplicated. So, we have to issue a freeze operation. So, there’s a sense that we have a number of ways, we do things that are a little more write heavy than other databases. That’s not a problem for most people, but it is a problem for some people. And we continue to make incremental improvements on this. As I said, in Postgres13, we improved the way we handle duplicates in indexes and Postgres14, we improved the way that we do index cleanups, index pruning, basically on the fly to produce the number of page splits, which will greatly reduce the need for re-index, but we keep chipping away at it.

Bruce Momjian 00:57:39 And it’s just something that if you look back at Postgres like 92,93, and you look at the write profile there and you look at the profile of say a Postgres13 or 14, you’re going to see a much reduced write profile, but it’s still there. And I don’t know if there’s a great way to solve that without adding a whole lot of other negatives to the system. So, we have a lot of smart people looking at it. Obviously, we’re a very open project and people are giving opinions all the time. I don’t know if we need to do something drastic here, like a new way of doing things, or if our incremental approaches is acceptable currently seems to be acceptable, almost everyone. And we continue to make small improvements every year. But it is something you should be aware of that this MVCC does not come without costs. There is a cost in terms of having to have the two transaction that he’s on every row on having to update the hint bits, on having to handle the cleanup in the background and then having to do the freezing. These are, you know, write operations that do happen.

Robert Blumen 00:58:46 Thank you for that. Before we wrap up, would you like to point listeners anywhere that they can find you or any projects you’re involved with on the internet?

Bruce Momjian 00:58:56 Sure. My website, Momjian.us has 57 talks, 93-94 videos, and over 600 blog entries. So, I’ve got a lot of stuff there. I just sort of modernized the webpage a little bit to be a little fresher. Of course, the Postgres.org website has a huge amount of information about Postgres. And there is even a website called PG life, which I maintain, which gives you a snapshot of what’s happening right now in the community. And if you’re curious about what’s going on, you can find the link to that on my Postgres blog webpage.

Robert Blumen 00:59:34 Bruce, thank you so much for speaking to Software Engineering Radio. For Software Engineering Radio, this has been Robert Blumen. Thank you for listening.

[End of Audio]