Episode 536: Ryan Magee on Software Engineering in Physics Research : Software Engineering Radio

Ryan Magee, postdoctoral scholar research associate at LIGO Laboratory – Caltech, joins host Jeff Doolittle for a conversation about how software is used by scientists in physics research. The episode begins with a discussion of gravitational waves and the scientific processes of detection and measurement. Magee explains how data science principles are applied to scientific research and discovery, highlighting comparisons and contrasts between data science and software engineering, in general. The conversation turns to specific practices and patterns, such as version control, unit testing, simulations, modularity, portability, redundancy, and failover. The show wraps up with a discussion of some specific tools used by software engineers and data scientists involved in fundamental research.

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Jeff Doolittle 00:00:16 Welcome to Software Engineering Radio. I’m your host, Jeff Doolittle. I’m excited to invite Ryan McGee as our guest on the show today for a conversation about using software to explore the nature of reality. Ryan McGee is a post-doctoral scholar, research associate at LIGO Laboratory Caltech. He is interested in all things gravitational waves, but at the moment he is mostly working to facilitate multi-messenger astrophysics and probes of the dark universe. Before arriving at Caltech, he defended his PhD at Penn State. Ryan occasionally has free time outside of physics. On any given weekend, he can be found trying new food, running and hanging out with his deaf dog, Poppy. Ryan, welcome to the show.

Ryan Magee 00:00:56 Hey, thanks Jeff for having me.

Jeff Doolittle 00:00:58 So we’re here to talk about how we use software to explore the nature of reality, and I think just from your bio, it lifts up some questions in my mind. Can you explain to us a little bit of context of what problems you’re trying to solve with software, so that as we get more into the software side of things, listeners have context for what we mean when you say things like multi-messenger astrophysics or probes of the dark universe?

Ryan Magee 00:01:21 Yeah, sure thing. So, I work specifically on detecting gravitational waves, which were predicted around a hundred years ago by Einstein, but hadn’t been seen up until recently. There was some solid evidence that they might exist back in the seventies, I believe. But it wasn’t until 2015 that we were able to observe the impact of these signals directly. So, gravitational waves are really exciting right now in physics because they offer a new way to observe our universe. We’re so used to using various types of electromagnetic waves or light to take in what’s going on and infer the types of processes that are occurring out in the cosmos. But gravitational waves let us probe things in a new direction that are often complementary to the information that we might get from electromagnetic waves. So the first major thing that I work on, facilitating multi-messenger astronomy, really means that I’m interested in detecting gravitational waves at the same time as light or other types of astrophysical signals. The hope here is that when we detect things in both of these channels, we’re able to get more information than if we had just made the observation in one of the channels alone. So I’m very interested in making sure that we get more of those types of discoveries.

Jeff Doolittle 00:02:43 Interesting. Is it somewhat analogous maybe to how humans have multiple senses, and if all we had was our eyes we’d be limited in our ability to experience the world, but because we also have tactile senses and auditory senses that that gives us other ways in order to understand what’s happening around us?

Ryan Magee 00:02:57 Yeah, exactly. I think that’s a perfect analogy.

Jeff Doolittle 00:03:00 So gravitational waves, let’s maybe get a little more of a sense of of what that means. What is their source, what caused these, and then how do you measure them?

Ryan Magee 00:03:09 Yeah, so gravitational waves are these really weak distortions in space time, and the most common way to think about them are ripples in space time that propagate through our universe at the speed of light. So they’re very, very weak and they’re only caused by the most violent cosmic processes. We have a couple of different ideas on how they might form out in the universe, but right now the only measured way is whenever we have two very dense objects that wind up orbiting one another and eventually colliding into one another. And so you might hear me refer to these as binary black holes or binary neutron stars throughout this podcast. Now, because they’re so weak, we need to come up with these very advanced ways to detect these waves. We have to rely on very, very sensitive instruments. And at the moment, the best way to do that is through interferometry, which basically relies on using laser beams to help measure very, very small changes in length.

Ryan Magee 00:04:10 So we have a number of these interferometer detectors around the earth at the moment, and the basic way that they work is by sending a light beam down two perpendicular arms where they hit a mirror, bounce back towards the source and recombine to produce an interference pattern. And this interference pattern is something that we can analyze for the presence of gravitational waves. If there isn’t a gravitational wave, we don’t expect there to be any type of change in the interference pattern because the two arms have the exact same length. But if a gravitational wave passes through the earth and hits our detector, it’ll have this effect of slowly changing the length of each of the two arms in a rhythmic pattern that corresponds directly to the properties of the source. As these two arms change very minutely in length, the interference pattern from their recombined beam will begin to change, and we can map this change back to the physical properties of the system. Now, the changes that we actually observe are incredibly small, and my favorite way to think about this is by considering the night sky. So if you want to think about how small these changes that we’re measuring are, look up at the sky and find the closest star that you can. If you were to measure the distance between earth and that star, the changes that we’re measuring are equivalent to measuring a change in that distance of one human hair’s width.

Jeff Doolittle 00:05:36 From here to, what is it? Proxima Centauri or something?

Ryan Magee 00:05:38 Yeah, exactly.

Jeff Doolittle 00:05:39 One human hair’s width difference over a three point something lightyear span. Yeah. Okay, that’s small.

Ryan Magee 00:05:45 This incredibly large distance and we’re just perturbing it by the smallest of amounts. And yet, through the genius of a number of engineers, we’re able to make that observation.

Jeff Doolittle 00:05:57 Yeah. If this wasn’t a software podcast, we could definitely geek out, I’m sure, on the hardened engineering in the physical world about this process. I imagine there’s a lot of challenges related to error and you know, a mouse could trip things up and things of that nature, which, you know, we might get into as we talk about how you use software to correct for those things, but clearly there’s a lot of angles and challenges that you have to face in order to even come up with a way to measure such a minute aspect of the universe. So, let’s shift gears a little bit then into how do you use software at a high level, and then we’ll kind of dig down into the details as we go. How is software used by you and by other scientists to explore the nature of reality?

Ryan Magee 00:06:36 Yeah, so I think the job of a lot of people in science right now is kind of at this interface between data analysis and software engineering, because we write a lot of software to solve our problems, but at the heart of it, we’re really interested in uncovering some type of physical truth or being able to place some type of statistical constraint on whatever we’re observing. So, my work really starts after these detectors have made all of their measurements, and software helps us to facilitate the types of measurements that we want to take. And we’re able to do this both in low latency, which I’m quite interested in, as well as in archival analyses. So, software is extremely useful in terms of figuring out how to analyze the data as we collect it in as rapid of a way as possible in terms of cleaning up the data so that we get better measurements of physical properties. It really just makes our lives a lot easier.

Jeff Doolittle 00:07:32 So there’s software, I imagine, on both the collection side and then on the real-time side, and then on the analysis side, as well. So you mentioned for example, the low-latency immediate feedback as opposed to post data-retrieval analysis. What are the differences there as far as how you approach these things and where is more of your work focused — or is it in both areas?

Ryan Magee 00:07:54 So the software that I primarily work on is stream-based. So what we’re interested in doing is as the data goes through the collectors, through the detectors, there’s a post-processing pipeline, which I won’t talk about now, but the output of that post-processing pipeline is data that we wish to analyze. And so, my pipeline works on analyzing that data as soon as it comes in and continuously updating the broader world with results. So the hope here is that we can analyze this data looking for gravitational wave candidates, and that we can alert partner astronomers anytime there’s a promising candidate that rolls through the pipeline.

Jeff Doolittle 00:08:33 I see. So I imagine there’s some statistical constraints there where you may or may not have discovered a gravitational wave, and then in the archival world people can go in and try to basically falsify whether or not that truly was a gravitational wave, but you’re looking for that initial signal as the data’s being collected.

Ryan Magee 00:08:50 Yeah, that’s right. So we typically don’t broadcast our candidates to the world unless we have a very strong indication that the candidate is astrophysical. Of course, there are candidates that slip through that wind up being noise or glitches that we later have to go back and correct our interpretation of. And you’re right, these archival analyses also help us to provide a final say on a data set. These are often done months after we’ve collected the data and we have a better idea of what the noise properties look like, what the the mapping between the physics and the interference pattern looks like. So yeah, there’s definitely a couple of steps to this analysis.

Jeff Doolittle 00:09:29 Are you also having to collect data about the real world environment around, you know, these interference laser configurations? For example, did an earthquake happen? Did a hurricane happen? Did somebody sneeze? I mean, is that data also being collected in real time for later analysis as well?

Ryan Magee 00:09:45 Yeah, and that’s a really great question and there’s a couple of answers to that. The first is that the raw data, we can actually see evidence of these things. So we can look in the data and see when an earthquake happened or when some other violent event happened on earth. The more rigorous answer is a little bit tougher, which is that, you know, at these detectors, I’m mainly talking about this one data set that we’re interested in analyzing. But in reality, we actually monitor hundreds of thousands of different data sets at once. And a lot of these never really make it to me because they’re often used by these detector characterization pipelines that help to monitor the state of the detector, see things that are going wrong, et cetera. And so those are really where I would say a lot of these environmental impacts would show up in addition to having some, you know, more difficult to quantify impact on the strain that we’re actually observing.

Jeff Doolittle 00:10:41 Okay. And then before we dig a little bit deeper into some of the details of the software, I imagine there’s also feedback loops coming back from those downstream pipelines that you’re using to be able to calibrate your own statistical analysis of the realtime data collection?

Ryan Magee 00:10:55 Yeah, that’s right. So there’s a couple of new pipelines that try to incorporate as much of that information as possible to provide some type of data quality statement, and that’s something that we’re working to incorporate on the detection side as well.

Jeff Doolittle 00:11:08 Okay. So you mentioned before, and I feel like it’s pretty evident just from the last couple minutes of our conversation, that there’s certainly an intersection here between the software engineering aspects of using software to explore the nature of reality and then the data science aspects of doing this process as well. So maybe speak to us a little bit about where you kind of land in that world and then what kind of distinguishes those two approaches with the people that you tend to be working with?

Ryan Magee 00:11:33 So I would probably say I am very close to the center, maybe just to touch more on the data science side of things. But yeah, it’s definitely a spectrum inside of science, that’s for sure. So I think something to remember about academia is that there’s a lot of structure in it that’s not dissimilar from companies that act in the software space already. So we have, you know, professors that run these research labs that have graduate students that write their software and do their analysis, but we also have staff scientists that work on maintaining critical pieces of software or infrastructure or database handling. There’s really a broad spectrum of work being carried out at all times. And so, a lot of people often have their hands in one or two piles at once. I think, you know, for us, software engineering is really the group of people that make sure that everything is running smoothly: that all of our data analysis pipelines are connected properly, that we’re doing things as quickly as possible. And I would say, you know, the data analysis people are more interested in writing the models that we’re hoping to analyze in the first place — so going through the math and the statistics and making sure that the software pipeline that we’ve set up is producing the exact number that we, you know, want to look at in the future.

Jeff Doolittle 00:12:55 So in the software engineering, as you said, it’s more of a spectrum, not a hard distinction, but give the listeners maybe a sense of the flavor of the tools that you and others in your field might be using, and what’s distinctive about that as it pertains to software engineering versus data science? In other words, is there overlap in the tooling? Is there distinction in the tooling and what kind of languages, tools, platforms are often being used in this world?

Ryan Magee 00:13:18 Yeah, I’d say Python is probably the dominant language at the moment, at least for most of the people that I know. There’s of course a ton of C, as well. I would say those two are the most common by far. We also tend to handle our databases using SQL and of course, you know, we have more front-end stuff as well. But I’d say that’s a little bit more limited since we’re not always the best about real-time visualization stuff, although we’re starting to, you know, move a little bit more in that direction.

Jeff Doolittle 00:13:49 Interesting. That’s funny to me that you said SQL. That’s surprising to me. Maybe it’s not to others, but it’s just interesting how SQL is kind of the way we, we deal with data. I, for some reason, I might’ve thought it was different in your world. Yeah,

Ryan Magee 00:14:00 It’s got a lot of staying power. ,

Jeff Doolittle 00:14:01 Yeah, SQL databases on variations in space time. Interesting.

Ryan Magee 00:14:07 .

Jeff Doolittle 00:14:09 Yeah, that’s really cool. So Python, as you mentioned, is pretty dominant and that’s both in the software engineering and the data science world?

Ryan Magee 00:14:15 Yeah, I would say so,

Jeff Doolittle 00:14:17 Yeah. And then I imagine C is probably more what you’re doing when you’re doing control systems for the physical instruments and things of that nature.

Ryan Magee 00:14:24 Yeah, definitely. The stuff that works really close to the detector is normally written in those lower-level languages as you might imagine.

Jeff Doolittle 00:14:31 Now, are there specialists perhaps that are writing some of that control software where maybe they aren’t as trained in the world of science but they are more pure software engineers, or most of these people scientists who also happen to be software engineering capable?

Ryan Magee 00:14:47 That’s an interesting question. I would probably classify a lot of these people as mostly software engineers. That said, a huge majority of them have a science background of some sort, whether they went for a terminal masters in some type of engineering or they have a PhD and decided they just like writing pure software and not worrying about the physical implementations of some of the downstream stuff as much. So there is a spectrum, but I would say there’s a number of people that really focus entirely on maintaining the software stack that the rest of the community uses.

Jeff Doolittle 00:15:22 Interesting. So while they have specialized in software engineering, they still very often have a science background, but maybe their day-to-day operations are more related to the specialization of software engineering?

Ryan Magee 00:15:32 Yeah, exactly.

Jeff Doolittle 00:15:33 Yeah, that’s actually really cool to hear too because it means you don’t have to be a particle physicist, you know, the top tier in order to still contribute to using software for exploring fundamental physics.

Ryan Magee 00:15:45 Oh, definitely. And there are a lot of people also that don’t have a science background and have just found some type of staff scientist role where here “scientist” doesn’t necessarily mean, you know, they’re getting their hands dirty with the actual physics of it, but just that they are associated with some academic group and writing software for that group.

Jeff Doolittle 00:16:03 Yeah. Although in this case we’re not getting our hands dirty, we’re getting our hands warped. Minutely. Yeah, . Which it did occur to me before when you said we’re talking about the width of human hair from the distance from here to Proxima Centauri, which I think kind of shatters our hopes for a warp drive because gosh, the energy to warp enough space around a physical object in order to move it through the universe seems pretty daunting. But again, it was a little far field, but , it’s disappointing I’m sure for many of our listeners .

Jeff Doolittle 00:16:32 So having no experience in exploring fundamental physics or science using software, I am curious from my perspective, mostly being in the business software world for my career, there are a lot of times where we talk about good software engineering practices, and this often shows up in different patterns or practices that we basically were trying to make sure our software is maintainable, we want to make sure it’s reusable, you know, hopefully we’re trying to make sure it’s cost effective and it’s high quality. So there’s various patterns you, you know, maybe you’ve heard of and maybe you haven’t, you know, single responsibility principle, open-close principle, you know, various patterns that we use to try to determine if our software is going to be maintainable and of high quality things of that nature. So I’m curious if there’s principles like that that might apply in your field, or maybe you have different even ways of looking at it or, or talking about it.

Ryan Magee 00:17:20 Yeah, I think they do. I think part of what can get confusing in academia is that we either use different vocab to describe some of that, or we just have a slightly more loosey goosey approach to things. We certainly strive to make software as maintainable as possible. We don’t want to have just a singular point of contact for a piece of code because we know that’s just going to be a failure mode at some point down the line. I imagine, like everyone in business software, we work very hard to keep everything in version control, to write unit tests to make sure that the software is functioning properly and that any changes aren’t breaking the software. And of course, we’re always interested in making sure that it is very modular and as portable as possible, which is increasingly important in academia because although we’ve relied on having dedicated computing resources in the past, we’re rapidly moving to the world of cloud computing, as you might imagine, where we’d like to use our software on distributed resources, which has posed a bit of a challenge at times just because a lot of the software that’s been previously developed has been designed to just work on very specific systems.

Ryan Magee 00:18:26 And so, the portability of software has also been a huge thing that we’ve worked towards over the last couple of years.

Jeff Doolittle 00:18:33 Oh, interesting. So there are definitely parallels between the two worlds, and I had no idea. Now that you say it, it sort of makes sense, but you know, moving to the cloud it’s like, oh we’re all moving to the cloud. There’s a lot of challenges with moving from monolithic to distributed systems that I imagine you’re also having to deal with in your world.

Ryan Magee 00:18:51 Yeah, yeah.

Jeff Doolittle 00:18:52 So are there any special or specific constraints on the software that you develop and maintain?

Ryan Magee 00:18:57 Yeah, I think we really need to focus on it being high availability and high throughput at the moment. So we want to make sure that when we’re analyzing this data at the moment of collection, that we don’t have any type of dropouts on our side. So we want to make sure that we’re always able to produce results if the data exists. So it’s really important that we have a couple of different contingency plans in place so that if something goes wrong at one site that doesn’t jeopardize the entire analysis. To facilitate having this entire analysis running in low latency, we also make sure that we have a very highly paralleled analysis, so that we can have a number of things running at once with essentially the lowest latency possible.

Jeff Doolittle 00:19:44 And I imagine there’s challenges to doing that. So can you dig a little bit deeper into what are your mitigation strategies and your contingency strategies for being able to handle potential failures so that you can maintain your, basically your service level agreements for availability, throughput, and parallelization?

Ryan Magee 00:20:00 Yeah, so I had mentioned before that, you know, we’re in this stage of moving from dedicated compute resources to the cloud, but this is primarily true for some of the later analyses that we do — a lot of archival analyses. For the time being, whenever we’re doing something real time, we still have data from our detectors broadcast to central computing sites. Some are owned by Caltech, some are owned by the various detectors. And then I believe it’s also University of Wisconsin, Milwaukee, and Penn State that have compute sites that should be receiving this data stream in ultra-low latency. So at the moment, our plan for getting around any type of data dropouts is to simply run similar analyses at multiple sites at once. So we’ll run one analysis at Caltech, another analysis at Milwaukee, and then if there’s any type of power outage or availability issue at one of those sites, well then hopefully there’s just the issue at one and we’ll have the other analysis still running, still able to produce the results that we need.

Jeff Doolittle 00:21:02 It sounds a lot like Netflix being able to shut down one AWS region and Netflix still works.

Ryan Magee 00:21:09 Yeah, yeah, I guess, yeah, it’s very similar.

Jeff Doolittle 00:21:12 , I mean pat yourself on the back. That’s pretty cool, right?

Ryan Magee 00:21:15

Jeff Doolittle 00:21:16 Now, I don’t know if you have chaos monkeys running around actually, you know, shutting things down. Of course, for those who know, they don’t actually just shut down an AWS region willy-nilly, like there’s a lot of planning and prep that goes into it, but that’s great. So you mentioned, for example, broadcast. Maybe explain a little bit for people who aren’t familiar with what that means. What is that pattern? What is that practice that you’re using when you broadcast in order to have redundancy in your system?

Ryan Magee 00:21:39 So we collect the data at the detectors, calibrate the data to have this physical mapping, and then we package it up into this proprietary data format called frames. And we ship these frames off to a number of sites as soon as we have them, basically. So we’ll collect a couple of seconds of data inside a single frame, send it to Caltech, send it to Milwaukee at the same time, and then once that data arrives there, the pipelines are analyzing it, and it’s this continuous process where data from the detectors is just immediately sent out to each of these computing sites.

Jeff Doolittle 00:22:15 So we’ve got this idea now of broadcast, which is essentially a messaging pattern. We’re we’re sending information out and you know, in a true broadcast fashion, anyone could plug in and receive the broadcast. Of course, in the case you described, we have a couple known recipients of the data that we expect to receive the data. Are there other patterns or practices that you use to ensure that the data is reliably delivered?

Ryan Magee 00:22:37 Yeah, so when we get the data, we know what to expect. We expect to have data flowing in at some cadence and time. So to prevent — or to help mitigate against times where that’s not the case, our pipeline actually has this feature where if the data doesn’t arrive, it kind of just circles in this holding pattern waiting for the data to arrive. And if after a certain amount of time that never actually happens, it just continues on with what it was doing. But it knows to expect the data from the broadcast, and it knows to wait some reasonable length of time.

Jeff Doolittle 00:23:10 Yeah, and that’s interesting because in some applications — for example, business applications — you’re waiting and there’s nothing until an event occurs. But in this case there’s always data. There may or not be an event, a gravitational wave detection event, but there is always data. In other words, it’s the state of the interference pattern, which may or may not show presence of a gravitational wave, but there’s always, you’re always expecting data, is that correct?

Ryan Magee 00:23:35 Yeah, that’s right. There are times where the interferometer is not operating, in which case we wouldn’t expect data, but there’s other control signals in our data that help us to, you know, be aware of the state of the detector.

Jeff Doolittle 00:23:49 Got it, Got it. Okay, so control signals along with the standard data streams, and again, this is, you know, these sound like a lot of standard messaging patterns. I’d be curious if we had time to dig into how exactly those are implemented and how similar those are to other, you know, technologies that people in the business side of the house might be feel familiar with, but in the interest of time, we probably won’t be able to dig too deep into some of those things. Well, let’s switch gears here a little bit and maybe speak a little bit to the volumes of data that you’re dealing with, the kinds of processing power that you need. You know, is this old school hardware is enough, do we need terabytes and zettabytes or what, like, you know, if you can give us kind of a sense of the flavor of the compute power, the storage, the network transport, what are we looking at here as far as the constraints and the requirements of what you need to get your work done?

Ryan Magee 00:24:36 Yeah, so I think the data flowing in from each of the detectors is somewhere of the order of a gigabyte per second. The data that we’re actually analyzing is originally shipped to us at about 16 kilohertz, but it’s also packaged with a bunch of other data that can blow up the file sizes quite a bit. We typically use about one, sometimes two CPUs per analysis job. And here by “analysis job” I really mean that we have some search going on for a binary black hole or a binary neutron star. The signal space of these types of systems is really large, so we parallelize our entire analysis, but for each of these little segments of our analysis, we typically rely on about one to two CPUs, and this is enough to analyze all of the data that’s coming in in real time.

Jeff Doolittle 00:25:28 Okay. So not necessarily heavy on CPU, it might be heavy on the CPUs you’re using, but not high quantity, But it sounds like the data itself is, I mean, a gig per second for how long are you capturing that gigabyte of data per second?

Ryan Magee 00:25:42 For about a year?

Jeff Doolittle 00:25:44 Oh gosh. Okay.

Ryan Magee 00:25:47 We take quite a bit of data and yeah, you know, when we’re running one of these analyses, even if the CPU is full, we’re not using more than a few thousand at a time. This is of course just for one pipeline. There’s many pipelines that are analyzing the data all at once. So there’s definitely several thousand CPUs in usage, but it’s not obscenely heavy.

Jeff Doolittle 00:26:10 Okay. So if you’re gathering data over a year, then how long can it take for you to get some actual, maybe go back to the beginning for us real quick and then tell us how the software actually function to get you an answer. I mean we, you know, when did LIGO start? When was it operational? You get a year’s worth of a gigabyte per second, when do you start getting answers?

Ryan Magee 00:26:30 Yeah, so I mean LIGO probably first started collecting data. I never remember if it was the very end of the nineties when the data collection turned on very early 2000s. But in its current state, the advanced LIGO detectors, they started collecting data in 2015. And typically, what we’ll do is we’ll observe for some set period of time, shut down the detectors, perform some upgrades to make it more sensitive, and then continue the process all over again. When we’re looking to get answers to if there’s gravitational waves in the data, I guess there’s really a couple of time scales that we’re interested in. The first is this, you know, low latency or near real time, time scale. And at the moment the pipeline that I work on can analyze all of the data in about six seconds or so as it’s coming in. So, we can pretty rapidly identify when there is a candidate gravitational wave.

Ryan Magee 00:27:24 There’s a number of other enrichment processes that we do on each of these candidates, which means that by the, from the time of data collection to the time of broadcast to the broader world, there’s maybe 20 to 30 seconds of additional latency. But overall, we still are able to make those statements pretty fast. On a higher time scale side of things when we want to go back and look in the data and have a final say on, you know, what’s in there and we don’t want to have to worry about the constraints of doing this in near real time, that process can take a little bit longer, It can take of the order of a couple of months. And this is really a feature of a couple of things: maybe how we’re cleaning the data, making sure that we’re waiting for all of those pipelines to finish up how we’re calibrating the data, waiting for those to finish up. And then also just tuning the actual detection pipelines so that they’re giving us the best results that they possibly can.

Jeff Doolittle 00:28:18 And how do you do that? How do you know that your error correction is working, and your calibration is working, and is software helping you to answer those questions?

Ryan Magee 00:28:27 Yeah, definitely. I don’t know as much about the calibration pipeline. It’s, it’s a complicated thing. I don’t want to speak too much on that, but it certainly helps us with the actual search for candidates and helping to identify them.

Jeff Doolittle 00:28:40 It has to be tricky though, right? Because your error correction can introduce artifacts, or your calibration can calibrate in a way that introduces something that may be a false signal. I’m not sure how familiar you are with that part of the process, but that seems like a pretty significant challenge.

Ryan Magee 00:28:53 Yeah, so the calibration, I don’t think it would ever have that large of an effect. When I say calibration, I really mean the mapping between that interference pattern and the distance that these mirrors inside of our detector are actually around.

Jeff Doolittle 00:29:08 I see, I see. So it’s more about ensuring that the data we’re collecting is corresponding to the physical reality and these are kind of aligned.

Ryan Magee 00:29:17 Exactly. And so our initial calibration is already pretty good and it’s these subsequent processes that help just reduce our uncertainties by a couple extra percent, but it would not have the impact of introducing a spurious candidate or anything like that in the data.

Jeff Doolittle 00:29:33 So, if I’m understanding this correctly, it seems like very early on after the data collection and calibration process, you’re able to do some initial analysis of this data. And so while we’re collecting a gigabyte of data per second, we don’t necessarily treat every gigabyte of data the same because of that initial analysis. Is that correct? Meaning some data is more interesting than others?

Ryan Magee 00:29:56 Yeah, exactly. So you know, packaged in with that gigabyte of data is a number of different data streams. We’re really just interested in one of those streams, you know, to help further mitigate the size of the files that we’re analyzing and creating. We downsample the data to two kilohertz as well. So we are able to reduce the storage capacity for the output of the analysis by quite a bit. When we do these archival analyses, I guess just to give a little bit of context, when we do the archival analyses over maybe five days of data, we’re typically dealing with candidate databases — well, let me be even more careful. They’re not even candidate databases but analysis directories that are somewhere of the order of a terabyte or two. So there’s, there’s clearly quite a bit of data reduction that happens between ingesting the raw data and writing out our final results.

Jeff Doolittle 00:30:49 Okay. And when you say downsampling, would that be equivalent to say taking a MP3 file that’s at a certain sampling rate and then reducing the sampling rate, which means you’ll lose some of the fidelity and the quality of the original recording, but you’ll maintain enough information so that you can enjoy the song or in your case enjoy the interference pattern of gravitational waves? ?

Ryan Magee 00:31:10 Yeah, that’s exactly right. At the moment, if you were to take a look at where our detectors are most sensitive to in the frequency space, you’ll see that our real sweet spot is somewhere around like 100 to 200 hertz. So if we’re sampling at 16 kilohertz, that is a lot of resolution that we don’t necessarily need when we’re interested in such a small band. Now of course we’re interested in more than just the 100 to 200 hertz region, but we still lose sensitivity pretty rapidly as you move to higher frequencies. So that extra frequency content is something that we don’t need to worry about, at least at the detection side, for now.

Jeff Doolittle 00:31:46 Interesting. So the analogy’s quite pertinent because you know, 16 kilohertz is CD quality sound. If you know you’re old like me and you remember CDs before we just had Spotify and whatever have now, and of course even if you’re at 100, 200 there’s still harmonics and there’s other resonant frequencies, but you’re literally able to chop off some of those higher frequencies, reduce the sampling rate, and then you can deal with a much smaller dataset.

Ryan Magee 00:32:09 Yeah, exactly. To give some context here, when we’re looking for a binary black hole in spiral, we really expect the highest frequencies that like the standard emission reaches to be hundreds of hertz, maybe not above like six, 800 hertz, something like that. For binary neutron stars, we expect this to be a bit higher, but still nowhere near the 16 kilohertz bound.

Jeff Doolittle 00:32:33 Right? Or even the 2 to 4k. I think that’s about the human voice range. We’re talking very, very low, low frequencies. Yeah. Although it’s interesting that they’re not as low as I might have expected. I mean, isn’t that within the human auditory? Not that we could hear a gravitational wave. I’m just saying the her itself, that’s an audible frequency, which is interesting.

Ryan Magee 00:32:49 There’s actually a lot of fun animations and audio clips online that show what the power deposited in a detector from a gravitational wave looks like. And then you can listen to that gravitational wave as time progresses so you can hear what frequencies the wave is depositing power in the detector at. So of course, you know, it’s not natural sound that like you could hear it to sound and it’s really nice.

Jeff Doolittle 00:33:16 Yeah, that’s really cool. We’ll have to find some links in the show notes and if you can share some, that would be fun for I think listeners to be able to go and actually, I’ll put it in quotes, you can’t see me doing this but “hear” gravitational waves . Yeah. Sort of like watching a sci-fi movie and you can hear the explosions and you say, Well, okay, we know we can’t really hear them, but it’s, it’s fun . So large volumes of data, both collection time as well as in later analysis and processing time. I imagine because of the nature of what you’re doing as well, there’s also certain aspects of data security and public record requirements that you have to deal with, as well. So maybe speak to our listeners some about how that affects what you do and how software either helps or hinders in those aspects.

Ryan Magee 00:34:02 You had mentioned earlier with broadcasting that like a true broadcast, anybody can kind of just listen into. The difference with the data that we’re analyzing is that it’s proprietary for some period set forth in, you know, our NSF agreements. So it’s only broadcast to very specific sites and it’s eventually publicly released later on. So, we do need to have different ways of authenticating the users when we’re trying to access data before this public period has commenced. And then once it’s commenced, it’s fine, anybody can access it from anywhere. Yeah. So to actually access this data and to make sure that, you know, we’re properly authenticated, we use a couple of different methods. The first method, which is maybe the easiest is just with SSH keys. So we have, you know, a protected database somewhere we can upload our public SSH key and that’ll allow us to access the different central computing sites that we might want to use. Now once we’re on one of these sites, if we want to access any data that’s still proprietary, we use X509 certification to authenticate ourselves and make sure that we can access this data.

Jeff Doolittle 00:35:10 Okay. So SSH key sharing and then as well as public-private key encryption, which is pretty standard stuff. I mean X509 is what SSL uses under the covers anyway, so it’s pretty standard protocols there. So does the use of software ever get in the way or create additional challenges?

Ryan Magee 00:35:27 I think maybe sometimes, you know, we’ve, we’ve definitely been making this push to formalize things in academia a little bit more so to maybe have some better software practices. So to make sure that we actually carry out reviews, we have teams review things, approve all of these different merges and pull requests, et cetera. But what we can run into, especially when we’re analyzing data in low latency, is that we’ve got these fixes that we want to deploy to production immediately, but we still have to deal with getting things reviewed. And of course this isn’t to say that review is a bad thing at all, it’s just that, you know, as we move towards the world of best software practices, you know, there’s a lot of things that come with it, and we’ve definitely had some growing pains at times with making sure that we can actually do things as quickly as we want to when there’s time-sensitive data coming in.

Jeff Doolittle 00:36:18 Yeah, it sounds like it’s very equivalent to the feature grind, which is what we call in business software world. So maybe tell us a little bit about that. What are those kinds of things that you might say, oh, we need to update, or we need to get this out there, and what are the pressures on you that lead to those kinds of requirements for change in the software?

Ryan Magee 00:36:39 Yeah, so when we’re going into our different observing runs, we always make sure that we are in the best possible state that we can be. The problem is that, of course, nature is very uncertain, the detectors are very uncertain. There is always something that we didn’t expect that will pop up. And the way that this manifests itself in our analysis is in retractions. So, retractions are basically when we identify a gravitational wave candidate and then realize — quickly or otherwise — that it is not actually a gravitational wave, but just some type of noise in the detector. And this is something that we really want to avoid, number one, because we really just want to announce things that we expect to be astrophysical interesting. And number two, because there’s a lot of people around the world that take in these alerts and spend their own valuable telescope time trying to find something associated with that particular candidate event.

Ryan Magee 00:37:38 And so, thinking back to previous observing runs, a lot of the times where we wanted to hot fix something were because we wanted to fix the pipeline to avoid whatever new class of retractions was showing up. So, you know, we can get used to the data in advance of the observing run, but if something unexpected comes up, we might find a better way to deal with the noise. We just want to get that implemented as quickly as possible. And so, I would say that most of the time when we’re dealing with, you know, rapid review approval, it’s because we’re trying to fix something that’s gone awry.

Jeff Doolittle 00:38:14 And that makes sense. Like you said, you want to prevent people from essentially going on a wild goose chase when they’re just going to be wasting their time and their resources. And if you discover a way to prevent that, you want to get that shipped as quickly as you can so that you can at least mitigate the problem going forward.

Ryan Magee 00:38:29 Yeah, exactly.

Jeff Doolittle 00:38:30 Do you ever go back and sort of replay or resanitize the streams after the fact if you discover one of these retractions had a significant impact on a run?

Ryan Magee 00:38:41 Yeah, I guess we resize the streams by these different noise-mitigation pipelines that can clean up the data. And this is normally what we wind up using in our final analyses that are maybe months along down the line. In terms of doing something in maybe medium latency of the order of minutes to hours or so if we’re just trying to clean things up, we normally just change the way we’re doing our analysis in a very small way. We just tweak something to see if we were correct about our hypothesis that a specific thing was causing this retraction.

Jeff Doolittle 00:39:15 An analogy keeps coming into my head as you’re talking about processing this data; it’s reminded me a lot of audio mixing and how you have all these various inputs but you might filter and stretch or correct or these kinds, and in the end what you’re looking for is this finished curated product that reflects, you know, the best of your musicians and the best of their abilities in a way that’s pleasing to the listener. And this sounds like there’s some similarities here between what you’re trying to do too.

Ryan Magee 00:39:42 There’s actually a remarkable number, and I probably should have led with this at some point, that the pipeline that I work on, the detection pipeline I work on is called GST lao. And the name GST comes from G Streamer and LAL comes from the LIGO algorithm library. Now G Streamer is an audio mixing software. So we are built on top of those capabilities.

Jeff Doolittle 00:40:05 And here we are creating a podcast where after this, people will take our data and they will sanitize it and they will correct it and they will publish it for our listeners’ listening pleasure. And of course we’ve also taken LIGO waves and turned them into equivalent sound waves. So it all comes full circle. Thank you by the way, Claude Shannon for your information theory that we all benefit so greatly from, and we’ll put a link to the show notes about that. Let’s talk a little bit about simulation and testing because you did briefly mention unit testing before, but I want to dig a little bit more into that and specifically too, if you can speak to are you running simulations beforehand, and if so, how does that play into your testing strategy and your software development life cycle?

Ryan Magee 00:40:46 We do run a number of simulations to make sure that the pipelines are operating as expected. And we do this during the actual analyses themselves. So typically what we do is we decide what types of astrophysical sources we’re interested in. So we say we want to find binary black holes or binary neutron stars, and we calculate for a number of these systems what the signal would look like in the LIGO detectors, and then we add it blindly to the detector data and analyze that data at the same time that we’re carrying out the normal analysis. And so, what this allows us to do is to search for these known signals at the same time that there are these unknown signals in the data, and it provides complementary information because by including these simulations, we can estimate how sensitive our pipeline is. We can estimate, you know, how many things we might expect to see in the true data, and it just lets us know if anything’s going awry, if we’ve lost any type of sensitivity to some part of the parameter space or not. Something that’s a little bit newer, as of maybe the last year or so, a number of really bright graduate students have added this capability to a lot of our monitoring software in low latency. And so now we’re doing the same thing there where we have these fake signals inside of one of the data streams in low latency and we’re able to in real time see that the pipeline is functioning as we expect — that we’re still recovering signals.

Jeff Doolittle 00:42:19 That sounds very similar to a practice that’s emerging in the software industry, which is testing in production. So what you just described, because initially in my mind I was thinking maybe before you run the software, you run some simulations and you sort of do that separately, but from what you just described, you’re doing this at real time and now you, you know, you injected a false signal, of course you’re able to, you know, distinguish that from a real signal, but the fact that you’re doing that, you’re doing that against the real data stream in in real time.

Ryan Magee 00:42:46 Yeah, and that’s true, I would argue, even in these archival analyses, we don’t normally do any type of simulation in advance of the analysis normally just concurrently.

Jeff Doolittle 00:42:56 Okay, that’s really interesting. And then of course the testing is as part of the simulation is you’re using your test to verify that the simulation results in what you expect and everything’s calibrated correctly and and all sorts of things.

Ryan Magee 00:43:09 Yeah, exactly.

Jeff Doolittle 00:43:11 Yeah, that’s really cool. And again, hopefully, you know, as listeners are learning from this, there is that bit of bifurcation between, you know, business software or streaming media software versus the world of scientific software and yet I think there’s some really interesting parallels that we’ve been able to explore here as well. So are there any perspectives of physicists generally, like just broad perspective of physicists that have been helpful for you when you think about software engineering and how to apply software to what you do?

Ryan Magee 00:43:39 I think one of the biggest things maybe impressed upon me through grad school was that it’s very easy, especially for scientists, to maybe lose track of the bigger picture. And I think that’s something that is really useful when designing software. Cause I know when I’m writing code, sometimes it’s really easy to get bogged down in the minutia, try to optimize everything as much as possible, try to make everything as modular and disconnected as possible. But at the end of the day, I think it’s really important for us to remember exactly what it is we’re trying to find. And I find that by stepping back and reminding myself of that, it’s a lot easier to write code that stays readable and more usable for others in the long run.

Jeff Doolittle 00:44:23 Yeah, it sounds like don’t lose the forest for the trees.

Ryan Magee 00:44:26 Yeah, exactly. Surprisingly easy to do because you know, you’ll have this very broad physical problem that you’re interested in, but the more you dive into it, the more easy it is to focus on, you know, the minutia instead of the the bigger picture.

Jeff Doolittle 00:44:40 Yeah, I think that’s very equivalent in business software where you can lose sight of what are we actually trying to deliver to the customer, and you can get so bogged down and focused on this, this operation, this method, this line of code and, and that now and there’s times where you need to optimize it. Mm-hmm and I guess you know, that’s going to be similar in, in your world as well. So then how do you distinguish that, for example, when, when do you need to dig into the minutia and, and what helps you identify those times when maybe a bit of code does need a little bit of extra attention versus discovering yourself, oh shoot, I think I’m bogged down and coming back up for air? Like, what kind of helps you, you know, distinguish between those?

Ryan Magee 00:45:15 For me, you know, my approach to code is normally write something that works first and then go back and optimize it later on. And if I run into anything catastrophic along the way, then that’s a sign to go back and rewrite a couple of things or reorganize stuff there.

Jeff Doolittle 00:45:29 So speaking of catastrophic failures, can you speak to an incident where maybe you shipped something into the pipeline and immediately everybody had a like ‘oh no’ moment and then you had to scramble to try to get things back where they needed to be?

Ryan Magee 00:45:42 You know, I don’t know if I can think of an example offhand of where we had shipped it into production, but I can think of a couple of times in early testing where I had implemented some feature and I started looking at the output and I realized that it made absolutely no sense. And in the particular case I’m thinking of it’s because I had a normalization wrong. So, the numbers that were coming out were just not at all what I expected, but fortunately I don’t have like a real go-to answer of that in production. That would be a little more terrifying.

Jeff Doolittle 00:46:12 Well, and that’s fine, but what signaled to you that was a problem? Uh, like maybe explain what you mean by a normalization problem and then how did you discover it and then how did you fix it before it did end up going to production?

Ryan Magee 00:46:22 Yeah, so by normalization I really mean that we are making sure that the output of the pipeline is set to produce some specific value of numbers under a noise hypothesis. So that if we have actual, we like to assume Gaussian distributed noise in our detectors. So if we have Gaussian noise, we expect the output of some stage of the pipeline to give us numbers between, you know, A and B.

Jeff Doolittle 00:46:49 So similar to music man, negative one to one, like a sine wave. Exactly right. You’re getting it normalized within this range so it doesn’t go outside of range and then you get distortion, which of course in rock and roll you want, but in physics we

Ryan Magee 00:47:00 Don’t. Exactly. And normally, you know, if we get something outside of this range when we’re running in production, it’s indicative that maybe the data just doesn’t look so good right there. But you know, when I was testing in this particular patch, I was only getting stuff outside of this range, which indicated to me I had either somehow lucked upon the worst data ever collected or I had had some type of typo to my code.

Jeff Doolittle 00:47:25 Occam’s razor. The simplest answer is probably the right one.

Ryan Magee 00:47:27 Unfortunately, yeah. .

Jeff Doolittle 00:47:30 Well, what’s interesting about that is when I think about business software, you know, you do have one advantage, which is because you’re dealing with, with things that are physically real. Uh, we don’t need to get philosophical about what I mean by real there, but things that are physical, then you have a natural mechanism that’s giving you a corrective. Whereas, sometimes in business software if you’re building a feature, there’s not necessarily a physical correspondent that tells you if you’re off track. The only thing you have is ask the customer or watch the customer and see how they interact with it. You don’t have something to tell you. Well, you’re just out of, you’re out of range. Like what does that even mean?

Ryan Magee 00:48:04 I am very grateful of that because even the most difficult problems that I, tackle, I can at least normally come up with some a priori expectation of what range I expect my results to be in. And that can help me narrow down potential problems very, very quickly. And I’d imagine, you know, if I was just relying on feedback from others that that would be a much longer and more iterative process.

Jeff Doolittle 00:48:26 Yes. And a priori assumptions are incredibly dangerous when you’re trying to discover the best feature or solution for a customer.

Jeff Doolittle 00:48:35 Because we all know the rule of what happens when you assume, which I won’t go into right now, but yes, you have to be very, very cautious. So yeah, that sounds like a actually a significant advantage of what you’re doing, although it might be interesting to explore are there ways to get signals in in business software that are maybe not exactly akin to but could provide some of those advantages. But that would be a whole other, whole other podcast episode. So maybe give us a little bit more detail. You mentioned some of the languages before that you’re using. What about platforms? What cloud maybe services are you using, and what development environments are you using? Give our listeners a sense of the flavor of those things if you can.

Ryan Magee 00:49:14 Yeah, so at the moment we package our software in singularity every once in a while, we release kondo distributions as well, although we’ve been maybe a little bit slower on updating that recently. As far as cloud services go, there’s something known as the Open Science Grid, which we’ve been working to leverage. This is maybe not a true cloud service, it is still, you know, dedicated computing for scientific purposes, but it’s available to, you know, groups around the world instead of just one small subset of researchers. And because of that, it still functions similar to cloud computing and that we have to make sure that our software is portable enough to be used anywhere, and so that we don’t have to rely on shared file systems and having everything, you know, exactly where we’re running the analysis. We are working to, you know, hopefully eventually use something like AWS. I think that’d be really nice to be able to just rely on something at that level of distribution, but we’re not there quite yet.

Jeff Doolittle 00:50:13 Okay. And then what about development tools and development environments? What are you coding in, you know, day-to-day? What is a typical day of software coding look like for you?

Ryan Magee 00:50:22 Yeah, so , you know, it’s funny you say that. I think I always use VIM and I know a lot of my coworkers use VIM. Plenty of people also use IDEs. I don’t know if this is just a side effect of the fact that a lot of the development I do and my collaborators do is on these central computing sites that, you know, we have to SSH into. But there’s maybe not as high of a prevalence of IDEs as you might expect, although maybe I’m just behind the times at this point.

Jeff Doolittle 00:50:50 No, actually that’s about what I expected, especially when you talk about the history of the internet, right? It goes back to defense and academic computing and that was what you did. You SSHed through a terminal shell and then you go in and you do your work using VIM because, well what else you going to do? So that’s, that’s not surprising to me. But you know, again trying to give our listeners a flavor of what’s going on in that space and yeah, so that’s interesting that and not surprising that those are the tools that you’re using. What about operating systems? Are you using proprietary operating systems, custom flavors? Are you using standard off-the-shelf forms of Linux or something else?

Ryan Magee 00:51:25 Pretty standard stuff. Most of what we do is some flavor of scientific Linux.

Jeff Doolittle 00:51:30 Yeah. And then is that these like community-built kernels or are these things that maybe you, you’ve custom prepared for what you’re doing?

Ryan Magee 00:51:37 That I’m not as sure on? I think there’s some level of customization, but I, I think a lot of it’s pretty off-the-shelf.

Jeff Doolittle 00:51:43 Okay. So there’s some standard scientific Linux, maybe multiple flavors, but there’s sort of a standard set of, hey, this is what we kind of get when we’re doing scientific work and we can sort of use that as a foundational starting point. Yeah. That’s pretty cool. What about Open Source software? Is there any contributions that you make or others on your team make or any open source software that you use to do your work? Or is it mostly internal? Other, other than the scientific Linux, which I imagine there, there might be some open source aspects to that?

Ryan Magee 00:52:12 Pretty much everything that we use, I think is open source. So all of the code that we write is open source under the standard GPL license. You know, we use pretty much any standard Python package you can think of. But we definitely strive to be as open source as possible. We don’t often get contributions from people outside of the scientific community, but we have had a handful.

Jeff Doolittle 00:52:36 Okay. Well listeners, challenge accepted.

Ryan Magee 00:52:40 .

Jeff Doolittle 00:52:42 So I asked you previously if there were perspectives you found helpful from a, you know, a scientific and physicist’s standpoint when you’re thinking about software engineering. But is there anything that maybe has gotten in the way or ways of thinking you’ve had to overcome to transfer your knowledge into the world of software engineering?

Ryan Magee 00:53:00 Yeah, definitely. So, I think one of the best and arguably worst things about physics is how tightly it’s linked to math. And so, you know, as you go through graduate school, you’re really used to being able to write down these precise expressions for just about everything. And if you have some type of imprecision, you can write an approximation to some degree that is extremely well measurable. And I think one of the toughest things about writing this software, about software engineering and about writing data analysis pipelines is getting used to the fact that, in the world of computers, you sometimes have to make additional approximations that might not have this very clean and neat formula that you’re so used to writing. You know, thinking back to graduate school, I remember thinking that numerically sampling something was just so unsatisfying because it was so much nicer to just be able to write this clean analytic expression that gave me exactly what I wanted. And I just recall that there’s plenty of instances like that where it takes a little bit of time to get used to, but I think by the time, you know, you’ve got a couple of years experience with a foot in both worlds, you kind of get past that.

Jeff Doolittle 00:54:06 Yeah. And I think that’s part of the challenge is we’re trying to put abstractions on abstractions and it’s very challenging and complex for our minds. And sometimes we think we know more than we know, and it’s good to challenge our own assumptions and get past them sometimes. So. Very interesting. Well, Ryan, this has been a really fascinating conversation, and if people want to find out more about what you’re up to, where can they go?

Ryan Magee 00:54:28 So I have a website, rymagee.com, which I try to keep updated with recent papers, research interests, and my cv.

Jeff Doolittle 00:54:35 Okay, great. So that’s R Y M A G E e.com. Rymagee.com, for listeners who are interested, Well, Ryan, thank you so much for joining me today on Software Engineering Radio.

Ryan Magee 00:54:47 Yeah, thank you again for having me, Jeff.

Jeff Doolittle 00:54:49 This is Jeff Doolittle for Software Engineering Radio. Thanks so much for listening. [End of Audio]