Episode 506: Rob Hirschfeld on Bare Metal Infrastructure : Software Engineering Radio

Rob Hirschfeld, CEO of RackN, discusses “bare metal as a service” with SE Radio host Brijesh Ammanath. This episode examines all things bare metal, starting with the basics before doing a deep dive into bare metal configuration, provisioning, common failures and challenges, achieving resiliency, and the benefits of this set up. The discussion explores standards and toolsets in the bare metal space, touching on PXE, IPMI, and Redfish before closing off with innovation and exciting new advances in the infrastructure space that promise to help developers achieve true end-to-end DevOps automation.

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Brijesh Ammanath 00:00:16 Welcome to Software Engineering Radio. I’m your host, Brijesh Ammanath, and today my guest is Rob Hirschfeld. Rob is CEO and co-founder of RackN, leaders in physical and hybrid DevOps software. He has been in the cloud and infrastructure space for nearly 15 years from working with early ESX betas to serving four terms on the OpenStack Foundation board and becoming an executive at Dell. As a co-founder of the Digital Rebar project, Rob is creating a new generation of DevOps orchestration to leverage the containers and service-oriented ops. He’s trained as an industrial engineer and is passionate about applying lean and agile processes to software delivery. Rob, welcome to Software Engineering Radio.

Rob Hirschfield 00:01:03 Brijesh, it’s a pleasure to be here. I’m really looking forward to the conversation.

Brijesh Ammanath 00:01:06 Excellent. We will be talking about infrastructure as code with a specific focus on bare metal. We have covered infrastructure as code previously in episodes 268, 405, and 482. I would like to start our session by doing a quick refresher of the basics: Infrastructure as code, infrastructure as a service, and bare metal as a service — how are these different?

Rob Hirschfield 00:01:29 Oh boy, that’s a great question to start with. Infrastructure as code to me is very different than infrastructure as a service and bare metal as a service. Infrastructure as code is this idea of being able to build automation — because that’s what we call software that runs and sets up infrastructure — but do it with code-like principles. So, modularity, reuse, collaboration, GET, you’re having a CICD pipeline. Those are all development processes that need to be brought into our infrastructure processes, our operations teams. And infrastructure as code, to me, talks about doing exactly that — that change in mindset when it comes to… We have a couple of tools that are called infrastructure as code tools (Terraform or Ansible come to mind most readily), but those are really tools that handle only a part of the process. It would be like looking at a single Python module: Hey, I can serve up a web, but I can’t connect to a database.

Rob Hirschfield 00:02:25 Infrastructure as code really talks about the process by which we’re developing, maintaining, and sustaining that automation. Infrastructure as a service, a lot of people equate that with a VM hosting or a Cloud service; it really is very simply having an infrastructure that is API-driven. So, if you have compute networking storage components that are able to be addressed through an API, that would be infrastructure as a service, to me. Bare metal as a service, as a subclass of that, where you are talking about the physical layer of the infrastructure and enabling that to have an API in front of it, it handles all the pieces. It’s much more complex than what people are used to for infrastructure as a service, because there’s a lot of RAID and bios and PXE booting. There’s additional complexities in that that are worth exploring, and I’m assuming we’ll get to.

Brijesh Ammanath 00:03:22 Absolutely. You also touched on tooling, which is a topic that we’ll come to later in the talk. But first, I want to just make sure that we have covered the basics and done a deep dive on bare metal. What specific use cases or workloads are most suitable for a bare metal server? Any examples you can recollect clients benefited by using bare metal?

Rob Hirschfield 00:03:42 At the end of the day, every workload runs on bare metal. We love to talk about things like serverless or cloud; those services don’t exist without bare metal somewhere deep underneath the surface. So, at some point, every service can be run on bare metal. There are prices to be paid for running things directly on bare metal, meaning that you have to manage that infrastructure. And so, if you’re running — you know, we get a lot of people who are interested in, say, running a Kubernetes stack, which is a containerized orchestration system directly, on bare metal to eliminate the virtualization layer. So, let me step back a second. Typically, on bare metal, you run systems that either abstract the bare metal away, so you don’t have to deal with management – so, that would be a virtualized system like VMware or KPM, and that’s what most of the clouds do when they offer you a server, or they’re actually using a layer like that above the bare metal and offering that.

Rob Hirschfield 00:04:37 So, that would be infrastructure as a service, typical system. So, virtualization is always going to run on a bare metal substrate. And there are some places where you want a lot of performance, like a high-performance workload or a data analytics system. Those also typically run-on bare metal because you don’t want to have any additional overhead on the system or the workload that you’re doing just requires all the capacity of the system. So, you don’t need to virtualize it. Even so some people still virtualize because it just makes it easier to manage systems or we’ve gotten so good at managing bare metal now, that the benefit of adding virtualization just to improve management is really dropping to zero. And then there is another class of bare metal that people are starting to care about, which is Edge infrastructure. So in an Edge site, you’re typically deploying very small footprint devices and it doesn’t make sense to virtualize them, or you don’t want to add the complexity of virtualizing them. And so we do see places where people are talking about bare metal and bare metal automation because they just don’t have the resources on the systems are deploying to add a virtualization layer. So there’s a broad range from that perspective

Brijesh Ammanath 00:05:48 Then would you not use bare metal?

Rob Hirschfield 00:05:50 There are times when you might decide that you don’t want to manage the bare metal. So like I said before, you’re always using bare metal somewhere, but in a lot of cases, people don’t want to deal with the additional complexity for using bare metal. So in a lot of cases you would argue the other way around when should I use bare metal instead of not. But the reasons that you don’t are being able to deliver infrastructure in a virtualized package really, really simplifies how you set up the systems. So if you’re putting a virtualization on top of that, then the person using the infrastructure, doesn’t have to worry about setting the rate of bios. They don’t have to worry about the security on out-of-band management. They don’t have to worry about networking because you can control the networking and a virtual machine a lot more.

Rob Hirschfield 00:06:40 It really just provides you a much more controlled environment. So, you want to use these virtualized layers on top of bare metal to remove complexity from people in your organization, provide that abstraction. That’s typically what we see as a really good use for it. There’s another case where your servers just have a lot more capacity than you need. And so, the other benefit of virtualizing on top of bare metal is that you can actually oversubscribe the systems and you can have 10, 20, 100 servers that are dedicated to different uses on a piece of bare metal and serve a lot more customers with that one piece of equipment. That is another place where the ability to share or partition work really is a value to a lot of companies.

Brijesh Ammanath 00:07:29 What is the difference between the two offerings? For instance a bare metal with a hypervisor? And second is a dedicated host by the hypervisors managed by the Cloud provider.

Rob Hirschfield 00:07:40 We see that if you are running the whole thing yourself, even if you’ve virtualized it, there are some really significant benefits to being able to balance the workload that’s on that system. To know that you’re not with what they call noisy neighbor? In a cloud provider situation, where you’re just getting a virtual machine without knowing what’s going on underneath, you could get virtual machines that are on systems that are very busy, that have somebody who’s really taxing the resources on that system and starving your virtual machine. And you don’t have any way to know that? You could also be in a situation where you’ve been assigned a slower or an old system, something with slower memory. So the performance of your virtual machine could suffer based on circumstances that are completely outside of your control. And so there’s a pretty significant benefit if you’re worried about performance or you’re worried about consistency in the results to actually have full control of the stack. And it can be cheaper. Cloud services are expensive services. They charge premiums for what they do. And so our customers definitely find that if they buy the hardware, they buy the virtualization layer, they can save a significant amount of money over the course of a year by basically having full control and ownership of that stack rather than renting a VM on a per month or per minute basis.

Brijesh Ammanath 00:09:04 Thank you. We’re going to dig deeper into bare metal infrastructure as a service. So moving on to bare metal provisioning, what makes bare metal provisioning difficult?

Rob Hirschfield 00:09:15 There’s a lot of things that make bare metal a challenge. I’m going to try and break them into a couple of pieces. One of them is just the fact that the servers themselves have a lot of moving parts in them. So when you are managing a server, it has multiple network interfaces. It has multiple storage devices. Usually has some type of rate controller. It has firmware for the device. It actually has firmware for the Ram. It has firmware for the drives. It has firmware for the out-of-band management. It has its own out-of-band management controller, which means that there’s a separate interface for the systems that you use to set the firmware or control its power and things like that. And so all of those pieces together translate into, you can’t ignore that aspect of the system. So you actually have to build the systems to match how they’re configured and what their capabilities are and setting all that stuff up is a much more, we’ve automated it, but it requires a lot more information, a lot more skills, a lot more knowledge.

Rob Hirschfield 00:10:22 And so bare metal itself becomes more challenging. And even if you took something as simple as a Raspberry Pi, it has those same limitations and you have to understand how to deal with them and set up the operating system to match into that environment. So that’s a piece of the puzzle. The other thing about it is within that machine, you also have external needs for controlling the machine. So we talk about something called PXE a lot, P X E, it’s a pre-execution environment that is actually running on the network interface cards of the server that handles that initial boot provision process. So in order to install software on a physical machine, you have to have a way to have that machine boot, talk to the network, which means talking to your DHCP server, your DHCP server has to understand how to answer the request for this PXE provisioning has to send in an infrastructure.

Rob Hirschfield 00:11:15 You actually then send a series of OSS to boot sequence. So for what we do at Digital Rebar, there’s four distinct boot provision cycles that go into doing that process. And so you’re literally sending a boot loader and then another boot loader and another boot loader until you get up to installing an operating system and all of that requires infrastructure. And then the PXE process is actually been around for over two decades. It’s well-established, but there’s new processes that are coming when people use UEFI the new firmware that’s coming out or it’s embedded in servers now. And that actually has a slightly different process that skipped some boot loader components but has different configuration requirements. If I’m not making people’s heads spin yet, that you should be either, you’re used to doing sort of this sequential boot process. And what I’m saying makes sense, or you’re thinking, all right, I’m never going to want to do that.

Rob Hirschfield 00:12:12 And that’s exactly why people install virtualization. But there’s a huge, but here, it is all now, it’s pretty well figured out ground and the need to be like, RackN and understand how the boot provision process works and things like that has really diminished. So nowadays you can stand up simple service that will automate that full process for you, manage the bios rate and firmware and do all that configuration. You have to be aware that it’s happening on your behalf, but you don’t really have to understand the nuances of multi-stage PXE boot provisioning process.

Brijesh Ammanath 00:12:48 So if I’m able to summarize it, the way I understood it, that the challenges are around the differences in the bare metal so was itself, as well as the different ways of controlling the boot process and the configuration of the servers. Is that a right summary?

Rob Hirschfield 00:13:03 That is right. That is exactly what makes it challenging. I would actually add there’s one more thing here that is also hard. Installing operating systems themselves also have the actual operating system process of mapping onto that infrastructure, is also challenging from that perspective. So each operating system has different ways that it adapts to the infrastructure that is being installed on. Your Debbie and Ubuntu has a pre-seed process, Red Hat centers, everything have something called a kick-start process that does all this configuration. Windows has its own specific thing. And for a lot of our customers, they don’t choose not to do any of that. And they’ll build a pre-baked image and they’ll write that image directly to disk and skip a lot of that configuration process. But those are another place where people often stumbled in building bare metal infrastructure because they have to figure out all of those pieces, even with VMG, you have to figure it out. But a lot of it’s sort of baked in for VMs.

Brijesh Ammanath 00:14:05 You also mentioned UEFI, is that a newer standard to PXE and what are the advantages it offers?

Rob Hirschfield 00:14:12 So UEFI bios is actually what is embedded in all of the computers’ motherboards to run the operating systems. And this has been around for about 10 years now, but it’s only slowly coming in as a standard. What people would be used to the alternative for UEFI is Legacy bios, which is what used to run servers. If you have a desktop, most desktops now run UEFI bios by default, exclusively in this data center world, UEFI bios actually changed some ways that systems are addressed and still trips people up in security problems and reduction. It’s a whole bunch of security issues announced with UEFI bios have to be patched. And so people who had existing data centers often put servers back in Legacy mode. UEFI bios also has a different PXE process, slightly different PXE process, and they can skip the Legacy PXE and switch into IPXE more quickly, or even skip into a higher-level boot loader beyond that. And it’s worth noting for all that we’re talking, this is very server heavy, network switches have similar challenges and similar processes. And so, boot strapping a switching infrastructure is also a bare metal provisioning and install process that requires another stack of automation and logic.

Brijesh Ammanath 00:15:30 What kind of effort and lead time do you need to add more compute or RAM or storage to a bare metal setup?

Rob Hirschfield 00:15:37 You know, interestingly, a lot of the times that we work in data centers, people don’t modify existing servers as much as they modify the footprint they buy for new servers. It is much less common in my experience for somebody to say, add a couple of sticks of RAM or new drives into a system, they might replace failing ones, but typically they don’t go in and modify them. That’s said, if you were doing that, what you would look at would be like adding additional RAM doesn’t necessarily cause a lot of overhead in the system rebooting this, you know, and you can identify the new RAM adding drives to assist them can be very disruptive to the system or even network cards also can be disruptive because those devices can change the enumeration of the systems that you have in place. And so, we talked about this pre-seed and kickstart process and configuring all these pieces.

Rob Hirschfield 00:16:38 When all those are connected into a bare metal server, they have a bus order they’re actually connected than identified and they have unique identifiers and they also have a sequence depending on how the operating system sees them. It can actually change the way they’re listed to the operating system. And this is a good example for going from Legacy bios to UEFI bios. I mentioned that, that changes things. It changes in some cases, the way the drives are enumerated in a system. So you might have a system that’s working great in Legacy mode, switch the bios to UEFI mode, and then the drive enumeration is different. And the operating system no longer works or drives were attached are no longer attached in the places you expected them to be. And that’s incredibly disruptive. So we see that change quite a bit. As companies, no longer support Legacy bios, their enterprises are being, having forced migrations to the UEFI bios and flipping that switch actually makes it look like they got new drives or added drives or rewired their drive infrastructure. And that’s incredibly disruptive from that perspective. It’s one of the reasons why people typically don’t modify systems in place. They typically buy a whole new systems and treat them as a converged unit.

Brijesh Ammanath 00:17:52 So if I understood you correctly, what you’re saying is that the sequencing of the drivers itself could change, which could have an impact in terms of the hardware running properly.

Rob Hirschfield 00:18:04 The way the operating system addresses that hardware. That’s exactly right. It can also do things like change the boot order of the network interfaces, and depending on how you’ve mapped your network interfaces, that means that the Mac address that you’ve registered for a server that can confuse the DHCP server that’s then running the IP systems underneath your servers. And so those types of sequence changes can cause disruptions too. The way infrastructure gets built and this is true for Cloud as much as bare metal, the order of operations, the sequence of things, you know, identifiers and addresses get coded into the systems. And it can be very difficult to unwind those types of problems. We’ve had experiences where people made, what they thought would be a very small change in a server configuration in the bios or patch to bios, which changed the order that their network interfaces came online.

Rob Hirschfield 00:18:59 And so a different Nick was the first one came up first and then that tried to PXE boot the server. But this is a very down in the weeds story, but it illustrates the point when that Nick came up first, the DHCP server thought it was a new server and told it to re-image the server, which was not well received by the operating team. And so those types of resiliencies building that type of resilience into the system is actually a big part of what we’ve done over time. Actually, in that specific case, we built a whole fingerprinting system into Digital Rebar so that when servers come up, we can actually not rely on whether the Mac addresses, which Mac address has asked for the image, but we can fingerprint the systems and look at serial numbers, baked deep into the hardware to identify and map in which server is which so that we don’t get faked out. If somebody makes a change like that, which happens more than you might expect. And when it does rewriting somebody disks never as a popular thing, unless they wanted it done.

Brijesh Ammanath 00:20:01 Agreed. It does sound very disruptive.

Rob Hirschfield 00:20:05 Yeah. There’s a lot of defensive technology in any operational system and infrastructure as code system. You want to have automation that does positive things. You also want to have automation that stops before it does harmful or destructive things. Both are important.

Brijesh Ammanath 00:20:22 Agreed. How do you achieve resiliency and fault tolerance in a bare metal set up?

Rob Hirschfield 00:20:28 It can be really challenging to have resilience. Some of the protocols that we depend on, like DHCP, TFTP boot, out-of-band management, aren’t necessarily designed with resilience in mind. And so what we’ve ended up doing is actually building HA components for DHCP infrastructure, and then being able to reset and restart these processes. Some of the protocols that are being used are very hard to change. They’ve been around for a long time and they didn’t think through a lot of the resilience aspects when they were just worried about how do you PXE with the service, as a matter of fact, PXE building a server, especially incredibly limited from a software capability. So it really requires you thinking through externally, how do you encourage that system to be built in a, in a really sustainable way? One of the things I can say that we do that you might not think of out of the box as HA resiliency, but has proven to be the simplest over time, is our infrastructure’s code systems are all very set up as an immutable artifact set.

Rob Hirschfield 00:21:40 So part of what we do to make things very resilient is we make it incredibly easy to recreate your environment and have all the artifacts that went into building that environment version controlled and then bundled together in a very packaged way. And so, while it’s important to be able to come back and say, oh, I have my infrastructure and my boot provision system is offline. I’m stuck. That’s, that’s a big problem. You can, and we support building a multi-node HA cluster and having a consensus algorithm that will keep it all up. That’s great. In some cases, it’s really nice to just be able to say, yeah, something happened. I’m going to rebuild that system from scratch and everything will be just fine. Take a backup, have backups going of the infrastructure and be able to recover. Sometimes that’s actually the simplest and best component for this algorithm.

Rob Hirschfield 00:22:32 It is worth noting a lot of what our customers have been able to do and what we advocate is being much more dynamic in how you manage these environments. So the wrong answer for being more resilient is to turn off the automation and provisioning systems. And just pretend like your servers never need to be re provisioned or reconfigured. That is the absolute wrong way to go about building resilience in your system. It’s much better to go in and say, you know what, I want my bare metal infrastructure to be very dynamic and be updated every month and rebooted and patched and reviewed. We found that the most resilient systems here are the ones where their bare metal infrastructure is actually the most dynamic and they are constantly reprovision and repaving and resetting the systems, patching the bios and keeping things up to date that the more dynamic and the more turnover they have in that system from an operation system and rebuilding and resetting all that, those actually create much more resilient data centers as a whole. It does put more stress on the provisioning infrastructure around that, but the overall system is much, much stronger as a consequence.

Brijesh Ammanath 00:23:44 I can see some infrastructure as code and some agile principles being applied over here. But one of the principles in agile is the more often you release, the more resilient your system is, and you’re pretty much bring something similar over here.

Rob Hirschfield 00:23:59 That’s exactly right. We’re calling that process infrastructure pipelines. Some people would call it a continuous infrastructure pipeline. And the idea here is when you’re dealing with bare metal systems, we’ve talked about this a couple of times already, and it’s worth reinforcing. The thing that makes bare metal challenging is I don’t have one API that does all the work. I actually have to walk through a series of steps, especially if you then look at building the app, the operating system, and installing platforms on top of the operating system, and then bringing those into clusters. That’s an integrated workflow that has to operate end to end. So very much like we’ve seen CICD pipelines really, really helped development processes from an agile perspective where you can make those incremental changes. And then that change is going to automatically flow all the way through, into production delivery. If you do that at the bare metal layer, even at the virtualized infrastructure layer, you have dramatic results from being able to make small, quick changes, and then watch those get implemented very quickly through the system. So you’re exactly right. That’s agile mindset of small, quick, constantly testing, refining, executing. That process translates into really, really dynamic, much more resilient infrastructure as a whole.

Brijesh Ammanath 00:25:14 We will now move to the next section, which is about standards and toolset, but I do want to continue the conversation about the infrastructure pipeline. So on the infrastructure pipeline, how is their tooling? Is it mature? And do you have a mature tool set similar to what we have called for the CICD pipelines?

Rob Hirschfield 00:25:34 What RackN builds are products called Digital Rebar, and that has been in use in running data centers that have thousands of servers and tens and hundreds of sites, global footprints. And so we’re very comfortable with that process and being able to bring in components in that process. It is something that more generally we’ve seen companies trying to build themselves with either a lot of bash scripts, right? They’re sort of trying to cobble together pieces. And I’ll talk about what the pieces in a second or they’re, they’re sort of trying to stuff it at the end of the CICD pipeline where they’ll call out to a Terraform script or an Ansible script and they’ll try and run those things together. That’s a starting point. The challenge is that it really it doesn’t become an operational platform. It’s important to when you’re dealing with infrastructure to really have visibility and insight into the processes as they’re running.

Rob Hirschfield 00:26:28 And it’s also really important that the process is run from a data center. You don’t want to run infrastructure pipelines from a desktop system because they have to be available all the time. The state of them has to be available back into the systems. We do see a lot of excitement around some really good tools that we leverage to in building our pipelines. Things like Terraform or Pulumi that are infrastructure code tools that interface that sort of wrap the Cloud APIs and provide a slightly more consistent experience for programmatically interfacing to a Cloud in a generic way. We can talk about more generally how those aren’t as consistent as we would like, the goal of an infrastructure pipeline is that it doesn’t really care what infrastructure you’re running underneath. It should be an abstraction. And then we see a lot of configuration, which is a very different operation where you’re actually working inside of the system? Inside of the operating system and installing software and configuring firewalls and adding user accounts and things like that. Typically people use something like Ansible, Chef, Puppet and Salt for that. Those types of processes are also important to have in the pipeline and should be connected together so that you can go straight from provisioning into configuration, and then run that as a seamless process.

Brijesh Ammanath 00:27:43 I was going to ask you about Terraform and whether that’s applicable for bare metal, but you’ve already answered my question.

Rob Hirschfield 00:27:49 Terraform and bare metal is an interesting chance. Terraform really is a driver for other APIs. It doesn’t do anything by itself. It’s an API it’s a front end for APIs, and then it stores some state. And the way it sort state can be a challenge from a pipeline perspective. I’m happy to dig deeper into that, but you can use Terraform. I mean, one of the things that we’ve done is taken our API for bare metal as a service and wrapped it in Terraform so you can use a Terraform provider to do that work. What we found though, was that people really wanted the end-to-end pipeline pieces. And so if you’re building a pipeline and Terraform is providing, say provisioning in that pipeline, like we use it for Cloud interfacing. If you have a way to do it, that doesn’t require you to call into Terraform, it’s not as important from that process. And from an infrastructure as code perspective, we’ve really stepped above the Terraform aspect and asked how do people want to build data center infrastructure? How do they want to build clusters? How they want to do the configuration after the systems are provisioned and how they want to do the controls leading into the decision to build a cluster. Those operations are actually really the conversations that we have more from an infrastructure as code perspective, not the, how do I turn on the LMS in another system,

Brijesh Ammanath 00:29:11 Does bare metal have any API? What is the API of the server itself?

Rob Hirschfield 00:29:16 The servers have traditionally, they’ve had something called IPMI. So at the variants, and this is very, very large. Most enterprise class servers have out-of-band management or BMC is another acronym that people use for that. The vendors have their own brand names for it. For Dell it’s DRAC, for HP it’s ILO a whole bunch of acronyms behind all those names, but fundamentally those use proprietary protocols, the Legacy ones use something called IPMI, which is an IP based management interface. So it’s a network based access to turn the machine on or off. IPMI’s there’s some basics that works sort of everywhere, but once you get past the basics, every server is different. And then there’s a new standard coming around slowly called Redfish. That has a little bit more consistency than IPMI, but vendors still have their own overlays and implementations of it. And so it’s handy to have some convergence on APIs, but the servers themselves are different.

Rob Hirschfield 00:30:18 And so it can be very hard to automate against it. And then you have a whole band, like all the edge servers have their own, you know, they might not have any outer band management interface. And so, you’re stuck only to being able to PXE boot it. Some servers use another protocol that sort of rides on top of their main networking that you can sort of use to do power controls and things like that. It’s unfortunately all over the map from that perspective and can be very hard to automate because you have to know how to reach the server. You have to be in the network that it has the, of management on it. You have to have the credentials, hopefully, please, please, please, everybody. If you’re listening to this, make sure that you set passwords ideally unique per server, passwords on all of your out-of-band management interfaces.

Rob Hirschfield 00:31:06 If you’re attaching those to the internet and you’re not changing the passwords, you are exposing your server to the internet and it will be hacked and taken down. So those are very easy ingress points for people. These are challenges. That’s what customers that we work with are very careful about those interfaces and how they’re exposed and not leaving them at the faults or not. You know, making sure they have certificates to whole bunch of security that goes into improving those APIs because they are incredibly powerful when it comes to owning and managing a server.

Brijesh Ammanath 00:31:40 I would like you to explain what do you mean by out-of-band?

Rob Hirschfield 00:31:44 So when you take a piece of bare metal, really any system, because virtual machines have the same concept, it’s worth understanding how the controls work. But if I take a regular server and install an operating system on it, and I start using that server, the normal way to configure that server is what we would call in band, where I talked to a network interface on the server, usually through like SSH or through its web port. And then I log into the server and I start doing things with the server and I can even do reboots and things like that. We call that a soft reboot where you’re asking the operating system to restart. That would be in band control. Our software, most software has an agent that you can run on the system. And if you need to make changes to the system, you can ask that agent to do that work for you.

Rob Hirschfield 00:32:30 And that would be in band control. And it’s the primary way that most systems are managed. And it’s a really good secure way to do it. But sometimes that doesn’t work. If your operating system crashed or the operating system isn’t installed yet, or you might not have the access credentials to that system, you need another way to get access to it. And that’s what out-of- band management is. So in outer-band-management, there is a back door. It’s not exactly like an operating system back door. It’s a network access that talks to the motherboard of the server as a separate service, the monitoring system management system. And through that, you can control the server. You can stop and restart it. You can update the bios change the configuration settings. You can literally do all of the setting actions on the systems. And it’s important to understand these control mechanisms are actually the way you configure the server predominantly, there’s no buttons or dials on the server.

Rob Hirschfield 00:33:33 The server usually has an on-off button and that’s about it. If you want to modify a server, you’re either using the out-of-band management port or you’re rebooting it pushing F2 to get into the bios configuration and using a keyboard and mouse or mostly keyboard, to set whatever you want on those settings. That’s the difference from an outer-band-management. It’s worth noting if you’re dealing with a VM and you’re talking to the hypervisor control plane, that’s effectively out-of-band management too. So, if I’ve installed a VMware and I’m talking to VMware, that is an out-of-band management for a VM. If I was talking to a Cloud and talking to the Clouds API, that’s out-of-band management for the Cloud instance.

Brijesh Ammanath 00:34:14 Thank you. I also liked you to touch on DevOps automation. How does DevOps automation work with bare metal?

Rob Hirschfield 00:34:22 Yeah. DevOps automation from our perspective is really very much the same thing is what I would consider infrastructure as code automation. And it’s this idea that I’m building processes to control the system. With bare metal it’s really the same. Once you have that machine bootstrapped and installed, and we have an API that lets you do that. So your devOps tooling can talk to your bare metal APIs or your Cloud APIs provision a system. That’s the provisioning part of the devOps automation, usually Terraform, Putumi, something like that. And then the configuration side of it, so devOps tooling would be Chef, Puppet, Ansible, Salt, your favorite bash scripts or PowerShell scripts actually running in-band on the system would be, you know. A lot of people think of devOps automation as sort of that part of the process where you’re actually on the system, installing software, configuring it, making all those pieces go, but it’s really a continuum.

Rob Hirschfield 00:35:23 I would fall back. When I talk about devOps to the idea of the devOps processes, more where people are looking at getting teams to communicate together and then building that pipeline and that automation sometimes when we get very tied into like, oh my devOps tools, you know, Ansible is my devOps automation tool. You’re really only looking at one piece of how that works. It’s super important to have automation tools that do the work you need to do. You certainly don’t want to log in and do anything by hand. You just also need to understand that the individual parts of your pipeline, those are important tools they need to work well. And then you have to take a step back and figure out how to connect them together. So the devOps tooling, when people look at that every devOps automation component I have, I should have sent you, that calls it. And I signed that. It calls that, that’s what makes a pipeline.

Brijesh Ammanath 00:36:15 In this last section, I’d like to close off the show, talking about what’s in the future. What are some of the exciting new ideas and innovations in the infrastructure space that you would like our listeners to know about?

Rob Hirschfield 00:36:27 Infrastructure is really exciting. There’s a lot going on that people haven’t been paying attention to because we’ve been so wrapped up in Cloud. So, unlike the opportunity to sort of have people step back and say, wow, what is going on in the infrastructure space? Because there’s a lot of innovation here. One of the things that we’re seeing and you can access it in Cloud infrastructure too, is more and more ARM processors. So Intel and AMD processor styles has really dominated the market for the last 20 years. Cell phones and other tech like that have been using arm processors, but in a very captive way, we’re starting to see ARM become available for data center use and enterprise use. And so I see that from a power management perspective, from a price performance perspective, and also from an edge utility perspective, we’re going to see a lot more servers using ARM architecture chips.

Rob Hirschfield 00:37:19 It’s going to require dual compiling. And there’s some challenges around it. But I think that the footprint of that architecture is going to be very powerful for people, especially as we we’ve gotten better at bare metal management, you could have 10 ARM servers and manage those for less than it would cost you to put 10 comparable virtual machines on an Andy Intel class machine. So incredibly powerful stories for that. The other thing that we’re tracking is interesting is something called a SmartNIC. Sometimes these are called supervisory controllers or IPUs, where they’re basically a whole separate computer often with an ARM chip in it that runs inside of your primary server. And that second computer can then override the networking, the storage. I can actually run services like the hypervisor for the server that you’re talking to. And so that it’s basically the supervisory system, it’s his own life cycle, its own controls, but then it is able to provide security, monitor the traffic going in and out.

Rob Hirschfield 00:38:25 I can offload some of the compute processing like by running the hypervisor so that you can, Amazon does this with all of their servers, can actually put the server that’s running the virtual machines, only runs virtual machines, and the coordination and control of those virtual machines is all done on this SmartNICs. And it’s been offloaded for those control systems. That capability of having that type of supervisory control in a system really changes how we would look at a server. It might mean that you get more performance out of it. It might mean that you can create a layer of security in the systems, that is really important. It might mean that you can bridge in virtual devices. So you might be able to create a server and where we have partners that are doing exactly this, that you can create a server that has, you know, a hundred GPU instances in it instead of just one or two or maybe eight, but you can actually change the physical characteristics of a server in a dynamic way.

Rob Hirschfield 00:39:26 And so it really changes the way we think about how servers get built. That’s something that it’s called converged infrastructure or composable infrastructure is another term in it. And so we’re seeing those types of operations really change how we’re defining the systems. The other thing that those two lead to is a real growth in Edge computing and Edge infrastructure. And in those cases, we’re getting out of traditional data centers and we’re putting computational power into the environment. People talk about like smart farms or factories or wind farms or real popular examples or smart cities where every intersection could have a little data center at it. That’s managed the traffic for flowing through that intersection. People are getting excited about augmented reality or virtual reality, which is going to require you to have a very low latency processing close into where you are. And those environments all would be prime locations, where you would say, I need more processing power closer to where I am.

Rob Hirschfield 00:40:29 I’m going to distribute my data center so that it’s local and that change where we have to be able to manage and run that infrastructure and power that infrastructure and secure that infrastructure actually has the potential to really rewrite how data centers are thought of today, where we’re used to big buildings with big cooling and rows and rows of servers. And, you know, people with crash carts running around to manage them where we could be moving. I think we have to be moving into a world where while we have that, we also have a lot more 5, 10, 20 machine data centers, power powered by very low, low power ARM systems or secured in a municipal location. Or Walmart has been talked about like every Walmart could be a data center that runs the whole shopping center around it. We’re moving into a place where we really can decentralize how computation is run. And part of these other innovations I talked about are key to helping build that coming. And so, we’re seeing infrastructure, infrastructure management, and then infrastructure is code techniques to then manage all of that infrastructure as the future. Really exciting new ways to think about how we’re building all these things together.

Brijesh Ammanath 00:41:49 Sounds super exciting. So just to summarize, you touched on ARM processors, SmartNIC, IPU, converge infrastructure and Edge. What does IPU stand for?

Rob Hirschfield 00:42:02 IPU stands for the Infrastructure Processing Unit. Some people are calling these things DPUs, there’s all sorts of names for these different processing units that we’re adding on to the primary interface in part, because the word SmartNIC is very limiting. It sounds like it’s only a network interface, but the IPUs designed to look at it more as a storage and security and a virtual hypervisor control system. I don’t think the final name on this is set. I think that we’re going to continue to have different vendors trying to come up with their own branded marketing around what this is going to be. So it’s important that people sort of scratch behind the surface. What does that actually mean? Is that like something else and think through what they are fundamentally, it’s this idea that I have a supervisory computer monitoring and being maybe the storage interface or the bus interface for what we’ve traditionally called the main computer. And it’ll also take over what we spend a lot of time talking about our out-of-band management, our baseboard management controllers, which is BMCs. Those are usually not considered SmartNICs or IPUs. They’re just not wired into the systems enough. They’re just for power management and patching.

Brijesh Ammanath 00:43:20 Obviously bare-metal metal infrastructure as a service is a very powerful offering with an evolving ecosystem. But if there was one thing, a software engineer, should remember from a show, what would it be?

Rob Hirschfield 00:43:32 When software engineers are approaching automation, a lot of the automation tools have been designed with very narrow focus to accomplish sort of a very narrow scope of work. And I think that we need software engineers to think like software engineers in Ops, devOps and automation contexts, and really encourage software engineering practice. So reuse modularity, pipelining, where they have dev test and prod cycles get commits and source code controls. That thinking is essential in building really resilient automation. And it’s been missing. I’ve been in the Ops space for decades now, and we haven’t had the APIs or the tools until recently to really start thinking about the software engineering process for automation, and really bringing that to there and it’s time. And so what I would hope is that a software engineer listening to this and getting involved in site reliability, engineering, or automation, doesn’t give up there and just start crafting bespoke scripts or one-off modules, but actually goes and looks for ways that they can take more of a platform approach to the automation and create these repeatable processes and infrastructure pipelines that we’ve proven have incredible ROI for customers when they get out of the do it in a way that only works for me and one-off scripts and very narrowly defined automation layers.

Rob Hirschfield 00:45:12 So I would hope that they look at it as a software engineering problem and a systems problem instead.

Brijesh Ammanath 00:45:18 Was there anything I missed that you’d like to mention?

Rob Hirschfield 00:45:21 This has been a fairly thorough interview. We have covered bare metal pieces. We’ve covered infrastructure’s code. I do think there’s one thing that’s worth pointing out. These different types of infrastructures are really not that different. And so I like that we’ve come in and explored the differences between all these systems. At the end of the day, they’re still composed of very similar parts and we should be able to have much more unified processes where we look at infrastructure much more generically. And so I do think it’s important to sort of reflect back on all of this variation and say, okay, wait a second. I can actually create more uniform processes and see that happening. And it’s worth noting a lot of these things that we went into very deep detail on, and the details are important. In some ways it’s like knowing how a CPU works. You can use infrastructure without having to worry about some of these nuances it’s useful information to have because when systems are operating you, you understand it better. But at the end of the day, you can work at a higher level of abstraction and then keep going. And I would encourage people to remember that they have the choice to dig into the details and they should, and also they can enjoy abstractions that make a lot of that complexity go away.

Brijesh Ammanath 00:46:44 People can follow you on Twitter, but how else can people get in touch?

Rob Hirschfield 00:46:49 I am, Zehicle on Twitter and I’m very active there. That is a great way to do it. They are welcome to reach out to me through RackN and visit RackN website to do that. You contact me via LinkedIn. Those are the primary places that I am active, and I do love a good conversation and Q & A on Twitter. So, I would highly, highly suggest that one is, if you want to reach me, that’s the easiest way.

Brijesh Ammanath 00:47:13 We have a link to your Twitter handle in the show notes. Rob, thank you for coming on the show. It’s been a real pleasure. This is Brijesh Ammanath for Software Engineering Radio. Thank you for listening.

Rob Hirschfield 00:47:24 Thank you Brijesh. [End of Audio]