Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Network Automation Expert Beginners (ipspace.net)
62 points by zdw on Jan 19, 2023 | hide | past | favorite | 19 comments


I've been building network controllers my entire career and the "hard part" has never been interacting with devices. Like who cares if you're using Ansible? The hard part is configuring these things; Every vendor has a different way of configuring the same 5 solutions to the same problem, and even with a single vendor network you may have to tailor your configs based on the device type that you're configuring.

IMO there needs to be some higher level abstractions emerge. Half the time I just need a layer 2 network (please don't ask me to configure it one way for a point-to-point vs greater than two endpoints. if there's an optimization to be made do it for me). The other half the time I need a layer 3 network that can peer with BGP (don't ask me what underlay or overlay protocols to use, I only care about the VLANs I'm terminating on).

Slightly related: Cloud providers expose APIs for networks to setup peerings between the cloud provider's network and the network's end customers. These APIs aren't standardized at all, and are a complete PITA for networks that try to integrate their SDN solution with multiple cloud providers. This is one area of complexity that the big guys could cleanup without a lot of work.


Agreed - this is more or less the problem statement that OpenConfig was designed to solve (minus the making optimisation part), however the chicken-and-egg problem for vendors is always: few people are asking for it, because until every vendor supports it, only a few people will use it.


Well, industry tried before with NETCONF, good luck...


NETCONF doesn’t abstract anything though. All it did was provide a programmatic method for configuring devices.


I don't know about network automation much. Is NAPALM still a thing, is it viable?


> Instead of an in-depth discussion of architectures, data structures, software development methodologies, and challenges of modifying the state of a distributed system, those motivational talks often resulted in a cargo cult of expert beginners focused on low-level tools.

We may still be in the early innings of the network automation game.

One factor that makes network automation hard is that networking is still lagging behind other resources (such as storage, compute, etc.) in openness and modularity.

Network devices, in many places, use proprietary gear from Cisco, etc. that has a custom user interface (modern Cisco devices do have bash).

Many networks are still managed using older protocols such as SNMP although newer protocols such as OpenConfig seem to be gaining ground.

Many companies are trying to disaggregate networks (commodity hardware + Linux) but it still has not taken off that much. Many interesting projects (such as switchdev) are languishing. Big companies such as Facebook, Microsoft, Amazon etc are building their own Network OSes (FBoss, SONiC, DentOS, etc) rather than consolidating under a common standard.

Lack of resources does not help. Even Wikipedia did not have articles on basic networking concepts until recently. The textbooks that teach this subject talk about older abandoned protocols.

We will see improved automation techniques in networking as time progresses.


> The problem is that the more automation we push, the fewer people know how to use the “old school” way to administer stuff.

Yes, that is the irony of automation, when something goes wrong you need a higher level of understanding to fix the problem. No matter how good the automation is something will go wrong because there are situations that automation doesn't know how to handle and you can't prepare for.


I wish I were able to work in a large network environment and see for myself why network automation is needed. Sure it’s useful but to what degree? Information highway and highway share one trait: not much flexibility is needed at least that’s the case for small/mid network.


I work for an ISP in a small country and we have to manage more than 100k network devices. Network automation is absolutely a must even on our rather small scale.


How does network automation help? A few use cases please?


The broader question is probably “what do you mean by network automation”

Dhcp could well be described as network automation. Same as LLDP and arp.


At certain points its not “useful”, but mandatory to simply keep functioning. Even small failure and change rates are untenable as device and port counts increase to Very Large numbers. How do you maintain millions of devices? Think on the order of hundreds to thousands if automated remediation workflows for every exception that requires a person in network operations to intervene.

Once you move beyond simple deployment and break/fix theres a huge step function in the change count as well. Networking is far behind on concepts like “continuous deployment” or software updates its not funny. And unlike everyones favorite stateless container of the day networks take coordination to perform intrusive maintenance.


I mostly agree with this sentiment - working for systems integrators over the years, it is very hard to convince me that most corporate customers need network automation solutions, when on average they deploy a new network maybe once every 5 years, and when the deployment is done, it structurally changes very little until the next hardware replacement.

However there are definitely business in the managed service provider, ISP and obviously anyone with a large growing data centre environment where consistency and accuracy in large-scale deployemnts are important.


Most managed service providers and ISP's have developed their own simple "automation" tools for managing customers network. It's easy when the config is mostly the same across all customers.


Exactly, and the change count (add new customer with 50 sites, remove old customer with 100 sites) are big enough to warrant investment of time/money into these tools


Templates sure, I’m not sure that the type of automation software developers are used to applies.


Everything I see in this area tends to be "here's how to configure ansible to log into a couple of switches and do a show run".

Whoopy do.

This is a refreshing blog, at least confiming my suspicions that nobody doing network automation actually says anything about it.

My team (of 2) run an internal mixed-vendor network of about 150 switches and routers. New sites are deployed from templates, that's easy enough, although it's not "automation". Testing and rollback is pretty tricky when you're talking about physical bits of hardware.

I’ve written our own internal stuff to manage part of the network - managing firewall/nat rules on fortigates for example, and obviously theres a great deal of useful open tools out there to report the current state of the network (rancid, librenms, portmappers), but to actually make atomic attributed changes is tricky and relies on the person say changing an interface from vlan 301 to vlan351 to record the reason in the change log (description says what should be on the end and a jira reference for more about it)

Currently I’m trying to automate vlan creation on our larger sites. We tend to have a standard dual pair distribution (not stacked - that’s a single point of failure I’ve seen take us out time and time again) vrrp based site with say 30 access switches hanging off.

Creating a new vlan involves 1) adding to ipam (vlan and the ip range, then adding the individual hosts)

2) creating the vlans on the core switch

3) adding the vlans to the interface between the two cores

4) adding the l3 interface (including setting the right priority for which core I want to be the normal master), setting spanning tree and pim priorities to match

5) configuring dhcp

6) adding vlans to specific trunk ports to the access switches

7) adding vlans to the access switches and their uplinks

And then often

8) adding vlans to vm trunk interfaces

9) creating networks on vms

It’s a cesspit of manual errors, and takes an age, almost as bad as punching holes in the firewalls.

Automation for automating bgp adjacencies? Meh. There’s very little time saving. For setting vlans on access ports on agile deployments? Again we have web based tools for that to let end users do it, but that's not a pr/test/commit/apply/test/rollback workflow.

Currently my CI process pulls the show run from the Ciscos and works out what needs to happen to perform the steps, if any, then put out the instructions as an artifact. I guess next step would be a manual push of the artefact from the CI, and then testing that the IPs actually are resolvable, maybe create a test VM machine on the network (in the case the vlan is pushed to a VM host)

Trouble is the testing and rollback process is extremely tricky, especially with Cisco IOS based kit. Network kit can’t be easily or cheaply virtualised (I tried using terraform to deploy a small copy of our network onto an AWS bare-metal system, but the $5/hour you need for bare metal to run gns3). We can't afford to have an identical network setup, so any test system would not scale to the size where automation would tease out problems.

I'd love to see how other people do it, but when I talk to the department which looks after our external networking they just ssh into switches and make changes and hope for the best. Online as I say it's all people with ansible running "show version" and then saying "the rest is left as an exercise to the reader". Vendors want to sell proprietary management systems which don't seem to be the kinds of things you can drive from a gitops workflow. Maybe some SDN overlay network can be automated in some fashion, but you're basically learning a new CLI rather than implementing a test-based automation system.


I wouldn’t let script take over steps 234678.

Firewall rule management is where automation can shine.


> I'd love to see how other people do it

I worked in this space for over a decade, doing some pretty complex things along the way, like mpls cutovers involving multiple devices and qos deployments, across tens of thousands of devices. Time savings and eliminating the errors of manual deployments were the big drivers. I ended up refining a set of tools that I used and releasing them as a set of perl cpan modules, Mnet [0].

The goal was to be able to make automation scripts that can ssh into devices, parse command outputs, generate reports and debug logs, prepare and push new configs, and record and replay tests. The sample script in the readme serves as a quick walkthrough and a good starting project template. You can run your scripts on one device, do offline replays during development, or concurrently batch process a list of devices.

Reporting and testing are key to successful deployments. First you'll want to report on the current configs, flagging anomalies or anything unexpected. Early in a project you are looking for one-offs in reports, to make sure your script exits if it encounters something it can't handle. You'll also want to identify different scenarios you'll encounter. You can have the script generate an output config for these different scenarios, for review. You can manually run the script on a device and have it record the ssh session and config output to a file, which the script can replay as a regression test. You can add a --deploy option to your script along with code to push the new config to a device, reporting on success or failure. Rollback procedures may vary by project, often we manually handled errors using script output of the prior and new configs, along with the details log outputs. We'd deploy to a couple devices first, then a few devices for every scenario we coded for, then larger numbers as we gained confidence, deploying only to devices that had no errors in the reports. Following the above best practices we didn't end up with many rollbacks. One of our most complex scripts, for qos deployments, averaged maybe a dozen errors per batch of a thousand devices, mostly transient offline devices or authentication errors.

I had some experience with hp openview and ansible, but much preferred my tools. The test record/replay functions are so important to doing anything non-trivial, and were missing in these other systems. The perl code I have is also a lot faster, and perl is well suited to dealing with text network devices configs.

The vlan creation project you mentioned is do-able, but definitely has some meat on it, worth the effort if it would be used often enough. You'd need to identify the minimum inputs needed, and what extra info needs to be harvested from the network. A single script can log into multiple devices, as long as it has credentials available. Perhaps you have a hardcoded list of core switches, or follow routing tables from the targeted access switch to the connected core devices. I'm not sure how much variability you might encounter, besides the vm trunks and vms stuff you mentioned, like ios vs nexus devices using slightly different commands. You don't need to handle everything, but you might want the script to identify those scenarios it can't handle. You'd probably want to check that the new vlan doesn't exist on the access or core switches, that the access port is not already assigned a vlan or l3 address, the l3 address is not already in use elsewhere, the access port is not already in a vlan, that the uplinks can be identified and are already trunked, etc. Perhaps after the deployment you run show commands to check spanning tree, routes, etc. Ideally the deployment engineer can run the script read-only, verify everything looks good, then --deploy and get an 'ok' back. If done well the script serves to document the process.

Feel free to email me, my contact info is in cpan. I'm game to help.

[0] https://metacpan.org/pod/Mnet




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: