- Lead sustainable incident response, blameless postmortems, and production improvements that result in direct business opportunities
- Provide guidance to other team members on managing end-to-end availability and performance of mission critical services, on building automation to prevent problem recurrence, and on building automated responses for non-exceptional service conditions.
- Building network and systems automation software for managing a multi-tenant cloud infrastructure
- Debugging complex problems across full stack and creating solid solutions via the ability to to identify and and delve deeper into Root Cause Analysis efforts on network incidents with a strong network background is good to have.
- Automating work across a variety of infrastructure needs such as testing, failover, policy modifications and deployment.
- Writing, updating, and using documentation, including runbooks/playbooks with the ability to respond consistently via the regular creation of runbooks/playbooks with an eye towards additional automation opportunities in the environment is a must have skill .
- 7-10+ yrs of experience with designing and building distributed software systems.
- BS/MS degree in Computer science or related areas (or equivalent experience)
- Demonstrated ability to write code in a mainstream systems programming language such as C, C++, Go, Python, Java, Rust, etc.
- Demonstrated ability to use, design and implement maintainable APIs including use of tools such as Git, NetBox, Cloud Vision Portal, SaltStack, Victoria Metrics. SNMP and HashiVault
- Practical experience with asynchronous programming, type safety, threading models, state machines.
- Understanding of underlying Linux Internals: Kernel scheduling, memory management, and networking subsystems.
- Understanding of networking protocols such as IP, IPv6, BGP, HTTP, ICMP, tunneling protocols (VXLAN, Geneve, GRE) in a multi-vendor environment as implemented on platforms such as Arista, Cumulus, Cisco, HP Palo Alto and others
- Understanding of data persistence (SQL or similar).
- Understanding of secure communication protocols (mutual-TLS, IPsec, or similar).
- Demonstrated ability to reach cross-functional consensus without all the details
- Experience in a Hyperscale Cloud Service Provider (public facing or not)
- Experience with high level compiled languages such as Go or Java
- Experience with Kubernetes and/or distributed task scheduling
- Experience with host security services and security principles such as TPM, TXT, SecureBoot
- Knowledge of SRE principles (observability, SLOs, SLIs, logging, etc)
- Knowledge of software interface design & documentation for less technical end-users
Santa Clara, CA - United States of America
Rust Job Details
Position: Sr. Network Site Reliability Engineer
Location: Santa Clara, California, (Hybrid role)
Duration: 6-12+ Months to CTH
Sr. Network Site Reliability Engineer - Hybrid role
Design, Build and Operate scalable software systems to manage Client’s network infrastructure
What we need to see:
Ways to stand out from the crowd: