At the heart of Netflix technology is the Cloud Computing platform, which serves as the distributed systems foundation for Netflix application development. We are building a job and resource scheduling engine for container based workloads on top of the public cloud that powers Netflix. This system manages both service and batch jobs across multiple regions of the world. To handle this amount of scale, we launch over 1 million containers per week with thousands of underlying container hosts, and leverage the elastic cloud to optimize efficiency through advanced bin packing and capacity bursting.
We architect our system to be highly available, fault tolerant and distributed from the ground up. We invest deeply in reliability improvements to support our scale and business criticality of container applications. Operational automation, testing, and performance improvement is critical to the success of the container platform. We are looking to expand the team with software developers that can advance not only the functionality of the platform, but also keep a strong focus on the operational challenges around keeping the platform reliable as it continues to scale.
For more information on the Netflix container platform, see our recent techblog post and our most recent public presentation.
What we are building:
- We extend Linux and container runtimes to provide isolation and deep integration with Amazon EC2 networking and security. We integrate the container execution environment with other critical Netflix infrastructural systems.
- We provide advanced scheduling across both service and batch jobs (capacity management, bin packing for efficiency, anti-colocation for high availability, cross workload optimization, etc.).
- We focus on driving a consistent and fault tolerant control plane. We drive all parts of the system to be operationally resilient and capable of world-wide scale in support of all Netflix users.
Skills we are looking for:
- Passion and demonstrated experience in improving the reliability and operational automation of complex, multi-tier systems. SRE experience is a big plus.
- Experience beyond usage of container management platforms (Mesos, Swarm and Kubernetes) and container runtimes (Docker and rkt). Specifically, we are looking for developers who have extended and improved these platforms.
- Experience with addressing performance issues across the whole stack from applications to operating systems.
- Good understanding of OS fundamentals, Linux internals and shell programming.
- Experience building business critical large scale system with extreme availability.
- Ability to program across the core project languages Java and Golang
Netflix offers a unique culture that values freedom and responsibility. You can learn more on our jobs page.