Site Reliability Engineer

Collective Health

Chicago, IL, US
  • Job Type: Full-Time
  • Function: IT
  • Post Date: 02/14/2021
  • Website:
  • Company Address: 85 Bluxome St, San Francisco, CA, 94107

About Collective Health

Collective Health is the world's first cloud-based employer self-insurance platform. We enable employers to sponsor employee health care on their own terms by extending the benefits of self-insurance to companies with 100 or more employees, and providing a level of customer service and care that has not been available in the health insurance industry before.

Job Description

We all depend on healthcare throughout our lifetimes, for ourselves, and our families and friends, but it is notoriously difficult to navigate and understand. As an industry that comprises 20% of the US economy we think healthcare should work better for all of us. At Collective Health we believe it’s time for a new day in healthcare where as members we are informed and empowered to make the right care choices when the decisions are urgent and critical. 

Site Reliability Engineering at Collective Health is a discipline combining software and systems engineering skills. We apply modern infrastructure, systems, software, architecture, and development practices to give our customers a more reliable healthcare management experience. Through designing solutions for reliability, automating and simplifying to reduce toil, and normalizing a robust incident response procedure that resolves uncovered problems: we unlock development velocity so that we can deliver reliable services that make a real difference in healthcare.

Embedded in an engineering team, Site Reliability Engineers gain deep localized functional and technical domain knowledge, which they use to build solutions and improve outcomes for their embedded team. As a broader team of Site Reliability Engineers, we collaborate and identify themes and solutions to benefit Collective Health at large, engage in regular knowledge sharing activities and retrospectives, and relentlessly support one another in order to gain knowledge, remove barriers, and grow as individuals and a team.


  • Measure and monitor availability, latency, and efficiency to build an overall picture of system health
  • Scale systems through automation, and evolve systems by advocating for changes that improve reliability and velocity
  • Engage in and improve the development lifecycle of applications—from concept and design, through commit to production deployment, and beyond into operation and iteration
  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews
  • Practice sustainable incident response and blameless postmortems

Minimum qualifications:

  • BS degree in Computer Science or a related technical field involving systems engineering and/or coding, or equivalent practical experience
  • Passionate about solving challenging problems
  • Experience in one or more of the following programming languages: Java, Go (golang), Python, C, C++, Perl, Ruby or shell scripting
  • Expertise in management and use of relational databases
  • Experience in solving diagnosing and resolving incidents that involve application, OS, network, infrastructure, partners, people, and process
  • Experience with algorithms, data structures, complexity analysis and software design
  • Methodical problem-solving approach, coupled with strong communication skills and an ability to own and drive projects to completion

Desired qualifications:

  • Experience with Linux internals and/or network administration (e.g., filesystems, system calls, signals, process states, TCP/IP, routing, AWS VPCs, Firewalls, AWS Security Groups, IP Block Management)
  • Experience with batch (eg. daily, weekly, monthly) workloads and how they influence design and reliability concerns
  • Experience in troubleshooting and resolving performance issues in relational databases such as Postgres
  • Experience in architecture of solutions that integrate with third party providers
  • Expertise in debugging and optimizing systems, and automating routine tasks
  • Interest and expertise in designing, analyzing and troubleshooting distributed systems and APIs
  • Experience in container build, management, and orchestration
  • Ability to use localized technical and functional understanding to critically think about, prioritize, and advocate for efforts that will be most beneficial for a team

Collective Health is a technology company simplifying employer healthcare to make health insurance work for everyone. With more than 200,000 members and over 45 enterprise clients—including Pinterest, Red Bull, Restoration Hardware, Activision Blizzard, and more—our technical and customer experience teams are reinventing the healthcare experience for forward-thinking employers and their people across the U.S.

Collective Health is headquartered in San Francisco, CA, with additional offices in Chicago, IL, and Lehi, UT. Founded in 2013, Collective Health is backed by the SoftBank Vision Fund, DFJ Growth, PSP Investments, NEA, GV, G Squared, Founders Fund, Maverick Ventures, Mubadala Ventures, Sun Life, and other leading investors. For more information, visit us at

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Related Jobs

Senior Engineering Manager - Experiences

Collective Health - San Francisco, CA, US

Site Reliability Engineer

Collective Health - Chicago, IL, US

Staff Data Platform Engineer

Collective Health - San Francisco, CA, US

Regional Vice President of Sales, Southeast Region

Collective Health - Atlanta, GA, US

Registered Nurse

Collective Health - Lehi, UT, US
Disclaimer: Local Candidates Only
This company does NOT accept candidates from outside recruiting firms. Agency contacts are not welcome.