Site Reliability Engineering

Globality

London, GB
  • Job Type: Full-Time
  • Function: IT
  • Post Date: 06/08/2021
  • Website: www.globality.com/en-us
  • Company Address: 8 Homewood Pl., Menlo Park, California 94025, US

About Globality

Globality’s vision is to unlock world-class services.

Using innovative A.I. technology built upon a constantly-expanding knowledge foundation with millions of data points, Globality ensures a level playing field so companies get the best service providers at the right price for every project. Plus, this inclusive approach leads to a decrease in time spent searching in favor of time spent doing, giving your business an immediate return on investment.

Job Description

At Globality, we’re proud to embody the core values of innovation, collaboration, and trust in both our culture and product.

We’re creating ground-breaking technology utilizing a world-class, AI-powered Platform that revolutionizes how businesses buy and sell services. We are an open, inclusive, and diverse organization and our employees are at the heart of the great products we create.

We’ve raised over $310M and are supported by an impressive group of prominent investors, including Al Gore and SoftBank Vision Fund. Our co-founders, Joel Hyatt and Lior Delgo, are seasoned entrepreneurs who bring an extensive business-building experience to our organization. Our impressive board includes Dennis Nally (former Global Chairman of PwC) and Ron Johnson (former SVP of Apple).

We’re excited to deliver the best in both innovative technologies and customer-focused experiences to realize our mission of creating a more inclusive global economy. Come help us build something great!

Role Summary:

Site Reliability Engineers (SREs) are responsible for keeping all customer-facing services and other Globality production systems running smoothly as a unit within the broader Production Engineering team.

SREs are a blend of pragmatic technical operators and tooling craftspeople that apply sound engineering principles, operational discipline, and mature automation to our production environment and the Globality codebase. We are a DevOps-driven culture with a particular team interest in improving our product stack insight, automation tooling, and scalability.

Globality is a unique product stack which brings unique challenges – it’s a ground-breaking technology utilizing a world-class, AI-powered microservices platform that revolutionizes how businesses buy and sell services. The experience of our team feeds back into other engineering groups within the company, perpetuating product improvement. We are an open, inclusive, and diverse organization and our employees are at the heart of the great products we create.

As an SRE you will:

  • Be part of the team responsible for managing an enterprise-grade AI-driven data and messaging platform.
  • Protect the health of the Production environment.
  • Be on the (non-overnight) on-call rotation to respond to Globality availability incidents and provide support for other customer-impacting incidents.
  • Use your on-call shift to prevent incidents from ever happening.
  • Run our infrastructure with tools like Spinnaker, Terraform, and Kubernetes.
  • Help make monitoring and alerting alert on symptoms and not on outages
  • Protect the health of the Production environment.
  • Document every action so your findings turn into repeatable actions…and then into automation.
  • Work with the Infrastructure and QA/TestEng teams to make the deployment process as efficient and boring as possible.
  • Design, build, and maintain core production infrastructure pieces.
  • Work with the architects to implement the baseline technologies, policies, and practices to build a high-velocity, high-security, strong compliance platform that allows Globality scaling to support exponential growth.
  • Keep a keen eye on security issues in every project you work on, contributing to improving security in the systems that were already in place.
  • Debug production issues across services and levels of the stack.
  • Help plan the growth of Globality's infrastructure.
  • Establish strong relationships with other teams in order to positively influence them in their pursuit of automation and toil reduction, and to keep the rest of our team apprised of upcoming initiatives.
  • Protect the health of the Production environment.

You may be a fit to this role if you:

  • Think deeply about edge cases, points of failure, failure modes, and systemic behaviors.
  • Embrace a DevOps philosophy.
  • Know your way around Linux and the command line.
  • Feel comfortable working toward delivering an end-to-end seamless CI/CD pipeline, with a goal of delivering code into production as swiftly as possible, while working with the QA/TestEng and Infrastructure teams to ensure that code is production worthy.
  • Have strong programming skills – Python, Go, and/or Ruby (etc.)
  • Maintain “production grade” adherence to best practices for the lowliest tools and scripts.
  • Embrace collaboration and are comfortable with communicating asynchronously.
  • Are driven to document, document, document so you don't need to learn (or teach) the same thing twice.
  • Have an enthusiastic, driven, go-for-it attitude. Are compelled to fix broken things and improve less-than-ideal things.
  • Have experience with Drone.io, Jenkins, Docker, Kubernetes, Terraform, Elasticsearch, or similar technologies.
  • Have experience using the advanced tools of AWS, GCP, or other cloud providers.

Projects you could work on:

  • Improve production infrastructure automation with Ansible or Terraform.
  • Improve our Metrics collection scope or improve our metrics-driven Monitoring story.
  • Work with the QA / Test Engineering team to fully pipeline our internal tools.
  • Work with Test Engineering on scale testing initiatives.
  • Reduce the noise-to-signal ratio in our alerting.
  • Develop a relationship with a product group, define their SLOs, help analyze our metrics data on those SLOs and improve their reliability.

Leveling of Site Reliability Engineers at Globality

Areas of expertise/contribution for up-leveling:

Technical:

  • Use Ansible to efficiently manage our infrastructure
  • Further our "Infrastructure as Code" mission using Terraform and CI/CD-focused automation
  • Administration of a variety of high-availability clusters.
  • Firm grasp of Metrics and Monitoring systems, Grafana visualization implementation, and delivery of well-targeted alerting with Slack/PagerDuty integrations.
  • Logging infrastructure
  • Backend storage management and scaling
  • Disaster Recovery and High Availability strategy
  • Script / tool authoring
  • Knowledge of Globality product stack and service interoperations
  • Contributing to code in Globality

Execution:

  • Team organization and planning
  • Issue, Epic, OKR/KPI leadership and completion

Collaboration and Communication:

  • Creating blog posts / confluence articles
  • Completing Root Cause Analysis (RCA) investigations
  • Contributions to handbook, runbooks, general documentation
  • Leading and contributing to designs for issues, epics, KPIs
  • Improving team practices in handoffs of work and incidents

Influence and Maturity

  • Involvement in hiring process – developing/reviewing questionnaires, involved in interviews, qualifying candidates
  • Knowledge sharing, mentoring
  • Accountability, self-awareness, handling conflict in the team and receiving feedback
  • Maintaining good relationships with other engineering teams in Globality that help improve the product

Levels for Site Reliability Engineer

Site Reliability Engineer I

Are early-career Site Reliability Engineers who are expected to work toward:

Technical:

  1. General knowledge of at least 4 of the areas of technical expertise with deep knowledge in at least 1 area
  2. Are able to write basic scripts and alter existing scripts

Execution:

  1. Provides timely response to requests from Globality teammates and by reacting to alerts from monitoring and appropriately escalating when needed
  2. Proposes ideas and solutions within the Production Engineering team to reduce the workload through automation.
  3. Execute solutions within the production ecosystem to reach specific goals agreed upon within the team.
  4. Execute configuration change operations at the infrastructure level.
  5. Actively looks for opportunities to improve the availability and performance of the system by applying the knowledge gained from monitoring and observation

Collaboration and Communication:

  1. Improves documentation all around, either in application documentation or in runbooks, explaining the ‘why’ and ‘how’, not stopping with the ‘what’.
  2. Does not allow outdated/deprecated information to go un-flagged.

Influence and Maturity

  1. Shares gained knowledge readily with the team, either by creating issues that provide context for anyone to understand it or by writing Confluence articles.
  2. Contributes to the hiring process by being part of the interview team to evaluate SRE candidates for team fit

Site Reliability Engineer II

Are experienced Site Reliability Engineer I’s who meet the following criteria:

Technical:

  1. General knowledge of 5+ of the areas of technical expertise with deep knowledge in at least 2 areas.
  2. Are able to write well-crafted scripts and basic tools

Execution:

  1. Provides emergency response either by being on-call or by reacting to symptoms according to monitoring and seeing them through to resolution or escalating as appropriate.
  2. Proposes ideas and solutions within the infrastructure team to reduce the workload by automation.
  3. Plan, design and execute solutions within infrastructure team to reach specific goals agreed within the team.
  4. Plan and execute configuration change operations both at the application and the infrastructure level.
  5. Actively looks for opportunities to improve the availability and performance of the system by applying the knowledge gained from monitoring and observation

Collaboration and Communication:

  1. Improves documentation all around, either in application documentation or in runbooks, explaining the ‘why’ and ‘how’, not stopping with the ‘what’.
  2. Does not allow outdated/deprecated information to go un-corrected.

Influence and Maturity

  1. Shares gained knowledge readily with the team, either by creating issues that provide context for anyone to understand it or by writing Confluence articles.
  2. Contributes to the hiring process by being part of the interview team to qualify SRE candidates

Senior Site Reliability Engineer I/II

Are experienced Site Reliability Engineers II’s who meet the following criteria

Technical:

  1. Deep knowledge in 2+ areas of expertise and general knowledge of all areas of expertise. Capable of mentoring SRE-Is in all areas and other SREs in their area of deep knowledge.
  2. Are able to design and build tools to improve the management of the production environment and/or infrastructure
  3. Are able to contribute small improvement PRs to the Globality codebase to resolve issues

Execution:

  1. Identifies significant projects that result in substantial cost savings or revenue
  2. Identifies changes for the product architecture from the reliability, performance, and availability perspective with a data-driven approach.
  3. Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources usage to make Globality cheaper to run.
  4. Identify parts of the system that do not scale, provides immediate palliative measures and drives long term resolution of these incidents.
  5. Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives.

Collaboration and Communication:

  1. Know a domain really well and radiate that knowledge through recorded demos, discussions in ProdEng design meetings, or Incident/Root-Cause Reviews
  2. Perform and run blameless RCAs on incidents and outages aggressively looking for answers that will prevent the incident from ever happening again.

Influence and Maturity:

  1. Set an example for team of SREs with positive and inclusive leadership and discussion on work.
  2. Contributes to the hiring process by being part of the interview team to qualify SRE candidates
  3. Show ownership of a major part of the infrastructure.
  4. Trusted to de-escalate conflicts inside the team

Staff Site Reliability Engineer

Are Senior SREs who meet the following criteria:

Technical:

  1. Able to conceptualize, design, and create innovative solutions that push Globality's technical abilities ahead of the curve
  2. Deep knowledge of Globality and 4 areas of expertise. Knowledge of each area of expertise enough to mentor and guide other team members in those areas.
  3. Contributes to Globality codebase to resolve issues and add new functionality
  4. Significant modification to open source or major from-scratch tooling to deliver best-of-breed implementation of our production ecosystem.

Execution:

  1. Strives for automation either by coding it or by leading and influencing developers to build systems that are easy to run in production.
  2. Measure the risk of introduced features to plan ahead and improve the infrastructure.
  3. Proposes and drives architectural changes that affect the whole company to solve scaling and performance problems
  4. Leads significant project work for KPI level goals for the team

Communication and Collaboration:

  1. Works with engineers across the whole company, influencing design to create features that will work well multi-region/multi-cloud, massive-scaling implementations
  2. Runs RCAs and epic level planning meetings to get meaningful work scheduled into the plan

Influence and Maturity:

  1. Writes in-depth documentation that shares knowledge and radiates Globality technical strengths
  2. Has a high level of self-awareness
  3. Trusted to de-escalate conflicts inside and outside the team
  4. Routinely has an impact on the broader Engineering organization
  5. Helps to develop other team members into more senior levels and leaders in the team

 

We are an equal opportunity employer and a participant in the E-Verify program. We believe diversity makes teams better and that discrimination based on race, gender, or anything else is self-defeating.

Related Jobs

Technical Recruiter

Globality - London, GB

Sr. Frontend Engineer

Globality - London, GB

Director, Provider Success

Globality - Menlo Park, CA, US

Sr. Full Stack Engineer

Globality - Kiev, UA

Sr. Backend Engineer

Globality - Kiev, UA
Disclaimer: Local Candidates Only
This company does NOT accept candidates from outside recruiting firms. Agency contacts are not welcome.