Site Reliability Engineer

HR POD - Hiring Talent GloballyLahore, Pakistan

30+ days ago

Job description

Requirements :

5+ years of experience in an SRE, DevOps, or infrastructure engineering role.
Strong experience with AWS or GCP, including services like EC2,Lambda, S3, RDS, and GKE (for GCP).
Experience with automation tools like Terraform.
Proficient in at least one scripting language (Python, Bash, Go, etc.).
Solid understanding of Linux systems, networking, and cloud-based architectures.
Experience working with container orchestration platforms like Kubernetes.
Proficient with CI / CD pipelines, preferably with cloud-native tools (e.g.,GitHub).
Ability to troubleshoot complex, distributed systems and provide solutions in high-pressure environments.
Ability to communicate effectively with both technical and non-technical stakeholders.

Nice to have :

Exposure to Execution Management Systems (EMS) / Portfolio Management Systems (PMS).

Experience with client-impact triage, working cross-functionally with account managers or product teams.

Proficiency with Datadog or similar observability platforms.

Knowledge of serverless architectures (e.g., AWS Lambda, GCP Cloud Functions).

Familiarity with RDBMS and NoSQL databases, such as RDS, CloudSQL, and DynamoDB.

Prior experience in fintech, trading platforms, or 24 / 7 financial infrastructure.

Strong understanding of API integrations and how infrastructure issues might manifest in client environments.

Excellent problem-solving and communication skills, with the ability to translate technical incidents into clear client updates.

Experience working with client-facing teams.

Responsibilities :

Ensure the reliability, availability, and performance of production systems, particularly during weekends.

Take ownership of monitoring, troubleshooting, and incident response during weekends and off-hours.

Troubleshoot and resolve critical issues in a fast-paced, high-availability environment.

Automate manual processes and workflows, reducing operational overhead.

Work closely with engineering teams to design and deploy scalable, fault-tolerant infrastructure solutions on AWS or GCP.

Improve observability by utilizing monitoring, logging, and alerting systems (e.g., CloudWatch, Datadog).

Lead post-incident reviews, contribute to the continuous improvement of system reliability, and follow up on strategic fixes.

Develop and update runbooks, incident response playbooks, and documentation.

Work closely with Engineering, Product, and Client teams to proactively identify infrastructure pain points that could affect the user experience.

Monitor alert channels, logs, and infrastructure load for the entire stack.

Set up automation for alerting.

Site Reliability Engineer • Lahore, Pakistan