Senior Engineer, IT Infrastructure - 1 year renewable contract
SimplyGo Pte. Ltd. Toa Payoh Temporary
Who We Are
We are SimplyGo Pte. Ltd. (SPL), a wholly owned subsidiary of the Land Transport Authority, and a key player in the Singapore public transport ecosystem. We develop transit ticketing and travel card-related products and services that simplify journeys and create value for our commuters.We believe great ideas come from people who are curious, driven, and eager to learn. Join us if you are looking for an organisation that allows you to make a meaningful impact and provide the support you need to grow.
What You Will Do
At SPL, you will play a key role in driving impactful initiatives and seeing the real-world results of your contributions. As a Senior Engineer, IT Infrastructure, you will take on a hybrid role spanning Network, Systems, and FinOps engineering.In this position, you will be responsible for end-to-end monitoring, observability, and service assurance, ensuring reliable and high-performing systems across applications, databases, infrastructure, and network services. You will oversee alert management, incident triage, and support, while maintaining clear visibility of service health through reporting.
Additionally, you will drive cost governance initiatives and collaborate across teams to ensure operational efficiency and alignment.
Key Tasks you will be involved in- Monitoring, Observability & Service Assurance
- Manage and operate centralized monitoring dashboards and observability platforms across applications, databases, infrastructure (compute and storage), and network environments (on-premises and cloud) supporting 24/7 services.
- Continuously track system health using metrics, logs, and alerts to proactively identify anomalies, performance degradation, and potential issues.
- Alerting, Triage & Coordination
- Respond to alerts and anomalies by conducting initial triage, impact assessment, and cross-system event correlation.
- Coordinate and escalate issues to the appropriate teams (Application, Cloud/Infrastructure, Network, Database) to ensure timely resolution in line with SLAs.
- Monitoring Strategy, Design & Governance
- Define and implement monitoring strategies, including frameworks, alert thresholds, escalation policies, and observability standards.
- Collaborate with engineering teams to onboard systems into monitoring platforms and establish meaningful metrics and alerts.
- Continuously review and refine monitoring frameworks to reduce noise and improve signal-to-noise ratio.
- Service Health & Reporting
- Develop and maintain real-time service health dashboards for operational monitoring and reporting.
- Track and analyze system, network, and cloud availability, performance trends, and recurring incident patterns.
- Support reporting needs for service management, senior leadership, and key stakeholders.
- Incident Support & Service Reliability
- Support major incident management by providing system visibility, diagnostics, and cross-team coordination.
- Identify recurring issues and reliability gaps, driving improvements in system stability, monitoring coverage, and response times.
- FinOps, Cost Monitoring & Governance
- Monitor and manage cloud and infrastructure costs, including AWS usage (compute, storage, data transfer), as well as network and connectivity expenses.
- Implement cost allocation, tagging strategies, and budget monitoring with alerting mechanisms.
- Analyze cost drivers and identify optimization opportunities.
- Cost Reporting & Optimisation
- Prepare and present cost reports, dashboards, forecasts, and trend analyses.
- Partner with engineering teams to optimize resource utilization, recommending rightsizing and cost-saving initiatives.
- Ensure a balanced approach between cost efficiency, performance, and reliability.
- Cross-Team Coordination
- Serve as the central coordination point across Application, Cloud/Infrastructure, Network, and Database teams to align monitoring insights with operational actions.
- Continuous Improvement & SRE Evolution
- Drive initiatives to enhance observability, expand monitoring coverage, and automate alerting and response workflows.
- Contribute to the adoption of Site Reliability Engineering (SRE) practices, including SLIs, SLOs, and error budgets.
- Documentation
- Maintain documentation for monitoring architecture and dashboards, alerting rules and escalation procedures, cost governance models and reports
- Operational Support
- Participate in major incident response, critical service monitoring, and provide after-hours support (including weekends and public holidays) as required.
What you’ll gain
At SimplyGo, our people are at the heart of everything we do.
We ASPIRE to create a supportive and inclusive workplace where you can thrive.
Here’s what you can look forward to:
Professional- Continuous growth through training, mentorship, and development.
- Career pathways with room to advance and expand your skills.
- A culture of collaboration that celebrates innovation, respect, and teamwork.
- An attractive annual compensation package that values your contribution.
- Comprehensive benefits to support your lifestyle and wellbeing.
- Flexibility with hybrid work options and supportive arrangements.
- Wellbeing initiatives designed to keep you healthy, balanced, and engaged.
How to Apply
Send us your resume with a short note telling us what excites you most about this opportunity to [email protected] by 1 May 2026.
This role might be for you if you have:
- Training in Computer Science, Information Technology, Engineering, or a related field.
- 3 to 5 years of experience in IT operations, system monitoring, NOC, service assurance, or cloud/infrastructure operations.
- Hands-on experience with monitoring and observability platforms.
- Experience working in hybrid environments (on-premises and AWS cloud).
- Strong understanding of system and network monitoring concepts, as well as application and infrastructure health metrics.
- Proficiency with tools such as CloudWatch, Grafana, Prometheus, Splunk, ELK Stack, or similar platforms.
- Ability to analyze and interpret logs, metrics, and alerts effectively.
- Experience with AWS Cost Explorer, budgeting, and tagging strategies, with a good understanding of cloud cost structures and optimization techniques.
- Familiarity with ITIL processes (Incident, Problem, and Change Management), service level management, and observability practices.
- AWS certifications (Associate level or above) or AWS FinOps Certified Practitioner are preferred.
- Familiarity with ITIL processes (Incident, Problem, Change Management), Service Level Management and Observability principles.
- AWS Certification (Associate level or above) or AWS FinOps Certified Practitioner preferred.
- Strong analytical thinking and problem-solving abilities.
- Ability to correlate events across complex, distributed systems.
- Effective communication and coordination skills, with the ability to perform under pressure during incidents.
- Proactive, detail-oriented, and highly organized.
- Strong sense of ownership and accountability, with the ability to work independently and drive initiatives.
- Willingness to provide after-hours support, including weekends and public holidays, for monitoring and critical incidents.
SimplyGo Pte. Ltd.Toa Payoh
need to grow.
What You Will Do
At SPL, you are an integral part of our team - driving meaningful projects and witnessing the tangible difference your work makes. As an Engineer II, IT Infrastructure, your responsibility focused on cloud engineering...
KULICKE & SOFFA PTE. LTD.Yishun, 11 km from Toa Payoh
The Enterprise Infrastructure Engineer, plays a key leadership role in designing, operating, and optimizing global IT infrastructure services across the organization. This position provides deep technical expertise, drives enterprise-scale...
REC SolarBukit Batok, 10 km from Toa Payoh
REC is part of Reliance Industries Limited, India's largest private sector company with revenues of USD 104.6 billion.
Find out more at recgroup.com
DESIGNATION : Specialist, IT Infrastructure & Operations
RESPONSIBILITIES
In search for a competent...