VetJobs - The Leading Military Job Board

Job Information

Microsoft Corporation Site Reliability Engineer II in Hyderabad, India

Do you have a passion for high scale services and working with some of Microsoft’s most critical cloud capabilities? We’re looking for a Senior Site Relability Engineer with the right mix of software development, Cloud experience and passion for quality to envision, design, and deliver solutions for Microsoft's cloud Infrastructure.

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

In alignment with our Microsoft values, we are committed to cultivating an inclusive work environment for all employees to positively impact our culture every day.Are you looking to be at the forefront of Microsoft’s cloud computing transformation? Are you looking to work in an agile environment that ships frequently while maintaining a focus on long-term bets? Do you want to work with state of the art distributed systems that deal with near real time detections on petabyte scale telemetry using Machine Learning and traditional software to deliver on Cloud Availability and Safety goals. Do you want to make an impact in a team of talented engineers delivering world class Software solutions?

Microsoft Cloud Operations & Innovation (CO+I) is the engine that powers Microsoft cloud services through the operation of our unified global datacenters enabling 30% of Microsoft revenue through Commercial Cloud ($38 billion in FY20 Q1). The Cloud Infrastructure Health team in CO+IE is focused on improving Cusomer Availability, Data center Safety, Capacity and helping optimize the utilization of Datacenter resources using telemetry and Insights. Our systems analyze petabyte scale telemetry data from Datacenter critical environments and secondary signals in near real time and offline that enable timesensitive insights directly impacting Cloud Operations.Our team is looking for an experienced, competent, and motivated Senior SRE . The Site Reliability Engineering (SRE) team provides leadership, direction and accountability for application architecture, system design, and end-to-end implementation. As a Site Reliability Engineer you will identify and deliver software improvements using your expertise in software development, complexity analysis, and scalable system design. Collaboration skills will be required to work closely with other engineering teams to ensure services/systems are highly stable and performant, meeting the expectations of our customers and users.

SRE participates in the service design aspects of Cloud Infrastructure Health system and takes primary responsibility for developing code, scripts, systems, and/or tools that reduce operational burden by automating complex and repetitive tasks such as onboarding of system capabilities to newer data centers and upkeep of system capabilities in the existing sites . The SRE enables feature teams to increase the velocity at which they can safely deploy changes to production, and monitor the effects of changes across the footprint. SRE analyzes telemetry data to develop capacity planning models, identify patterns and trends that drive continuous improvement, and highlight opportunities to deploy automation to monitor and manage CIH services across sites. SRE also participates in on-call rotations to resolve live site incidents, minimize customer impact, and document solutions and insights that inform ongoing improvements to infrastructure, code, tools, and/or processes that prevent the recurrence of similar issues.

Responsibilities

  • Design, develop, and deliver the required software engineering that reduce operational burden by automating complex and repetitive tasks such as onboarding of system capabilities to newer data centers and upkeep of system capabilities in the existing sites

  • Own deployment, availability, reliability, performance and customer escalation targets for Critical Environment Telemetry solutions

  • Proactively identify and reduce issues through design, testing, and implementation of software-based solutions

  • Collaborate with Engineering and Program Management partners to translate customer, business, and technical requirements into architectural designs and feature releases

  • Drive efficiencies through software improvement and root cause analysis resulting in service delivery, maturity, and scalability

  • Work within a highly skilled team of engineers to deliver revolutionary improvements to the system and scale them

Qualifications

Required/Minimum Qualifications:

  • Bachelor of Computer Science or equivalent industry experience

  • 5+ years of professional experience with 3+ years of experience involving service operations, Data centre operations, monitoring, and reliability improvement.

  • Proven ability to collaborate across teams & organizations.

  • Experience in managing distributed systems and/or cloud platforms a plus.

  • Publications and/or certifications related to cloud technologies a plus.

Other Requirements:

  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.

  • These requirements include, but are not limited to, the following specialized security screenings: Microsoft Cloud Background Check

  • This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Preferred/Additional Qualifications:

  • Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Doctorate Degree in Computer Science, Information Technology, or related field.

  • Familiarity with one or more general purpose programming languages including but not limited to: Java, C/C++, C#, Python, JavaScript, PowerShell

  • Experience with the Microsoft cloud and/or stack including:

  • O365, Azure, Windows or other Microsoft software/service

  • Experience leveraging cloud architecture, applying site reliability principles, and/or demonstrating sensitivity to operational concerns

  • Demonstrated ability to debug, fix, and optimize code

  • Full-stack troubleshooting skills across network, application, hardware, management fabric, and distributed services layers

  • Excellent communications skills, both verbal and written

  • #COIcareers

  • #COIEngCareers

  • #COIE_DIODEcareersMicrosoft is an equal opportunity employer. Consistent with applicable law, all qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations (https://careers.microsoft.com/v2/global/en/accessibility.html) .

DirectEmployers