November 2, 2020

1176 words 6 mins read

Massive AWS outage was caused by adding new servers to Kinesis

Massive AWS outage was caused by adding new servers to Kinesis

Red-faced Amazon says it will apply lessons learned to improve the reliability of its services

       Amazon Web Services (AWS) has revealed the actual cause of the massive outage that impacted thousands of online sites and services, including Amazon's own services, last week.

According to the company, the outage was not driven by any memory issue in the network. Rather, it was triggered by

the addition of new servers to the Amazon Kinesis real-time data processing service. Adding new capacity caused all servers in the Kinesis system to exceed the maximum number of ‘threads’ allowed by an operating system (OS) configuration. To communicate with each other, servers in the Kinesis system need to generate threads between each other in the front-end fleet. The Kinesis system already has “thousands of servers”, according to AWS, and when new machines were added, the maximum limit of thread count allowed by the OS configuration was exceeded. This issue resulted in a series of other problems that eventually took down thousands of websites and services, including those from some big companies such as Adobe, Flickr, Roku, Twilio and Autodesk. AWS’s own services were also affected, including ACM, Amplify Console, AppStream2, AppSync, Athena, Batch, CodeArtifact, CodeGuru Profiler, CodeGuru Reviewer, CloudFormation, CloudMap, CloudTrail, Connect, Comprehend, DynamoDB, Elastic Beanstalk, EventBridge, GuardDuty, IoT Services, Lambda, LEX, Macie, Managed Blockchain, Marketplace, MediaLive, MediaConvert, Personalize, RDS Performance Insights, Rekognition, SageMaker and Workspaces. The multi-hour outage affected the US-East-1 region, according to the company. The problem was fixed by rebooting the entire Kinesis service, which took a while to complete. Amazon has apologised for the outage and said it would apply lessons learned to further improve the reliability of its services. In the short term, the company plans to move to servers with more powerful CPUs and more and memory to help it reduce the number of servers and the thread count across the fleet. It is also carrying tests to increase thread count limits in OS configuration. AWS believes the measure will give additional safety margin by providing more threads per server. The company also plans to introduce lots of other changes to “radically improve the cold-start time for the front-end fleet”. “We are moving the front-end server cache to a dedicated fleet. We will also move a few large AWS services, like CloudWatch, to a separate, partitioned front-end fleet,” AWS said. “In the medium term, we will greatly accelerate the cellularisation of the front-end fleet to match what we’ve done with the back-end.”

Author: Kundaliya)

Date: 2020-12-01


Hackers demand €500 ransom from patients after compromising psychotherapy centre (2020-10-26) Patients started receiving personal demands after Finnish clinic Vastaamo refused to pay 450000 A hacking group has demanded ransom payments from patients of a psychotherapy centre in Finland in exchange for deleting their stolen records Finlands Interior Minister Maria Ohisalo said on Sunday that police are working with Interpol and Europol to investigate a data breach that may have compromised t..
Cyber actors are attempting to exploit Windows Zerologon and Oracle security flaws, researchers warn (2020-10-30) The vulnerabilities have already been addressed but many systems remain unpatched Microsoft on Thursday warned of continued activity from malicious cyber actors that are seeking to exploit the Zerologon security vulnerability in efforts to steal sensitive information from unpatched systems In a blog post Aanchal Gupta VP engineering MSRC revealed that the company has received multiple reports from..
Automation Logic: DevOps is as important now as mechanised production was 100 years ago (2020-10-27) DevOps is as important to the digital economy as mechanised production was to the 20th centurys industrialised economy says Automation Logics Kris Saxton The world has undergone sweeping social economic and political changes this year and the ability to adapt to them has been critical In the professional technology world that has meant DevOps practices and methodologies have become even more impor..
Peter Cochrane: The digitalisation mystery (2020-11-27) The plethora of digitalisation events shows managers havent been paying attention says Professor Peter Cochrane I was just cold-called to see if I would be interested in a conference on business digitalisation My immediate response caused surprise as I stated that if I was being invited as a speaker I would be happy to do so but if they just wanted me to attend then I was not interested The caller..
GCSC proposes rules to guide states towards responsible cyber behaviour (2020-11-16) The proposed norms are similar to a cyber version of the Geneva Convention The Global Commission on the Stability of Cyberspace GCSC a group established to develop policies to keep the internet stable and secure has released a final report outlining a set of proposals that it believes can help in promoting the peaceful use of cyberspace and safeguarding online activity against attacks by state and..
Credential-related attacks lead to the biggest financial losses, says report (2020-11-10) Extreme loss events could cost victims 100 times their annual revenue or more says the Cyentia Institute Cyber attacks resulting from stolen credentials are more common and more financially damaging for organisations than any other type of cyber incident according to new research The Cyentia Institutes IRIS Xtreme report pdf reviewed 103 large cyber-loss events from the last five years and found t..
Why talent mobility should be the top of every corporate agenda in 2021 (2020-11-26) The combination of pent-up desire among employees for greater flexibility and intensifying competition for talent means talent mobility will become a critical priority for companies in 2021 Back in 2010 PwC introduced its report Talent Mobility 2020 with the following statement: Mobilisation strategies will need to progress significantly to keep pace with this change and the further increases in a..
Industry Voice: Why tool adoption by employees will drive IT decisions in 2021 (2020-11-18) The rapid digital transformation and resulting challenges that accompanied the shift to remote working cannot be understated According to IDC the adoption of collaborative technologies in 2020 accelerated by five years in just a few months This dramatic transformation has ultimately left organisations struggling to connect and collaborate efficiently According to an Asana survey conducted in Octob..
AWS outage hits company’s own services (2020-11-26) Sites like The Washington Post and Roku as well as Amazons own services were affected AWS - the backbone of many online services - experienced a major outage last night with service only restored at 9:18am GMT on Thursday Although the multi-hour outage only affected one of AWSs 23 global regions US-East-1 it affected many services and sites including Coinbase Flickr Glassdoor and Roku as well as n.. AWS outage hits company’s own services
Interview with Paul French, Director of Business Intelligence, Visualisation and Reporting, Nationwide Building Society (2020-11-11) Computing catches up with one of the finalists in the Digital Transformation Project of the Year category at the UK IT Industry Awards 2020 The UK IT Industry Awards is the IT industrys most coveted awards with each entry subjected to rigorous scrutiny by a panel of expert independent judges Being shortlisted is a huge achievement and one whichPaul French Director of Business Intelligence Visualis..