Without observability Kubernetes is impossible

diginomica.com

Without observability, Kubernetes is impossible Nica Fee Tue, 11/17/2020 - 05:18

Summary:
True observability makes the benefits of Kubernetes clusters available at full capacity when powering microservices architectures. New Relic’s Nočnica Fee outlines 7 steps to unlock the possibilities

When talking about the need for observability on cloud platforms — which seems to be happening more and more these days — I’m often asked ‘when will this problem be fixed?' People seem to assume that the problems with tracking the state of Kubernetes, serverless functions, or other aspects of today’s complex cloud architectures, is related to bugs or blind spots in the tooling. In reality, observability problems are fundamental to the nature of these complex new ways of running software. A word about the definition of observability; observability isn’t a single product, it’s not synonymous with logging or metrics, and it’s not a simple feature your team can check off. Observability is a measurement of how long on average your team spends trying to understand a problem. If you glance at a dashboard and have an immediate idea what’s causing problems, you have great observability. If however it takes hours to figure out issues — and outages often end with you manually restarting everything — observability is the first problem you should solve. Because microservices increase the surface area and frequency of software changes, engineers need better visibility to understand the performance of their cloud-native applications and infrastructure. The great benefits of Kubernetes clusters are only available at full capacity to those who maintain real observability. While often portrayed as mutually exclusive, it’s possible to make gains in reliability, performance, and efficiency with well-managed Kubernetes, but that requires not only that you know when things are performing poorly, but also when you have underutilized resources. With these 7 steps to Kubernetes observability, you’ll be better prepared to explore, visualize and troubleshoot your entire Kubernetes environment:

Understand the overall health and capacity of your nodes View application metrics deployed on Kubernetes. Infrastructure monitoring sounds like a somewhat old-fashioned concept but if anything it’s more critical on Kubernetes. When analyzing unexpected behaviors and performance issues, the first step for troubleshooting is evaluating your cluster’s overall health.
Track the dynamic behavior of your cluster, like autoscaling events Monitor all Kubernetes events to get useful context about dynamic behaviors including new deployments, autoscaling and health checks. The control plane of Kubernetes determines the real-world performance of your cluster so tracking dynamic events are a critical component. You must track API server stats, etcd, and scheduler to know what’s going on with your cluster as a whole.
Correlate log data from all services running on Kubernetes See logs for your clusters in the context of your broader environment to speed troubleshooting. When exploring a critical issue, many developers find themselves moving from logs, to monitoring of overall metrics, over to a tracing tool. Not only does this result in a scattered user experience, it can be very difficult to correlate this data. Put simply: when we see a spike in reaction time metrics, it can be hard to find logging from the slowest responses, or connect distributed traces with relevant logging. Open-source observability tools like OpenTelemetry are actively working to develop ‘logs in context' to connect logging data with other monitoring tools.

These connections allow engineers to correlate causes — seeing what incident triggered a certain behavior. See point 5 for an exploration of seeing behaviors that, while not necessarily causally linked, are at least correlated in time. 4. Understand how your microservices communicate with each other Communication between the nodes and pods within a cluster is often harder to track than behavior within a single node. Link Kubernetes metadata and get performance and distributed traces of your applications, whether instrumented via New Relic agents, open source tools like Prometheus, StatsD or Zipkin, or standards like OpenTelemetry deployed in Kubernetes clusters. This gives you insight into error rates, transaction times and throughput so you can better understand their performance. 5. Understand service performance through integrated telemetry data When asked to define ‘observability' I often use the shorthand “Observability is how quickly you can understand problems with your system.” By that definition, it’s clear that the speed with which you can read metrics from a dashboard has a direct impact on observability. Put simply all monitoring has a user experience, and the better that user experience for your engineers, the more quickly they can understand and resolve problems. 6. Correlate performance information with business intelligence Tracking the value of particular customers, their parent organization, or their usage level of your product may not seem like critical data to have during an outage; but business intelligence can be the key to figuring out a problem. No system, no matter how well architected, successfully treats all users equally, so patterns like parent organization or user geography can reveal patterns that weren’t obvious any other way. This can also help root out some false alarms: during one of the last outages I worked, repeated alarms for errors in the user experience were quickly resolved when we realized the user in question was a contractor working for us and experimenting with different unsupported requests to our API. 7. Track requests from source to destination Distributed tracing has grown to be a tool that most engineers expect to have available while researching problems. In an ideal world you’d see every request start in a front end or mobile app and move through your entire system. In reality even the best systems must do sampling and don’t always cover every step of a request’s path. Still distributed tracing, the measurement of timing information from all parts of your stack, is incredibly useful when chasing an intermittent bug or trying to improve performance. It’s critical that any solution lets you view your Prometheus monitoring data alongside telemetry data from other sources for unified visibility, and remove the overhead of managing storage and availability of Prometheus so you can focus on deploying and scaling your software. In a competitive technology marketplace, Kubernetes is a path to differentiation on uptime, performance, and efficiency. If you wish to deliver better performance than your competitors, an efficiently orchestrated cluster is one of the ways you’ll get there. To reach these performance goals you must maintain consistent insight into how your cluster is really performing. This is especially critical for maintaining efficiency, since close monitoring will help show you when you have excess capacity that can be more efficiently put to use.

Read more on:
IT service management DevOps NoSQL and the open source stack Cloud platforms - infrastructure and architecture

Author: Nica Fee

Date: 2020-11-17

URL: https://diginomica.com/without-observability-kubernetes-impossible

diginomica.com


Happy Holidays! Walmart heads into Black Friday on the back of 79% e-commerce growth (but still chalks up an online loss) (2020-11-18)	Happy Holidays! Walmart heads into Black Friday on the back of 79% e-commerce growth but still chalks up an online loss Stuart Lauchlan Wed 11/18/2020 - 03:12 Summary: Walmarts e-commerce growth is impressive and its new Plus offering is set to take on Amazon Prime - Happy Holidays ahead? With Black Friday almost upon us and the US retail sector about to face up to its most challenging test in an ..
Enterprise hits and misses - AI ethics gets a fresh critique, and retailers get a pre-holiday reckoning (2020-11-23)	Enterprise hits and misses - AI ethics gets a fresh critique and retailers get a pre-holiday reckoning Jon Reed Sun 11/22/2020 - 21:47 Summary: This week - AI ethics gets a fresh critique - and the gap between lofty AI talk and project needs is exposed Also: as we push into holiday season retailers get one more omni-grade Enterprise buyers share their COVID-19 era agendas and the whiffs keep comin.. Enterprise hits and misses - AI ethics gets a fresh critique, and retailers get a pre-holiday reckoning
British government (once again) strives for creating a ‘data quality culture’ (2020-12-04)	British government once again strives for creating a data quality culture Derek du Preez Fri 12/04/2020 - 01:13 Summary: The British government has launched a framework for data quality management which uses principles and approaches to help people better manage data quality This week the British government has launched its Data Quality Framework in a bid to create a data quality culture across th..
Enterprise technology buyers and COVID-19 - what have we learned? (2020-11-03)	Enterprise technology buyers and COVID-19 - what have we learned? Derek du Preez Tue 11/03/2020 - 08:59 Summary: The COVID-19 pandemic will cause structural changes to the economy and technology investments are front of mind What have we learned thus far? Chansom Pantip - shutterstock It is undeniable that the impact of the COVID-19 pandemic on the enterprise has been significant The pace at whic.. Enterprise technology buyers and COVID-19 - what have we learned?
Salesforce Revenue Cloud streamlines B2B RevOps for a post-COVID world (2020-11-12)	Salesforce Revenue Cloud streamlines B2B RevOps for a post-COVID world Phil Wainewright Thu 11/12/2020 - 09:37 Summary: Salesforce Revenue Cloud rebundles B2B commerce Q2C and PRM to power RevOps and plug finance into the customer lifecycle Salesforce Revenue Cloud dashboard via Salesforce Tectonic shifts in the landscape of B2B sales have been thrown into stark relief in the wake of this years pa..
A re:Invent like no other shows an AWS capitalizing on 2020 chaos (2020-12-04)	A re:Invent like no other shows an AWS capitalizing on 2020 chaos Kurt Marko Fri 12/04/2020 - 03:22 Summary: Amazons re:Invent went online but there was plenty to mull over to be found in CEO Andy Jassys keynote AWS After more than a decade of explosive growth and eight previous re:Invent conferences cloud watchers are used to the annual firehose of information packed into CEO Andy Jassys keynotes.. A re:Invent like no other shows an AWS capitalizing on 2020 chaos
Workplace VP Julien Cordoniou on Facebook’s enterprise-wide appeal and roadmap (2020-11-19)	Workplace VP Julien Cordoniou on Facebooks enterprise-wide appeal and roadmap Phil Wainewright Thu 11/19/2020 - 10:29 Summary: Workplace from Facebook is building an impressive list of large customers - can it become their core platform for enterprise digital teamwork? via Facebook Facebooks Workplace collaboration app has demonstrated the breadth of its appeal in the enterprise market with an imp..
2020 - the year that Dreamforce came to you and Benioff made the best of it (2020-12-03)	2020 - the year that Dreamforce came to you and Benioff made the best of it Stuart Lauchlan Wed 12/02/2020 - 16:16 Summary: Not the Dreamforce he wanted but the Dreamforce we all got - Marc Benioff hosts a virtual version of Salesforces biggest customer event of the year A socially-distanced Stewart Butterfield and Marc Benioff Dreamforce is different this yearits not the Dreamforce that we wanted.. 2020 - the year that Dreamforce came to you and Benioff made the best of it
FutureGov CEO - thinking radically to transition to 21st Century public services (2020-11-26)	FutureGov CEO - thinking radically to transition to 21st Century public services Derek du Preez Thu 11/26/2020 - 03:27 Summary: Dominic Campbell offers some serious food for thought on how public service organisations can consolidate on the gains made during the COVID-19 pandemic Image sourced via YouTube FutureGov has a solid reputation in the UK for pushing public sector organisations to think d..
Why mid-market companies should get their Goldilocks mindset on (2020-11-16)	Why mid-market companies should get their Goldilocks mindset on Mike Ettling Mon 11/16/2020 - 13:17 Summary: Mid-market companies should apply a Goldilocks mindset to getting the best fit when selecting a business software supplier writes Unit4 CEO Mike Ettling antoniodiaz - Shutterstock Goldilocks got it right Size is everything Its true in all aspects of business production capacity office spac..