Artificial Intelligence

Kubernetes clusters for AI ML apps

Tuesday, March 29, 2022

Google-Play-Store-Developers-Claim-Leaderboard

Rafay Systems has announced an expansion on their solution for operating Kubernetes clusters, which allows you to manage and launch Kubernetes clusters for AI & ML apps. They also announced a new GPU dashboard that displays critical GPU metrics so developers can monitor and improve performance.

Quickly launch and easily manage production-grade Kubernetes clusters for AI and machine learning applications at scale with Rafay.

Rafay Systems, the platform provider for Kubernetes Operations, announced the expansion of the industry's only turnkey solution for operating Kubernetes clusters with GPU support at scale by adding powerful new metrics and dashboards for deeper visibility into GPU health and performance.

The Rafay Kubernetes Operations Platform (KOP) now features a fully integrated GPU Resource Dashboard that visualizes critical GPU metrics so developers and operations teams can seamlessly monitor, operate, and improve performance for GPU-based container workloads, all from one unified platform.

Manage & Launch Kubernetes clusters for Artificial Intelligence & Machine Learning apps with Rafay

Kubernetes has rapidly become the preferred orchestration layer for enterprises that need the ability to provision and operate GPU-enabled, AI, and machine learning applications in the cloud and at edge/remote locations.

According to 2022 Gartner Emerging Technologies: Edge Technologies Offer Strong Area of Opportunity, Adopter Survey Findings, "The primary objectives for respondent organizations investing in and adopting edge technologies are to improve employees productivity (41%) and automate business processes (39%). This aligns with existing Gartner research (see Emerging Technologies: Use-Case Patterns in Edge AI) that edge AI is being used to improve business processes, delivering automation and productivity gains that translate into measurable ROI, such as cost savings."

However, as enterprises rapidly increase the number of AI and machine learning workloads, addressing several challenges such as visibility and monitoring helps prevent significant delays in application deployment and wasted costs associated with idle or underperforming GPUs in the clusters.

For example, a factory that increasingly relies upon real-time video detection applications powered by AI needs a standardized approach for cross-functional teams to manage the IT infrastructure and applications. The following challenges often result in operational fragility and lack of repeatability that hinders productivity:

Flawed or overly restrictive access and visibility for developers and operational personnel that need GPU metrics on-demand to tune and optimize GPU workloads.

The struggle of hiring or training a team of experts and spending months to develop, operate and maintain a customized monitoring infrastructure to scrape and centrally aggregate GPU metrics.

The complexity of developing and maintaining an integration with corporate single sign-on (SSO) systems to provide role-based access to metrics and dashboards.

Accounting for the organizations' GPU-enabled workloads that are developed and maintained by external entities (e.g., partners and ISVs). These entities also need visibility of GPU metrics to ensure the workloads are performing optimally.

Rafay KOP solves these challenges by providing enterprises and trusted external entities with a zero-touch experience for automated and centralized aggregation of critical operational metrics for GPUs for the entire fleet of Kubernetes clusters. Rafay's Zero-Trust Access Service with SSO integration enables seamless role-based access to ensure only authorized developers, external partners, and operational personnel can gain secure access and visibility into GPU metrics from the console.

"Rafay makes spinning up GPU-enabled Kubernetes clusters incredibly simple. In just a few steps an enterprise's deep learning and inference projects can be fully operational. Not only do we provide the fastest path to powering environments for AI and machine learning applications, but the combination of capabilities in Rafay KOP enables scalable edge/remote use cases with support for zero-trust access, policy management, GPU monitoring, and more across an entire fleet of thousands of clusters," explained Mohan Atreya, SVP Product and Solutions at Rafay Systems.

The new GPU Resource Dashboard that streamlines the orchestration of GPU-based container workloads has been fully integrated into the Rafay KOP and teams can take advantage of many additional benefits of the SaaS platform today including:

AI/ML Application Deployment Automation: Rafay KOP allows organizations to avoid spending months or years developing a custom platform just to provision and manage GPU-enabled Kubernetes clusters for bare metal, virtualized, and cloud environments.

AI/ML Cluster and Workload Standardization and Consistency: Rafay KOP's Cluster Blueprints standardize and govern clusters and workload configurations across a fleet. Enterprises can detect, be notified, and/or block configuration changes to Kubernetes clusters.

Unleash the power of AI and machine learning applications at the edge with Rafay KOP: https://rafay.co/start/