How Slack achieved operational excellence for Spark on Amazon EMR using generative AI

At Slack, our data platform processes terabytes of data each day using Apache Spark on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2), powering the insights that drive strategic decision-making across the organization.

As our data volume expanded, so did our performance challenges. With traditional monitoring tools, we couldn’t effectively manage our systems when Spark jobs slowed down or costs spiraled out of control. We were stuck searching through cryptic logs, making educated guesses about resource allocation, and watching our engineering teams spend hours on manual tuning that should have been automated. That’s why we built something better: a detailed metrics framework designed specifically for Spark’s unique challenges. This is a visibility system that gives us granular insights into application behavior, resource usage, and job-level performance patterns we never had before. We’ve achieved 30–50% cost reductions and 40–60% faster job completion times. This is real operational efficiency that directly translates to better service for our users and significant savings for our infrastructure budget. In this post, we walk you through exactly how we built this framework, the key metrics that made the difference, and how your team can implement similar monitoring to transform your own Spark operations.

Why comprehensive Spark monitoring matters

In enterprise environments, poorly optimized Spark jobs can waste thousands of dollars in cloud compute costs, block critical data pipelines affecting downstream business processes, create cascading failures across interconnected data workflows, and impact service level agreement (SLA) compliance for time-sensitive analytics.

The monitoring framework we’re examining captures over 40 distinct metrics across five key categories, providing the granular insights needed to prevent these issues.

How we ingest, process, and act on Spark metrics

To address the challenges of managing Spark at scale, we developed a custom monitoring and optimization pipeline—from metric collection to AI-assisted tuning. It begins with our in-house Spark listener framework, which captures over 40 metrics in real time across Spark applications, jobs, stages, and tasks while pulling critical operational context from tools such as Apache Airflow and Apache Hadoop YARN.

An Apache Airflow-orchestrated Spark SQL pipeline transforms this data into actionable insights, surfacing performance bottlenecks and failure points. To integrate these metrics into the developer tuning workflow, we expose a metrics tool and a custom prompt through our internal analytics model context protocol (MCP) server. This enables seamless integration with AI-assisted coding tools such as Cursor or Claude Code.

The following is the list of tools used for our Spark monitoring solution, which includes metric collection to AI-assisted tuning:

The result is fast, reliable, deterministic Spark tuning without the guesswork. Developers get environment-aware recommendations, automated configuration updates, and ready-to-review pull requests.

Deep dive into Spark metrics collection

At the center of our real-time monitoring solution lies a custom Spark listener framework that captures thorough telemetry across the Spark lifecycle. Spark’s built-in metrics are often coarse, short‑lived, and scattered across the user interface (UI) and logs, which leaves four critical gaps:

Consistent historical record
Weak linkage from applications to jobs to stages to tasks
Limited context (user, cluster, team)
Poor visibility into patterns such as skew, spill, and retries

Our expanded listener framework closes these gaps by unifying and enriching telemetry with environment and configuration tags, building a durable, queryable history, and correlating events across the execution graph. It explains why tasks fail, pinpoints where memory or CPU pressure occurs, compares intended configurations to actual usage, and produces clear, repeatable tuning recommendations so teams can baseline behavior, minimize waste, and resolve issues faster. The following architecture diagram illustrates the flow of the Spark metrics collection pipeline.

Spark listener

Our listener framework captures Spark metrics at four distinct levels:

Application metrics: Overall application success/failure rates, total runtime, and resource allocation
Job-level metrics: Individual job duration and status tracking within an application
Stage-level metrics: Stage execution details, shuffle operations, and memory usage per stage
Task-level metrics: Individual task performance for deep debugging scenarios

The following Scala example code shows the SparkTaskListener extends the class SparkListener to capture detailed task-level metrics:

class SparkTaskListener(conf: SparkConf) extends SparkListener {
 val taskToStageId = new mutable.HashMap[Long, Int]()
 val stageToJobID = new mutable.HashMap[Int, Int]()
 private val emitter: Emitter = getEmitter(conf)
  override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = {
   taskToStageId += taskStart.taskInfo.taskId -> taskStart.stageId 
 }
 override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
   val taskInfo = taskEnd.taskInfo
   val taskMetrics = taskEnd.taskMetrics
   val jobId = stageToJobID.apply(taskToStageId.apply(taskInfo.taskId))
   val metrics = Map[String, Any](
     "event_type" -> "task_metric",
     "job_id" -> jobId,
     "task_id" -> taskInfo.taskId,
     "duration" -> taskInfo.duration,
     "executor_run_time" -> taskMetrics.executorRunTime,
     "memory_bytes_spilled" -> taskMetrics.memoryBytesSpilled,
     "bytes_read" -> taskMetrics.inputMetrics.bytesRead,
     "records_read" -> taskMetrics.inputMetrics.recordsRead
     // additional metrics.....
   )
   emitter.report(convertToJson(metrics))
 }
}

Real-time streaming to Kafka

These metrics are streamed in real time to Kafka as JSON-formatted telemetry using a flexible emitter system:

class KafkaEmitter(conf: SparkConf) extends Emitter {
     private val broker = conf.get("spark.custom.listener.kafkaBroker", "")
     private val topic = conf.get("spark.custom.listener.kafkaTopic", "")
     private var producer: Producer[String, Array[Byte]] = _
     override def report(str: String): Unit = {
         val message = str.getBytes(StandardCharsets.UTF_8)
         producer.send(new ProducerRecord[String, Array[Byte]](topic, message))
     }
}

From Kafka, a downstream pipeline ingests these records into an Apache Iceberg table.

Context-rich observability

Beyond standard Spark metrics, our framework captures essential operational context:

Airflow integration: DAG metadata, task IDs, and execution timestamps
Resource tracking: Configurable executor metrics (heap usage, execution memory)
Environment context: Cluster identification, user tracking, and Spark configurations
Failure analysis: Detailed error messages and task failure root causes

The combination of thorough metrics collection and real-time streaming has redefined Spark monitoring at scale, laying the groundwork for powerful insights.

Deep dive into Spark metrics processing

When raw metrics—often containing millions of records—are ingested from various sources, a Spark SQL pipeline transforms this high-volume data into actionable insights. It aggregates the data into a single row per application ID, significantly reducing complexity while preserving key performance signals.

For consistency in how teams interpret and act on this data, we apply the Five Pillars of Spark Monitoring, a structured framework that turns raw telemetry into clear diagnostics and repeatable optimization strategies, as shown in the following table.

Pillar	Metrics	Key purpose/insight	Driving event
Application metadata and orchestration details	YARN metadata (app, attempt, allocated memory, compute cluster, final job status, run duration) Airflow metadata (DAG, task, owner)	Correlate performance patterns with teams and infrastructure to identify inefficiencies and ownership.	Airflow metadata YARN metadata on Amazon EMR on EC2
User-specified configuration	Given memory (driver, executor) Dynamic allocation (min/max/initial executor count) Cores per executor Shuffle partitions	Compare configuration as opposed to actual performance to detect over- and under-provisioning and optimizing costs. This is where significant cost savings often hide.	Spark event:
Performance insights	Maximum skew ratio (75th percentile as opposed to max shuffle_total_bytes_read by Spark tasks per stage) Total spill Spark stage/task retry/failure	This is where the real diagnostic power lies. These metrics identify the three primary stoppers of Spark performance: skew, spill, and failures.	Spark event:
Execution insights	Spark job/stage/task count Spark job/stage/task duration	Understand runtime distribution, identify bottlenecks, and highlight execution outliers.	Spark event: task_metric stage_metric job_metric
Resource usage and system health	Peak JVM heap memory Max GC overhead %	Reveal memory inefficiencies and JVM-related pressure for cost and stability improvements. Comparing these against given configs helps identify waste and optimize resources.	Spark event: task_metric stage_metric executor_metric

AI-powered Spark tuning

The following architecture diagram illustrates the use of agentic AI tools to analyze the aggregated Spark metrics.

To integrate these metrics into a developer’s tuning workflow, we build a custom Spark metrics tool and a custom prompt that any agent can use. We use our existing analytics service, a homegrown web application that users can query our data warehouse with, build dashboards, and share insights. The backend is written in Python using FastAPI, and we expose an MCP server from the same service by using FastMCP. By exposing the Spark metrics tool and custom prompt through the MCP server, we make it possible for developers to connect their preferred assisted coding tools (Cursor, Claude Code, and more) and use data to guide their tuning.

Because the data exposed by the analytics MCP server might be sensitive, we use Amazon Bedrock in our Amazon Web Services (AWS) account to provide the foundation models to our MCP clients. This keeps our data more secure and facilitates compliance because it never leaves our AWS environment.

Custom prompt

To create our custom prompt for AI-driven Spark tuning, we design a structured, rule-based format that encourages more deterministic and standardized output. The prompt defines the required sections (application overview, current Spark configuration, job health summary, resource recommendations, and summary) for consistency across analyses. We include detailed formatting rules, such as wrapping values in backticks, avoiding line breaks, and enforcing strict table structures to maintain clarity and machine readability. The prompt also embeds explicit guidance for interpreting Spark metrics and mapping them to recommended tuning actions based on best practices, with clear criteria for status flags and impact explanations. The prompt means that the AI’s recommendations can be traced, reproduced, and actioned based on the provided data by tightly controlling the input-output flow and attempting to prevent hallucinations.

Final results

The screenshots in this section show how our tool performed the analysis and provided recommendations. The following is a performance analysis for an existing application.

The following is a recommendation to reduce resource waste.

The impact

Our AI-powered framework has fundamentally changed how Spark is monitored and managed at Slack. We’ve transformed Spark tuning from a high-expertise, trial-and-error process into an automated, data-backed standard by moving beyond traditional log-diving and embracing a structured, AI-driven approach. The results speak for themselves, as shown in the following table.

Metric	Before	After	Improvement
Compute cost	Non-deterministic	Optimized resource use	Up to 50% lower
Job completion time	Non-deterministic	Optimized	Over 40% faster
Developer time on tuning	Hours per week	Minutes per week	>90% reduction
Configuration waste	Frequent over-provisioning	Precise resource allocation	Near-zero waste

Conclusion

At Slack, our experience with Spark monitoring shows that you don’t need to be a performance expert to achieve exceptional results. We’ve shifted from reacting to performance issues to preventing them by systematically applying five key metric categories.

The numbers speak for themselves: 30–50% cost reductions and 40–60% faster job completion times represent operational efficiency that directly impacts our ability to serve millions of users worldwide. These improvements compound over time as teams build confidence in their data infrastructure and can focus on innovation rather than troubleshooting.

Your organization can achieve similar outcomes. Start with the basics: implement comprehensive monitoring, establish baseline metrics, and commit to continuous optimization. Spark performance doesn’t require expertise in every parameter, but it does require a strong monitoring foundation and a disciplined approach to analysis.

Acknowledgments

We want to give our thanks to all the people who have contributed to this incredible journey: Johnny Cao, Nav Shergill, Yi Chen, Lakshmi Mohan, Apun Hiran, and Ricardo Bion.

About the authors

Source link

What's Hot

ClickFix attackers using new tactic to evade detection, says Microsoft – Computerworld

M&A Monthly: February/March 2026

Posit AI Blog: luz 0.4.0

How Slack achieved operational excellence for Spark on Amazon EMR using generative AI

Top Reasons to Choose Precisely for SAP and Salesforce Process Automation

How Amplitude implemented natural language-powered analytics using Amazon OpenSearch Service as a vector database

Azure Databricks Lakebase is Generally Available

Transforming Hiring with Smarter Tech

How Much Does Agentic AI Implementation Cost?

Perplexity Computer is Here to Change the Way we Use AI

Hard-braking events as indicators of road segment crash risk

Understanding U-Net Architecture in Deep Learning

How to integrate a graph database into your RAG pipeline

ClickFix attackers using new tactic to evade detection, says Microsoft – Computerworld

M&A Monthly: February/March 2026

Posit AI Blog: luz 0.4.0

Top Reasons to Choose Precisely for SAP and Salesforce Process Automation

Our Picks

ClickFix attackers using new tactic to evade detection, says Microsoft – Computerworld

M&A Monthly: February/March 2026

What's Hot

How Slack achieved operational excellence for Spark on Amazon EMR using generative AI

Why comprehensive Spark monitoring matters

How we ingest, process, and act on Spark metrics

Deep dive into Spark metrics collection

Spark listener

Real-time streaming to Kafka

Context-rich observability

Deep dive into Spark metrics processing

AI-powered Spark tuning

Custom prompt

Final results

The impact

Conclusion

Acknowledgments

About the authors

Related Posts

Subscribe to Updates