Apache Hadoop - Monk Docs

Overview

This template provides a production‑ready Apache Hadoop stack as a Monk runnable. You can:

Run it directly to get a managed Hadoop cluster with sensible defaults
Inherit it in your own runnable to seamlessly add distributed data processing and storage to your stack

Apache Hadoop is an open-source framework for distributed storage and processing of large datasets using the MapReduce programming model. It includes HDFS (Hadoop Distributed File System) for storage and YARN for resource management.

What this template manages

Hadoop NameNode (HDFS master) with HTTP interface on port 9870
Hadoop DataNode (HDFS storage) on port 9864
Resource Manager (YARN) on port 8088
NodeManager (YARN compute)
History Server for job tracking on port 8188
Persistent volumes for NameNode, DataNode, and HistoryServer data

Quick start (run directly)

Load templates

monk load MANIFEST

Run Hadoop stack

monk run hadoop/stack

Customize configuration (optional)

Running directly uses the defaults defined in this template’s variables. To customize:

Preferred: inherit and override variables as shown below.
Alternative: fork/clone and edit the variables in stack.yml, then monk load MANIFEST and run.

Once started, access the web interfaces:

NameNode: http://localhost:9870
ResourceManager: http://localhost:8088
HistoryServer: http://localhost:8188

Configuration

Key variables you can customize in this template:

variables:
  image_tag: "3.2.1-hadoop3.2.1-java8"    # Hadoop version/image tag
  cluster_name: "Monk SuperCluster"        # HDFS cluster name

Additional configuration through environment variables (defined in hadoop-common):

HDFS settings: WebHDFS enabled, permissions, replication
YARN settings: Resource limits, memory, CPU cores
MapReduce settings: Memory allocation, compression codecs

Data is persisted under ${monk-volume-path}/hadoop-1 on the host:

/namenode - NameNode metadata
/datanode - HDFS data blocks
/historyserver - Job history and logs

Use by inheritance (recommended for apps)

Inherit the Hadoop stack in your application and declare connections. Example for a data processing application:

namespace: myapp
hadoop-cluster:
  defines: process-group
  inherits: hadoop/stack
  variables:
    cluster_name:
      value: <- secret("hadoop-cluster-name") default("MyApp Cluster")

data-processor:
  defines: runnable
  containers:
    processor:
      image: myorg/data-processor
      environment:
        - <- `HADOOP_NAMENODE=${hdfs_namenode_host}`
        - <- `YARN_RESOURCEMANAGER=${yarn_rm_host}`
  variables:
    hdfs_namenode_host:
      value: <- connection-hostname("namenode") default("localhost")
    yarn_rm_host:
      value: <- connection-hostname("resourcemanager") default("localhost")

app:
  defines: process-group
  runnable-list:
    - myapp/hadoop-cluster
    - myapp/data-processor

Then run your app:

monk secrets add -g hadoop-cluster-name="Production Cluster"
monk run myapp/app

Ports and connectivity

The Hadoop stack exposes the following services:

NameNode HTTP: TCP 9870 - Web UI and REST API
NameNode RPC: TCP 9000 - HDFS client connections
DataNode: TCP 9864 - Data transfer and HTTP
ResourceManager: TCP 8088 - YARN web UI and REST API
HistoryServer: TCP 8188 - Job history web UI

From other runnables in the same process group, use connection-hostname("\<connection-name>") to resolve service hosts.

Persistence and configuration

NameNode data: ${monk-volume-path}/hadoop-1/namenode:/hadoop/dfs/name
DataNode data: ${monk-volume-path}/hadoop-1/datanode:/hadoop/dfs/data
HistoryServer data: ${monk-volume-path}/hadoop-1/historyserver:/hadoop/yarn/timeline

All data is persisted to the host volumes and will survive container restarts. Ensure the host volumes are writable by the container user (typically UID 1000).

Features

Distributed file system (HDFS) with configurable replication
MapReduce processing framework
YARN resource management with configurable memory and CPU limits
Scalable and fault-tolerant architecture
Supports batch processing workloads
WebHDFS REST API enabled by default
Job history tracking and log aggregation

See other templates in this repository for complementary services
Combine with monitoring tools (prometheus-grafana/) for observability
Integrate with Apache Spark, Hive, or other Hadoop ecosystem tools
Use with object storage (MinIO) for backup and archival

Troubleshooting

If NameNode fails to start: Check that the volume path is writable and that port 9870 and 9000 are not in use.
If DataNode cannot connect to NameNode: Verify network connectivity and that the NameNode is fully started (check logs).
If jobs fail with memory errors: Adjust YARN memory settings in the configuration variables (yarn_conf_nodemanager_resource_memory_mb, mapred_conf_map_memory_mb, etc.).
If changing cluster_name on existing data: This may cause NameNode to reject DataNodes. Either reset volumes or keep the same cluster name.

Check logs for detailed error messages:

monk logs -l 500 -f hadoop/stack

View logs for individual components:

monk logs -l 500 -f hadoop/hadoop-name-node
monk logs -l 500 -f hadoop/hadoop-data-node
monk logs -l 500 -f hadoop/hadoop-resource-manager-node

Networking

CDN & DNS

Identity & Auth

Database

Compute

Serverless

Storage

Messaging

Devtools

Analytics Monitoring

Hosting & CI/CD

Payments & Billing

Cache

Web Server

Database Tools

Data Integration

Data Engineering

Communication

Infrastructure

CMS

Observability

DevOps

Big Data

API

Security

Monitoring

Analytics

Automation

Customer Support

Message Broker

Development

Search

AI/ML

Documentation

Social

​Overview

​What this template manages

​Quick start (run directly)

​Configuration

​Use by inheritance (recommended for apps)

​Ports and connectivity

​Persistence and configuration

​Features

​Related templates

​Troubleshooting