Overview
This template provides a production‑ready Apache Hadoop stack as a Monk runnable. You can:- Run it directly to get a managed Hadoop cluster with sensible defaults
- Inherit it in your own runnable to seamlessly add distributed data processing and storage to your stack
What this template manages
- Hadoop NameNode (HDFS master) with HTTP interface on port 9870
- Hadoop DataNode (HDFS storage) on port 9864
- Resource Manager (YARN) on port 8088
- NodeManager (YARN compute)
- History Server for job tracking on port 8188
- Persistent volumes for NameNode, DataNode, and HistoryServer data
Quick start (run directly)
- Load templates
- Run Hadoop stack
- Customize configuration (optional)
variables. To customize:
- Preferred: inherit and override variables as shown below.
- Alternative: fork/clone and edit the
variablesinstack.yml, thenmonk load MANIFESTand run.
- NameNode:
http://localhost:9870 - ResourceManager:
http://localhost:8088 - HistoryServer:
http://localhost:8188
Configuration
Key variables you can customize in this template:- HDFS settings: WebHDFS enabled, permissions, replication
- YARN settings: Resource limits, memory, CPU cores
- MapReduce settings: Memory allocation, compression codecs
${monk-volume-path}/hadoop-1 on the host:
/namenode- NameNode metadata/datanode- HDFS data blocks/historyserver- Job history and logs
Use by inheritance (recommended for apps)
Inherit the Hadoop stack in your application and declare connections. Example for a data processing application:Ports and connectivity
The Hadoop stack exposes the following services:- NameNode HTTP: TCP
9870- Web UI and REST API - NameNode RPC: TCP
9000- HDFS client connections - DataNode: TCP
9864- Data transfer and HTTP - ResourceManager: TCP
8088- YARN web UI and REST API - HistoryServer: TCP
8188- Job history web UI
connection-hostname("\<connection-name>") to resolve service hosts.
Persistence and configuration
- NameNode data:
${monk-volume-path}/hadoop-1/namenode:/hadoop/dfs/name - DataNode data:
${monk-volume-path}/hadoop-1/datanode:/hadoop/dfs/data - HistoryServer data:
${monk-volume-path}/hadoop-1/historyserver:/hadoop/yarn/timeline
Features
- Distributed file system (HDFS) with configurable replication
- MapReduce processing framework
- YARN resource management with configurable memory and CPU limits
- Scalable and fault-tolerant architecture
- Supports batch processing workloads
- WebHDFS REST API enabled by default
- Job history tracking and log aggregation
Related templates
- See other templates in this repository for complementary services
- Combine with monitoring tools (
prometheus-grafana/) for observability - Integrate with Apache Spark, Hive, or other Hadoop ecosystem tools
- Use with object storage (MinIO) for backup and archival
Troubleshooting
- If NameNode fails to start: Check that the volume path is writable and that port 9870 and 9000 are not in use.
- If DataNode cannot connect to NameNode: Verify network connectivity and that the NameNode is fully started (check logs).
- If jobs fail with memory errors: Adjust YARN memory settings in the configuration variables (
yarn_conf_nodemanager_resource_memory_mb,mapred_conf_map_memory_mb, etc.). - If changing cluster_name on existing data: This may cause NameNode to reject DataNodes. Either reset volumes or keep the same cluster name.