> ## Documentation Index
> Fetch the complete documentation index at: https://docs.monk.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Apache Hadoop

> Ready-to-run Apache Hadoop container stack you can run directly or inherit to integrate distributed data processing into your infrastructure.

## Overview

This template provides a production‑ready Apache Hadoop stack as a Monk runnable. You can:

* Run it directly to get a managed Hadoop cluster with sensible defaults
* Inherit it in your own runnable to seamlessly add distributed data processing and storage to your stack

Apache Hadoop is an open-source framework for distributed storage and processing of large datasets using the MapReduce programming model. It includes HDFS (Hadoop Distributed File System) for storage and YARN for resource management.

## What this template manages

* Hadoop NameNode (HDFS master) with HTTP interface on port 9870
* Hadoop DataNode (HDFS storage) on port 9864
* Resource Manager (YARN) on port 8088
* NodeManager (YARN compute)
* History Server for job tracking on port 8188
* Persistent volumes for NameNode, DataNode, and HistoryServer data

## Quick start (run directly)

1. Load templates

```bash theme={null}
monk load MANIFEST
```

2. Run Hadoop stack

```bash theme={null}
monk run hadoop/stack
```

3. Customize configuration (optional)

Running directly uses the defaults defined in this template's `variables`. To customize:

* Preferred: inherit and override variables as shown below.
* Alternative: fork/clone and edit the `variables` in `stack.yml`, then `monk load MANIFEST` and run.

Once started, access the web interfaces:

* NameNode: `http://localhost:9870`
* ResourceManager: `http://localhost:8088`
* HistoryServer: `http://localhost:8188`

## Configuration

Key variables you can customize in this template:

```yaml theme={null}
variables:
  image_tag: "3.2.1-hadoop3.2.1-java8"    # Hadoop version/image tag
  cluster_name: "Monk SuperCluster"        # HDFS cluster name
```

Additional configuration through environment variables (defined in hadoop-common):

* HDFS settings: WebHDFS enabled, permissions, replication
* YARN settings: Resource limits, memory, CPU cores
* MapReduce settings: Memory allocation, compression codecs

Data is persisted under `${monk-volume-path}/hadoop-1` on the host:

* `/namenode` - NameNode metadata
* `/datanode` - HDFS data blocks
* `/historyserver` - Job history and logs

## Use by inheritance (recommended for apps)

Inherit the Hadoop stack in your application and declare connections. Example for a data processing application:

```yaml theme={null}
namespace: myapp
hadoop-cluster:
  defines: process-group
  inherits: hadoop/stack
  variables:
    cluster_name:
      value: <- secret("hadoop-cluster-name") default("MyApp Cluster")

data-processor:
  defines: runnable
  containers:
    processor:
      image: myorg/data-processor
      environment:
        - <- `HADOOP_NAMENODE=${hdfs_namenode_host}`
        - <- `YARN_RESOURCEMANAGER=${yarn_rm_host}`
  variables:
    hdfs_namenode_host:
      value: <- connection-hostname("namenode") default("localhost")
    yarn_rm_host:
      value: <- connection-hostname("resourcemanager") default("localhost")

app:
  defines: process-group
  runnable-list:
    - myapp/hadoop-cluster
    - myapp/data-processor
```

Then run your app:

```bash theme={null}
monk secrets add -g hadoop-cluster-name="Production Cluster"
monk run myapp/app
```

## Ports and connectivity

The Hadoop stack exposes the following services:

* **NameNode HTTP**: TCP `9870` - Web UI and REST API
* **NameNode RPC**: TCP `9000` - HDFS client connections
* **DataNode**: TCP `9864` - Data transfer and HTTP
* **ResourceManager**: TCP `8088` - YARN web UI and REST API
* **HistoryServer**: TCP `8188` - Job history web UI

From other runnables in the same process group, use `connection-hostname("\<connection-name>")` to resolve service hosts.

## Persistence and configuration

* **NameNode data**: `${monk-volume-path}/hadoop-1/namenode:/hadoop/dfs/name`
* **DataNode data**: `${monk-volume-path}/hadoop-1/datanode:/hadoop/dfs/data`
* **HistoryServer data**: `${monk-volume-path}/hadoop-1/historyserver:/hadoop/yarn/timeline`

All data is persisted to the host volumes and will survive container restarts. Ensure the host volumes are writable by the container user (typically UID 1000).

## Features

* Distributed file system (HDFS) with configurable replication
* MapReduce processing framework
* YARN resource management with configurable memory and CPU limits
* Scalable and fault-tolerant architecture
* Supports batch processing workloads
* WebHDFS REST API enabled by default
* Job history tracking and log aggregation

## Related templates

* See other templates in this repository for complementary services
* Combine with monitoring tools (`prometheus-grafana/`) for observability
* Integrate with Apache Spark, Hive, or other Hadoop ecosystem tools
* Use with object storage (MinIO) for backup and archival

## Troubleshooting

* **If NameNode fails to start**: Check that the volume path is writable and that port 9870 and 9000 are not in use.
* **If DataNode cannot connect to NameNode**: Verify network connectivity and that the NameNode is fully started (check logs).
* **If jobs fail with memory errors**: Adjust YARN memory settings in the configuration variables (`yarn_conf_nodemanager_resource_memory_mb`, `mapred_conf_map_memory_mb`, etc.).
* **If changing cluster\_name on existing data**: This may cause NameNode to reject DataNodes. Either reset volumes or keep the same cluster name.

Check logs for detailed error messages:

```bash theme={null}
monk logs -l 500 -f hadoop/stack
```

View logs for individual components:

```bash theme={null}
monk logs -l 500 -f hadoop/hadoop-name-node
monk logs -l 500 -f hadoop/hadoop-data-node
monk logs -l 500 -f hadoop/hadoop-resource-manager-node
```
