Querying Apache Hadoop Resource Manager with Python.

Querying Apache Hadoop Resource Manager with Python.

I was recently asked to write a script that would monitor the running application on the Apache Hadoop Resource Manager.

I wonder over to the Apache Hadoop Cluster Application Statistics API. The API allows to query most of the information that you see in the WEB UI. Information such as status on the cluster, metrics on the cluster, scheduler information, information about nodes in the cluster, and information about applications on the cluster.

I first start by querying the cluster info.

import urllib2
import json

resource_manager = 'http://resourcemanager:8088'

info_url = resource_manager+"/ws/v1/cluster/info"

request = urllib2.Request(info_url)

'''
If you prefer to work with xml replace json below with xml
'''
request.add_header('Accept', 'application/json')

response = urllib2.urlopen(request)
data = json.loads(response.read())

print json.dumps(data, sort_keys=True, indent=4, separators=(',', ': '))

returns the following:

{
"clusterInfo": {
"haState": "ACTIVE",
"hadoopBuildVersion": "2.6.0-cdh5.7.0 from c00978c67b0d3fe9f3b896b5030741bd40bf541a by jenkins source checksum b2eabfa328e763c88cb14168f9b372",
"hadoopVersion": "2.6.0-cdh5.7.0",
"hadoopVersionBuiltOn": "2016-03-23T18:36Z",
"id": 1478120586043,
"resourceManagerBuildVersion": "2.6.0-cdh5.7.0 from c00978c67b0d3fe9f3b896b5030741bd40bf541a by jenkins source checksum deb0fdfede32bbbb9cfbda6aa7e380",
"resourceManagerVersion": "2.6.0-cdh5.7.0",
"resourceManagerVersionBuiltOn": "2016-03-23T18:43Z",
"rmStateStoreName": "org.apache.hadoop.yarn.server.resourcemanager.recovery.NullRMStateStore",
"startedOn": 1478120586043,
"state": "STARTED"
}
}

Now onto what I need to do, querying the Resource Manager about running applications. The Cluster Applications API allow you to collect information on resources, which represents an application. There are multiple parameters that can be specified to retrieve data. For a list of parameters go to Cluster_Applications_API

I however just need the information on running applications. Which looks something like.

import urllib2
import json

resource_manager = 'http://dvcdhnn02:8088'

info_url = resource_manager+"/ws/v1/cluster/apps?states=running"

request = urllib2.Request(info_url)

'''
If you prefer to work with xml replace json below with xml
'''
request.add_header('Accept', 'application/json')

response = urllib2.urlopen(request)
data = json.loads(response.read())

print json.dumps(data, sort_keys=True, indent=4, separators=(',', ': '))

which returns something like:

{
"apps": {
"app": [
{
"allocatedMB": 24576,
"allocatedVCores": 3,
"amContainerLogs": "http://resourcemanager:8042/node/containerlogs/container_1478120586043_15232_01_000001/hdfs",
"amHostHttpAddress": "resourcemanager:8042",
"applicationTags": "",
"applicationType": "MAPREDUCE",
"clusterId": 1478120586043,
"diagnostics": "",
"elapsedTime": 18009,
"finalStatus": "UNDEFINED",
"finishedTime": 0,
"id": "application_1478120586043_15232",
"logAggregationStatus": "NOT_START",
"memorySeconds": 431865,
"name": "SELECT 1 AS `number_of_records...TIMESTAMP))(Stage-1)",
"numAMContainerPreempted": 0,
"numNonAMContainerPreempted": 0,
"preemptedResourceMB": 0,
"preemptedResourceVCores": 0,
"progress": 54.07485,
"queue": "root.hdfs",
"runningContainers": 3,
"startedTime": 1479156085020,
"state": "RUNNING",
"trackingUI": "ApplicationMaster",
"trackingUrl": "http://resourcemanager:8088/proxy/application_1478120586043_15232/",
"user": "hdfs",
"vcoreSeconds": 51
}
]
}
}

straight forward and simple to use.