Apache Knox made easy!

Apache Knox is the secure entry point of a Hadoop cluster, but can it also be the entry point for my REST applications?

Apache Knox overview

Apache Knox is an application gateway for interacting in a secure way with the REST APIs and the user interfaces of one or more Hadoop clusters. Out of the box it provides:

On the other hand, it is not an alternative to Kerberos for strong authentication of an Hadoop cluster, nor a channel for acquiring or exporting large volumes of data.

We can define the benefits of the gateway in four different categories:

  • Enhanced security through the exposure of REST and HTTP services without revealing the details of the Hadoop cluster, the filter on the vulnerabilities of web applications or the use of the SSL protocol on services that do not have this possibility;
  • Centralized control through the use of a single gateway, which facilitates auditing and authorizations (with Apache Ranger);
  • Simplified access thanks to the encapsulation of services with Kerberos or the use of a single SSL certificate;
  • Enterprise integration through leading market solutions (Microsoft Active Directory, LDAP, Kerberos, etc) or using custom solutions (Apache Shiro, etc).

Apache Knox example of architecture

Kerberos encapsulation

Encapsulation is mainly used for products that are incompatible with the Kerberos protocol. The user provides his username and password via the use of the HTTP Basic Auth protocol.

Kerberos encapsulation kinematics

Simplified management of client certificates

The user relies solely on the Apache Knox certificate, so we centralize the certificates of the different services on the Apache Knox servers and not on all the clients. Very useful when revoking and creating a new certificate.

Certificate Management via Apache Knox

Apache Ranger integration

Apache Knox includes an Apache Ranger agent to check the permissions of users who want to access cluster ressources.

Apache Knox plugin operation kinematics for Apache Ranger

Hadoop URLs VS Knox URLs

Using Apache Knox URLs obscures the cluster architecture and allows users to remember only one URL.

Service Hadoop URL Apache Knox URL
WebHDFS http://namenode-host:50070/webhdfs https://knox-host:8443/gateway/default/webhdfs
WebHCat http://webhcat-host:50111/templeton https://knox-host:8443/gateway/default/templeton
Apache Oozie http://oozie-host:11000/oozie https://knox-host:8443/gateway/default/oozie
Apache HBase http://hbase-host:60080 https://knox-host:8443/gateway/default/hbase
Apache Hive http://hive-host:10001/cliservice https://knox-host:8443/gateway/default/hive

Customizing Apache Knox

To configure the Apache Knox gateway, we need to modify the topology type files. These files consist of three components: Providers, HA Provider and Services.

Topology

You can find the topology files in the topologies directory /conf/topologies. The name of the topology file will dictate the Apache Knox URL, if the file is named sandbox.xml, the URL will be: https://knox-host:8443/gateway/sandbox/webhdfs.

Here is an example of a topology file:

<topology>
  <gateway>
    <!-- Authentication provider -->
    <provider>
      <role>authentication</role>
      <name>ShiroProvider</name>
      <enabled>true</enabled>
      [...]
    </provider>

    <!-- Identity assertion provider -->
    <provider>
      <role>identity-assertion</role>
      <name>HadoopGroupProvider</name>
      <enabled>true</enabled>
      [...]
    </provider>

    <!-- Authorization provider -->
    <provider>
      <role>authorization</role>
      <name>XASecurePDPKnox</name>
      <enabled>true</enabled>
      [...]
    </provider>

    <!-- HA provider -->
    <provider>
      <role>ha</role>
      <name>HaProvider</name>
      <enabled>true</enabled>
      [...]
    </provider>
  </gateway>

  <!-- Services -->
  <service>
    <role>WEBHDFS</role>
    <url>http://webhdfs-host:50070/webhdfs</url>
  </service>
</topology>

Be careful, if you use Apache Ambari, the topologies admin, knoxsso, manager and default must be modified via the web interface, otherwise in case of restart of the service the files will be overwritten.

Provider

Providers add new features (authentication, federation, authorization, identity-assertion, etc) to the gateway that can be used by different services, usually one or more filters that are added to one or more topologies.

The Apache Knox gateway supports federation by adding HTTP header. Federation provides a quick way to implement single sign-on (SSO) by propagating user and group information. Use it only in a highly controlled network environment.

The default authentication provider is Apache Shiro (ShiroProvider). It is used for authentication to an Active Directory or LDAP. For authentication via Kerberos, we will use HadoopAuth.

There are five main identity-assertion providers:

  • Default: this is the default provider for easy mapping of names and / or groups of users. It is responsible for establishing the identity passed on to the rest of the cluster;
  • Concat: it is the provider that allows the composition of a new name and / or groups of users by concatenating a prefix and / or a suffix;
  • SwitchCase: this provider resolves the case where the ecosystem requires a specific case for the names and / or groups of users;
  • Regex: it allows incoming identities to be translated using a regular expression;
  • HadoopGroupProvider: this provider looks for the user within a group to greatly facilitate the definition of permissions (via Apache Ranger).

In case you use the HadoopGroupProvider provider, you are only required to use groups for setting up permissions (via Apache Ranger), a JIRA (KNOX-821: Identity Assertion Providers must be able to be chained together) was opened to be able to chain together several identity providers.

Services

Services, in turn, add new routing rules to the gateway. The services are located in /data/services directory. Here is an example of the files that make up the HIVE service:

-bash-4.1$ ls -lahR hive/
hive/:
total 12K
drwxr-xr-x  3 knox knox 4.0K Nov 16  2017 .
drwxr-xr-x 84 knox knox 4.0K Nov 16 17:07 ..
drwxr-xr-x  2 knox knox 4.0K Nov 16  2017 0.13.0

hive/0.13.0:
total 16K
drwxr-xr-x 2 knox knox 4.0K Nov 16  2017 .
drwxr-xr-x 3 knox knox 4.0K Nov 16  2017 ..
-rw-r--r-- 1 knox knox  949 May 31  2017 rewrite.xml
-rw-r--r-- 1 knox knox 1.1K May 31  2017 service.xml

In the hive directory, we find the directory 0.13.0 (which corresponds to the version of the service). And inside this folder we find the rewrite.xml and service.xml files.

Personalized services

Let’s assume a REST application available at http://rest-application:8080/api/. Just create a new service and add it to the topology file.

<topology>
  [...]
  <!-- Services -->
  <service>
    <role>{{ service_name | upper }}</role>
    <url>http://rest-application:8080</url>
  </service>
</topology>

service.xml file, the value of {{ knox_service_version }} must be equal to the name of the parent folder

<service role="{{ service_name | upper }}" name="{{ service_name | lower }}" version="{{ knox_service_version }}">
  <policies>
    <policy role="webappsec"/>
    <policy role="authentication"/>
    <policy role="federation"/>
    <policy role="identity-assertion"/>
    <policy role="authorization"/>
    <policy role="rewrite"/>
  </policies>
  <routes>
    <route path="/{{ service_name | lower }}">
      <rewrite apply="{{ service_name | upper }}/{{ service_name | lower }}/inbound/root" to="request.url"/>
    </route>
    <route path="/{{ service_name | lower }}/**">
      <rewrite apply="{{ service_name | upper }}/{{ service_name | lower }}/inbound/path" to="request.url"/>
    </route>
    <route path="/{{ service_name | lower }}/?**">
      <rewrite apply="{{ service_name | upper }}/{{ service_name | lower }}/inbound/args" to="request.url"/>
    </route>
    <route path="/{{ service_name | lower }}/**?**">
      <rewrite apply="{{ service_name | upper }}/{{ service_name | lower }}/inbound/pathargs" to="request.url"/>
    </route>
  </routes>
</service>

rewrite.xml file where we re-write the URL (adding here /api):

<rules>
  <rule dir="IN" name="{{ service_name | upper }}/{{ service_name | lower }}/inbound/root" pattern="*://*:*/**/{{ service_name | lower }}">
    <rewrite template="{$serviceUrl[{{ service_name | upper }}]}/api/{{ service_name | lower }}"/>
  </rule>
  <rule dir="IN" name="{{ service_name | upper }}/{{ service_name | lower }}/inbound/path" pattern="*://*:*/**/{{ service_name | lower }}/{path=**}">
    <rewrite template="{$serviceUrl[{{ service_name | upper }}]}/api/{{ service_name | lower }}/{path=**}"/>
  </rule>
  <rule dir="IN" name="{{ service_name | upper }}/{{ service_name | lower }}/inbound/args" pattern="*://*:*/**/{{ service_name | lower }}/?{**}">
    <rewrite template="{$serviceUrl[{{ service_name | upper }}]}/api/{{ service_name | lower }}/?{**}"/>
  </rule>
  <rule dir="IN" name="{{ service_name | upper }}/{{ service_name | lower }}/inbound/pathargs" pattern="*://*:*/**/{{ service_name | lower }}/{path=**}?{**}">
    <rewrite template="{$serviceUrl[{{ service_name | upper }}]}/api/{{ service_name | lower }}/{path=**}?{**}"/>
  </rule>
</rules>

For more details on setting up custom services behind Apache Knox, here is a link to an article.

Tips and tricks

Setting up SSL

# Requirements
$ export KEYALIAS=gateway-identity
$ export PRIVATEKEYALIAS=gateway-identity-passphrase
$ export KEYPASS=4eTTb03v
$ export P12PASS=$KEYPASS
$ export JKSPASS=$KEYPASS

# Save old Keystore
$ mv /usr/hdp/current/knox-server/data/security/keystores /usr/hdp/current/knox-server/data
$ mkdir /usr/hdp/current/knox-server/data/security/keystores
$ chown knox:knox /usr/hdp/current/knox-server/data/security/keystores
$ chmod 755 /usr/hdp/current/knox-server/data/security/keystores

# Generate Knox Master Key
$ /usr/hdp/current/knox-server/bin/knoxcli.sh create-master --force
> Enter master secret: $JKSPASS
> Enter master secret again: $JKSPASS

# Generate the Keystore
$ openssl pkcs12 -export -in /etc/security/.ssl/`hostname -f`.cer -inkey /etc/security/.ssl/`hostname -f`.key -passin env:KEYPASS \
    -out /usr/hdp/current/knox-server/data/security/keystores/gateway.p12 -password env:P12PASS -name $KEYALIAS
$ keytool -importkeystore -deststorepass $JKSPASS -destkeypass $KEYPASS \
    -destkeystore /usr/hdp/current/knox-server/data/security/keystores/gateway.jks \
    -srckeystore /usr/hdp/current/knox-server/data/security/keystores/gateway.p12 \
    -srcstoretype PKCS12 -srcstorepass $P12PASS -alias $KEYALIAS

# Import the parents certificates
$ keytool -import -file /etc/security/.ssl/enterprise.cer -alias enterprise \
    -keystore /usr/hdp/current/knox-server/data/security/keystores/gateway.jks -storepass $JKSPASS

# Save the private key in the Keystore
$ /usr/hdp/current/knox-server/bin/knoxcli.sh create-alias $PRIVATEKEYALIAS --value $JKSPASS

When backing up the private key in the keystore, the $PRIVATEKEYALIAS value must be gateway-identity-passphrase.

The password must match the one used when generating the master key (Apache Knox Master secret). That’s why we use the same password everywhere ($JKSPASS).

Common mistakes

First of all, a clean restart of Apache Knox solves come problems and purges the logs before restarting the query that is problematic.

# Stop Apache Knox via Ambari

# Delete topologies and services deployments
$ rm -rf /usr/hdp/current/knox-server/data/deployments/*.topo.*

# Delete JCEKS keystores
$ rm -rf /usr/hdp/current/knox-server/data/security/keystores/*.jceks

# Purge logs
$ rm -rf /var/log/knox/*

# Kill zombie process
$ ps -ef | grep `cat /var/run/knox/gateway.pid`
$ kill -9 `cat /var/run/knox/gateway.pid`

# Start Apache Knox via Ambari

Bulky answer

If your queries fail quickly and return a 500 error, the request response may be too large (8Kb by default). You will find something like this in the gateway-audit.log file:

dispatch|uri|http://<hostname>:10000/cliservice?doAs=<user>|success|Response status: 500 

Just modify the topology to add a parameter in the service in question:

<param name="replayBufferSize" value="32" />

Apache Solr

If you have enabled auditing Apache Solr in Apache Ranger (xasecure.audit.destination.solr=true), it is possible that in case of problems with Apache Solr no left space filesystem, Apache Knox will not work anymore.

Apache Hive connections often disconnected via Apache Knox

To correct this problem, you must add to the gateway-site configuration file (these values must be modified according to your environment).

gateway.httpclient.connectionTimeout=600000 (10 min)
gateway.httpclient.socketTimeout=600000 (10 min)
gateway.metrics.enabled=false
gateway.jmx.metrics.reporting.enabled=false
gateway.httpclient.maxConnections=128

How to check my deployment?

Apache Knox provides a client to check several things in the deployment of your instance.

Certificate validation

Validate the certificate used by Apache Knox instance, to verify that you have the certificates of your company.

$ openssl s_client -showcerts -connect {{ knoxhostname }}:8443

Topology validation

Validation that the description of a cluster (clustername equals the topology) follows the correct formatting rules.

$ /usr/hdp/current/knox-server/bin/knoxcli.sh validate-topology [--cluster clustername] | [-- path "path/to/file"]

Authentication and authorization via LDAP

This command tests the ability of a cluster configuration (topology) to authenticate a user with ShiroProvider settings. The --g parameter lists the groups of which a user is a member. The --u and --p parameters are optional, if they are not provided, the terminal will ask you for one.

$ /usr/hdp/current/knox-server/bin/knoxcli.sh user-auth-test [--cluster clustername] [--u username] [--p password] [--g]

Topology / LDAP binding

This command tests the ability of a cluster configuration (topology) to authenticate the user provided only with ShiroProvider settings.

$ /usr/hdp/current/knox-server/bin/knoxcli.sh system-user-auth-test [--cluster c] [--d]

Gateway test

Using the HTTP Basic access authentication method:

$ export $KNOX_URL=knox.local
$ curl -vik -u admin:admin-password 'https://$KNOX_URL:8443/gateway/default/webhdfs/v1/?op=LISTSTATUS'

Using the Kerberos authentication method:

$ export $KNOX_URL=knox.local
$ export $KNOX_USER=knoxuser
$ kdestroy
$ kinit $KNOX_USER

$ curl -i -k --negotiate -X GET -u 'https://$KNOX_URL:8443/gateway/kdefault/webhdfs/v1/?op=LISTSTATUS'

In this case, we use the kdefault topology, which uses a HadoopAuth authentication provider.

Direct reading

The /usr/hdp/current/knox-server/data/deployments/ directory contains the different folders corresponding to your different topologies. Whenever you update your topology, a new directory is created based on {{ topologyname }}.topo.{{ timestamp }}.

In the %2F/WEB-INF/ subdirectory you will find the file rewrite.xml which is a concatenation of the topology file and services files. In this file, you can check that your rewrite rules have been taken into account.

$ cd /usr/hdp/current/knox-server/data/deployments/
$ ll
total 0
drwxr-xr-x. 4 knox knox 31 22 janv. 16:04 admin.topo.168760a5c28
drwxr-xr-x. 4 knox knox 31 24 janv. 18:53 admin.topo.16880fef4b8
drwxr-xr-x. 4 knox knox 31 22 janv. 16:04 default.topo.168760a5c28
drwxr-xr-x. 5 knox knox 49 22 janv. 16:04 knoxsso.topo.168760a5c28
drwxr-xr-x. 5 knox knox 49 22 janv. 16:04 manager.topo.15f6b1f2888
$ cd default.topo.168760a5c28/%2F/WEB-INF/
$ ll
total 104
-rw-r--r--. 1 knox knox 70632 22 janv. 16:04 gateway.xml
-rw-r--r--. 1 knox knox  1112 22 janv. 16:04 ha.xml
-rw-r--r--. 1 knox knox 19384 22 janv. 16:04 rewrite.xml
-rw-r--r--. 1 knox knox   586 22 janv. 16:04 shiro.ini
-rw-r--r--. 1 knox knox  1700 22 janv. 16:04 web.xml

Apache Knox Ranger plugin debug

This configuration makes it possible to see what is transmitted to Apache Ranger via the Apache Knox plugin. You must modify the gateway-log4j.properties file as below. Restart your Apache Knox instances and check the longs in ranger.knoxagent.log file.

ranger.knoxagent.logger=DEBUG,console,KNOXAGENT
ranger.knoxagent.log.file=ranger.knoxagent.log
log4j.logger.org.apache.ranger=${ranger.knoxagent.logger}
log4j.additivity.org.apache.ranger=false
log4j.appender.KNOXAGENT=org.apache.log4j.DailyRollingFileAppender
log4j.appender.KNOXAGENT.File=${app.log.dir}/${ranger.knoxagent.log.file}
log4j.appender.KNOXAGENT.layout=org.apache.log4j.PatternLayout
log4j.appender.KNOXAGENT.layout.ConversionPattern=%d{ISO8601} %p %c{2}: %m%n %L
log4j.appender.KNOXAGENT.DatePattern=.yyyy-MM-dd

Improve response times via Apache Knox

If you have much higher response times through Apache Knox, you can change these different settings in the gateway.site configuration file.

gateway.metrics.enabled=false 
gateway.jmx.metrics.reporting.enabled=false 
gateway.graphite.metrics.reporting.enabled=false

Conclusion

In conclusion, Apache Knox is a powerful tool, with Apache Ranger audits, to filter and audit all access to your environment(s). But it also allows you to be used as a classic gateway in front of your various personalized services by means of configuration. For example, you can add Kerberos authentication in front of your REST APIs that do not have it. Now, it’s up to you to play and make your feedback in the comment section.

Sources

Canada - Morocco - France

International locations

10 rue de la Kasbah
2393 Rabbat
Canada

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.