Outdated Version

You are viewing an older version of this section. View current production version.

Enabling Wire Encryption and Kerberos on HDFS Pipelines

Info

This topic does not apply to MemSQL Helios.

In advanced HDFS Pipelines mode, you can encrypt your pipeline’s connection to HDFS and you can authenticate your pipeline using Kerberos. MemSQL supports Hadoop’s Data Transfer Protocol (DTP), which encrypts your pipeline’s connection to HDFS.

This topic assumes you have already have set up your HDFS cluster to use wire encryption and/or Kerberos. For information on how to set up wire encryption, see the DTP section in the Hadoop Secure Mode documentation. For information on how to set up your HDFS cluster to use Kerberos, see the Kerberos discussion in the Hadoop Secure Mode documentation.

To create an advanced HDFS pipeline, first set the advanced_hdfs_pipelines engine sync variable to true on the master aggregator. Then, run a CREATE PIPELINE statement and pass in JSON attributes in the CONFIG clause. These attributes specify how to encrypt your pipeline’s connection to HDFS, how to authenticate your pipeline using Kerberos, or both.

Info

With advanced HDFS pipelines, you can enable debug logging. To do so, set the engine variable pipelines_extractor_debug_logging engine sync variable to true. This setting allows your pipeline to return error messages to the client application.

Wire Encryption

If encrypted DTP is enabled in your HDFS cluster, you can encrypt your pipeline’s connection to HDFS. To do this, create your CONFIG JSON that you will use in CREATE PIPELINE as follows:

  1. Set dfs.encrypt.data.transfer to true.
  2. Set the attributes dfs.encrypt.data.transfer.cipher.key.bitlength, dfs.encrypt.data.transfer.algorithm, and dfs.data.transfer.protection. Set these attribute’s values as they are specified your hdfs-site.xml file. Find a copy of this file on each node in your HDFS cluster.

The following example creates a pipeline that uses encrypted DTP to communicate with HDFS.

CREATE PIPELINE my_pipeline
AS LOAD DATA HDFS 'hdfs://hadoop-namenode:8020/path/to/files'
CONFIG '{
	"dfs.encrypt.data.transfer": true,
	"dfs.encrypt.data.transfer.cipher.key.bitlength": 256,
	"dfs.encrypt.data.transfer.algorithm": "rc4",
	"dfs.data.transfer.protection": "authentication"
}'
INTO TABLE `my_table`
FIELDS TERMINATED BY '\t';

Authenticating with Kerberos

You can create an HDFS pipeline that authenticates with Kerberos. Prior to doing so, perform the following installation steps on every MemSQL leaf node. These steps use EXAMPLE.COM as the default realm and host.example.com as the fully qualified domain name (FQDN) of the KDC server.

Info

Perform the following steps on every MemSQL leaf node (referred to below as the “node”).

An exception is step three; perform this step on the KDC server, only.

  1. Install version 1.8 or later of the Java Runtime Environment (JRE). The JRE version installed should match the JRE version installed on the HDFS nodes.

  2. Tell MemSQL the path where the JRE binary files have been installed. An example path is /usr/bin/java/jre1.8.2_12/bin. Specify the path using one of the two following methods:

    Method 1: Add the path to your operating system’s PATH environment variable.

    Method 2: Set the engine variables java_pipelines_java_path and java_pipelines_java_home to the path.

  3. On the KDC server, create a MemSQL service principal (e.g. memsql/host.example.com@EXAMPLE.COM) and a keytab file containing the MemSQL service principal.

  4. Securely copy the keytab file containing the MemSQL service principal from the KDC server to the node. You should use a secure file transfer method, such as scp, to copy the keytab file to your node. The file location on your node should be consistent across all nodes in the cluster.

  5. Ensure that the Linux service account used to run MemSQL on the node can access the copied keytab file. This can be accomplished by changing file ownership or permissions. If this account cannot access the keytab file, you will not be able to complete the next step because your master aggregator will not be able to restart after applying configuration updates.

  6. When authenticating with Kerberos, MemSQL needs to authenticate as a client, which means you must also install a Kerberos client on your node.

    The following command installs the client on Debian-based Linux distributions.

    sudo apt-get update && apt-get install krb5-user
    

    ``

    The following command installs the client on RHEL/CentOS:

    yum install krb5-workstation
    

    ``

  7. Configure your Kerberos client to connect to the KDC server. In your node’s /etc/krb5.conf file, set your default realm, Kerberos admin server, and other options to those defined by your KDC server.

  8. Make sure your node can connect to the KDC server using the fully-qualified domain name (FQDN) of the KDC server. This FQDN is found in the /etc/krb5.conf file. This might require configuring network settings or updating /etc/hosts on your node.

  9. Ensure that your node can access every HDFS datanode, using the FQDN or IP by which the HDFS namenode accesses the datanode. The FQDN is typically used.

  10. Specify the path of your keytab file in the kerberos.keytab attribute of your CONFIG JSON that you will pass to your CREATE PIPELINE statement.

  11. In your CONFIG JSON, add the attributes dfs.datanode.kerberos.principal and dfs.namenode.kerberos.principal. Set these attribute’s values as they are specified your hdfs-site.xml file. Find a copy of this file on each node in your HDFS cluster.

Example CREATE PIPELINE Statement Using Kerberos

The following example demonstrates how to create an HDFS pipeline that authenticates using Kerberos. Assume that port 8020 is the HDFS endpoint.

CREATE PIPELINE my_pipeline
AS LOAD DATA HDFS 'hdfs://hadoop-namenode:8020/path/to/files'
CONFIG '{
	"hadoop.security.authentication": "kerberos",
	"kerberos.user": "memsql/host.example.com@EXAMPLE.COM",
	"kerberos.keytab": "/path/to/kerberos.keytab",
	"dfs.client.use.datanode.hostname": true,
	"dfs.datanode.kerberos.principal": "datanode_principal/_HOST@EXAMPLE.COM",
	"dfs.namenode.kerberos.principal": "namenode_principal/_HOST@EXAMPLE.COM"
}'
INTO TABLE `my_table`
FIELDS TERMINATED BY '\t';