Testing the Oracle SQL Connector for Hadoop HDFS
By David WORMS
Jul 15, 2013
- Categories
- Data Engineering
- Tags
- Database
- File system
- Oracle
- HDFS
- CDH
- SQL
Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.
Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in HDFS files or a Hive table. You can also query and join data in HDFS or a Hive table with other database-resident data. If required, you can also load data into the database using SQL. For an Oracle user, HDFS files and the Hive tables are hidden behind external tables.
This article describes how to install and use the Oracle SQL Connector for Hadoop. I am only covering the integration with HDFS. Another article describes how to further configure the SQL connector to integrate with Hive.
Minimum Requirements
We assume that we have a Hadoop cluster as well as another server with Oracle already installed. The supported Hadoop distributions are Cloudera CDH3 and CDH4 or Apache Hadoop 1.0 (formerly 0.20.2). The same version of Hadoop must be installed on the Hadoop cluster and the Oracle server. Also, the minimal Oracle Database release must be 11g release 2 (11.2.0.2 or 11.2.0.3) for Linux. In our case, all the servers are running CentOs 6.4. with CDH4 and Oracle 11.2.0.
Make sure to disable SeLinux both for Hadoop and for Oracle. Using “sed”, the command sed -i.bck 's/SELINUX=enforcing/SELINUX=disabled/' /etc/sysconfig/selinux
will make a backup and update the configuration in place. This change will be applied after the next reboot.
Prepare the Oracle database
Database creation
The Oracle system must already be installed on your system.
We will create a database “BIG”. The following settings are applyable to a development node on a virtual machine. Using the graphical Oracle client: select the “Data Warehouse” model; the global database name and SID are both “BIG”; deactivate Oracle Enterprise Manager; use a database password; the storage location is on the filesytem; we don’t activate archiving; set the memory SGA size as 2048 Mo and the memory PGA size as 1024 Mo; leave the rest empty. We save the database model with name BIG and we generate the database creation script inside $ORACLE_BASE/admin/BIG/scripts.
Environment
To facilitate the connection with the sqlplus
cli client, set the ORACLE_SID
to the Oracle instance ID (SID).
echo "export ORACLE_SID=BIG" >> ~/.bash_profile
. ~/.bash_profile
env | grep ORA
ORACLE_SID=BIG
ORACLE_BASE=/home/oracle/app/oracle
ORACLE_HOME=/home/oracle/app/oracle/product/11.2.0/dbhome_1
Also, you PATH
variable should reflect the Oracle installation export PATH=$PATH:$ORACLE_HOME/bin:$ORACLE_HOME/OPatch
The values may differ based on your Oracle installation. In my installation, all the variables are defined inside ”~/.bash_profile”.
User creation
The database should be started. If it isn’t, you could refer to the “Starting the Database” section below.
You can now login within the Oracle client with privelege sysdba
(user sys) without a password and create a user “BIG” with the “XXXXXX” password and the roles “CONNECT” and “RESSOURCE”.
sqlplus "/ as sysdba"
SQL*Plus: Release 11.2.0.3.0 Production on Fri May 17 12:07:56 2013
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected to :
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Oracle Label Security, OLAP, Data Mining,
Oracle Database Vault and Real Application Testing options
SQL> show user;
USER is "SYS"
SQL> CREATE USER "BIG" IDENTIFIED BY "toto123" PROFILE "DEFAULT" DEFAULT TABLESPACE "USERS" TEMPORARY TABLESPACE "TEMP";
User created.
SQL> GRANT "CONNECT", "RESOURCE" to "BIG";
Authorization for privilege (GRANT) accepted.
Alternatively, we could also connect to the database with a remote Tora client. Using “instant client”, set the username as “system” and leave the schema empty to login.
Here, we have granted all privileges to our user. You should be more restrictive. Oracle requires the haddop user “CREATE SESSION”, “CREATE TABLE”, “EXECUTE” on the UTL_FILE PL/SQL package as well as “READ” and “EXECUTE” on the “OSCH_BIN_PATH” directory created during the installation of Oracle SQL Connector for HDFS.
SQL Connector installation
From the offication documentation:
Oracle SQL Connector for HDFS is installed and configured on the system where Oracle Database runs. It can also run on the system where Oracle Database runs. If Hive tables are used as data sources, then Oracle SQL Connector for HDFS must also be installed and running on the system where Hive is installed.
You must install Hadoop on the Oracle Database system and minimally configure it for Hadoop client use only. The installation procedure is described below.
Install Hadoop
This step must only be executed on your Oracle server. Only the Hadoop client is required on this server and the configuration must match the one of your Hadoop cluster.
One easy way is to use the yum repository. In this example we deploy the Cloudera repository but Hortonworks works the same way.
# Yum repository
cat > /etc/yum.repos.d/cloudera-cdh4.repo <<REPO
[fusion_builder_container hundred_percent="yes" overflow="visible"][fusion_builder_row][fusion_builder_column type="1_1" background_position="left top" background_color="" border_size="" border_color="" border_style="solid" spacing="yes" background_image="" background_repeat="no-repeat" padding="" margin_top="0px" margin_bottom="0px" class="" id="" animation_type="" animation_speed="0.3" animation_direction="left" hide_on_mobile="no" center_content="no" min_height="none"][cloudera-cdh4]
name = Cloudera CDH, Version 4
baseurl = http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/4/
gpgkey = http://archive.cloudera.com/redhat/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
REPO
yum update
Now that the repository is available, we install the Hadoop client package and apply your cluster configuration. Note, the “mapred-site.xml” and “yarn-site.xml” configuration files are relevant and we need to import them.
yum install hadoop-client
# Import your Hadoop configuration
node=a_hadoop_hostname
scp root@${node}:/etc/hadoop/conf/core-site.xml /etc/hadoop/conf/core-site.xml
scp root@${node}:/etc/hadoop/conf/hdfs-site.xml /etc/hadoop/conf/hdfs-site.xml
Ensure that the Oracle Database has access to HDFS. Run the following command using your Oracle Database account and you should see the directory listing of “/user”.
hadoop fs -ls /user
Download and install the SQL connector
The step must be run on your Oracle server and, if using Hive, on your Hadoop cluster nodes as well.
Download must be done manually since it requires acception the OTN license agreement. Place the resulting file “oraosch-2.1.0.zip” into “/tmp/oraosch-2.1.0.zip”.
cd /tmp
unzip oraosch-2.1.0.zip
unzip orahdfs-2.1.0.zip
mv orahdfs-2.1.0 /usr/lib/
Configure the SQL connector on your Oracle server
Edit the “./bin/hdfs_stream” file inside your “orahdfs” folder and set the OSCH_HOME
variable with the correct path.
Update your environment variables (vi ~/.bash_profile
):
export OSCH_HOME=/usr/lib/orahdfs-2.1.0
export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$OSCH_HOME/jlib/*"
Create a database directory for the orahdfs-version/bin directory where hdfs_stream resides. In our example:
SQL> CREATE OR REPLACE DIRECTORY osch_bin_path AS '/home/oracle/orahdfs-2.1.0/bin';
SQL> GRANT READ, EXECUTE ON DIRECTORY OSCH_BIN_PATH TO big;
Configure the SQL connector on your Hadoop cluster
Contrary to what the official documentation say, you shouldn’t only perform this step if you plan on using Hive. You should also do it if you wish to declare new external tables from the Hadoop cluster (the “ExternalTable” command described below).
Creating a table referencing an HDFS file
First, we need to create an external directory grant access to the our user.
SQL> CREATE OR REPLACE DIRECTORY ext_hdfs_test_dir AS '/home/oracle/ext_hdfs_test_dir';
SQL> GRANT READ, WRITE ON DIRECTORY ext_hdfs_test_dir TO big;
Now we can declare the Hadoop data file as a table.
hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -conf ./hdfs.xml -createTable
Where the content of the “hdfs.xml” file is:
<?xml version="1.0"?>
<configuration>
<property>
<name>oracle.hadoop.connection.url</name>
<value>jdbc:oracle:thin:@//100.100.100.52:1521/BIG</value>
</property>
<property>
<name>oracle.hadoop.connection.user</name>
<value>BIG</value>
</property>
<property>
<name>oracle.hadoop.exttab.sourceType</name>
<value>text</value>
</property>
<property>
<name>oracle.hadoop.exttab.tableName</name>
<value>BIG.EXT_HDFS_TEST</value>
</property>
<property>
<name>oracle.hadoop.exttab.defaultDirectory</name>
<value>ext_hdfs_test_dir</value>
</property>
<property>
<name>oracle.hadoop.exttab.dataPaths</name>
<value>/user/big/oracle/hdfs.csv</value>
</property>
<property>
<name>oracle.hadoop.exttab.columnNames</name>
<value>ID_MI,ID_MT,HOST,TS,VAL_MOYENNE,VAL_MIN,VAL_MAX,FIABILITY</value>
</property>
<property>
<name>oracle.hadoop.exttab.fieldTerminator</name>
<value>,</value>
</property>
</configuration>
You can test the connection with the following query.
SQL> SELECT COUNT(*) FROM big.ext_hive_test;
Additional notes
Starting the Database
lsnrctl start
sqlplus "/ as sysdba"
...
SQL> startup
Disable Oracle Database Vault
Database vault needs to be disabled or you will encounter this error: “KUP-04094: preprocessing cannot be performed if Database Vault is installed”. The procedure to disable this setting is described below.
- Check if Oracle Database Vault Is Enabled or Disabled:
sqlplus sys as sysdba
Enter password: password
SELECT * FROM V$OPTION WHERE PARAMETER = 'Oracle Database Vault';
...
Oracle Database Vault
TRUE
- Stop all Oracle services
SQL> SHUTDOWN IMMEDIATE
SQL> EXIT
If you have installed the enterprise manager database control, then:
$ emctl stop dbconsole
$ lsnrctl stop [listener_name]
- Relink:
cd $ORACLE_HOME/rdbms/lib
make -f ins_rdbms.mk dv_off ioracle
- Restart Oracle services:
lsnrctl start
sqlplus "/ as sysdba"
SQL> startup
SQL> SELECT * FROM V$OPTION WHERE PARAMETER = 'Oracle Database Vault';
...
Oracle Database Vault
FALSE