Adaltas manie les technologies open source de l’Internet. Nos domaines de compétences incluent la création d’applications riches basées sur l’HTML5, l’environnement serveur NodeJs, les stockages NoSQLs et le traitement de données massives, notamment sur la plateforme Hadoop.
Adaltas work with open source web technologies. Our focus is on rich Internet application based on HTML5, the server-side NodeJs stack, NoSQLs storages and big data treatment with Hadoop.
Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in HDFS files or a Hive table. You can also query and join data in HDFS or a Hive table with other database-resident data. If required, you can also load data into the database using SQL.
The HDFS files and the Hive tables are defined inside Oracle as external tables.
I will list the different tools and libraries available to us developers in order to integrate Oracle and Hadoop. The Oracle SQL Connector for HDFS described below is covered in a follow up article with more details.
I was interested to compare the different components distributed by Cloudera and HortonWorks. This also gives us an idea of the versions packaged by the two distributions. At the time of this writting, April 2013, I am comparing the Cloudera distribution 4.2.0 and the Hortonwork Data Plaftorm 2.0.0.
This isn’t the first time I faced this situation. Today, I finally found the time and energy to look for a solution. In your mocha test, let’s say you need to test an expected “uncaughtException” event, the Node.js technique to catch the uncatchable. Easy, just register an “uncaughtException” listener to the process event emitter. Well, not so easy, and no so complicate either.
A few days ago, I explained how to set up a cluster of virtual machine with static IPs and Internet access suitable to host your Hadoop cluster locally for development. At the time I made use of VMWare. I’m getting back with the same topic but this time using the VirtualBox manager.
I decided to give a change to VirtualBox as an alternative to VMWare for multiple reasons. Installation of CentOs partially failed at the end. I need to reboot the machine. No real consequences but not a thing I appreciate. VirtualBox is free and open source, VMWare isn’t open source and even commercially distributed on OSX. Another goodies I was interested in, the ability to choose the IP rage of address for my internal network, I have a limited memory dedicated to those sort of things. After many trials, I managed to install only once the VMWare tools, don’t ask me how I did it, another traumatism. Finally, I have the sweet hypothetic idea of scripting the virtual machine provisioning and installation process. If I’m not wrong, that shouldn’t be a problem with VirtualBox.
Apache Hadoop is of course made available for download on its official webpage. However, downloading and installing the several components that make a Hadoop cluster is not an easy task and is a daunting task. Below is a list of the main distributions including Hadoop. This follows an article published a few days ago about the Hadoop ecosystem.
While I am about to install and test Ambari, this article is the occasion to illustrate how I set up my development environment with multiple virtual machines. Ambari, the deployment and monitoring tool for Hadoop cluster, will be the subject of a yet to be written article. My virtal environment is VMWare but VirtualBox has the same network functionalities and should work as well.
What’s really important here is to assign to each virtual machine a fixed IP address which won’t change over time. I personally work on a MacBook pro laptop and I found it very frustrating to restart each of the Hadoop components when I receive new IP addresses while switching between networks. Additionally, the setup should also provide an Internet gateway.
I am going to show how to split a file store as CSV inside HDFS into multiple Hive tables based on the content of each record. The context is simple. We are using Flume to collect logs from all over our datacenter through syslog. The stream is dumped into HDFS files partitioned by minute. Oozie is here listening to newly created directories and when ready, it want to distribute its content across various Hive tables, one for each log category.
For example, we want log ssh logs to go to the ssh table. If we cannot determine to which category a log record is associated, we dump it to an “xlogs” table. Later on, when appropriate new rules are added, we should be able to iterate through the “xlogs” table and dispatch its record across the appropriate tables.
This is a command I used to concatenate the files stored in Hadoop HDFS matching a globing expression into a single file. It use the “getmerge” utility of “hadoop fs” but contrary to “getmerge”, the final merged file isn’t put into the local filesystem but inside HDFS.
Maven 3 isn’t so different to it’s previous version 2. You will migrate most of your project quite easily between the two versions. That wasn’t the case a fews years ago between versions 1 and 2. However it took me some time to find out how to properly configure my proxy settings and this article is the occasion to share the result and keep it for later consultation.