Articles published in 2013
Tutorial for creating and publishing a new Node.js module
Categories: Front End | Tags: Learning and tutorial, License, Mocha, NPM, Travis CI, CoffeeScript, GitHub, JavaScript, Node.js, Unit tests
In this tutorial, I provide complete instructions for creating a new Node.js module, writing the code in coffee-script, publishing it on GitHub, sharing it with other Node.js fellows through NPM…
By David WORMS
Dec 3, 2013
Catch 'uncaughtException' error in your mocha test
Categories: Node.js | Tags: DevOps, Mocha, JavaScript, Unit tests
This isn’t the first time I faced this situation. Today, I finally found the time and energy to look for a solution. In your mocha test, let’s say you need to test an expected “uncaughtException…
By David WORMS
Oct 27, 2013
Crawl you website including login form with Phantomjs
Categories: Front End | Tags: Mocha, CoffeeScript, JavaScript, Node.js, Unit tests
With PhantomJS, we start a headless WebKit and pilot it with our own scripts. Said differently, we write a script in JavaScript or CoffeeScript which controls an Internet browser and manipulates the…
By David WORMS
Nov 27, 2013
Remote connection with SSH
Categories: Cyber Security | Tags: Automation, HTTP, SSH
While teaching Big Data and Hadoop, a student asks me about SSH and how to use. I’ll discuss about the protocol and the tools to benefit from it. Lately, I automate the deployment of Hadoop clusters…
By David WORMS
Oct 2, 2013
Composants for CDH and HDP
Categories: Big Data | Tags: Flume, Hortonworks, Hadoop, Hive, Oozie, Sqoop, Zookeeper, Cloudera, CDH, HDP
I was interested to compare the different components distributed by Cloudera and HortonWorks. This also gives us an idea of the versions packaged by the two distributions. At the time of this writting…
By David WORMS
Sep 22, 2013
Splitting HDFS files into multiple hive tables
Categories: Data Engineering | Tags: Flume, Pig, HDFS, Hive, Oozie, Python, SQL
I am going to show how to split a CSV file stored inside HDFS as multiple Hive tables based on the content of each record. The context is simple. We are using Flume to collect logs from all over our…
By David WORMS
Sep 15, 2013
About the new BSD license and its difference with other BSD licenses
Categories: Data Governance | Tags: License, Open source
As a non restrictive Open Source license, the “new BSD license” is a commonly used license across the Node.js community. However, this is only one of the BSD license available along the original “BSD…
By David WORMS
Aug 8, 2013
Kerberos and delegation tokens security with WebHDFS
Categories: Cyber Security | Tags: HTTP, HDFS, Big Data, Kerberos
WebHDFS is an HTTP Rest server bundle with the latest version of Hadoop. What interests me on this article is to dig into security with the Kerberos and delegation tokens functionalities. I will cover…
By David WORMS
Jul 25, 2013
Maven 3 behind a proxy
Categories: Hack | Tags: Maven, Java, Proxy
Maven 3 isn’t so different to it’s previous version 2. You will migrate most of your project quite easily between the two versions. That wasn’t the case a fews years ago between versions 1 and…
By David WORMS
Jul 11, 2013
Testing the Oracle SQL Connector for Hadoop HDFS
Categories: Data Engineering | Tags: Database, File system, Oracle, HDFS, CDH, SQL
Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in HDFS files or a Hive table. You can also query and join data in HDFS or a Hive table with other…
By David WORMS
Jul 15, 2013
Node CSV version 0.2.7
Categories: Hack | Tags: Pipeline, CoffeeScript, CSV, Node.js
While I’m release version 0.2.7 of the CSV parser for Node.js, I stop here to drop a few lines of what has made into this release. We are now using the latest CoffeeScript, which is version 1.4.…
By David WORMS
Jul 9, 2013
State of the Hadoop open-source ecosystem in early 2013
Categories: Big Data | Tags: Flume, Mesos, Phoenix, Pig, File system, MongoDB, Hadoop, Kafka, Mahout, Consensus, Data Science, File Format, PostgreSQL, Storage
Hadoop is already a large ecosystem and my guess is that 2013 will be the year where it grows even larger. There are some pieces that we no longer need to present. ZooKeeper, hbase, Hive, Pig, Flume…
By David WORMS
Jul 8, 2013
Oracle and Hive, how data are published?
Categories: Big Data | Tags: Oracle, Hive, Sqoop, Data Lake
In the past few days, I’ve published 3 related articles: a first one covering the option to integrate Oracle and Hadoop, a second one explaining how to install and use the Oracle SQL Connector with…
By David WORMS
Jul 6, 2013
Oracle to Apache Hive with the Oracle SQL Connector
Categories: Business Intelligence | Tags: Oracle, HDFS, Hive, Network
In a previous article published last week, I introduced the choices available to connect Oracle and Hadoop. In a follow up article, I covered the Oracle SQL Connector, its installation and integration…
By David WORMS
May 27, 2013
Options to connect and integrate Hadoop with Oracle
Categories: Data Engineering | Tags: Database, Java, Oracle, R, RDBMS, Avro, HDFS, Hive, MapReduce, Sqoop, NoSQL, SQL
I will list the different tools and libraries available to us developers in order to integrate Oracle and Hadoop. The Oracle SQL Connector for HDFS described below is covered in a follow up article…
By David WORMS
May 15, 2013
The state of Hadoop distributions
Categories: Big Data | Tags: Hortonworks, Intel, Oracle, Hadoop, Cloudera
Apache Hadoop is of course made available for download on its official webpage. However, downloading and installing the several components that make a Hadoop cluster is not an easy task and is a…
By David WORMS
May 11, 2013
Apache Hive Essentials How-to by Darren Lee
Categories: Business Intelligence, Learning | Tags: UDF, Hadoop, Hive, File Format, SQL
Recently, I’ve been ask to review a new book on Apache Hive called “Apache Hive Essentials How-to” (edit: the second edition is now available) written by Darren Lee and published by Packt Publishing…
By David WORMS
Apr 23, 2013
Hadoop development cluster of virtual machines with static IP using VirtualBox
Categories: Infrastructure | Tags: Ambari, Hortonworks, Red Hat, VirtualBox, VM, VMware, Cloudera, Network
A few days ago, I explained how to set up a cluster of virtual machine with static IPsand Internet access suitable to host your Hadoop cluster locally for development. At the time I made use of VMWare…
By David WORMS
Mar 14, 2013
Definitions of machine learning algorithms present in Apache Mahout
Categories: Data Science | Tags: Algorithm, Сlassification, Hadoop, Mahout, Clustering, Machine Learning
Apache Mahout is a machine learning library built for scalability. Its core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop…
By David WORMS
Mar 8, 2013
Virtual machines with static IP for your Hadoop development cluster
Categories: Infrastructure | Tags: Ambari, Hortonworks, Red Hat, VirtualBox, VM, VMware, Cloudera, Network
While I am about to install and test Ambari, this article is the occasion to illustrate how I set up my development environment with multiple virtual machines. Ambari, the deployment and monitoring…
By David WORMS
Feb 27, 2013
Merging multiple files in Hadoop
Categories: Hack | Tags: File system, Hadoop, HDFS, Storage
This is a command I used to concatenate the files stored in Hadoop HDFS matching a globing expression into a single file. It uses the “getmerge” utility of but contrary to “getmerge”, the final…
By David WORMS
Jan 12, 2013