How Node CSV parser may save your weekend
Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.
Last Friday, an hour before the doors of my customer close for the weekend, a co-worker came to me. He just finished to export 9 CSV files from an Oracle database which he wanted to import into Greenplum such as our customer could start testing on Monday morning.
The problem as exposed was quite simple. He needed a quick solution (less than an hour, coding included) to transform all the date in the source CSV file into a format suitable for Greenplum. While Oracle exported dates in the form of ‘DD/MM/YYYY’, Greenplum was picky enough to expect dates in the form of ‘YYYY-MM-DD’.
Here are the files we needed to import:
-rw-r--r-- 1 ingres ingres 41G 2011-12-09 18:23 a.csv
-rw-r--r-- 1 ingres ingres 418M 2011-12-09 17:22 b.csv
-rw-r--r-- 1 ingres ingres 45G 2011-12-09 18:26 c.csv
-rw-r--r-- 1 ingres ingres 1,8G 2011-12-09 17:37 d.csv
-rw-r--r-- 1 ingres ingres 896M 2011-12-09 17:33 e.csv
-rw-r--r-- 1 ingres ingres 382M 2011-12-09 17:13 f.csv
-rw-r--r-- 1 ingres ingres 7,6G 2011-12-09 18:09 g.csv
-rw-r--r-- 1 ingres ingres 621M 2011-12-09 17:33 h.csv
-rw-r--r-- 1 ingres ingres 3,1G 2011-12-09 17:46 i.csv
Among those files, two are larger than 45Gb.
Since generating new files would probably have taken too long, the first question asked was whether or not the Greenplum import command would accept stdin
data through unix pipe, which question luckily answered to yes. So my first suggestion was to cat a file, pipe it to Awk which would reformat the date line by line and finally pipe it to the import tool. The command would look like:
cat a.csv | awk '...' | psql -p $port $database -c "COPY $table FROM STDIN WITH CSV;"
The solution seems workable. Problem, we are no Awk expert and I could not guaranty him an hour would be enough to write and test the command.
Considering how familiar I am with Node those days, I told him that writing a small JavaScript script which interacts with stdin
and stdout
would be much more secure. This is when we though of using the CSV parser, the first library I wrote for Node a little more than a year ago.
We first had to configure the proxy and install Node on this server which took about 10mn. For this we used NVM. Then, we wrote the following JavaScript:
#!/usr/bin/env node
var csv = require('./csv');
csv()
.fromStream(process.stdin)
.toStream(process.stdout)
.transform(function(data, index){
for(var i=0; i< data.length; i++){
if(/(\d{2})\/(\d{2})\/(\d{2})/.test(data[i])){
data[i] = data[i].replace(/(\d{2})\/(\d{2})\/(\d{2})/, "20$3-$2-$1");
}
}
return data;
});
process.stdin.resume()
Note how the usage of the regular expression could have been optimized. But our goal wasn’t to speed the import process, only to make it run as soon as possible.
We already had at our disposal a small bash script which could loop trough the CSV files and launch Greenplum import command. We tweak it a bit such as the final version looked like:
#!/bin/bash
for csv in $(ls ~/export/*.csv)
do
table=`echo $csv|cut -d'/' -f 5`;
table=`echo $table|cut -d'.' -f 1`;
table=`echo $table| sed 's/\(.*\)../\1/'`;
cat $csv | transform.js | psql -p 5433 my_database -c "COPY $table FROM STDIN WITH CSV;"
done
Well, on Monday morning, the CSV parser handled the volume and all the CSV files were ingested into Greenplum.