Node.js CSV version 4 - re-writing and performance
By David WORMS
Nov 19, 2018
Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.
Today, we release a new major version of the Node.js CSV parser project. Version 4 is a complete re-writing of the project focusing on performance. It also comes with new functionalities as well as some cleanup in the option properties and the exported information. The official website is updated and the changelog contains the list of changes for this major release.
A massive undertaking
The Node.js CSV project was started on September 25th of 2010. This is quite old in our evolving tech world. Since then, it has survived multiple Node.js evolutions such as the redesign of the Stream APIs. Over the years, the project was maintained with bug fixes, documentation, and support. With the help of the community, incremental features were provided to fit everyone use cases. The quality of the test suite made us confident to get back to the project and dive into the code. However, there was one task which I never had the courage to initiate: to rewrite the parser from the ground up to take benefit of the Buffer API and its promises of performance. A few days of holidays gave me the opportunity to engage this work.
The re-writing started with a blank new project. While there is probably still room for improvements and further optimizations, I run a few benchmarks to measure the performance impact of multiple implementations. This is how I came up with the Resizable Buffer class which reuse the same internal buffer adjusting to the input data set instead of instantiating a new buffer for each field. When ready, the next step was to write the parser. The process was broken down into multiple iterations, 13 exactly:
- Basic Buffer loop
- Add
__needMoreData
- Add
__autoDiscoverRowDelimiter
- Start working on
quote
,escape
,delimiter
, andrecord_delimiter
- Options
quote
,escape
,delimiter
andrecord_delimiter
working - Option comment
- Options
relax_column_count
andskip_empty_lines
as well as info count,empty_line_count
andskipped_line_count
- Options
skip_lines_with_empty_values
,skip_lines_with_error
,from
,to
- Option
columns
- Option
trim
- Option
relax
- Options
objname
,raw
,cast
, andcast_date
- Rewrite info counters
The implementation no longer uses CoffeeScript and is written directly in JavaScript 6. Don’t get me wrong, I am still a big fan of CoffeeScript and we are still using it in the tests for its expressiveness. However, I needed a fine control on the code and using JavaScript as the main language will hopefully encourage more contributions.
Breaking changes
Overall, there are no major breaking changes. The modules are the same and the API for using it remained unchanged. There are however a few minor breaking changes to take into consideration such as the rowDelimiter
option being renamed to record_delimiter
, some previously deprecated options being removed and the available counters being regrouped into the new info property:
- Option
rowDelimiter
is nowrecord_delimiter
count
is nowinfo.records
- Drop the
record
event - Normalize error message as
{error type}: {error description}
- State values are now isolated into the
info
object count
is nowinfo.records
lines
is nowinfo.lines
empty_line_count
is nowinfo.empty_lines
skipped_line_count
is nowinfo.invalid_field_length
context.count
in thecast
function is nowcontext.records
- Drop support for deprecated options
auto_parse
andauto_parse_date
- In raw option, the
row
property is renamedrecord
- Option
max_limit_on_data_read
is nowmax_record_size
- Default value of
max_record_size
is now0
(unlimited) - Drop emission of the
record
event, use thereadable
event andthis.read()
instead
The most impacting breaking change is probably the renaming of the rowDelimiter
option into record_delimiter
because of its popular usage. Also, the max_record_size
is now unlimited by default and must be explicitly defined if used.
New features
This new version comes also with new features. The new information object is a nice addon. It regroups a few counter properties which were available directly from the parser instance. Those properties have been renamed to be more expressive. The information object is directly available from the parser instance as info
. To the callback users, they are exported as the third argument of their callback function. They can also be available for each record by activating the info
option with the value true
.
There are 3 new options which are info
, from_line
, and to_line
:
info
: Generate two propertiesinfo
andrecord
whereinfo
is a snapshot of the info object at the time the record was created andrecord
is the parsed array or object; note, it can be used conjointly with theraw
option.from_line
: Start handling records from the requested line number.to_line
: Stop handling records after the requested line number.
The info
option is quite useful for debugging or giving to the end users some feedback about their mistake.
The from_line
and to_line
options respectively filter the first and last lines of a data set. Speaking of lines, previous versions of the parser were surely confused when it comes to count lines mixing row and record delimiters. It was working for most users for the simple reason that they are usually the same. It shall be fixed with this new release.
Here is the new feature list extracted from the changelog:
- new options
info
,from_line
andto_line
trim
: respectltrim
andrtrim
when defineddelimiter
: may be a Bufferdelimiter
: handle multiple bytes/characterscallback
: export info object as third argumentcast
: catch error in user functions- TypeScript: mark info as readonly with required properties
comment_lines
: count the number of commented lines with no records
What’s coming next
The source code is backed by an extended test suite. No test has been removed and new tests have appeared to reinforce the guarantees of the parser. It is, however, possible that some behaviors are not covered by the tests and, in the next few weeks, we count on your feedback to fix any coming issues.
While not being a big fan of ES6 Promise in the context of the parser, the request for support has been made multiple time and will come soon. It will also be implemented in the other CSV packages.
Another potential improvement is to extend the error objects with additional information such as a unique code associated to each type of error. While being improved, there is room to better normalize the messages.
I am also planning to support the Flow static type checker. I have never used it before. It seems appropriate to the package and it will give me the occasion to try it on.
Finally, I am considering writing a command line tool which will expose all the available options and provide multiple output formats (JSON, JSON line, YAML, …).