
Node.js CSV version 4 - re-writing and performance
By David WORMS
Nov 19, 2018
Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.
Today, we release a new major version of the Node.js CSV parser project. Version 4 is a complete re-writing of the project focusing on performance. It also comes with new functionalities as well as some cleanup in the option properties and the exported information. The official website is updated and the changelog contains the list of changes for this major release.
A massive undertaking
The Node.js CSV project was started on September 25th of 2010. This is quite old in our evolving tech world. Since then, it has survived multiple Node.js evolutions such as the redesign of the Stream APIs. Over the years, the project was maintained with bug fixes, documentation, and support. With the help of the community, incremental features were provided to fit everyone use cases. The quality of the test suite made us confident to get back to the project and dive into the code. However, there was one task which I never had the courage to initiate: to rewrite the parser from the ground up to take benefit of the Buffer API and its promises of performance. A few days of holidays gave me the opportunity to engage this work.
The re-writing started with a blank new project. While there is probably still room for improvements and further optimizations, I run a few benchmarks to measure the performance impact of multiple implementations. This is how I came up with the Resizable Buffer class which reuse the same internal buffer adjusting to the input data set instead of instantiating a new buffer for each field. When ready, the next step was to write the parser. The process was broken down into multiple iterations, 13 exactly:
- Basic Buffer loop
- Add
__needMoreData - Add
__autoDiscoverRowDelimiter - Start working on
quote,escape,delimiter, andrecord_delimiter - Options
quote,escape,delimiterandrecord_delimiterworking - Option comment
- Options
relax_column_countandskip_empty_linesas well as info count,empty_line_countandskipped_line_count - Options
skip_lines_with_empty_values,skip_lines_with_error,from,to - Option
columns - Option
trim - Option
relax - Options
objname,raw,cast, andcast_date - Rewrite info counters
The implementation no longer uses CoffeeScript and is written directly in JavaScript 6. Don’t get me wrong, I am still a big fan of CoffeeScript and we are still using it in the tests for its expressiveness. However, I needed a fine control on the code and using JavaScript as the main language will hopefully encourage more contributions.
Breaking changes
Overall, there are no major breaking changes. The modules are the same and the API for using it remained unchanged. There are however a few minor breaking changes to take into consideration such as the rowDelimiter option being renamed to record_delimiter, some previously deprecated options being removed and the available counters being regrouped into the new info property:
- Option
rowDelimiteris nowrecord_delimiter countis nowinfo.records- Drop the
recordevent - Normalize error message as
{error type}: {error description} - State values are now isolated into the
infoobject countis nowinfo.recordslinesis nowinfo.linesempty_line_countis nowinfo.empty_linesskipped_line_countis nowinfo.invalid_field_lengthcontext.countin thecastfunction is nowcontext.records- Drop support for deprecated options
auto_parseandauto_parse_date - In raw option, the
rowproperty is renamedrecord - Option
max_limit_on_data_readis nowmax_record_size - Default value of
max_record_sizeis now0(unlimited) - Drop emission of the
recordevent, use thereadableevent andthis.read()instead
The most impacting breaking change is probably the renaming of the rowDelimiter option into record_delimiter because of its popular usage. Also, the max_record_size is now unlimited by default and must be explicitly defined if used.
New features
This new version comes also with new features. The new information object is a nice addon. It regroups a few counter properties which were available directly from the parser instance. Those properties have been renamed to be more expressive. The information object is directly available from the parser instance as info. To the callback users, they are exported as the third argument of their callback function. They can also be available for each record by activating the info option with the value true.
There are 3 new options which are info, from_line, and to_line:
info: Generate two propertiesinfoandrecordwhereinfois a snapshot of the info object at the time the record was created andrecordis the parsed array or object; note, it can be used conjointly with therawoption.from_line: Start handling records from the requested line number.to_line: Stop handling records after the requested line number.
The info option is quite useful for debugging or giving to the end users some feedback about their mistake.
The from_line and to_line options respectively filter the first and last lines of a data set. Speaking of lines, previous versions of the parser were surely confused when it comes to count lines mixing row and record delimiters. It was working for most users for the simple reason that they are usually the same. It shall be fixed with this new release.
Here is the new feature list extracted from the changelog:
- new options
info,from_lineandto_line trim: respectltrimandrtrimwhen defineddelimiter: may be a Bufferdelimiter: handle multiple bytes/characterscallback: export info object as third argumentcast: catch error in user functions- TypeScript: mark info as readonly with required properties
comment_lines: count the number of commented lines with no records
What’s coming next
The source code is backed by an extended test suite. No test has been removed and new tests have appeared to reinforce the guarantees of the parser. It is, however, possible that some behaviors are not covered by the tests and, in the next few weeks, we count on your feedback to fix any coming issues.
While not being a big fan of ES6 Promise in the context of the parser, the request for support has been made multiple time and will come soon. It will also be implemented in the other CSV packages.
Another potential improvement is to extend the error objects with additional information such as a unique code associated to each type of error. While being improved, there is room to better normalize the messages.
I am also planning to support the Flow static type checker. I have never used it before. It seems appropriate to the package and it will give me the occasion to try it on.
Finally, I am considering writing a command line tool which will expose all the available options and provide multiple output formats (JSON, JSON line, YAML, …).