uDSV

A faster CSV parser in 5KB (min)

README

𝌠 μDSV


Introduction


uDSV is a fast JS library for parsing well-formed CSV strings, either from memory or incrementally from disk or network.
It is mostly RFC 4180 compliant, with support for quoted values containing commas, escaped quotes, and line breaks¹.
The aim of this project is to handle the 99.5% use-case without adding complexity and performance trade-offs to support the remaining 0.5%.

¹ Line breaks (\n,\r,\r\n) within quoted values must match the row separator.

Features


What does uDSV pack into 5KB?

- RFC 4180 compliant
- Incremental or full parsing, with optional accumulation
- Auto-detection and customization of delimiters (rows, columns, quotes, escapes)
- Schema inference and value typing: string, number, boolean, date, json
- Defined handling of '', 'null', 'NaN'
- Whitespace trimming of values & skipping empty lines
- Multi-row header skipping and column renaming
- Multiple outputs: arrays (tuples), objects, nested objects, columnar arrays

Of course, _most_ of these are table stakes for CSV parsers :)

Performance


Is it Lightning Fast™ or Blazing Fast™?

No, those are too slow! uDSV has Ludicrous Speed™;
it's faster than the parsers you recognize and faster than those you've never heard of.

Most CSV parsers have one happy/fast path -- the one without quoted values, without value typing, and only when using the default settings & output format.
Once you're off that path, you can generally throw any self-promoting benchmarks in the trash.
In contrast, uDSV remains fast with any datasets and all options; its happy path is _every path_.

On a Ryzen 7 ThinkPad, Linux v6.4.11, and NodeJS v20.6.0, a diverse set of benchmarks show a 1x-5x performance boost relative to the popular, proven-fast, Papa Parse.

For _way too many_ synthetic and real-world benchmarks, head over to /bench...and don't forget your coffee!

  1. ```
  2. uszips.csv (6 MB, 18 cols x 34K rows)                                                        
  3. Name                   Rows/s │ Throughput (MiB/s)                                          
  4. uDSV                   782K   140
  5. csv-simple-parser       682K   122        
  6. achilles-csv-parser     469K   83.8                      
  7. d3-dsv                 433K   77.4                        
  8. csv-rex                 346K   61.9                              
  9. PapaParse               305K   54.5                                
  10. csv42                   296K   52.9                                  
  11. csv-js                 285K   50.9                                  
  12. comma-separated-values 258K   46.1                                    
  13. dekkai                 248K   44.3                                    
  14. CSVtoJSON               245K   43.8                                    
  15. csv-parser (neat-csv)   218K   39                                        
  16. ACsv                   218K   39                                        
  17. SheetJS                 208K   37.1                                        
  18. @vanillaes/csv         200K   35.8                                        
  19. node-csvtojson         165K   29.4                                          
  20. csv-parse/sync         125K   22.4                                              
  21. @fast-csv/parse         78.2K   14                                                  
  22. jquery-csv             55.1K   9.85                                                  
  23. but-csv                 ---     Wrong row count! Expected: 33790, Actual: 1                
  24. @gregoranders/csv       ---     Invalid CSV at 1:109                                        
  25. utils-dsv-base-parse   ---     unexpected error. Encountered an invalid record. Field 17 o
  26. ```

Installation


  1. ```
  2. npm i udsv
  3. ```

or

  1. ```html
  2. <script src="./dist/uDSV.iife.min.js"></script>
  3. ```

API


A 150 LoC uDSV.d.ts TypeScript def.

Basic Usage


  1. ```js
  2. import { inferSchema, initParser } from 'udsv';

  3. let csvStr = 'a,b,c\n1,2,3\n4,5,6';

  4. let schema = inferSchema(csvStr);
  5. let parser = initParser(schema);

  6. // native format (fastest)
  7. let stringArrs = parser.stringArrs(csvStr); // [ ['1','2','3'], ['4','5','6'] ]

  8. // typed formats (internally converted from native)
  9. let typedArrs  = parser.typedArrs(csvStr);  // [ [1, 2, 3], [4, 5, 6] ]
  10. let typedObjs  = parser.typedObjs(csvStr);  // [ {a: 1, b: 2, c: 3}, {a: 4, b: 5, c: 6} ]
  11. let typedCols  = parser.typedCols(csvStr);  // [ [1, 4], [2, 5], [3, 6] ]
  12. ```

Nested/deep objects can be re-constructed from column naming via .typedDeep():

  1. ```js
  2. // deep/nested objects (from column naming)
  3. let csvStr2 = `
  4. _type,name,description,location.city,location.street,location.geo[0],location.geo[1],speed,heading,size[0],size[1],size[2]
  5. item,Item 0,Item 0 description in text,Rotterdam,Main street,51.9280712,4.4207888,5.4,128.3,3.4,5.1,0.9
  6. `.trim();

  7. let schema2 = inferSchema(csvStr2);
  8. let parser2 = initParser(schema2);

  9. let typedDeep = parser2.typedDeep(csvStr2);

  10. /*
  11. [
  12.   {
  13.     _type: 'item',
  14.     name: 'Item 0',
  15.     description: 'Item 0 description in text',
  16.     location: {
  17.       city: 'Rotterdam',
  18.       street: 'Main street',
  19.       geo: [ 51.9280712, 4.4207888 ]
  20.     },
  21.     speed: 5.4,
  22.     heading: 128.3,
  23.     size: [ 3.4, 5.1, 0.9 ],
  24.   }
  25. ]
  26. */
  27. ```

CSP Note:

uDSV uses dynamically-generated functions (via new Function()) for its .typed*() methods.
These functions are lazy-generated and use JSON.stringify() code-injection guards, so the risk should be minimal.
Nevertheless, if you have strict CSP headers withoutunsafe-eval, you won't be able to take advantage of the typed methods and will have to do the type conversion from the string tuples yourself.

Incremental / Streaming


uDSV has no inherent knowledge of streams.
Instead, it exposes a generic incremental parsing API to which you can pass sequential chunks.
These chunks can come from various sources, such as a Web Stream or Node stream viafetch() or fs, a WebSocket, etc.

Here's what it looks like with Node's fs.createReadStream():

  1. ```js
  2. let stream = fs.createReadStream(filePath);

  3. let parser = null;
  4. let result = null;

  5. stream.on('data', (chunk) => {
  6.   // convert from Buffer
  7.   let strChunk = chunk.toString();
  8.   // on first chunk, infer schema and init parser
  9.   parser ??= initParser(inferSchema(strChunk));
  10.   // incremental parse to string arrays
  11.   parser.chunk(strChunk, parser.stringArrs);
  12. });

  13. stream.on('end', () => {
  14.   result = parser.end();
  15. });
  16. ```

...and Web streams in Node, or Fetch's Response.body:

  1. ```js
  2. let stream = fs.createReadStream(filePath);

  3. let webStream = Stream.Readable.toWeb(stream);
  4. let textStream = webStream.pipeThrough(new TextDecoderStream());

  5. let parser = null;

  6. for await (const strChunk of textStream) {
  7.   parser ??= initParser(inferSchema(strChunk));
  8.   parser.chunk(strChunk, parser.stringArrs);
  9. }

  10. let result = parser.end();
  11. ```

The above examples show accumulating parsers -- they will buffer the full result into memory.
This may not be something you want (or need), for example with huge datasets where you're looking to get the sum of a single column, or want to filter only a small subset of rows.
To bypass this auto-accumulation behavior, simply pass your own handler as the third argument to parser.chunk():

  1. ```js
  2. // ...same as above

  3. let sum = 0;

  4. let reducer = (rows) => {
  5.   for (let i = 0; i < rows.length; i++) {
  6.     sum += rows[i][3]; // sum fourth column
  7.   }
  8. };

  9. for await (const strChunk of textStream) {
  10.   parser ??= initParser(inferSchema(strChunk));
  11.   parser.chunk(strChunk, parser.typedArrs, reducer); // typedArrs + reducer
  12. }

  13. parser.end();
  14. ```

TODO?


- handle #comment rows
- emit empty-row and #comment events?