simple-data-analysis

Easy-to-use and high-performance JavaScript library for data analysis.

README

Simple data analysis (SDA) in JavaScript


This repository is maintained by Nael Shiab, computational journalist and senior data producer for CBC News.

To install with NPM:

  1. ```
  2. npm i simple-data-analysis
  3. ```

The documentation is available here.

Core principles


These project's goals are:

-   To offer a high-performance and convenient solution in JavaScript for data analysis. It's based on DuckDB and inspired by Pandas (Python) and the Tidyverse (R).

-   To standardize and accelerate frontend/backend workflows with a simple-to-use library working both in the browser and with NodeJS (and similar runtimes).

-   To ease the way for non-coders (especially journalists and web developers) into the beautiful world of data analysis and data visualization in JavaScript.

SDA is based on duckdb-node and duckdb-wasm. DuckDB is a high-performance analytical database system. Under the hood, SDA sends SQL queries to be executed by DuckDB.

You also have the flexibility of writing your own queries if you want to (check the customQuery method) or to use JavaScript to process your data (check the updateWithJS method).

Feel free to start a conversation or open an issue. Check how you can contribute.

About v2


Because v1.x.x versions weren't based on DuckDB, v2.0.1 is a complete rewrite of the library with many breaking changes.

To test and compare the performance of simple-data-analysis@2.0.1, we calculated the average temperature per decade and city with the daily temperatures from the Adjusted and Homogenized Canadian Climate Data. See this repository for the code.

We ran the same calculations with simple-data-analysis@1.8.1 (both NodeJS and Bun), Pandas (Python), and the Tidyverse (R).

In each script, we:

1. Loaded a CSV file (_Importing_)
2. Selected four columns, removed rows with missing temperature, converted date strings to date and temperature strings to float (_Cleaning_)
3. Added a new column _decade_ and calculated the decade (_Modifying_)
4. Calculated the average temperature per decade and city (_Summarizing_)
5. Wrote the cleaned-up data that we computed the averages from in a new CSV file (_Writing_)

Each script has been run ten times on a MacBook Pro (Apple M1 Pro / 16 GB), and the durations have been averaged.

The charts displayed below come from this Observable notebook.

Small file


With _ahccd-samples.csv_:

-   74.7 MB
-   19 cities
-   20 columns
-   971,804 rows
-   19,436,080 data points

As we can see, simple-data-analysis@1.8.1 was the slowest, but simple-data-analysis@2.0.1 is now the fastest.

A chart showing the processing duration of multiple scripts in various languages

Big file


With _ahccd.csv_:

-   1.7 GB
-   773 cities
-   20 columns
-   22,051,025 rows
-   441,020,500 data points

The file was too big for simple-data-analysis@1.8.1, so it's not included here.

Again, simple-data-analysis@2.0.1 is now the fastest option.

A chart showing the processing duration of multiple scripts in various languages

Note that DuckDB, which powers SDA, can also be used with Python and R.

SDA in an Observable notebook


Observable notebooks are great for data analysis in JavaScript. This example shows you how to use simple-data-analysis in one of them.

SDA in an HTML page


If you want to add the library directly to your webpage, you can use a npm-based CDN like jsDelivr.

Here's some code that you can copy an paste into an HTML file. For more methods, check the SimpleDB class documentation.

  1. ```html
  2. <script type="module">
  3.     // We import the SimpleDB class from the esm bundle.
  4.     import { SimpleDB } from "https://cdn.jsdelivr.net/npm/simple-data-analysis/+esm"
  5.     async function main() {
  6.         // We start a new instance of SimpleDB
  7.         const sdb = new SimpleDB()
  8.         // We load daily temperatures for three cities.
  9.         // We put the data in the table dailyTemperatures.
  10.         await sdb.loadData(
  11.             "dailyTemperatures",
  12.             "https://raw.githubusercontent.com/nshiab/simple-data-analysis/main/test/data/files/dailyTemperatures.csv"
  13.         )
  14.         // We compute the decade from each date
  15.         // and put the result in the decade column.
  16.         await sdb.addColumn(
  17.             "dailyTemperatures",
  18.             "decade",
  19.             "integer",
  20.             "FLOOR(YEAR(time)/10)*10" // This is SQL
  21.         )
  22.         // We summarize the data by computing
  23.         // the average dailyTemperature
  24.         // per decade and per city.
  25.         await sdb.summarize("dailyTemperatures", {
  26.             values: "t",
  27.             categories: ["decade", "id"],
  28.             summaries: "mean",
  29.         })
  30.         // We run linear regressions
  31.         // to check for trends.
  32.         await sdb.linearRegressions("dailyTemperatures", {
  33.             x: "decade",
  34.             y: "mean",
  35.             categories: "id",
  36.             decimals: 4,
  37.         })
  38.         // The dailyTemperature table does not have
  39.         // the name of the cities, just the ids.
  40.         // We load another file with the names
  41.         // in the table cities.
  42.         await sdb.loadData(
  43.             "cities",
  44.             "https://raw.githubusercontent.com/nshiab/simple-data-analysis/main/test/data/files/cities.csv"
  45.         )
  46.         // We join the two tables based
  47.         // on the ids and put the joined rows
  48.         // in the table results.
  49.         await sdb.join("dailyTemperatures", "cities", "id", "left", "results")
  50.         // We select the columns of interest
  51.         // in the table results.
  52.         await sdb.selectColumns("results", [
  53.             "city",
  54.             "slope",
  55.             "yIntercept",
  56.             "r2",
  57.         ])
  58.         // We log the results table.
  59.         await sdb.logTable("results")
  60.         // We store the data in a variable.
  61.         const results = await sdb.getData("results")
  62.     }
  63.     main()
  64. </script>
  65. ```

And here's the table you'll see in your browser's console tab.