The Data Kitchen

Welcome to our restaurant! This site was designed as an experiment to help people understand how to use simple tools to process data locally.

🥦 Appetizers: The Unix Salad

The file sales.csv contains a list of purchase receipts with all the articles they contain. Each purchase is identified by the field tx_id and each article bought has a price item_price and a two-letter code identifying the country.

In this recipe, you are required to produce a list with the receipts, but adding the full country name at the end. The translation between country codes and labels can be found in countries.csv.

Solve the exercise by sending the ante-penultimate row, sorted by item_price descending.

curl -sX POST https://kitchen.luisbelloch.es/api/:team/e1 -d 'FPQINMY120,0.04,SD,Sudan'

Entrées - A tale of two choices

🌶️ Option A: An AWKward chili soup

Using awk, read the file sales.csv and produce a list of the top-ten countries with most sales. Sorting is not a good operation for awk, you may want to use the sort command.

curl -sX POST https://kitchen.luisbelloch.es/api/:team/e2a -d 'KM|595.81'

🍗 Option B: Pandas Fillet

docker run -v $(pwd):/home/jovyan/ -p 8888:8888 -p 4040:4040 jupyter/scipy-notebook

Then navigate to the provider URL. ctrl+enter executes the current cell, use a or b keys to add cells. Alternatively you may use VS Code Jupyter extensions, but it’s a bit painful to install.

curl -sX POST https://kitchen.luisbelloch.es/api/:team/e2b -d 'KM|595.81'

🦆 Main: Slow-roasted DuckDB

The file pancake_orders.10M.csv.gz contains 10 million registers, with the following columns:

┌──────────────┬────────┬────────────┬────────────────┬─────────────┐
│      ts      │ price  │ item_count │ source_country │ coupon_code │
│   timestamp  │ double │   int64    │    varchar     │   varchar   │
├──────────────┼────────┼────────────┼────────────────┼─────────────┤
│ 16:36:19.794 │   3.85 │          1 │ BR             │ 501826      │
│ 16:49:30.072 │   6.36 │          4 │ PY             │ 2bd108      │
│ 16:51:26.371 │   6.36 │          1 │ MF             │ cfaed6      │
└──────────────┴────────┴────────────┴────────────────┴─────────────┘

Data file can be found here: pancake_orders.10M.csv.gz. Do not decompress it, DuckDB is able to read compressed files on the fly.

You are required to get the top-ten countries with more sales, mixing the data with countries.jsonl file. You can only use DuckDB, no external tooling.

curl -sX POST https://kitchen.luisbelloch.es/api/:team/e3 -d 'Namibia,299101.0800000052'

Optional: Try to query directly the CSV and also load the data into a some.duckdb file. Does the later make a difference in performance?

Optional: Save the results as Parquet and repeat the performance experiment.

🍣 Main: Marinated ClickHouse

Solve the exercise by sending the 5^th row (Peru), like in the previous exercise.

curl -sX POST https://kitchen.luisbelloch.es/api/:team/e4 -d 'Namibia,299101.0800000052'

🧁 Dessert: JQ salty rotten chocolate pudding

Using jq only, read the file sales.csv and produce a list the top-ten countries with most sales. Because reasons.

Food Allergy Warning: Please be advised that our food may have come in contact or contain unix nuts, terminal traces, bash fish or console peanuts.