summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorscharlottej13 <sarah@coiled.io>2024-02-02 15:35:13 -0800
committerscharlottej13 <sarah@coiled.io>2024-02-02 15:35:13 -0800
commitbb23f9245dcd631c33000657bd21c1fe532abfcc (patch)
tree75e1898242594e7f14c238fafb78f4c7067aae98
parent7ec23a39198016eb285cee324c0f967ffda8b084 (diff)
Add attribution to Jacob's script
-rw-r--r--README.md3
-rw-r--r--generate_data.py3
2 files changed, 5 insertions, 1 deletions
diff --git a/README.md b/README.md
index 075e993..603dc34 100644
--- a/README.md
+++ b/README.md
@@ -4,9 +4,10 @@ Inspired by Gunnar Morling's [one billion row challenge](https://github.com/gunn
## Data Generation
-You can generate the dataset yourself using the [data generation script](generate_data.py). We've also hosted the dataset in a requester pays S3 bucket `s3://coiled-datasets-rp/1trc` in `us-east-1`.
+You can generate the dataset yourself using the [data generation script](generate_data.py), adapted from [Jacob Tomlinson's data generation script](https://github.com/gunnarmorling/1brc/discussions/487). We've also hosted the dataset in a requester pays S3 bucket `s3://coiled-datasets-rp/1trc` in `us-east-1`.
It draws a random sample of weather stations and normally distributed temperatures drawn from the mean for each station based on the values in [lookup.csv](lookup.csv).
+
## The Challenge
The main task, like the 1BRC, is to calculate the min, mean, and max values per weather station, sorted alphabetically.
diff --git a/generate_data.py b/generate_data.py
index fa30785..f5de1a8 100644
--- a/generate_data.py
+++ b/generate_data.py
@@ -1,3 +1,6 @@
+# This script was adapted from Jacob Tomlinson's 1BRC submission
+# https://github.com/gunnarmorling/1brc/discussions/487
+
import os
import tempfile
import coiled