summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorKai Wu <kaiwu@meta.com>2024-09-30 13:14:51 -0700
committerGitHub <noreply@github.com>2024-09-30 13:14:51 -0700
commitdd320f7755d582e38f228f97e3ef655765a88b22 (patch)
tree1c496215f40104f99ccece9468398ad247999bf0
parent7a279628aa737335664fb732ce9108a57fd48507 (diff)
add a line to link the eval reproduce recipe (#123)
-rw-r--r--models/llama3_1/eval_details.md3
1 files changed, 2 insertions, 1 deletions
diff --git a/models/llama3_1/eval_details.md b/models/llama3_1/eval_details.md
index 0637a62..f126f00 100644
--- a/models/llama3_1/eval_details.md
+++ b/models/llama3_1/eval_details.md
@@ -6,7 +6,8 @@ This document contains some additional context on the settings and methodology f
## Language auto-eval benchmark notes:
-For a given benchmark, we strive to use consistent evaluation settings across all models, including external models. We make every effort to achieve optimal scores for external models, including addressing any model-specific parsing and tokenization requirements. Where the scores are lower for external models than self-reported scores on comparable or more conservative settings, we report the self-reported scores for external models. We are also releasing the data generated as part of evaluations with publicly available benchmarks which can be found on [Llama 3.1 Evals Huggingface collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f).
+For a given benchmark, we strive to use consistent evaluation settings across all models, including external models. We make every effort to achieve optimal scores for external models, including addressing any model-specific parsing and tokenization requirements. Where the scores are lower for external models than self-reported scores on comparable or more conservative settings, we report the self-reported scores for external models. We are also releasing the data generated as part of evaluations with publicly available benchmarks which can be found on [Llama 3.1 Evals Huggingface collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f). We have also developed a [eval reproduction recipe](https://github.com/meta-llama/llama-recipes/tree/b5f64c0b69d7ff85ec186d964c6c557d55025969/tools/benchmarks/llm_eval_harness/meta_eval_reproduce) that demonstrates how to closely reproduce the Llama 3.1 reported benchmark numbers using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) library and the datasets in [3.1 evals collections](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f) on selected tasks.
+
### MMLU