From dd320f7755d582e38f228f97e3ef655765a88b22 Mon Sep 17 00:00:00 2001 From: Kai Wu Date: Mon, 30 Sep 2024 13:14:51 -0700 Subject: add a line to link the eval reproduce recipe (#123) --- models/llama3_1/eval_details.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/models/llama3_1/eval_details.md b/models/llama3_1/eval_details.md index 0637a62..f126f00 100644 --- a/models/llama3_1/eval_details.md +++ b/models/llama3_1/eval_details.md @@ -6,7 +6,8 @@ This document contains some additional context on the settings and methodology f ## Language auto-eval benchmark notes: -For a given benchmark, we strive to use consistent evaluation settings across all models, including external models. We make every effort to achieve optimal scores for external models, including addressing any model-specific parsing and tokenization requirements. Where the scores are lower for external models than self-reported scores on comparable or more conservative settings, we report the self-reported scores for external models. We are also releasing the data generated as part of evaluations with publicly available benchmarks which can be found on [Llama 3.1 Evals Huggingface collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f). +For a given benchmark, we strive to use consistent evaluation settings across all models, including external models. We make every effort to achieve optimal scores for external models, including addressing any model-specific parsing and tokenization requirements. Where the scores are lower for external models than self-reported scores on comparable or more conservative settings, we report the self-reported scores for external models. We are also releasing the data generated as part of evaluations with publicly available benchmarks which can be found on [Llama 3.1 Evals Huggingface collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f). We have also developed a [eval reproduction recipe](https://github.com/meta-llama/llama-recipes/tree/b5f64c0b69d7ff85ec186d964c6c557d55025969/tools/benchmarks/llm_eval_harness/meta_eval_reproduce) that demonstrates how to closely reproduce the Llama 3.1 reported benchmark numbers using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) library and the datasets in [3.1 evals collections](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f) on selected tasks. + ### MMLU -- cgit v1.2.3-70-g09d2