fev-bench

Running

Oleksandr Shchur commited on Dec 16, 2024

Commit

218d801

1 Parent(s): 079b094

Highlight zero-shot models

Files changed (1) hide show

app.py CHANGED Viewed

@@ -44,8 +44,16 @@ rename_cols = {
 selected_cols = list(rename_cols.keys())
-def is_zero_shot(model_name):
-    return model_name.startswith("chronos") or model_name in {"timesfm"}
 leaderboards = {}
@@ -54,7 +62,7 @@ for metric in ["WQL", "MASE"]:
     format_dict = {}
     for col in lb.columns:
         format_dict[col] = "{:.3f}" if col != "Training corpus overlap (%)" else "{:.1%}"
-    leaderboards[metric] = lb.reset_index().style.format(format_dict)
 with gr.Blocks() as demo:
@@ -71,7 +79,7 @@ with gr.Blocks() as demo:
                     * **Average relative error**: Geometric mean of the relative errors for each task. The relative error for each task is computed as `model_error / baseline_error`.
                     * **Average rank**: Arithmetic mean of the ranks achieved by each model on each task.
                     * **Median inference time (s)**: Median of the times required to make predictions for the entire dataset (in seconds).
-                    * **Training corpus overlap (%)**: Percentage of the datasets used in the benchmark that were included in the model's training corpus.
                     Lower values are better for all of the above metrics.

 selected_cols = list(rename_cols.keys())
+def highlight_zeroshot(styler):
+    """Highlight training overlap for zero-shot models with bold green."""
+    def style_func(val):
+        if val == 0:
+            return "color: green; font-weight: bold"
+        else:
+            return "color: black"
+    return styler.map(style_func, subset=["Training corpus overlap (%)"])
 leaderboards = {}
     format_dict = {}
     for col in lb.columns:
         format_dict[col] = "{:.3f}" if col != "Training corpus overlap (%)" else "{:.1%}"
+    leaderboards[metric] = highlight_zeroshot(lb.reset_index().style.format(format_dict))
 with gr.Blocks() as demo:
                     * **Average relative error**: Geometric mean of the relative errors for each task. The relative error for each task is computed as `model_error / baseline_error`.
                     * **Average rank**: Arithmetic mean of the ranks achieved by each model on each task.
                     * **Median inference time (s)**: Median of the times required to make predictions for the entire dataset (in seconds).
+                    * **Training corpus overlap (%)**: Percentage of the datasets used in the benchmark that were included in the model's training corpus. Zero-shot models are highlighted in green.
                     Lower values are better for all of the above metrics.