Running the benchmark

Requirements

Strictly python3.8, on Linux

Installation

First install Progress-Gym using pip. Notice that, in order to perform accelerated inference using vllm and resolve resulting conflicts, we strictly require python version of 3.8 and vllm version of 0.4.0.post1. Therefore, make sure you have python3.8 installed, and run:

$ python3.8 -m venv .venv
(.venv) $ pip install progressgym

Or install by cloning the github repository:

$ git clone https://github.com/ProgressAlign/ProgressAlign.git
$ python3.8 -m venv .venv
(.venv) $ pip install -r src/config/requirements-pip.txt

Basic Usage

Building Dataset

Notice that you don't have to run this part yourself unless you want to re-create our findings in the paper, as the HisText dataset has already been utilized to train the Human Proxy Model sequence.

Running

python ./build_dataset.py
will automatically download the relevant files and then compile them into the desired dataset. Specifically, it will...
  • Download and decompress raw data files from multiple sources, placing them in ./dataset/raw_downloads

  • Organize downloaded data into an unified format, and the resulting historical text dataset is saved as a series of files ./dataset/dataset_textseq/[century_num]/[year_num].json

  • Train a series of historical language models ./dataset/dataset_modelseq/[century_num] using the historical text dataset.

Running The Benchmarks

Running

$ python run_benchmark.py \\
     --algorithms=ALGO1[,ALGO2[,ALGO3[,...]]] \\
     --challenges=SUBTASK1[,SUBTASK2[,SUBTASK3[,...]]] \\
     --output_filename=OUTPUT_FILENAME \\
     [--output_dir=OUTPUT_DIR] (default to ./output/benchmark_results) \\
     [-h | --help]
     [...] (additional arguments will be supplied to the algorithms when they are instantiated; only string values are supported)

will produce the benchmark results. Omitting

--challenge [challenge_name]

will default to testing all challenges. Results will be saved to [output_directory]/[test_name].json, with [output_directory] set to default as ./output/benchmark_results. For each test, results are saved as a dictionary. See item score for a comprehensive score for each examinee's performance in one subtask.

See comments at the beginning of 'run_benchmark.py' for a detailed usage of the command line arguments.