Method
Here is the run-down of the overall method for this challenge:
1. Download the Dynamic Stock
and Flows (DSF) task- An executable copy of the DSF is available
for download to all participants. The DSF task environment requires
a Windows platform but uses a TCP/IP socket protocol to communicate
with external models. See the text-based socket protocol documentation.
In addition, a description of the dynamics of the task is provided
in the “Human performance data for model calibration”
section of this webpage. Participants are free to implement their
own version of the DSF task but whatever model they develop must ultimately
interact via the published TCP/IP socket protocol with our version
of the DSF task.
Finally, two versions of the DSF task are available as explained
in the "DSF: The Dynamic Stock & Flows Task": the DSFForSockets.zip
file supports constructive connection with external models; the second
one includes the GUI used by human subjects so that the participants
of this challenge can experience the DSF task for themselves.
2. Create a model to interact with the task- Once participants
have established a connection to the DSF environment, they can calibrate
their models running against the "calibrating" protocols described
in the Human Performance Data section and comparing model performance
against human performance data in those conditions.
In this way, participants will be able to gauge whether their models
are capable of simulating the basic effects seen in human control
of the DSF task.
3. Refine your model as needed- Our past experience suggests
that this will lead to an iterative development process where models
are continually refined as they are run under different experimental
protocols and against different data sets. Modelers are free to experiment
with any variation of the task that are allowed under the existing
protocol and implementation, but no data will be provided for any
condition besides the “calibrating” protocols.
4. Model comparison- Model comparison begins only after participants
are satisfied with the performance they have achieved on the calibrating
data. At that point, but no later than May 15, 2009 participants
will submit executable version of their model through the website
to be run against novel protocols. The DSF task supports several interesting
variants, including but not limited to: different inflow and outflow
functions, control delays, the addition of "noise" to the inflow and
outflow amounts, and another agent controlling the environmental inflows
and outflows. The choice of specific novel conditions will be entirely
at our discretion and submitted models will be run under these conditions
submitted.
Our goal for this blind evaluation under the novel conditions is
not to hamstring participants, but to see how well their models generalize
without the benefit of continual tweaking or tuning, and to test the
predictiveness of the model for conditions in which no data were available.
Assessing robustness under the transfer condition is an important
factor to consider when we investigate the invariance of modeling
approaches. Again, the transfer experimental conditions and corresponding
data will not be known to modelers prior to evaluation. Their purpose
is to evaluate the generality and scalability of the model to a range
of conditions beyond those for which data are available in the model
development and calibration process.
We will rank all participants according to a quantitative measure of
goodness-of-fit to the transfer data. That said, goodness-of-fit measures
under the calibrating and transfer conditions is not the only factor
we will use in our comparison effort. In addition to submitting their
models, we will require participants to submit written accounts of their
development efforts and detailed explanations of the mechanisms their
models implement. We recognize that it is difficult to explain the workings
of a cognitive model in a compact and understandable manner to people
who might be unfamiliar with the paradigm in which it was developed,
but it is exactly that level of detail that is required to understand
what has been accomplished and to judge the implications of the model’s
assumptions for its ability to model the task.
The top-ranking model according to the purely quantitative goodness-of-fit
criterion will automatically be invited to the symposium at ICCM. In
addition, we will invite, at our discretion, another two entries to
the symposium based on a mixture of quantitative fit, qualitative capture
of important effects in the data, theoretical soundness and cognitive
plausibility, as well as to showcase a diversity of modeling approaches.
|