My philosophy of modelling

My philosophy of modelling.

Here, I link to some longer documents that I made in the past, for classes and research groups:

A syllabus for the class in biological modelling that I taught. A substantial section presents my philosophy.
A writeup I made for the Jornada Long-Term Ecological Research group, trying to get them oriented in the fundamentals of modelling
An interim report on modelling the photosynthesis and transpiration of pecan orchards. A section lays out my modelling philosophy. You're welcome to browse the rest, too.
My 1987 book, Functional Ecology of Crop Plants (Croom Helm, London/ Timber Press, Beaverton, OR; sorry, no full PDF available yet ) - where an engineering view of how plants work or "should" work is blended with an evolutionary view of how they were naturally selected (and why agriculture overlooks the discrepant objectives of natural and artificial selection at its peril…as does biomedicine)

Quick summary:

I eschew verbal/conceptual models (block and arrow diagrams) until they lead to an explicit mathematical form, which forces one to put down what one really knows (or does not know). Some concepts don't lend themselves to explicit formulation, such as the role of biodiversity in some ecosystem functions - a much deeper and more careful formulation is needed before we design more bad experiments.
Models can be used in many ways. One gross classification is

for prediction (if one really knows the system), or
for developing hypotheses (if the system parts all work as we think so, the system will show all these behaviors; let's go test them; models of this type are potent ways to design experiments so that we attend to what's most important, rather than incorporating 10 to the googol treatments), or
for synthesis of our knowledge (e.g., let's put together what we know about how soil and atmospheric environments control plant water use; we can do a much better job now).

I am heavily in favor of mechanistic models vs. statistical models. Mechanistic models capture real understanding of causation (with any luck and effort). They can also be applied readily to new conditions and times. Statistical models are inherently biased, in the practitioner's eye, toward linear models. Most natural phenomena are linear only over short ranges - e.g., stomatal conductance of leaves
I go for models that are the least data-hungry, which means finding the most robust formulation of processes. Example: models of leaf photosynthetic rates are abundant and varied in form; most have limited accuracy and their numerous parameters must be redone for every new plant and every new condition. The model of Farquhar, von Caemmerer, and Berry (1980 ff.) captured all the complexity of the biochemistry in 3 parameters (for light-saturated rate), and 2 of these are essentially universal among all vascular plants. Can't do better than that!
Still, many realistic models grow to become complex. I have a rule of thumb, in addition to being able to get data/parameters for models: I stop adding processes when I forget what I was doing in the large. Naturally, this happens to all modellers.
Models have to be testable and tested, and not just against other models or for a few predictions among thousands (read: remote sensing models over millions of pixels where measurements can never be made must have a great deal of other features in their favor! Predicted spatial patterns can't be eyeballed as "looking like the observations" but must be analyzed with tough spatial statistics).
Models can take a lot of time and effort, during construction and/or during execution, because of complexity on any of several levels:

Conceptual: the concepts are elaborate in form or are numerous.
Mathematical: even with simple concepts, the mathematical solutions may be very difficult to obtain - e.g., coupled nonlinear equations
Computational: similar to mathematical complexity, but not identical to it: a few equations may have a beastly amount of computation involved, as in optimizing several physiological traits simultaneously à the need for simulated annealing or genetic algorithms)
Data-hungriness: plant growth models or patch-dynamic models may require only a few descriptors of plant resource use and/or dispersal, but for many individuals or species. Can we really get all this info?

I like solutions in closed form (i.e., one can write an explicit mathematical form for the answer), but I'll accept numerical solutions, even those that are truly black-box, such as neural networks, when necessary.

There are distinct elements in models that must be regarded carefully. Considering the common differential-equation models (or related difference-equation models) such as for plant growth, water movement, etc., we need to distinguish

state variables (the responses we want to track),
parameters (constants, of physical, physiological, developmental, or ecological origin) that occur in the equations describing the state variable changes),
driving variables (external to our system, and simply prescribed, such as wind and precipitation),
boundary conditions (in space - what happens at the limits),
initial conditions (where we start, in time), and
process equations (equations describing how everything changes in space and time). There's often a lot of sloppy thinking on these, to the extent that no real models are ever attempted (e.g., some prominent desertification 'models').

Design practices:

I insist on documentation, as narratives outside the code, and as copious comments inside the code.
I debug all code for every step - many nasty surprises lurk in even simple codes. Some people don't debug; don't touch their codes or rely on their results.
I admire the aims of object-oriented programming, but I don't do it. I find the interfaces tedious, especially if one really wants to avoid having to customize them for each purpose, defeating the whole intent. Truly big models do need OOP.
I don't like modelling packages that enforce certain structures. An example is Stella, which basically forces you to do stock-and-flow models and dissuades you heavily from doing spatially distributed models (unless you want to program in a ton of levels and lose track of what you're doing). Excel is sort-of general purpose for quick answers to simple problems, but its solution to nonlinear equations (solver) is awkward to use within larger iterative schemes. Matlab, Mathematica, and Maple are good general-purpose packages, but you want to use scripts that you save, so that you don't forget all the changes you made. For these reasons, I prefer Fortran (an update warhorse, now quite powerful); C, C++, etc. are good alternatives.

My models are not big in the sense of needing supercomputers. Models of some of the co-organizers are this big. I'd like us to discuss 'bigness' or complexity on the several levels I outline below.