Coding Guidelines¶
Model Execution¶
Regarding the execution of the model:
- For a given scenario, the command to execute the model must not require alteration. For example, providing the model with the current date should not be necessary.
- Any scenario-dependent inputs must be required as arguments provided on the command-line, unless there is a single argument that clearly references the scenario (for example, providing the option
--scenario=ExistingTrends
fulfills this guideline). This argument will be referred to as the scenario-reference argument.- In the case that the scenario-reference argument is provided, internally (i.e. in the model code) the model should determine all scenario-dependent inputs by this argument, and only this argument in a single control-flow statement (for example, an
if
statement). This determination of inputs should be made as near as possible to the starting point of the model code.
- For a given scenario, output data files generated by the model must always have the same name and data structure (i.e. be constant through time). The number of output data files should be as small as possible. No mutations to the output data should be applied with the exception of:
- Mutations to make data identifiers more human-friendly/understandable.
- Mutations to include measurement units in the data identifiers.
For example, the following commands show adherence to guidelines 1-3:
ModelRun.exe --unemp_rate="ExistTrendsUnemploymentRate.csv" --AUDUSDExRate="ExistTrendsAUDtoUSDExchangeRate.xlsx"
python AnotherModelRun.py --scenario="Supercities" --urban_density="UrbanDensity.csv" --AUDUSDExRate="ExisTrendAUDtoUSDExchangeRate.xlsx" --urban_growth="Growth.har"
python AnotherModelRun.py --scenario="Supercities"
For the first example, it can be assumed that --unemp_rate
and --AUDUSDExRate
specify all scenario-dependent inputs to ‘Model’.
The second and third examples collectively show (for AnotherModel
) that if an argument that has a clear and obvious connection to the scenario is provided (i.e. --scenario
), then it is not necessary to provide the other arguments (i.e. --urban_density
, --urban_growth
etc.) assuming that the underlying code makes all scenario-dependent decisions based on this variable in one and only one control flow statement.
It should also be clear to the reader from the examples that there are many ways to interface to the model, where compiling an executable (.exe) and running a python script are just 2 examples.
Model Documentation¶
Regarding the documentation of a model (and potentially the execution of scripts/models), the following points must be adopted:
The command to invoke the model, as outlined above in Model Execution, must be documented, in what will be referred to as the documentation file. The documentation file could be a standalone file (e.g.
README.txt
) or a docstring at the start of the execution script. For users of graphical programs, I can appreciate that this may be difficult and/or unfamiliar to you. In this case, we will discuss on a one-on-one basis the programs you use, the process you follow to execute the model etc..Model inputs must be specified in the documentation file. In the documentation file, the origin of the input file must be obvious, which can be achieved by providing a URL. For example, the LUTO model may require a VURM output file, which could be found at:
C:\NOL2Wk\iam\VURM\model\output\an_output_file.csv
.If it is your (i.e. the model owner) preference that this input file be copied to a another directory such as a ‘local’ directory (to run your model for example) this must be documented in the documentation file with a URL as well.
- If it is currently necessary to mutate the input data before the model is executed, any one of the following requirements must be adopted:
- The mutations are included in the execution of the model. That is, the model itself performs the necessary data mutation(s) as part of its execution.
- The complete set of mutations applied, and the data to which they are applied, must be documented in the documentation file.
- A script that performs these mutations must be written and executable on the command line (for example,
python mutate_input_data.py
). This script should adhere to all of the Model Execution guidelines.All model outputs that are used by another model must be documented in the documentation file. If in doubt, specify the file. If known, state specific data series in the file used by other modellers. If any mutations are applied to the data (whilst adhering to guideline 4 of Model Execution) then one of 3.1, 3.2 and 3.3 must be adopted with respect to the output mutations.
Structure of a Script¶
If the economic model is in script form (i.e. a single file designed to be run from the command line), then the code should be structured as sections in the following order:
Import statements. Any references to system libraries/3rd party software should be at the top of the file. This allows for easy identification of any installs required to run the model.
Code to load input files. All references to data files should immediately follow the import statements. If the script language is python, it is suggested that all data files are included in a list or dict. For example:
data_files = ['data_file_a.csv', 'data_file_b.xlsx', 'data_file_c.txt']or
data_files = {"energy_data": 'data_file_a.csv', "unemp_rates": {"file":'data_file_b.xlsx', "sheet_name": "unemp_rates"}, "price_of_oil": 'data_file_c.txt'}The commented line:
---------ALL INPUT DATA DECLARED ABOVE THIS LINE----------------
should appear immediately after all references to data files. This gives an unfamiliar reader certainty that no further analysis of the script is necessary to identify inputs.
The code to mutate the inputs into outputs (i.e. the core of the economic model).
The commented line:
---------ALL OUTPUT DATA DECLARED BELOW THIS LINE----------------
should follow. This gives an unfamiliar reader certainty that analysis of the script above this line is unnecessary to identify outputs.
Code to export data into file(s). A variable similarly structured like that for the input files (e.g. a list or dict in python) is recommended.
General Structural Recommendations¶
This section is meant to serve as guidance on how to structure code such as to minimise ‘technical debt’. All of the guidelines provided here are programming language-independent. The intention is that these are implemented gradually alongside fixing bugs in code, or when deadlines allow.
In this section, the programmer is the person who wrote the original code, a debugger is a person who may or may not be familiar with the code and is tasked with fixing the code and a reader is someone unfamiliar with the code who may or may not wish to debug the code.
Data should be stored in one, and only one, location¶
Storing the same data in multiple locations creates uncertainty in the reader’s mind regarding the programmers intention. Consider the poor structure of the two classes below:
class A(): def __init__(self): self.a_var = 42 <a whole lot of methods> class B(): def __init__(self): self.child_obj = A() self.a_var = self.child_obj.a_var def print_a_var(self): print(self.a_var) <a whole lot of methods> obj = B()In this example, the reader could be uncertain if the programmers intention is
obj.a_var
is meant to matchobj.child_obj.a_var
all the time, or only at initialisation. Both of these issues could be resolved with the use of retrieval methods (which by convention, are named to start withget
). Although, in the example above, this could be easily clarified by a comment, this may not be so simple in more complex circumstances (and comments themselves can sometimes be sources of confusion). If the programmer’s intention is that the values remain matched, then the above approach is prone to error - any time one ofobj.a_var
andobj.child_obj.a_var
is altered, the other value must be altered. This process can be easily forgotten, particularly if the debugger is unfamiliar with the code. And even if an updating process is diligently applied, the reader can be forgiven for being skeptical that it is. The alternative approach is:class A(): def __init__(self): self.a_var = 42 def get_a_var(self): return self.a_var <a whole lot of methods> class B(): def __init__(self): self.child_obj = A() def print_a_var(self): print(self.child_obj.get_a_var()) <a whole lot of methods> obj = B()With this approach, the duplicated information is removed, and the intention of the programmer is clear. If the values are meant to be equivalent at initialisation, a better approach is:
class A(): def __init__(self): self.a_var = 42 def get_a_var(self): return self.a_var <a whole lot of methods> class B(): def __init__(self): self.child_obj = A() self.b_var = self.child_obj.get_a_var() def print_b_var(self): print(self.b_var) <a whole lot of methods> obj = B()In this example, the reader has been assisted again by the use of a retrieval function, which clearly separates the two variables in the readers mind. Furthermore, to entirely eliminate any notion the variables are linked, the attribute of the
B
class is now calledb_var
.
Refactor early, refactor often¶
Refactoring is the process of simplifying code whilst maintaining functionality, with the consequence of reducing technical debt ‘technical debt’. The weaker the debugger’s understanding of how the code operates, the more inefficient a bug-fix is likely to be, which has the effect of making the code base more complex, which then results in an even weaker understanding of the code base the next time the debugger approaches the code base. Bad structural design decisions breeds bad structural design decisions:- bad structural design needs to be addressed before it escalates out of control.
The purpose of a function should be obvious from its signature¶
The purpose of a function should be obvious from its signature, and not require any unnatural arguments, where unnatural arguments are either irrelevant to the function, or can be deduced from the arguments provided. Unnatural arguments confuse the reader (e.g. “I know it says ‘calculate_rectangle_area’, but if that’s all it does, why do I need to provide it with the colour of the rectangle?”) and breeds mistrust of the program(mer) (e.g. “I’ve given it a multi-dimensional array, so why do I need to tell it the length of each dimension? … can’t it figure this out itself? … who wrote this $*&%#?”). It should also be clear what the functions alter (e.g. “Why did the ‘change_age(human)’ function change human.name from ‘Joe Bloggs’ to ‘Jill Bloggs’?”). Nearly always, altering the arguments themselves, or other objects not obvious from their signature, is to be avoided. If altering the arguments themselves is necessary, you’ll know precisely why it can’t be done the recommended way. For the reasons discussed above, all of the function signatures below are bad:
def calculate_rectangle_area(width, height, colour): # Why is colour necessary?
...
def sum_array(array, array_dimensions): # array_dimensions can surely be calculated from array...?
...
def change_age(human, new_age):
...
human.age = new_age # Change age to new_age
...
human.name = "Jill Bloggs" # Why would the user ever expect this to happen...?
...
Robustness to obvious and likely data format changes¶
Code should be able to handle obvious and foreseeable changes to input data. A common way this guideline is violated is by using index referencing unnecessarily. Consider df
in the below example:
>>> df = pandas.DataFrame(data=[[pd.np.nan,2,3], [pd.np.nan,5,6]],
index=["foo", "bar"],
columns=pd.to_datetime([2017, 2018, 2019], format="%Y"))
>>> print(df)
2017-01-01 2018-01-01 2019-01-01
foo NaN 2 3
bar NaN 5 6
Lets assume it is the "bar"
series that is the only series of interest for further interest, and it is only current and future values (time-wise) that are of interest (the year at time of writing is 2018). The bad way to retrieve the relevant data is:
>>> df2 = df.iloc[1, 1:]
>>> print(df2)
2018-01-01 5.0
2019-01-01 6.0
Name: bar, dtype: float64
This is a bad way to retrieve data because different ordering or sizes of either the rows or columns could break this code - the relevant rows and columns are identified by it’s position in the dataframe, not by the data itself. The better way to retrieve the data is:
>>> df2 = df.loc["bar", "2018":]
>>> print(df2)
2018-01-01 5.0
2019-01-01 6.0
Name: bar, dtype: float64
Test, tests and more tests¶
- Test-driven development is software-engineering methodology of starting new development by first writing a test. This has benefits including:
- reliability of the code base over time is maintained as any minor alteration in functionality is detected immediately.
- code is written as-needed (so no unnecessary ‘bells and whistles’ are written).
- makes it easy to identify natural arguments, and the useability of the function, by requiring a looking ‘outside-in’ approach.
Input data should not be altered¶
Generally, a function should be able to be executed repeatedly using the same inputs to generate the same output. In the event of a program breakage or an exception being thrown, this instantly creates an environment where any bugs discovered are repeatable (which goes a long way to fixing the bug). In the context of multiple function executions, this linearises the process flow and therefore makes it easier to isolate the function in which the bug resides. Because every function comes with the overhead of a test, the code base tends to focus on truly necessary functionality as well.
Single Responsibility Principle¶
“Every module or class should have responsibility over a single part of the functionality provided by the software, and that responsibility should be entirely encapsulated by the class.” - this is explained with examples at the Wikipedia page ‘single responsibility principle’. It follows from the principle that it should not be necessary to have knowledge of inter-relationships between methods in a class to use that class (because that would contradict the “encapsulation” requirement) - see this modular programming Wikipedia page. For example, consider that there is a class that is tasked with the responsability of transporting a human by car. Part of that class may look like:
class TransportByCar(): def __init__(self, src, dest, human, car): self.src = src # Source self.dest = dest # Destination self.human = human self.car = car def putSeatbeltOn(self): ... def turnIgnitionOn(self): ... def driveToDestination(self): ... def turnIgnitionOff(self): ... def takeSeatbeltOff(self): ...There exists inter-relationships between the methods in this class - to get meaningful (that is, road-legal) results, it is necessary that the methods of the class be executed in this order:
putSeatbeltOn(self)
turnIgnitionOn(self)
driveToDestination(self)
turnIgnitionOff(self)
takeSeatbeltOff(self)
If
TransportByCar()
is poorly structured, it will rely on the software instantiating the class to execute these methods in the correct order. A much better approach is to include the method (in theTransportByCar
class):def gotoDestination(self): self.putSeatbeltOn() self.turnIgnitionOn() self.driveToDestination() self.turnIgnitionOff() self.takeSeatbeltOff()which allows external software to execute the one method and get the expected result. Furthermore, if there is no likely reason for a user of the
TransportByCar
class to want to use any of the methods (other thangotoDestination()
), then these methods should be made hidden/private from the user (in python, the convention is to name these functions such that the first character is an underscore - e.g._turnIgnitionOn()
).
Program flow should be as linear as possible¶
Linear code is significantly easier to follow then code that branches. Two extensions of this principle are:
In the event multiple types are acceptable for an input argument (to a function), then the function should convert the input arguments to one specific type immediately at execution. The following example converts the str form of
names
to the other acceptable type - a list of str:def a_function(names): """``names`` can be either a `str` or a `list` of `str`.""" if isinstance(names, str): names = [names] <the rest of the function>The alternative method - that is bad practice - is to use if statements to handle the two cases whenever an operation involving
names
occurs.if statements should be used like detours, not like intersections. Alternatively this could be stated, “as much as possible, rejoin forks”. Consider the (bad) example:
if a == 1: do_something() else: do_something_else() returnand the (good) example:
if a == 1: make_a_like_not_a() do_something_else() returnIn the first example, a reader is left uncertain whether the value of
a
has a big or little difference on program execution (even if the function names were more descriptive) - does the program attempt to solve the meaning of life ifa == 1
, and go to Mars ifa != 1
? … or does it simply change the print-out slightly? In the second example, it is clear thea == 1
and thea != 1
cases are not that different - structurally, the code returns to a linear path after a slight detour. Applying this approach generally results in more-readable code, and less code. Furthermore, lets assume on 50% of occasionsa==1
- with the second example, the code in the if statement is executed 50% of the time, and the code outside is executed 100% of the time. This makes bug discovery indo_something_else()
twice as likely in comparison to the first example - though this has to be balanced against the relative complexity ofmake_a_like_not_a()
in contrast todo_something()
.
Type checking of input variables¶
- Type checking of a input variable should be:
- performed immediately, and,
- if failed, either: (b.1) handled, or (b.2) raise an exception.
Type-checking of an input variable should occur immediately after the variable is initialised/provided because it: (a) avoids wasted time/resources on calculations that will eventually fail; and (b), reduces the search scope to find a bug. In the event that the input data is mutated - an operation itself to be avoided - then failing to adhere to this guideline can result in data being left in a unknown and/or broken state, which makes replication of the issue (for debugging purposes) difficult. For example, consider the (bad) code-segment:
def a_function(a, b):
"""``a`` should be an float, ``b`` a string"""
<numerous operations using ``b``>
if not isinstance(a, float):
print("'a' is not an float.')
<numerous operations using ``a`` and ``b``>
return something
versus the (good) code-segment:
def a_function(a, b):
"""``a`` should be an float, ``b`` a string"""
if not isinstance(a, float):
try:
float(a)
except ValueError:
raise ValueError("'a' is not a float and \
could not be converted to an float.")
<numerous operations using ``b``>
<numerous operations using ``a`` and ``b``>
return something
Consider the event in which a
is not an float - in the first example, numerous operations are performed on b
before this is discovered. This is wasteful of time and resources, and if some of the operations permanently mutate data (for example, alter an input file - though this is bad practice as well) this may cause irreversible problems. Additionally, when attempting to discover why a
is not an float - which itself relies on the error-prone process of a user looking through the print
output - the programmer has to search amongst the numerous operations on b
to see if there is an operation on a
that has changed the type of a
. In the second example, if a
is an int, the code executes as desired. If a
is a string (which cannot be cast to a float), this is discovered immediately on function execution, and the coder knows that the code that passes a
to a_function
is at fault.
In some circumstances it may be ‘overkill’ to implement type handling for every input argument. An alternative approach is to record the type assumption in the form of an assert
statement (languages other than Python likely have a similar feature). If the assertion is incorrect, an assertion error is raised. For example:
def write_to_file(filename):
assert isinstance(filename, str)
...
The balance between “lots of small functions” and “one big code chunk”¶
Psychologically, its difficult for a reader maintain their attention after scrolling or switching windows. Consequently, a single process (such as a function) should be as large as possible without requiring scrolling.
Regarding comments¶
- Comments should be written in 5 different forms:
- Single-line comments to refer to a single line of code. These comments should be on the same line as the code and refer only to that line of code.
- Single/multi-line comments used to refer to multiple lines of code. There should be no blank/empty line between the comment and the referenced code, and the end of the referenced code should be marked by a blank line.
- Single/multi-line comments/docstrings that describe the entire operation of a function or class. These multi-line comments should be immediately after the function/class declaration and be isolated from the function/class code by an empty/blank line.
- Single/multi-line comments that describe the purpose of a flow-control statement (e.g. if statements and for loops). These comments should be placed immediately before the control flow statement and be isolated by blank/empty lines.
- Single/multi-line comments that describe a flow-control statement’s trigger condition should be placed immediately after the trigger condition, and be isolated by blank/empty lines.
For example, the following code-segment adheres to this guideline:
def a_function(n, print_notice=True):
"""This function calculates the Fibonacci sequence
for ``n`` numbers."""
result = [None]*n # Storage of results
# Initialisation of sequence
result[0] = 0
result[1] = 1
# Calculates the remainder of the sequence
for i in range(2, n):
result[i] = result[i-1] + result[i-2]
if print_notice:
# Notification triggered
print("Calculation complete!")
else:
# No notification triggered
pass
return result