Australian Antarctic Division Conditions of Use | Requests | About us | Site map
Forgot Your Password?
Username:  
Password:
Australian Antarctic Data Centre

Analysis Tools

Data mining project - computer assisted discovery of scientific knowledge

A suite of analysis tools utilising data from the Data Centre's repository and elsewhere.

Computer-assisted discovery of scientific knowledge

Dr Ben Raymond and Dr Dave Watts, Australian Antarctic Division

Collecting soil samples at Thala Valley tip

Collecting soil samples at Thala Valley tip

Project background

The Australian Antarctic Data Centre provides an active repository for Australia's Antarctic scientific information. This has created a ``critical mass'' of information from which previously unknown patterns and relationships can emerge.
We approach data mining as a method of assisting scientists in their pursuit of the scientific process. Scientific hypotheses are conventionally formed following the observation of physical phenomena (e.g. the oft-told story of Newton's "discovery" of gravity after watching an apple fall). However, previously collected data can also provide insight into physical processes, giving scientists the opportunity to refine or develop scientific hypotheses without necessarily needing to make physical observations.
We recently developed a method of searching for relationships between data sources within the data centre. The process involves identifying a data set of interest, and then searching for other data sets that can be used to predict these data. A "match" may indicate that there is a (physical) relationship between the two data sets, and this can be evaluated by the scientist. The search process is effectively nonlinear regression with variable selection, using multivariate adaptive regression splines (MARS) and classification/regression trees.

Example

We applied the method to various indicators from our State of the Environment reporting database, for example, the monthly fuel usage of the generator sets and boilers for Davis station. This fuel usage represents the fuel needed for both heating and powering the station. The heat generated by the electrical generators is used as the primary heat source for heating the station. During summer, this heat is often sufficient (or even excess to requirements) and the boilers are generally not used. During winter the boilers are used to provide additional heat to maintain the station temperature.
Our algorithm found an initial shortlist of 13 predictor data sets measured at Davis station: surface air temperatures (mean, lowest, and highest), mean lower stratospheric temperatures, mean mid-tropospheric temperatures, mean atmospheric pressure, electricity usage, mean wind speed, and the number of people on station (all measured at Davis station), and the sea surface temperature, sea surface temperature anomaly (the anomaly with respect to the long-term monthly average), and sea ice cover (measured adjacent to Davis).
Of these 13 predictors, the MARS technique selected three: electricity usage, mean air temperature, and wind speed. The model error was 4.2 Ml^2, equivalent to 3.2% of the monthly fuel usage. The modelled effects of air temperature, wind, and electricity usage on fuel usage can be observed from the figure below. Colder air temperatures increased fuel usage, as did higher wind speeds and higher electricity usages. Both electricity usage and wind speed showed a threshold effect: increases of wind speed over 6.5 m/s, or electricity usage above about 160MWh did not cause further increase in fuel consumption. These results are in good agreement with the known physical processes -- compare with the neural network model of fuel usage at Mawson station.

Modelled fuel usage at Davis station
Fig 1: "Discovered" model of fuel usage at Davis station. Diagram: Ben Raymond.

References