Computer-assisted discovery of scientific knowledgeDr Ben Raymond and Dr Dave Watts, Australian Antarctic Division |
|
| Collecting soil samples at Thala Valley tip |
Project background
The Australian Antarctic Data Centre provides an active repository for Australia's Antarctic scientific information. This has created a ``critical mass'' of information from which previously unknown patterns and relationships can emerge.
We approach data mining as a method of assisting scientists in their pursuit of the scientific process. Scientific hypotheses are conventionally formed following the observation of physical phenomena (e.g. the oft-told story of Newton's "discovery" of gravity after watching an apple fall). However, previously collected data can also provide insight into physical processes, giving scientists the opportunity to refine or develop scientific hypotheses without necessarily needing to make physical observations.
We recently developed a method of searching for relationships between data sources within the data centre. The process involves identifying a data set of interest, and then searching for other data sets that can be used to predict these data. A "match" may indicate that there is a (physical) relationship between the two data sets, and this can be evaluated by the scientist. The search process is effectively nonlinear regression with variable selection, using multivariate adaptive regression splines (MARS) and classification/regression trees.
Example
We applied the method to various indicators from our State of the Environment reporting database, for example, the monthly fuel usage of the generator sets and boilers for Davis station. This fuel usage represents the fuel needed for both heating and powering the station. The heat generated by the electrical generators is used as the primary heat source for heating the station. During summer, this heat is often sufficient (or even excess to requirements) and the boilers are generally not used. During winter the boilers are used to provide additional heat to maintain the station temperature.
Our algorithm found an initial shortlist of 13 predictor data sets measured at Davis station: surface air temperatures (mean, lowest, and highest), mean lower stratospheric temperatures, mean mid-tropospheric temperatures, mean atmospheric pressure, electricity usage, mean wind speed, and the number of people on station (all measured at Davis station), and the sea surface temperature, sea surface temperature anomaly (the anomaly with respect to the long-term monthly average), and sea ice cover (measured adjacent to Davis).
Of these 13 predictors, the MARS technique selected three: electricity usage, mean air temperature, and wind speed. The model error was 4.2 Ml^2, equivalent to 3.2% of the monthly fuel usage. The modelled effects of air temperature, wind, and electricity usage on fuel usage can be observed from the figure below. Colder air temperatures increased fuel usage, as did higher wind speeds and higher electricity usages. Both electricity usage and wind speed showed a threshold effect: increases of wind speed over 6.5 m/s, or electricity usage above about 160MWh did not cause further increase in fuel consumption. These results are in good agreement with the known physical processes -- compare with the neural network model of fuel usage at Mawson station.
![]() |
| Fig 1: "Discovered" model of fuel usage at Davis station. Diagram: Ben Raymond. |
References
- Raymond, B., Watts, D.J., Burton, H., and Bonnice, J. (2003) Data mining and Antarctic scientific data (submitted).




