See PDF Version for the schedule.
MultiStrategy Ensemble Learning, Ensembles of Bayesian Classifiers, and the Problem of False Discoveries
This talk covers an ensemble of my research contributions that I believe are likely to resonate with a current audience.
Ensemble Learning combines the predictions of multiple classifiers to enhance accuracy relative to any individual classifier. I will show that combining established ensemble learning techniques further enhances accuracy without computational overhead.
Naive Bayes is a popular approach to classification learning due to its computational efficiency, strong theoretical foundation and its capacity to predict probabilities rather than just the most probable outcome. I will present a simple extension that creates an ensemble of naive-Bayes like classifiers, improving naive Bayes' accuracy without undue computational burden.
Finally, I will discuss false discoveries, a problem that plagues many modern pattern discovery systems. Quite simply, many state-of-the-art approaches to pattern discovery are prone to 'discover' patterns that do not exist. I will explain why this is so and discuss approaches to overcome the problem.
Geoff Webb holds a research chair in the Faculty of Information Technology at Monash University, where he heads the Centre for Research in Intelligent Systems. Prior to Monash he held appointments at Griffith University and then Deakin University, where he received a personal chair. His primary research areas are machine learning, data mining, and user modelling. He is known for his contribution to the debate about the application of Occam's razor in machine learning and for the development of numerous methods, algorithms and techniques for machine learning, data mining and user modelling. His commercial data mining software, Magnum Opus, is marketed internationally by Rulequest Research. Many of his learning algorithms are included in the widely-used Weka machine learning workbench. He is editor-in-chief of the highest impact data mining journal, Data Mining and Knowledge Discovery, co-editor of the Encyclopedia of Machine Learning (to be published by Springer) and a member of the editorial boards of Machine Learning and ACM Transactions on Knowledge Discovery in Data.
2) Prof David Powers, Flinders University of South Australia
Minors as Miners: Modelling and Evaluating Ontological and Linguistic Learning
Growing up is in large measure learning about the world and our social and linguistic environment. We might call this data mining, although it is far more multimodal and immersive than most applications. This paper describes computational research into how children learn, with a particular focus on evaluation in both supervised and unsupervised paradigms.
Conversely, we gain additional insight into association mining by considering psycholinguistic experiments that quantify the way human association by both adults and children relate to a variety of association measures. Learning and evaluation are not dealt with in isolation, but a program of formal and application-based evaluation is expounded and exemplified to show how to evaluate discovered patterns with and without a gold standard. In this context, some serious issues with current evaluation techniques and accuracy measures are identified and the unbiased techniques identified.
David Powers is Professor of Computer Science and Director of the Artificial Intelligence and Language Technology Laboratories at Flinders University. Since the 1970s, David has been focused on the idea of getting computers to communicate in everyday language, and to learn about the world like babies. This includes learning about the sound systems and grammars of languages as well as about the way meaning connects to the world. For this reason, much of David's focus has been on using real and simulated robots to ground meaning, and more recently the Thinking Head.
David has also worked on developing psychologically plausible models of child learning, using techniques from neuropsychology to monitor and understand the learning process. However, much of David's research is about user-centric applications of his research, including several products in various stages of commercialization. Applications include controlling your home or your wheelchair by talking or thinking; searching the web by exploring the universe star-trek style; and correcting typing, recognition and translation errors using syntactic and semantic information.
Volume, Velocity and Variety - Key Challenges for Mining Large Volumes of Multimedia Information
New challenges are emerging, as both government and commercial organisations attempt to exploit the potentially important information in their ever increasing volumes of collected data. This presentation will focus on some of the major challenges involved in the processing and analysis of large multimedia databases. The presentation will present and discuss a range of data mining and visual analytic tools and techniques that DSTO have either developed or acquired to assist organisations uncover potentially interesting patterns of behaviours, trends, links and associations that exist in their data.
After completing his Ph.D in Mathematics from the University of Hertfordshire (UK) in 1987, Richard joined Logica Space and Defence Systems in London where he worked as a mathematical modeller. In 1991 he emmigrated to Australia to join the Defence Science and Technology Organisation, where he has spent the last seventeen years working on intelligence related R&D. Richard is currently the Head of Intelligence Analysis Discipline at DSTO, a research group that provides IT related scientific advice to the Australian Intelligence Community and allied agencies.
Jiuyong Li, Peter Christen, Vladimir Estivill-Castro and Artak Amirbekyan
Various organisations, such as hospitals, medical administrations and insurance companies, have collected a large amount of data over years. However, gold nuggets in these data are unlikely to be discovered if the data is locked in data custodians' storage. A major risk of sharing data among different organisations is revealing the private information of individuals in data.
Data sanitation is not enough for protecting privacy in data. Data anonymisation is often used for data publishing to minimise the risk of privacy revealing Many models have been proposed for data anonymisation in the last few years. These models ensure that the probability of identifying an individual or knowing her sensitive information is less than a maximal threshold. In many cases, optimal anonymisation is computational infeasible. Many efficient algorithms have been proposed to anonymise data for various applications. Significant progresses have been made in data anonymisation, but many challenges remain. A typical challenge is to balance strong protection and good data utility.
A major task for data sharing is data linkage (also called data matching or entity resolution), since useful information is normally threaded in various data sources, possibly across several organisations. Several protocols and methods have been developed in past decade to link separate data sets without the need of identifiable information having to be revealed by the data sources Significant developments have been made in automatic linking of large scale and distributed data sets. However, many challenges still have to be solved before privacy-preserving data linkage can be applied to match large real-world data collections in practice.
An alternative approach for data publishing is through secure data exchanging. Secure Multiparty Computation (SMC) based techniques have been widely used to compute aggregated results from multi-parties without revealing anything from a party. However, there are many challenges here. Many solutions have proved very difficult to implement. Even so simple as to check which of two numbers is the largest. Data mining must be efficient and secure for very large datasets; therefore, it seems that expertise is required to ensure that, in implementations, the information that is leaked is innocuous.
In this tutorial, we will discuss fundamental models and protocols, major technologies current developments and research challenges in the above three directions.
1. Introduction to privacy and data sharing
2. Data anonymisation
3. Privacy preserving data linkage
4. SMC based data mining
The programm will be available here soon.