|
General Information |
Main /
TopicsAndAbstractsTopicsTopics of the 2008 KDubiq Summer School are:
Abstracts
Francesco BonchiPrivacy and anonymity in location and movement-aware data analysis
Our everyday actions, the way people live and move, leave digital traces into the information systems of the organizations that provide services through the ICTs. As a remarkable example, the wireless phone networks gather highly informative traces about the human activities in a territory, due to two factors: pervasiveness and positioning accuracy. The number of mobile phone users worldwide was estimated at 1.5 billions in 2005, still increasing at a high speed. Moreover, the location technologies currently used by wireless operators are capable of providing a good estimate of user location, and better localization is expected in the near future, due to the integration of various positioning technologies (GPS-equipped mobile devices, Wi-Fi and Bluetooth for indoor positioning, sensors and sensor networks for ubiquitous computing, and so on). A scenario of opportunities opens up: on one side, better and better location-based services can be delivered to the end-user, and on the other side, ever more sophisticated forms of analytic knowledge can be discovered from the traces left behind by mobile users. As a concrete example, from the traces of our mobile phones or other location-aware devices it is possible to reconstruct how people move, and this knowledge may enable us to improve decision-making in mobility-related policies. Unfortunately, making mobility data publicly available would put at risk our own privacy, our natural right to keep secret the places we visit, the places we live or work at, the people we meet - all in all, the way everyone lives. The personal mobility data, such as those gathered by the wireless networks, are extremely sensitive information; their disclosure may represent a brutal violation of the privacy protection rights, established in increasingly more laws and regulations internationally. In its introductory part, the course will introduce:
On this basis, the course shall focus on the issue of privacy and anonymity in location-aware and movement-aware data, presenting an original account of the newly emerging and active research areas of:
Philippe BonnetTesting Sensor Networks Applications
Sensor networks promise to allow scientific data acquisition via in
situ sensing at unprecedented density and scale. In particular,
long-term monitoring programs in the domain of earth sciences,
biodiversity, or biology in remote or harsh environments such as the
polar regions, deep see locations or other planets are ideal targets.
In the first part of this talk, I will review the latest developments
in the area
of sensor network systems. The good news is that sensor networks are
no longer hard to deploy, or unreliable.
In the second part of the talk, I will focus on the use of testbeds to
experiment
with sensor network applications in general and knowledge discovery in
particular.
I will first define testing in the context of sensor networks. This
turns out to be an open research area: how to ensure that a sensor
network application will meet its requirements in terms of
functionality and performance. I will then review a representative
subset of the existing testbeds and illustrate how they can be used to
experiment with sensor network-based knowledge discovery.
Joao GamaLearning from Data Streams
This talk focuses on learning in dynamic environments with distributed sources of continuous data and computing with resource constraints.
Learning in these environments is faced with new challenges: we need to continuously maintain a decision model consistent with the most recent data. Desirable properties of learning algorithms include:
ability to maintain an any time model, ability to modify the decision model whenever new information is available, ability to forget outdated information, ability to detect and react to changes in the underlying process generating data, monitoring the learning process and manage the trade-off between the cost of updating a model and the benefits in performance gains. In this talk we illustrate these ideas in two learning tasks: clustering and predictive learning and present illustrative algorithms for these learning tasks.
Andreas HothoUbiquitious Data Mining for Web 2.0
The lecture gives an overview on the currently emerging Web 2.0 with a special focus on folksonomies and social bookmarking. It begins with a formalization of folksonomies as hypergraphs, which results in two major challenges: 1/ the problem of coping with the enormous size of the available data, due the large number of users of systems like del.icio.us, 2/ the complex and rich structure of hypergraphs, requiring the introduction of new graph measures as well as new strategies for data analysis. To learn more about the structure of folksonomies we present methods to analyze hypergraphs based on standard graph measures, suitably adapted to hypergraphs, as well as projections of the folksonomy onto simpler graph structures. We focus on clustering measures and on the analysis of tag co-occurrance graphs. In a next step, we show how it is possible to introduce several notions of similarity between the nodes of a folksonomy (resources, users, tags) and how such measures can be used to mine for structures in the folksonomy. In particular, we show how clusters of resources can be identified an characterized. Finally, we present solutions for practical problems like the ranking of resources and tag recommendations in folksonomies. We show how statistical relations between tags can be mined to infer semantic relations between them, and how these relations can in turn be used to build new views on the data. For ranking, we adopt as first ranking approaches PageRank and Folkrank, a ranking algorithm developed for hypergraphs. The application of association rules on different projections of the hypergraph allows to extract relation between all modes of a folksonomy, namely tags, users, and resources, with the goal of extracting more semantics and gaining insights into the behaviour of users.
Murat KantarciogluPrivacy-preserving Distributed Data-Mining (PPDDM) for Ubiquitous Knowledge Discovery (tentative syllabus)
Course description: In this seminar, we will cover the basic aspects of privacy-preserving distributed data mining. We will discuss the adversarial models and the basic cryptographic techniques used in the privacy-preserving distributed data mining protocols. Also we will show how to use the basic cryptographic techniques to build secure sub-protocols such as secure dot product. Finally, we will discuss how to securely combine the secure sub-protocols for building privacy-preserving distributed data mining protocols. Learning outcomes: Participants of this seminar will learn:
a) Overview of cryptographic techniques b) Introduction to secure multi-party models for PPDDM c) Basic subprotocols for PPDDM algorithms - Secure Set Operations
- Secure Addition
- Secure Dot Product
- Secure Comparison
d) Examples: - Privacy-preserving Association Rule Mining on Horizontally partitioned data
- Privacy-preserving Association Rule Mining on Vertically partitioned data
- Privacy-Preserving ID3 construction
e) Applications of PPDDM for sensor networks, embedded systems and secure coprocessors. f) Future Directions
Hillol KarguptaAlgorithmic Foundations of Distributed Data Mining
Distributed data mining (DDM) deals with the problem of analyzing distributed, possibly multi-party data by paying attention to the computing, communication, storage, and human factors-related issues in a distributed environment. Unlike the conventional off-the-shelf centralized data mining products, DDM systems are based on fundamentally distributed algorithms that do not necessarily require centralization of data and other resources. This course will offer an exposure to the algorithmic foundation of DDM. It will first discuss distributed algorithms for computing statistical and algebraic primitives that are important for developing advanced distributed algorithms for data analysis. We will consider both deterministic and non-deterministic approaches. The following part of the course will focus on a few advanced DDM algorithms that make use of the distributed statistical and algebraic primitives discussed earlier. We will particularly explore distributed algorithms for clustering and classification. Both parts of the discussion will be followed by hands-on exercises in the class. We will use the DDM toolkit for instructional purposes. The course will end with discussions on how these algorithms can blend in various frameworks such as privacy-preserving data mining, resource-constrained mining in sensor networks, and data stream mining.
Peter MarwedelEmbedded Systems in a Nutshell
According to many forecasts, the largest growth factor for applications of information technology will be in embedded systems. Embedded systems are information processing systems that are embedded into a surrounding environment such as cars, trains, airplanes, fabrication equipment etc. This tutorial will provide an overview over some of the key areas in "embedded system design", essentially following an updated structure of our textbook covering the area. The tutorial will start with a description of the characteristics of embedded systems. We will then present some specification techniques, using different models of computation. We will show that specification techniques are far from ideal and even very fundamental techniques in computer science can be questioned. Some key issues in the design of embedded system execution platforms will be presented next. These issues include efficiency, predictability and security. This presentation will be followed by an introduction into techniques for mapping applications to execution platforms. Finally, we will briefly cover evaluation and optimization techniques. The tutorial will comprise an integrated lab.
Rasmus PedersenMachine Learning in Wireless Sensor Networks using NXTMOTE
Embedded Machine learning is an interesting interdisciplinary topic. It poses several difficult research challenges as well as promising application domains. In this tutorial the participants will see how a certain class of learning algorithms fits the embedded and distributed learning space. In embedded and distributed machine learning we are faced with challenges such as small processors, limited memory, low communication bandwidth, and limited battery supply. TinyOS is a small embedded operating system particularly well suited for energy-aware machine learning. We will introduce the support vector machine, TinyOS (language and toolchain) and the LEGO MINDSTORMS NXT. Within this wireless sensor network system, we work through examples with embedded support vector machines and NXT-based motes. The objective is to indicate how selected constraints (like limited memory for example) for embedded learning can map toward a specific algorithm, and thus we hope that the participants can apply similar techniques toward their favorite combination of algorithm and embedded system.
Michèle SébagToward Behavioural Modelling of a Grid System: Mining the Logging and Bookkeeping files of the EGEE grid
The course will describe some approaches for handling complex datasets and clustering them. The motivating application is the mining of log files generated by grid systems. Grid systems are complex heterogeneous systems, and their self-management constitutes a highly challenging goal pertaining to the field of Autonomic Computing. A first step toward this goal is concerned with mining the Logging and BookKeeping files gathered by the grid broker, describing the lifecycle of the jobs submitted to the grid. Specifically, the point is to discover meaningful job categories, refining the coarse distinctions between e.g. "done" and "failed" jobs. Some critical aspects regard the size of the dataset (a few Gigabytes) and the heterogeneity of the jobs, e.g. including a variable number of episodes and related to diverse types of applications, contexts, users. The first part of the course is concerned with the redescription of heterogeneous datasets. The course will survey among others - dimensionality reduction approaches; - static and/or dynamic propositionalization, mapping structured examples onto (fixed size) vectors. The second part of the course is concerned with finding relevant clusters in the Logging and BookKeeping files, based on the chosen redescription. The course will survey new fundamental results about clustering and stability criteria (Meila 2005, 2006), and discuss how these criteria can be made tractable in the case of huge datasets. |