Summer-school Program

Timetable	Participants	Module
Monday, June 18, 2018
09:00-09:00	Genoveva Vargas Solar CNRS LIG	Introduction to #BDAMDD
09:00-12:00	Michalis Vazirgiannis LIX, École Polytechnique	Graph based text Mining and the case of the Greek Web Archiving
14:00-17:00	Helena Galhardas Instituto Superior Técnico (IST), Lisbon	Data Quality Assessment and Improvement
Tuesday, June 19, 2018
09:00-10:30	Vassilis Christophides Inria Paris	IoT analytics
11:00-12:30	Khalid Belhajjame University Paris Dauphine	Computational Reproducibility: Workflows, Provenance and Scripts
14:00-17:00	Nicolas Travers CNAM-Paris	Conducting Experiments (TP)
Wednesday, June 20, 2018
09:00-12:00	Laurent d'Orazio University of Rennes	Big data management and a practical experience with TPC-H
14:00-17:00	Social event	MDD Olympic games (foot, pétanque, baby foot, ping pong and billard) starting at 14:30 Hicking for newbies: a 1:15 - 1:30 with two possibilities Balade Jomier short / Marabout (see pictures below) This option opens the possibility of participating at the MDD Olympic Games: Hicking for newbies long: 3H with one possibility Balada Jomier long Possibility of partially participating at the MDD Olympic Games (indoor activities) For those that are used to hike in the mountains: Hicking for « pro’s » the afternoon (contact us for details) We will start at the parking (not shown on the map but at the edge of the arrow) and go to the "ARRIVAL" arrow using the red path (Hike's map). The path marked GR5 is actually a small road easy to walk. We will probably use it for returning while we will go using the path which is closer to the lake... The optional extension is for those who wants and only if ) I (Luc) will be organizing this hike (between 4h to more if you want) I am unfortunately not a mountain guide and thus you should be comfortable with hiking in the mountains, despite potential rain or other problems... the path will be to go from the lakes (lac d'Amont) to the refuge de la dent Parrachée through the refuge de la Fournache going on the right of the lake. There is globally no major difficulties (few snowfields, but not dangerous - we have tried) but you should be in good shape for a 4 hours hike. In case, you want to do more, we could go from the refuge to le grand Chatelard (around 1,5 hour more) - and it's wonderful (we tested) You should have good shoes. Either you are confident with you shoes, either you want to have trek shoes at the CAES Paul Langevin (you should go in the morning following "magasin" form the desk)... you should tell us right now, if you want to do the trek such that we may ask for a picnic tomorrow... Do not forget to take shoes at the Magasin during the morning
Thursday, June 21, 2018
08:30-12:30	Vincent Leroy University of Grenoble Alpes	Data analysis on Apache Spark With a break at 10:00
14:00-17:00	Mustapha Lebbah University of Paris 13	Scalable Machine learning

Helena Galhardas

Helena Galhardas is an Assistant Professor in the Department of Information Systems and Computer Engineering of the Instituto Superior Técnico (IST), University of Lisbon and a researcher at INESC-ID. Her current research topics focus on data cleaning and transformation, data profiling, and information extraction. Helena received her PhD on Informatics from the University of Versailles in 2001, on "Data Cleaning: Model, Declarative Language and Algorithms"

Data Quality Assessment and Improvement

This lecture will be divided into two parts. The first part will introduce the notion of data quality. The main tasks of a data quality process will be presented: (i) data quality improvement, and (ii) data quality assessment. In what concerns data improvement, we will focus on the task of approximate duplicate detection. In particular, we will detail how to ensure accuracy and efficiency when identifying approximate duplicates in large data sets.

In the second part of the lecture, we will detail data fusion that aims at consolidating several records that refer to the same real entity (i.e., approximate duplicates). Finally, we will overview the methods used for data quality assessment. The main tasks required for profiling a data set will be presented as well as the main underlying techniques.

Nicolas Travers

Nicolas Travers is an Assistant Professor at Conservatoire National des Arts et Métiers (CNAM) since 2007. He works on query languages, data modeling and query optimization on various types of data. His main topics are: XQuery optimization in a distributed environment, Continuous queries on XML flows (graph factorization & indexes), distributed datalog queries for multi-structured data management, Digital Score Library management, and also recommendation systems for micro-blogging.

Conducting Experiments (TP)

This tutorial on conducting Computer Science experiments aims at showing students a better sight on what exactly is conducting properly experiments in research. We will explore research methods in computer science, developing cases study from empirical approaches, demonstrations, quantitative and qualitative simulations or results. The peer reviewing process will also be developed in order to understand the whole process of paper writing.

This tutorial will be applied by PhD students on existing papers to propose relevant experiments according to a given topic. Then PhD students will review the propositions in a peer process. Then best papers and proposed experiments will be discussed.

Vassilis Christophides

Prof. Vassilis Christophides has been recently appointed to an advanced research position at Inria Paris. Previously, he worked as Distinguished Scientist at Technicolor, R&I Center in Paris and at the Computer Science Department of the University of Crete. He studied Electrical Engineering at the National Technical University of Athens (NTUA), Greece, July 1988, he received his DEA in computer science from the University PARIS VI, June 1992, and his Ph.D. from the Conservatoire National des Arts et Metiers (CNAM) of Paris, October 1996. His main research interests span Data Management and Web Information Systems, Big Data Processing and Analysis, as well as Personal Information Systems. He has published over 120 articles in high-quality international conferences, journals and workshops. In overall, he received 5584 citations according to Google Scholar with a Hirsch's h-index 38. He has awarded with the 2004 SIGMOD Test of Time Award and the Best Paper Award at the 2nd and 6th International Semantic Web Conference in 2003 and 2007. He served as General Chair of the joint EDBT/ICDT Conference in 2014 at Athens and as Area Chair for the ICDE “Semi-structured, Web, and Linked Data Management” track in 2016 at Helsinki Finland.

IoT analytics

The challenge of deriving insights from the Internet of Things (IoT) has been recognized as one of the most exciting and key opportunities for both academia and industry. Advanced analysis of big data streams from sensors embedded in the environment and wearable or mobile user devices is bound to become a key area of data mining research as the number of applications requiring such processing increases. However, the high data speed (Velocity) in conjunction with the low data quality (Veracity) of IoT data streams challenges traditional Machine-Learning (ML) approaches assuming that a good quality training set is available a priori to learn models that may be effectively applied to new data collected under very similar conditions. As previous work has observed, data quality issues are detrimental to data analysis. Good quality training data are typically the result of a thorough data pre-processing comprising data aggregation/integration, data cleaning/normalization, data dimensionality reduction, etc. The offline nature of these data engineering tasks represent nowadays one of the biggest technical barriers for supporting a high-value data analytics in real-time for various IoT settings (e.g., residential, industrial, urban, etc.). Furthermore, existing techniques for data quality management are usually agnostic of the analytical process that is to be applied on the data. For this reason, analysts either clean everything, which is impossible for Big Data, or clean random subsets and hope for the best.

In this talk we will present representative use cases of IoT analytics along with the involved algorithmic and data quality issues.

Khalid Belhajjame

Khalid Belhajjame is an associate professor at the University Paris-Dauphine. Before moving to Paris, he has been a researcher for several years at the University of Manchester, and prior to that a Ph.D. student at the University of Grenoble. His research interests lie in the areas of information and knowledge management. He made key contributions in the areas of pay-as-you data integration, e-Science, scientific workflow management, provenance tracking and exploitation, and semantic web services. Most of his research proposals were validated against real-world applications from the fields of astronomy, biodiversity and life sciences.

Computational Reproducibility: Workflows, Provenance and Scripts

We have witnessed in the last 2 decades a paradigm shift in the way scientists conduct their experiments and analyses. Increasingly, scientists in areas such as the life sciences, astronomy and biodiversity perform their experiments and analyses by utilizing workflows and scripts. A key issue that has been actively investigated by the scientific research community but also by publishers and government body is the reproducibility of the experiments and analyses conducted by scientists with the view to enable their understanding, verification and ultimately reuse. This talk gives solution elements that were developed to improve the reproducibility of scientific workflows and scripts.

Michalis Varzigiannis

Dr. Vazirgiannis is a Professor in LIX, École Polytechnique. He has worked as a researcher in the different places: in the Knowledge & DB Lab (group, N.T.U. Athens), in GMD-IPSI (currently Fraunhofer - IPSI), Germany, in Fern-Universitaet Hagen, in project VERSO (later GEMO) in INRIA/Paris, in IBM India Research Laboratory and in Max Planck Instistut fur Informatik (Saarbruecken, Germany) in the group of G. Weikum. M. Vazirgiannis held a Marie Curie Intra-European fellow (2006-2007) in area of "P2P Web Search", hosted by INRIA FUTURS. His current research interests are in the area of bigdata mining - aiming at harnessing the potential of machine learning algorithms for large-scale data sets including text and graphs. More specifically his current work is on graph degeneracy for large-scale graph mining, graph based text retrieval, learning models from time series data and text mining for the web (i.e. advertising, news streams).

Graph based text Mining and the case of the Greek Web Archiving

Presentation of material from our international tutorials on Graph of Words and applications on keyword extraction, summarization, doc classification.

The Greek Web archiving - we do it for years - all the Greek Web (each snapshot 14TB text) - we built systems for collection (focused as well) and also developed for the first time large scale linguistic resources (dictionary, word embedding, ...) - The techniques are language independent as we use Neural networks

Laurent d'Orazio

Laurent d'Orazio has been a Professor at Univ. Rennes, CNRS, IRISA since 2016. He received his PhD degree in computer science from Grenoble National Polytechnic Institute in 2007. He was an Associate Professor at Blaise Pascal University and LIMOS CNRS, Clermont-Ferrand from 2008 to 2016. His research interests include (big) data algorithms and architectures, distributed and parallel databases. He has published papers in Information Systems, Sigmod Record, Concurrency and Computation Practice and Experience. He served in Program Committees in BPM, workshops affiliated to VLDB, EDBT, etc. and Reviewing Committees in Transactions on Parallel and Distributed Systems, Concurrency and Computation: Practice and Experience. He is or has been involved (sometimes as a coordinator) in research projects such as the NSF MOCCAD project (since 2013), the ANR SYSEO project (797 000 euros funding, 2010-2015) and the STIC ASIA GOD project (30 000 euros funding, 2013-2015).

Big data management and a practical experience with TPC-H

This course is an introduction to big data management and a practical experience with TPC-H. Attendees will have a flavor on storing data on a distributed file systems (namely HDFS), using different layouts (row, column, hybrid) and will execute queries with different engines (Spark, Tez, MapReduce).

Vincent Leroy

Vincent Leroy is an associate Professor in Compute Science at Université Grenoble Alpes, and a member of the SLIDE team. His research focuses on data analytics in the context of large-scale distributed systems. Before joining UGA, he was a post-doctoral researcher at Yahoo! Barcelona. Vincent received his Ph.D. from INSA Rennes and INRIA in 2010, and his habilitation from Université Grenoble Alpes in 2017.

Data analysis on Apache Spark

This course will provide students with an introduction to data analysis on Apache Spark. The presentation will cover the RDD API and some functionalities of the MLlib library, with both batch and stream processing. The course will be followed by a Lab session on Twitter data analysis.

Requirements: Scala / SBT, Scala IDE

Mustapha Lebbah

Mustapha Lebbah is currently Associate Professor at the University of Paris 13 and a member of Machine learning Team A3, LIPN. His main researches are centered on machine learning and data mining (Unsupervised Learning, Self-organizing map, Probabilistic and Statistic, scalable machine learning and data science). Graduated from USTO University where he received his engineer diploma in 1998. Thereafter, he gained an MSC (DEA) in Artificial Intelligence from the Paris 13 University in 1999. In 2003, after three year in RENAULT R&D, he received his PhD degree in Computer Science from the University of Versailles. He received the "Habilitation à Diriger des Recherches" (accreditation to lead research) degree in Computer Science from Paris 13 University in 2012. He is a member of the french group in "complex data mining", and Secretary for the French Classification Society since november 2012.

Scalable Machine learning

This course will provide students with an introduction to scalable Machine Learning. I will describe recent work on scalable clustering. By this course, I would like to teach students that the construction of scalable models is not necessarily associated with strictly computer engineering. The traditional steps of modeling and estimation remain essential. The course will be followed by a short Lab session using spark-notebook.io.

Masses de données distribuées