Program

All lectures, coffee breaks and the young scientist conference take place at 3 Kantemirovskaya St., on the fourth floor.

	Mon, August 24		Tue, August 25		Wed, August 26		Thu, August 27		Fri, August 28
9:10–9:30	Opening 435
9:30-11:00	Pardalos 435		Pardalos 435		Karakitsiou 435	Sebastiani	Fortunato 435
11:30-13:00	Migdalas 435	Sebastiani 436	Karakitsiou 435	Sebastiani 436	Demartini 435	Kanoulas 436	Yandex 435		Fortunato 435
14:00-15:30	Migdalas 435	Sebastiani 436	Demartini 435	Kanoulas 436	Demartini 435	Kanoulas 436	Raigorodsky 435	Laptev 436	Mail.ru 435
16:00-17:30	Kamps, Clarke, Kiseleva, Yang 435		Demartini 435	Kanoulas 435	Kamps, Clarke, Kiseleva, Yang 435		Raigorodsky 435	Laptev 436	Raigorodsky 435	Laptev 436
18:00-19:30	Kamps, Clarke, Kiseleva, Yang 435		YSC 18.00-18.10 18.10-18.20 18.20-18.30 Poster session hall		YSC 18.00-18.10 18.10-18.20 18.20-18.30 Poster session hall		Kamps, Clarke, Kiseleva, Yang 435		Raigorodsky 435	Laptev 436
	Welcome party 20:00 Hall		Kamps, Clarke, Kiseleva, Yang (Hackathon)	Poster session hall			RuSSIR party - Boat trip 20:30 Makarova embankment 20		Closing 19:30–19:50 435

Summer School Venue Map

Data Science for Massive (Dynamic) Networks

Panos M. Pardalos (University of Florida)

Data science tools, such as data mining and optimization heuristics, have been used to analyze many large (and massive) data-sets that can be represented as a network. In these networks, certain attributes are associated with vertices and edges. This analysis often provides useful information about the internal structure of the datasets they represent. We are going to discuss our work on several networks from telecommunications (call graph), financial networks (market graph), social networks, and neuroscience.

Community Detection in Networks

Santo Fortunato (Aalto University)

Detecting communities in networks is one of the most popular topics of network science. Communities are usually conceived as subgraphs of a network, with a high density of links within the subgraphs and a comparatively lower density between them. The existence of community structure indicates that the nodes of the network are not homogeneous but divided into classes, with a higher probability of connections between nodes of the same class than between nodes of different classes. This can have various reasons. In a social network, for instance, communities could be groups of people with common interests, or acquaintanceships; in protein interaction networks they might indicate functional modules, where proteins with the same function frequently interact in the cell, hence they share more links; in the web graph, they might be web pages dealing with similar topics, which therefore refer to each other.
Fortunato Lecture RuSSIR

Big Data driven Logistics

Athanasios Migdalas (Lulea University of Technology)

The purpose of this course is to give an overview of a quite recent, exciting and very active field, that of Big Data Analytics, typically referred to as Data Mining, and to emphasize its application to Logistics and Supply Chain Management.
The topics to be discussed during the course include:

A. General

What is Big Data?

What do we mean by Analytics and what is its Lifecycle?

What are the basic Data Analytic Methods?

What are the advanced methods?

Programming Languages, Technology and Tools

B. Logistics

How to use Big Data in order to transform the Supply Chain to an Intelligent one and improve its visibility?

Impact on the components of the Supply Chain

Demand forecast and matching supply and demand

Measures and Metrics

Difficulties and obstacles to implementation

Expected benefits

Research examples (VRP and others)

Big Data Analytics with R

Athanasia Karakitsiou (Lulea University of Technology)

This course is a follow up of and a complement to the course "Big Data Driven Logistics". It is a technical introductory guide to algorithms in big data analytics using the R language. R is open source and free. It is extremely powerful and flexible tool for data analysis equipped with numerous packages for time series analysis, data mining, predictive modeling and optimization. Both R and various packages can be downloaded from the Comprehensive R Archive Network (CTAN) free of charge: http://cran.r-project.org

In the course we follow a hands-on approach and cover the following topics:

Introduction to R

Time Series Decomposition

Time Series Correlation

Forecasting Strategies

Linear Regression

Nonlinear Regression

Linear Classification

Nonlinear Classification

Clustering

In particular methods useful in Data Mining such ARMA, ARIMA, Support Vector Machines (SVM), Naive Bayes, Neural Networks, Discriminal Analysis, k-Center, and Hierarchical Clustering will be considered.
Karakitsiou Lecture Time Series
Karakitsiou Lecture 2

Text Quantification

Fabrizio Sebastiani (Qatar Computing Research Institute)

In recent years it has been pointed out that, in a number of applications involving text classification, the final goal is not determining which class (or classes) individual unlabelled documents belong to, but determining the prevalence (or “relative frequency”) of each class in the unlabelled data. The latter task is known as text quantification (or prevalence estimation, or class prior estimation).

The goal of this course is to introduce the audience to the problem of quantification, to the techniques that have been proposed for solving it, to the metrics used to evaluate them, to its applications in fields such as information retrieval, machine learning, and data mining, and to the problems that are still open in the area. The emphasis of the course will be on text quantification.

Sebastiani Quantification RUSSIR 2015

Leveraging Knowledge Graphs for Web Search

Gianluca Demartini (University of Sheffield)

Knowledge Graphs (KGs) contain structured information about entities such as persons, locations, and organizations. Modern Web Search engines leverage such KGs to empower entity-oriented search by displaying in search engine result pages so called entity cards that summarize the main facts about the queried entity.

In this course we will first introduce the main concepts around KGs and the `Web of data' to then discuss techniques for mining the Web for entities and using KGs to create an entity-centric user experience on the Web. The first lecture will discuss the Linked Open Data initiative including popular KGs such as Freebase, DBPedia, and Wikidata. Next, we will introduce Named Entity Recognition and Linking techniques that can be used to identify entity mentions in textual content and to disambiguate and connect them to entities in a background KG. Then, entity search techniques will be presented including how to index graph-shaped data to answer queries that look for specific entities. Finally, we will introduce micro-task crowdsourcing techniques and discuss how to apply them to improve data quality (e.g., correctness, completeness, freshness) in KGs.

Demartini KGs 1 Intro
Demartini KGs 2 Ner
Demartini KGs 3 Search
Demartini KGs 4 Crown

Online/Offline Evaluation of Search Engines

Evangelos Kanoulas (University of Amsterdam)

The growth of the Web and the consequent need to organize and search a vast amount of information has demonstrated the importance of Information Retrieval (IR), while the success of web search engines has proven that IR can provide eminently valuable tools to manage large amounts of data. Evaluation has played a critical role in the success of IR. There is an arsenal of methods in hand that researcher and practitioners use to evaluate an experimental search system and compare it to the production system; the course will focus on the two predominant paradigms: collection-based evaluation and in-situ evaluation.
Collection-based evaluation is performed offline, in a laboratory setting. A test collection, comprising benchmark documents, a sample of user queries, and human judgment labels of the relevance of each document to each query, together with an evaluation measure that summarizes
the relevance of ranked list of documents returned as a response to a query are used to assess the effectiveness of a retrieval system. On the other hand, in-situ evaluation is run online, by deploying an experimental system and running user queries both against the experimental and the production system. A/B testing and interleaving provide between-subject and within subject experimental designs of running such experiments.

Kanoulas On Off Evaluation RuSSIR Part I
Kanoulas On Off Evaluation RuSSIR Part II

The course will cover the latest advances in both paradigms of search engines evaluation. Topics will include, click-models and model-based measures, measures for complex retrieval scenarios, statistical inference frameworks that allows hypothesis testing in complex experimental designs, and state-of-the-art of A/B testing and interleaving methods. Special focus will be given to recent work that attempts to bridge the gap between these two evaluation paradigms, e.g. methods to predict the results of an A/B and an interleaving test from offline historical data, and collection-based evaluation frameworks with the human in the loop.

Models of Random Graphs and their Applications to the Web-graph analysis

Andrey Raigorodsky (Moscow Institute of Physics and Technology, Moscow State University, Yandex)

In our lectures, we shall give a survey of various models of random graphs and their applications. Starting from the classical Erdős–Rényi model and its applications to network reliability, we shall proceed to most recent models describing the topology and growth of the Internet, of social networks, economic and biologic networks, etc. Finally, we shall talk about several applications to search and crawling.
Raigorodsky Present 2

Visual object recognition and localization

Ivan Laptev (INRIA Paris-Rocquencourt)

The goal of the course is to introduce state-of-the-art methods for large scale image recognition and retrieval. Participants will learn about industry-strength techniques enabling efficient search of particular object instances among billions of images. Participants will also learn about most recent advances in Deep Learning enabling close-to-human performance for such tasks as face recognition and object category recognition.

This course will introduce modern computer vision techniques for object recognition and will provide students with practical hands-on experience in state-of the-art methods for object recognition in images. The course will contain lectures and practical sessions. The lectures will cover recent image representations for object recognition (HOG, SIFT, DPM, BOF, CNN) as well as modern machine learning techniques (SVM, CNN / Deep Learning). Besides lectures, the course will include guided practical sessions where students will implement basic techniques for object recognition.
Laptev RuSSIR 15 Part 01 Intro
Laptev RuSSIR 15 Part 02 Instance

Contextual Search and Exploration

Charles L. A. Clarke (University of Waterloo)
Jaap Kamps (University of Amsterdam)
Julia Kiseleva (Eindhoven University of Technology)
Grace Hui Yang (Georgetown University)

The ubiquitous availability of information on the web and personalized (mobile) devices has a revolutionary impact on modern information access, challenging both research and industrial practice. Searchers with a complex information need typically slice-and-dice their problem into several queries and subqueries, and laboriously combine the answers post hoc to solve their tasks. Rich context allows for far more powerful, personalized search, without the need for users to write long complex queries. Consider planning a social event at the last day of RuSSIR, in the unknown city of Saint Petersburg, factoring in distances, timing, and preferences on budget, cuisine, and entertainment. Rich context and profiles in combination with a curated set of web data allow us to solve complex tasks with just a simple query: entertain me. Rather than retrieving a "document" on the topic of a "query," the rich contextual information allows for tailored search and recommendation, and solve their complex task by taking into account complex constraints, exploring options, and combining individual answers into a coherent whole.

This course discusses the challenges of contextual search and recommendation, with concrete focus on the venue recommendation task as run as part of TREC 2012-2015. It will consist of both lectures and hands-on sessions with data derived from the TREC task. It will enable students to understand the challenges and opportunities of contextualized search over entities, and learn effective approaches for the concrete application to venue recommendation domain, as well as get hands-on experience with developing and evaluating personalized search and recommendation approaches.
Kiseleva RuSSIR 2015

Efficient Online Experiments

Eugene Kharitonov (Yandex)

Online evaluation methods, such as A/B and interleaving experiments, are widely used for search engine evaluation in the industrial setting. Since they rely on noisy implicit user feedback, running each experiment takes a considerable time and a share of the query traffic. How to overcome this limitation is an important scientific and industrial problem.

In this talk, we will discuss two recently proposed approaches to reduce the duration of the online evaluation experiments. Firstly, we will discuss how the sequential statistical testing methods can be applied in the online evaluation scenario. Such sequential testing procedures allow an experiment to stop early, once the data collected is sufficient to make a conclusion. Secondly, we will discuss a new modification of the widely used Team Draft interleaving algorithm, Generalised Team Draft. Generalised Team Draft achieves a faster convergence rate by performing a joint data-driven optimization of interleaving parameters, as well as by using a stratified estimate of the experiment outcome. Finally, Generalised Team Draft is formulated to be applicable in the domains with the grid-based result representation, such as image search, thus offering a sensitive alternative to A/B testing in these domains.
Yandex Talk

User behavior analysis in Mail.Ru Group

1. Does the size matter? Smart Data at OK.ru

Dmitry Bugaychenko (Odnoklassniki)

"Big data" is one of the top buzzwords in the IT world, but is it really about the size? We are going to discuss a set of cases from OK.ru where not so big data and a set of well tuned algorithms were used to significantly improve our services. To begin with we consider some classical behavioral analysis and collaborative filtering algorithms and show how domain knowledge might help to improve recommender system. Then we switch to latent semantic mining algorithms and show how to generalize them not only to text mining, but also for mining semantic from users' behavior patterns. It is often the case that a good algorithm fails in production due to slow reaction to new data, thus extending those algorithms with real-time updates is crucial. We are going to describe the infrastructure and approach we use to achieve real-time speed for well known "slow" algorithms.

2. Learning to rank using clickthrough data

Vladimir Gulin (Search@Mail.Ru)

Which page does a user want to retrieve when he types query into a search engine?
There are millions of pages that contains words from query, but only small subset of these pages are interesting for user.
If we knew the set of pages actually relevant to user`s query, we could use this as training data for optimizing ranking formula.
In my short talk I will propose one approach to collect these relevant pages from user`s feedback and how we apply this approach in out search engine.

Text Quantification

Fabrizio Sebastiani (Qatar Computing Research Institute)
In recent years it has been pointed out that, in a number of applications involving text classification, the final goal is not determining which class (or classes) individual unlabelled documents belong to, but determining the prevalence (or “relative frequency”) of each class in the unlabelled data. The latter task is known as text quantification (or prevalence estimation, or class prior estimation). The goal of this course is to introduce the audience to the problem of quantification, to the techniques that have been proposed for solving it, to the metrics used to evaluate them, to its applications in fields such as information retrieval, machine learning, and data mining, and to the problems that are still open in the area. The emphasis of the course will be on text quantification.