School program

The main program includes six main courses, a short guest course, sponsors' tech talks, and a pre-school one-day course. A preliminary school schedule is following.

	Aug 5, Sun	Aug 6, Mon		Aug 7, Tue		Aug 8, Wed		Aug 9, Thu		Aug 10, Fri
9:30-11:00	registration	CLIR		Mail.Ru		Yandex		Google
11:30-13:00	IIR	CLIR		WebSci	MSMT	WebSci	MSMT	WebSci	MSMT	WebSci	MSMT
14:00-15.30		WebSci	MSMT	DSIR	PGMIR	DSIR	PGMIR	DSIR	PGMIR	ABBYY/HP
16:00-17:30		DSIR	PGMIR	IRCQA	DynWeb	IRCQA	DynWeb	IRCQA	DynWeb	DSIR	PGMIR
18:00-19:30		IRCQA	DynWeb	YSC		YSC		YSC		departure
after 20.00	registration	welcome party				sports		RuSSIR party

Cross-Language Information Retrieval and Beyond (invited mini-course)
Jian-Yun Nie

Cross-language information retrieval (CLIR) aims to find relevant documents that are written in a different language than the query. When there is not enough relevant information in the language of the user, the user may be interested in using CLIR to find more information originally written in other languages.
In this talk, we will describe the problems of CLIR and its differences with general the machine translation task. The traditional approaches will be described, namely the ones based on machine translation, bilingual dictionaries and parallel and comparable texts.
Although reasonable effectiveness can be obtained using these approaches, problems remain. We will further analyze the remaining problems and describe the current efforts to solve them. Finally, we will show that the approaches developed for CLIR can be adequately used for other IR tasks such as monolingual IR. This talk will provide a general introduction to the area of CLIR and suggest some interesting research problems for the future.
Slides
Video

An Introduction to Web Science: Observing the Online World to Study the Offline World
Ingmar Weber

In the interdisciplinary area of Web Science the Web is not only studied as a technological infrastructure but it is used as a telescope to study society at large. Research questions from fields such as sociology or psychology which were traditionally answered in small, timeconsuming studies are more and more answered by observing and mining online behavior of thousands or even millions of users.

In this course we will look at current research in this area and learn about the data sets, the tools and the techniques used. As example applications we will see how this type of research can be applied to (i) study the spread of a influenza epidemics, to (ii) identify political issues and assign a leaning to them and to (iii) study the development of children and how it is linked to demographic factors. The course will feature a small, informal competition to come up with concrete project proposals addressing any odd, creative or simply funny research question pertaining to offline human behavior using publicly accessible data on the Web. The common thread throughout the course will be the question how findings from the online world can be linked to and verified against findings in the offline world.
Slides: part 1 part 2 part 3 part 4 part 5
Video

Domain Specific Information Retrieval
Allan Hanbury & Mihai Lupu
Domain-specific search engines only index documents relevant to a specific domain, such as health information or intellectual property information. As prior knowledge is available about the domain of interest, such search engines can be adapted to take advantage of this knowledge for improving search results. Furthermore, the users of these search engines often have specific requirements as to how the search engines should function.

This course begins with a general introduction to domain-specific search and the methods used, such as adapting vector space weighting and using specialized vocabularies. This is followed by more detailed coverage of two domains. In web search for health information, the quality and trustworthiness of the information is extremely important, as is the readability level – will this information be understood by a layperson or only by a medical professional? In the intellectual property area, professional patent searchers are faced with the task of finding all patents related to a query (high recall), even though patent documents are sometimes not written to be found easily. In both of these domains, image search is also extremely important. Finally, after an overview of evaluation, evaluation campaigns and their results in the area of medical information search and intellectual property search will be discussed.
Slides: part 1 part 2 part 3 part 4 part 5
Video

The concept and feasibility of modern statistical machine translation: theory, practice and industry applications
Marta Ruiz Costa-jussa & Maxim Khalilov

Our world is currently in an era of globalization, which implies increasing interaction and the intertwining of different language communities. Machine translation technology is one of the core components of efficient multilingual information environment and should be seen as a strategic issue in the framework of the modern multilingual community. Nowadays, statistical machine translation (SMT) is one of the most popular paradigms of MT.

The aim of the course is to teach students the background, theory and implementation behind SMT systems. The course covers aspects of SMT technology from formal description to implementation. It is divided into three parts: theoretical background of SMT; introduction to the main SMT software; and SMT industry perspectives, along with open-source and commercial products. At the end of the course, students will be aware of the taxonomy of various approaches to SMT; the main research problems in the field; the MT industrial applications and they will be able to test and train real SMT systems using the open-source Moses toolkit.
Slides: part 1 part 2 part 3 part 4
Video

Bibligraphy

Students seminar documents

Probabilistic graphical models for Information Retrieval
Guillaume Obozinski

This course provides an introduction to probabilistic graphical modeling in the context of information retrieval. Starting with a review of basic concepts from statistics including notions of conditional independence and the maximum likelihood principle, the course will introduce the concepts of factorization of a probabilistic model on a graph, the properties of the such models and the associated semantics. The course will then introduce gradually new concepts and algorithms to perform inference and learning with graphical models through examples that are directly relevant to IR including the Naive Bayes model, probabilistic Latent Semantic Analysis, the Latent Dirichlet Allocation, and time varying models.
Slides: part 1 part 2 part 3 part 4
Video

IR in Community Question-Answering
Chirag Shah

This course will introduce the students to various challenges and opportunities in rapidly growing community-based question-answering (CQA), with emphasis on content quality and ranking, as well as usage and user satisfaction. Question answering helps one go beyond traditional keywords-based querying and retrieve information in more precise form than given by a document or a list of documents. Several CQA services have emerged in the recent years allowing information seekers pose their information need as questions and receive answers from their fellow users. This has created new challenges for IR researchers relating to content ranking and evaluation within and outside a CQA site. The social/community aspect of IR is unique and important to address in the swiftly changing landscape of the Web.

The course will cover (1) understanding structures and functionalities of a CQA service from a developer's view, (2) criteria relating to motivations and satisfaction from a user's view, and (3) methodology for collecting and analyzing CQA data from a researcher's view. In particular, the course will teach a student tools and techniques for obtaining data from various CQA sites, extracting textual and non-textual features from this content, and using them to rank and evaluate for its usefulness and relevance.
Slides
Video

Dynamics of Web: Analysis and Implications from Search Perspective
Ismail Sengor Altingovde & Nattiya Kanhabua

Dynamicity of Web and its implications on various components of search systems have taken a large attention in the last decade. This course, in the first place, aims to introduce students to the general and wide topic of Web evolution, and then explore a number of issues that is related to temporal aspects of search and IR. We plan to start with an overview of seminal works that shed light on the evolution of Web within time. Next, we will focus on the impacts of this evolution on search and we will essentially focus on indexing of versioned document collections and time-aware retrieval and ranking. We will discuss evolution of search results and invalidation mechanisms for query result caches, and wrap up the course with a review of some recent approaches that aim to predict and search the future!
Slides: part 1 part 2
Video

Introduction to Information Retrieval (one-day pre-school course)
Andrew Aksyonoff

An introductory course that covers all the basic concepts one needs to know to build keyword search engines from scratch. We will overview all the engine pipeline stages (indexing, searching, ranking, etc) at reasonable detail, aided by practical examples from lecturer’s experience as a search engine developer.

Andrew Aksyonoff is founder and CEO of Sphinx Technologies Inc.
Video

Language identification of documents and queries (Yandex)
Nikolai Grigoriev

Language identification is a relatively simple and well-solved task. In the talk, I will give an overview of existing standard techniques, and discuss their application to two text types: crawled Web documents and user search queries. Both present specific challenges:
- for Web documents - multilinguality, genre variability;
- for queries - they are just too short for reliable attribution: hence the need for extra data (user context) to resolve potential ambiguity.
I will talk about Yandex endeavours to cope with all that.
Video

Introduction to active learning (Mail.Ru)
Alexey Voropaev

When you develop system based on machine learning algorithms you must solve 3 problems: find factors that allow to solve problem, chose appropriate ML algorithm, and build train set. The first two problems are well studied for different applications including learning to rank problem. But selection of train examples usually is not controlled.

Algorithms for active learning are used for automatically selection of examples that will be judged and added to train set. Usually application of such algorithms allows you to reduce size of train set and simultaneously increase quality of resulting ML models.

In my short lecture I will show you why it is important to build correct train set and which algorithms can be used to solve this problem. Knowledge of machine learning basics will be nice, but not required.
Slides

Machine Translation at Google (Google)
David Talbot

Google Translate is a research driven product that bridges the language barrier and makes it possible to explore the multilingual web in 63 languages. This talk will highlight some of the key research efforts that have made this possible: large-scale language models, large-scale parallel data mining, syntactic reordering and targeted evaluation metrics.
Video

Using complete syntactic and semantic analysis in NLP tasks (ABBYY, short presentation)
Anatoly Starostin

The main objective of this presentation is to demonstrate how detailed descriptions and modelling of a large number of natural language phenomena can solve complex NLP problems which can hardly be solved with shallow linguistic or statistical methods. All principles will be demonstrated based on ABBYY Compreno, a technology on which ABBYY has been working for the past 15 years. Originally the technology was geared towards machine translation, and it is now being successfully applied to MT tasks (the technology is used to translate texts from English into Russian and from Russian into English).

Text detection in images (HP labs, short presentation)
Natalia Vassilieva

Text detection in images is an important component for a wide range of applications. The most obvious one is Optical Character Recognition (OCR). It can be used to digitize the content of paper documents, to automate the annotation and indexing of multimedia documents, to provide computerized aid for visually impaired, and for many other purposes. Common OCR tools being designed to detect and recognize paragraphs of text perform well on scanned documents, but fail to recognize text in drawings, charts and images of natural scenes when applied to the whole image. Other algorithms for text detection are required for these types of images.
This talk will address key approaches to a problem of text detection for various types of images and showcase text detection for charts and screenshots. We will discuss script dependency of the approaches under study: which of them are capable of performing equally well for Latin, Cyrillic or Chinese?
Video