4th Russian Summer School in Information Retrieval
September 13-18, 2010, Voronezh

Organizers

ROMIP: Russian Information Retrieval Evaluation Seminar Voronezh State University

Golden sponsors

Yandex Google

Silver sponsor

HP

Bronze sponsors

Microsoft Research

SKB Kontur

Rambler

Media partners

Videolectures.Net Magazin PC World

Portal All Voronezh

Newspaper Poisk

IT events

IT Terra

HackDay

School Program

Sep 12, Su Sep 13, Mo Sep 14, Tu Sep 15, We Sep 16, Th Sep 17, Fr Sep 18, Sa
9.00-10.30   registration XMLIR DIR DIR DIR  
10.30-11.00 break break break break excursion
11.00-12.30 Google
lecture
WDM WDM YSC DIR
12.30-13.30 lunch lunch lunch lunch lunch
13.30-15.00 XMLIR WDM XMLIR XMLIR Yandex
lecture
15.15-16.45 MMIR MMIR MMIR MMIR MMIR
16.45-17.15 registration break break break break break
17.15-18.45 GMSS GMSS GMSS GMSS GMSS
after 19.00   welcome
party
      RuSSIR
party
departure

Web Data Mining (WDM)

Ricardo Baeza-Yates, Yahoo! Research

The Web continues to grow and evolve very fast, changing our daily lives. This activity represents the collaborative work of the millions of institutions and people that contribute content to the Web as well as the one billion people that use it. In this ocean of hyperlinked data there is explicit and implicit information and knowledge.
Web Mining is the task of analyzing this data and extracting information and knowledge for many different purposes. The data comes in three main flavors: content (text, images, etc.), structure (hyperlinks) and usage (navigation, queries, etc.), implying different techniques such as text, graph or log mining. Each case reflects the wisdom of some group of people that can be used to make the Web better. For example, user generated tags in Web 2.0 sites.
The tutorial covers (a) the main concepts behind Web mining, the different data that is found in the Web and typical applications; (b) the mining process: data recollection, data cleaning, data warehousing and data analysis, including crawling in the case of content mining, and privacy issues in the case of usage mining; (c) the main techniques used for the different data types; and (d) use cases of the three types: content, structure and usage mining, ranging from Web site design to search engines.

Slides, video.

Multimedia Information Retrieval (MMIR)

Stefan Rüger, The Open University

At its very core multimedia information retrieval means the process of searching for and nding multimedia documents; the corresponding research field is concerned with building the best possible multimedia search engines. The intriguing bit here is that the query itself can be a multimedia excerpt.
This course examines the full matrix of a variety of query modes versus document types. The course discusses techniques and common approaches to facilitate multimedia search engines: metadata driven retrieval; piggyback text retrieval where automated processes create text surrogates for multimedia; automated image annotation; content-based retrieval. The latter is studied in great depth looking at features and distances, and how to effectively combine them for efficient retrieval, to a point where the participants have the ingredients and recipe in their hands for building their own multimedia search engines. Supporting users in their resource discovery mission when hunting for multimedia material is not a technological indexing problem alone. We look at interactive ways of engaging with repositories through browsing and relevance feedback, roping in geographical context, and providing visual summaries for videos. The course emphasises state-of-the-art research in the area of multimedia information retrieval, which gives an indication of the research and development trends and, thereby, a glimpse of the future world.

Slides: part 1, part 2, part 3, part 4. Book excerpts. Video.

XML Information Retrieval (XMLIR)

Mounia Lalmas, University of Glasgow

Documents usually have content and structure. The content refers to the text of the document, whereas the structure refers to how a document is logically organized. An increasingly common way to encode the structure is through the use of a mark?up language. Nowadays, the most widely used mark?up language for representing structure is the eXtensible Mark?up Language (XML). XML can be used to provide a focused access to documents, i.e. returning XML elements, such as sections and paragraphs, instead of whole documents in response to a query. Such focused strategies are of particular benefit for information repositories containing long documents, or documents covering a wide variety of topics, where users are directed to the most relevant content within a document. The increased adoption of XML to represent a document structure requires the development of tools to effectively access documents marked?up in XML. This course provides a detailed description of query languages, indexing strategies, ranking algorithms, presentation scenarios developed to access XML documents. Major advances in XML information retrieval were seen from 2002 as a result of INEX, the Initiative for Evaluation of XML Retrieval. INEX, also described in this course, provided test sets for evaluating XML information retrieval effectiveness. Many of the developments and results described in this course were investigated within INEX.

Slides, video.

Reference: M. Lalmas. XML Retrieval, Synthesis Lectures on Information Concepts, Retrieval, and Services, Vol. 1, No. 1, Pages 1-111, Morgan & Claypool Publishers, 2009. (a pre-prodcution PDF version)

Graph-based Methods for Social Search (GMSS)

Alexander Troussov, IBM Ireland

Recent developments in Web 2.0 and Cyberinfrastructure technologies create massive computer mediated networks, where the nodes might be people as well as “non-human agents” such as documents, datasets, analytic tools, and concepts. And these networks become more and more “multidimensional”. Search exploits that links. Search becomes personal, collaborative, social. Network models are capable to aggregate heterogeneous information, graph-based methods provide clear intuition and elegant mathematic to mine such models. The course will provide review of modern graph-based methods, including methods of stochastic physics and clustering approaches needed to analyse the structure of complex networks exhibiting high clustering (such as in networks of friendships between individuals). We will present applications of these methods to mining of large volumes of heterogeneous information, and we will demonstrate how to make these methods aware of dimensions of networks where people are involved, including social, semantics, and activity management dimensions.

Slides: part 1, part 2, part 3. Video.

Distributed Information Retrieval (DIR)

Fabio Crestani & Ilya Markov, University of Lugano

The research area of Distributed Information Retrieval (DIR) provides techniques that help to integrate multiple searchable resources into a single federated resource and provide direct access to them through a single system. A DIR system can access Deep Web resources through their search interfaces without crawling them. Also it does not need to maintain a complete index of all the federated collections while retrieval results are always consistent and up-to-date. Moreover, it leverages the strengths of dedicated search engines forwarding user queries directly to them. The proposed course will give the background and motivation for DIR research. The main DIR architectures will be presented and discussed, namely, broker-based architecture, DIR over peer-to-peer networks and Open Archive Initiatives. However, the main focus of the tutorial will be the brokerbased architecture and its main phases.

Slides, video.

Query expansion based on linguistic evidence (Yandex lecture)

Alexei Sokirko, Evgeniy Soloviev, Yandex

As search engine users get lazier, the search queries become shorter and muddier. There is no sense in making users ask only accurate queries, so the problem of query expansion and reformulating is getting more and more urgent. In principle query reformulation without semantic losses is often impossible, that’s why methods of finding near-synonyms that are sensitive to the query context should be employed. We show that very simple methods of synonym mining being applied to the very large corpora like Russian Internet are quite effective. The paper argues the importance of query logs or anchor texts as comparatively new linguistic resources. We discuss the ways query expansions improve search engine relevance or degrade it even if they seem to fit the query context.

Slides, video.

NLP at Google (Google lecture)

Katja Filippova, Google

Google's mission is to "organize the world's information and make it universally accessible and useful". In the first place this implies understanding and processing the vast amounts of natural language data available on the web -- news, (video-) blogs, books, forums -- all kinds of text and speech in many languages. This talk consists of two parts: The first part will be an overview of a variety of NLP problems solved at Google on a daily basis, such as machine translation, speech recognition, information extraction. In the second part I will consider the task of text summarization in more detail and will present a graph-based method of multi-document news summarization and a way of summarizing video content by looking at users' comments.

Slides, video.

Welcome party

Starts at 7 pm at Parnassus club (Karl Marx st. 67b).

RuSSIR party

Starts at 7 pm at 100 ruchiev (100 brooks) club (Kirov st. 5).

Contacts

Please send all inquiries to school[at]romip[dot]ru.