2nd Russian Summer School in Information Retrieval
September 1-5, 2008, Taganrog


ROMIP: Russian Information Retrieval Evaluation Seminar Taganrog Technological Institute of Southern Federal University Russian Foundation for Basic Research

Golden sponsors

Yandex Rambler

Bronze sponsors

Google HP Labs

School Program

The school program includes four main courses, five lectures each, and a short course (two lectures). Preliminary schedule looks as follows:


Aug 31, Su Sep 1, Mo Sep 2, Tu Sep 3, We Sep 4, Th Sep 5, Fr Sep 6, Sa
9.00-10.30   TMIFE TMIFE TMIFE TMIFE TMIFE city tour
10.30-11.00 break break break break break
12.30-13.30 lunch lunch lunch lunch lunch
13.30-15.00 IRSM IRSM Yandex
15.15-16.45 registration @
Taganrog Hotel
16.45-17.15 break break break break break
after 19.00 welcome

Data Structures in IR (DSIR)

Maxim Gubin (Ask.com)

The course presents an overview of theoretical and practical approaches to implementation of information retrieval systems. It is mainly focused on classic big and large-scale search problems but also includes brief description of structures applicable for other IR tasks. The course covers a wide range of questions from a high-level theoretical view on data structures design to particular questions of implementation. It includes such important practical problems, which are poorly presented in available educational literature, as parallelization, lossy compressions techniques, and relevant modern hardware features.
The course contains a discussion of known open source and commercial systems implementations. Some considered examples are based on lecturers practical experience from his participation in IR systems development projects.
The course can be interesting for students who want to know details of IR system implementation or tailoring existing systems for a specific data scale or IR task. It was presented at internal seminars for employees at Ask.com in 2007 and 2008.

Slides: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5.

Hands-on Natural Language Processing for Information Access Applications (NLPIAA)

Horacio Saggion (University of Sheffield)

This course focus on the development of practical applications which involve the use of natural language technology. The course will introduce NLP concepts which will be reinforced by the development, testing, and evaluation of technology in demonstration sessions. Applications to be studied in the course include: Information Extraction, Question Answering, and Text Summarization. None of the applications will be studied in detail, the main objective of the course is to promote the use of NLP and to facilitate access to available technology which can be adapted to specific application domains so that students can go home motivated to develop their own tools/systems.
Detailed content:
Overview of Natural Language Processing technologies including parts of speech tagging, named entity recognition, parsing, semantic interpretation and coreference resolution.
Natural Language Technology for Information access: existent systems and projects combining advanced NLP will be presented (e.g. Cubreporter project).
Information Extraction: named entity recognition, relation extraction, event extraction, rule-based and machine learning approaches, evaluation, MUC.
Question Answering: QA architecture, questions and answers, passage selection, answer identification, evaluation, TREC/QA.
Text Summarization: sentence extraction, superficial features for sentence extraction, feature combination, multi-document summarization, evaluation, Document Understanding Conference.

PPT: Introduction, Question Answering, Summarization.
PDF: Introduction, Question Answering, Summarization.
References, GATE.

Content Based Image Retrieval (CBIR)

Natalia Vassilieva (HP Labs)

This course will give an overview of the main tasks and methods in the content based image retrieval (CBIR) field.
Firstly we will address a question of image retrieval methods application in real life: what are the real-world problems that can be solved using these methods, where it can be used by human.
Then we will consider the traditional architecture of CBIR-systems and the main problems which should be solved by the developers of such a system. This includes the discussion of image preprocessing and feature extraction, multidimensional indexing, design of user interface and data visualization. Special attention will be given to low-level feature processing (color, texture and shape) and construction of feature vectors. The classification of known feature vectors will be presented and results of experimental comparison will be discussed for some of them.
In the final part of the course we will briefly review some of the existing CBIR systems (both commercial and research ones) and we will analyze and discuss their advantages and disadvantages.

PPT: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5.
PDF: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5.
Web links

Text Mining, Information and Fact Extraction (TMIFE)

Marie-Francine Moens (Katholieke Universiteit Leuven)

Text mining and information extraction (IE) are becoming an increasingly requested technology by various communities (medical informatics, security, blog and news analysis, business information analysis, legal informatics, etc.). ?Still, today it is a somewhat fragmented subfield of human language technologies and information retrieval where the themes of (often forgotten) old-style pattern-based IE and more recent machine learning techniques, as applied in medical informatics, opinion mining and blog extraction, are scattered in various conferences and sessions (computational linguistics, artificial intelligence, machine learning, Web technologies, semantic computing).
The aim of this tutorial is to explain important technologies from handcrafted patterns to learning, and especially focus on how they blend together in order to suit the needs of current information systems that retrieve or mine information, or that make decisions and solve problems based on the extracted information. This unified perspective also entails valuable insights into the role of traditional pipelined system architectures and more recent probabilistic inference techniques.
Probabilistic extraction, by which text is translated into a variety of semantic labels, pe"../slides/rfectly integrates with probabilistic retrieval models that naturally combine surface text features and semantic labels in ranking computations, among which are the popular language retrieval models. Finally, information extraction alleviates the knowledge acquisition bottleneck in expert and question answering systems technology that operate in more restricted subject domains.
We conclude with some pointers to new challenges among which are the recognition of complex semantic concepts (e.g., narrative scripts, or issues such as medical malpractice or competitiveness) in texts.
Because of the reconciling aspects of the many techniques and application domains, the tutorial will attract students and researchers with different backgrounds.

PPT: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5.
PDF: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5.

IR in Social Media (IRSM)

(short course)

Matthew Hurst, Alek Kolcz, Alexey Maykov (Live Labs)

We define Social Media as a user-generated content on a Web. Social Media includes but not limited to: blogs, usenet, forums. The first part of a tutorial is pretty technical and hands-on. We will show specifics of a data acquisition from blogs, microblogs, usenet. We will present our existing data sets and show how to use them. In the second part we will talk about specifics of using obtained data. We will cover keyword extraction and other data mining techniques.
Spam has become a major problem for Internet users and covers web search as well as most aspects of communication including email, IM, discussion forums. The recent popularity of blogging has spurned a surge in blog spam, with many flavors including splogs, comment spam, trackback spam and ping spam. In this talk we will discuss the differences and commonalities of combating spam in the blog medium vs. other types of spam. The exposition will be supported by results and examples based on real data.

Slides (PDF)

Topic of the Day at Yandex Blog Search: How it Works

(Yandex lecture)

Andrei Mishchenko, Anton Volnukhin (Yandex) (in Russian)

Slides: PPT, PDF.

"Higher Education and IT Industry: Mutual Expectations and Ways of Collaboration" (panel discussion)

Duscussion will feature representatives of both universities and industry. (in Russian)


Please send all inquiries to school[at]romip[dot]ru.