II Российская летняя школа по информационному поиску
1-5 сентября 2008, Таганрог

Организаторы

РОМИП: Российский семинар по Оценке Методов Информационного Поиска Таганрогский технологический институт Южного федерального университета Российский фонд фундаментальных исследований

Золотые спонсоры

Яндекс Рамблер

Бронзовые спонсоры

Google HP Labs

Программа

Основная программа школы состоит из четырех курсов (по пять лекций каждый) и одного короткого курса (две лекции). Предварительное расписание:

 

Aug 31, Su Sep 1, Mo Sep 2, Tu Sep 3, We Sep 4, Th Sep 5, Fr Sep 6, Sa
9.00-10.30   TMIFE TMIFE TMIFE TMIFE TMIFE city tour
  and
departure
10.30-11.00 break break break break break
11.00-12.30 CBIR CBIR CBIR CBIR CBIR
12.30-13.30 lunch lunch lunch lunch lunch
13.30-15.00 IRSM IRSM Yandex
lecture
YSC YSC
15.15-16.45 registration @
Taganrog Hotel
DSIR DSIR DSIR DSIR NLPIAA
16.45-17.15 break break break break break
17.15-18.45 DSIR NLPIAA NLPIAA NLPIAA NLPIAA
after 19.00 welcome
party
  panel
discussion
  RuSSIR
party
 

Data Structures in IR (DSIR)

Максим Губин (Ask.com)

The course presents an overview of theoretical and practical approaches to implementation of information retrieval systems. It is mainly focused on classic big and large-scale search problems but also includes brief description of structures applicable for other IR tasks. The course covers a wide range of questions from a high-level theoretical view on data structures design to particular questions of implementation. It includes such important practical problems, which are poorly presented in available educational literature, as parallelization, lossy compressions techniques, and relevant modern hardware features.
The course contains a discussion of known open source and commercial systems implementations. Some considered examples are based on lecturer’s practical experience from his participation in IR systems development projects.
The course can be interesting for students who want to know details of IR system implementation or tailoring existing systems for a specific data scale or IR task. It was presented at internal seminars for employees at Ask.com in 2007 and 2008.

Slides: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5.
Video

Hands-on Natural Language Processing for Information Access Applications (NLPIAA)

Horacio Saggion (University of Sheffield)

This course focus on the development of practical applications which involve the use of natural language technology. The course will introduce NLP concepts which will be reinforced by the development, testing, and evaluation of technology in demonstration sessions. Applications to be studied in the course include: Information Extraction, Question Answering, and Text Summarization. None of the applications will be studied in detail, the main objective of the course is to promote the use of NLP and to facilitate access to available technology which can be adapted to specific application domains so that students can go home motivated to develop their own tools/systems.
Detailed content:
– Overview of Natural Language Processing technologies including parts of speech tagging, named entity recognition, parsing, semantic interpretation and coreference resolution.
– Natural Language Technology for Information access: existent systems and projects combining advanced NLP will be presented (e.g. Cubreporter project).
– Information Extraction: named entity recognition, relation extraction, event extraction, rule-based and machine learning approaches, evaluation, MUC.
– Question Answering: QA architecture, questions and answers, passage selection, answer identification, evaluation, TREC/QA.
– Text Summarization: sentence extraction, superficial features for sentence extraction, feature combination, multi-document summarization, evaluation, Document Understanding Conference.

Slides:
PPT: Introduction, Question Answering, Summarization.
PDF: Introduction, Question Answering, Summarization.
References, GATE.
Video

Поиск изображений по содержанию (CBIR)

Наталья Васильева (HP Labs)

Курс познакомит слушателей с основными аспектами поиска изображений по содержанию (Content-Based Image Retrieval, CBIR), даст общее представление об этой многогранной и быстро развивающейся области.
В первую очередь будет затронут вопрос применимости методов поиска изображений: какие практические задачи можно решить с их помощью, где данные методы могут быть полезны человеку.
Далее мы рассмотрим традиционную архитектуру систем поиска изображений, а также основные задачи, встающие перед разработчиками таких систем. В рамках курса будут обсуждаться задачи предварительной обработки и параметризации изображений (построение векторов признаков), многомерного индексирования, проектирования пользовательского интерфейса и визуализации данных. Особое внимание будет уделено способам описания низкоуровневых характеристик изображения (цвета, текстуры и формы объектов). Будет приведена классификация известных векторов признаков, обсуждены результаты экспериментальных сравнений некоторых их них.
В заключительной части курса мы рассмотрим наиболее известные на сегодняшний день системы поиска изображений (коммерческие и созданные в рамках исследовательских проектов), обсудим их достоинства и недостатки.

Slides:
PPT: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5.
PDF: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5.
Web links
Video

Text Mining, Information and Fact Extraction (TMIFE)

Marie-Francine Moens (Katholieke Universiteit Leuven)

Text mining and information extraction (IE) are becoming an increasingly requested technology by various communities (medical informatics, security, blog and news analysis, business information analysis, legal informatics, etc.). ?Still, today it is a somewhat fragmented subfield of human language technologies and information retrieval where the themes of (often forgotten) old-style pattern-based IE and more recent machine learning techniques, as applied in medical informatics, opinion mining and blog extraction, are scattered in various conferences and sessions (computational linguistics, artificial intelligence, machine learning, Web technologies, semantic computing).
The aim of this tutorial is to explain important technologies from handcrafted patterns to learning, and especially focus on how they blend together in order to suit the needs of current information systems that retrieve or mine information, or that make decisions and solve problems based on the extracted information. This unified perspective also entails valuable insights into the role of traditional pipelined system architectures and more recent probabilistic inference techniques.
Probabilistic extraction, by which text is translated into a variety of semantic labels, perfectly integrates with probabilistic retrieval models that naturally combine surface text features and semantic labels in ranking computations, among which are the popular language retrieval models. Finally, information extraction alleviates the knowledge acquisition bottleneck in expert and question answering systems technology that operate in more restricted subject domains.
We conclude with some pointers to new challenges among which are the recognition of complex semantic concepts (e.g., narrative scripts, or issues such as medical malpractice or competitiveness) in texts.
Because of the reconciling aspects of the many techniques and application domains, the tutorial will attract students and researchers with different backgrounds.

Slides:
PPT: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5.
PDF: Lecture 1, Lecture 2, Lecture 3, Lecture 4, Lecture 5.
Video

IR in Social Media (IRSM)

(short course)

Matthew Hurst, Alek Kolcz, Alexey Maykov (Live Labs)

We define Social Media as a user-generated content on a Web. Social Media includes but not limited to: blogs, usenet, forums. The first part of a tutorial is pretty technical and hands-on. We will show specifics of a data acquisition from blogs, microblogs, usenet. We will present our existing data sets and show how to use them. In the second part we will talk about specifics of using obtained data. We will cover keyword extraction and other data mining techniques.
Spam has become a major problem for Internet users and covers web search as well as most aspects of communication including email, IM, discussion forums. The recent popularity of blogging has spurned a surge in blog spam, with many flavors including splogs, comment spam, trackback spam and ping spam. In this talk we will discuss the differences and commonalities of combating spam in the blog medium vs. other types of spam. The exposition will be supported by results and examples based on real data.

Slides (PDF)
Video

Темы дня на поиске по блогам: как это работает

(Yandex lecture)

Андрей Мищенко, Антон Волнухин (Яндекс)

Около года назад Поиск по блогам Яндекса (http://blogs.yandex.ru/) начал выделять самые обсуждаемые темы в блогосфере и рассказывать о них пользователям. Большие объемы данных, с учетом требований на скорость их обработки делают эту задачу сложной и интересной. В этой презентации мы расскажем основные технические детали используемых идей и алгоритмов, а также ответим на все интересующие вопросы.

Slides: PPT, PDF.
Video

"Высшая школа и ИТ-индустрия: взаимные ожидания и формы взаимодействия" (panel discussion)

Дискуссия с участием представителей университетов и ИТ-компаний.

Контакты

По всем вопросам, связанным со школой, обращайтесь по электронной почте school[at]romip[dot]ru.