Schedule

The registration will take place on Sunday, August 14, from 18:00 till 19:30 at the University hostel, Kapitanskaya str. 3, "Primorskaya" metro station, and on Monday, August 15, from 8:30 till 9:30 at the school venue, Mendeleevskaya line 5.

Su, 14.08

Mo, 15.08

Tu, 16.08

We, 17.08

Th, 18.08

Fr, 19.08

08.30-09.30

Registration

09.30-11.00

MMQI
(B)

SocM
(A)

MMQI
(B)

SocM
(A)

MMQI
(B)

SocM
(A)

MMQI
(B)

SocM
(A)

MMQI
(B)

SocM
(A)

11.30-13.00

Plenary
(A)

Yandex
(A)

ColIR
(B)

T2TGen
(A)

ColIR
(B)

T2TGen
(A)

14.00-15.30

IREval
(A)

SentA
(B)

IREval
(A)

SentA
(B)

IREval
(A)

SentA
(B)

IREval
(A)

SentA
(B)

IREval
(A)

SentA
(B)

16.00-17.30

QLM
(B)

TopK
(A)

QLM
(B)

TopK
(A)

QLM
(B)

TopK
(A)

QLM
(B)

TopK
(A)

QLM
(B)

TopK
(A)

18.00-19.30

Registr.

Conference

Mail.Ru

Conference

Evening

Reception
20:00-23:00

Boat trip
20:00-22:00

Sport party
21:00-24:00

RuSSIR party
21:00-24:00

A - auditorium.
B - room 89.
Please note that the schedule is preliminary and is subject to change.

Knowledge Harvesting from Web Sources (Plenary)

Gerhard Weikum, Plenary session

The proliferation of knowledge-sharing communities such as Wikipedia and the progress in scalable information extraction from Web and text sources has enabled the automatic construction of very large knowledge bases. Recent endeavors of this kind include academic research projects such as DBpedia, KnowItAll, ReadTheWeb, and YAGO-NAGA, as well as industrial ones such as Freebase and Trueknowledge. These projects provide automatically constructed knowledge bases of facts about named entities, their semantic classes, and their mutual relationships. Such world knowledge in turn enables cognitive applications and knowledge-centric services like disambiguating natural-language text, deep question answering, and semantic search for entities and relations in Web and enterprise data.

This tutorial discusses state-of-the-art methods, research opportunities, and open challenges along this avenue of knowledge harvesting. It also addresses issues of querying knowledge bases and ranking answers. The content of this tutorial is based on prior tutorials jointly given with Hady Lauw, Ralf Schenkel, Fabian Suchanek, and Martin Theobald.

Slides: part 1, part 2. Video: part 1, part 2

Advances in Information Retrieval Evaluation (IREval)

Ben Carterette, Evangelos Kanoulas, and Emine Yilmaz

We are proposing a course titled “Advances in Information Retrieval Evaluation”. The course will be presented by Ben Carterette (University of Delaware), Evangelos Kanoulas (University of Sheffield), and Emine Yilmaz (Microsoft Research). The goal of the course is to provide attendees with a comprehensive overview of the latest advances in information retrieval evaluation and discuss the current challenges in the area. The course will focus on two main themes: low cost evaluation and evaluation measures for complex retrieval scenarios. A number of topics will be covered, including background in traditional evaluation paradigm, alternatives to pooling, statistical inference of evaluation metrics, inference of relevance judgments, techniques to test the reliability of the evaluation, on-line evaluation techniques, user model based metrics, metrics for novelty, diversity and multi-query session evaluation.. The course should be of interest to a wide range of attendees. Graduate and PhD students will come away with a solid understanding of how low cost evaluation methods can be applied to construct inexpensive test collections and how to evaluate new IR technology under complex retrieval scenarios, while those with intermediate knowledge will gain deeper insights and further understand the risks and gains of low cost evaluation, and the open problem of extending the traditional evaluation paradigm to different setups. Attendees should have a basic knowledge of the traditional evaluation framework (Cranfield) and metrics (such as average precision), along with some basic knowledge on probability theory and statistics. More advanced concepts will be explained during the tutorial.

Slides: part 1, part 2, part 3, part 4, part 5. Video: part 1, part 2

Top-K Processing for Search and Information Discovery in Social Applications (TopK)

Sihem Amer-Yahia and Julia Stoyanovich

The advent of the Social Web presents new opportunities for search and discovery of personalized relevant information. Social Web users build persistent online personas, making rich semantic information about themselves readily available; this information may be leveraged to improve the online experience, making users more productive, more creative, and better entertained online.

This course covers state-of-the-art techniques for search and information discovery in social applications. We start with an overview of top-K processing – methods residing at the intersection of DB and IR research and practice, and commonly used for ranked retrieval (1.5 hrs). We go on to discuss the application of top-K to social search, paying particular attention to the new semantic considerations and performance tradeoffs arising in collaborative tagging sites (3 hrs). Finally, we show how top-K processing can be adapted to novel information discovery tasks that go beyond search. We discuss applications such as expert finding, question routing, and building and ranking item bundles (3 hrs). Throughout the course, we will maintain a focus on both efficiency and effectiveness, evaluated with user studies and A/B testing, and will keep an eye on open problems and exciting new research directions.

Slides: part 1, part 2, part 3, part 4, part 5. Video: part 1, part 2, part 3, part 4, part 5

Multimedia queries and indexing (MMQI)

Stefan Rueger

Multimedia databases are amongst the most difficult databases to query and index meaningfully. The main reason for this is that the objects cannot readily be analyzed as to their information content and semantic meaning. Hence, the state of the art for multimedia databases has been indexing and querying based on metadata.

This course looks at the area where the query itself can be a multimedia excerpt: For example, when you walk around in an unknown place and stumble across an interesting landmark, would it not be great if you could just take a picture with your mobile phone and send it to a service that finds a similar picture in a landmark image database and tells you more about the building — and about its signiﬁcance for that matter?

The course discusses underlying techniques and common approaches to facilitate multimedia indexing and search: metadata; piggy-back text access where automated processes create text surrogates for multimedia; automated image annotation; content-based image queries and indexing. The latter is studied in great depth looking at features and distances, and how to effectively combine them for efficient database access, to a point where the participants have the ingredients and recipe in their hands for building their own multimedia search engines.

Slides: part 1, part 2, part 3, part4, part 5. Book excerpts. Video: part 1, part 2, part 3, part 4, part 5

Mining query logs to improve web search engines' operations (QLM)

Salvatore Orlando, Raffaele Perego, and Fabrizio Silvestri

Web Search Engines (WSEs) log detailed information regarding every action performed by users. Usage information represents a very important source of knowledge for the optimization of both efficiency and effectiveness of WSEs. The primary focus of this course is to introduce students to the discipline of query log mining by showing its foundations, and by analyzing the basic algorithms and techniques used to extract and to exploit useful knowledge from this rich information source.

An introductory part will present main results regarding the statistical analysis of web usage activities stored in logs. The second part of the course will be focused on how knowledge extracted from logs can be used to enhance several aspects of WSEs. The third part will review research works aimed at enhancing WSEs' efficiency. Eventually, we will go through some of the most challenging future research issues related to query log analysis and its role in the Web of Data. Throughout the classes we will refer to the paper: Fabrizio Silvestri: Mining Query Logs: Turning Search Usage Data into Knowledge. Foundations and Trends in Information Retrieval 4(1-2): 1-174 (2010). We will also present results from recent research papers that will be distributed during classes.

Slides: part 1, part 2, part 3, part 4, part 5, part 6, part 7. Query Log Mining at Facebook. Video: part 1, part 2, part 3, part4, part 5

Sentiment Strength Detection in the Social Web (SentA)

Mike Thelwall

The social web -- including social network sites, blogs and Twitter -- is full of text-based interactions between individual web users. Many of these texts contain expressions of emotion, from friendship to opinions about products or are related to current events. This information is potentially of value for social researchers and for businesses and this has led to the recent growth of automatic sentiment analysis methods and companies that offer online market intelligence services.

This course will offer an introduction to automatic sentiment analysis in the social web and equip participants to apply this kind of method in appropriate contexts and to understand a range of different approaches for sentiment analysis. The course will also discuss the particular problems associated with sentiment analysis in the social web as well as strategies to take advantage of features like emoticons to improve the automatic prediction of online sentiment.

Slides: part 1, part 2, part 3, part 4, part 5. Video: part 1, part 2, part 3, part 4, part 5

An Introduction to Social Mining (SocM)

Vladimir Gorovoy and Yana Volkovich

There are obvious research and commercial needs in understanding of the social media. Everyday millions of users are producing enormous amount of data that describes their style of life: their movements, their preferences, and their social choices. The research community has been challenged by the prospectives of studying such data. There are various ways to mine this data. The popular and classic approaches are basing on statistical analysis of various properties of graphs of friendships or interactions, and on contrasting the findings with different characteristics of the social participants. We will start the course with presentation of these approaches.

Today, however, more and more studies are moving towards new social directions, which are based on interdisciplinary studies. Such interdisciplinarity involves not only different research fields (e.g., computer science, mathematics, physics, psychology, sociology, cultural anthropology, etc.), but also industry, government and academia. In this course we choose to focus on two new directions. The first one is on social media engagement, the phenomena of being captivated and motivated by social media. The second focus is on social innovations, e.g. new strategies to strengthen civil society, to achieve greater proximity between administrations and citizens, through social media. This part would be defined by the outcomes of First International Workshop on Social Media Engagement (SoME 2011) and First International Workshop on Social Innovation and Social Media (SISoM 2011).

Slides: part 1, part 2, part 3, part 4, part 5. Practical Task for SocM course. Video: part 1, part 2, part 3

Monolingual Text-to-Text Generation (T2TGen)

Katja Filippova, Short course

Traditionally, natural language generation assumes a non-textual, symbolic representation of information as input, which needs to be converted into a coherent text through planning, aggregation and surface realization. However, recent years have seen a shift in interest to generation tasks where the input is plain text. This is not surprising because the ability to perform monolingual text-to-text generation is an important step in many Natural Language Processing problems. For example, when generating novel text at the sentence-level, abstractive summarization systems may need to compress sentences or fuse multiple sentences together; the evaluation of translation systems may require additional paraphrases to use as reference gold standards; and answers to questions may need to be generated automatically from extracted sentences.

Text-to-text generation, being an umbrella term, covers a variety of subtopics. Those covered in the present course are paraphrasing, sentence compression, sentence fusion and discourse-level generation. The course goals are to provide an accessible introduction to each of those and present the students with an overview of methods and approaches developed in the last decade.

Slides, video

Collaborative IR (ColIR)

Chirag Shah, Short course

The course will introduce the student to theories, methodologies, and tools that focus on IR in collaboration. The student will have an opportunity to learn about the social aspect of IR with a focus on collaborative IR situations, systems, and evaluation techniques.

Traditionally, IR is considered an individual pursuit, and not surprisingly, the majority of tools, techniques, and models developed for addressing information need, retrieval, and usage have focused on single users. The assumption of information seekers being independent and IR problem being individual has been challenged often in the recent past. This course will introduce such works to the students, with an emphasis on understanding models and systems that support collaborative search or browsing.

Specifically, the course will (1) outline the research and latest developments in the field of collaborative IR, (2) list the challenges for designing and evaluating collaborative IR systems, and (3) show how traditional single user IR models and systems could be mapped to those for collaborative IR. This will be achieved through introduction to appropriate literature, algorithms and interfaces that facilitate collaborative IR, and methodologies for studying and evaluating them. Thus, the course will offer a balance between theoretical and practical elements of collaborative IR.

Slides. Video: part 1, part 2

Entity-oriented Search Result Diversification (Yandex)

Andrey Plakhov

In my talk, I'm going to discuss the approach to search results diversification currently used at Yandex, which is named Spectrum. Large number of queries sent to Yandex are highly ambiguous and mention a specific entity or a class of entities. A query might refer to several objects of the same name (like [apple] might mean either a fruit or a consumer electronics company). More importantly, a query might represent an underlying intent from a large spectrum: e.g. someone searching for [pizza] might want either a restaurant offering delivery service, or a recipe, or even images of pizza.

Spectrum is based on analyzing click-through statistics. The system first identifies objects in queries. Each object is then classified into one or more categories, e.g. "cities", "humans", "cars", "medicines" etc. Based on the object's category, our mined knowledge about typical information needs related to the object and relevant pages available on the Web, Spectrum determines the share of users looking for this object in relation to each of the potential intents. The search engine then uses this information to rank its results for ambiguous queries using the probabilistic model of SERP perception. Target ranking is exactly the one that maximizes the user's chance to find a relevant answer.

Slides, video

Fighting Web Spam: Example of the Real-world Document Classifier System (Mail.Ru)

Andrey L. Kalinin

The majority of text-books provide an universal solution for any information retrieval task: get the tagged corpora, extract features, choose the evaluation metric and train the model with some fancy machine learning algorithm. It seems that it’s easy. But the devil's in the details. How to build corpora? Which features to extract?

The lecture will cover the step-by-step process of solving one of the hardest problem in web information retrieval -- anti-spam system. We will discuss common misunderstandings about web spam and define the spam type. Also, we will talk about technical, linguistic and behavioral spam features, available for modern web search engines.

The lecture will show you the practical approach rather than the academic.

Slides, video