|Sep 12, Su||Sep 13, Mo||Sep 14, Tu||Sep 15, We||Sep 16, Th||Sep 17, Fr||Sep 18, Sa|
Ricardo Baeza-Yates, Yahoo! Research
The Web continues to grow and evolve very fast, changing our daily lives. This activity represents the collaborative work of the millions of institutions and people that contribute content to the Web as well as
the one billion people that use it. In this ocean of hyperlinked
data there is explicit and implicit information and knowledge.
Web Mining is the task of analyzing this data and extracting information and knowledge for many different purposes. The data comes in three main flavors: content (text, images, etc.), structure (hyperlinks) and usage (navigation, queries, etc.), implying different techniques such as text, graph or log mining. Each case reflects the wisdom of some group of people that can be used to make the Web better. For example, user generated tags in Web 2.0 sites.
The tutorial covers (a) the main concepts behind Web mining, the different data that is found in the Web and typical applications; (b) the mining process: data recollection, data cleaning, data warehousing and data analysis, including crawling in the case of content mining, and privacy issues in the case of usage mining; (c) the main techniques used for the different data types; and (d) use cases of the three types: content, structure and usage mining, ranging from Web site design to search engines.
Stefan Rüger, The Open University
At its very core multimedia information retrieval means the process of searching for and nding
multimedia documents; the corresponding research field is concerned with building the best
possible multimedia search engines. The intriguing bit here is that the query itself can be a
This course examines the full matrix of a variety of query modes versus document types. The course discusses techniques and common approaches to facilitate multimedia search engines: metadata driven retrieval; piggyback text retrieval where automated processes create text surrogates for multimedia; automated image annotation; content-based retrieval. The latter is studied in great depth looking at features and distances, and how to effectively combine them for efficient retrieval, to a point where the participants have the ingredients and recipe in their hands for building their own multimedia search engines. Supporting users in their resource discovery mission when hunting for multimedia material is not a technological indexing problem alone. We look at interactive ways of engaging with repositories through browsing and relevance feedback, roping in geographical context, and providing visual summaries for videos. The course emphasises state-of-the-art research in the area of multimedia information retrieval, which gives an indication of the research and development trends and, thereby, a glimpse of the future world.
Mounia Lalmas, University of Glasgow
Documents usually have content and structure. The content refers to the text of the document, whereas the structure refers to how a document is logically organized. An increasingly common way to encode the structure is through the use of a mark?up language. Nowadays, the most widely used mark?up language for representing structure is the eXtensible Mark?up Language (XML). XML can be used to provide a focused access to documents, i.e. returning XML elements, such as sections and paragraphs, instead of whole documents in response to a query. Such focused strategies are of particular benefit for information repositories containing long documents, or documents covering a wide variety of topics, where users are directed to the most relevant content within a document. The increased adoption of XML to represent a document structure requires the development of tools to effectively access documents marked?up in XML. This course provides a detailed description of query languages, indexing strategies, ranking algorithms, presentation scenarios developed to access XML documents. Major advances in XML information retrieval were seen from 2002 as a result of INEX, the Initiative for Evaluation of XML Retrieval. INEX, also described in this course, provided test sets for evaluating XML information retrieval effectiveness. Many of the developments and results described in this course were investigated within INEX.
Reference: M. Lalmas. XML Retrieval, Synthesis Lectures on Information Concepts, Retrieval, and Services, Vol. 1, No. 1, Pages 1-111, Morgan & Claypool Publishers, 2009. (a pre-prodcution PDF version)
Alexander Troussov, IBM Ireland
Recent developments in Web 2.0 and Cyberinfrastructure technologies create massive computer mediated networks, where the nodes might be people as well as “non-human agents” such as documents, datasets, analytic tools, and concepts. And these networks become more and more “multidimensional”. Search exploits that links. Search becomes personal, collaborative, social. Network models are capable to aggregate heterogeneous information, graph-based methods provide clear intuition and elegant mathematic to mine such models. The course will provide review of modern graph-based methods, including methods of stochastic physics and clustering approaches needed to analyse the structure of complex networks exhibiting high clustering (such as in networks of friendships between individuals). We will present applications of these methods to mining of large volumes of heterogeneous information, and we will demonstrate how to make these methods aware of dimensions of networks where people are involved, including social, semantics, and activity management dimensions.
The research area of Distributed Information Retrieval (DIR) provides techniques that help to integrate multiple searchable resources into a single federated resource and provide direct access to them through a single system. A DIR system can access Deep Web resources through their search interfaces without crawling them. Also it does not need to maintain a complete index of all the federated collections while retrieval results are always consistent and up-to-date. Moreover, it leverages the strengths of dedicated search engines forwarding user queries directly to them. The proposed course will give the background and motivation for DIR research. The main DIR architectures will be presented and discussed, namely, broker-based architecture, DIR over peer-to-peer networks and Open Archive Initiatives. However, the main focus of the tutorial will be the brokerbased architecture and its main phases.
Alexei Sokirko, Evgeniy Soloviev, Yandex
As search engine users get lazier, the search queries become shorter and muddier. There is no sense in making users ask only accurate queries, so the problem of query expansion and reformulating is getting more and more urgent. In principle query reformulation without semantic losses is often impossible, that’s why methods of finding near-synonyms that are sensitive to the query context should be employed. We show that very simple methods of synonym mining being applied to the very large corpora like Russian Internet are quite effective. The paper argues the importance of query logs or anchor texts as comparatively new linguistic resources. We discuss the ways query expansions improve search engine relevance or degrade it even if they seem to fit the query context.
Katja Filippova, Google
Google's mission is to "organize the world's information and make it universally accessible and useful". In the first place this implies understanding and processing the vast amounts of natural language data available on the web -- news, (video-) blogs, books, forums -- all kinds of text and speech in many languages. This talk consists of two parts: The first part will be an overview of a variety of NLP problems solved at Google on a daily basis, such as machine translation, speech recognition, information extraction. In the second part I will consider the task of text summarization in more detail and will present a graph-based method of multi-document news summarization and a way of summarizing video content by looking at users' comments.
Пройдет в развлекательном комплексе Парнас (ул. Карла Маркса 67Б). Начало в 19.00.
Состоится в ночном клубе 100 ручьев (ул. Кирова 5). Начало в 19.00.
По всем вопросам, связанным со школой, обращайтесь по электронной почте school[at]romip[dot]ru.