Web information extraction and retrieval

Subject description

Content of the course:

This course will cover the following topics:

 

  • Information Retrieval and Web Search

  • Basic Concepts of Information Retrieval

  • Information Retrieval Models

  • Relevance Feedback

  • Evaluation Measures

  • Text and Web Page Pre-Processing

  • Inverted Index and Its Compression

  • Latent Semantic Indexing

  • Web Search

  • Meta-Search: Combining Multiple Rankings

 

  • Web Crawling

  • A Basic Crawler Algorithm

  • Implementation Issues

  • Universal Crawlers

  • Focused Crawlers

  • Topical Crawlers

 

  • Structured Data Extraction

  • Wrapper Induction

  • Instance-Based Wrapper Learning

  • Automatic Wrapper Generation

  • String Matching and Tree Matching

  • Multiple Alignment

  • Building DOM Trees

  • Extraction Based on a Single List Page or Multiple Pages

 

  • Information Integration

  • Schema-Level Matching

  • Domain and Instance-Level Matching

  • Combining Similarities

  • 1:m Match

  • Integration of Web Query Interfaces

  • Constructing a Unified Global Query Interface

     

  • Opinion Mining and Sentiment Analysis

  • Document Sentiment Classification

  • Sentence Subjectivity and Sentiment Classification

  • Opinion Lexicon Expansion

  • Aspect-Based Opinion Mining

  • Opinion Search and Retrieval

 

The subject is taught in programs

Objectives and competences

The main objective of this course is to teach students about how to develop programs for web search (including surface web and deep web search) and for extraction of structural data from both, static and dynamic web pages. Beside basic concepts of the web search and retrieval, students will learn about relevant techniques and approaches. After the course, if successful, students will be able to develop programs for automatic web search and structured data extraction from web pages (including search and extraction from on-line social media).

Teaching and learning methods

Lectures, seminars, homeworks, oral presentations, project work.

Expected study results

After successful completion of the module, students will be able to:

  • summarize the most important approaches and techniques for searching and extracting data from the web

  • to select approaches and techniques that are most suitable for individual problems in web information extraction and retrieval.

  • to develop applications for data acquisition and analysis,

  • to construct new algorithms for web data search and extraction,

  • to explain behavior and time complexity of specific web search algorithms,

  • to integrate and employ different open-source solutions from the field.

Basic sources and literature

  1. Bing Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications, Springer, August 2013

  2. Ricardo Baeza-Yates , Berthier Ribeiro-Neto: Modern Information Retrieval: The Concepts and Technology behind Search, 2nd Edition, ACM Press Books, 2010

Stay up to date

University of Ljubljana, Faculty of Electrical Engineering Tržaška cesta 25, 1000 Ljubljana

E:  dekanat@fe.uni-lj.si T:  01 4768 411