Introduction to data science

Subject description

  1. Introduction to visual programming and data mining workflows. Data input, visualization, data selection and interactive data exploration. Scatterplot visualization, choice of projection.
  2. Classification. Classification trees. Confusion matrix. Scoring of classification models. Classification accuracy and AUC. Data sampling, training and test sets. Cross-validation. A glimpse into logistic regression, random forests, and SVM. Statistical comparison of classifiers.
  3. Regression. Linear and polynomial regression. Regularization. Effects of regularization on accuracy in training and test sets. Parameter search. Other regression techniques (random forests).
  4. Clustering. Hierarchical clustering. Explorative data analysis with clustering and data projections. k-means clustering. DBSCAN clustering. Time and space complexity. Cluster scoring and selection of number of clusters.
  5. Data projections. Principal component analysis. Multi-dimensional scaling. TSNE.

Analysis of unstructured data, like images and sequences. Data embedding. Deep models.

The subject is taught in programs

Objectives and competences

The course will familiarize graduate students with basic techniques in machine learning and data mining and will illustrate their utility on a range of case studies from biomedicine. Teaching will present data mining techiques on the intuitive level, and will not venture into mathematical foundations. After completing the course, students should be able to gain insight into their own data, access and use key public bioinformatics databases, and creatively collaborate with statisticians and expert bioinformaticians on advanced data analysis projects.

Teaching and learning methods

This is a hands-on workshop style course. The students will learn about data mining procedures through designing data analysis workflows in a visual programming environment Orange (http://orange.biolab.si).

Expected study results

Knowledge and understanding: Understanding of basic data science methods and their utility on analysis of biomedical data sets. Design of data mining workflows. Understanding of which type of data mining is appropriate for specific data analysis problem.

Application: The course will be carried out as a hands-on tutorial; students will apply data mining procedures on real data sets. They will gain knowledge on application of data analytics methods in their own research.

Reflection: Understanding of basics of analytical thinking.

Transferable skills: Understanding and use of visual programming and data analysis workflows.

Basic sources and literature

Video tečaji za programski paket Orange na YouTube-u (http://bit.ly/21E8Vt8).

Delovna skripta Zupan B, Demšar J: Introduction to Data Science.

Stay up to date

University of Ljubljana, Faculty of Electrical Engineering Tržaška cesta 25, 1000 Ljubljana

E:  dekanat@fe.uni-lj.si T:  01 4768 411