Advanced Social Data Science

Course content

The objective of this course is to teach students how to leverage the data science toolbox for use in social science. We emphasize the use of new data sources associated with communication, behavior, transactions, etc., which are increasingly available through the web and by collection from the various devices we use. These new sources of structured and unstructured data allow for testing and validation of existing theories in social science as well as development of new ones. Performing these analyses, however, requires an ability to understand and apply methods from the computational sciences. We build on the foundational course in social data science to teach these fundamental skills.

We introduce students to the essentials of data structure and structuring and teach state of the art methods for applying data science and machine learning techniques. We do this by using practical examples and provide students with hands-on experience. We will build on the knowledge from the basic Social Data Science course.

The first canonical data structure we introduce is network and relational data. This data type is ubiquitous when analyzing data from social media, communication on cell phones or data on physical meetings. The second data type is spatial data which includes data on shape and structure of shops, buildings, administrative boundaries, etc. but also includes personal data from GPS on smartphones, cars and much more. The final data type is text data which is present everywhere as documents, online discussions etc. For each of the three datatypes we will teach various tools to work with them in practice.

We teach students a high level of applied machine learning. We will provide an in-depth review of the advantages and disadvantages of standard machine learning techniques, i.e. supervised machine learning (regression, classification) and unsupervised learning. In addition we will teach tools from the frontier of applied social data science that leverages machine learning for causal inference.

The teaching is built around empirical examples: the course aims at developing good practices in data analysis, including thorough exploratory analysis, reliable collection and cleaning of data, visualization skills and statistical sensitivity analysis.

The course will emphasize a complete approach to working with data - from data collection - over data structuring (i.e. parsing, cleaning, transformation, and merging) - to exploratory analysis, and finally reporting of the results.

Education

MSc programme in Economics – elective course.

 

Learning outcome

After completing the course, the student should be able to:

Knowledge:

  • Account for the structure of complex networks and understand modeling of social relations based on network statistics like node degree and centrality measures.

  • Understand fundamental concepts in machine learning: model generalization, overfitting, loss functions, the bias variance trade-off and cross-validation.

  • Account for various learning strategies, algorithms as well as approaches: clustering and unsupervised learning, supervised learning, semi-supervised learning, transfer learning, multi-task learning.

  • Define spatial data using shapes including points, lines and polygons and account for the choice of coordinate system.

  • Understand the potential of different representations of text: structured and unstructured,graph-based, and latent representations.

Skills:

  • Gather, structure, and prepare data for analysis.

  • Select an appropriate modeling approach for analyzing a given dataset: apply model selection, hyperparameter search and robust model validation. Analyze the statistical power of model parameters and limits of your current training sample and choice of representation.

  • Extract reliable information from text data using supervised learning and techniques from natural language processing.

  • Structure geodata for analysis by manipulating shapes, compute local network structures and spatially combining various sources.

  • Communicate results using comprehensive statistics and modern visualization methods.

Competencies:

  • Integrate theoretical and applied knowledge within the field of Data Science and formulate powerful research questions given an interesting dataset.

  • Combine learned methods to address research questions involving large scale social data and machine learning.

  • Choose the appropriate tools to increase performance of computation.

  • Critically evaluate the implications of results, taking into account model limitations and biases, and systematic noise introduced by data collection and sampling methods.  

Lectures. Main work will be exercise individually and in groups which will focus on applying methods.

Barabási, Albert-László. Network science. Web book avaialable free at http://barabasi.com/networksciencebook/. Cambridge university press, 2016.

Gimond, Manuel. Intro to GIS and Spatial Analysis. Web book available free at https://mgimond.github.io/Spatial/index.html. Preprint, 2017.

Jurafsky, Dan, and James H. Martin. Speech and language processing. Vol. 3. London: Pearson, 2014.

Bender, Emily M. "Linguistic fundamentals for natural language processing: 100 essentials from morphology and syntax." Synthesis Lectures on Human Language Technologies 6.3 (2013)

Farzindar, Atefeh, and Diana Inkpen. "Natural language processing for social media." Synthesis Lectures on Human Language Technologies 8.2 (2015)

Søgaard, Anders. "Semi-supervised learning and domain adaptation in natural language processing." Synthesis Lectures on Human Language Technologies 6.2 (2013)

Friedman, J., Hastie T., and R. Tibshirani. Elements of statistical learning. Second edition, 12th printing. Web book available free at https://web.stanford.edu/~hastie/ElemStatLearn/. Springer, 2017.

Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.

Athey, Susan and Guido Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings in National Academy of Sciences, 2017.

It is strongly recommended either to have followed ther foundational course Social Data Science or have completed a similar course elsewhere or through self-studies. Specifically we expect students to have familiarity with programming in Python and in particular working with modern approaches (i.e. Pandas for data structuring, Seaborn for visualization) and gathering data through scraping. Moreover, the student should have basic knowledge about supervised machine learning techniques for regression and classification (e.g. Lasso, Random Forest). Having a basic understanding of linear algebra is strongly recommended.

Schedule:
4 hours lectures combined with exercises once a week from week 6 to 21 (except holidays).

The overall schema for the Master can be seen at
https:/​/​intranet.ku.dk/​ECONOMICS_MA/​COURSES/​COURSECATALOGUE-F18/​Pages/​default.aspx


Timetable and venue:
To see the time and location of lectures and exercise classes please press the link/links under "Se skema" (See schedule) at the right side of this page (E means Autumn, F means Spring). The lectures is shown in each link.

You can find the similar information partly in English at
https:/​/​skema.ku.dk/​ku1718/​uk/​module.htm
-Select Department: “2200-Økonomisk Institut” (and wait for respond)
-Select Module:: “2200-F18; [Name of course]”
-Select Report Type: “List – Weekdays”
-Select Period: “Forår/Spring – Week 5-30”
Press: “ View Timetable”

ECTS
7,5 ECTS
Type of assessment
Written assignment, 3 weeks
project exam. It is allowed to work in groups of 3 to 4 participants. The plagiarism rules must be complied and please be aware of the rules for co-writing assignments.
The project paper must be written in English.
____
Aid
All aids allowed
Marking scale
7-point grading scale
Censorship form
No external censorship
The course can be selected for external assessment.
____
Criteria for exam assessment

Students are assessed on the extent to which they master the learning outcome for the course.

To receive the top grade, the student must with no or only a few minor weaknesses be able to demonstrate an excellent performance displaying a high level of command of all aspects of the relevant material and can make use of the knowledge, skills and competencies listed in the learning outcomes.

In particular, the student should be able to independently analyze new data sets using the tools and theories covered in the course. This includes construction of VAR model for the data and a discussion and testing of the underlying assumptions. Determination of the cointegration properties. Formulation and test of relevant hypotheses on the cointegrating relations and the short-term adjustment. Be able to analyze models for data integrated of order two.

Single subject courses (day)

  • Category
  • Hours
  • Lectures
  • 42
  • Preparation
  • 112
  • Class Instruction
  • 28
  • Exam
  • 24
  • English
  • 206