Computer-based Content Analysis – Text HWS11
 

 

 


Title

Computer-based Content Analysis – Text

Type of course

Lecture and practical exercises

Level

Bachelor, Ph.D.

ECTS points

6

SWS

2

Language

English

Time and place of lectures

Thursdays, 13:45–15:15, Schloss Ehrenhof West - EW 163

Lecturer

Cäcilia Zirn, Johannes Knopp

Assisted computer pool times

Wednesdays, 15:00–17:00, PI-Pool


! Attention ! The room changed. From now on, the lecture will take place in Schloss Ehrenhof West - EW 163. 

New: Assisted Computer Pool time. See above.

First appointment:

    • 08.09.2011 (lecture)

Preliminaries:

    • Foundations of linear algebra and probability theory (high school level)
    • Computer skills that allow to get familiar with complex applications fast

Grading is based on:

    • Implementation of a project
    • Final presentation
    • Report (~ 15 pages)

Attendance Modalities (new!):

    • Lecture part: attendance *voluntary*
    • Project presentation part: attendance *mandatory*

For more details attend to the first lecture on Thursday, 08.09.2011.


Content of the Lecture

The course presents methods for the computer assisted automatic analysis of digital documents as a basis for further quantitative content analyses used in social and cultural sciences.

In the beginning we will present some possible analyses computational linguistics can offer to social and cultural sciences using the software GATE. This is followed by a short programming course in the Python programming language introducing a more flexible way of preprocessing texts and also access to text data through web crawling and conversion of different file formats. Before the break more advanced methods on text classification and clustering are presented along with more tools that can be used. In the second part of the course participants will present their own project work to each other.


Dates and Topics

Date

Topic

Material (PDF)

Exercises (PDF)

Introduction

08.09

Overview & Goals

Introduction to Named Entity Recognition & GATE

NER/Gate

15.09.

Regular Expressions & JAPE

Regular Expressions

assignment-RegEx
CourtneyLove_Speech.txt

solution.txt (updated)

Programming with Python

22.09.

Introduction to Python

Python Intro

Assignment

29.09.

Introduction to Python II

Python Intro II

Assignment: Instructions and Data (zip)

code template

solution (zip)

06.10

Text preprocessing with NLTK

NLTK

Assignment: Instructions

web page

13.10

Crawling Websites & Document Conversion

Crawling

crawler.py

docment_conversion

speech_pdf2txt_converter_sceleton.py

Diving into Theory & Tools

20.10.

Information Retrieval

IR

Assignment: IR

tfidf.zip

27.10

Text Classification & Machine Learning

Rapidminer

Machine Learning

Rapid Miner Processes

03.11.

Project Assignments

Project Proposals

Project Work

10.11 & 17.11

Project Time without Lectures

24.11

Presentations

Felix Lorenz

Christopher Markert (Slides)

1.12.

Presentations

Linda Gierich & Judith Klingenstein

Seyhan Özkan

Markus Baumann

 

8.12.

Presentations

Simone Krug

Yipeng Liu

Matthias Haber


Reading recommendations

Description

Title

Application of NLP methods (NLTK)

Social Media Mining of the Icelandic Blogosphere

Application of NLP methods

Automated Discovery and Analysis of Social Networks from Threaded Discussions


Exercises

We will hand out (non-mandatory) exercises that will help you understand the presented technology and methods. We strongly suggest that you take the time to work on them. In our experience hands-on exercises make it a lot easier to follow a course like this.