Datenanalyse in Physik und Astronomie

Data-Project

Um einen benoteten Schein zu bekommen muss ein Datenprojekt in Form einer Hausarbeit bearbeitet und eingereicht werden. Die Details werden im Laufe der Vorlesung besprochen und dann auch hier aufgeführt.

To get a graded course certificate students are required to complete and submit a data project. The project is designed to allow the students to demonstrate what they learned during class and apply their skills to a real-world data set.

Every student can choose any data sets described in the following section. I only provide the pointer to the data. I am always open for alternative ideas for a data project - contact me! I do not ask a particular question or demand a particular analysis. Students are invited to investigate the data themselves and come up with an interesting question/analysis. For each data set I will make suggestions to help your imagination.

Deadline for submission: 30. September 2024

General guidelines

Look at all datasets. Try to understand what it contains. Find and check out complementary information. Does it sound interesting to you? Only then pick the project.

After picking the project download all the available data and all documentation that comes along with it. Start by looking at the data directly. Figure out the meaning of every element.

Now you can begin to 'play' with the data. Plot some of it. Try to visualize associations. Map it maybe. Note if something catches your attention. Use this phase to come up with an initial 'question' to address. Focus on this topic and analyze it. Also follow some subsequent paths and try to logically extend your results. Always carefully document what you are doing - keep a log of your analysis steps.

If you use complementary data or resources stick to proper citing and keep a reference list.

Required format

  • To complete the project you will have to
    • complete some mathematical/numerical work
    • visualize the data and your analysis
    • write down a report.
    • These three parts don’t have to be equally weighted. You can focus on visualization or on quantitative analysis. But it will not be possible to fully omit any of the three parts.
  • Choose the right format. For example, if you would like to focus on visualization I suggest you use interactive plotting libraries and publish/submit your project as a stand-alone website where users can interacts with the plots and displays. If you decide to do this then:
    • Create a webpage with your result and submit its URL OR send the complete contents of the page and I will upload it as a subpage to the class webpage.
    • ALSO submit a written summary detailing what you did and your results.
  • Submit publication ready! Your result/report should be in a form that you are ok with presenting it to a wider audience.
  • Publish in English (or German).

Grading-Criteria

The project will be graded according to the following criteria:

  • The analysis is genuine, i.e. this is not a collaborative effort or plagiarism.
  • It follows formal standards, i.e. it applies proper techniques, and uses correct assumptions, does not make technical errors, etc. Common scientific standards are followed.
  • Non-trivial result. I.e. it should not provide an obvious answer to a trivial question.
  • Heavy workload is rewarded (to some degree).
  • It should be a stand-alone result.
  • It should address a wider audience but may also provide more technical information for the informed reader.

Additional Remarks

Maybe you are planning to apply methods that we did not cover in class. This is OK. You are not limited to what had been taught in class. The course should have prepared you to pick up other techniques pretty fast. If you do so, summarize the applied technique and give references.

Project 1 - Hubway Data Visualization Challenge

URL: http://hubwaydatachallenge.org/

“HUBWAY” means the Hubway public bike share system, operating in Boston, Brookline, Cambridge, and Somerville, Massachusetts.

This problem focuses on data visualization and not prediction / machine learning explicitly (No one stops you from applying those though). The official data challenge is closed now, but it was rewarding:

  • Best Data Narrative: tells the most compelling story
  • Most Insightful (or Inciteful): shakes up the current thinking about Hubway and bike share; reveals and provides the most intelligent and surprising learnings
  • Best Data Exploration Tool: allows the user to interact with the data and uncover their own insights and results
  • Most Artistic: visualizes the data in the most creative, different, and innovative way

From the webpage:

“In 2012, after Hubway’s first year of operations, we held our first Data Visualization Challenge. The last 5 years have seen Hubway triple in size and much has changed in terms of access, availability, routes, and system usage. Hubway trip data is now released publicly each month.”

“Where do Hubway users ride? When do they ride? How far do they go? Which stations are most popular? On what days of the week are most rides taken? How do user patterns differ between members and casual riders? How does weather affect usage? These and many other questions can be answered by the ride data.”

The Data

http://hubwaydatachallenge.org/

You can also access present bike sharing data from https://www.thehubway.com/system-data

“Access the Hubway System Data to find metrics on all trips taken – including trip duration, start and stop time, station name, and user type (casual or member) – to inform your visualizations and other creative projects.” Also use the Related data (Census, neighborhoods, bike facilities, elevation, etc.) packaged up as Hack Day Treat (100MB zip)

You can view the public entries that have been submitted to the challenge here: http://hubwaydatachallenge.org/ Use the presented projects to get an idea what other people did and what you could do.

You are free to use available complementary data in your project. For example:

Boston Housing Data

https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

Project 2 – MovieLens Dataset

GroupLens Research has collected and made available rating data sets from the MovieLens web site (https://grouplens.org/datasets/movielens/). The data sets were collected over various periods of time, depending on the size of the set. Before using these data sets, please review their README files for the usage licenses and other details.

Small: 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.

Full: 27,000,000 ratings and 1,100,000 tag applications applied to 58,000 movies by 280,000 users. Includes tag genome data with 14 million relevance scores across 1,100 tags. Last updated 9/2018.

Permalink: https://grouplens.org/datasets/movielens/latest/

Carefully read the README files that exlpains the data.

Alternative projects:

Data from  The Telecom Italia Big Data Challenge

At the beginning of 2014, Telecom Italia, in collaboration with several international partners, launched the Telecom Italia Big Data Challenge. The contest made available to developers, designers and scientists a large dataset of 30+ kinds of data (mobile, weather, energy, etc.)

http://theodi.fbk.eu/openbigdata/#portfolioModal18

Data: https://dandelion.eu/datamine/open-big-data/

Dandelion  account needed to access the data.

The World Database on Protected Areas (WDPA)

is the most comprehensive global database of marine and terrestrial protected areas, updated on a monthly basis, and is one of the key global biodiversity data sets being widely used by scientists, businesses, governments, International secretariats and others to inform planning, policy decisions and management.

https://www.protectedplanet.net/c/world-database-on-protected-areas

http://datasets.wri.org/dataset/64b69c0fb0834351bd6c0ceb3744c5ad

Global Power Plant Database

The Global Power Plant Database is a comprehensive, open source database of power plants around the world. It centralizes power plant data to make it easier to navigate, compare and draw insights for one's own analysis. Each power plant is geolocated and entries contain information on plant capacity, generation, ownership, and fuel type. As of June 2018, the database includes around 28,500 power plants from 164 countries. It will be continuously updated as data becomes available. The most recent release of the Global Power Plant Database 1.1 includes the addition of two countries (China and Fiji), over 3,000 power plants, and nearly 1300 gigawatts of power capacity.

http://datasets.wri.org/dataset/globalpowerplantdatabase

kaggle.net

Choose a suitable dataset from kaggle.net. Browse the available datasets and select an interesting one. Make sure, that the data provides sufficient opportunity to do some non-trivial analysis. Also consider combining dultiple datasets in your analysis. Contact me if you are not sure whether the data you would choose is suitable.

https://www.kaggle.com/datasets

Check out what others did (see at kernels)

e.g:

Berlin Airbnb Data    https://www.kaggle.com/brittabettendorf/berlin-airbnb-data

World Happiness Report https://www.kaggle.com/unsdsn/world-happiness   probably in combination with additional data

European Soccer Database        https://www.kaggle.com/hugomathien/soccer

120 years of Olympic history: athletes and results    https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results

Avocado Prices        https://www.kaggle.com/neuromusic/avocado-prices