Datasets

Let’s play with some of the datasets the Unit for Data Science and Analytics is currently working with. Feel free to use these datasets as part of your research, coursework, or just for fun! 

Microplastics Image

This dataset is a collection of images of microplastics. Microplastics are small fragments of plastic (<5mm) that potentially have a negative impact on our health and the environment. Suggested dataset uses are for image classification, image segmentation, or any other image processing tasks.

Suggested dataset use: Image classification, image segmentation, or any other image processing tasks.

Microplastics Image dataset (zip)

 

HMIS Usage

This dataset is derived from the Homeless Management Information System (HMIS), which is a government-run database to collect client-level data on housing and services to homeless individuals and families. This particular dataset counts the number of times each homeless individual (rows) attends each of the different services/projects (columns) available to them.

Suggested dataset use: Unsupervised learning tasks.

HMIS Usage dataset (csv)

 

Reddit

This dataset is an extract of the subreddit /s/wallstreetbets from the website Reddit.com. It contains all of the non-deleted posts from all of January and February 2021.

Suggested dataset use: Great for all types of natural language processing (NLP).

Reddit dataset