Super bizarre data sets you might not know exist

Written by:
August 17, 2018
whiteMocca // Shutterstock

Super bizarre data sets you might not know exist

Knowledge is power. That's been true even before Sir Francis Bacon  (in Latin) back in 1597. In today's age of readily accessible information, data is a commodity used by everyone from scientists researching cancer cures to fantasy football fanatics looking for an edge in their league. The internet is not only constructed by data, it's filled with unique data sets that are available to anyone with a keyboard and cover topics that range from  to .

To celebrate data in all its wonderful forms, Âé¶¹Ô­´´ put together this list of super bizarre data sets you might not know existed. Obviously "super bizarre" is in the eye of the beholder, but these data sets span the spectrum from pop culture to public health and everything in between. To be included in the list, the data set had to be free and available to researchers and journalists, which eliminates a wide swath of data sets that are only accessible via a subscription or one-time payment.

Read on to explore the wonderful wide world of incredibly specific data.

Wine Quality Data Set

Wine fans can use this data set to confidently converse about all things Portuguese wine. The data set involves 12 attributes including fixed acidity, pH, and alcohol content distilled from information regarding northern Portuguese red and white Vinho Verde wine samples. The University of Minho in Portugal puts everything but the bouquet underneath your nose with this data set of .

Amazon product data

takes 142.8 million Amazon reviews and parses it into searchable details. Everything is broken down into consumable datasets, whether by category or just by product name. The reviews and metadata span nearly 20 years from 1996 to 2014 and were put together by Julian McAuley of the University of California San Diego.

Every available Reddit comment

Reddit can be a bizarre place in and of itself, but what happens when somone aggregates every comment that's ever been made on the platform? That's what  aimed to find out when  that tracks over 1.7 billion comments. The comments are categorized by author, comment, score, subreddit, and more using Reddit's application program interface. With so many comments on such a vast number of topics, anything could be hiding in this data set.

Million Song Dataset

A relatable feeling for many is having that one song on the tip of the tongue, but the name just isn't coming to mind. To assist in the self-Shazam, the  is a collaboration of features made up of over a million contemporary popular music tracks. Even though it doesn't have audio, it does break things down by features of the songs and includes a community of smaller data sets that analyze lyric data and cover songs.

Bob Ross Elements by Episode

Television painting master Bob Ross calmed millions with his easy-going attitude about  on PBS, and  at FiveThirtyEight analyzes the types of paintings Ross taught in each episode. Broken down by elements like trees, mountains, and water, the data set can be used by Bob Ross aficionados to create an accurate picture of the art teacher's work or by a novice painter looking for inspiration.

Speed Dating Experiment

Speed daters can start looking for love in the all the right places as Professors Ray Fisman and Sheena Iyengar of Columbia Business School have put together this . Utilizing data collected from 2002 to 2004, they have broken down the information into categories such as dating habits or beliefs people found valuable in a mate.

Are dog size and intelligence linked?

Find out if a 60-pound dog is more or less intelligent than a heavier canine with  that pits weight versus the IQ of dogs.  sets out to explore the correlation between dog size and intelligence using data derived from the . The project is based on research by Stanley Coren, professor of canine psychology at the University of British Columbia.

UFO Reports

The search for intelligent life in the universe gets a big upgrade with the, which tracks over 80,000 sightings from the National UFO Reporting Center over the last century. The data collected includes geo-location and time-standardization for easy comparison between sightings for those studying extraterrestrial contact.

Last 20 Games Major League Baseball Standings

Every fantasy league manager wants to know who's hot and who's not.  lets the owner view data from the last 20 games by individual categories like batting average or home runs. It also allows baseball fans the opportunity to break down data about the top 10 players or dive into the massive data set to examine the complete raw data about players on every team.

List of cats in movies

have been around since 1903, according to this data set, which was compiled on OpenDataSoft, a portal for over 13,000 public datasets. This list can be sorted by director, producer, and year, and can be used to find out which decade was the most feline-friendly in film.

Billions of web clicks of 100K users at Indiana University

Imagine how many clicks one user goes through a day. Now, multiply that by 100,000 people and you have  registered by Google Software Engineer Mark Meiss and the University of Indiana Center for Complex Networks and Systems Research. This information can be used in practical applications like accurately predicting web traffic trends or understanding online behavior.

List of all countries with names and ISO 3166-1 codes

Serbian researcher SaÅ¡a Stamenković created  to provide programmers with the names and ISO 3166-1 codes for every country in the world. Even more useful, the data translates the countries into all languages and data formats, so it can be implemented across the globe from Mongolia to South Africa. Sometimes, even interesting data sets can bring the world together.

Crowd Violence \ Non-violence Database

 Isaac Asimov noted.  examines these "incompetent" actions by examining crowd violence from YouTube and analyzing the data for the paper "Violent Flows: Real-Time Detection of Violent Crowd Behavior" by Tal Hassner, Yossi Itcher, and Orit Kliper-Gross. Future researchers may be able to use the data to identify violent situations in real-time if they're captured by surveillance cameras.

San Francisco Restaurant Health Grades

Check out this data set compiled by the San Francisco Department of Public Health to find out which restaurants failed their health inspector tests before your next big date. The SFDPH  by zip code for easy reference.

Caltech Pedestrian Detection Benchmark

is incredibly useful to traffic researchers and consists of 10 hours of video, which recognized 2,300 unique pedestrians. This technology has the ability to recognize pedestrians both in and out of crosswalks, potentially eliminating injuries and fatalities. With , this data can be invaluable in the future for saving lives.

Data from the New Yorker Caption Contest

It's nearly impossible to objectively say whether something is funny or not, but  aims to do just that. The data set contains 33 million ratings on over 440,000 New Yorker captions, which researchers can use to create an algorithm that could, hypothetically, create its own funny captions. 

Brussels comic book mural route

This  procured all data on comic book murals on buildings in Brussels, Belgium, including the artist of the mural and the characters featured. The group VisitBrussels even  publishing the maps where the murals can be found so travelers can put together a tour.

Registered meteorites that has impacted on Earth

In 2013, a meteorite fell in the Ural Mountains in Russia injuring about 10,000 people. Inspired by this, Ramon Martinez of  created . The information is based on every meteorite recorded in the . Those using the data can now determine how often an area has been hit and the size of the meteor that hit, possibly foreseeing what could be coming if the data has predictive value.

Government Hospitality wine cellar and consumption data set

When members of the United Kingdom government need to entertain guests, they . To keep track of all the wine U.K. ministers are consuming, the British government  that tracks consumption of wine by origin, vintage, and quantity to see specifically how hospitable government employees are being to their guests. For the record, only three bottles of Australian wine were consumed by the British government in July 2015.

A Complete Catalog of Every Time Someone Cursed or Bled Out in a Quentin Tarantino Movie

In Quentin Tarantino's "Pulp Fiction," Samuel L. Jackson's character famously says, "," but the big brains here come from FiveThirtyEight, which created. A movie buff can now impress friends by pinpointing the carnage in every Tarantino film with this list that breaks down the act, minute, and movie where the bleeding or swearing occurred.

Abandoned Shopping Trolleys in Bristol Rivers

The Bristol City Council in England created  to identify the location of abandoned shopping carts in the rivers of their fair town. While the relevance is limited to the citizens of Bristol, it certainly helps those wishing to round-out the abandoned carts for an impromptu shopping trip.

Bigfoot Sightings

Bigfoot hunters can now take their hunt to specific geographic locations thanks to the Bigfoot Field Researchers Organization (BFRO), which has put together this . The BGFO has broken everything down into searchable categories such as how many people have witnessed sightings or suggestions for most likely sightings based on geographic clusters of previous sightings.

The Great British Toilet Map

Even bizarre data sets can be useful and, in this case, sponsored. , sponsored by a British cleaning product company, was created by the  to promote their "Use Our Loos" public toilet initiative. The map, which charts the location of over 11,000 facilities, is United Kingdom's "largest database of publicly accessible toilets," making it relevant to any Brit or visitor who's ever needed a loo.

Last words of Texas death row inmates

While it may seem morose, it's also quite fascinating to examine this data set of the last words of inmates who were executed by the State of Texas.  information about the inmates and links to their last statements. While the data may not be as practical as some other sets on this list, the historical value of the information cannot be ignored.

Edible mushroom data set

Mushrooms can be potentially harmful or delicious in food dishes. To separate the toxic from the tasty, researcher Jeff Schlimmer took records from the  and  that organizes 23 species of gilled mushrooms into categories of "definitely edible," "definitely poisonous," and "unknown." The best advice is to stick to what's definitely edible.

Does it fart?

"Does it fart?" began as a Twitter hashtag that captivated fart-lovers everywhere. To answer the question, this odorous data set was created on OpenDataSoft with 80 different animal species along with the answer to the infamous question. The data set also includes specific notes like, "They do it often and have no shame," for orangutans.

The NORB Dataset, v1.0

Toys are things to play with for children, but they're something else entirely for researchers at the Courant Institute of New York University. The is intended to be used for 3D recognition of objects based on shape. To create the data set, the researchers used images of 50 different toys from five categories: airplanes, trucks, four-legged animals, human figures, and cars.

Names from the dog population in Zurich

Dog lovers take a lot of time to find the perfect name for their pooch and  provides insight into dog owners from the Swiss city of Zurich. There are over 7,000 names in the database and include names like Akosambo's Black Massai Ulani, Zorro of Blue Diamond, and Windy Nights Nice Angel.

Montreal Bixi Trip History

in major urban areas across the globe. For anyone curious about the duration of every single ride taken in Montreal through the , BIXI provides  separated by month. While the information is most useful to BIXI itself to determine pricing systems and use-rates, the data sets can also be beneficial to public health researchers looking to prove that bicycling is making Montreal residents healthier.

SMS Spam Collection v. 1

For anyone who's ever been annoyed by random text messages advertising free money and crazy scams, there's the  data set. The set comprises of over 5,000 messages compiled by the University of California Irvine Machine Learning Repository. The information was collected from various sources, including , a Singaporean university database, and. The messages in the data set are categorized as real messages, aka "ham," or illegitimate messages, aka "spam." The ultimate goal is to use the data set to train machines to be able to recognize the difference between the two.

 

Trending Now