If you are looking for a map to discover new datasets you are in the right place. “My philosophy is that worrying means you suffer twice.” Newt Scamander
It is fun to imagine datasets as mythical creatures with their own personality and traits. Don’t let their surly appearance fool you. Datasets are gentle creatures willing to reveal their secrets when you spend enough time taming them. If by chance you encounter a wild specimen, do not despair, you can plan ahead and you’ll subdue the beast. In your first encounter, analyze its appearance, presentation, size and if it lives in a database or in several folders. When you feel more at ease, you can go through a second inspection deeper in the details, study the nature of the data and understand the situation that it describes (this step is crucial). Finally when you feel more confident, find the best approach to open up the wisdom inside that creature and create a code routine to process it. Now with all this knowledge, we can go into more details about the types of datasets and how to recognize them.
As an apprentice, you start taming tabular datasets, you recognize first the numerical values (integer, double, imaginary …), types of values (string, date, category, binary …) and its format (.json, .xls, .csv), which are visible directly in the files. This kind of datasets often live in files that are not big in size less that 1GB, no more than 1 million records and they can be saved locally, most of them are very easy to copy, manipulate and explore, you can use Excel, or notepad to inspect them. Some notable specimens are: iris dataset, titanic dataset and Census dataset.
Once you have mastered the principles of tabular data and you want to tame bigger creatures, you would have to visit the databases, the structured datasets can be found using SQL or No-SQL statements, they will allow you to read sections of the data at a time. This type of datasets are very big in size, from 1 million to billions of records, some of the are growing continually. For example the database of users of Netflix (sample dataset) or Amazon.
If you are an experienced ML magician you would like to use more advanced spells in non-structured datasets, those are the most popular nowadays, in this category you can find sound, image, and text. For example, if you want to classify plant images by specie, each class will be labeled with the name of the category and a bigger folder with all the classes. Some popular datasets from this category are: COCO common objects in context, Imagenet image database organized according to the WordNet hierarchy and OpenImages .
You will find other types of datasets, for example the ones with a time variable to order the records, such as time series, videos and historical data. Those are very similar to tabular data, but the essential difference is that the records are ordered and that order will change the way the data can be processed, so you need to be careful to use techniques that use memory to track the changes.
Now that you have some tricks to approach new datasets, I want to share with you a list of sites where you can find thousands of datasets to hone your skills, practice makes perfect. Where to find them
- UCI Machine Learning repository : All types of datasets sometimes with paper references.
- Kaggle : The most popular site to find benchmarks and examples.
- Google dataset: Google gallery featuring hundreds of datasets.
- FKI repository : Datasets for computer vision, text and character recognition.
- Bio-metric Ideal Test: Biometric datasets, fingerprint, iris and face.
- COCO: Common objects in context
- Imagenet: Images indexed with text content.
- OpenImages: Dataset for image segmentation and other object detection tasks.
I hope that you would find something useful in this post, don’t forget to have a little fun playing with your datasets. See you soon.
“It matters not what someone is born, but what they grow to be.” Albus Dumbledore