Skip to content

opendatasets

opendatasets is a Python library for downloading datasets from online sources like Kaggle and Google Drive using a simple Python command.

Installation

Install the library using pip:

pip install opendatasets --upgrade

Usage - Downloading a dataset

Datasets can be downloaded within a Jupyter notebook or Python script using the opendatasets.download helper function. Here's some sample code for downloading the US Elections Dataset:

import opendatasets as od
dataset_url = 'https://www.kaggle.com/tunguz/us-elections-dataset'
od.download('https://www.kaggle.com/tunguz/us-elections-dataset')

dataset_url can also point to a public Google Drive link or a raw file URL.

Kaggle Credentials

opendatasets uses the Kaggle Official API for donwloading dataset from Kaggle. Follow these steps to find your API credentials:

  1. Go to https://kaggle.com/me/account (sign in if required).

  2. Scroll down to the "API" section and click "Create New API Token". This will download a file kaggle.json with the following contents:

{"username":"YOUR_KAGGLE_USERNAME","key":"YOUR_KAGGLE_KEY"}
  1. When you run opendatsets.download, you will be asked to enter your username & Kaggle API, which you can get from the file downloaded in step 2.

Note that you need to download the kaggle.json file only once. You can also place the kaggle.json file in the same directory as the Jupyter notebook, and the credentials will be read automatically.

IMPORTANT NOTE: If you're downloading a competition dataset, make sure to first accept the rules of the competition.

Some interesting datasets

You can find interesting datasets on Kaggle: https://www.kaggle.com/datasets

You can also create a new dataset on Kaggle by uploading a CSV file here: https://www.kaggle.com/datasets?new=true (make sure to keep your dataset public, otherwise it will not be downloadable)

Other sources to look for datasets:

If you use an external source other than Kaggle, you'll create a new dataset on Kaggle by uploading a CSV file here: https://www.kaggle.com/datasets?new=true (make sure to keep your dataset public, otherwise it will not be downloadable using opendatasets)

Curated Datasets

opendatasets also provides some curated datsets that you can download by passing the Dataset ID to opendatasets.download. Here's an example:

import opendatasets
opendatasets.download('stackoverflow-developer-survey-2020')

The following datasets are available for download.

Dataset IDDescriptionSource
stackoverflow-developer-survey-2020Stack Overflow Developer Survey 2020Stack Overflow
owid-covid-19-latestCovid-19 Stats by Our World in DataOur World in Data
state-of-javascript-2016State of Javascript Annual Survey 2016StateOfJS
state-of-javascript-2017State of Javascript Annual Survey 2017StateOfJS
state-of-javascript-2018State of Javascript Annual Survey 2018StateOfJS
state-of-javascript-2019State of Javascript Annual Survey 2019StateOfJS
countries-languages-spokenLanguages Spoken in Different CountriesInfoplease

More datasets will be added soon..

Contributing

This is an open source project and we welcome contributions.

Local Development Setup

  1. Clone the repository:
git clone https://github.com/JovianML/opendatasets.git
  1. Setup the Python environment for development
conda create -n opendatasets python=3.5
conda activate opendatasets
pip install -r requirements.txt
  1. Open up the project in VS code and make your changes. Make sure to install the Python Extension for VS Code and select the opendatasets conda environment.

This package is developed and maintained by the Jovian team.

opendatasets has loaded