Google Dataset Search: A new search service to find data from sciences, government, some news organizations.
Re3Data: 2,000 Data Repositories and Science Europe’s Framework for Discipline-specific Research Data Management
Open Data Inception: 2600+ Open Data Portals Around the World
Reddit Datasets: A place to share, find, and discuss Datasets.
AWS Public Datasets: AWS hosts a variety of public datasets that anyone can access for free.
awesome-public-datasets #Project#: An awesome list of high-quality open datasets in public domains (on-going).
Yelp Academic Dataset, all the data and reviews of the 250 closest businesses for 30 universities for students and academics to explore and research.
Data For Everyone: Here are some of our favorite open datasets created on the Figure Eight platform. They’re free for any and everyone to download.
20 Newsgroups:The text from 20000 messages taken from 20 Usenet newsgroups for text analysis, classification, etc.
Wikimedia Dumps: The Wikimedia Foundation is requesting help to ensure that as many copies as possible are available of all Wikimedia database dumps.
Amazon Reviews: Over 142 million product reviews for sentiment analysis, recommender systems, and more.
chinese-xinhua: 中华新华字典数据库和 API。收录包括 14032 条歇后语，16142 个汉字，264434 个词语，31648 个成语。
chinese-poetry: 最全的中华古典文集数据库, 包含 5.5 万首唐诗、26 万首宋诗和 2.1 万首宋词. 唐宋两朝近 1.4 万古诗人, 和两宋时期 1.5K 词人. 数据来源于互联网。
2019-ChineseGLUE #Project#: Language Understanding Evaluation benchmark for Chinese: datasets, baselines, pre-trained models,corpus and leaderboard.
fashion-mnist #Project#: Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples.
facets #Project#: The facets project contains two visualizations for understanding and analyzing machine learning datasets: Facets Overview and Facets Dive.
Labeled Faces in the Wild:13,000 named faces for facial recognition. Multiple training and test sets. 共 173MB
Mushroom Identification:For hypothetically classifying mushrooms as edible or poisonous based on its characteristics.3 files, 480KB
NORB 3D Object Recognition:Binocular images of 50 toy figurines for 3D object recognition from image.Multiple files, over 5GB total
One Million Songs: Audio features and metadata for a subset (10,000) of the one million popular songs dataset for recognition/classification.1.8GB
Hate Speech Identification:A sampling of Twitter posts that have been judged based on whether they are offensive or contain hate speech, as a training set for text analysis.2.66MB
Hidden Beauty of Flickr Pictures: 15,000 Flickr photo IDs that have received ratings based on aesthetics, for image analysis.138KB, use Flickr API to get images
NSFW Data Scrapper #Project#: Collection of scripts to aggregate image data for the purposes of training an NSFW Image Classifier.
im2latex-100k #Project#: A prebuilt dataset for OpenAI's task for image-2-latex system. Includes total of ~100k formulas and images splitted into train, validation and test sets.
Yahoo Instant Messenger Friends Connectivity Graph:Connections between Yahoo users who communicate with each other using Yahoo messenger, can be used to identify key social contacts/influencers. Add dataset to cart to access.
SNAP: Stanford Large Network Dataset Collection
MLVIS: This project is the first to combine the notion of a data repository with real-time visual analytics for interactive data mining and exploratory analysis on the web.
Network Repository: Network repository is not only the first interactive repository, but also the largest network repository with thousands of donations.
中国 5 级行政区域 mysql 库 #Project#: 爬取国家统计局官网的行政区域数据,包括省市县镇村 5 个层级;
china_regions #Project#: 最全最新中国省，市，地区 json 及 sql 数据
qqzeng-ip #Project#: 最新 IP 地址数据库-多语言解析以及导入数据库脚本。
Time Series Data Library: The Time Series Data Library (TSDL) was created by Rob Hyndman, Professor of Statistics at Monash University, Australia.
Tushare: 交易类数据提供股票的交易行情数据，通过简单的接口调用可获取相应的 DataFrame 格式数据。
Football Strategy:Thousands of scenarios to make the best coaching decisions. 共 876KB
Horses for Course:Horse-racing data for predicting race results. 共 19MB
NBA & MLB Stats:Current and past season stats for teams and players for fantasy sports predictions.
National Survey on Drug Use and Health:Predict drug use based on health survey questions. 共 2GB
Prostate Cancer:Tumor and nontumor samples, used to recognize prostate cancer. 共 4.8MB
Record of Heart Sound:Recordings of normal and abnormal heartbeats, used to recognize heart murmur, etc. 共 47.7MB
Wine Quality:Chemical properties of red and white wines (separately) and quality, for classification. 3 个文件，共 343KB。
malicious-urls: 数十万条级别的 URL 以及其是否 Malicious 标签.
The home of the U.S. Government’s open data: Here you will find data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more.
Enterprise Registration Data of Chinese Mainland: 中国大陆 31 个省份 1978 年至 2019 年一千多万工商企业注册信息，包含企业名称、注册地址、统一社会信用代码、地区、注册日期、经营范围、法人代表、注册资金、企业类型等详细资料。