16-18 декабря в Парке Высоких Технологий состоится первый в Беларуси AI Hackathon, для участия в котором вам потребуются датасеты. В этой статье вы найдете полезную информацию о ресурсах, на которых можно скачать датасеты. Возможно, какие-то из них натолкнут вас на идеи проектов.
Аналогичный хакатон KPI Vision Hack состоялся 25-27 ноября в Киеве. Впечатления менторов и идеи участников (для вдохновения ;) ) вы найдете на facebook-странице хакатона. Хотим поблагодарить организаторов KPI Vision Hack за подборку полезных ссылок.
Материал подготовлен на основе еженедельной рассылки журналиста и программиста Jeremy Singer-Vine (Data Editor @BuzzFeed, New York City). Подписывайтесь на него Data Is Plural и получайте письма с источниками датасетов.
- Classical music, annotated. MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition. [h/t Lon Riesberg]
- Health habits. The CDC calls its Behavioral Risk Factor Surveillance System “the largest continuously conducted health survey system in the world.” Every year, the survey asks more than 400,000 American adults about a range of health-related topics, from tobacco to seatbelt use, from alcohol consumption to arthritis, from HIV testing to immunizations. Annual datasets from 1984–2015 are currently available. [h/t Ricardo Pietrobon]
- Global agriculture. EarthStat provides geographic data on harvest regions, yields, and fertilizer use for more than 100 crops. The website also publishes data on pasture land, water depletion, and climatological effects on crop yields.
- Millions of Amazon reviews. Julian McAuley, an assistant professor at UC San Diego, has collected a massive amount of user-generated data from Amazon.com, including 142.8 million reviews and 1.4 million answered Q&As. (As of mid-2014, Sophie la Girafe was the most-reviewed item in the baby category. Backstory here.) Much of the data can be downloaded directly, but the largest files require contacting McAuley for access. [h/t Reddit user samofny]
- Fake news on Facebook. Last month, colleagues at BuzzFeed News and I analyzed and fact-checked 1,000+ posts from hyperpartisan Facebook pages, and found a disturbingly high rate of fake news. Here’s the data. Facebook CEO Mark Zuckerberg has dismissed the possibility that fake news influenced the election, calling it a “pretty crazy idea”. Meanwhile, renegade Facebook employees have now formed an unofficial task force to battle fake news on the platform.
- The most important entries on Wikipedia. Germany-based researcher Andreas Thalhammer has applied PageRank — the algorithm at the heart of Google’s origin story — to the world of Wikipedia. The result: the DBpedia PageRank dataset, which estimates the importance of each page based on the other pages that link to it. You can download the data directly, or query it online. (According to the metric, Aristotle, Plato, and Karl Marx are history’s three most Wiki-central philosophers).
- School testing. The Department of Education’s EDFacts data tracks public grade schools’ participation and proficiency rates on standardized math and reading/language exams. The files provide data on all students who took the tests, broken down by race/ethnicity, sex, disability status, homelessness, and more. A related set of data files, available on the same page, tracks high-school graduation rates.
- Airborne. OpenFlights.org has collected data on more than 60,000 flight routes, including 915 itineraries departing Atlanta’s Hartsfield–Jackson International Airport. (That airport was recently named the world’s busiest, for the 18th year in a row). For each route, the dataset indicates the airline, the departing airport, the arriving airport, the number of stops, and what type of plane is typically used. The website also provides datasets on thousands of airports and airlines. Important caveat: “This data is not suitable for navigation.”
- R&D spending. The UNESCO Institute for Statistics’ data on national research and development budgets contains estimates of personnel and total spending by field, funding source, and more. You can also explore the data online through a series of interactive graphics. [h/t Rebecca Galloway]
- Music makers. The American Society of Composers, Authors and Publishers (ASCAP) boasts a membership of “more than 585,000 US composers, songwriters, lyricists and music publishers of every kind of music.“ The organization also maintains a downloadable catalog of the writers and publishers behind nearly 9 million songs. (But the downloaded files lack key details, such as the date the song was published).
Рекомендуем посмотреть
- аккаун Xiaming Chen’а на GitHub — Caesar0301, где он создал Awesome Public Datasets репозиторий – это внушительный перечень ссылок на открытые данные, которые Xiaming Chen’а собрал при поддержке пользователей сети Интернет (но не все датасеты являются бесплатными!);
- популярная среди Data Scientist специалистов платформа kaggle.com – датасеты являются публичными и доступ к загрузке имеют все зарегистрированные на ресурсе пользователи;
- онлайн-ресурс Data Science Central – список содержит данные на разные тематики и детальное описание ко всем датасетам;
- скачивание датасетов kdnuggets.com.
Уверены, что ресурсы будут полезны для предварительной подготовки к AI Hackathon.
До встречи на #aihackby!
Релоцировались? Теперь вы можете комментировать без верификации аккаунта.