Which of the following are usually good data sources Select all that apply 1 point social media sites governmental agency data academic papers vetted public datasets?

Data is a critical component of decision making, helping businesses and organizations gain key insights and understand the implications of their decisions at a granular level. And visual analytics, in the form of interactive dashboards and visualizations, are essential tools for anyone—from students to CEOs—who needs to analyze data and tell stories with data. Public data sets are ideal resources to tap into to create data visualizations. With the information provided below, you can explore a number of free, accessible data sets and begin to create your own analyses.

Nội dung chính Show

COVID-19 Data Visualization
Free Health Data Sets
Free Social Impact Data Sets
Free Climate and Environment Data Sets
Tableau For Everyone
Free Government Data Sets
Free Education Data Sets
Other Cool Free Data Sets
Free Public Data Sets for Advanced Users
The Cancer Genome Atlas
Foldingathome COVID-19 Datasets
Therapeutically Applicable Research to Generate Effective Treatments (TARGET)
Common Crawl
Gabriella Miller Kids First Pediatric Research Program (Kids First)
NASA Prediction of Worldwide Energy Resources (POWER)
NEXRAD on AWS
NOAA Geostationary Operational Environmental Satellites (GOES) 16, 17 & 18
Genome Aggregation Database (gnomAD)
Cell Painting Gallery
Fly Brain Anatomy: FlyLight Gen1 and Split-GAL4 Imagery
Allen Cell Imaging Collections
International Neuroimaging Data-Sharing Initiative (INDI)
NOAA Operational Forecast System (OFS)
Digital Earth Africa Sentinel-2 Level-2A
Department of Energy's Open Energy Data Initiative (OEDI)
Open NeuroData
DOE's Water Power Technology Office's (WPTO) US Wave dataset
NREL Wind Integration National Dataset
Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription (TaRGET)
USGS 3DEP LiDAR Point Clouds
World Bank - Light Every Night
Clinical Proteomic Tumor Analysis Consortium 2 (CPTAC-2)
Global Database of Events, Language and Tone (GDELT)
NOAA Joint Polar Satellite System (JPSS)
BossDB Open Neuroimagery Datasets
Low Altitude Disaster Imagery (LADI) Dataset
NOAA Rapid Refresh Forecast System (RRFS) [Prototype]
Open Bioinformatics Reference Data for Galaxy
Reference Elevation Model of Antarctica (REMA)
CAM6 Data Assimilation Research Testbed (DART) Reanalysis: Cloud-Optimized Dataset
CoMMpass from the Multiple Myeloma Research Foundation
Community Earth System Model Large Ensemble (CESM LENS)
First Street Foundation (FSF) Flood Risk Summary Statistics
Global Seasonal Sentinel-1 Interferometric Coherence and Backscatter Data Set
NOAA National Water Model CONUS Retrospective Dataset
The Human Connectome Project
Basic Local Alignment Sequences Tool (BLAST) Databases
Boreas Autonomous Driving Dataset
JMA Himawari-8
Maxar Open Data Program
Mouse Brain Anatomy: MouseLight Imagery
NAIP on AWS
NREL National Solar Radiation Database
OpenCell on AWS
Sea Surface Temperature Daily Analysis: European Space Agency Climate Change Initiative product version 2.1
Virginia Coastal Resilience Master Plan, Phase 1 - December 2021
Yale-CMU-Berkeley (YCB) Object and Model Set
Beat Acute Myeloid Leukemia (AML) 1.0
Cell Organelle Segmentation in Electron Microscopy (COSEM) on AWS
Clinical Trial Sequencing Project - Diffuse Large B-Cell Lymphoma
Finnish Meteorological Institute Weather Radar Data
Foundation Medicine Adult Cancer Clinical Dataset (FM-AD)
MIMIC-III (‘Medical Information Mart for Intensive Care’)
Medical Segmentation Decathlon
Multiview Extended Video with Activities (MEVA)
OpenAlex dataset
The Human Microbiome Project
4D Nucleome (4DN)
Atmospheric Models from Météo-France
Cancer Genome Characterization Initiatives - Burkitt Lymphoma, HIV+ Cervical Cancer
Copernicus Digital Elevation Model (DEM)
DNAStack COVID19 SRA Data
DigitalCorpora
Hecatomb Databases
NOAA Climate Forecast System (CFS)
NOAA Emergency Response Imagery
NOAA World Ocean Database (WOD)
Pancreatic Cancer Organoid Profiling
Protein Data Bank 3D Structural Biology Data
RAPID NRT Flood Maps
STOIC2021 Training
Sentinel-1 SLC dataset for South and Southeast Asia, Taiwan, Korea and Japan
Terra Fusion Data Sampler
3DCoMPaT: Composition of Materials on Parts of 3D Things
A2D2: Audi Autonomous Driving Dataset
ARPA-E PERFORM Forecast data
Allen Brain Observatory - Visual Coding AWS Public Data Set
COVID-19 Genome Sequence Dataset
Cell Painting Image Collection
Coupled Model Intercomparison Project Phase 5 (CMIP5) University of Wisconsin-Madison Probabilistic Downscaling Dataset
Daylight Map Distribution of OpenStreetMap
Ford Multi-AV Seasonal Dataset
Global Biodiversity Information Facility (GBIF) Species Occurrences
High-Order Accurate Direct Numerical Simulation of Flow over a MTU-T161 Low Pressure Turbine Blade
Human Cancer Models Initiative (HCMI) Cancer Model Development Center
Legal Entity Identifier (LEI) and Legal Entity Reference Data (LE-RD)
NASA / USGS Europa Controlled Observations
NOAA Global Forecast System (GFS)
NOAA Global Surface Summary of Day
NOAA Integrated Surface Database (ISD)
NOAA National Digital Forecast Database (NDFD)
NOAA/PMEL Ocean Climate Stations Moorings
New Jersey Statewide Digital Aerial Imagery Catalog
New Jersey Statewide LiDAR
Ohio State Cardiac MRI Raw Data (OCMR)
Oxford Nanopore Technologies Benchmark Datasets
SILAM Air Quality
Sentinel-1 SLC dataset for Germany
Tabula Muris
Voices Obscured in Complex Environmental Settings (VOiCES)
2021 Amazon Last Mile Routing Research Challenge Dataset
A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018)
Australasian Genomes
CAFE60 reanalysis
CCAFS-Climate Data
COCO - Common Objects in Context - fast.ai datasets
COVID-19 Molecular Structure and Therapeutics Hub
Cloud to Street - Microsoft Flood and Clouds Dataset
District of Columbia - Classified Point Cloud LiDAR
Downscaled Climate Data for Alaska
Epoch of Reionization Dataset
Galaxy Evolution Explorer Satellite (GALEX)
Google Books Ngrams
HIRLAM Weather Model
High Resolution Downscaled Climate Data for Southeast Alaska
Homeland Security and Infrastructure US Cities
Image localization - fast.ai datasets
InRad COVID-19 X-Ray and CT Scans
K2 Mission Data
KITTI Vision Benchmark Suite
Kepler Mission Data
NLP - fast.ai datasets
NOAA Atmospheric Climate Data Records
NOAA Coastal Lidar Data
NOAA Continuously Operating Reference Stations (CORS) Network (NCN)
NOAA Fundamental Climate Data Records (FCDR)
NOAA Global Ensemble Forecast System (GEFS)
NOAA Global Extratropical Surge and Tide Operational Forecast System (Global ESTOFS)
NOAA Global Hydro Estimator (GHE)
NOAA Global Mosaic of Geostationary Satellite Imagery (GMGSI)
NOAA Global Real-Time Ocean Forecast System (Global RTOFS)
NOAA National Bathymetric Source Data
NOAA National Blend of Models (NBM)
NOAA National Water Model Short-Range Forecast
NOAA North American Mesoscale Forecast System (NAM)
NOAA Oceanic Climate Data Records
NOAA Rapid Refresh (RAP)
NOAA Real-Time Mesoscale Analysis (RTMA)
NOAA Severe Weather Data Inventory (SWDI)
NOAA Space Weather Forecast and Observation Data
NOAA Terrestrial Climate Data Records
NOAA U.S. Climate Gridded Dataset (NClimGrid)
NOAA Unified Forecast System (UFS) Marine Reanalysis: 1979-2019
NOAA Unified Forecast System Short-Range Weather (UFS SRW) Application
NOAA Unified Forecast System Subseasonal to Seasonal Prototypes
NOAA Unified Forecast System Weather Model (UFS-WM) Regression Tests
Nanopore Reference Human Genome
Natural Scenes Dataset
OpenFold Training Data
PROJ datum grids
Provision of Web-Scale Parallel Corpora for Official European Languages (ParaCrawl)
SMN Hi-Res Weather Forecast over Argentina
SUCHO Ukrainian Cultural Heritage Web Archives
Smithsonian Open Access
Software Heritage Graph Dataset
Tabula Muris Senis
Tabula Sapiens
The Genome Modeling System
The Massively Multilingual Image Dataset (MMID)
UCSC Genome Browser Sequence and Annotations
University of British Columbia Sunflower Genome Dataset
iNaturalist Licensed Observation Images
stdpopsim species resources
AgricultureVision
ChEMBL - Data Lakehouse Ready
ClinVar - Data Lakehouse Ready
NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP-CMIP6)
YouTube 8 Million - Data Lakehouse Ready
1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 - Data Lakehouse Ready
BodyM Dataset
Google Brain Genomics Sequencing Dataset for Benchmarking and Development
Humor patterns used for querying Alexa traffic
MODIS MYD13A1, MOD13A1, MYD11A1, MOD11A1, MCD43A4
Orcasound - bioacoustic data for marine conservation
PersonPath22
Pre- and post-purchase product questions
The Multilingual Amazon Reviews Corpus
WikiSum: Coherent Summarization Dataset for Efficient Human-Evaluation
Wizard of Tasks
Airborne Object Tracking Dataset
Amazon Berkeley Objects Dataset
MWIS VR Instances
Registry of Open Data on AWS
The Klarna Product-Page Dataset
Which of the following are usually good data source?
What are the main benefits of open data select all that apply?
Which of the following are types of data bias often encountered in data analytics select all that apply?
What is the process for arranging data into a meaningful order to make it easier to understand analyze and visualize?

The following COVID-19 data visualization is representative of the the types of visualizations that can be created using free public data sets. Explore it and a catalogue of free data sets across numerous topics below.

COVID-19 Data Visualization

Free Health Data Sets

Health dashboards can be used to highlight key metrics including: changes in a population’s health over time, how people choose to receive healthcare, or urgent public health information, such as vaccination rates during a global pandemic.

Details →

Usage examples

Cancer Genomics Cloud by Seven Bridges
The Immune Landscape of Cancer by Vésteinn Thorsson, David L. Gibbs, et al.
A Pan-Cancer Analysis of Enhancer Expression in Nearly 9000 Patient Samples by Han Chen, Chunyan Li, et al.
Genomic and Molecular Landscape of DNA Damage Repair Deficiency across The Cancer Genome Atlas by Theo A. Knijnenburg, Linghua Wang, et al.
Pan-Cancer Analysis of lncRNA Regulation Supports Their Targeting of Cancer Genes in Each Tumor Context by Hua-Sheng Chiu, Sonal Somvanshi, et al.

See 29 usage examples →

Foldingathome COVID-19 Datasets

alchemical free energy calculationsbiomolecular modelingcoronavirusCOVID-19foldingathomehealthlife sciencesmolecular dynamicsproteinSARS-CoV-2simulationsstructural biology

Folding@home is a massively distributed computing project that uses biomolecular simulations to investigate the molecular origins of disease and accelerate the discovery of new therapies. Run by the Folding@home Consortium, a worldwide network of research laboratories focusing on a variety of different diseases, Folding@home seeks to address problems in human health on a scale that is infeasible by another other means, sharing the results of these large-scale studies with the research community through peer-reviewed publications and publicly shared datasets. During the COVID-19 epidemic, Folding@home focused its resources on understanding the vulernabilities in SARS-CoV-2, the virus that causes COVID-19 disease, and working closely with a number of experimental collaborators to accelerate progress toward effective therapies for treating COVID-19 and ending the pandemic. In the process, it created the world's first exascale distributed computing resource, enabling it to generate valuable scientific datasets of unprecedented size. More information about Folding@home's COVID-19 research activities at the Folding@home COVID-19 page. In addition to working directly with experimental collaborators and rapidly sharing new research findings through preprint servers, Folding@home has joined other researchers in committing to rapidly share all COVID-19 research data, and has joined forces with AWS and the Molecular Sciences Software Institute (MolSSI) to share datasets of unprecented side through the AWS Open Data Registry, indexing these massive datsets via the MolSSI COVID-19 Molecular Structure and Therapeutics Hub. The complete index of all Folding@home datasets can be found here. Th...

Details →

Usage examples

SARS-CoV-2 spike protein dataset: A 1.2 ms dataset of the SARS-CoV-2 spike protein in search of cryptic pockets by The Bowman lab at Washington University in St. Louis
SARS-CoV-2 RNA polymerase (nsp12, RdRP) dataset: A 3.4 ms dataset of the SARS-CoV-2 nsp12 protein in search of cryptic pockets by The Bowman lab at Washington University in St. Louis
SARS-CoV-2 spike RBD with P337L mutation bound to monoclonal antibody S309 (923.2 µs) by The Chodera lab at the Memorial Sloan Kettering Cancer Center
SARS-CoV-2 RBD antibodies that maximize breadth and resistance to escape by Tyler N. Starr, Nadine Czudnochowski, Zhuoming Liu, et al.
SARS-CoV-2 spike RBD bound to human ACE2 receptor (173.8 us): Wild-type and mutant simulations by The Chodera lab at the Memorial Sloan Kettering Cancer Center

See 24 usage examples →

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

cancergenomiclife sciencesSTRIDESwhole genome sequencing

Therapeutically Applicable Research to Generate Effective Treatments (TARGET) is the collaborative effort of a large, diverse consortium of extramural and NCI investigators. The goal of the effort is to accelerate molecular discoveries that drive the initiation and progression of hard-to-treat childhood cancers and facilitate rapid translation of those findings into the clinic. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers.The dataset contains open Clinical Supplement, Biospecimen...

Details →

Usage examples

A Children's Oncology Group and TARGET initiative exploring the genetic landscape of Wilms tumor by Gadd S, Huff V, Walz AL, et al.
Genetic predisposition to neuroblastoma mediated by a LMO1 super-enhancer polymorphism by Oldridge DA, Wood AC, Weichert-Leahey N, Crimmins I, Sussman R, Winter C, McDaniel LD, Diamond M, Hart LS, Zhu S, Durbin AD, Abraham BJ, et al.
Recurrent DGCR8, DROSHA, and SIX homeodomain mutations in favorable histology Wilms tumors by Walz AL, Ooms A, Gadd S, et al.
Biomarker significance of plasma and tumor miR-21, miR-221, and miR-106a in osteosarcoma by Nakka M, Allen-Rhoades W, Li Y, et al.
CSF3R mutations have a high degree of overlap with CEBPA mutations in pediatric AM by Maxson JE, Ries RE, Wang YC, et al.

See 24 usage examples →

Common Crawl

encyclopedicinternetnatural language processing

A corpus of web crawl data composed of over 50 billion web pages.

Details →

Usage examples

On the impact of publicly available news and information transfer to financial markets by Metod Jazbec, Barna Pásztor, Felix Faltings, Nino Antulov-Fantulin, Petter N. Kolm
CCAligned: A Massive collection of cross-lingual web-document pairs by Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzmán, Philipp Koehn
Defending against neural fake news by Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, et al
N-gram counts and language models from the Common Crawl by Christian Buck, Kenneth Heafield, Bas van Ooyen
Analysing Petabytes of Websites by Mark Litwintschik

See 23 usage examples →

Gabriella Miller Kids First Pediatric Research Program (Kids First)

cancergeneticgenomicHomo sapienslife sciencespediatricSTRIDESstructural birth defectwhole genome sequencing

The NIH Common Fund's Gabriella Miller Kids First Pediatric Research Program’s (“Kids First”) vision is to “alleviate suffering from childhood cancer and structural birth defects by fostering collaborative research to uncover the etiology of these diseases and by supporting data sharing within the pediatric research community.” The program continues to generate and share whole genome sequence data from thousands of children affected by these conditions, ranging from rare pediatric cancers, such as osteosarcoma, to more prevalent diagnoses, such as congenital heart defects. In 2018, Kids Fi...

Details →

Usage examples

MAGEL2-Related Disorders: A study and case series. by Jameson Patak, James Gilfert, et al.
Development and Clinical Validation of a Large Fusion Gene Panel for Pediatric Cancers. by Fengqi Chang, Fumin Lin, et al.
Germline microsatellite genotypes differentiate children with medulloblastoma. by Samuel Rivero-Hinojosa, Nicholas Kinney, et al.
Kids First DRC Source Code by Kids First DRC
Deleterious de novo variants of X-linked ZC4H2 in females cause a variable phenotype with neurogenic arthrogryposis multiplex congenita. by Suzanna G M Frints, Friederike Hennig, et al.

See 19 usage examples →

NASA Prediction of Worldwide Energy Resources (POWER)

agricultureair qualityanalyticsarchivesatmosphereclimateclimate modeldata assimilationdeep learningearth observationenergyenvironmentalforecastgeosciencegeospatialglobalhistoryimagingindustrymachine learningmachine translationmetadatameteorologicalmodelnetcdfopendapradiationsatellite imagerysolarstatisticssustainabilitytime series forecastingwaterweatherzarr

NASA's goal in Earth science is to observe, understand, and model the Earth system to discover how it is changing, to better predict change, and to understand the consequences for life on Earth. The Applied Sciences Program serves NASA and Society by expanding and accelerating the realization of societal and economic benefits from Earth science, information, and technology research and development.

The NASA Prediction Of Worldwide Energy Resources (POWER) Project, a NASA Applied Sciences program, improves the accessibility and usage NASA Earth Observations (EO) supporting community research in three focus areas: 1) renewable energy development, 2) building energy efficiency, and 3) agroclimatology applications. POWER can help communities be resilient amid observed climate variability through the easy access of solar and meteorological data via a verity of access methods.

The latest POWER version includes hourly-based source Analysis Ready Data (ARD), in addition to enhanced daily, monthly, annual, and climatology ARD. The daily time-series spans 40 years for meteorology available from 1981 and solar-based parameters start in 1984. The hourly source data are from Clouds and the Earth's Radiant Energy System (CERES) and Global Modeling and Assimilation Office (GMAO), spanning 20 years from 2001. The hourly data will provide users the ARD needed to model the energy performance of building systems, providing information directly amenable to decision support tools introducing the industry standard EPW (EnergyPlus Weather file).

POWER also provides parameters at daily, monthly, annual, and user-defined time periods, spanning from 1984 through to within a week of real time. Additionally, POWER provides are user-defined analytic capabilities, including custom climatologies and climatological-based reports for parameter anomalies, ASHRAE® compatible climate design condition statistics, and building climate zones.

The ARD and climate analytics will be readily accessible through POWER's integrated services suite, including the Data Access Viewer (DAV). The DAV has recently been improved to incorporate updated parameter groupings, new analytical capabilities, and the new data formats. POWER also provides a complete API (Application Programming Interface) that allows uses...

Details →

Usage examples

Enhancing Climate Resilience at NASA Centers: A Collaboration between Science and Stewardship by Rosenzweig, C., and Coauthors
The Contribution of Solar Brightening to the US Maize Yield Trend by Tollenaar, T., J. Fridgen, P. Tyagi, P. W. Stackhouse Jr., and S. Kumudini
Association between solar insolation and a history of suicide attempts in bipolar I disorder by Bauer M, et al., Stackhouse PW Jr., et al.
Evaluation of NASA satellite- and assimilation model-derived long-term daily temperature data over the continental US by White, J. W., G. Hoogenboom, P. W. Stackhouse, and J. M. Hoell
POWER Data Access Viewer (DAV) by The POWER Project

See 18 usage examples →

NEXRAD on AWS

agricultureearth observationmeteorologicalnatural resourcesustainabilityweather

Real-time and archival data from the Next Generation Weather Radar (NEXRAD) network.

Details →

Usage examples

Extreme Pyroconvective Updrafts During a Megafire by B. Rodriguez, N. P. Lareau, D. E. Kingsmill, & C. B. Clements
Declines in an abundant aquatic insect, the burrowing mayfly, across major North American waterways by Phillip M. Stepanian, Sally A. Entrekin, Charlotte E. Wainwright, Djordje Mirkovic, Jennifer L. Tank, & Jeffrey F. Kelly
Updated introduction to S3, Boto, and NOAA Nexrad in SageMaker Studio Lab (SMSL) by Chris Stoner
nexradaws on pypi.python.org - python module to query and download Nexrad data from Amazon S3 by Aaron Anderson
Into the eye of the storm: NEXRAD Level II open data by Jonni Walker

See 16 usage examples →

NOAA Geostationary Operational Environmental Satellites (GOES) 16, 17 & 18

agriculturedisaster responseearth observationgeospatialmeteorologicalsatellite imagerysustainabilityweather

NEW GOES-18 Data!!! GOES-18 is now provisional and data has began streaming. Data files will be available between Provisional and the Operational Declaration of the satellite, however, data will have the caveat GOES-18 Preliminary, Non-Operational Data. The exception is during the interleave period when ABI Radiances and Cloud and Moisture Imagery data will be shared operationally via the NOAA Open Data Dissemination Program.

GOES satellites (GOES-16, GOES-17, & GOES-18) provide continuous weather imagery and monitoring of meteorological and space environment data across North America. ...

Details →

Usage examples

GOES Quick Guides (Spanish) by Anthony Segura García
Imaging Considerations From a Geostationary Orbit Using the Short Wavelength Side of the Mid-Infrared Water Vapor Absorption Band by N.B. Miller, M.M. Gunshor, A.J. Merrelli, T.S. L'Ecuyer, T.J. Schmit, J.J. Gerth, N.J. Gordillo
NOAA GOES16 Julia Jupyter Notebook Example by Peter Schmiedeskamp
Billions of Birds Migrate. Where Do They Go? by National Geographic
Forecasting Hurricane Tracks with TensorFlow and data from AWS S3 by Kyle Archie

See 16 usage examples →

Genome Aggregation Database (gnomAD)

bioinformaticsgeneticgenomiclife sciencespopulationpopulation geneticsshort read sequencingwhole genome sequencing

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. The v2 data set (GRCh37) spans 125,748 exome sequences and 15,708 whole-genome sequences from unrelated individuals. The v3 data set (GRCh38) spans 71,702 genomes, selected as in v2. Sign up for the gnomAD mailing list here.

Details →

Usage examples

Hail on AWS Quick Start by Amazon Web Services and PrivoIT
A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020) by Collins, R. L., Brand, H., Karczewski, K. J., Zhao, X., Alföldi, J., Francioli, L. C., Khera, A. V., Lowther, C., Gauthier, L. D., Wang, H., Watts, N. A., Solomonson, M., O’Donnell-Luria, A., Baumann, A., Munshi, R., Walker, M., Whelan, C., Huang, Y., Brookings, T., ... Talkowski, M. E.
The effect of LRRK2 loss-of-function variants in humans. Nature Medicine (2020) by Whiffin, N., Armean, I. M., Kleinman, A., Marshall, J. L., Minikel, E. V., Goodrich, J. K., Quaife, N. M., Cole, J. B., Wang, Q., Karczewski, K. J., Cummings, B. B., Francioli, L., Laricchia, K., Guan, A., Alipanahi, B., Morrison, P., Baptista, M. A. S., Merchant, K. M., Genome Aggregation Database Production Team, ... MacArthur, D. G.
gnomAD quality control GitHub repository by gnomAD Production Team
Transcript expression-aware annotation improves rare variant interpretation. Nature 581, 452–458 (2020) by Cummings, B. B., Karczewski, K. J., Kosmicki, J. A., Seaby, E. G., Watts, N. A., Singer-Berk, M., Mudge, J. M., Karjalainen, J., Kyle Satterstrom, F., O’Donnell-Luria, A., Poterba, T., Seed, C., Solomonson, M., Alföldi, J., The Genome Aggregation Database Production Team, The Genome Aggregation Database Consortium, Daly, M. J., & MacArthur, D. G.

See 15 usage examples →

SpaceNet

computer visiondisaster responseearth observationgeospatialmachine learningsatellite imagery

SpaceNet, launched in August 2016 as an open innovation project offering a repository of freely available imagery with co-registered map features. Before SpaceNet, computer vision researchers had minimal options to obtain free, precision-labeled, and high-resolution satellite imagery. Today, SpaceNet hosts datasets developed by its own team, along with data sets from projects like IARPA’s Functional Map of the World (fMoW).

Details →

Usage examples

SpaceNet 6: Dataset Release by Jake Shermeyer
SpaceNet 8 - The Detection of Flooded Roads and Buildings by Ronny Hansch, Jacob Arndt, Dalton Lunga, Matthew Gibb, Tyler Pedelose, Arnold Boedihardjo, Desiree Petrie, Todd M. Bacastow
Accelerating Ukraine Intelligence Analysis with Computer Vision on Synthetic Aperture Radar Imagery by Ritwik Gupta, Colorado Reed, Anja Rohrbach, and Trevor Darrell
SpaceNet: Winning Implementations and New Imagery Release by Todd Stavish
Getting Started with SpaceNet Data by Adam Van Etten

See 15 usage examples →

Cell Painting Gallery

bioinformaticsbiologycancercell biologycell imagingcell paintingchemical biologycomputer visioncsvdeep learningfluorescence imaginggenetichigh-throughput imagingimage processingimagingmachine learningmedicinemicroscopyorganelle

The Cell Painting Gallery is a collection of image datasets created using the Cell Painting assay. The images of cells are captured by microscopy imaging, and reveal the response of various labeled cell components to whatever treatments are tested, which can include genetic perturbations, chemicals or drugs, or different cell types. The datasets can be used for diverse applications in basic biology and pharmaceutical research, such as identifying disease-associated phenotypes, understanding disease mechanisms, and predicting a drug’s activity, toxicity, or mechanism of action (Chandrasekaran et al 2020). This collection is maintained by the Carpenter–Singh lab and the Cimini lab at the Broad...

Details →

Usage examples

Image-based Profiling Recipe by Multiple Authors
Multiplex Cytological Profiling Assay to Measure Diverse Cellular States by Gustafsdottir SM, Ljosa V, Sokolnicki KL, Wilson JA, Walpita D, Kemp MM, Seiler KP, Carrel HA, Golub TR, Schreiber SL, Clemons PA, Carpenter AE, and Shamji AF
Toward performance-diverse small-molecule libraries for cell-based phenotypic screening using multiplexed high-dimensional profiling by Wawer MJ, Li K, Gustafsdottir SM, Ljosa V, BodycombeNE, Marton MA, Sokolnicki KL, Bray M-A, Kemp MM, Winchester E, Taylor B, Grant GB, Hon CSK, Duvall JR, Wilson JA, Bittker JA, Dancik V, Narayan R, Subramanian A, Winckler W, Golub TR, Carpenter AE, Shamji AF, Schreiber SL, & Clemons PA
Systematic morphological profiling of human gene and allele function via Cell Painting by Rohban MH, Singh S, Wu X, Berthet JB, Bray M-A, Shrestha Y, Varelas X, Boehm JS, & Carpenter AE
Cell Painting predicts impact of lung cancer variants by Caicedo JC, Arevalo J, Piccioni F, Bray MA, Hartland CL, Wu X, Brooks AN, Berger AH, Boehm JS, Carpenter AE, & Singh S

See 16 usage examples →

Fly Brain Anatomy: FlyLight Gen1 and Split-GAL4 Imagery

biologyfluorescence imagingimage processingimaginglife sciencesmicroscopyneurobiologyneuroimagingneuroscience

This data set, made available by Janelia's FlyLight project, consists of fluorescence images of Drosophila melanogaster driver lines, aligned to standard templates, and stored in formats suitable for rapid searching in the cloud. Additional data will be added as it is published.

Details →

Usage examples

Using Imagery on AWS S3 by Rob Svirskas
The neuronal architecture of the mushroom body provides a logic for associative learning by Yoshinori Aso, Daisuke Hattori, Yang Yu, Rebecca M Johnston, Nirmala A Iyer, Teri-TB Ngo, Heather Dionne, LF Abbott, Richard Axel, Hiromu Tanimoto, Gerald M Rubin
Scaling Neuroscience Research on AWS by Konrad Rokicki
An unbiased template of the Drosophila brain and ventral nerve cord by John A Bogovic, Hideo Otsuna, Larissa Heinrich, Masayoshi Ito, Jennifer Jeter, Geoffrey Meissner, Aljoscha Nern, Jennifer Colonell, Oz Malkesman, Kei Ito, Stephan Saalfeld
NeuronBridge by Jody Clements, Rob Svirskas, Hideo Otsuna, Cristian Goina, Konrad Rokicki

See 13 usage examples →

Allen Cell Imaging Collections

biologycell biologycell imagingHomo sapiensimage processinglife sciencesmachine learningmicroscopy

This bucket contains multiple datasets (as Quilt packages) created by the Allen Institute for Cell Science (AICS). The imaging data in this bucket contains either of the following:

field of view images from glass plates
cell membrane, DNA, and structure segmentations
cell membrane, DNA and structure contours
machine learning imaging predictions of the previously listed modalities.

In addition, many of the datasets include CSVs that contain feature sets related to that data.

Details →

Usage examples

Allen Cell Feature Explorer by Allen Institute for Cell Science
AICS Volume Viewer by Dan Toloudis
Pytorch 3D Integrated Cell by Gregory R. Johnson, Rory M. Donovan-Maiye, Mary M. Maleckar
Visual Guide to Human Cells by Allen Institute for Cell Science
Allen Cell Structure Segmenter by Jianxu Chen, Liya Ding, Matheus P. Viana, Melissa C. Hendershott, Ruian Yang, Irina A. Mueller, Susanne M. Rafelski

See 11 usage examples →

Homo sapiensimaginglife sciencesmagnetic resonance imagingneuroimagingneuroscience

This bucket contains multiple neuroimaging datasets that are part of the International Neuroimaging Data-Sharing Initiative. Raw human and non-human primate neuroimaging data include 1) Structural MRI; 2) Functional MRI; 3) Diffusion Tensor Imaging; 4) Electroencephalogram (EEG) In addition to the raw data, preprocessed data is also included for some datasets. A complete list of the available datasets can be seen in the documentation lonk provided below.

Details →

Usage examples

Configurable Pipeline for the Analysis of Connectomes (C-PAC) by [INDI C-PAC Team](https://fcp-indi.github.io/)
Making data sharing work: The FCP/INDI experience by M. Mennes, B.B. Biswal, F.X. Castellanos, M.P. Milham
Accelerating the Evolution of Nonhuman Primate Neuroimaging by M.P. Milham, C. Petkov
Assessment of the impact of shared brain imaging data on the scientific literature by M.P. Milham, R.C. Craddock, ..., A. Klein
Enhancing studies of the connectome in autism using the autism brain imaging data exchange II. by A. Di Martino, D. O'Connor, M.P. Milham

See 11 usage examples →

NOAA Operational Forecast System (OFS)

climatecoastaldisaster responseenvironmentalmeteorologicaloceanssustainabilitywaterweather

ANNOUNCEMENTS: [NOS OFS Version Updates and Implementation of Upgraded Oceanographic Forecast Modeling Systems for Lakes Superior and Ontario; Effective October 25, 2022}(https://www.weather.gov/media/notification/pdf2/scn22-91_nos_loofs_lsofs_v3.pdf)

For decades, mariners in the United States have depended on NOAA's Tide Tables for the best estimate of expected water levels. These tables provide accurate predictions of the astronomical tide (i.e., the change in water level due to the gravitational effects of the moon and sun and the rotation of the Earth); however, they cannot predict water-level changes due to wind, atmospheric pressure, and river flow, which are often significant.

The National Ocean Service (NOS) has the mission and mandate to provide guidance and information to support navigation and coastal needs. To support this mission, NOS has been developing and implementing hydrodynamic model-based Operational Forecast Systems.

This forecast guidance provides oceanographic information that helps mariners safely navigate their local waters. This national network of hydrodynamic models provides users with operational nowcast and forecast guidance (out to 48 – 120 hours) on parameters such as water levels, water temperature, salinity, and currents. These forecast systems are implemented in critical ports, harbors, estuaries, Great Lakes and coastal waters of the United States, and form a national backbone of real-time data, tidal predictions, data management and operational modeling.

Nowcasts and forecasts are scientific predictions about the present and future states of water levels (and possibly currents and other relevant oceanographic variables, such as salinity and temperature) in a coastal area. These predictions rely on either observed data or forecasts from a numerical model. A nowcast incorporates recent (and often near real-time) observed meteorological, oceanographic, and/or river flow rate data. A nowcast covers the period from the recent past (e.g., the past few days) to the present, and it can make predictions for locations where observational data are not available. A forecast incorporates meteorological, oceanographic, and/or river flow rate forecasts and makes predictions for times where observational data will not be available. A forecast is usually initiated by the results of a nowcast.

OFS generally runs four times per day (every 6 hours) on NOAA's Weather and Climate Operational Supercomputing Systems (WCOSS) in a standard Coastal Ocean Modeling Framework (COMF) developed by the Center for Operational Oceanographic Products and Services (CO-OPS). COMF is a set...

Details →

Usage examples

OFS Data Aggregation and Sub-Setting by NOAA
Tampa Bay OFS Flyer by NOAA
Delaware Bay and River OFS Flyer by NOAA
Technical Implementation Notice for Delaware River and Bay OFS by NOAA
Technical Implementation Notice for Chesapeake Bay OFS by NOAA

See 11 usage examples →

Digital Earth Africa Sentinel-2 Level-2A

agriculturecogdeafricadisaster responseearth observationgeospatialnatural resourcesatellite imagerystacsustainability

The Sentinel-2 mission is part of the European Union Copernicus programme for Earth observations. Sentinel-2 consists of twin satellites, Sentinel-2A (launched 23 June 2015) and Sentinel-2B (launched 7 March 2017). The two satellites have the same orbit, but 180° apart for optimal coverage and data delivery. Their combined data is used in the Digital Earth Africa Sentinel-2 product. Together, they cover all Earth’s land surfaces, large islands, inland and coastal waters every 3-5 days. Sentinel-2 data is tiered by level of pre-processing. Level-0, Level-1A and Level-1B data contain raw data fr...

Details →

Usage examples

Digital Earth Africa Training by Digital Earth Africa Contributors
Introduction to DE Africa by Dr Fang Yuan
Digital Earth Africa Map by Digital Earth Africa Contributors
Use Sentinel-2 data in the Open Data Cube by Alex Leith
Digital Earth Africa Geoportal by Digital Earth Africa Contributors

See 10 usage examples →

Department of Energy's Open Energy Data Initiative (OEDI)

energyenvironmentalgeospatiallidarmodelsolarsustainability

Data released under the Department of Energy's Open Energy Data Initiative (DOE). The Open Energy Data Initiative (OEDI) aims to improve and automate access of high-value energy data sets across the U.S. Department of Energy’s (DOE’s) programs, offices, and national laboratories. OEDI aims to make data actionable and discoverable by researchers and industry to accelerate analysis and advance innovation.

Details →

Usage examples

Rooftop Solar Technical Potential for Low-to-Moderate Income Households in the United States by Benjamin Sigrin and Meghan Mooney
On the Use of Coupled Wind, Wave, and Current Fields in the Simulation of Loads on BottomSupported Offshore Wind Turbines during Hurricanes by E. Kim, L. Manuel, M. Curcic, S. S. Chen, C. Phillips, P. Veers
Rooftop Solar Photovoltaic Technical Potential in the United States: A Detailed Assessment by Pieter Gagnon, Robert Margolis, Jennifer Melius, Caleb Phillips, and Ryan Elmore
Tracking the Sun Tool by Lawrence Berkeley National Laboratory (LBNL)
NSRDB Viewer by National Renewable Energy Laboratory (NREL)

See 9 usage examples →

Open NeuroData

array tomographybiologyelectron microscopyimage processinglife scienceslight-sheet microscopymagnetic resonance imagingneuroimagingneuroscience

This bucket contains multiple neuroimaging datasets (as Neuroglancer Precomputed Volumes) across multiple modalities and scales, ranging from nanoscale (electron microscopy), to microscale (cleared lightsheet microscopy and array tomography), and mesoscale (structural and functional magnetic resonance imaging). Additionally, many of the datasets include segmentations and meshes.

Details →

Usage examples

CloudVolume by William Silversmith
Download by Benjamin Falk
Visualization using Neuroglancer by Benjamin Falk
A Community-Developed Open-Source Computational Ecosystem for Big Neuro Data by J. T. Vogelstein, E. Perlman, B. Falk, A. Baden, W. Gray Roncal, V. Chandrashekhar, F. Collman, S. Seshamani, J. L. Patsolic, K. Lillaney, M. Kazhdan, R. Hider, D. Pryor, J. Matelsky, T. Gion, P. Manavalan, B. Wester, M. Chevillet, E. T. Trautman, K. Khairy, E. Bridgeford, D. M. Kleissas, D. J. Tward, A. K. Crow, B. Hsueh, M. A. Wright, M. I. Miller, S. J. Smith, R. J. Vogelstein, K. Deisseroth, and R. Burns
The Open Connectome Project Data Cluster: Scalable Analysis and Vision for High-Throughput Neuroscience by R. Burns, W. G. Roncal, D. Kleissas, K. Lillaney, P. Manavalan, E. Perlman, D. R. Berger, D. D. Bock, K. Chung, L. Grosenick, N. Kasthuri, N. C. Weiler, K. Deisseroth, M. Kazhdan, J. Lichtman, R. C. Reid, S. J. Smith, A. S. Szalay, J. T. Vogelstein, and R. J. Vogelstein.

See 9 usage examples →

DOE's Water Power Technology Office's (WPTO) US Wave dataset

earth observationenergygeospatialmeteorologicalsustainabilitywater

Released to the public as part of the Department of Energy's Open Energy Data Initiative, this is the highest resolution publicly available long-term wave hindcast dataset that – when complete – will cover the entire U.S. Exclusive Economic Zone (EEZ).

Details →

Usage examples

Nearshore wave energy resource characterization along the East Coast of the United States by Ahn, S. V.S. Neary, Allahdadi, N. and R. He
HSDS Examples by Caleb Phillips, Caroline Draxl, John Readey, Jordan Perr-Sauer, Michael Rossol
High-resolution hindcasts for U.S. wave energy resource characterization by Yang, Z. and V.S. Neary
Development and validation of a high-resolution regional wave hindcast model for U.S. West Coast wave resource characterization by Wu, Wei-Cheng; Wang, Taiping; Yang, Zhaoqing; Garcia Medina, Gabriel
High-Resolution Regional Wave Hindcast for the U.S. West Coast by Yang, Zhaoqing; Wu, Wei-Cheng; Wang, Taiping; Castrucci, Luca

See 8 usage examples →

NREL Wind Integration National Dataset

environmentalgeospatialmeteorologicalsustainability

Released to the public as part of the Department of Energy's Open Energy Data Initiative, the Wind Integration National Dataset (WIND) is an update and expansion of the Eastern Wind Integration Data Set and Western Wind Integration Data Set. It supports the next generation of wind integration studies.

Details →

Usage examples

Validation of Power Output for the WIND Toolkit by J. King, Andrew Clifton, Bri-Mathias Hodge
A Twenty-Year Analysis of Winds in California for Offshore Wind Energy Production Using WRF v4.1.2 by Alex Rybchuk, Mike Optis, Julie K. Lundquist, Michael Rossol, Walt Musial
The Wind Integration National Dataset (WIND) Toolkit by Caroline Draxl, Andrew Clifton, Bri-Mathias Hodge, Jim McCaa
Wind Visualization by Jordan Perr-Sauer
Power from wind: Open data on AWS by Caleb Phillips, Caroline Draxl, John Readey, Jordan Perr-Sauer

See 8 usage examples →

Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription (TaRGET)

bioinformaticsbiologyenvironmentalepigenomicsgeneticgenomiclife sciences

The TaRGET (Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription) Program is a research consortium funded by the National Institute of Environmental Health Sciences (NIEHS). The goal of the collaboration is to address the role of environmental exposures in disease pathogenesis as a function of epigenome perturbation, including understanding the environmental control of epigenetic mechanisms and assessing the utility of surrogate tissue analysis in mouse models of disease-relevant environmental exposures.

Details →

Usage examples

Metabolic effects of air pollution exposure and reversibility by Rajagopalan S, Park B, Palanivel R, et al.
Environmental Determinants of cardiovasular disease: lessons learned from air pollution by Al-Kindi SG, Brook RD, Biswal S, Rajagopalan S.
Visualize TaRGET II data with WashU Epigenome Browser by WashU Epigenome Browser
Epigenetic biomarkers and preterm birth by Park B, Khanam R, Vinayachandran V, et.al.
Finding and Downloading TaRGET II Data files by TaRGET-DCC

See 8 usage examples →

USGS 3DEP LiDAR Point Clouds

agriculturedisaster responseelevationgeospatiallidarstacsustainability

The goal of the USGS 3D Elevation Program (3DEP) is to collect elevation data in the form of light detection and ranging (LiDAR) data over the conterminous United States, Hawaii, and the U.S. territories, with data acquired over an 8-year period. This dataset provides two realizations of the 3DEP point cloud data. The first resource is a public access organization provided in Entwine Point Tiles format, which a lossless, full-density, streamable octree based on LASzip (LAZ) encoding. The second resource is a Requester Pays of the original, Raw LAZ (Compressed LAS) 1.4 3DEP format, and more co...

Details →

Usage examples

USGS 3DEP Lidar Point Cloud Now Available as Amazon Public Dataset by Department of the Interior, U.S. Geological Survey
Extracting buildings and roads from AWS Open Data using Amazon SageMaker by Yunzhi Shi, Tianyu Zhang, and Xin Chen
Statewide USGS 3DEP Lidar Topographic Differencing Applied to Indiana, USA by Chelsea Phipps Scott, Matthew Beckley, Minh Phan, Emily Zawacki, Christopher Crosby, Viswanath Nandigam, and Ramon Arrowsmith
WebGL Visualization of USGS 3DEP Lidar Point Clouds with Potree and Plasio.js by Connor Manning
OpenTopography access to 3DEP lidar point cloud data by OpenTopography

See 8 usage examples →

World Bank - Light Every Night

cogdisaster responseearth observationsatellite imagerystac

Light Every Night - World Bank Nightime Light Data – provides open access to all nightly imagery and data from the Visible Infrared Imaging Radiometer Suite Day-Night Band (VIIRS DNB) from 2012-2020 and the Defense Meteorological Satellite Program Operational Linescan System (DMSP-OLS) from 1992-2013. The underlying data are sourced from the NOAA National Centers for Environmental Information (NCEI) archive. Additional processing by the University of Michigan enables access in Cloud Optimized GeoTIFF format (COG) and search using the Spatial Temporal Asset Catalog (STAC) standard. The data is ...

Details →

Usage examples

High Resolution Electricity Access Indicators (HREA) - Settlement-level measures of electricity access, reliability, and usage. by Brian Min, Zachary O'Keeffe
Mapping city lights with nighttime data from the DMSP Operational Linescan System. Photogrammetric Engineering and Remote Sensing, 63(6)727-734. by Elvidge, C.D., Baugh, K.E., Kihn, E.A., Kroehl, H.W. and Davis, E.R.
Detection of Rural Electrification in Africa using DMSP-OLS Night Lights Imagery. International Journal of Remote Sensing by Brian Min, Kwawu Mensan Gaba, Ousmane Fall Sarr, Alassane Agalassou.
Twenty Years of India Lights by Kwawu Mensan Gaba, Brian Min, Anand Thakker, Christopher Elvidge
Mainstreaming Disruptive Technologies in Energy. World Bank Report. 2019 by Kwawu Mensan Gaba, Brian Min, Olaf Veerman, Kimberly Baugh

See 8 usage examples →

Clinical Proteomic Tumor Analysis Consortium 2 (CPTAC-2)

cancergenomiclife sciencesSTRIDEStranscriptomics

The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. CPTAC-2 is the Phase II of the CPTAC Initiative (2011-2016). Datasets contain open RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, and miRNA Expression Quantification data.

Details →

Usage examples

CPTAC Data Portal by National Cancer Institute
Proteomic analysis of colon and rectal carcinoma using standard and customized databases by Slebos RJ, Wang X, Wang X, Zhang B, Tabb DL, Liebler DC
Proteogenomic Analysis of Human Colon Cancer Reveals New Therapeutic Opportunities by Suhas Vasaikar, Chen Huang, Xiaojing Wang. Vladislav A. Petyuk, Sara R. Savage, Bo Wen, Yongchao Dou, Yun Zhang, Zhiao Shi, Osama A. Arshad, Marina A. Gritsenko, Lisa J. Zimmerman, Jason E. McDermott, Therese R. Clauss, Ronald J. Moore, Rui Zhao, Matthew E. Monroe, Yi-Ting Wang, Matthew C. Chambers, Robbert J.C. Slebos, Ken S. Lau, Qianxing Mo, Li Ding, Matthew Ellis, Mathangi Thiagarajan, Christopher R. Kinsinger, Henry Rodriguez, Richard D. Smith, Karin D. Rodland, Daniel C. Liebler, Tao Liu, Bing Zhang, Clinical Proteomic Tumor Analysis Consortium
Cancer Genomics Cloud by Seven Bridges
Integrated Proteogenomic Characterization of Human High-Grade Serous Ovarian Cancer by Hui Zhang, Tao Liu, Zhen Zhang, Samuel H. Payne, Bai Zhang, Jason E. McDermott, Jian-Ying Zhou, Vladislav A. Petyuk, Li Chen, Debjit Ray, Shisheng Sun, Feng Yang, Lijun Chen, Jing Wang, Punit Shah, Seong Won Cha, Paul Aiyetan, Sunghee Woo, Yuan Tian, Marina A. Gritsenko, Therese R. Clauss, Caitlin Choi, Matthew E. Monroe, Stefani Thomas, Song Nie, Chaochao Wu, Ronald J. Moore, Kun-Hsing Yu, David L. Tabb, David Fenyö, Vineet Bafna, Yue Wang, Henry Rodriguez, Emily S. Boja, Tara Hiltke, Robert C. Rivers, Lori Sokoll, Heng Zhu, Ie-Ming Shih, Leslie Cope, Akhilesh Pandey, Bing Zhang, Michael P. Snyder, Douglas A. Levine, Richard D. Smith, Daniel W. Chan, Karin D. Rodland, the CPTAC Investigators

See 7 usage examples →

Global Database of Events, Language and Tone (GDELT)

disaster responseevents

This project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, quotes, images and events driving our global society every second of every day.

Details →

Usage examples

Analysing Brexit Coverage In The Media Over Time by Mark Chopping
Running R on Amazon Athena by Gopal Wunnava
How to partition your geospatial data lake for analysis with Amazon Redshift by Jeff DeMuth, Luke Wells, and Nemanja Boric
Creating PySpark DataFrame from CSV in AWS S3 in EMR by Jake Chen
Exploring GDELT with Athena by Julien Simon

See 7 usage examples →

NOAA Joint Polar Satellite System (JPSS)

agricultureclimatemeteorologicalsustainabilityweather

Satellites in the JPSS constellation gather global measurements of atmospheric, terrestrial and oceanic conditions, including sea and land surface temperatures, vegetation, clouds, rainfall, snow and ice cover, fire locations and smoke plumes, atmospheric temperature, water vapor and ozone. JPSS delivers key observations for the Nation's essential products and services, including forecasting severe weather like hurricanes, tornadoes and blizzards days in advance, and assessing environmental hazards such as droughts, forest fires, poor air quality and harmful coastal waters. Further, JPSS w...

Details →

Usage examples

JPSS Science Seminar Annual Digest 2020 by NOAA
VIIRS Active Fire Quick Guide by NOAA
JPSS Satellites (COMET) by UCAR
JPSS Short Course from the 2018 Annual Meeting of American Meteorological Society by Colorado State University
JPSS Training Resources by NOAA

See 7 usage examples →

ArcticDEM

cogearth observationelevationgeospatialmappingopen source softwaresatellite imagerystac

ArcticDEM - 2m GSD Digital Elevation Models (DEMs) and mosaics from 2007 to the present. The ArticDEM project seeks to fill the need for high-resolution time-series elevation data in the Arctic. The time-dependent nature of the strip DEM files allows users to perform change detection analysis and to compare observations of topography data acquired in different seasons or years. The mosaic DEM tiles are assembled from multiple strip DEMs with the intention of providing a more consistent and comprehensive product over large areas. ArcticDEM data is constructed from in-track and cross-track high-...

Details →

Usage examples

Dynamic ice loss from the Greenland Ice Sheet driven by sustained glacier retreat by Michalea D. King, Ian M. Howat, Salvatore G. Candela, Myoung J. Noh, Seongsu Jeong, Brice P. Y. Noël, Michiel R. van den Broeke, Bert Wouters, Adelaide Negrete
ArcticDEM Explorer by Polar Geospatial Center & ESRI
Future Evolution of Greenland's Marine-Terminating Outlet Glaciers by Ginny A. Catania, Leigh A. Stearns, Twila A. Moon, Ellen M. Enderlin, R. H. Jackson
Automated stereo-photogrammetric DEM generation at high latitudes: Surface Extraction with TIN-based Search-space Minimization (SETSM) validation and demonstration over glaciated regions by Myoung-Jong Noh, Ian M. Howat
The surface extraction from TIN based search-space minimization (SETSM) algorithm by Myoung-Jong Noh, Ian M. Howat

See 6 usage examples →

BossDB Open Neuroimagery Datasets

calcium imagingelectron microscopyimaginglife scienceslight-sheet microscopymagnetic resonance imagingneuroimagingneurosciencevolumetric imagingx-rayx-ray microtomographyx-ray tomography

This data ecosystem, Brain Observatory Storage Service & Database (BossDB), contains several neuro-imaging datasets across multiple modalities and scales, ranging from nanoscale (electron microscopy), to microscale (cleared lightsheet microscopy and array tomography), and mesoscale (structural and functional magnetic resonance imaging). Additionally, many of the datasets include dense segmentation and meshes.

Details →

Usage examples

CloudVolume by Seung Lab
intern: Integrated Toolkit for Extensible and Reproducible Neuroscience by Jordan K Matelsky, Luis Rodriguez, Daniel Xenes, Timothy Gion, Robert Hider Jr., Brock Wester, William Gray-Roncal
Data access and download by Jordan Matelsky
A Community-Developed Open-Source Computational Ecosystem for Big Neuro Data by J. T. Vogelstein, E. Perlman, B. Falk, A. Baden, W. Gray Roncal, V. Chandrashekhar, F. Collman, S. Seshamani, J. L. Patsolic, K. Lillaney, M. Kazhdan, R. Hider, D. Pryor, J. Matelsky, T. Gion, P. Manavalan, B. Wester, M. Chevillet, E. T. Trautman, K. Khairy, E. Bridgeford, D. M. Kleissas, D. J. Tward, A. K. Crow, B. Hsueh, M. A. Wright, M. I. Miller, S. J. Smith, R. J. Vogelstein, K. Deisseroth, and R. Burns
bossDB by bossDB Team

See 6 usage examples →

Low Altitude Disaster Imagery (LADI) Dataset

aerial imagerycoastalcomputer visiondisaster responseearth observationearthquakesgeospatialimage processingimaginginfrastructurelandmachine learningmappingnatural resourceseismologytransportationurbanwater

The Low Altitude Disaster Imagery (LADI) Dataset consists of human and machine annotated airborne images collected by the Civil Air Patrol in support of various disaster responses from 2015-2019. The initial release of LADI focuses on the Atlantic hurricane seasons and coastal states along the Atlantic Ocean and Gulf of Mexico. Annotations are included for major hurricanes of Harvey, Maria, and Florence. Two key distinctions are the low altitude, oblique perspective of the imagery and disaster-related features, which are rarely featured in computer vision benchmarks and datasets.

Details →

Usage examples

Remote Sensing for Disaster Response Course by Beaver Works Summer Institute
Large Scale Organization and Inference of an Imagery Dataset for Public Safety by Jeffrey Liu, David Strohschein, Siddharth Samsi, Andrew Weinert
Video Testing at the FirstNet Innovation and Test Lab Using a Public Safety Dataset by Chris Budny, Jeffrey Liu, Andrew Weinert
LADI Tutorials by Andrew Weinert, Jianyu Mao, Kiana Harris, Nae-Rong Chang, Caleb Pennell, Yiming Ren, Ryan Earley, Nadia Dimitrova
NIST TRECVID 2020 - Disaster Scene Description and Indexing (DSDI) by TREC Video Retrieval Evaluation (TRECVID)

See 6 usage examples →

NOAA Rapid Refresh Forecast System (RRFS) [Prototype]

agricultureclimatemeteorologicalsustainabilityweather

The Rapid Refresh Forecast System (RRFS) is the National Oceanic and Atmospheric Administration’s (NOAA) next generation convection-allowing, rapidly-updated ensemble prediction system, currently scheduled for operational implementation in 2024. The operational configuration will feature a 3 km grid covering North America and include deterministic forecasts every hour out to 18 hours, with deterministic and ensemble forecasts to 60 hours four times per day at 00, 06, 12, and 18 UTC.The RRFS will provide guidance to support forecast interests including, but not limited to, aviation, severe convective weather, renewable energy, heavy precipitation, and winter weather on timescales where rapidly-updated guidance is particularly useful.

The RRFS is underpinned by the Unified Forecast System (UFS), a community-based Earth modeling initiative, and benefits from collaborative development efforts across NOAA, academia, and research institutions.

This bucket provides access to real time, experimental RRFS prototype output as of October 2022. This bucket also holds output from past experimental RRFS prototypes that were evaluated as a part of NOAA testbed projects. The immediate section describes the data for the real time system. The section that follows thereafter describes outputs from three past NOAA Testbed experiments.

Real time, experimental RRFS Prototype output

The real-time RRFS prototype is experimental and evolving. It is not under 24x7 monitoring and is not operational. Output may be delayed or missing. Outputs will change. When significant changes to output take place, this description will be updated.

We currently provide hourly deterministic forecasts at 3 km grid spacing over the CONUS out to 60 hours at 00 and 12 UTC, and out to 18 hours at other times. Future enhancements will include an ensemble forecast component and expansion to the planned North American domain. All forecasts are initialized from a hybrid 3DEnVar data assimilation system with hourly updates.Output is available on the S3 bucket for every third cycle, and is organized by cycle day and time of day. For example, rrfs_a/rrfs_a.20221012/00/ contains the forecast initialized at 00 UTC on 12 October 2022. Users will find two types of output in GRIB2 format. The first is:

rrfs.t00z.natlev.f018.conus_3km.grib2

Meaning that this is the RRFS_A initialized at 00 UTC, covers the CONUS domain, and is the native level post-processed gridded data at hour 18. This output is on a Lambert Conic Conformal domain at 3 km grid spacing.

The second output file in grib2 format is:

rrfs.t00z.prslev.f018.conus_3km.grib2

Meaning that this is the pressure level post-processed gridded data.

Past output from NOAA Testbed Experiments

This bucket also provides datasets from three of the 2021 NOAA Testbed Experiments. During each of these experiments, a prototype version of RRFS under development was run. The following is a high-level overview dates and RRFS configurations for each of the Testbed Experiments.

2021 Hazardous Weather Testbed (HWT) Spring Forecast Experiment (May 3 through June 4 2021) and 2021 Hydrometeorological Testbed Annual Flash Flood and Intense Rainfall Experiment (FFaIR) (June 21 through July 23 2021, excluding the week of July 4). A 9-member multi-physics ensemble with stochastic perturbations run once per day at 3 km grid spacing covering North America out to 60 hours. Initial conditions and lateral boundary conditions are taken from the GFS and GEFS.

2021-2022 Hydrometeorological Testbed Winter Weather Experiment (WWE) (mid November through mid-March). Select cases only. Deterministic forecasts were run once per day at 00 UTC at 3 km grid spacing covering the CONUS out to 60 hours. A 36-member, 3 km ensemble Kalman filter data assimilation approach is implemented through hourly cycling starting at 18 UTC on the previous day.

For each cycle of the HWT and FFaIR experiments, the dataset is organized by cycle day, time of day, and member. For example, rrfs.20210504/00/mem01/ contains the forecast from ensemble member 1 initialized at 00 UTC on 04 May 2021. Users will find two types of output in GRIB2 format. The first is:

rrfs.t00z.mem01.naf024.grib2

Meaning that this is RRFS ensemble member 1 initialized at 00 UTC, covers the North American domain, and is the post-processed gridded data at hour 24. This output is on a rotated latitude-longitude domain at 3 km grid spacing. These are large files and users may wish to subset or re-project the grid after downloading. We recommend using the WGRIB2 application for such purposes.

The second output file in grib2 format is as follows:

rrfs.t00z.mem01.testbed.conusf020.grib2

These grids have been subset from the much larger North American domain to a CONUS domain on a Lambert Conic Conformal projection and also contain significantly fewer fields, resulting in smaller files.

Graphics for select runs are also included in a plots/ directory under each experiment day for quick, yet simple visualization.

For each cycle of the WWE, the dataset is organized by cycle day and time of day. For example, rrfs.20220306/00/ contains data for the forecast initialized at 00 UTC on 06 March 2022. The initial conditions for the 36 ensemble members are located in the ens_ics/mem??? subdirectories. Users will find two types of output in GRIB2 format in the post subdirectories. The first is:

BGDAWP.GrbF12

Meaning that this is the forecast initialized at 00 UTC, covers the CONUS domain, and is the pressure level post-processed gridded data at forecast hour 18. This output is on a Lambert Conic Conformal grid at 3 km grid spacing.

The second output file in grib2 format is as follows:

testbed.conusf030.grib2

These grids contain significantly fewer fields, resulting in smaller files.

This work is supported by the Unified Forecast System Research to Operation (UFS R2O) Project which is jointly funded by NOAA’s Office of Science and Technology Integration (OSTI) of National Weather Service (NWS) and Weather Program Office (WPO), [Joint Technology Transfer Initiative (JTTI)] of the Office of Oceanic and Atmospheric Research (OAR).

DISCLAIMER The o...

Details →

Usage examples

Prototype UFS-Based Rapid Refresh Forecast System (RRFS) on the Cloud by Holt, C., D. Abdi, J. A. Abeles, J. R. Carley, C. W. Harrop, R. Panda, S. Trahan, and C. R. Alexander
Assessment of the data assimilation framework for the Rapid Refresh Forecast System v0.1 and impacts on forecasts of a convective storm case study by Banos, I. H., W. D. Mayfield, G. Ge, L. F. Sapucci, J. R. Carley, and L. Nance
Community modeling framework underpinning the RRFS - The UFS Short Range Weather Application by UFS Community
A Limited Area Modeling Capability for the Finite-Volume Cubed-Sphere (FV3) Dynamical Core and Comparison With a Global Two-Way Nest by Black, T. L., J. A. Abeles, B. T. Blake, D. Jovic, E. Rogers, X. Zhang, E. A. Aligo, L. C. Dawson, Y. Lin, E. Strobach, P. C. Shafran, and J. R. Carley
Highlights from a Year of Continued Development of the Rapid Refresh Forecast System (RRFS) by Carley J. R. and C. R. Alexander

See 6 usage examples →

Open Bioinformatics Reference Data for Galaxy

bioinformaticsbiologygeneticgenomiclife sciencesreference index

This dataset provides genomic reference data and software packages for use with Galaxy and Bioconductor applications. The reference data is available for hundreds of reference genomes and has been formatted for use with a variety of tools. The available configuration files make this data easily incorporable with a local Galaxy server without additional data preparation. Additionally, Bioconductor's AnnotationHub and ExperimentHub data are provided for use via R packag...

Details →

Usage examples

Using Open Bio Ref Data with Galaxy and Bioconductor by Enis Afgan, Alexandru Mahmoud, Nuwan Goonasekera
Galaxy by Galaxy Project
Accessible, curated metagenomic data through ExperimentHub by Edoardo Pasolli, Lucas Schiffer, Paolo Manghi, Audrey Renson, Valerie Obenchain, Duy Tin Truong, Francesco Beghini, Faizan Malik, Marcel Ramos, Jennifer B Dowd, Curtis Huttenhower, Martin Morgan, Nicola Segata, and Levi Waldron
Wrangling Galaxy's reference data by Daniel Blankenberg, James E. Johnson, The Galaxy Team, James Taylor, Anton Nekrutenko
Bioconductor by Bioconductor Project

See 6 usage examples →

PoroTomo

geospatialgeothermalimage processingseismology

Released to the public as part of the Department of Energy's Open Energy Data Initiative, these data represent vertical and horizontal distributed acoustic sensing (DAS) data collected as part of the Poroelastic Tomography (PoroTomo) project funded in part by the Office of Energy Efficiency and Renewable Energy (EERE), U.S. Department of Energy.

Details →

Usage examples

DAS and DTS at Brady Hot Springs: Observations about Coupling and Coupled Interpretations by Douglas E. Miller, Thomas Coleman, Xiangfang Zeng, Jeremy R. Patterson , Elena C. Reinnisch, Michael A. Cardiff, Herbert F. Wang, Dante Fratta, Whitney Trainor-Guitton, Clifford H. Thurber, Michelle ROBERTSON, Kurt FEIGL, and The PoroTomo Team
PoroTomo DAS Data Processing Tutorial for SEG-Y Files by Nicole Taverna and Ross Ring-Jarvi
PoroTomo Final Technical Report: Poroelastic Tomography by Adjoint Inverse Modeling of Data from Seismology, Geodesy, and Hydrology by Kurt L. Feigl, Lesley M. Parker, and the PoroTomo Team
PoroTomo DAS Data Processing Tutorial for hdf5 Files by Nicole Taverna and Michael Rossol
PoroTomo DAS Data Processing Tutorial for hdf5 Files via HSDS and h5pyd by Michael Rossol and Nicole Taverna

See 6 usage examples →

Reference Elevation Model of Antarctica (REMA)

cogearth observationelevationgeospatialmappingopen source softwaresatellite imagerystac

The Reference Elevation Model of Antarctica - 2m GSD Digital Elevation Models (DEMs) and mosaics from 2009 to the present. The REMA project seeks to fill the need for high-resolution time-series elevation data in the Antarctic. The time-dependent nature of the strip DEM files allows users to perform change detection analysis and to compare observations of topography data acquired in different seasons or years. The mosaic DEM tiles are assembled from multiple strip DEMs with the intention of providing a more consistent and comprehensive product over large areas. REMA data is constructed from in...

Details →

Usage examples

Deep glacial troughs and stabilizing ridges unveiled beneath the margins of the Antarctic ice sheet by Morlighem, M., Rignot, E., Binder, T. et al.
The Reference Elevation Model of Antarctica by Ian M. Howat, Claire Porter, Benjanim E. Smith, Myoung-Jong Noh, Paul Morin
Automatic relative RPC image model bias compensation through hierarchical image matching for improving DEM quality by Myoung-Jong Noh, Ian M. Howat
The surface extraction from TIN based search-space minimization (SETSM) algorithm by Myoung-Jong Noh, Ian M. Howat
Automated stereo-photogrammetric DEM generation at high latitudes: Surface Extraction with TIN-based Search-space Minimization (SETSM) validation and demonstration over glaciated regions by Myoung-Jong Noh, Ian M. Howat

See 6 usage examples →

CAM6 Data Assimilation Research Testbed (DART) Reanalysis: Cloud-Optimized Dataset

atmosphereclimateclimate modeldata assimilationforecastgeosciencegeospatiallandmeteorologicalweatherzarr

This is a cloud-hosted subset of the CAM6+DART (Community Atmosphere Model version 6 Data Assimilation Research Testbed) Reanalysis dataset. These data products are designed to facilitate a broad variety of research using the NCAR CESM 2.1 (National Center for Atmospheric Research's Community Earth System Model version 2.1), including model evaluation, ensemble hindcasting, data assimilation experiments, and sensitivity studies. They come from an 80 member ensemble reanalysis of the global troposphere and stratosphere using DART and CAM6. The data products represent states of the atmospher...

Details →

Usage examples

Intake-ESM Catalog by Brian Bonnlander, NCAR
Rendered (static) version of Jupyter Notebook by Brian Bonnlander, NCAR
Jupyter Notebook and other documentation and tools for DART Reanalysis on AWS by NCAR Science at Scale team
A new CAM6 + DART reanalysis with surface forcing from CAM6 to other CESM models by Raeder, K., Hoar, T.J., El Gharamti, M. et al (2021)
Analyzing large climate model ensembles in the cloud by Joe Hamman, NCAR

See 5 usage examples →

CoMMpass from the Multiple Myeloma Research Foundation

cancergeneticgenomicSTRIDESwhole genome sequencing

The Relating Clinical Outcomes in Multiple Myeloma to Personal Assessment of Genetic Profile study is the Multiple Myeloma Research Foundation (MMRF)’s landmark personalized medicine initiative. CoMMpass is a longitudinal observation study of around 1000 newly diagnosed myeloma patients receiving various standard approved treatments. The MMRF’s vision is to track the treatment and results for each CoMMpass patient so that someday the information can be used to guide decisions for newly diagnosed patients. CoMMpass checked on patients every 6 months for 8 years, collecting tissue samples, gene...

Details →

Usage examples

Genomic Data Commons by National Cancer Institute
"Interim Analysis of the Mmrf Commpass Trial: Identification of Novel Rearrangements Potentially Associated with Disease Initiation and Progression" by Sagar Lonial, MD, Venkata D Yellapantula, Winnie Liang, PhD, Ahmet Kurdoglu, BS, Jessica Aldrich, MSc, Christophe M. Legendre, MD, Kristi Stephenson, Jonathan Adkins, Jackie McDonald, Adrienne Helland, Megan Russell, Austin Christofferson, Lori Cuyugan, Dan Rohrer, Alex Blanski, Meghan Hodges, Mmrf CoMMpass Network, Mary Derome, Daniel Auclair, PhD, Pamela G. Kidd, MD, Scott Jewell, PhD, David Craig, PhD, John Carpten, PhD, Jonathan J. Keats, PhD
"Interim Analysis Of The MMRF CoMMpass Trial: a Longitudinal Study In Multiple Myeloma Relating Clinical Outcomes To Genomic and Immunophenotypic Profiles" by Keats JJ, Craig DW, Liang W, Venkata Y, Kurdoglu A, Aldrich J, Auclair D, Allen K, Harrison B, Jewell S, Kidd PG, Correll M, Jagannath S, Siegel DS, Vij R, Orloff G, Zimmerman TM, MMRF CoMMpass Network, Capone W, Carpten J, Lonial S.
"Identification of Initiating Trunk Mutations and Distinct Molecular Subtypes: An Interim Analysis of the Mmrf Commpass Study" by Jonathan J Keats, PhD, Gil Speyer, Legendre Christophe, Christofferson Austin, Kristi Stephenson, BS, Ahmet Kurdoglu, Megan Russell, Aldrich Jessica, Cuyugan Lori, Jonathan Adkins, Jackie McDonald, Adrienne Helland, Alex Blanski, Meghan Hodges, Dan Rohrer, Sundar Jagannath, MD, David Siegel, MD PhD, Ravi Vij, MD MBA, Gregory Orloff, MD, Todd Zimmerman, MD, Ruben Niesvizky, MD, Darla Liles, MD, Joseph W. Fay, Jeffrey L. Wolf, MD PhD, Robert M. Rifkin, Norma C Gutierrez, The MMRF CoMMpass Network, Jen Toups, Mary Derome, MS, Winnie Liang, PhD, Seunchan Kim, Daniel Auclair, PhD, Pamela G. Kidd, MD, Scott Jewell, PhD, John David Carpten, PhD, Sagar Lonial, MD
"Molecular Predictors of Outcome and Drug Response in Multiple Myeloma: An Interim Analysis of the Mmrf CoMMpass Study" by Jonathan J Keats, PhD, Gil Speyer, Austin Christofferson, Christophe Legendre, PhD, Jessica Aldrich, Megan Russell, Lori Cuyugan, Jonathan Adkins, Alex Blanski, Meghan Hodges, Dan Rohrer, Sundar Jagannath, MD, Ravi Vij, MD, Gregory Orloff, MD, Todd Zimmerman, MD, Ruben Niesvizky, MD, Darla Liles, MD, Joseph W. Fay, Jeffrey L. Wolf, MD, Robert M Rifkin, Norma C Gutierrez, MD PhD, Mmrf CoMMpass Network, Jennifer Yesil, MS, Mary Derome, MS, Seungchan Kim, PhD, Winnie Liang, PhD, Pamela G. Kidd, MD, Scott Jewell, PhD, John David Carpten, PhD, Daniel Auclair, PhD, Sagar Lonial, MD FACP

See 5 usage examples →

Community Earth System Model Large Ensemble (CESM LENS)

atmosphereclimateclimate modelgeospatialicelandmodeloceanssustainabilityzarr

The Community Earth System Model (CESM) Large Ensemble Numerical Simulation (LENS) dataset includes a 40-member ensemble of climate simulations for the period 1920-2100 using historical data (1920-2005) or assuming the RCP8.5 greenhouse gas concentration scenario (2006-2100), as well as longer control runs based on pre-industrial conditions. The data comprise both surface (2D) and volumetric (3D) variables in the atmosphere, ocean, land, and ice domains. The total data volume of the original dataset is ~500TB, which has traditionally been stored as ~150,000 individual CF/NetCDF files on disk o...

Details →

Usage examples

Urban Climate Explorer by Zhonghua Zheng
Rendered (static) version of Jupyter Notebook by Anderson Banihirwe, NCAR
The Community Earth System Model (CESM) Large Ensemble Project: A Community Resource for Studying Climate Change in the Presence of Internal Climate Variability by Kay et al. (2015), Bull. AMS, 96, 1333-1349
Jupyter Notebook and other documentation and tools for CESM LENS on AWS by NCAR Science at Scale team
Analyzing large climate model ensembles in the cloud by Joe Hamman, NCAR

See 5 usage examples →

First Street Foundation (FSF) Flood Risk Summary Statistics

agricultureclimatemodelstatisticssustainabilitywaterweather

CSV files of flood statistics for the 48 contiguous states at the congressional district, county, and zip code level. The CSV for each of these geographical extents includes statistics on the amount of properties at risk according to FEMA, the number of properties at risk according to First Street Foundation, and the difference between the two.

Details →

Usage examples

Do You Know Your Home’s Flood Risk? by Edward Kearns, Jeremy Porter, Michael Amodeo
Estimating Recent Local Impacts of Sea-Level Rise on Current Real-Estate Losses: A Housing Market Case Study in Miami-Dade, Florida by Steven A. McAlpine, Jeremy R. Porter
Communicating a national flood risk assessment using AWS by Ed Kearns, Mike Amodeo
First Street Foundation Flood Lab by First Street Foundation
Validation of a 30 m resolution flood hazard model of the conterminous United States by Oliver E. J. Wing, Paul D. Bates, Christopher C. Sampson, Andrew M. Smith, Kris A. Johnson, Tyler A. Erickson

See 5 usage examples →

Global Seasonal Sentinel-1 Interferometric Coherence and Backscatter Data Set

agriculturecogearth observationearthquakesecosystemsenvironmentalgeologygeophysicsgeospatialglobalinfrastructuremappingnatural resourcesatellite imagerysynthetic aperture radarurban

This data set is the first-of-its-kind spatial representation of multi-seasonal, global SAR repeat-pass interferometric coherence and backscatter signatures. Global coverage comprises all land masses and ice sheets from 82 degrees northern to 79 degress southern latitude. The data set is derived from high-resolution multi-temporal repeat-pass interferometric processing of about 205,000 Sentinel-1 Single-Look-Complex data acquired in Interferometric Wide-Swath mode (Sentinel-1 IW mode) from 1-Dec-2019 to 30-Nov-2020. The data set was developed by Earth Big Data LLC and Gamma Remote Sensing AG, under contract for NASA's Jet Propulsion Laboratory. ...

Details →

Usage examples

Jupyter Notebook to access and visualize sub regions of the global data set by Josef Kellndorfer
Webinar: The new era of SAR Time Series Analysis and Visualization: Cloud meets Big SAR Data. IEEE GRSS Bay Area Chapter (Dec. 3rd 2021) by Josef Kellndorfer
Generating Global Temporal Coherence Maps from one year of Sentinel-1 C-band data, ESA Fringe 2021 Poster (Youtube) by Oliver Cartus, Josef Kellndorfer, Shadi Oveisgharan, Batu Osmanoglu, Paul Rosen, Urs Wegmüller
Global seasonal Sentinel-1 interferometric coherence and backscatter data set by Josef Kellndorfer, Oliver Cartus, Marco Lavalle, Christophe Magnard, Pietro Milillo, Shadi Oveisgharan, Batu Osmanoglu, Paul A. Rosen, Urs Wegmüller
Jupyter Notebook to access and visualize global mosaics of the global data set by Josef Kellndorfer

See 5 usage examples →

NOAA National Water Model CONUS Retrospective Dataset

agricultureagricultureclimatedisaster responseenvironmentalsustainabilitytransportationweather

The NOAA National Water Model Retrospective dataset contains input and output from multi-decade CONUS retrospective simulations. These simulations used meteorological input fields from meteorological retrospective datasets. The output frequency and fields available in this historical NWM dataset differ from those contained in the real-time operational NWM forecast model.

One application of this dataset is to provide historical context to current near real-time streamflow, soil moisture and snowpack conditions. The retrospective data can be used to infer flow frequencies and perform temporal analyses with hourly streamflow output and 3-hourly land surface output. This dataset can also be used in the development of end user applications which require a long baseline of data for system training or verification purposes.

Currently there are three versions of the NWM retrospective dataset

A 42-year (February 1979 through December 2020) retrospective simulation using version 2.1 of the National Water Model. A 26-year (January 1993 through December 2018) retrospective simulation using version 2.0 of the National Water Model. A 25-year (January 1993 through December 2017) retrospective simulation using version 1.2 of the National Water Model.

Version 2.1 uses forcings from the Office of Water Prediction Analysis of Record for Calibration (AORC) dataset while Version 2.0 and version 1.2 use input meteorological forcing from the North American Land Data Assimilation (NLDAS) data set. Note that no streamflow or other data assimilation is performed within any of the NWM retrospective simulations.

NWM Retrospective data is available in two formats, NetCDF and Zarr. The NetCDF files contain the full set of NWM output data, while the Zarr files contain a subset of NWM output fields that vary with model version.

NWM V2.1: All model output and forcing input fields are available in the NetCDF format. All model output fields along with the precipitation forcing field are available in the Zarr format. NWM V2.0: All model output fields are available in NetCDF format. Model channel output including streamflow and related fields are available in Zarr format. NWM V1.2: All model output fields are available in NetCDF format.

A table listing the data available within each NetCDF and Zarr file is located in the 'documentation page'. This data includes meteorologic...

Details →

Usage examples

Explore the National Water Model V2.1 Retrospective Dataset in Zarr by James McCreight, Ishita Srivastava, Rich Signell
Simulating storm surge and compound flooding events with a creek-to-ocean model: Importance of baroclinic effects by Fei Ye, et al.
Explore Repository of Tutorials on National Water Model V2.1 Retrospective Dataset in Zarr by James McCreight
On Strictly Enforced Mass Conservation Constraints for Modeling the Rainfall-Runoff Process by Jonathan M. Frame, Frederik Kratzert, Hoshin V. Gupta, Paul Ullrich and Grey S. Nearing
Explore the National Water Model V2.0 Retrospective in Zarr by Rich Signell

See 5 usage examples →

The Human Connectome Project

biologyimaginglife sciencesneurobiologyneuroimagingneuroscience

The Human Connectome Project (HCP Young Adult, HCP-YA) is mapping the healthy human connectome by collecting and freely distributing neuroimaging and behavioral data on 1,200 normal young adults, aged 22-35.

Details →

Usage examples

Exploring the Human Connectom by The Human Connectome Project
The WU-Minn Human Connectome Project: an overview. by Van Essen DC, Smith SM, Barch DM, Behrens TEJ, Yacoub E, Ugurbil, K, and the WU-Minn HCP Consortium.
The minimal preprocessing pipelines for the Human Connectome Project by Glasser MF, Sotiropoulos SN, Wilson JA, Coalson TS, Fischl B, Andersson JL, Xu J, Jbabdi S, et al.
The Human Connectome Project: A retrospective by Elam JS, Glasser MF, Harms MP, Sotiropoulos SN, Andersson JL, Burgess GC, Curtiss SW, et al.
The Human Connectome Workbench by The Human Connectome Project

See 5 usage examples →

Basic Local Alignment Sequences Tool (BLAST) Databases

bioinformaticsbiologygeneticgenomichealthlife sciencesproteinreference indextranscriptomics

A centralized repository of pre-formatted BLAST databases created by the National Center for Biotechnology Information (NCBI).

Details →

Usage examples

BLAST+ Docker by NCBI BLAST
BLAST+: Architecture and Applications by Christiam Camacho 1 , George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, Thomas L Madden
BLAST on the Cloud with NCBI’s ElasticBLAST by Sixing Huang
Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs by S F Altschul, T L Madden, A A Schäffer, J Zhang, Z Zhang, W Miller, D J Lipman

See 4 usage examples →

Boreas Autonomous Driving Dataset

autonomous vehiclescomputer visionlidarrobotics

This autonomous driving dataset includes data from a 128-beam Velodyne Alpha-Prime lidar, a 5MP Blackfly camera, a 360-degree Navtech radar, and post-processed Applanix POS LV GNSS data. This dataset was collect in various weather conditions (sun, rain, snow) over the course of a year. The intended purpose of this dataset is to enable benchmarking of long-term all-weather odometry and metric localization across various sensor types. In the future, we hope to also support an object detection benchmark.

Details →

Usage examples

Radar odometry combining probabilistic estimation and unsupervised feature learning by K. Burnett, D. J. Yoon, A. P. Schoellig, T. D. Barfoot
Do we need to compensate for motion distortion and doppler effects in spinning radar navigation? by K. Burnett, A. P. Schoellig, T. D. Barfoot
Introduction to Visualizing Sensor Types (Jupyter notebook) by Keenan Burnett
Project Lidar onto Camera Frames (Jupyter notebook) by Keenan Burnett

See 4 usage examples →

JMA Himawari-8

agriculturedisaster responseearth observationgeospatialmeteorologicalsatellite imagerysustainabilityweather

Himawari-8, stationed at 140E, owned and operated by the Japan Meteorological Agency (JMA), is a geostationary meteorological satellite, with Himawari-9 as on-orbit back-up, that provides constant and uniform coverage of east Asia, and the west and central Pacific regions from around 35,800 km above the equator with an orbit corresponding to the period of the earth’s rotation. This allows JMA weather offices to perform uninterrupted observation of environmental phenomena such as typhoons, volcanoes, and general weather systems. Archive data back to July 2015 is available for Full Disk (AHI-L1...

Details →

Usage examples

Introduction of Himawari-8/9 (pdf file) by JMA
Himawari-8 Advanced Himawari Imager Data on AWS (pdf file) by NOAA NESDIS
Himawari-8 on AWS (pdf file) by ASDI
Himawari-8: Enabling access to key weather data by Manan Dalal, Jena Kent

See 4 usage examples →

Maxar Open Data Program

cogdisaster responseearth observationgeospatialsatellite imagerystacsustainability

Pre and post event high-resolution satellite imagery in support of emergency planning, risk assessment, monitoring of staging areas and emergency response, damage assessment, and recovery. These images are generated using the Maxar ARD pipeline, tiled on an organized grid in analysis-ready cloud-optimized formats.

Details →

Usage examples

Using Data from Earth Observation to Support Sustainable Development Indicators: An Analysis of the Literature and Challenges for the Future by Ana Andries, Stephen Morse, Richard J. Murphy, Jim Lynch, and Emma R. Woolliams
Disaster, Infrastructure and Participatory Knowledge The Planetary Response Network by Brooke Simmons, Chris Lintott, Steven Reece, et al.
Data Access (SDK tutorial) by Maxar Open Data
ARD and Command Line Tools by Maxar Open Data
Seeing a Better World from Space by Carly Sakumura

See 7 usage examples →

Mouse Brain Anatomy: MouseLight Imagery

biologyfluorescence imagingimage processingimaginglife sciencesmicroscopyneurobiologyneuroimagingneuroscience

This data set, made available by Janelia's MouseLight project, consists of images and neuron annotations of the Mus musculus brain, stored in formats suitable for viewing and annotation using the HortaCloud cloud-based annotation system.

Details →

Usage examples

MouseLight Project Website by Tiago A. Ferreira, Jayaram Chandrashekar
MouseLight NeuronBrowser by Tiago A. Ferreira, Jayaram Chandrashekar
Reconstruction of 1,000 Projection Neurons Reveals New Cell Types and Organization of Long-Range Connectivity in the Mouse Brain by Johan Winnubst, Erhan Bas, Tiago A. Ferreira, Zhuhao Wu, Michael N. Economo, Patrick Edson, Ben J. Arthur, Christopher Bruns, Konrad Rokicki, David Schauder, Donald J. Olbris, Sean D. Murphy, David G. Ackerman, Cameron Arshadi, Perry Baldwin, Regina Blake, Ahmad Elsayed, Mashtura Hasan, Daniel Ramirez, Bruno Dos Santos, Monet Weldon, Amina Zafar, Joshua T. Dudman, Charles R. Gerfen, Adam W. Hantman, Wyatt Korff, Scott M. Sternson, Nelson Spruston, Karel Svoboda, Jayaram Chandrashekar
HortaCloud by David Schauder, Donald J. Olbris, Jody Clements, Cristian Goina, Robert R. Svirskas, Konrad Rokicki

See 4 usage examples →

NAIP on AWS

aerial imageryagriculturecogearth observationgeospatialnatural resourceregulatorysustainability

The National Agriculture Imagery Program (NAIP) acquires aerial imagery during the agricultural growing seasons in the continental U.S. This "leaf-on" imagery andtypically ranges from 60 centimeters to 100 centimeters in resolution and is available from the naip-analytic Amazon S3 bucket as 4-band (RGB + NIR) imagery in MRF format, on naip-source Amazon S3 bucket as 4-band (RGB + NIR) in uncompressed Raw GeoTiff format and naip-visualization as 3-band (RGB) Cloud Optimized GeoTiff format. NAIP data is delivered at the state level; every year, a number of states receive updates, with ...

Details →

Usage examples

EOS Land Viewer by Earth Observing System
VoyagerSearch showing off Batch + NAIP by Voyager
Individual Tree Detection in Large-Scale Urban Environments using High-Resolution Multispectral Imagery by Jonathan Ventura, Milo Honsberger, Cameron Gonsalves, Julian Rice, Camille Pawlak, Natalie L.R. Love, Skyler Han, Viet Nguyen, Keilana Sugano, Jacqueline Doremus, G. Andrew Fricker, Jenn Yost, Matt Ritter
Urban Tree Detection by Jonathan Ventura

See 4 usage examples →

NREL National Solar Radiation Database

earth observationenergygeospatialmeteorologicalsolarsustainability

Released to the public as part of the Department of Energy's Open Energy Data Initiative, the National Solar Radiation Database (NSRDB) is a serially complete collection of hourly and half-hourly values of the three most common measurements of solar radiation – global horizontal, direct normal, and diffuse horizontal irradiance — and meteorological data. These data have been collected at a sufficient number of locations and temporal and spatial scales to accurately represent regional solar radiation climates.

Details →

Usage examples

NSRDB Viewer by Manajit Sengupta, Yu Xe, Anthony Lopez, Aron Habte, Galen Maclaurin, James Shelby, Paul Edwards
The National Solar Radiation Data Base (NSRDB) by Manajit Sengupta, Yu Xe, Anthony Lopez, Aron Habte, Galen Maclaurin, James Shelby
Physics-guided machine learning for improved accuracy of the National Solar Radiation Database by Grant Buster, Mike Bannister, Aron Habte, Dylan Hettinger, Galen Maclaurin, Michael Rossol, Manajit Sengupta, Yu Xie
HSDS Examples by Caleb Phillips, Caroline Draxl, John Readey, Jordan Perr-Sauer, Michael Rossol

See 4 usage examples →

OpenCell on AWS

biologycell biologycell imagingcomputer visionfluorescence imagingimaginglife sciencesmachine learningmicroscopy

The OpenCell project is a proteome-scale effort to measure the localization and interactions of human proteins using high-throughput genome engineering to endogenously tag thousands of proteins in the human proteome. This dataset consists of the raw confocal fluorescence microscopy images for all tagged cell lines in the OpenCell library. These images can be interpreted both individually, to determine the localization of particular proteins of interest, and in aggregate, by training machine learning models to classify or quantify subcellular localization patterns.

Details →

Usage examples

Self-Supervised Deep-Learning Encodes High-Resolution Features of Protein Subcellular Localization by Hirofumi Kobayashi, Keith C. Cheveralls, Manuel D. Leonetti, Loic A. Royer
cytoself (an unsupervised ML model to quantify localization patterns) by Hirofumi Kobayashi, Keith C. Cheveralls, Manuel D. Leonetti, Loic A. Royer
OpenCell web portal by OpenCell team
OpenCell: proteome-scale endogenous tagging enables the cartography of human cellular organization by Nathan H. Cho, Keith C. Cheveralls, Andreas-David Brunner, Kibeom Kim, André C. Michaelis, Preethi Raghavan, et al.

See 4 usage examples →

Sea Surface Temperature Daily Analysis: European Space Agency Climate Change Initiative product version 2.1

climateearth observationenvironmentalgeospatialglobaloceans

Global daily-mean sea surface temperatures, presented on a 0.05° latitude-longitude grid, with gaps between available daily observations filled by statistical means, spanning late 1981 to recent time. Suitable for large-scale oceanographic meteorological and climatological applications, such as evaluating or constraining environmental models or case-studies of marine heat wave events. Includes temperature uncertainty information and auxiliary information about land-sea fraction and sea-ice coverage. For reference and citation see: www.nature.com/articles/s41597-019-0236-x.

Details →

Usage examples

Satellite-based time-series of sea-surface temperature since 1981 for climate applications (2019). by Merchant, C.J., Embury, O., Bulgin, C.E., Block, T., Corlett, G.K., Fiedler, E., Good, S.A., Mittaz, J., Rayner, N.A., Berry, D., Eastwood, S., Taylor, M., Tsushima, Y., Waterfall, A., Wilson, R. and Donlon, C.
Working with surftemp-sst data - Tutorial 2 - Analysing Marine Heatwaves by Niall McCarroll
Adjusting for desert-dust-related biases in a climate data record of sea surface temperature (2020). by Merchant, C.J. and Embury, O.
Working with surftemp-sst data - Tutorial 1 - Getting started by Niall McCarroll

See 4 usage examples →

Virginia Coastal Resilience Master Plan, Phase 1 - December 2021

coastalfloods

The Virginia Coastal Resilience Master Plan builds on the 2020 Virginia Coastal Resilience Master Planning Framework, which outlined the goals and principles of the Commonwealth’s statewide coastal resilience strategy. Recognizing the urgent challenge flooding already poses, the Commonwealth developed Phase One of the Master Plan on an accelerated timeline and focused this first assessment on the impacts of tidal and storm surge coastal flooding on coastal Virginia. The Master Plan leveraged the combined efforts of more than two thousand stakeholders, subject matter experts, and government personnel. We centered the development of this plan around three core components:

A Technical Study compiled essential data, research, processes, products, and resilience efforts in the Coastal Resilience Database, which forms much of basis of this plan and the Coastal Resilience Web Explorer;

A Technical Advisory Committee supported coordination across key stakeholders and ensured the incorporation of the best available subject matter knowledge, data, and methods into this plan; and

Stakeholder Engagement captured diverse resilience perspectives from residents, local and regional officials, and other stakeholders across Virginia’s coastal communities to drive regionally specific resilience priorities.Data products used and generated for the Virginia Coastal Resilience.

This dataset represents the data that was developed for the technical study. Appendix F - Data Product List provides a list of available data. Other Appendix documents provide the inpu...

Details →

Usage examples

ArcGIS REST Services Directory by Virginia Department of Conservation and Recreation
Appendix F Data Product List by Virginia Department of Conservation and Recreation
Virginia Coastal Resilience Web Explorer by Virginia Department of Conservation and Recreation
Virginia Coastal Resilience Master Plan, Phase One December 2021 by Virginia Department of Conservation and Recreation

See 4 usage examples →

Yale-CMU-Berkeley (YCB) Object and Model Set

robotics

This project primarily aims to facilitate performance benchmarking in robotics research. The dataset provides mesh models, RGB, RGB-D and point cloud images of over 80 objects. The physical objects are also available via the YCB benchmarking project. The data are collected by two state of the art systems: UC Berkley's scanning rig and the Google scanner. The UC Berkley's scanning rig data provide meshes generated with Poisson reconstruction, meshes generated with volumetric range image integration, textured versions of both meshes, Kinbody files for using the meshes with OpenRAVE, 600 ...

Details →

Usage examples

The Closure Signature: A Functional Approach to Model Underactuated Compliant Robotic Hands by Maria Pozzi, Gionata Salvietti, João Bimbo, Monica Malvezzi, Domenico Prattichizzo
Pre-touch sensing for sequential manipulation by Boling Yang, Patrick Lancaster, Joshua R. Smith
Label Fusion: A Pipeline for Generating Ground Truth Labels for Real RGBD Data of Cluttered Scenes by Pat Marion, Peter R. Florence, Lucas Manuelli, Russ Tedrake
Benchmarking in Manipulation Research: Using the Yale-CMU-Berkeley Object and Model Set by Berk Calli, Aaron Walsman, Arjun Singh, Siddhartha Srinivasa, Pieter Abbeel, Aaron M Dollar

See 4 usage examples →

iSDAsoil

agricultureanalyticsbiodiversityconservationdeep learningfood securitygeospatialmachine learningsatellite imagery

iSDAsoil is a resource containing soil property predictions for the entire African continent, generated using machine learning. Maps for over 20 different soil properties have been created at 2 different depths (0-20 and 20-50cm). Soil property predictions were made using machine learning coupled with remote sensing data and a training set of over 100,000 analyzed soil samples. Included in this datset are images of predicted soil properties, model error and satellite covariates used in the mapping process.

Details →

Usage examples

iSDAsoil Python tutorial by Matt Miller
iSDAsoil homepage - view soil property maps online by iSDA
iSDAsoil liming demo app on Observable by Jamie Collinson
African soil properties and nutrients mapped at 30 m spatial resolution using two-scale ensemble machine learning by Tomislav Hengl, Matthew A. E. Miller, Josip Križan, Keith D. Shepherd, Andrew Sila, Milan Kilibarda, Ognjen Antonijević, Luka Glušica, Achim Dobermann, Stephan M. Haefele, Steve P. McGrath, Gifty E. Acquah, Jamie Collinson, Leandro Parente, Mohammadreza Sheykhmousa, Kazuki Saito, Jean-Martial Johnson, Jordan Chamberlin, Francis B. T. Silatsa, Martin Yemefack, John Wendt, Robert A. MacMillan, Ichsani Wheeler & Jonathan Crouch

See 4 usage examples →

Beat Acute Myeloid Leukemia (AML) 1.0

cancergeneticgenomicHomo sapienslife sciencesSTRIDES

Beat AML 1.0 is a collaborative research program involving 11 academic medical centers who worked collectively to better understand drugs and drug combinations that should be prioritized for further development within clinical and/or molecular subsets of acute myeloid leukemia (AML) patients. Beat AML 1.0 provides the largest-to-date dataset on primary acute myeloid leukemia samples offering genomic, clinical, and drug response.This dataset contains open Clinical Supplement and RNA-Seq Gene Expression Quantification data.This dataset also contains controlled Whole Exome Sequencing (WXS) and R...

Details →

Usage examples

Functional Genomic Landscape of Acute Myeloid Leukemia by Jeffrey W. Tyner, Cristina E. Tognon, Dan Bottomly et al.
Genomic Data Commons by National Cancer Institute
Clinical resistance to crenolanib in acute myeloid leukemia due to diverse molecular mechanisms by Zhang H, Savage S, Schultz AR, Bottomly D, White L, Segerdell E, et al.

See 3 usage examples →

Cell Organelle Segmentation in Electron Microscopy (COSEM) on AWS

cell biologycomputer visionelectron microscopyimaginglife sciencesorganelle

High resolution images of subcellular structures.

Details →

Usage examples

Enhanced FIB-SEM systems for large-volume 3D imaging by C. Shan Xu, Kenneth J. Hayworth, Zhiyuan Lu, Patricia Grob, Ahmed M. Hassan, José G. García-Cerdán, Krishna K. Niyogi, Eva Nogales, Richard J. Weinberg, Harald F. Hess.
Whole-cell organelle segmentation in volume electron microscopy by Lisa Heinrich, Davis Bennett, David Ackerman, Woohyun Park, Jon Bogovic, Nils Eckstein, et al.
Correlative three-dimensional super-resolution and block-face electron microscopy of whole vitreously frozen cells. by David P. Hoffman, Gleb Shtengel, C. Shan Xu, Kirby R. Campbell, Melanie Freeman, Lei Wang, Daniel E. Milkie, H. Amalia Pasolli, Nirmala Iyer, John A. Bogovic, Daniel R. Stabley, Abbas Shirinifard, Song Pang, David Peale, Kathy Schaefer, Wim Pomp, Chi-Lun Chang, Jennifer Lippincott-Schwartz, Tom Kirchhausen1, David J. Solecki, Eric Betzig, Harald F. Hess

See 3 usage examples →

Clinical Trial Sequencing Project - Diffuse Large B-Cell Lymphoma

cancergenomiclife sciencesSTRIDEStranscriptomicswhole genome sequencing

The goal of the project is to identify recurrent genetic alterations (mutations, deletions, amplifications, rearrangements) and/or gene expression signatures. National Cancer Institute (NCI) utilized whole genome sequencing and/or whole exome sequencing in conjunction with transcriptome sequencing. The samples were processed and submitted for genomic characterization using pipelines and procedures established within The Cancer Genome Analysis (TCGA) project.

Details →

Usage examples

Genetics and Pathogenesis of Diffuse Large B Cell Lymphoma by Roland Schmitz, Ph.D., George W. Wright, Ph.D., Da Wei Huang, M.D., Calvin A. Johnson, Ph.D., James D. Phelan, Ph.D., James Q. Wang, Ph.D., Sandrine Roulland, Ph.D., Monica Kasbekar, Ph.D., Ryan M. Young, Ph.D., Arthur L. Shaffer, Ph.D., Daniel J. Hodson, M.D., Ph.D., Wenming Xiao, Ph.D., et al.
A multiprotein supercomplex controlling oncogenic signalling in lymphoma by Phelan JD, Young RM, Webster DE, Roulland S, Wright GW, Kasbekar M, Shaffer AL 3rd, Ceribelli M, Wang JQ, Schmitz R, Nakagawa M, Bachy E, Huang DW, Ji Y, Chen L, Yang Y, Zhao H, Yu X, Xu W, Palisoc MM, Valadez RR, Davies-Hill T, Wilson WH, Chan WC, Jaffe ES, Gascoyne RD, Campo E, Rosenwald A, Ott G, Delabie J, Rimsza LM, Rodriguez FJ, Estephan F, Holdhoff M, Kruhlak MJ, Hewitt SM, Thomas CJ, Pittaluga S, Oellerich T, Staudt LM
Genomic Data Commons by National Cancer Institute

See 3 usage examples →

Finnish Meteorological Institute Weather Radar Data

agricultureearth observationmeteorologicalsustainabilityweather

The up-to-date weather radar from the FMI radar network is available as Open Data. The data contain both single radar data along with composites over Finland in GeoTIFF and HDF5-formats. Available composite parameters consist of radar reflectivity (DBZ), rainfall intensity (RR), and precipitation accumulation of 1, 12, and 24 hours. Single radar parameters consist of radar reflectivity (DBZ), radial velocity (VRAD), rain classification (HCLASS), and Cloud top height (ETOP 20). Raw volume data from singe radars are also provided in HDF5 format with ODIM 2.3 conventions. Radar data becomes avail...

Details →

Usage examples

Handling data with QGIS by Markus Peura
Processing GeoTIFF data with python by Roope Tervo
Processing HDF5 data with python by Roope Tervo

See 3 usage examples →

Foundation Medicine Adult Cancer Clinical Dataset (FM-AD)

cancergenomic

The Foundation Medicine Adult Cancer Clinical Dataset (FM-AD) is a study conducted by Foundation Medicine Inc (FMI). Genomic profiling data for approximately 18,000 adult patients with a diverse array of cancers was generated using FoundationeOne, FMI's commercially available, comprehensive genomic profiling assay. This dataset contains open Clinical and Biospecimen data.

Details →

Usage examples

High-Throughput Genomic Profiling of Adult Solid Tumors Reveals Novel Insights into Cancer Pathogenesis by Ryan J. Hartmaier, Lee A. Albacker, Juliann Chmielecki, Mark Bailey, Jie He, Michael E. Goldberg, Shakti Ramkissoon, James Suh, Julia A. Elvin, Samuel Chiacchia, Garrett M. Frampton, Jeffrey S. Ross, Vincent Miller, Philip J. Stephens and Doron Lipson
Targeted next-generation sequencing of advanced prostate cancer identifies potential therapeutic targets and disease heterogeneity. by Beltran H, Yelensky R, Frampton GM, Park K, Downing SR, MacDonald TY, Jarosz M, Lipson D, Tagawa ST, Nanus DM, Stephens PJ, Mosquera JM, Cronin MT, Rubin MA
Genomic Data Commons by National Cancer Institute

See 3 usage examples →

MIMIC-III (‘Medical Information Mart for Intensive Care’)

bioinformaticshealthlife sciencesnatural language processingus

MIMIC-III (‘Medical Information Mart for Intensive Care’) is a large, single-center database comprising information relating to patients admitted to critical care units at a large tertiary care hospital. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more. The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework. The MIMIC-I...

Details →

Usage examples

Perform biomedical informatics without a database using MIMIC-III data and Amazon Athena by James Wiggins, Alistair Johnson
MIMIC-code GitHub repository by Alistair Johnson
Building predictive disease models using Amazon SageMaker with Amazon HealthLake normalized data by Ujjwal Ratan, Nihir Chadderwala, and Parminder Bhatia

See 3 usage examples →

Medical Segmentation Decathlon

computed tomographyhealthimaginglife sciencesmagnetic resonance imagingmedicineniftisegmentation

With recent advances in machine learning, semantic segmentation algorithms are becoming increasingly general purpose and translatable to unseen tasks. Many key algorithmic advances in the field of medical imaging are commonly validated on a small number of tasks, limiting our understanding of the generalisability of the proposed contributions. A model which works out-of-the-box on many tasks, in the spirit of AutoML, would have a tremendous impact on healthcare. The field of medical imaging is also missing a fully open source and comprehensive benchmark for general purpose algorithmic validati...

Details →

Usage examples

A large annotated medical image dataset for the development and evaluation of segmentation algorithms by Simpson A. L., Antonelli M., Bakas S., Bilello M., Farahana K., van Ginneken B., et al
Pytorch-Integrated MSD Data Loader by MONAI Development Team
MONAI: Getting Started by MONAI Development Team

See 3 usage examples →

Multiview Extended Video with Activities (MEVA)

computer visionurbanusvideo

The Multiview Extended Video with Activities (MEVA) dataset consists video data of human activity, both scripted and unscripted, collected with roughly 100 actors over several weeks. The data was collected with 29 cameras with overlapping and non-overlapping fields of view. The current release consists of about 328 hours (516GB, 4259 clips) of video data, as well as 4.6 hours (26GB) of UAV data. Other data includes GPS tracks of actors, camera models, and a site map. We have also released annotations for roughly 184 hours of data. Further updates are planned.

Details →

Usage examples

TinyAction Challenge: Recognizing Real-world Low-resolution Activities in Videos by Praveen Tirupattur, Aayush J Rana, Tushar Sangam, Shruti Vyas, Yogesh S Rawat, Mubarak Shah
ActEV: Activities in Extended Video by National Institute of Standards and Technology (NIST)
MEVA: A Large-Scale Multiview, Multimodal Video Dataset for Activity Detection by Kellie Corona, Katie Osterdahl, Roderic Collins, Anthony Hoogs

See 3 usage examples →

OpenAlex dataset

graphjsonmetadatascholarly communication

An open, comprehensive index of scolarly papers, citations, authors, institutions, and journals.

Details →

Usage examples

Download snapshot by OurResearch
OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts by Jason Priem, Heather Piwowar, Richard Orr
Getting citation data from OpenAlex by DOI (Jupyter notebook) by Jens Peter Anderson

See 3 usage examples →

The Human Microbiome Project

amino acidfastafastqgeneticgenomiclife sciencesmetagenomicsmicrobiome

The NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performe...

Details →

Usage examples

New microbe genomic variants in patients fecal community following surgical disruption of the upper human gastrointestinal tract by Ranjit Kumar, Jayleen Grams, Daniel I. Chu, David K.Crossman, Richard Stahl, Peter Eipers, et al
The Human Microbiome Project by Peter J. Turnbaugh, Ruth E. Ley, Micah Hamady, Claire M. Fraser-Liggett, Rob Knight & Jeffrey I. Gordon
Strains, functions and dynamics in the expanded Human Microbiome Project by Jason Lloyd-Price, Anup Mahurkar, Gholamali Rahnavard, Jonathan Crabtree, Joshua Orvis, A. Brantley Hall, et al.

See 3 usage examples →

4D Nucleome (4DN)

bioinformaticsbiologygeneticgenomicimaginglife sciences

The goal of the National Institutes of Health (NIH) Common Fund’s 4D Nucleome (4DN) program is to study the three-dimensional organization of the nucleus in space and time (the 4th dimension). The nucleus of a cell contains DNA, the genetic “blueprint” that encodes all of the genes a living organism uses to produce proteins needed to carry out life-sustaining cellular functions. Understanding the conformation of the nuclear DNA and how it is maintained or changes in response to environmental and cellular cues over time will provide insights into basic biology as well as aspects of human health...

Details →

Usage examples

Using jupyterhub on the 4DN data portal by 4DN-DCIC
Finding and Downloading 4DN Data files by 4DN-DCIC

See 2 usage examples →

Atmospheric Models from Météo-France

agricultureclimatedisaster responseearth observationenvironmentalmeteorologicalmodelweather

Global and high-resolution regional atmospheric models from Météo-France.

ARPEGE World covers the entire world at a base horizontal resolution of 0.5° (~55km) between grid points, it predicts weather out up to 114 hours in the future.
ARPEGE Europe covers Europe and North-Africa at a base horizontal resolution of 0.1° (~11km) between grid points, it predicts weather out up to 114 hours in the future.
AROME France covers France at a base horizontal resolution of 0.025° (~2.5km) between grid points, it predicts weather out up to 42 hours in the future.
AROME France HD covers France and neigborhood at a base horizontal resolution of 0.01° (~1.5km) between grid points, it predicts weather out up to 42 hours in the future.

Dozens of atmospheric variables are avail...

Details →

Usage examples

Windguru.cz by Windguru
Windy.com by Windy

See 2 usage examples →

Cancer Genome Characterization Initiatives - Burkitt Lymphoma, HIV+ Cervical Cancer

cancergenomiclife sciencesSTRIDEStranscriptomics

The Cancer Genome Characterization Initiatives (CGCI) program supports cutting-edge genomics research of adult and pediatric cancers. CGCI investigators develop and apply advanced sequencing methods that examine genomes, exomes, and transcriptomes within various types of tumors. The program includes Burkitt Lymphoma Genome Sequencing Project (BLGSP) project and HIV+ Tumor Molecular Characterization Project - Cervical Cancer (HTMCP-CC) project. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantificati...

Details →

Usage examples

Genomic Data Commons by National Cancer Institute
Genome-wide discovery of somatic coding and noncoding mutations in pediatric endemic and sporadic Burkitt lymphoma by Grande B. M., Gerhard D. S., Jiang A., Griner N. B., Abramson J. S., Alexander T. B., et al.

See 2 usage examples →

Copernicus Digital Elevation Model (DEM)

agriculturecogdisaster responseearth observationelevationgeospatialsatellite imagerysustainability

The Copernicus DEM is a Digital Surface Model (DSM) which represents the surface of the Earth including buildings, infrastructure and vegetation. We provide two instances of Copernicus DEM named GLO-30 Public and GLO-90. GLO-90 provides worldwide coverage at 90 meters. GLO-30 Public provides limited worldwide coverage at 30 meters because a small subset of tiles covering specific countries are not yet released to the public by the Copernicus Programme. Note that in both cases ocean areas do not have tiles, there one can assume height values equal to zero. Data is provided as Cloud Optimized Ge...

Details →

Usage examples

Sentinel Hub WMS/WMTS/WCS Service and Process API by Sinergise
EO Browser by Sinergise

See 2 usage examples →

DNAStack COVID19 SRA Data

bambioinformaticscoronavirusCOVID-19fastafastqgeneticgenomicglobalhealthlife scienceslong read sequencingSARS-CoV-2vcfviruswhole genome sequencing

The Sequence Read Archive (SRA) is the primary archive of high-throughput sequencing data, hosted by the National Institutes of Health (NIH). The SRA represents the largest publicly available repository of SARS-CoV-2 sequencing data. This dataset was created by DNAstack using SARS-CoV-2 sequencing data sourced from the SRA. Where possible, raw sequence data were processed by DNAstack through a unified bioinformatics pipeline to produce genome assemblies and variant calls. The use of a standardized workflow to produce this harmonized dataset allows public data generated using different methodol...

Details →

Usage examples

Viral lineage assignment by Heather Ward
Viral AI by DNAstack

See 2 usage examples →

DigitalCorpora

computer forensicscomputer securityCSIcyber securitydigital forensicsimage processingimaginginformation retrievalinternetintrusion detectionmachine learningmachine translationtext analysis

Disk images, memory dumps, network packet captures, and files for use in digital forensics research and education. All of this information is accessible through the digitalcorpora.org website, and made available at s3://digitalcorpora/. Some of these datasets implement scenarios that were performed by students, faculty, and others acting in persona. As such, the information is synthetic and may be used without prior authorization or IRB approval. Details of these datasets can be found at Details →

Usage examples

Bringing Science to Digital Forensics with Standardized Forensic Corpora by Garfinkel, Farrell, Roussev and Dinolt
Creating Realistic Corpora for Forensic and Security Education by Woods, K., Christopher Lee, Simson Garfinkel, David Dittrich, Adam Russel, Kris Kearton

See 2 usage examples →

Hecatomb Databases

bioinformaticsgeneticgenomiclife sciencesmetagenomicsviruswhole genome sequencing

Preprocessed databases for use with the Hecatomb pipeline for viral and phage sequence annotation.

Details →

Usage examples

The Hecatomb Tutorial by Michael Roach
No Evidence Known Viruses Play a Role in the Pathogenesis of Onchocerciasis-Associated Epilepsy. An Explorative Metagenomic Case-Control Study by Michael Roach,Adrian Cantu,Melissa Krizia Vieri,Matthew Cotten,Paul Kellam,My Phan,Lia van der Hoek,Michel Mandro,Floribert Tepage,Germain Mambandu,Gisele Musinya,Anne Laudisoit,Robert Colebunders,Robert Edwards, John L. Mokili

See 2 usage examples →

NOAA Climate Forecast System (CFS)

agricultureclimatemeteorologicalsustainabilityweather

The Climate Forecast System (CFS) is a model representing the global interaction between Earth's oceans, land, and atmosphere. Produced by several dozen scientists under guidance from the National Centers for Environmental Prediction (NCEP), this model offers hourly data with a horizontal resolution down to one-half of a degree (approximately 56 km) around Earth for many variables. CFS uses the latest scientific approaches for taking in, or assimilating, observations from data sources including surface observations, upper air balloon observations, aircraft observations, and satellite obser...

Details →

Usage examples

The NCEP Climate Forecast System Reanalysis by Saha, Suranjana, and Coauthors
The NCEP Climate Forecast System Version 2 by Saha, Suranjana, and Coauthors

See 2 usage examples →

NOAA Emergency Response Imagery

aerial imageryclimatecogdisaster responsesustainabilityweather

In order to support NOAA's homeland security and emergency response requirements, the National Geodetic Survey Remote Sensing Division (NGS/RSD) has the capability to acquire and rapidly disseminate a variety of spatially-referenced datasets to federal, state, and local government agencies, as well as the general public. Remote sensing technologies used for these projects have included lidar, high-resolution digital cameras, a film-based RC-30 aerial camera system, and hyperspectral imagers. Examples of rapid response initiatives include acquiring high resolution images with the Emerge/App...

Details →

Usage examples

Open data helps recovery in the aftermath of devastating weather events by Jena Kent
Using Emergency and Pre-Event Imagery by Jon Sellars

See 2 usage examples →

NOAA World Ocean Database (WOD)

climateoceanssustainability

The World Ocean Database (WOD) is the largest uniformly formatted, quality-controlled, publicly available historical subsurface ocean profile database. From Captain Cook's second voyage in 1772 to today's automated Argo floats, global aggregation of ocean variable information including temperature, salinity, oxygen, nutrients, and others vs. depth allow for study and understanding of the changing physical, chemical, and to some extent biological state of the World's Oceans. Browse the bucket via the AWS S3 explorer: https://noaa-wod-pds.s3.amazonaws.com/index.html

Details →

Usage examples

The World Ocean Database Introduction by Tim P. Boyer, Olga K. Baranova, Carla Coleman, Hernan E. Garcia, Alexandra Grodsky, Ricardo A. Locarnini, Alexey V. Mishonov, Christopher R. Paver, James R. Reagan, Dan Seidov, Igor V. Smolyar, Katharine W. Weathers, Melissa M. Zweng
The World Ocean Database User's Manual by Hernan E. Garcia, Tim P. Boyer, Ricardo A. Locarnini, Olga K. Baranova, Melissa M. Zweng

See 2 usage examples →

Pancreatic Cancer Organoid Profiling

cancergeneticgenomicSTRIDEStranscriptomicswhole genome sequencing

This study generated a collection of patient-derived pancreatic normal and cancer organoids and it was sequenced using Whole Genome Sequencing (WGS), Whole Exome Sequencing (WXS) and RNA-Seq as well as matched tumor and normal tissue if available. The study provides a valuable resource for pancreatic cancer researchers. The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification.

Details →

Usage examples

Genomic Data Commons by National Cancer Institute
Organoid Profiling Identifies Common Responders to Chemotherapy in Pancreatic Cancer by Tiriac H, Belleau P, Engle DD, Plenker D, Deschênes A, Somerville TD, et al.

See 2 usage examples →

Protein Data Bank 3D Structural Biology Data

amino acidarchivesbioinformaticsbiomolecular modelingcell biologychemical biologyCOVID-19electron microscopyelectron tomographyenzymelife sciencesmoleculenuclear magnetic resonancepharmaceuticalproteinprotein templateSARS-CoV-2structural biologyx-ray crystallography

The "Protein Data Bank (PDB) archive" was established in 1971 as the first open-access digital data archive in biology. It is a collection of three-dimensional (3D) atomic-level structures of biological macromolecules (i.e., proteins, DNA, and RNA) and their complexes with one another and various small-molecule ligands (e.g., US FDA approved drugs, enzyme co-factors). For each PDB entry (unique identifier: 1abc or PDB_0000001abc) multiple data files contain information about the 3D atomic coordinates, sequences of biological macromolecules, information about any small molecules/ligan...

Details →

Usage examples

Announcing the worldwide Protein Data Bank by Berman, H., Henrick, K. & Nakamura, H.
Protein Data Bank: the single global archive for 3D macromolecular structure data by wwPDB consortium

See 2 usage examples →

RAPID NRT Flood Maps

agriculturedisaster responseearth observationenvironmentalwater

Near Real-time and archival data of High-resolution (10 m) flood inundation dataset over the Contiguous United States, developed based on the Sentinel-1 SAR imagery (2016-current) archive, using an automated Radar Produced Inundation Diary (RAPID) algorithm.

Details →

Usage examples

Near Real-Time Nonobstructed Flood Inundation Mapping by Synthetic Aperture Radar by Xinyi Shen, Emmanouil N. Anagnostou, George H. Allen, G. Robert Brakenridge, Albert J. Kettner
Inundation Extent Mapping by Synthetic Aperture Radar: A Review by Xinyi Shen, Dacheng Wang, Kebiao Mao, Emmanouil Anagnostou, and Yang Hong

See 2 usage examples →

STOIC2021 Training

computed tomographycomputer visioncoronavirusCOVID-19grand-challenge.orgimaginglife sciencesSARS-CoV-2

The STOIC project collected Computed Tomography (CT) images of 10,735 individuals suspected of being infected with SARS-COV-2 during the first wave of the pandemic in France, from March to April 2020. For each patient in the training set, the dataset contains binary labels for COVID-19 presence, based on RT-PCR test results, and COVID-19 severity, defined as intubation or death within one month from the acquisition of the CT scan. This S3 bucket contains the training sample of the STOIC dataset as used in the STOIC2021 challenge on grand-challenge.org.

Details →

Usage examples

STOIC2021 Challenge by Diagnostic Image Analysis Group, Radboudumc, Nijmegen
Study of Thoracic CT in COVID-19: The STOIC Project by Revel, Marie-Pierre, et al.

See 2 usage examples →

Sentinel-1 SLC dataset for South and Southeast Asia, Taiwan, Korea and Japan

disaster responseearth observationenvironmentalgeospatialsatellite imagerysustainabilitysynthetic aperture radar

The S1 Single Look Complex (SLC) dataset contains Synthetic Aperture Radar (SAR) data in the C-Band wavelength. The SAR sensors are installed on a two-satellite (Sentinel-1A and Sentinel-1B) constellation orbiting the Earth with a combined revisit time of six days, operated by the European Space Agency. The S1 SLC data are a Level-1 product that collects radar amplitude and phase information in all-weather, day or night conditions, which is ideal for studying natural hazards and emergency response, land applications, oil spill monitoring, sea-ice conditions, and associated climate change effec...

Details →

Usage examples

Rapid flood and damage mapping using synthetic aperture radar in response to Typhoon Hagibis, Japan by Cheryl W. J. Tay, Sang-Ho Yun, Shi Tong Chin, Alok Bhardwaj, Jungkyo Jung & Emma M. Hill
Sentinel-1 Opendataset Wiki and Tutorials by Earth Observatory of Singapore

See 2 usage examples →

Terra Fusion Data Sampler

geospatialsatellite imagerysustainability

The Terra Basic Fusion dataset is a fused dataset of the original Level 1 radiances from the five Terra instruments. They have been fully validate to contain the original Terra instrument Level 1 data. Each Level 1 Terra Basic Fusion file contains one full Terra orbit of data and is typically 15 – 40 GB in size, depending on how much data was collected for that orbit. It contains instrument radiance in physical units; radiance quality indicator; geolocation for each IFOV at its native resolution; sun-view geometry; bservation time; and other attributes/metadata. It is stored in HDF5, conformed to CF conventions, and accessible by netCDF-4 enhanced models. It’s naming convention follows: TERRA_BF_L1B_OXXXX_YYYYMMDDHHMMSS_F000_V000.h5. A concise description of the dataset, along with links to complete documentation and available software tools, can be found on the Terra Fusion project page: https://terrafusion.web.illinois.edu.

Terra is the flagship satellite of NASA’s Earth Observing System (EOS). It was launched into orbit on December 18, 1999 and carries five instruments. These are the Moderate-resolution Imaging Spectroradiometer (MODIS), the Multi-angle Imaging SpectroRadiometer (MISR), the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER), the Clouds and Earth’s Radiant Energy System (CERES), and the Measurements of Pollution in the Troposphere (MOPITT).

The Terra Basic Fusion dataset is an easy-to-access record of the Level 1 radiances for instruments on...

Details →

Usage examples

Basic Terra fusion product algorithm theoretical basis and data specifications by Zhao, Guangu; Yang, Muqun; Clipp, Landon; Gao, Yizhao; Lee, Joe H.
TerraFusion GitHub by University of Illinois

See 2 usage examples →

3DCoMPaT: Composition of Materials on Parts of 3D Things

computer visionmachine learning

3D CoMPaT is a richly annotated large-scale dataset of rendered compositions of Materials on Parts of thousands of unique 3D Models. This dataset primarily focuses on stylizing 3D shapes at part-level with compatible materials. Each object with the applied part-material compositions is rendered from four equally spaced views as well as four randomized views. We introduce a new task, called Grounded CoMPaT Recognition (GCR), to collectively recognize and ground compositions of materials on parts of 3D objects. We present two variations of this task and adapt state-of-art 2D/3D deep learning met...

Details →

Usage examples

3DCoMPaT: Composition of Materials on Parts of 3D Things by Yuchen Li, Ujjwal Upadhyay, Habib Slim, Ahmed Abdelreheem, Arpit Prajapati, Suhail Pothigara, Peter Wonka & Mohamed Elhoseiny

See 1 usage example →

A2D2: Audi Autonomous Driving Dataset

autonomous vehiclescomputer visiondeep learninglidarmachine learningmappingrobotics

An open multi-sensor dataset for autonomous driving research. This dataset comprises semantically segmented images, semantic point clouds, and 3D bounding boxes. In addition, it contains unlabelled 360 degree camera images, lidar, and bus data for three sequences. We hope this dataset will further facilitate active research and development in AI, computer vision, and robotics for autonomous driving.

Details →

Usage examples

Data Service for ADAS and ADS Development by Ajay Vohra

See 1 usage example →

ARPA-E PERFORM Forecast data

energyenvironmentalgeospatialmodelsolarsustainability

The ARPA-E PERFORM Program is an ARPA-E funded program that aim to use time-coincident power and load seeks to develop innovative management systems that represent the relative delivery risk of each asset and balance the collective risk of all assets across the grid. A risk-driven paradigm allows operators to: (i) fully understand the true likelihood of maintaining a supply-demand balance and system reliability, (ii) optimally manage the system, and (iii) assess the true value of essential reliability services. This paradigm shift is critical for all power systems and is essential for grids wi...

Details →

Usage examples

ARPA-E PERFORM by ARPA-E

See 1 usage example →

Allen Brain Observatory - Visual Coding AWS Public Data Set

electrophysiologyimage processingimaginglife sciencesMus musculusneurobiologyneuroimagingsignal processing

The Allen Brain Observatory – Visual Coding is a large-scale, standardized survey of physiological activity across the mouse visual cortex, hippocampus, and thalamus. It includes datasets collected with both two-photon imaging and Neuropixels probes, two complementary techniques for measuring the activity of neurons in vivo. The two-photon imaging dataset features visually evoked calcium responses from GCaMP6-expressing neurons in a range of cortical layers, visual areas, and Cre lines. The Neuropixels dataset features spiking activity from distributed cortical and subcortical brain regions, c...

Details →

Usage examples

Use the Allen Brain Observatory – Visual Coding on AWS by Nika Keller, David Feng

See 1 usage example →

COVID-19 Genome Sequence Dataset

bambioinformaticsbiologycoronavirusCOVID-19cramfastqgeneticgenomichealthlife sciencesMERSSARSSTRIDEStranscriptomicsviruswhole genome sequencing

A centralized sequence repository for all records containing sequence associated with the novel corona virus (SARS-CoV-2) submitted to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). Included are both the original sequences submitted by the principal investigator as well as SRA-processed sequences that require the SRA Toolkit for analysis. Additionally, submitter provided metadata included in associated BioSample and BioProject records is available alongside NCBI calculated data, such k-mer based taxonomy analysis results, contiguous assemblies (contigs) a...

Details →

Usage examples

Download SRA sequence data using Amazon Web Services (AWS) by NCBI SRA

See 1 usage example →

Cell Painting Image Collection

biologycell imagingcell paintingfluorescence imaginghigh-throughput imagingimaginglife sciencesmicroscopy

The Cell Painting Image Collection is a collection of freely downloadable microscopy image sets. Cell Painting is an unbiased high throughput imaging assay used to analyze perturbations in cell models. In addition to the images themselves, each set includes a description of the biological application and some type of "ground truth" (expected results). Researchers are encouraged to use these image sets as reference points when developing, testing, and publishing new image analysis algorithms for the life sciences. We hope that the this data set will lead to a better understanding of w...

Details →

Usage examples

Example submission for the 2018 CytoData Hackathon (in R and Python) by Juan Caicedo, Tim Becker

See 1 usage example →

Coupled Model Intercomparison Project Phase 5 (CMIP5) University of Wisconsin-Madison Probabilistic Downscaling Dataset

climatecoastaldisaster responseenvironmentalmeteorologicaloceanssustainabilitywaterweather

The University of Wisconsin Probabilistic Downscaling (UWPD) is a statistically downscaled dataset based on the Coupled Model Intercomparison Project Phase 5 (CMIP5) climate models. UWPD consists of three variables, daily precipitation and maximum and minimum temperature. The spatial resolution is 0.1°x0.1° degree resolution for the United States and southern Canada east of the Rocky Mountains.

The downscaling methodology is not deterministic. Instead, to properly capture unexplained variability and extreme events, the methodology predicts a spatially and temporally varying Probability Density Function (PDF) for each variable. Statistics such as the mean, mean PDF and annual maximum statistics can be calculated directly from the daily PDF and these statistics are included in the dataset. In addition, “standard”, “raw” data is created by randomly sampling from the PDFs to create a “realization” of the local scale given the large-scale from the climate model. There are 3 realizations for temperature and 14 realizations for precipitation. ...

Details →

Usage examples

Assessment Report: Analysis of Impact of Nonstationary Climate on NOAA Atlas 14 Estimates by NOAA

See 1 usage example →

CoversBR

copyright monitoringcover song identificationlive song identificationmusicmusic features datasetmusic information retrievalmusic recognition

CoversBR is the first large audio database with, predominantly, Brazilian music for the tasks of Covers Song Identification (CSI) and Live Song Identifications (LSI). Due to copyright restrictions audios of the songs cannot be made available, however metadata and files of features have public access. Audio streamings captured from radio and TV channels for the live song identification task will be made public. CoversBR is composed of metadata and features extracted from 102298 songs, distributed in 26366 groups of covers/versions, with an average of 3.88 versions per group. The entire collecti...

Details →

Usage examples

Using the (CoversBR) dataset by Dirceu Silva, Atila Xavier, Edgard Moraes, Marco Grivet and Fernando Perdigão

See 1 usage example →

Daylight Map Distribution of OpenStreetMap

disaster responsegeospatialmappingosmsustainability

Daylight is a complete distribution of global, open map data that’s freely available with support from community and professional mapmakers. Meta combines the work of global contributors to projects like OpenStreetMap with quality and consistency checks from Daylight mapping partners to create a free, stable, and easy-to-use street-scale global map. The Daylight Map Distribution contains a validated subset of the OpenStreetMap database. In addition to the standard OpenStreetMap PBF format, Daylight is available in two parquet formats that are optimized for AWS Athena including geometries (Poin...

Details →

Usage examples

Loading the Daylight Map Distribution OpenStreetMap Features into AWS Athena by Jennings Anderson

See 1 usage example →

Ford Multi-AV Seasonal Dataset

autonomous vehiclescomputer visionlidarmappingroboticstransportationurbanweather

This research presents a challenging multi-agent seasonal dataset collected by a fleet of Ford autonomous vehicles at different days and times during 2017-18. The vehicles The vehicles were manually driven on an average route of 66 km in Michigan that included a mix of driving scenarios like the Detroit Airport, freeways, city-centres, university campus and suburban neighbourhood, etc. Each vehicle used in this data collection is a Ford Fusion outfitted with an Applanix POS-LV inertial measurement unit (IMU), four HDL-32E Velodyne 3D-lidar scanners, 6 Point Grey 1.3 MP Cameras arranged on the...

Details →

Usage examples

Ford AV Dataset Tutorial by Ford Motor Company

See 1 usage example →

Global Biodiversity Information Facility (GBIF) Species Occurrences

biodiversitybioinformaticsconservationearth observationlife sciences

The Global Biodiversity Information Facility (GBIF) is an international network and data infrastructure funded by the world's governments providing global data that document the occurrence of species. GBIF currently integrates datasets documenting over 1.6 billion species occurrences, growing daily. The GBIF occurrence dataset combines data from a wide array of sources including specimen-related data from natural history museums, observations from citizen science networks and environment recording schemes. While these data are constantly changing at GBIF.org, periodic snapshots are taken a...

Details →

Usage examples

GBIF and Apache-Spark on AWS tutorial by John Waller

See 1 usage example →

High-Order Accurate Direct Numerical Simulation of Flow over a MTU-T161 Low Pressure Turbine Blade

computational fluid dynamicsgreen aviationlow-pressure turbineturbulence

The archive comprises snapshot, point-probe, and time-average data produced via a high-fidelity computational simulation of turbulent air flow over a low pressure turbine blade, which is an important component in a jet engine. The simulation was undertaken using the open source PyFR flow solver on over 5000 Nvidia K20X GPUs of the Titan supercomputer at Oak Ridge National Laboratory under an INCITE award from the US DOE. The data can be used to develop an enhanced understanding of the complex three-dimensional unsteady air flow patterns over turbine blades in jet engines. This could in turn le...

Details →

Usage examples

High-Order Accurate Direct Numerical Simulation of Flow over a MTU-T161 Low Pressure Turbine Blade by A. S. Iyer, Y. Abe, B. C. Vermeire, P. Bechlars, R. D. Baier, A. Jameson, F. D. Witherden, and P. E. Vincent

See 1 usage example →

Human Cancer Models Initiative (HCMI) Cancer Model Development Center

cancergenomiclife sciencesSTRIDESwhole genome sequencing

The Human Cancer Models Initiative (HCMI) is an international consortium that is generating novel, next-generation, tumor-derived culture models annotated with genomic and clinical data. HCMI-developed models and related data are available as a community resource. The NCI is contributing to the initiative by supporting four Cancer Model Development Centers (CMDCs). CMDCs are tasked with producing next-generation cancer models from clinical samples. The cancer models include tumor types that are rare, originate from patients from underrepresented populations, lack precision therapy, or lack ca...

Details →

Usage examples

Genomic Data Commons by National Cancer Institute

See 1 usage example →

Legal Entity Identifier (LEI) and Legal Entity Reference Data (LE-RD)

analyticsblockchainclimatecommercecopyright monitoringcsvfinancial marketsgovernancegovernment spendingjsonmarket datasocioeconomicstatisticstransparencyxml

The Legal Entity Identifier (LEI) is a 20-character, alpha-numeric code based on the ISO 17442 standard developed by the International Organization for Standardization (ISO). It connects to key reference information that enables clear and unique identification of legal entities participating in financial transactions. Each LEI contains information about an entity’s ownership structure and thus answers the questions of 'who is who’ and ‘who owns whom’. Simply put, the publicly available LEI data pool can be regarded as a global directory, which greatly enhances transparency in the global ma...

Details →

Usage examples

AWS hosts new open dataset to help businesses identify climate finance risks and investments by AWS Public Sector Blog Team

See 1 usage example →

NASA / USGS Europa Controlled Observations

cogplanetarysatellite imagerystac

The Solid State Imager (SSI) on NASA's Galileo spacecraft acquired more than 500 images of Jupiter's moon, Europa. These images vary from relatively low-resolution hemispherical imaging, to high-resolution targeted images that cover a small portion of the surface. Here we provide a set of 481 minimally processed, projected Galileo images with photogrammetrically improved locations on Europa's surface. These individual images were subsequently used as input into a set of 92 observation mosaics.

These images provide users with nearly the entire Galileo Europa imaging dataset at its native resolution and with improved relative image locations. The Solid State Imager on NASA's Galileo spacecraft provided the only moderate- to high-resolution images of Jupiter's moon, Europa. Unfortunately, uncertainty in the position and pointing of the spacecraft, as well as the position and orientation of Europa, when the images were acquired resulted in significant errors in image locations on the surface. The result of these errors is that images acquired during different Galileo orbits, or even at different times during the same orbit, are significantly misaligned (errors of up to 100 km on the surface).

The dataset provides a set of individual images that can be used for scientific analysis...

Details →

Usage examples

Querying for Data in an ROI and Loading it into QGIS by J. Laura
PySTAC Client by PySTAC-Client Contributors
Discovering and Downloading Data with Python by J. Laura
Discovering and Downloading Data via the Command Line by J. Laura

See 4 usage examples →

NOAA Global Forecast System (GFS)

agricultureclimatedisaster responseenvironmentalmeteorologicalsustainabilityweather

The Global Forecast System (GFS) is a weather forecast model produced by the National Centers for Environmental Prediction (NCEP). Dozens of atmospheric and land-soil variables are available through this dataset, from temperatures, winds, and precipitation to soil moisture and atmospheric ozone concentration. The entire globe is covered by the GFS at a base horizontal resolution of 18 miles (28 kilometers) between grid points, which is used by the operational forecasters who predict weather out to 16 days in the future. Horizontal resolution drops to 44 miles (70 kilometers) between grid point for forecasts between one week and two weeks.

The NOAA Global Forecast Systems (GFS) Warm Start Initial Conditions are produced by the National Centers for Environmental Prediction Center (NCEP) to run operational deterministic medium-range numerical weather predictions.
The GFS is built with the GFDL Finite-Volume Cubed-Sphere Dynamical Core (FV3) and the Grid-Point Statistical Interpolation (GSI) data assimilation system.
Please visit the links below in the Documentation section to find more details about the model and the data assimilation systems. The current operational GFS is run at 64 layers in the vertical extending from th...

Details →

Usage examples

GFS Warm Restart Files Additional Information by Fanglin Yang

See 1 usage example →

NOAA Global Surface Summary of Day

agricultureclimateenvironmentalnatural resourceregulatorysustainabilityweather

Global Surface Summary of the Day is derived from The Integrated Surface Hourly (ISH) dataset. The ISH dataset includes global data obtained from the USAF Climatology Center, located in the Federal Climate Complex with NCDC. The latest daily summary data are normally available 1-2 days after the date-time of the observations used in the daily summaries. The online data files begin with 1929 and are at the time of this writing at the Version 8 software level. Over 9000 stations' data are typically available. The daily elements included in the dataset (as available from each station) are:
Mean temperature (.1 Fahrenheit)
Mean dew point (.1 Fahrenheit)
Mean sea level pressure (.1 mb)
Mean station pressure (.1 mb)
Mean visibility (.1 miles)
Mean wind speed (.1 knots)
Maximum sustained wind speed (.1 knots)
Maximum wind gust (.1 knots)
Maximum temperature (.1 Fahrenheit)
Minimum temperature (.1 Fahrenheit)
Precipitation amount (.01 inches)
Snow depth (.1 inches)
Indicator for occurrence of: Fog, Rain or Drizzle, Snow or Ice Pellets, Hail, Thunder, Tornado/Funnel Cloud.

G...

Details →

Usage examples

ML Demo: Predicting Air Quality w/ ASDI NOAA + OpenAQ Datasets in SageMaker Studio Lab (SMSL) by Aaron Soto

See 1 usage example →

NOAA Integrated Surface Database (ISD)

agricultureclimatemeteorologicalsustainabilityweather

The Integrated Surface Database (ISD) consists of global hourly and synoptic observations compiled from numerous sources into a gzipped fixed width format. ISD was developed as a joint activity within Asheville's Federal Climate Complex. The database includes over 35,000 stations worldwide, with some having data as far back as 1901, though the data show a substantial increase in volume in the 1940s and again in the early 1970s. Currently, there are over 14,000 "active" stations updated daily in the database. The total uncompressed data volume is around 600 gigabytes; however, it ...

Details →

Usage examples

NOAA Integrated Surface Database (ISD) Example Notebook by Zac Flamig

See 1 usage example →

NOAA National Digital Forecast Database (NDFD)

agricultureclimatemeteorologicalsustainabilityweather

The National Digital Forecast Database (NDFD) is a suite of gridded forecasts of sensible weather elements (e.g., cloud cover, maximum temperature). Forecasts prepared by NWS field offices working in collaboration with the National Centers for Environmental Prediction (NCEP) are combined in the NDFD to create a seamless mosaic of digital forecasts from which operational NWS products are generated. The most recent data is under the opnl and expr prefixes. A copy is also placed under the wmo prefix. The wmo prefix is structured like so: wmo/<parameter>/<year>/<month>/<day&g...

Details →

Usage examples

NDFD Product Spreadsheet (excel file) by NOAA MDL

See 1 usage example →

NOAA/PMEL Ocean Climate Stations Moorings

climateenvironmentaloceanssustainabilityweather

The mission of the Ocean Climate Stations (OCS) Project is to make meteorological and oceanic measurements from autonomous platforms. Calibrated, quality-controlled, and well-documented climatological measurements are available on the OCS webpage and the OceanSITES Global Data Assembly Centers (GDACs), with near-realtime data available prior to release of the complete, downloaded datasets.

OCS measurements served through the Big Data Program come from OCS high-latitude moored buoys located in the Kuroshio Extension (32°N 145°E) and the Gulf of Alaska (50°N 145°W). Initiated in 2004 and 20...

Details →

Usage examples

OCS publications - All OCS-relevant publications are updated at the URL below. by PMEL

See 1 usage example →

New Jersey Statewide Digital Aerial Imagery Catalog

aerial imagerycogearth observationgeospatialimagingmapping

The New Jersey Office of GIS, NJ Office of Information Technology manages a series of 11 digital orthophotography and scanned aerial photo maps collected at various years ranging from 1930 to 2017. Each year’s worth of imagery are available as Cloud Optimized GeoTIFF (COG) files and some years are available as compressed MrSID and/or JP2 files. Additionally, each year of imagery is organized into a tile grid scheme covering the entire geography of New Jersey. Many years share the same tiling grid while others have unique grids as defined by the project at the time.

Details →

Usage examples

Visualize Imagery Changes by

See 1 usage example →

New Jersey Statewide LiDAR

elevationgeospatiallidarmapping

Elevation datasets in New Jersey have been collected over several years as several discrete projects. Each project covers a geographic area, which is a subsection of the entire state, and has differing specifications based on the available technology at the time and project budget. The geographic extent of one project may overlap that of a neighboring project. Each of the 18 projects contains deliverable products such as LAS (Lidar point cloud) files, unclassified/classified, tiled to cover project area; relevant metadata records or documents, most adhering to the Federal Geographic Data Com...

Details →

Usage examples

3D Visualization by

See 1 usage example →

Ohio State Cardiac MRI Raw Data (OCMR)

Homo sapiensimage processingimaginglife sciencesmagnetic resonance imagingsignal processing

OCMR is an open-access repository that provides multi-coil k-space data for cardiac cine. The fully sampled MRI datasets are intended for quantitative comparison and evaluation of image reconstruction methods. The free-breathing, prospectively undersampled datasets are intended to evaluate their performance and generalizability qualitatively.

Details →

Usage examples

OCMR Tutorial by Chong Chen

See 1 usage example →

Oxford Nanopore Technologies Benchmark Datasets

bioinformaticsbiologyfast5fastqgenomicHomo sapienslife scienceswhole genome sequencing

The ont-open-data registry provides reference sequencing data from Oxford Nanopore Technologies to support, 1) Exploration of the characteristics of nanopore sequence data. 2) Assessment and reproduction of performance benchmarks 3) Development of tools and methods. The data deposited showcases DNA sequences from a representative subset of sequencing chemistries. The datasets correspond to publicly-available reference samples (e.g. GM24385 as reference human). Raw data are provided with metadata and scripts to describe sample and data provenance.

Details →

Usage examples

ONT Dataset Tutorials by EPI2MELabs

See 1 usage example →

SILAM Air Quality

air qualityclimateearth observationmeteorologicalsustainabilityweather

Air Quality is a global SILAM atmospheric composition and air quality forecast performed on a daily basis for > 100 species and covering the troposphere and the stratosphere. The output produces 3D concentration fields and aerosol optical thickness. The data are unique: 20km resolution for global AQ models is unseen worldwide.

Details →

Usage examples

Simple examples by Roope Tervo

See 1 usage example →

Sentinel-1 SLC dataset for Germany

disaster responseearth observationenvironmentalgeospatialsatellite imagerysustainabilitysynthetic aperture radar

The Sentinel1 Single Look Complex (SLC) unzipped dataset contains Synthetic Aperture Radar (SAR) data from the European Space Agency’s Sentinel-1 mission. Different from the zipped data provided by ESA, this dataset allows direct access to individual swaths required for a given study area, thus drastically minimizing the storage and downloading time requirements of a project. Since the data is stored on S3, users can utilize the boto3 library and s3 get_object method to read the entire content of the object into the memory for processing, without actually having to download it. The Sentinel-1 ...

Details →

Usage examples

Interferometric Synthetic Aperture Radar Tutorial by LiveEO

See 1 usage example →

Tabula Muris

biologyencyclopedicgenomichealthlife sciencesmedicine

Tabula Muris is a compendium of single cell transcriptomic data from the model organism Mus musculus comprising more than 100,000 cells from 20 organs and tissues. These data represent a new resource for cell biology, reveal gene expression in poorly characterized cell populations, and allow for direct and controlled comparison of gene expression in cell types shared between tissues, such as T-lymphocytes and endothelial cells from different anatomical locations. Two distinct technical approaches were used for most organs: one approach, microfluidic droplet-based 3’-end counting, enabled the s...

Details →

Usage examples

Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. by Tabula Muris Consortium (2019)

See 1 usage example →

Voices Obscured in Complex Environmental Settings (VOiCES)

automatic speech recognitiondenoisingmachine learningspeaker identificationspeech processing

VOiCES is a speech corpus recorded in acoustically challenging settings, using distant microphone recording. Speech was recorded in real rooms with various acoustic features (reverb, echo, HVAC systems, outside noise, etc.). Adversarial noise, either television, music, or babble, was concurrently played with clean speech. Data was recorded using multiple microphones strategically placed throughout the room. The corpus includes audio recordings, orthographic transcriptions, and speaker labels.

Details →

Usage examples

Getting started with VOiCES data by M.A. Barrios

See 1 usage example →

2021 Amazon Last Mile Routing Research Challenge Dataset

amazon.scienceanalyticsdeep learninggeospatiallast milelogisticsmachine learningoptimizationroutingtransportationurban

The 2021 Amazon Last Mile Routing Research Challenge was an innovative research initiative led by Amazon.com and supported by the Massachusetts Institute of Technology’s Center for Transportation and Logistics. Over a period of 4 months, participants were challenged to develop innovative machine learning-based methods to enhance classic optimization-based approaches to solve the travelling salesperson problem, by learning from historical routes executed by Amazon delivery drivers. The primary goal of the Amazon Last Mile Routing Research Challenge was to foster innovative applied research in r...

Details →

Usage examples

Code repository used for the 2021 Amazon Routing Research Challenge (this repository is included for reference and documentation purposes only, you do not need to install it to access the data) by CAVE Lab, MIT Center for Transportation and Logistics
2021 Amazon Last Mile Routing Research Challenge: Data Set by Daniel Merchán, Jatin Arora, Julian Pachon, Karthik Konduri, Matthias Winkenbach, Steven Parks, Joseph Noszek
AWS Last Mile Route Sequence Optimization by Chen Wu, Yin Song, Verdi March, Eden Duthi

The Real-Time Mesoscale Analysis (RTMA) is a NOAA National Centers For Environmental Prediction (NCEP) high-spatial and temporal resolution analysis/assimilation system for near-surf ace weather conditions. Its main component is the NCEP/EMC Gridpoint Statistical Interpolation (GSI) system applied in two-dimensional variational mode to assimilate conventional and satellite-derived observations.

The RTMA was developed to support NDFD operations and provide field forecasters with high quality analyses for nowcasting, situational awareness, and forecast verification purposes. The system produces ...

Details →

NOAA Severe Weather Data Inventory (SWDI)

agricultureclimatemeteorologicalsustainabilityweather

The Storm Events Database is an integrated database of severe weather events across the United States from 1950 to this year, with information about a storm event's location, azimuth, distance, impact, and severity, including the cost of damages to property and crops. It contains data documenting: The occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce. Rare, unusual, weather phenomena that generate media attention, such as snow flurries in South Florida or the S...

Details →

NOAA Space Weather Forecast and Observation Data

climatemeteorologicalsolarsustainabilityweather

Space weather forecast and observation data is collected and disseminated by NOAA’s Space Weather Prediction Center (SWPC) in Boulder, CO. SWPC produces forecasts for multiple space weather phenomenon types and the resulting impacts to Earth and human activities. A variety of products are available that provide these forecast expectations, and their respective measurements, in formats that range from detailed technical forecast discussions to NOAA Scale values to simple bulletins that give information in laymen's terms. Forecasting is the prediction of future events, based on analysis and...

Details →

agricultureclimatemeteorologicalsustainabilityweather

The "Unified Forecast System (UFS)" is a community-based, coupled, comprehensive Earth Modeling System. It supports " multiple applications" with different forecast durations and spatial domains. The UFS Short-Range Weather (SRW) Application figures among these applications. It targets predictions of atmospheric behavior on a limited spatial domain and on time scales from minutes to several days. The SRW Application includes a prognostic atmospheric model, pre-processor, post-processor, and community workflow for running the system end-to-end. The "SRW Application Users's Guide" includes information on these components and provides detailed instructions on how to build and run the SRW Application. Users can access additional technical support via the "UFS Community Forum"

This data registry contains the data required to run the “out-of-the-box” SRW Application case. The SRW App requires numerous input files to run, including static datasets (fix files containing climatological information, terrain and land use data), initial condition data files, lateral boundary condition data files, and model configuration files (such as namelists). The SRW App experiment generation system also contains a set of workflow end-to-end (WE2E) tests that exercise various configurations of the system (e.g., different grids, physics suites). Data for running a subset of these WE2E tests are also included within this registry.

Users can generate forecasts for dates not included in this data registry by downloading and manually adding raw model files for the desired dates. Many of these model files are publicly available and can be accessed via links on the "Developmental Testbed Center&...

Details →

NOAA Unified Forecast System Subseasonal to Seasonal Prototypes

agricultureclimatedisaster responseenvironmentalmeteorologicaloceanssustainabilityweather

The Unified Forecast System Subseasonal to Seasonal prototypes consist of reforecast data from the UFS atmosphere-ocean coupled model experimental prototype version 5, 6, 7, and 8 produced by the Medium Range and Subseasonal to Seasonal Application team of the UFS-R2O project. The UFS prototypes are the first dataset released to the broader weather community for analysis and feedback as part of the development of the next generation operational numerical weather prediction system from NWS. The datasets includes all the major weather variables for atmosphere, land, ocean, sea ice, and ocean wav...

Details →

NOAA Unified Forecast System Weather Model (UFS-WM) Regression Tests

agricultureclimatemeteorologicalsustainabilityweather

earth observationmeteorologicalnatural resourcesustainabilityweather

The Servicio Meteorológico Nacional de Argentina (SMN-Arg), the National Meteorological Service of Argentina, shares its deterministic forecasts generated with WRF 4.0 (Weather and Research Forecasting) initialized at 00 and 12 UTC every day.

Sample Queries on the 1000 Genomes, gnomAD and ClinVar data Lake by Sujaya Srinivasan

See 1 usage example →

BodyM Dataset

computer visiondeep learning

The first large public body measurement dataset including 8978 frontal and lateral silhouettes for 2505 real subjects, paired with height, weight and 14 body measurements. The following artifacts are made available for each subject.

Subject Height
Subject Weight
Subject Gender
Two black-and-white silhouette images of subject standing in frontal and side pose respectively with full body in view.
14 body measurements in cm - {ankle girth, arm-length, bicep girth, calf girth, chest girth, forearm girth, height, hip girth, leg-length, shoulder-breadth, shoulder-to-crotch length, thigh girth, waist girth, wrist girth}

The data is split into 3 sets - Training, Test Set A, Test Set B. For the training and Test-A sets, subjects are photographed and 3D-scanned by in a lab by technicians. For the Test-B set, subjects ...

Details →

Usage examples

Human Body Measurement Estimation with Adversarial Augmentation by Nataniel Ruiz, Miriam Bellver, Timo Bolkart, Ambuj Arora, Ming C. Lin, Javier Romero and Raja Bala

See 1 usage example →

Google Brain Genomics Sequencing Dataset for Benchmarking and Development

bioinformaticsfastqgeneticgenomiclife scienceslong read sequencingshort read sequencingwhole exome sequencingwhole genome sequencing

To facilitate benchmarking and development, the Google Brain group has sequenced 9 human samples covering the Genome in a Bottle truth sets on different sequencing instruments, sequencing modalities (Illumina short read and Pacific BioSciences long read), sample preparation protocols, and for whole genome and whole exome capture. The original source of these data are gs://google-brain-genomics-public.

Details →

Usage examples

An Extensive Sequence Dataset of Gold-Standard Samples for Benchmarking and Development by Baid G., Nattestad M., Kolesnikov A., Goel S., Yang H., Chang P., and Carroll A (2020)

See 1 usage example →

Humor patterns used for querying Alexa traffic

amazon.sciencedialogmachine learningnatural language processing

Humor patterns used for quering Alexa traffic when creating the taxonomy described in the paper "“Alexa, Do You Want to Build a Snowman?” Characterizing Playful Requests to Conversational Agents" by Shani C., Libov A., Tolmach S., Lewin-Eytan L., Maarek Y., and Shahaf D. (CHI LBW 2022). These patterns corrospond to the researchers' hypotheses regarding what humor types are likely to appear in Alexa traffic. These patterns were used for querying Alexa traffic to evaluate these hypotheses.

Details →

Usage examples

“Alexa, Do You Want to Build a Snowman?” Characterizing Playful Requests to Conversational Agents by Shani C., Libov A., Tolmach S., Lewin-Eytan L., Maarek Y., and Shahaf D.

See 1 usage example →

MODIS MYD13A1, MOD13A1, MYD11A1, MOD11A1, MCD43A4

agriculturedisaster responsegeospatialnatural resourcesatellite imagerysustainability

Data from the Moderate Resolution Imaging Spectroradiometer (MODIS), managed by the U.S. Geological Survey and NASA. Five products are included: MCD43A4 (MODIS/Terra and Aqua Nadir BRDF-Adjusted Reflectance Daily L3 Global 500 m SIN Grid), MOD11A1 (MODIS/Terra Land Surface Temperature/Emissivity Daily L3 Global 1 km SIN Grid), MYD11A1 (MODIS/Aqua Land Surface Temperature/Emissivity Daily L3 Global 1 km SIN Grid), MOD13A1 (MODIS/Terra Vegetation Indices 16-Day L3 Global 500 m SIN Grid), and MYD13A1 (MODIS/Aqua Vegetation Indices 16-Day L3 Global 500 m SIN Grid). MCD43A4 has global coverage, all...

Details →

Usage examples

Astraea Earth OnDemand by Astraea, Inc.

See 1 usage example →

Orcasound - bioacoustic data for marine conservation

biodiversitybiologycoastalconservationdeep learningecosystemsenvironmentalgeospatiallabeledmachine learningmappingoceansopen source softwaresignal processing

Live-streamed and archived audio data (~2018-present) from underwater microphones (hydrophones) containing marine biological signals as well as ambient ocean noise. Hydrophone placement and passive acoustic monitoring effort prioritizes detection of orca sounds (calls, clicks, whistles) and potentially harmful noise. Geographic focus is on the US/Canada critical habitat of Southern Resident killer whales (northern CA to central BC) with initial focus on inland waters of WA. In addition to the raw lossy or lossless compressed data, we provide a growing archive of annotated bioacoustic bouts.

Details →

Usage examples

Github for our open source projects by Orcasound open source community

See 1 usage example →

PersonPath22

computer vision

PersonPath22 is a large-scale multi-person tracking dataset containing 236 videos captured mostly from static-mounted cameras, collected from sources where we were given the rights to redistribute the content and participants have given explicit consent. Each video has ground-truth annotations including both bounding boxes and tracklet-ids for all the persons in each frame.

Details →

Usage examples

Large scale Real-world Multi-Person Tracking by Bing Shuai, Alessandro Bergamo, Uta Buechler, Andrew Berneshawi, Alyssa Boden, Joseph Tighe

See 1 usage example →

Pre- and post-purchase product questions

amazon.sciencemachine learningnatural language processing

This dataset provides product related questions, including their textual content and gap, in hours, between purchase and posting time. Each question is also associated with related product details, including its id and title.

Details →

Usage examples

"Did you buy it already?", Detecting Users Purchase-State From Their Product-Related Questions by Lital Kuchy, David Carmel, Thomas Huet & Elad Kravi

See 1 usage example →

The Multilingual Amazon Reviews Corpus

machine learningnatural language processing

We present a collection of Amazon reviews specifically designed to aid research in multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. 'books', 'appliances', etc.)

Details →

Usage examples

The Multilingual Amazon Reviews Corpus by Phillip Keung, Yichao Lu, György Szarvas, Noah A. Smith

See 1 usage example →

WikiSum: Coherent Summarization Dataset for Efficient Human-Evaluation

amazon.sciencemachine learningnatural language processing

This dataset provides how-to articles from wikihow.com and their summaries, written as a coherent paragraph. The dataset itself is available at wikisum.zip, and contains the article, the summary, the wikihow url, and an official fold (train, val, or test). In addition, human evaluation results are available at wikisum-human-eval...

Details →

Usage examples

WikiSum: Coherent Summarization Dataset for Efficient Human-Evaluation by Nachshon Cohen, Oren Kalinsky, Yftah Ziser & Alessandro Moschitti

See 1 usage example →

Wizard of Tasks

amazon.scienceconversation datadialogmachine learningnatural language processing

Wizard of Tasks (WoT) is a dataset containing conversations for Conversational Task Assistants (CTAs). A CTA is a conversational agent whose goal is to help humans to perform real-world tasks. A CTA can help in exploring available tasks, answering task-specific questions and guiding users through step-by-step instructions. WoT contains about 550 conversations with ~18,000 utterances in two domains, i.e., Cooking and Home Improvement.

Details →

Usage examples

Wizard of Tasks: A Novel Conversational Dataset for Solving Real-World Tasks in Conversational Settings by Jason Ingyu Choi, Saar Kuzi, Nikhita Vedula, Jie Zhao, Giuseppe Castellucci, Marcus Collins, Shervin Malmasi, Oleg Rokhlenko and Eugene Agichtein

Details →

Which of the following are usually good data source?

Which of the following are usually good data sources? Select all that apply. Vetted public datasets, academic papers, and governmental agency data are usually good data sources.

What are the main benefits of open data select all that apply?

What are the main benefits of open data? Open data restricts data access to certain groups of people. Open data increases the amount of data available for purchase. Open data makes good data more widely available.

Which of the following are types of data bias often encountered in data analytics select all that apply?

Correct. Observer bias, interpretation bias, and confirmation bias are types of bias often encountered in data analytics.

What is the process for arranging data into a meaningful order to make it easier to understand analyze and visualize?

Data sorting is any process that involves arranging data into some meaningful order to make it easier to understand, analyze, or visualize. When working with data, sorting is a common method used for visualizing data in a form that makes it easier to digest the story you want to tell with the data.

Open data network Bias data

Which of the following are usually good data sources Select all that apply 1 point social media sites governmental agency data academic papers vetted public datasets?

COVID-19 Data Visualization

Free Health Data Sets

View Data Sets

Free Social Impact Data Sets

View Data Sets

Free Climate and Environment Data Sets

View Data Sets

Tableau For Everyone

Free Government Data Sets

View Data Sets

Free Education Data Sets

View Data Sets

Other Cool Free Data Sets

View Data Sets

Free Public Data Sets for Advanced Users

View Data Sets

The Cancer Genome Atlas

Usage examples

Foldingathome COVID-19 Datasets

Usage examples

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

Usage examples

Common Crawl

Usage examples

Gabriella Miller Kids First Pediatric Research Program (Kids First)

Usage examples

NASA Prediction of Worldwide Energy Resources (POWER)

Usage examples

NEXRAD on AWS

Usage examples

NOAA Geostationary Operational Environmental Satellites (GOES) 16, 17 & 18

Usage examples

Genome Aggregation Database (gnomAD)

Usage examples

SpaceNet

Usage examples

Cell Painting Gallery

Usage examples

Fly Brain Anatomy: FlyLight Gen1 and Split-GAL4 Imagery

Usage examples

Allen Cell Imaging Collections

Usage examples

International Neuroimaging Data-Sharing Initiative (INDI)

Usage examples

NOAA Operational Forecast System (OFS)

Usage examples

Digital Earth Africa Sentinel-2 Level-2A

Usage examples

Department of Energy's Open Energy Data Initiative (OEDI)

Usage examples

Open NeuroData

Usage examples

DOE's Water Power Technology Office's (WPTO) US Wave dataset

Usage examples

NREL Wind Integration National Dataset

Usage examples

Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription (TaRGET)

Usage examples

USGS 3DEP LiDAR Point Clouds

Usage examples

World Bank - Light Every Night

Usage examples

Clinical Proteomic Tumor Analysis Consortium 2 (CPTAC-2)

Usage examples

Global Database of Events, Language and Tone (GDELT)

Usage examples

NOAA Joint Polar Satellite System (JPSS)

Usage examples

ArcticDEM

Usage examples

BossDB Open Neuroimagery Datasets

Usage examples

Low Altitude Disaster Imagery (LADI) Dataset

Usage examples

NOAA Rapid Refresh Forecast System (RRFS) [Prototype]

Usage examples

Open Bioinformatics Reference Data for Galaxy

Usage examples

PoroTomo