Open datasets
A collection of open datasets from international public sources
9 results
A collection of open datasets from international public sources
9 results
9 listings
9 listings
Cost
Free
A CSV package converted from the 2024 ONS boys and girls baby-name workbooks, preserving every worksheet.
Cost
Free
A normalized CSV package for historical Census surname data plus 1990 first-name and last-name frequency files.
Cost
Free
A CSV package converted from the Census 2020 first-name workbooks, preserving each source worksheet as a CSV table.
Cost
Free
Word frequency data package for text processing and linguistic analysis. Useful for building language models, text scoring, and related tasks. Pack includes multiple files and metadata, with various extensions and notes on attribution. Details: 207 files totaling 60,828,912 bytes across formats such as gz, py, md, txt, ini, toml, and more. Sample files and notices provided for context and attribution. Source: rspeer/
Cost
Free
CMUdict is a US English pronouncing dictionary. It is useful for linguistic processing tasks such as phonetic transcription and pronunciation modeling. Details: 16 files totaling 3,664,918 bytes across multiple extensions, including .dict, .phones, .symbols, and .py scripts. Sample files are provided to illustrate structure and usage. Source: cmudict repository (cmusphinx/cmudict).
Cost
Free
Stopwords-json provides stopword lists for 50 languages in JSON and TXT formats, sourced from the repository stopwords-json. Useful for text processing tasks such as tokenization, filtering, and language-aware preprocessing.
Cost
Free
Wordnik Wordlist is an open-source English wordlist provided by Wordnik. It is useful for game development and other word-based projects.
Cost
Free
A collection of common stop words across languages, useful for text processing tasks such as filtering out common words during analysis.
Cost
Free
A CSV package converted from the Census 2020 last-name workbooks, preserving each source worksheet as a CSV table.