Research resources

This page contains the official released datasets for Rioplatense Spanish (SWOW-RP), English (SWOW-EN) and Dutch (SWOW-NL). In addition, we also release preliminary data for Mandarin Chinese (SWOW-ZH). Please check release version when replicating published results and use the date as the version and dataset acronym for consistency (SWOW-EN, SWOW-NL, SWOW-RP, SWOW-ZH). Don't hesitate to get in touch if you have any queries or suggestions.


Acknowledgements and Fair Use

Many hours of work have been put in this project and we gratefully acknowledge all the volunteers who have dedicated their time to contribute. If you find these data useful, please share the link to the study:

https://smallworldofwords.org

Your support to keep this project going and up-to-date is greatly appreciated!


Datafiles

Raw and processed data. For each of these languages we release the raw data, and a balanced datafile where each cue has the same number of associate responses.

Associatives strength. Sometimes it's convenient to know how many participants give a specific response to a cue. In this case, you should download the associative frequency files (i.e. the conditional probability of a response given a cue, and the number of responses might be different among cue words.). The first contains statistics based on the first response a participant gave (R1), and the second file contains all three responses (R123).

Cue and response statistics. Cue statistics provides information about which words were known, and how many responses for each cue were missing. Two files are available, one based on the first response a participant gave (R1), and a second file contains all three responses given by participant (R123). Response statistic includes response counts for tokens and types.

Mandarin Chinese Data (SWOW-ZH23)

Updated 24 September 2024

Warning: the following data are part of a manuscript currently submitted and under review. This release is therefor likely subject to change. Please check back if you decide to use these data in your work.

Three-response associations were collected from 2016 to 2023 for 10,000 cue words, each answered by about 76 participants on average. Participants were native speakers in Mandarin Chinese or other Chinese dialects. A dataset before the word and participant cleaning are provided, where 85 taboo words in responses were masked, and 19 taboo words in cue words were deleted. Preprocessing ended up including 55 participants for each cue word. A dataset after preprocessing is provided, where containing responses for 10,024 cue words contributed by 30,504 participants, for a total of 551,320 trials. Each row of datasets is a single trial, consisting of demographic information from one participant, and his or her three responses for one cue word. Scripts in MATLAB for preprocessing can be found on the SWOW-ZH github page https://github.com/lib314a/SWOWZH and the raw and preprocessed data, cue and response statistics, and centrality measures currently under review can be found below:

 SWOW-ZH23 [49Mb]

We also provide relatedness measures for word pairs based on cosine similarity for associative strength distributions, pointwise mutual information weighted distribution (PPMI), random walk vectors (RW) and compressed random walk vectors. Note that files contain pairwise entries on each row and are xz compressed to reduce file size.

 SWOW-ZH23 Relatedness measures (R1) [321Mb]
 SWOW-ZH23 Relatedness measures (R123) [395Mb]

Citation: Li, B., Ding, Z., De Deyne, S., & Cai, Q. (2024). A large-scale database of Mandarin Chinese word associations from the Small World of Words project. Under review.

Rioplatense Spanish Data (SWOW-RP22)

Updated 17 September 2022

Rioplatense is Spanish variant spoken in South America, primarily in Uruguay and Argentine. The following dataset is currently under review and is likely to be subject to minor updates (e.g spellchecks).
Here we release the full raw data of the SWOW-RP project as well as a balanced preprocessed dataset where each cue are judged by exactly 70 participants and responses were normalized and spellchecked. Scripts for preprocessing and evaluation can be found at https://github.com/almadana/SWOW-RP. Raw and processed data, together with cue and response statistics can be found below.

 SWOW-RP22 [79Mb]

Citation: Cabana, A., Zugarramurdi, C., Valle-Lisboa, J.C., & De Deyne, S. (2024). The “Small World of Words” free association norms for Rioplatense Spanish. Behavior Research Methods, 56 (2), 968-985.

English Data (SWOW-EN18)

Updated 18 October 2018

Word association and participant data for 100 primary, secondary and tertiary responses to 12,292 cues. The data published in Behavior Research Methods were collected between 2011 and 2018. The preprocessed data consist of normalizations of cues and responses by spell-checking them, correcting capitalization and Americanizing. In addition to normalizing cues and responses, the preprocessed file contains data in which each cue is judged by exactly 100 participants (see Github repository for details).

Scripts with a processing pipeline to analyse these data in R can be obtained from the SWOWEN-2018 github repository. Note to R users: use the following command to deal with quotation, otherwise the entire file might not be read in correctly. X= read_delim('strength.SWOW-EN.R123.csv',delim='\t',quote = '',escape_backslash = F,escape_double = F) Raw and processed data, together with cue and response statistics can be found below.

 SWOW-EN18 [80Mb]

Citation: De Deyne, S., Navarro, D.J., Perfors, A. et al. (2019). The “Small World of Words” English word association norms for over 12,000 cue words. Behavior Research, 51, 987–1006. https://doi.org/10.3758/s13428-018-1115-7

Dutch Data (SWOW-NL13)

Dutch word association data (SWOW-NL). Word association and participant data for 100 primary, secondary and tertiary responses to 12,571 cues as reported in De Deyne, Navarro and Storms (2013).

 Associative strength [22Mb]
  Matlab .mat file [10Mb]

Citation: De Deyne, S., Navarro, D.J. & Storms, G. (2013). Better explanations of lexical and semantic cognition using networks derived from continued rather than single-word associations. Behavior Research, 45, 480–498 (2013). https://doi.org/10.3758/s13428-012-0260-7

Data in other languages.

Contact [email protected] to discuss work-in-progress files in other languages.

Current data

Note that current data can be accessed on the project page as well, but there are some caveats. These data correspond to a work-in-progress snapshot with limited preprocessing. The purpose of the project interface explore and visualizations options is to make the data accessible to general public in a concise form, which means only the strongest responses are shown.

License

The data are licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. They cannot be redistributed or used for commercial purposes.

Creative Commons License

Statistics

The datasets has been downloaded times.