Data Collection
Most of our studies are conducted using survey experiments, where we randomize a block of text, an image, or a task and then measure an outcome of interest via survey question. Each study typically takes a few minutes to complete, depending on the tasks, and participants are paid approximately $0.15 per minute. We strive to recruit at least 200 participants per experimental condition (400 for the typical two-condition experiment), as this provides statistical power strong enough to detect an effect size of a bit more than a quarter of a standard deviation with about 80% likelihood.
We use two primary platforms when recruiting participants for our online surveys and experiments. These platforms are Amazon Mechanical Turk (MTurk) and Prolific. MTurk is a well-known platform where everyday people perform simple Human Intelligence Tasks (HITs) in return for payment. It is commonly used by university professors for academic research, as well as market researchers and others with short, straight-forward online tasks (e.g., image tagging). The population has historically been fairly representative of the U.S. population (Paolacci, et al., 2010) and reliable for establishing internal validity (Berinsky, et al., 2012). Similarly, Prolific is an online research platform specifically focused on providing participants for rigorous academic and practitioner research. Prolific includes additional features such as demographic targeting and balanced samples (e.g., 50% male, 50% female, etc.).
Once the allotted number of participants have completed the study, we download the data from our survey platform as an Excel file and clean each variable so the data is ready for analysis.
We use two primary platforms when recruiting participants for our online surveys and experiments. These platforms are Amazon Mechanical Turk (MTurk) and Prolific. MTurk is a well-known platform where everyday people perform simple Human Intelligence Tasks (HITs) in return for payment. It is commonly used by university professors for academic research, as well as market researchers and others with short, straight-forward online tasks (e.g., image tagging). The population has historically been fairly representative of the U.S. population (Paolacci, et al., 2010) and reliable for establishing internal validity (Berinsky, et al., 2012). Similarly, Prolific is an online research platform specifically focused on providing participants for rigorous academic and practitioner research. Prolific includes additional features such as demographic targeting and balanced samples (e.g., 50% male, 50% female, etc.).
Once the allotted number of participants have completed the study, we download the data from our survey platform as an Excel file and clean each variable so the data is ready for analysis.
Analysis & Reporting
We use Stata statistical software to analyze our data. Each analysis is chosen based on the specific research design and question being asked. Most of our questions are straight-forward comparisons of two conditions in a between-subjects experiment. In these cases, we use independent (i.e., two-sample) t-tests or ordinary least squares (OLS) regression analysis. For interactions (e.g., do two experimental conditions differ differently based on another variable like gender) we use multiple regression analysis with an interaction term.
In the event we use a within-subjects experimental design (i.e., when each participant sees both experimental conditions, but the order is randomized), we use a paired-samples t-test. For interactions of a within-subjects variable and a between-subjects variable, we use mixed effects regression models.
The primary results of interest, which we report for all of our studies, are the outcome averages for each experimental condition and the "p-value," which tells us the probability that the real-world difference in averages between our two experimental conditions would be equal to or more extreme than what we observe in our data. Lower p-values generally mean that we can be more confident that the results are not just due to chance sampling. We use a 95% confidence level for our statistical tests and treat p-values below 0.05 as statistically significant, though results with small effect sizes and p-values greater than 0.01 are caveated.
To graphically present the results, we use Stata's graphing program to produce a bar chart with one bar for each experimental condition, along with standard error bars to show the level of uncertainty, as is the norm in psychology research.
Finally, in light of the reproducibility crisis in academia (Bohannon, 2015; Camerer, et al., 2018), we have adopted a series of best research practices. We report the results of every study, state its limitations clearly, and make available the materials and data upon request (requests can be submitted via our Contact Us page). We also plan to periodically replicate a random sample of our studies, reporting any differences between the old and new results.
In the event we use a within-subjects experimental design (i.e., when each participant sees both experimental conditions, but the order is randomized), we use a paired-samples t-test. For interactions of a within-subjects variable and a between-subjects variable, we use mixed effects regression models.
The primary results of interest, which we report for all of our studies, are the outcome averages for each experimental condition and the "p-value," which tells us the probability that the real-world difference in averages between our two experimental conditions would be equal to or more extreme than what we observe in our data. Lower p-values generally mean that we can be more confident that the results are not just due to chance sampling. We use a 95% confidence level for our statistical tests and treat p-values below 0.05 as statistically significant, though results with small effect sizes and p-values greater than 0.01 are caveated.
To graphically present the results, we use Stata's graphing program to produce a bar chart with one bar for each experimental condition, along with standard error bars to show the level of uncertainty, as is the norm in psychology research.
Finally, in light of the reproducibility crisis in academia (Bohannon, 2015; Camerer, et al., 2018), we have adopted a series of best research practices. We report the results of every study, state its limitations clearly, and make available the materials and data upon request (requests can be submitted via our Contact Us page). We also plan to periodically replicate a random sample of our studies, reporting any differences between the old and new results.
References
Berinsky, A. J., Huber, G. A., & Lenz, G. S. 2012. Evaluating online labor markets for experimental research: Amazon.com's Mechanical Turk. Political Analysis, 20(3): 351-368.
Bohannon, J. 2015. Many psychology papers fail replication test. Science, 349(6251), 910-911.
Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T.-H., Huber, J., Johannesson, M.,… Wu, H. 2018. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2(9), 637-644.
Paolacci, G., Chandler, J., & Ipeirotis, P. G. 2010. Running experiments on Amazon Mechanical Turk. Judgment and Decision making, 5(5): 411-419.
Bohannon, J. 2015. Many psychology papers fail replication test. Science, 349(6251), 910-911.
Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T.-H., Huber, J., Johannesson, M.,… Wu, H. 2018. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2(9), 637-644.
Paolacci, G., Chandler, J., & Ipeirotis, P. G. 2010. Running experiments on Amazon Mechanical Turk. Judgment and Decision making, 5(5): 411-419.