I’ve often wondered how many raters I need to sample to get reliable stimulus ratings.
This will obviously depend on the stimuli and what they’re being rated for. If there is a lot of inter-rater variation or very little inter-stimulus variation, you will need more raters to generate mean ratings with any reliability.