Lately we’ve been occupying our gray cells with trying to predict how closely related related two people are. So we got a bunch of data from Mechanical Turk and found that some techniques correlate ok with the mean human judgment, but others don’t at all.
Well, before we start to fret too much, we should first try to understand how much agreement there is among people. I tried computing variance, means, etc. to try to understand the responses but still couldn’t get a good feel. So I thought, why not try to emulate the test we want to do: let’s hold out a rater and see what the correlation (spearman and pearson) would be to the mean of the other raters. We can repeat the process with each rater to get leave-one-out correlation statistics. Then we can see if our automatic “rater” passes this funky Turing-test.
I did this for the small dataset and the large dataset. And lest we think that the results are due to different raters having different calibrations, I also ran the procedure with centering/standardization.
Results for the small dataset:
Results for the large dataset:
The results are fairly similar for both datasets for both correlation tests. Most of the leave-one-out correlations hover around 0.5-0.7. It’s a lot lower than I would’ve thought. Can our techniques compare comparably to this bar?