Making ranking fun: rating trumps voting

by Jaysen Marais on January 21, 2009 in Development | Comments (23) | TrackBack (0)

Now that the 2 initial batches of challenges have completed (10 challenges in all) we can start to take stock of how the challenge arbitration process is working. I don't think there's any disagreement that 'better' images are bubbling to the top, however there's been keen interest (in the forums and via feedback emails) regarding our method for determining 2nd from 3rd, 18th from 19th etc.

Is this thing better than that thing?
Early in the planning phase of the challenges project we scoured our own forums and the wider web to gain a feel for different flavours of online content competitions. When it comes to democratic judging models there are 'star rating' systems (i.e. amazon), +/- systems (i.e. digg), +1 systems (i.e. electoral), point packets (e.g. pentax forum with 5,3,1), point-spreads (n points between <=n images, e.g. sony talk forum) and more. These models fall into two broad groups: rating models (inherent merit independent of competitors) and voting models (merit relative to the merit of ALL competitors). We knew we had to support many judging models eventually (and our system is extensible such that we can) however we had to pick a model to start with. We went with a rating model (stars) and here's why.

Challenges have to be fun
Meaningless constraints, menial tasks and feelings of obligation are all decidedly un-fun, so we had to steer clear of them as much as possible. Consider also that aesthetic sensibility varies from user to user, as does enthusiasm, attention span, available time and a host of other factors. With these in mind lets consider voting models versus rating models.

29th placed entry 'cube' ('geometry' challenge)
'cube' in geometry challenge

In a voting model: unplaced

Our rating model: 29th (top 10%)
overall rating: 3.208 overal rating: 3.208 stars low vote distribution for 'cube' entry in geometry challenge high

Pure voting systems: everyone votes, most opinions ignored
Consider a hypothetical voting system in which a user is required to vote for their favourite 3 images (and rank them). Our hypothetical user starts with the first image they see, comparing it to more images (one at a time), keeping their favourite until they've seen them all at which point the selected image is 'the best'. The process of picking places 2 and 3 then becomes an exercise of searching through the gallery to find 'those other good ones'. To have any hope of a fair decision, several passes are required by each voter (i.e. a menial task) which is both boring and impractical for people taking a coffee break. Furthermore, this model also only yields rankings for the cream of the crop leaving the aspiring photographers (i.e. most entrants) with no feedback on their work (again, not fun). In algorithm parlance this voting method turns voters into bubble sorting machines (when determining 1st place, some optimizations for subsequent places). Sound fun?

Even less enjoyable is the realisation that in all voting systems (especially electoral systems) votes for quirky images will have essentially no impact on the outcome. This saps people's will to express their opinions if their own tastes are a little left-field. Not cool!

Our rating system: no-one's opinion wasted
Instead consider a hypothetical rating system in which users rate an image not compared to ALL other images in the challenge but according to its own merits (factoring in a voter's taste and their interpretation of the rules). If enough users rate an arbitrary number of images each, considering each image in isolation, an amazing thing happens: consensus. True, we may never know whether randomUser99 liked the 29th placed entry in the geometry challenge if they didn't rate it, but we can say that generally people generally thought it was one of the better images in the challenge (in the context of the challenge theme). The image owner (and anyone else) also has access to 40 people's opinions of their image, 7 of whom thought it worthy of 5 stars versus only 1 person deeming it 0.5 stars. Every entrant also has an overall crowd-sourced rating (3.208 in the case above) to improve upon in subsequent challenges, making the results interesting for everyone, not just a handful of winners.

Voting results are interesting for winners, rating results are interesting for all participants

Is it really that simple?
So rating systems are great. Well, it turns out they have a few vulnerabilities, a few pitfalls for the unwary developer. Firstly, you need sufficient numbers of voter input on each entry, distributing the number (not necessarily the magnitude) of ratings as evenly as possible. Secondly, you need a robust technique to derive an overall rating from many user ratings in such a way that overall ratings are comparable (to facilitate ranking). Each of these topics is a blog post in itself (this post's already a bit lengthy) but suffice to say that we feel we've come up with a neat solution to both.

If it's rating, why call it voting?
Even though we (I) feel that rating systems are preferable to voting, given the current system configuration (a few large concurrent challenges run by dpreview staff), we plan to allow user-created challenges within the next few months and our citizen challenge hosts are likely to have their own ideas. That's why we've planned for a variety of judging models (including various voting models) for them to choose from. Even though we plan to support both rating and voting judging models at some point, we had to pick a label for UI gee-gaws (buttons, titles etc.) and stick with it. Why 'vote'? Well, given the two terms 'rate' and 'vote', our 30-second co-worker survey revealed that 'vote' is a more compelling call to action. So there you have it.

Stay tuned
By this point in the post, the anxiety of needing to know the exact algorithm for calculating overall image ratings may be causing some readers to gnaw their own fingers of. Well, luckily humans have plenty of fingers because this isn't that post. Fear not, the draft of 'the algorithm post' is gestating nicely and should emerge soon enough. Until then, keep voting (rating) and curb that gnawing.

With votes pouring in, the weighting game begins

by Jaysen Marais on January 07, 2009 in Development | Comments (13) | TrackBack (0)

Most of you will know that as part of dpreview's 10th anniversary celebrations we recently launched the new Challenges feature. The reception has been fantastic and the 5 initial challenges launched (with many more to come) saw incredible participation (most filling reaching the 500 entry initial limit inside 48 hours). Old news you say? Well now that voting is underway I thought I'd talk a little about how that's progressing. In a word ... fantastic.

The really juicy numbers are locked away in our private (for now) stats pages (manna from heaven for visualization and stats geeks) but even our publicly available numbers are interesting.

  • The least popular challenge already has 3969 votes with the most popular receiving several times that (with several days voting to go).
  • The challenges feature is seeing about 5% of overall site traffic
  • A vote is placed every 6 seconds (average)
  • 32687 votes total (as of writing)
  • 4674 entries across all challenges (~10% disqualified)

Even taking into account the traffic profile for new features (launch spike, trough, slow build) these are impressive numbers (we're certainly happy). We're really excited to see what happens when we throw open the doors and enable user-created challenges. 

It's written in the stars
We've currently implemented only a single voting scheme: stars (½ to 5 stars in ½ star increments) but have plans to introduce others eventualy. The success of the challenges system hinges on votes reflecting image quality, but with the subjective nature of image evaluation and the diversity of voters we weren't quite sure what the distribution of votes would be. Some staffers expected a normal distribution (most votes clustering around the mean). Others expected more of a 'hot or not' style distribution (i.e. mostly 'extreme' votes). Of course the reality ended up being more complicated.

Indoor portrait ... Backyard Safari Shadows Wrapped Up Compact/Abstract
Average: 2.77 Average: 2.63 Average: 2.62 Average: 2.04 Average: 2.41
 
 
 
 
 

There's a lot to be gleaned from the above. Not least:

  • Significant variation between challenge averages (2.04 on the low end, 2.77 on the other).
  • Distribution resembles classic normal distribution, but not in all cases (see Wrapped Up).
  • Voters are generally less likely to award ½ star ratings than full star ratings.

Just to satisfy the curious, here's the overall distribution of votes (across all 5 open challenges) as of writing.

 

Leave it to Bayes
There's been a little concern in the forums that determining winners will be impossible due to our use of a star-rating system combined with the effects of outliers and variable votes per image. Fret not, our algorithm does takes into account both score and 'confidence' using some fairly established Bayesian techniques. There will be winners announced in a few days and the voting/ranking combo is definitely sorting the wheat from the chaff (it's fun to watch winners emerge). If there's enough interest I might explore this issue in a further post. Rest assured that we're on top of it.

Vote early and often
Whilst you can only vote for a single entry once, nothing prevents (or obliges) you voting for every image (in fact someone's already tried). Voting makes the whole thing work so make your vote(s) count now.

You've got to be in it to win it
Several challenges have begun since the initial wave with new challenges opening virtually every day. Make sure to get your photos in the mix and good luck.

Copyright 1998-2008 Digital Photography Review, dpreview.com Ltd.