Ratings
Category: Ratings (4 posts)
Oct 23 2011
New Job at Redbrick Health
I am now gainfully employed at Redbrick Health. I started a few weeks ago and most of the recent weeks has been getting familiar with their main software product. For those who are not familiar with Redbrick, it is a company that tries to reduce health costs of large "self-insured" companies by improving the health behavior of the employees. A "self-insured" company is one that pays for the health costs for the employes as they occur and uses a large health company (such as Blue Cross Blue Shield) only for administration and accounting purposes. Because they pay for the costs directly, these companies have an incentive to reduce the health costs of their employees.
Redbrick helps reduce their costs by using software and coaching programs that encourage employees to do things like eat better, exercise more, take their medications, and see the doctor for needed checkups. What is novel about Redbrick is in how the software works. It takes its cue from some of the social gaming software (Farmville comes to mind) that is out there and tries to make improving one's health a bit of a game where you can earn points and achieve goals on a smaller time interval than is typical for many health programs.
The development environment is Grails and Groovy. It makes for quite a different development approach than I have seen before. Java opened the door on using "reflection" to automatically bind parts of your application together. With Grails and Groovy, reflection is how even the simplest numeric operation is performed and calling a method using reflection is as simple as just making the method call or "pseudo-accessing" an attribute. In Grails, a lot of application implementation happens by Groovy implied reflection "magic". The other big thing in Groovy is "closures" which I have found radically changes the design of most of the software I write.
Because of my new job, my work on ratings is currently on hold. There is quite a bit of new code in it and it is much more configurable than before. For those who want to see this code in its current raw state please download the zip file.
Aug 30 2011
Evaluating Rating Systems
In my prior blog I suggested that head to head comparisons should be used to create numeric “chess-like” ratings using the Elo Rating Model as a theoretical basis for the computations of ratings for various goods and services that are evaluated. But this brings up a question. How do you determine whether it is truly better?
In the multiplayer game simulation I wrote, it is quite easy to determine how well the rating solution is performing. The software knows the “true skill” of each player and can tell you precisely how well the differences in numeric ratings of two opponents reflect their true skill differences.
But in a real game, you do not know the actual skill of the players involved so you need to find some other way to determine how well the rating system is working. If this were a simple problem in sampling a random variable, then typically one would calculate the Variance of a computed estimation as compared to the observed values by taking the difference between the estimated value and an observed value, squaring it and then taking the mean of these squares. If we were to do this for a rating system, we would take the rating difference between the two players (or goods or services being compared) and use that to predict the percentage chance of the first player winning. That would be the estimated value for the result of the match. The match would then have a result of 1.0 if the first player won, 0.0 if the second player won, and 0.5 if there was a draw. We would then take the estimated value for the result of the match and subtract the actual result (1.0, 0.0, or 0.5), square it and take the average of all the squares computed this way. But this turns out not to work well. The rating difference for the players is only an estimated difference in ratings, their true difference can only be known if the true skill values of the players are known. This estimation of the rating difference does behave like a standard statistical random variable and can use standard variance computations for judging accuracy if we had a way of sampling the actual rating differences with a standard probability variable. But the estimated value for the winning percentage is a complicated function of the rating difference which means the random variable for the winning percentage with respect to the rating difference does not follow a standard probability distribution. How do we fix this?
It turns out that Jeff Sonas has already done this work and you can find the formula at http://www.kaggle.com/c/ChessRatings2/Details/Evaluation. The formula for a variance term coming from a single resolved match (the equivalent of one of the squares in the standard variance computation).
-[Y*LOG10(E) + (1-Y)*LOG10(1-E)]
where LOG10 is the logarithm base 10, E is the estimated winning percentage, and Y is 1.0, 0.0, or 0.5 depending on who actually won the match. The actual variance value is the mean of the variance terms computed for all the matches played.
I go a little further than this in my use of this variance computation. I compute the best possible variance assuming that for each match, the result is one that minimizes the variance term. So if the first player actually won the match, but the variance term would be smaller if the opponent won, then the computation for best possible variance would use that value instead. To compute an absolute variance value I take the variance value on actual results as proposed by Jeff Sonas and subtract this best possible value. In the simulations I have performed, a value of 0.1 or less indicates a well functioning rating system.
So how do we use this formula for other rating systems besides games? Take the example of ranking the best 100 films of the last decade. Have a group of experts be consulted as the basis for creating this ranking. You evaluate the ranking by randomly selecting two films from the best 100 films and asking one of the experts to judge which is better. You do this many times with all the experts getting an equal number of head-to-head films to judge. The current ranking predicts which film should be judged better and a percentage chance that this view will be confirmed by the expert can be estimated by how far apart the film is on the top 100 list (maybe each difference in 1 can be declared to be 10 Elo rating points, this computation can be varied until the variance computation is minimized) and then the chess evaluation variance as described above can be used to rate the quality of the top 100 list.
For those who don't like an idea unless it can make money, I believe there is some untapped potential here as well. The same approach could be used to evaluate stock pickers (predict which of two stocks will do better), rating agencies for bonds (predict which of two bonds has a higher chance of default), and formulas for evaluating derivatives (which of two derivatives is more accurately priced). Having a single metric that judges the quality of contributions of various financial experts could have great monetary value.
Aug 23 2011
Wonders of Ratings
Please read the earlier blog article for more on ratings.
I have been talking about my work on simulating competitive games and rating systems to anybody who might be interested. From these conversations, I have been getting a growing conviction that ratings as a general mechanism for evaluating skill in competitive environments have a much greater potential than I think most people realize.
For example, I have now heard a number of anecdotal stories about internet games that lost their user base because their competitors had rating systems and they did not. It is clear that rating systems give you competitive advantages for attracting players. Ratings are also used in more subtle ways. For example, Slashdot lets you rate stories and even lets you rate how other people rated stories. Google and Wikipedia both now have mechanisms for rating the quality of the content they show to the user.
Ratings are everywhere. Employees are rated by their coworkers and bosses. Teachers use tests to give ratings (grades) to their students. You can get ratings for colleges, cities, police departments, hotels, restaurants, and practically any activity performed by humans where there are competitors. Ratings are also used in finance. Bonds are rated by rating agencies, companies have market capitalizations, currencies have their current exchange rates, and so on.
This brings me to my main point that I wish to make. Most rating systems out there are far from optimal. One of the big problems is that they ask an authority to give absolute ratings. This has problems for two reasons. The first is that there is a limited supply of authorities and authorities can have their own biases. The second, and larger issue, is that it is hard to assign an absolute rating without reference to the competitors. Criteria and tests can help, but they still can fail to adequately discriminate between two different competitors.
In the movie “The Social Network”, the movie portrays the creation of a web site by Mark Zuckerburg where students can rate which girl they think is better looking than another. In the movie, the actor portraying Mark makes an argument that asking students for an absolute rating will fail because it is so difficult for students to come up with consistently applied rating systems. Instead, Mark uses a Elo chess based rating solution where two girls are selected and shown and the viewing student is asked to chose which is better looking. It appears that this was a highly successful approach.
I believe that this idea has untapped potential. For example, when evaluating films and choosing which movie should get the Oscar for best movie, a “head-to-head” based approach could be used where the members of the Actor's guild could be asked to judge which somewhat randomly selected film is better than another randomly selected film and asked to do about 50 such evaluations each. When making the judgement you could also choose between “slightly better”, “clearly better”, “far superior” and that could be used to determine the K-factor when applying rating adjustments. I believe that this would produce a more accurate consensus pick for best film than the current approach.
As another example, suppose Slashdot replaced their current moderating system with the following. Instead of the normal “absolute” approach, Slashdot would present you with two randomly selected user comments and asked you which comment was better, how much better, and what made it better. A similar thing could be done for Wikipedia content and for practically any other web site that offered user generated rating systems. You could be asked which restaurant is better, which hotel is better, which plumber is better and so on. In many of these cases, the potential selections that are shown for judgement would have to be limited to selections with which the judger was familiar.
Of course, if the judgers tend to be familiar with only one or two of the potential selections than this approach will not work and a more traditional approach has to be used. In that case, this approach can be used to evaluate the quality of the evaluations made by the judgers. If the judgers have to give written justifications for their judgement, these justifications can be judged in “head-to-head” competitive fashion.
An interesting thing occurs if you use a “head-to-head” competitive solution for producing your ratings. You get numeric ratings for all the content being judged not just a judgement of the relative ranking. For example, if this was done for hotels, it might turn out that the top three hotels might be very close in ratings, while the next tier has a large drop off in ratings. If I were looking to book a good hotel, I might judge that the top three are close enough in rating that other criteria such as price, location, convenience might become more important.
Jun 22 2011
Rated Multiplayer Competitions
In my idle time, I have been playing various computer games. Some of these games (such as World of Warcraft) feature randomly assembled teams that compete against each other. After playing these team player vs. player (PVP) games for a while, I became frustrated at the total randomness in the quality and skill of the players that were on a team. Elite players dominated and bad players could profit by doing nothing and still be on a winning team.
This frustration brought back memories of other team sports that I have participated in which had similar issues. Years ago I dreamed that such issues might be solved by creating a rating system similar to the one used for chess and applying it to members of team competitions. It would be great to go to a pick up game of touch football or Ultimate Frisbee and know that your teammates were of similar skill and aptitude. But I knew such dreams were fantasies because even if a rating system were possible, there was no way it would ever be applied to real sports. The overhead of maintaining information on all the players (and making players use it) would be quite cumbersome and most players would not play enough games so that a proper rating could be given to them.
However, online computer games do not have these problems. It is easy to track the results of players and it is even possible to track performance data so that it is possible to determine which player contributed more to the team effort. Also dedicated players of computer games tend to play a lot more team matches, usually 1000s of team matches a year. In this case, a real multiplayer rating system (using the Elo system that is used for chess as the base) might be quite possible. How could I demonstrate this? One way would be to create a simulation of players and team matches and then show that a rating system would correctly determine the relative skill of the players to some degree of accuracy. This turned out to be large effort and the results produced many interesting facts, many of them unrelated to the number of players on a team. Because of this, I decided to write an extensive document about what I had done.