< Earlier Kibitzing · PAGE 12 OF 12 ·
|Feb-18-16|| ||tuttifrutty: There's no rating inflation.|
|Feb-18-16|| ||alfamikewhiskey: Completely non-chessic info: Élo is a pet form of Éliás (Eliyahu / Elijah).|
|Feb-19-16|| ||frogbert: <AyerKupp> I recently posted an explanation of how I calculate systemic inflation, but not being a premium member, I can't effectively search my posts. ;) Would you be as kind as to search for this post? It would probably appear at the first or second search hit page if you search *my* kibitzing for the words "systemic inflation".|
|Feb-19-16|| ||SugarDom: Which i think is dratz for CG not to give a celebrity kibitzer like <frogbert> a free premium.|
|Feb-19-16|| ||Pulo y Gata: Chessgems is avoiding frogbert inflation if they recognize the guy as celebrity and give him free premium membership. Frogs are known to puff up.|
|Feb-19-16|| ||chancho: <Feb-13-15
frogbert: Hi, shams.
With <systemic inflation> I'm simply talking about the average influx of rating points in the system.
This can easily be calculated for the entire pool (or some subset of the pool based on criteria like age, activity (games played) or rating range) using the complete rating lists. There are only a few points to observe:
1) players who aren't active (don't play any rated games) don't impact the systemic inflation
2) if a player's rating don't change between two lists, this player don't contribute to systemic inflation - but if the player has had games rated, he/she should be counted for averaging purposes.
3) new players in the pool in a specific list don't have any immediate impact on systemic inflation - their initial ratings are based on the rating performance in their rating norms using a conservative approach for scores above 50%. (Also, statistically new players are just as likely to increase their ratings after their entry on the list as they are to lose rating.) They will be included in the calculations the first time they have games rated as a rated player. (For a rated player, games against unrated players are <not rated> - even if they may count towards performance rating for a title norm.)
4) to calculate the systemic inflation between two adjacent official rating lists one simply adds up the rating changes for each player that has had his/her rating adjusted between the two lists. To calculate the average systemic inflation one divides this sum by the number of players that have had games rated.
Various people have claimed over the years that "inflation is most visible at the top", i.e. for the highest rated players. But like I've written here before, my calculation of <systemic inflation> shows that 2600+ players as a group lose rating points on average - and have basically always done so. The same goes for 2700+ players.
That phenomenon shouldn't be surprising, really. In a big population like the current rating pool, I think it's simply an example of regression toward the mean.>
|Feb-19-16|| ||offramp: I don't think there is much inflation.
Players get better, just like sprinters and marathon runners and swimmers get faster.
Other sports, such as cricket, also use a system based on the Elo system. Does anyone know if these have inflation?
|Feb-19-16|| ||AylerKupp: <frogbert> Glad to do it. Here are the links to your posts containing the word "systemic" and addressing the definition of systemic inflation, most recent ones first.|
Tata Steel (2016) (kibitz #1053)
Shams chessforum (kibitz #788)
Hans Arild Runde (kibitz #6134)
Magnus Carlsen (kibitz #70671)
Hans Arild Runde (kibitz #4317)
Hans Arild Runde (kibitz #3423)
There is a fairly small number of posts that I found where you actually define what you mean by "systemic inflation". This might be the result of my too rigid filtering of the posts; most of them address the <results> of you analysis and not the methodology, provide examples, or (of course) some back and forth bantering with other posters.
If you don't find what you are looking for in these posts, I'll gladly post the entire set of links and let you look for the one(s) that you think are the most applicable. And please don't hesitate to ask me to do that; I'm retired and have time on my hands. Besides, there are only about 7 pages of posts so it's not a lot of work / time involved.
|Feb-19-16|| ||frogbert: There's a longer sequence of posts about rating inflation in my player page, starting with Hans Arild Runde (kibitz #5808) where several relevant points and observations are expressed. The linked post by <cro777> quoted a 2013 interview with Kasparov which sparked some rounds of debate:|
"Fischer's rating was 2785 in 1972, but it certainly has much more weight than Carlsen’s higher rating. And this is also comparative to my 2851 rating in 1999. An evolutionary factor is in play here. That’s why, despite the mathematical soundness of ratings, I wouldn’t give them such a historical significance.
Fisher regularly achieved “+6” when he was moving up, I often attained “+6-7”, while Carlsen gets “+3-4”. And that’s enough, because the pyramid has grown and today’s super tournaments already have ratings beyond 2750."
I didn't treat Kasparov's take on the workings of the rating system very humbly in my follow-up post. Except that I totally agree with hi that FIDE ratings shouldn't be used to compare players from different eras, of course. It's a couple of Kasparov's other implied assumptions/claims I disagree with.
|Feb-19-16|| ||frogbert: <AylerKupp> Thanks, you found the one I was thinking of - the most recent in your list of hits. |
[There has been] systemic inflation of 1-1.5 points a year the past 25 years, but it's a moot point anyway since ratings don't measure skills [directly] - only relative success against your peers.
Systemic inflation is "my term" for total gain TG minus total loss TL of rating points averaged over the number of players NP who had games rated between two rating lists. Hence, to calculate systemic inflation since say 1990, for each rating list Rn (n = the first list you consider) do this:
Look at list Rn and Rn+1 and calculate
An = (TG - TL)/NP between Rn and Rn+1
Then do the same, calculating An+1, for lists Rn+1 and Rn+2, and so on, for all lists up to the current list. Then calculate the total average of all An ... An+N for the N consecutive rating periods you considered.
I refer to this total average as the systemic inflation [per rating period] over these N rating periods.
However, I'm not totally convinced that this definition is entirely sound in all detail. For instance:
1) Is it meaningful to average per player when players' activity varies so much?
2) There's a slight difference between doing calculations of influx/outflux (TG minus TL) month by month (between adjacent lists) and comparing say the february list of 2015 with the february list of 2016 directly. Does it matter?
And so on. Still, since I for the most part have details about age, activity (#games played) and (more recently) each player's K (although it can be deduced for older data too - which I might do and add to my own database), it's possible to show a lot of correlations in this data set, related to systemic inflation, rating gain/loss and so on.
Correlation and cause are two different beasts, but I still think insights can be gleaned here.
|Feb-22-16|| ||pinoy king: All one has to do is look at the horrendous quality Carlsen's games today to see that he does not deserve to be rated over 100 points more than previous world champions.|
|Feb-22-16|| ||nimh: http://www.chessanalysis.ee/rating%...
This paper offers a brief demonstration of the relationship between the quality of play and ratings of both human and computer rating systems using Komodo 8. In a few months I'll upload a longer overview where I'll look at further differences between the way engines and humans play chess. Instead of more usual centipawns, upon urging by users Kai Laskos and Larry Kaufman from talkchess.com, I've transformed Komodo 8 evaluations into expected scores using the logistic function.
a=1.1 normalization factor
ExpectedScore = 1 + (Exp[p/a] - Exp[-p/a])/(Exp[p/a] + Exp[-p/a])/2 >
This table compares the two methods:
cp exp score
As you can see, using expected scores has two advantages over centipawns:
1) it eliminates the need for artifical thresholds or cut-offs. There's virtually no difference whether the evaluation swings from 0.12 to 2.96, or to 9.05; it is a lost position in either case. But it may have an unwanted and distorting effect on results in relatively small datasets.
2) high evaluation, like the difficulty of positions affects the accuracy of play. There are two different ways: a) the scaling effect - higher evaluations are accompanied by larger eval gaps between move choices; b) human players' tendency to make desperate moves when behind in eval, and to seek easier and riskless paths to mate the opponent when ahead in eval. Using expected scores eliminates the former - the scaling effect since it is independent of evaluation numbers.
This is not the first time I've attempted to compare engines and humans. Previously I've been subject to criticism by users who have played against low-rated engines and concluded that these are actually weaker than indicated by papers. However, the graphs are only intended to demonstrate the relative playing strength of humans and engines under assumption that humans play against engines without employing anti-computer strategy. It is impossible to take into account hypothetical increase in strength due to anti-computer strategy, as we do not know yet what factors ultimately determine its efficiency.
It should be stressed that the 'error' on both graphs on the Y axis represents the average expected error, i. e. the estimated hypothetical accuracy of play in case all entities involved had the same difficulty of positions. If the average difficulty of moves by an entity is lower than on average, then
What can we conclude from the results? Some people will certainly be surprised, because doesn't it seem most logical to assume that engines and humans share the same accuracy-strength relationship? To me it indeed seemed so and when I first undertook such comparison and saw the final results couple of years ago, it greatly surprised me. It turns out that while humans experience diminishing rating gains per equivalent accuracy increase, engines are the other way round: adding the accuracy of play leads to increasing rating gains (!). Not that it is increasingly easier to make progress in computer chess, of course, it is a completely another matter.
In retrospect, that should not entirely come as a complete surprise, because we know that engines and humans play chess very differently. These differences are following:
1) engines do not play pragmatically, they always strive for objectivity; but humans do.
2) engines have a broader search-tree than humans. They include in calculations a wide array of move choices. Even absurd-looking ones get calculated a few plies deep. Humans, on the contrary first look at the position to find potentially good moves, pick few candidates and then start calculating.
3) the most obvious difference lies in the fact that engines, unlike humans, almost purely rely on calculations, the relative importance of evaluation function is ever-diminishing by advances in hardware and search function. It implies that changes in the level of the difficulty of positions affects humans more, but engines less.
I think we can dismiss the first one for now, as it is more about deliberate and concscious choices than the fundamental nature of move selecting processes.
Unfortunately, they still don't explain the causes why the relationships are just like that and not reversed, i. e. humans increasing gains and engines diminishing gains. And what about other skill-based board games? Will go, checkers, arimaa etc discplay the exactly same phenomenon? These are intriguing questions and I think it's worth to make future research on it.
|Feb-22-16|| ||nimh: Here are tables making it easy to convert CCRL into FIDE and vice versa.|
FIDE CCRL CCRL FIDE
3100 3375 3600 3130
3000 2915 3500 3118
2900 2646 3400 3104
2800 2461 3300 3088
2700 2324 3200 3069
2600 2216 3100 3048
2500 2127 3000 3024
2400 2053 2900 2996
2300 1989 2800 2963
2200 1934 2700 2924
2100 1885 2600 2878
2000 1841 2500 2824
1900 1802 2400 2759
1800 1767 2300 2680
1700 1734 2200 2584
1600 1704 2100 2466
1500 1677 2000 2318
1400 1651 1900 2132
1300 1628 1800 1894
1200 1605 1700 1584
1100 1585 1600 1175
1000 1565 1500 624
The hardware that is used in creating CCRL lists is outdated by todays's standards, and time controls are ca 3x shorter. How well would Stockfish 7 (3341 CCRL) perform in terms of FIDE 2014, if we had the best hardware possible and standard time controls? A direct comparison of data shows that 3341 CCRL corresponds to 3095 FIDE. According to the PassMark website, Athlon 64 X2 4600+ (2.4 GHz) has an Average CPU Mark of 1365, whereas the strongest one - Intel Xeon E5-2698 v3 @ 2.30GHz - has 22309. They altogether amount to ca 49x advantage in the search quantity. LOG2 of 49 is 5.6 doublings. At that level each doubling is actually worth less than a conventionally used estimate of 50 ELO; user Kai Laskos from talkchess.com has done a reseach into this and found that at TCEC level (faster than the example given previously) the gain per doubling is below 40 ELO. Hence, using 40 ELO per doubling, the end result turns out to be 5.6 x 40 + 3341 = 3565, which equals to 3126 FIDE.
So the final conclusion I draw is that given humans do not use anti-computer strategy, Stockfish 7 on top hardware would perform 3100-3150 against humans.
|Mar-23-16|| ||Tiggler: Reposted from <Tiggler> chessforum:|
I think I found the origin of the mysterious differences between the FIDE tables for ratings based expected scores and the cumulative normal distribution with sd = 400.
The wiki article on the ELO system states:
" FIDE continues to use the rating difference table as proposed by Elo. The table is calculated with expectation 0, and standard deviation 2000 / 7."
If so, then it appears that Elo used the approximation 1/sqrt(2) = 0.7 .
For a difference in scores the corresponding distribution has sd multiplied by sqrt(2), so instead of getting sd = 400, as I had previously assumed, we get 404.061 .
So now the expected score (per game) is given by
This formula does match the tables in section 8.1 of the FIDE handbook.
|Mar-23-16|| ||Tiggler: Reposted from <Tiggler> chessforum:|
In an interesting post on the WC Candidates forum, <AylerKupp> mentioned that Arpad Elo suggested the use of a t-distribution:
World Championship Candidates (2016)
The t-distribution (Student's t) is used to find the distribution of the differences between pairs of values drawn INDEPENDENTLY from the same normal distribution (my emphasis).
Elo's underlying assumption is that the performance of a player in a single game is distributed normally about his expected value, and that the standard deviation of the distribution is the same for all players.
So when two players come to the board the difference in their performance is based on their two independent random samples from their individual distributions. Hence the t-distribution.
This seems to me to be extremely contrived, though of course Dr. Elo can make whatever ad hoc assumptions he choses in his system.
I prefer the following argument, however. When two players come to the board, the distribution of the differences in their performance is the fundamental one, and the most parsimonious (in the Occam sense) description of this is the normal distribution.
We cannot say that in a single game the deviation of player A's performance from his expectation is independent of the deviation of player B's performance from his expectation. On the face of it that is absurd.
|Mar-24-16|| ||offramp: A moot is a debate, especially an ecclesiastical one. A moot point is any subject up for discussion at a moot.|
Moot does not mean irrelevant. It means the opposite of irrelevant.
Since the difference in meaning is so large it is a good idea to get it right.
|Mar-24-16|| ||Gregor Samsa Mendel: <offramp>--Apparently we Yanks have mootated the meaning of mootness:|
|Mar-24-16|| ||offramp: <Gregor Samsa Mendel: <offramp>--Apparently we Yanks have mootated the meaning of mootness:
That is bizarre and disturbing. In means that British and American people reading the same text will have opposite views on what has been written. I will think it means "open to debate" and a yank will think it means "pointless to debate".
Perhaps it is clearer if one uses a word such as "irrelevant", although that is less pretentious.
|Mar-24-16|| ||zanzibar: I think we should table this dangerous discussion.|
|Mar-25-16|| ||AylerKupp: <Tiggler> I wouldn't be too hard on Dr. Elo. After all, he was working at a time when there wasn't an easy access to computers and, whatever there was, was expensive. So it's natural for Elo to make many assumptions and simplifications in order to make the calculations easier. Still, using ˇĚ2= 0.7 instead of 0.707 seems excessive, as that would make the SD = 404 (exactly) instead of 400. And 404 should not be that much more difficult to use in the calculations as 400.|
For another view on the accuracy of the Elo tables, see http://recherche.enac.fr/~alliot/el....
Your comment about Elo's assumption that the standard deviation of the distribution is the same for all players gave me pause for some thought. Clearly that was a necessary simplification for Elo but it would seem possible today to calculate a performance distribution for each rated player (or at least the top ones), and use each player's distribution in calculating their tailored t-distribution (perhaps another good use of the letter "t"!). I don't consider this concept absurd at all. For example, based on the current Candidates Tournament, I would assume that the SD in Giri's performance (all draws) distribution would be much different than Nakamura's or Anand's (5 decisive games each). Of course, I don't now if it would make a significant difference in the results.
The reason I've been considering all of this is that I'm trying to develop a predictor for game results in the Candidates Tournament for User: golden executive contest. I had been doing reasonable well (my goal was a 75% correct prediction) until the last round (70%), when I used my "hunch" instead of some of the model's predictions and I was wrong while the model was right. My enthusiasm is greatly tempered by the realization that if I had simply predicted that each game would end in a draw I would have been correct 71.1% of the time, even with the Nakamura ¨C Anand result included.
|Mar-25-16|| ||offramp: It's all moot, isn't it?|
|Mar-26-16|| ||Tiggler: <AylerKupp> Sorry to be pedantic (though you would not be the one who would complain of this), but you cannot have tailored t-distributions for each player. The t-distribution is for the difference of two samples from the same normal distribution.|
<offramp> Yes indeed, quite moot: worthy of debate.
|Mar-27-16|| ||luftforlife: In American usage, the adjective "moot" enjoys three denotations: first, "open to question; subject to discussion; debatable; unsettled"; also, "subjected to discussion; controversial, disputed"; second, "deprived of practical significance; made abstract or purely academic"; third, "concerned with a hypothetical situation." Webster's Third New International Dictionary (Springfield, Mass.: Merriam-Webster Inc. 1993), 1468. |
The second denotation does not connote, and is neither equivalent with, nor tantamount to, irrelevance per se (for such a moot point retains its academic relevance, its fitness for abstract consideration, or both), but rather connotes a change in status that can, in the legal context at least, lead to a change in treatment -- to unfitness for further consideration, thwarting and thereby pretermitting practical, concrete, specific, and final resolution, disposition, or decision, of a case turning on, and fatally infected by, such a moot point -- due to limitations of power.
|Mar-27-16|| ||perfidious: <luftforlife> Used as an adjective, you are correct; however, that is not the full story.|
While as a noun, the word is comparatively uncommonly used, as a verb that is not the case, though of course Over Here 'debate' is much more often employed.
|Mar-27-16|| ||luftforlife: <perfidious>: Thanks for your comment, and I take your point. I focussed on the American adjectival form and usage chiefly to point up (and to contrast with irrelevance per se) the American denotation "deprived of practical significance" -- a necessary and sensible accretion to meaning as it has arisen and as it has been applied as a term of art by our Supreme Court in its construction of our Constitution and its limitations on the federal judicial power, but one that has, on our shores, overspilled the narrow confines of that usage, and that has, in more general American usage, come to acquire connotations that dull, obscure, and even subvert not only the other American adjectival denotations, but also the essential and vital British origins, meanings, and past and present uses of the word in all its forms. I appreciate <offramp's> incisive comments and reminders in this regard. Your comment and the others above my own are illuminating and edifying. Kind regards.|
< Earlier Kibitzing · PAGE 12 OF 12 ·