|
< Earlier Kibitzing · PAGE 1 OF 7 ·
Later Kibitzing> |
Jan-03-09
 | | agb2002: Hello all,
As suggested by David Zechiel, my forum is now open.
Please feel free to leave any comments you want. |
 |
Jan-03-09
 | | agb2002: agb2002: <MostlyAverageJoe: Hi, Antonio.> Hi, Joe! <Of course, your method would work fine if the assumptions you listed were met.> That's why I'm trying to collect several opinions! <The assumption that kibitzer Elo distribution is similar to the FIDE distribution is most likely not true.> Yes, I also suspect that. <I doubt that the median ELO of kibitzers is close to the ~2075 that the FIDE table implies (or we'd have many more <dzechiels> here :-).> I think the same. <I also suspect that the kibitzer distribution might not be gaussian.> At first instance it looks like, but it is not, a quick chi-square test clearly rejects that hypothesis. However, this is not really important, once you have the actual distribution. <Perhaps CG has some idea about the ELO of registered users (I think they did a poll on that topic a while ago, but maybe I am imagining it, as I cannot find the evidence of such a poll).> We would need the help of CG to apply this method. <Note also that not everyone spends the same amount of time on a puzzle, and they don't always bother posting the solutions.> Solution time is such a complex variable, because it depends on so many factors, that I was considering ignoring it initially and just evaluating the results. <Also, note that typical puzzle brings maybe 70 comments total, and these comments may be made by 30-40 users on a good day (today, we had about 20 users). Is this enough to make meaningful determination of the puzzle level?> The answer is probably no because the standard error of a proportion is inversely proportional to the square root of the sample size, which is tipically small (and still smaller after removing the 'outliers') so we would have a relatively large error. Another problem is that the percentaje-Elo function outlined in the table above is poorly conditioned in the tails (the very easy and the very difficult puzzles), in other words, small errors in the input data, the percentaje of those who really solved the problem, could be amplified in the answer, the estimated puzzle Elo. In any case, it would be very interesting to compare both estimates (engine and raw statistics). |
 |
Jan-03-09
 | | dzechiel: Hi, Antonio,
Looking over your methodology, I think it's sound. But I have to agree with <MAJ>, I don't see a way for you to get a consistent, reliable, statistically meaningful source of data for evaluation. Some players may solve the puzzle, but decline to post their results, while others may be unable to solve the position of the day, but decide, for appearances, to claim success. I think if you decide to pursue this approach, that you should request "volunteers" for your project who will attempt to solve each day, and who can be counted upon to give you reliable and honest results. You may be able to get statistically significant results with as few as five dependable subjects, but, of course, the more the better. Please let me know if I can be of help to your project. |
 |
Jan-04-09
 | | johnlspouge: Hi ,Antonio.
Rating schemes remind me of psychiatry, where there are few objective findings. Objective findings are so few, in fact, that some people (“anti-psychiatrists”, notably Thomas Szasz) have suggested that psychiatric diagnoses are completely subjective. (Szasz has a web site http://www.szasz.com/, with the very nice opening quote: “If you talk to God, you are praying; if God talks to you, you have schizophrenia.”) With this background in mind, I have some observations about the suggested methodology. Please keep in mind the observations are intended to suggest possible avenues for improvement, not to discourage. (1) I have not though a lot about how to construct a rating scheme, but clearly it has to be constructed from statistical samples from the outcomes of “games” (here, attempts to solve a puzzle). Because a rating for each fixed puzzle is desired, the parent population you are sampling must represent, presumably, outcomes from this puzzle against all possible puzzle solvers. The proportion of successful solvers in the parent population seems a reasonable ordinal for the puzzles, i.e., the more solvers who can solve a puzzle, the easier it is. Your use of a cumulative distribution function is therefore quite reasonable, and it represents the cumulative distribution function in some parent population. (2) In any such scheme, one has to be aware of temporal bias: the parent population or the sampling mechanism might not be constant over time. Historically, e.g., chess players have improved as technique became better. In my opinion, puzzle kibitzers on chessgames.com have become noticeably stronger over the past year. (This might or might not coincide with your starting to kibitz :) Your methodology cannot separate a trend in puzzle difficulty from a trend in kibitzer strength. In contrast, <MostlyAverageJoe>’s methodology can distinguish the two trends, because he uses objective findings (computer performance) to evaluate the puzzles. (3) As usual in statistics, one has the problem of biased samples. The problem of bias is real in biology, e.g., and where the data do not represent a controlled experiment (as in rating), it is usually ignored or treated quite crudely. The relation of kibitzers’ ELO rating to the cumulative ELO distribution function seems extremely problematic to me. Moreover, most Monday kibitzers do not continue to post through the week, so temporal bias enters again. In addition, I (e.g.) apportion much more time for a Sunday problem than a Monday, so presumably, the parental quantity you are estimate also reflects the variable of how willing people are to spend time on a given puzzle. An ELO rating does not reflect this willingness. I hope these comments are helpful.
All the best,
John |
 |
Jan-05-09
 | | agb2002: <dzechiel: Hi, Antonio,>
Hi, David
<Looking over your methodology, I think it's sound. But I have to agree with <MAJ>, I don't see a way for you to get a consistent, reliable, statistically meaningful source of data for evaluation.> That's why I told MAJ that we would need the help of CG. <Some players may solve the puzzle, but decline to post their results, while others may be unable to solve the position of the day, but decide, for appearances, to claim success.> It reminds me of the surveys on political vote intentions. I'm considering applying some statistical tools like discriminant analysis (http://en.wikipedia.org/wiki/Linear...) to estimate the reliability of posts, etc., but they require input variables I don't have, unless CG provide some of them and/or I manage to collect along the time. <I think if you decide to pursue this approach, that you should request "volunteers" for your project who will attempt to solve each day, and who can be counted upon to give you reliable and honest results.> I already counted on the growing influence of your 'chess school' at CG! <You may be able to get statistically significant results with as few as five dependable subjects, but, of course, the more the better.> See my answer to MAJ last question. <Please let me know if I can be of help to your project.> Thank you! |
 |
Jan-05-09
 | | agb2002: <johnlspouge: Hi ,Antonio.> Hi, John. <Rating schemes remind me of psychiatry, where there are few objective findings.> I'm fairly used to provide statistical support to psychologists (although I don't understand their language) and in my experience, in spite of the complexity of human behaviour, the limitations of questionnaires, the reliability of patient answers, etc., they tipically find the results quite valid. <“If you talk to God, you are praying; if God talks to you, you have schizophrenia.”> LOL. I'm afraid that my mental disorders are not very spectacular, at the moment, perhaps the unavoidable mathematician's perfectionism, but somewhat damped by the years dealing with engineers and their '... but for practical purposes...' <With this background in mind, I have some observations about the suggested methodology.> Thank you! <Please keep in mind the observations are intended to suggest possible avenues for improvement, not to discourage.> My elder reached Alaska's coasts sailing nutshells some five centuries ago and I became an independent math consultant in Spain, a country which hates maths (I often work for foreign companies). Paraphrasing Pink Floyd, I don't give in without a fight (Hey You, The Wall). <(2) In my opinion, puzzle kibitzers on chessgames.com have become noticeably stronger over the past year.> Yes, we would have some kind of upper bound of the Elo estimate. <Your methodology cannot separate a trend in puzzle difficulty from a trend in kibitzer strength. In contrast, <MostlyAverageJoe>’s methodology can distinguish the two trends, because he uses objective findings (computer performance) to evaluate the puzzles.> Computers (software & hardware) also become stronger. This was another reason to have a 'second opinion'. <(3) As usual in statistics, one has the problem of biased samples. ... The relation of kibitzers’ ELO rating to the cumulative ELO distribution function seems extremely problematic to me.> Yes, we would need a decent sample of volunteers, as suggested by David Zechiel. <Moreover, most Monday kibitzers do not continue to post through the week, so temporal bias enters again. In addition, I (e.g.) apportion much more time for a Sunday problem than a Monday, so presumably, the parental quantity you are estimate also reflects the variable of how willing people are to spend time on a given puzzle. An ELO rating does not reflect this willingness.> You need some willingness to score the point, regardless of the problem: actual game play or puzzle solving. In any case, these difficulties only make the problem more interesting :-) <I hope these comments are helpful.> Yes, they are. Thanks a lot! <All the best,
John>
Take care,
Antonio
|
 |
Jan-05-09
 | | MostlyAverageJoe: Well, it seems you got quite a bit of feedback. How about a second draft of the proposed methodology, addressing some of the issues. The ELO drift seems to be a major spoiler here, especially if you get responses from some of the younger kibitzers, who can move 150 points up or down in a single tournament. Or some older kibitzers claiming to have 2200 ELO while their last recorded ratings, quite easy to find on the web, show about 1750. <In any case, it would be very interesting to compare both estimates (engine and raw statistics).> Indeed, that would be interesting. If you try to go ahead with your project, I probably would restart doing daily evaluations (I did them for about a year, and then only occasionally -- too busy with real life :-). <Computers (software & hardware) also become stronger> I handled this by using the same software and hardware for the evals, despite having several new versions of Hiarcs and a way more powerful CPU. In any case, I am rethinking the methodology to use the analysis depth and the number of evaluated nodes to be a measure of difficulty. Mondays, for example, usually get solved at 5 plies or maybe 6. This measure (rather than ELO setting for the program that lets it find the puzzle solution in 10 seconds) would remove the dependency on the hardware. And if I find two spare days or so, I might finish my UCI driver to enable me to re-run the entire set of puzzles whenever a software upgrade becomes available. |
 |
Jan-07-09
 | | agb2002: <MostlyAverageJoe: Well, it seems you got quite a bit of feedback.> Yep, and it confirms some difficulties. <How about a second draft of the proposed methodology, addressing some of the issues.> Why not? <The ELO drift seems to be a major spoiler here, especially if you get responses from some of the younger kibitzers, who can move 150 points up or down in a single tournament. Or some older kibitzers claiming to have 2200 ELO while their last recorded ratings, quite easy to find on the web, show about 1750.> Sometimes errors cancel themselves and the result is quite accurate but I don’t expect such happy scenarios here. We will have to settle for interval estimates. <<In any case, it would be very interesting to compare both estimates (engine and raw statistics).> Indeed, that would be interesting. If you try to go ahead with your project, I probably would restart doing daily evaluations> I was considering calculating the estimates about one or two days later, once the kibitzers finish with the puzzle and move to the next one. <(I did them for about a year, and then only occasionally -- too busy with real life :-).> I also belong to the busy pros club. <<Computers (software & hardware) also become stronger> I handled this by using the same software and hardware for the evals, despite having several new versions of Hiarcs and a way more powerful CPU.> I suspected that. <In any case, I am rethinking the methodology to use the analysis depth and the number of evaluated nodes to be a measure of difficulty. Mondays, for example, usually get solved at 5 plies or maybe 6. This measure (rather than ELO setting for the program that lets it find the puzzle solution in 10 seconds) would remove the dependency on the hardware. And if I find two spare days or so, I might finish my UCI driver to enable me to re-run the entire set of puzzles whenever a software upgrade becomes available.> This is a very interesting idea and actually introduces a third approach: we can measure as many relevant variables as necessary for each puzzle and try to use (nonlinear) least squares algorithms to fit a mathematical model for your Elo estimates. These variables might also include (the number of): • tactical motifs (pin, hanging piece, fork, etc.),
• plies (depth) at which patterns become recognizable, • candidate moves at each ply (n plies make n variables), • feasible lines,
• lines leading to a decisive advantage or dead draw, etc. We would have some redundancy but we can identify the irrelevant variables by inspecting the (asymptotic) standard error of the regression coefficients. Once having a suitable model properly fitted a most interesting experiment would be to try it with the toughest puzzles and see how it behaves with extrapolated estimates. |
 |
Jan-09-09
 | | agb2002: I've been trying to apply the FIDE Elo distribution method to some puzzles and I have found that in the best case there are no more than 20 reasonably valid posts. This means that the Elo range we can obtain oscillates between about 1400 (100% or 20/20) and 2350 (5% or 1/20), far below the complexity of at least 30% of puzzles. Therefore, this method is, at the moment, unfeasible. |
 |
Jan-20-09
 | | MostlyAverageJoe: < agb2002: Hello Joe, actually, the number of occurrences of the word "demoIition" is two > Yeah, I tried to use capital "I" which looks like lowercase "l" in my entry font, to avoid matching my own posts. < But seriously, I have started collecting some statistics from the puzzles you have rated since March 26, 2007, specifically:a) Number of posts.
b) Number of posts missing the solution.
c) Solution depth in plies.
d) Your rating.
The first three variables are more or less easy to work out from the posts. The conjecture I'm trying to confirm or reject is that these variables can predict ratings to a reasonable accuracy (to be defined later). What do you think? Can you figure out any other relevant predictor?> Your idea is interesting. Another thing to include in the predictions might be the number of posts claiming "too easy" or "too difficult". I had an impression that my evals were in a reasonably good agreement, but I never took systematic records. I got too busy with the day job to keep up with that stuff, and never even finished my plans for having the evaluations completely automated... |
 |
Jan-29-09
 | | MostlyAverageJoe: The excel data looks good. I have a bit more ratings to add there. Too bad that Google does not seem to support hyperlinks in their Docs (spreadsheet kind). |
 |
Feb-11-09
 | | johnlspouge: < <agb2002> wrote: <chessgames.com> I tried to use the figurine notation [snip]
It would be useful to be able to use that notation with more or less long posts because they become much clearer to read IMHO. > Hi, <agb2002>. The figurine notation appears elegant, but I do not use it, because someone else cannot copy your post into another. (Try.) Despite the kindness of chessgames.com in making it available to us, I recommend against its use. |
 |
Mar-08-09
 | | johnlspouge: < <agb2002> wrote: <johnlspouge: [snip] <Some time ago, <JG27Pyth> and I tried to formulate a list of general defensive tactics. (Actually, I think I tried to formulate the list and he then criticized me mercilessly :)> I don't understand why you were criticized for trying to put some order into chaos....> > Actually, <JG27Pyth> had some cogent criticisms. I enjoy <JG27Pyth>'s humor so much, I could not resist a gentle tip of the hat in his direction. My comment was nothing serious... <We always return to statistics: it seems that defense-by-pinning is on the tail of the distribution because this type of defence requires more elements than the others (four: attacked piece, attacker, pinning piece, valuable piece) and arranged in a particular layout (to allow the pin). > You make an interesting point about rarity. Thanks, <Antonio>. |
 |
Mar-25-09
 | | johnlspouge: Hi, Antonio.
I was careful not to use the word "self-flagellation", because I was certain you could find "flagellate" in an English-Spanish dictionary, but possibly not "self-flagellation". I see now I need not have worried: presumably there are linguistic advantages to living in a Catholic country ;>) All the best,
John |
 |
Mar-26-09
 | | johnlspouge: Likewise, I much prefer being flagellated to doing press-ups ;>) |
 |
Apr-18-09
 | | johnlspouge: < <agb2002> wrote: <How often does a R reach the back rank by creeping in the back door?> Getting scatological at MY lunch time, John? > Hi, Antonio. (I presume we are on a first name basis now :) First, "scatalogical": Henceforth, I will assume that <every> English word is in your active vocabulary ;>) Second, "lunch time": I have a friend with whom I have had lunch weekly for 23 years. In some periods, I regarded myself fortunate when not subjected to "scatalogical" conversations before I finish the soup. (His conversation has its compensations.) While I consider myself innocent of conscious intent, I now grow very concerned about the effect of your lunch on my subconscious mind. |
 |
Apr-20-09
 | | johnlspouge: Thanks for the "scatological" correction. (I have Latin but not Greek in my education.) I spent time on both sides of the Atlantic, leaving me uncertain about my spelling sometimes. I happened to check the incorrect spelling on the web and found it in several "reasonably good" sources. I shall disbelieve them in future. |
 |
Apr-20-09
 | | johnlspouge: Hi, Antonio. Just to follow up: I checked Liddell and Scott's Greek lexicon and found among the dung "scatophageo" (in obvious transliteration, to save me the trouble of going to UTF-8). The biggest guns of all agree with you. No surprise there. Enjoy your lunch :) |
 |
Apr-24-09
 | | johnlspouge: < <agb2002> wrote: [snip] Anyway, long ago I made mine that old maxim of our dear old enemy, the British Navy: "The merely difficult, we do immediately. The impossible will take slightly longer". [snip] > My mother's father was in the Royal Navy in World War I. I spent many hours as a child playing cards with him, and I attribute much of my skill in calculation to him. Let's just say I have heard the maxim before ;>)
Thanks for the reminder... |
 |
May-04-09
 | | agb2002: Week's theme was attack against the king (except Monday). Apr 27, Monday: Timman vs Yusupov, 1987. Black's passed pawn made this puzzle considerably more difficult than expected. Apr 28, Tuesday: J Dueball vs G Jacoby, 1976. Just interesting. Apr 29, Wednesday: A Burjan vs A Kornelia, 1992. Another good example showing that being compulsive-obsessive is not that bad... Apr 30, Thursday: Z Kozul vs Bologan, 2005. Attack and defense at the same time. May 1, Friday: Bologan vs Movsesian, 2005. The number of times (and ways) I coincide with John Spouge is becoming a very interesting statistics. May 2, Saturday: Z Varga vs Z Gyimesi, 2005. I was unable to find the time to complete a decent analysis. Sorry. May 3, Sunday: O Gavrilov vs S Solovjov, 2005. Another example of why looking for forcing moves (32... Rg7+ instead of 32... Rxh3) can save a lot of effort. |
 |
May-11-09
 | | agb2002: Except for Wednesday and Thursday, the week's them was again attack against the king. May 4, Monday: DeFirmian vs R Byrne, 1994. Very basic. May 5, Tuesday: D Riley vs F Parr, 1949. The nice thing is that I have a brother-in-law whose older brother's name is also Frank Parr! May 6, Wednesday: Samisch vs F Krautheim, 1946. A dose of reality. May 7, Thursday: A Yurgis vs Botvinnik, 1931. For those who complain that endgames are tacticless, technical chores. May 8, Friday: F Handke vs H Hernandez, 2003. This and the previous puzzle show very embarrassingly how superficial one becomes after finding the solution (presumably): the way other candidate moves are dispatched is a monument to negligence. May 9, Saturday: L Milman vs J Fang, 2005. Too mechanical, disappointingly easy. But very nice. May 10, Sunday: J Ragan vs Benko, 1974. I thought that 23... h2+ was just too obvious so I decided to use some extra information (the puzzle is rated insane). Bad idea, as usual. Wednesday and Sunday puzzles suggest that I should reconsider my move selection algorithm. |
 |
| May-14-09 | | CHESSTTCAMPS: I did take a look at both the Saturday and Sunday puzzles while I was visiting Vermont, but I did not post my solutions, both of which were correct. <May 9, Saturday: L Milman vs J Fang, 2005. Too mechanical, disappointingly easy. But very nice.> Yes, this was a quick solve for me also. It does get beauty points though - the queen sac on g6 is reminiscent of the famous Marshall queen sac on b3 (Marshall playing black), because the queen can be taken in 3 ways. <May 10, Sunday: J Ragan vs Benko, 1974. I thought that 23... h2+ was just too obvious so I decided to use some extra information (the puzzle is rated insane). Bad idea, as usual.
>
I got this one too, but finding the key move Qf6 took me longer than Saturday's puzzle. It was an easier week than usual. |
 |
| May-14-09 | | CHESSTTCAMPS: I'll partially revise my previous post.
To correct my previous post, the famous Marshall queen sac was on g3. Also, after reviewing the Tactical Archive, I realize that I did in fact attempt and solve several of the puzzles without posting. However, Wednesday's was the one problem where I didn't come up with a satisfactory answer and gave up after about 20 minutes. I did notice that 16.Qf6 Nf5 17.Qxg7+ Nxg7 18.Nf6+ was an interesting way to win a pawn and even now I don't see anything better. But that doesn't seem a likely solution for a Wednesday puzzle. (Yes, I know that is using puzzle rating in a bad way.) On 16.Nxg7 Kxg7 17.Qf6+ Kg8 18.Bh6 Nf5 seems to hold. Time to peek...
Well, I was sort of on track, but it seems way out of line in difficulty for a Wednesday. I think they scrambled Wednesday/Saturday/Sunday. In short, I think your summary is good and a nice resource. Thanks,
Phil |
 |
May-18-09
 | | agb2002: A week quite biased to the endgame :)
May 11, Monday: T Kosintseva vs E Zaiatz, 2009. Forks, overburdened pieces and other very elementary stuff. May 12, Tuesday: Reggio vs Chigorin, 1901. Loose pieces and basic mates. May 13, Wednesday: Averbakh vs Korchnoi, 1965. A nice example of obstruction... May 14, Thursday: J Esser vs Breyer, 1916. ... and a nice example of obstructed, loose piece. May 15, Friday: Timman vs A Sokolov, 1987. The puzzle of the week, imho. I have the impression that 24.Qc2 wins but not as clearly as 24.Rc4, a quite subtle move. Worth of study. May 16, Saturday: H Ree vs Ftacnik, 1978. This puzzle deserves to appear in endgame textbooks. May 17, Sunday: P Nikolic vs Huebner, 1987. I agree with <CHESSTTCAMPS>'s conclusion: "No need for insanity - good technique seems to do the trick." (see his post to P Nikolic vs Huebner, 1987). |
 |
| May-19-09 | | CHESSTTCAMPS: <agb2002: A week quite biased to the endgame :) >
I see that this pleased you, too! In general, I think that mathematician types like interesting endgames. Excellent summary for all 7, in particular: <May 15, Friday:>
Interestingly, each of us came up with good (but different) and possibly winning alternatives to the game continuation, but the game continuation was undoubtedly best. This wasn't the deepest of puzzles, but it did offer the broadest range of reasonable candidates that I've seen in a while. <May 16, Saturday:> I loved this puzzle! So much so that I posted a solution very late in the day. The actual game continuation (that I posted as a side line, not thinking that it would be the game continuation) actually requires foresight of about 40 ply. However, that line was not particularly difficult because everything is forced, as another poster noted! <May 17, Sunday>
There were some chances for white to go wrong, but basically there was a lot more margin for error than the Saturday puzzle <..good technique..> Interestingly, I thought GM Nikolic got the first move right, but after that he did not use the best technique and a number of the posters did better. But it's so much easier without a ticking clock... |
 |
 |
|
< Earlier Kibitzing · PAGE 1 OF 7 ·
Later Kibitzing> |
|
 |

Advertise on Chessgames.com
|
|
|