chessgames.com
Members · Prefs · Laboratory · Collections · Openings · Endgames · Sacrifices · History · Search Kibitzing · Kibitzer's Café · Chessforums · Tournament Index · Players · Kibitzing

🏆 Stockfish Nakamura Match (2014)

Chessgames.com Chess Event Description
Played in Burlingame, California USA 23 August 2014. See e. g. ... [more]

Player: Nakamura / Rybka

 page 1 of 1; 2 games  PGN Download 
Game  ResultMoves YearEvent/LocaleOpening
1. Nakamura / Rybka vs Stockfish ½-½562014Stockfish Nakamura MatchC07 French, Tarrasch
2. Stockfish vs Nakamura / Rybka 1-01462014Stockfish Nakamura MatchE77 King's Indian
  REFINE SEARCH:   White wins (1-0) | Black wins (0-1) | Draws (1/2-1/2) | Nakamura / Rybka wins | Nakamura / Rybka loses  

Kibitzer's Corner
< Earlier Kibitzing  · PAGE 2 OF 2 ·  Later Kibitzing>
Aug-27-14  latvalatvian: I can't believe a computer would defeat a human being. Nakamura must have threw the match.
Aug-27-14  zanzibar: <Kinghunt> Good points.

And speaking of points - don't forget to deduct a few from the CCRL rating due to the match conditions on Stockfish - no opening book or endgame tables.

A few rating points should be shaved off - rounding down to 3200 is reasonable (bearing in mind the points you raised).

I'll try to find that article, they listed the hardware used (some quad-core mac iirc).

Aug-27-14  zanzibar: Ah, memory is a fragile thing - it was a 8-core Mac.

<According to match conditions, Stockfish was not allowed to access either an opening book or an endgame tablebase. What it did have was brute power -- match co-organizer Jesse Levinson said it was "the latest development build compiled for OS X and running on a 3ghz 8-core Mac Pro."

In comparision, Nakamura had the assistance of an older version of Rybka (about 200 points less than Stockfish's 3200+ rating), and it ran on a 2008 MacBook. Of course, he also had his 20-plus years of chess knowledge in play.

"The inspiration for this match was me opening my mouth too much," said co-organizer Tyson Mao. "I was wondering out loud how my [2008] MacBook could compete against today's chess engines.

"The main question is, 'Do humans add any value to chess engines today?' It's a very polarizing question. That's why we're having the match.">

http://www.chess.com/news/stockfish...

Aug-27-14  Landman: We may someday reach the point that any assistance given by a human would weaken an engine's play.
Aug-27-14  zanzibar: <landman> Does that include oiling the gears and the like?!
Aug-27-14
Premium Chessgames Member
  ketchuplover: I bet John Henry could smash StockFish...literally
Aug-27-14  Taxman: It's interesting to see odds games between a top computer and a top human grandmaster. Perhaps such matches will become more popular as the machines continue to grow in playing strength.

I wonder what the “break – even” odds are; i.e. the level of odds at which the best human players will always win (at the time limit used for the Nakamura games), even against the top chess programs from the far future, running on the fastest hardware.

For example, I doubt that any computer will ever be able to give the top human player queen odds and still win. However, based solely on the Nakamura games (very limited evidence I realise), it appears likely that the break – even odds are somewhat more than a pawn.

Two pawns? A Knight?

What do people think?

Aug-27-14  nimh: There's no one-on-one relationship between human and engine rating systems. When one goes up the CCRL rating ladder from 3200 to 2400, it doesn't mean performance against humans would increase by the same amount. This stems from the fact that humans and engines have completely opposite accuracy vs rating relationship. I last year created a paper where I tried to establish a link between both rating systems based on comparisons on the accuracy of play through FIDE and CCRL rating scales.

http://www.chessanalysis.ee/CCRL%20...

Whether CCRL 3200 really corresponds to FIDE 2910 is debatable and further researches will surely shed more light on this. The relatively low accuracy in engine games is explained by 3x shorter time controls and 8x slower cpu than fastest one that can be bought for less than 500$.

The main finding is that there's no linear relationship. The further up to go along the CCRL rating ladder, the less there is to gain against humans.

Reasons for such behaviour? Well, I must admit I have no idea :)

Aug-27-14  Kinghunt: Thanks for sharing your paper, <nimh>. Very interesting read. However, I have a few objections:

<First>, it appears that for both humans and computers, the games chosen to benchmark accuracy were from "average" GMs and computers, and you are extrapolating your regression to data your model was not built with (ie, 2800+ players and 3100+ programs). This is a dangerous way to model, and such extrapolations are never trusted in statistical analysis.

Looking at the plots demonstrates exactly why: the claim is made that a 2900 human player would have an expected error of 0.05, but the lowest error rate we have is 0.12. We don't know if it's even humanly possible to have an error rate that low. Moreover, this is especially concerning given the unusual data in the 2400 range - small changes in model parameters can result in large changes of prediction.

I feel this issue could be resolved fairly easily by analyzing games by 2800 players (say, Carlsen as the closest we have to 2900, along with Aronian and Topalov to represent several styles of play) and 3200 engines (say, Stockfish, Komodo, and Rybka, the last one being important to make sure stronger engines do actually score better than weaker "Rybka-like" engines).

<Second>, the issue of time control and hardware cannot be ignored, and can be addressed in a similar manner to my above recommendations. Pick some TCEC games, which are played at classical time controls on strong hardware (and by very strong engines), and analyze them in the same way.

Finally, I also have to object to your conclusion:

<But it nevertheless turns out that expectations that top engines on up-to-date desktop machines are supposed to perform 3100-3200 against humans are a myth.>

Given what I pointed out above, I do not believe there is sufficient evidence to support such a claim. To the contrary, we have this match (as well as older odds matches) as evidence that computers, in fact, do perform 3100-3200 against humans.

Please do not take anything I said the wrong way - I think the chess world would benefit greatly from more people like you doing this kind of analysis. I am giving these comments because as much as I like what you've done, I think it can be even better and hope this work can be continued and made more convincing.

Aug-27-14  bobthebob: <I wonder what the “break – even” odds are; i.e. the level of odds at which the best human players will always win>

I wouldn't define the break even odds like that but rather at what point is the expected outcome close to even.

In that case, I would think if Naka went in with goal of drawing every game, that with these odds he would have been able to do that.

The thing about piece odds is that that throws theory out of the window on move 1 and as a result the computer with its superior calculating ability would not be at much as a disadvantage if a human was down a piece.

interesting discussion.

Aug-28-14  QueentakesKing: Excellent patience by Stockfish!
Aug-28-14  QueentakesKing: I think it is high time to look for another promising woodpusher to represent the U.S in the World Championship.
Aug-28-14  nimh: <and you are extrapolating your regression to data your model was not built with (ie, 2800+ players and 3100+ programs). This is a dangerous way to model, and such extrapolations are never trusted in statistical analysis. >

Extrapolatiions and interpolations are fairly common in statistical analyses, so it never occurred to me that it could be wrong to use them. At the time I performed the analysis and collected data, there were not enough 2800-rated and stronger players. I did not include 3000-rated engines, because I was afraid the strength advantage of my engine, hardware and time per move combination was not strong enough. For the same reason I deliberately made a decision not to use games from TCEC. As a rule, the strength of analysis must exceed the strength of player; the bigger the gap, the more tustworthy the results.

<Looking at the plots demonstrates exactly why: the claim is made that a 2900 human player would have an expected error of 0.05, but the lowest error rate we have is 0.12. We don't know if it's even humanly possible to have an error rate that low.>

Fischer against Taimanov and Larsen had absolute average error of 0.056 and expected error of 0.054, Carlsen in Nanjing 2009 had 0.067 and 0.065 respectively, so it isn't impossible at all, one just has to play slightly better than those two did. :)

<Moreover, this is especially concerning given the unusual data in the 2400 range - small changes in model parameters can result in large changes of prediction.>

It just shows I had too small datasets (400-500 positions per cohort) and a common notion of humans being very unstable players compared to engines. Larger datasets are too time-consuming.

<I feel this issue could be resolved fairly easily by analyzing games by 2800 players (say, Carlsen as the closest we have to 2900, along with Aronian and Topalov to represent several styles of play) and 3200 engines (say, Stockfish, Komodo, and Rybka, the last one being important to make sure stronger engines do actually score better than weaker "Rybka-like" engines).>

That's what I was planning to do with updated hardware, engine and more time per move, once I've finished with my current research paper.

<Please do not take anything I said the wrong way - I think the chess world would benefit greatly from more people like you doing this kind of analysis. >

Thanks, and no, i'm not mad at you. I'm grateful because your critcal remarks allowed me to explain better the methods I used.

Aug-28-14  Landman: I'd be interested in a slightly different experiment: Stockfish vs. Human + Stockfish. Same hardware specs for both computers. Historically +Human was always an advantage (cf. Advanced or Freestyle), but will we reach a point where that is no longer true? How strong of a player (or more precisely, a chess computer operator) do you have to be to tilt the result in a postive direction? I'd imagine few could pull it off and that number is slowly dwindling.

Has anyone seen recent explorations along these lines?

Aug-29-14  SHJR: I was very interested to read nimh's paper and the second paper of which he gives the link within it. Thank you to share. I read a paper on the same subject by Dr K.Regan (IM himself)from the University of Buffalo on the following link, I am sure, if you don't know it that you will enjoy it :

www.cse.buffalo.edu/~regan/papers/pdf/RMH11.pdf

Interestingly, he draw a completely different conclusion from statistical studies with chess engines of human games along history.

One of his main conclusion was that there is likely no inflation at all within FIDE ratings and that the numerous players above 2700 elo ratings today really deserve them in comparison with players in the 60's or 70's !

He introduces in the same paper the notion of Intrinsic Performance Rating (IPR) and he infers that according to his data a really perfect play should have a limit and could be around 3600 elo (speaking of course of FIDE elo). Apparently, according to this paper and some further studies, it is really thought that the best engines today could likely be really around 3200 elo in FIDE rating !

Aug-30-14  nimh: I'm well aware of Regan's work. There's no doubt that he has high reputation in computer science and his papers on engine analysis of human games are detailed and scientifically well-formatted.

But what I don't approve is the manner he has approached the subject. He completely ignores the fact that chess skill is not the only factor that determines the accuracy of play. He operates on the absolute accuracy level.

That's why there's an in-built bias for players who had simpler positions (Rubinstein, Smyslov, Kramnik), more thinking time (players of the past) and played objectively (Capablanca, Fine, Fischer); and against players who had more difficult positions (Morphy, Alekhine, Anand), less thinking time (players of today), and played practically (Lasker, Tal, Kasparov) in his papers.

A good example is comparing performances of Lasker New York 1924 and Capablanca New York 1927.

If we were to believe Regan, Lasker had an IPR of 2580 and Capablanca 2936 with 356 point difference.

But according to chessmetrics, Lasker had 2828 and capablanca 2827. Both had very similar time controls and the overall quality of play could not have improved significantly within mere 3 years. Both tournaments had 20 rounds. So the quality of moves was very probably closely correlated to performances.

Aug-31-14  SHJR: Thank you very much for your very detailed answer !
You are perfectly right and you know this fascinating subject far better than me : I only start to read some papers on it during this summer ! As you pointed out, in Regan's paper there is at this stage only a simple linear relation between elo rating and precision of play. In the paper that you share in this forum, the kind of logarithmic relation is much more elaborate to take account of the numerous factors which change difficulty in real play... I am eager to read further papers on this subject !
Thank you again, Nimh !
Aug-31-14  hoodrobin: <Landman: I'd be interested in a slightly different experiment: Stockfish vs. Human + Stockfish.> Looks intriguing, but how can you prevent the human from just operating his SF?
Sep-01-14
Premium Chessgames Member
  OhioChessFan: <bob: I wouldn't define the break even odds like that but rather at what point is the expected outcome close to even.

In that case, I would think if Naka went in with goal of drawing every game, that with these odds he would have been able to do that.>

Easily, I think. Or maybe there's just too many mistakes waiting to be made that the tactics overwhelm the chessboard.

Sep-01-14  Kinghunt: I think the more interesting metric wouldn't be what odds can the computer give and still win the match, but rather what odds can the computer give and not <lose> the match. Yes, Nakamura could have drawn these games if that was his goal, but playing for a draw just goes against the spirit of chess.

I, for one, would like to see a match with an even greater handicap, with the human's goal being to win (not draw) the match. For example, pawn and three moves, but if the final match score is a tie, it counts as a victory for the engine, and we need to increase the handicap more. If the human has a material advantage, they should be required to play for a win.

Sep-01-14  john barleycorn: Back in the days when Genius2 and Mephisto3 was all the craze I found it hard to win outside the pet lines where you could count on the engine to go wrong everytime you played those lines.

However, I read about some poster who is constructing (or dreaming of doing so) an engine that has "inbuilt positional knowledge" which should make it superior to currently available engines. Looking at this match it looks like a pipe dream.

Sep-10-14  paramount: <Kinghunt: I think the more interesting metric wouldn't be what odds can the computer give and still win the match, but rather what odds can the computer give and not <lose> the match. Yes, Nakamura could have drawn these games if that was his goal, BUT PLAYING FOR A DRAW JUST GOES AGAINST THE SPIRIT OF CHESS >

HMMMM....first of all pardon me if i will propound a little bit keen disclaimers of what the words i have "caps locked" above.

I dont agree with that. Your spirit of chess would bring success with weaker opponents but with stronger opponents with your "have to be always looking for win" spirit, would bring short destruction and absolutely defeat.

Dont you know that Naka played Stockfish...yes...one of the giants of chess engines beside Rybka, Houdini, or Critter.

If he followed your spirit, looking for a win...he would have played openly, sharply, and much of tactical than positional then SURELY he would end up losing much faster than 90+ moves (said...40 moves).

So beside the spirit, you have to use your brain...yes BRAIN!!

Why the hell in the world VIetnamese didnt use warfare strategy with mano a mano way or with conventional way attack with "gentle" way....because they isnt as strong as US in terms of weaponry, tactics, or human resources.

So they chose the irregular form of warfare with ambushes, sabotage, raids, hit and run tactics, petty warfare, and mobility. And by the end, can we called US win that war???...NOPE!!!

So playing for the draw in chess IS NOT against the spirit of chess. In every games or lives playing for the draw is a strategy and depending of the opponent. Even in some cases playing for loses also a strategy.

What a wrong way against the spirit of games is CHEATING. Thats what you have to avoid.

cheers

Nov-12-14  nimh: I posted it on my forum, but I think it's is an appropriate idea to post it here too.

Five years ago I posted this on my forum, announcing the start of another analysis Project:

<I have started a new analysis study, where I compare roughly 2650-rated players according to chessmetrics elo rating throughout the history of chess. The aim is to find which level of play corresponds to a certain rating in different decades. The methods are as follows:

I analyze games between 2600-2700-rated players. Moves by both sides are subject of the analysis. Results are displayed by decades from 1860s till 2000s. Plus, I also analyze games by 2200, 2400 and 2600-rated players from 2008-09 for comparisons.

The games are randomly chosen from the decades.

The start point of the analysis of every game depends on a period: 1860s-70s 8-th move, 1880s-90s 9-th, 1900s-10s 10-th, 1920s-30s 11-th, 1940s-50s 12th, 1960s-70s 13th, 1980s-90s 14th and 2000s 15th move.

The minimal length of a game is 20 moves + the start point; depending on a period it may vary from 28 to 35.

Rybka 3 Default with default settings is used for the purpose of the analysis. The time per position is 5 minutes.

Moves outside [-2; 2] interval are discarded.

All blunders valued more than 2.00 are considered as 2.00.

A separate engine is used to calculate the complexity of positions, Stockfish 14. The complexity of a position is determined from depths 2 to 15 and by adding differences between the eval of the best and the second best moves each time Stockfish 14 finds a new move. Differences found on the depths 10-15 are multiplied by 2.>

In 2010 I completed and uploaded the first part:
http://www.chessanalysis.ee/a%20stu...

Now I can proudly say that the long work finally is over and the final versioon of the study has been completed!

http://www.chessanalysis.ee/Quality...

It is divided into five sections.
Section 1 introduces the subject and offers some theoretical background. Section 2 describes the methodology in greater detail. Section 3 provides detailed results of this study. Section 4 provides several miscellaneous conclusions. Section 5 concludes the whole work and gives some ideas for future.

The main conclusions the study provides are as follows:

1) In the middle of 19th century players were around 2200-2300 strength, by the end of the century they were already playing as good as modern 2500-rated players. 2600-level was first reached at the first decades of the 20th century, 2700-level in 1940s. Lasker may have been the first player to be comparable to modern GM strength.

2) Carlsen's TPR 3001 at Nanjing 2009 appears to be overrated due to his game with white pieces against Wang Yue being unusually inaccurate. But he played better than Fischer against Larsen and Taimanov in 1971, or Kasparov at Linares 1999.

3) There is no simple relationship between rating systems of humans and engines.

4) Since 1970, FIDE rating has inflated by 5 points per decade with respect to absoluute strength of play, whereas Chessmetrics rating has deflated by 38 points per decade.

5) Higher rated players have a relatively bigger importance of intuition and knowledge, on the other hand, stronger engines rely on search function more than evaluation function.

6) The biggest source of inaccuracies are errors around 0.20, not blunders.

Nov-12-14  PinnedPiece: <nimh> Well done. I read through your study and am most impressed by your care for detail, consideration of "noise" in the data, allowance for external factors, compensation for various rating methods, etc. etc.

Your report is thorough, and exceedingly competent. The graphic data is brilliantly presented.

It is wonderful to see a person take on their shoulders a task that answers questions that thousands of people have, but have no answer: in this case, "What is the truth about chess ratings inflation?"

You have certainly shown the way to determine the answer, if not actually produced the answer.

I hope you receive widespread acknowledgement of your achievement. So many on this site make claims about their various talents; few have demonstrated them to the degree you have with your chess analysis.

Again, well done.

.

Nov-13-14  nimh: Thanks! :)

However I don't think it merits to be called as a final truth on the subject. There are still areas where improvements can be made.

As more experienced and wise people know, when an author of a study or an article claims he/she has found the truth, it's clear it must be taken with caution.

search thread:   
< Earlier Kibitzing  · PAGE 2 OF 2 ·  Later Kibitzing>

NOTE: Create an account today to post replies and access other powerful features which are available only to registered users. Becoming a member is free, anonymous, and takes less than 1 minute! If you already have a username, then simply login login under your username now to join the discussion.

Please observe our posting guidelines:

  1. No obscene, racist, sexist, or profane language.
  2. No spamming, advertising, duplicate, or gibberish posts.
  3. No vitriolic or systematic personal attacks against other members.
  4. Nothing in violation of United States law.
  5. No cyberstalking or malicious posting of negative or private information (doxing/doxxing) of members.
  6. No trolling.
  7. The use of "sock puppet" accounts to circumvent disciplinary action taken by moderators, create a false impression of consensus or support, or stage conversations, is prohibited.
  8. Do not degrade Chessgames or any of it's staff/volunteers.

Please try to maintain a semblance of civility at all times.

Blow the Whistle

See something that violates our rules? Blow the whistle and inform a moderator.


NOTE: Please keep all discussion on-topic. This forum is for this specific tournament only. To discuss chess or this site in general, visit the Kibitzer's Café.

Messages posted by Chessgames members do not necessarily represent the views of Chessgames.com, its employees, or sponsors.
All moderator actions taken are ultimately at the sole discretion of the administration.

Spot an error? Please suggest your correction and help us eliminate database mistakes!

Copyright 2001-2025, Chessgames Services LLC