Talk:Sunway TaihuLight

Computing Mid‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
Mid	This article has been rated as Mid-importance on the project's importance scale.
	This article is supported by Computer hardware task force (assessed as Mid-importance).

China Mid‑importance

	China portal This article is within the scope of WikiProject China, a collaborative effort to improve the coverage of China related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ChinaWikipedia:WikiProject ChinaTemplate:WikiProject ChinaChina-related articles
Mid	This article has been rated as Mid-importance on the project's importance scale.

It is requested that a photograph be included in this article to improve its quality.
The external tool WordPress Openverse may be able to locate suitable images on Flickr and other web sites.

Upload

Comparison to CELL, adapteva (and others)[edit]

I've wanted to mention these here because in my experience, laymen think of more common CPUs and GPUs when understanding things , and these kind of 'scratchpad' based chips are quite different... reading about CELL will help you understand how this chip is different, and how these cores can't be compared to x86 cores... and why they aren't everywhere. they're quite different to program.

Is there a collective term for these? if so that might help. "it is a blahblbah blah machine (with scratchpads instead of a traditional cache heirarchy)", then we can add all those chips to such a category. We've already got 'many core' but many-core covers Xeon Phi, which is a cache-based machine. There's a dividing line.

            |   multicore      |     manycore
  -----------------------------------------------------
Cache       |    i7,xeon       |      xeon phi,gpu
  ----------------------------------------------------
no cache    |             CELL |  adapteva SH26010

(and cell,sh26010 have an L2 cache of course, but the main horsepower is from the SPU/"CPE" which work in scratchpads)

If I'm wrong in my assessment based on currently available information, I'd be interested to see any description of architectural features I may have missed. (do those units have have some as yet undescribed capability ? unlike cell, adapteva can load/store outside of its scratchpad, but reads have huge latencies)

Fmadd (talk) 18:16, 21 June 2016 (UTC)[reply]

Paper[edit]

There's a paper being published about this, according to Jack Dongarra's report: “The Sunway TaihuLight Supercomputer: System and Applications”, by Fu H H, Liao J F, Yang J Z, et al., to appear in Sci. China Inf. Sci., 2016, 59(7): 072001, doi: 10.1007/s11432-016-5588-7 Open access here: http://engine.scichina.com/publisher/scp/journal/SCIS/59/7/10.1007/s11432-016-5588-7?slug=abstract . It also looks like the paper may be free content: it has a watermark saying "All article content, except where otherwise noted, is licensed under a Creative Commons Attribution 3.0 Unported license." -- The Anome (talk) 08:44, 22 June 2016 (UTC)[reply]

Reading the paper: it looks like they use OpenACC 2.0 for parallelization of code. Very nice. The SW26010 chip also has DDR3 controllers on it, so main memory hangs directly off the CPU chips. It also looks like the local per-core scratchpad can be used as either a "software-emulated" cache, albeit with poor performance, or a local store, at presumably very high performance. So (speculation on my part) it might be quite possible that they're running a stripped-down Linux OS on the management cores in each 65-core internal in-chip cluster, and dispatching the 64 other cores in that in-chip cluster as GPU-core-like compute workers, perhaps directly to the bare metal. The numbers would certainly work out about just right for that if they're taking the OpenACC approach. -- The Anome (talk) 09:14, 22 June 2016 (UTC)[reply]

Yeah reading the paper, they show a timeline of code running on it, talking about prefetching of data ("next wind block") happening on the CPU. I do also remember similar software-emulated-cache code being given out in desperation for the CELL.. and the immense performance difference between properly re-worked code and regular code. I think scientific code deals more with multi-dimensional arrays? that might be easier. I just remember how much of a shock it was moving gamedev workloads onto the cell (starting with code architected for multicore CPUs, then having to move CPU like code onto the scratchpad architecture because you only had one CPU-like core..). GPU-like code as a starting point might be better (inherently parallel) but memory addressing is still very different.. gpu hardware deals with the latencies through threading and accumulating many small reads into cache lines for you. On another note I wonder if the SIMD engines have gather or strided load or anything.. not all doFmadd (talk) 09:47, 22 June 2016 (UTC)[reply]

Yes, it's absolutely fascinating. I think the Chinese team have done something extremely clever here; this is a very nice half-way-house between a conventional NUMA cache-coherent CPU cluster architecture and a GPU-like architecture, and if I'm guessing right, they can probably make it operate in regimes more or less anywhere between the two, just by programming it. The system-wide interconnect is going to be very interesting: I wonder if it has things like cache-coherency logic and synchronization primitives built into it, and I also wonder if it and the network-on-a-chip have been explicitly designed to work together? -- The Anome (talk) 10:00, 22 June 2016 (UTC)[reply]

Reading the paper, it looks like they're taking a software, not hardware, approach to synchronization. That would make sense: provided you've already translated your problem into the right form, why use ASIC logic to do this when you already have CPUs everywhere? -- The Anome (talk) 10:15, 22 June 2016 (UTC)[reply]

"Sunway" vs. "ShenWei"[edit]

In the article, both the names "Sunway" and "ShenWei" are used to describe computer/chip architectures. I suspect these are actually the same term in Chinese, and if so, we should settle on one particular standard English translation for the term and use it consistently throughout the various articles related to this topic.

In particular, this paper seems to use the term "Sunway" for both the CPU and supercomputer architecture; since this appears to be the term used by the Sunway team themselves, I suggest we should standardize on that use. -- The Anome (talk) 11:13, 29 June 2016 (UTC)[reply]

ShenWei is simply the romanized pinyin pronunciation of 神威. Since paper writes it as Sunway that would be the official English term, though it might not be a bad idea to put in the pinyin next to the Chinese characters (神威·太湖之光 shénwēi·tàihú zhī guāng). 神威 means God's might, 太湖之光 means the light of Taihu/Lake Tai (the largest and most famous fresh water lake in China). SifaV6 (talk) 05:16, 9 February 2017 (UTC)[reply]

Optimal wording[edit]

In the architecture section, the first half of the second sentence is worded confusingly. "Each processor chip contains 256 processing cores, and an additional four auxiliary cores for system management that are also RISC cores just more fully featured..." Is there a way to make this clearer? It's quite difficult to discern what the intended meaning is. 98.211.62.86 (talk) 21:14, 18 October 2016 (UTC)[reply]

Website link broken[edit]

Website link in the quick facts table is broken. Needs to be fixed. — Preceding unsigned comment added by D0TheMath6.28 (talk • contribs) 03:39, 20 March 2020 (UTC)[reply]