Support for macOS Metal

Hi,

I'd like to run KataGo on my M1 macbook with native support for CoreML but sadly, I don't know the framework plus I have no experience with macOS development.

Hence I'd like to create this issue and ask the community if they are willing to financially support the development of ~CoreML~ Metal GPU backend. @lightvector said it would be very time consuming.

Thoughts?

P.S: Here is the output of benchmark of `g170-b40c256x2-s5095420928-d1229425124.bin.gz` on my Macbook Pro 14' M1:

````
katago benchmark -config /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg -model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz
2022-08-01 10:52:07+0200: Loading model and initializing benchmark...
2022-08-01 10:52:07+0200: Testing with default positions for board size: 19
2022-08-01 10:52:07+0200: nnRandSeed0 = 17240075635628857784
2022-08-01 10:52:07+0200: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz useFP16 auto useNHWC auto
2022-08-01 10:52:07+0200: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-01 10:52:08+0200: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-01 10:52:08+0200: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-01 10:52:08+0200: Found OpenCL Device 0: Apple M1 Pro (Apple) (score 1000102)
2022-08-01 10:52:08+0200: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-01 10:52:08+0200: Using OpenCL Device 0: Apple M1 Pro (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-08-01 10:52:08+0200: Loaded tuning parameters from: /Users/hologos/.katago/opencltuning/tune8_gpuAppleM1Pro_x19_y19_c256_mv8.txt
2022-08-01 10:52:08+0200: OpenCL backend thread 0: Model version 8
2022-08-01 10:52:08+0200: OpenCL backend thread 0: Model name: g170-b40c256x2-s5095420928-d1229425124
2022-08-01 10:52:10+0200: OpenCL backend thread 0: FP16Storage true FP16Compute false FP16TensorCores false

2022-08-01 10:52:10+0200: Loaded config /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg
2022-08-01 10:52:10+0200: Loaded model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz

Testing using 800 visits.
  If you have a good GPU, you might increase this using "-visits N" to get more accurate results.
  If you have a weak GPU and this is taking forever, you can decrease it instead to finish the benchmark faster.

You are currently using the OpenCL version of KataGo.
If you have a strong GPU capable of FP16 tensor cores (e.g. RTX2080), using the Cuda version of KataGo instead may give a mild performance boost.

Your GTP config is currently set to use numSearchThreads = 6
Automatically trying different numbers of threads to home in on the best (board size 19x19):

2022-08-01 10:52:10+0200: GPU -1 finishing, processed 5 rows 5 batches
2022-08-01 10:52:10+0200: nnRandSeed0 = 5768574494763223581
2022-08-01 10:52:10+0200: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz useFP16 auto useNHWC auto
2022-08-01 10:52:10+0200: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-01 10:52:11+0200: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-01 10:52:11+0200: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-01 10:52:11+0200: Found OpenCL Device 0: Apple M1 Pro (Apple) (score 1000102)
2022-08-01 10:52:11+0200: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-01 10:52:11+0200: Using OpenCL Device 0: Apple M1 Pro (Apple) OpenCL 1.2  (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-08-01 10:52:11+0200: Loaded tuning parameters from: /Users/hologos/.katago/opencltuning/tune8_gpuAppleM1Pro_x19_y19_c256_mv8.txt
2022-08-01 10:52:11+0200: OpenCL backend thread 0: Model version 8
2022-08-01 10:52:11+0200: OpenCL backend thread 0: Model name: g170-b40c256x2-s5095420928-d1229425124
2022-08-01 10:52:13+0200: OpenCL backend thread 0: FP16Storage true FP16Compute false FP16TensorCores false


Possible numbers of threads to test: 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32,

numSearchThreads =  5: 10 / 10 positions, visits/s = 112.16 nnEvals/s = 95.35 nnBatches/s = 38.29 avgBatchSize = 2.49 (71.7 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 157.41 nnEvals/s = 130.69 nnBatches/s = 22.07 avgBatchSize = 5.92 (51.5 secs)
numSearchThreads = 10: 10 / 10 positions, visits/s = 152.56 nnEvals/s = 124.96 nnBatches/s = 25.26 avgBatchSize = 4.95 (53.0 secs)
numSearchThreads = 20: 10 / 10 positions, visits/s = 176.72 nnEvals/s = 150.78 nnBatches/s = 15.34 avgBatchSize = 9.83 (46.3 secs)
numSearchThreads =  8: 10 / 10 positions, visits/s = 144.41 nnEvals/s = 118.71 nnBatches/s = 29.93 avgBatchSize = 3.97 (55.9 secs)
numSearchThreads = 16: 10 / 10 positions, visits/s = 172.77 nnEvals/s = 144.29 nnBatches/s = 18.30 avgBatchSize = 7.89 (47.2 secs)


Ordered summary of results:

numSearchThreads =  5: 10 / 10 positions, visits/s = 112.16 nnEvals/s = 95.35 nnBatches/s = 38.29 avgBatchSize = 2.49 (71.7 secs) (EloDiff baseline)
numSearchThreads =  8: 10 / 10 positions, visits/s = 144.41 nnEvals/s = 118.71 nnBatches/s = 29.93 avgBatchSize = 3.97 (55.9 secs) (EloDiff +73)
numSearchThreads = 10: 10 / 10 positions, visits/s = 152.56 nnEvals/s = 124.96 nnBatches/s = 25.26 avgBatchSize = 4.95 (53.0 secs) (EloDiff +80)
numSearchThreads = 12: 10 / 10 positions, visits/s = 157.41 nnEvals/s = 130.69 nnBatches/s = 22.07 avgBatchSize = 5.92 (51.5 secs) (EloDiff +78)
numSearchThreads = 16: 10 / 10 positions, visits/s = 172.77 nnEvals/s = 144.29 nnBatches/s = 18.30 avgBatchSize = 7.89 (47.2 secs) (EloDiff +88)
numSearchThreads = 20: 10 / 10 positions, visits/s = 176.72 nnEvals/s = 150.78 nnBatches/s = 15.34 avgBatchSize = 9.83 (46.3 secs) (EloDiff +70)


Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads =  5: (baseline)
numSearchThreads =  8:   +73 Elo
numSearchThreads = 10:   +80 Elo
numSearchThreads = 12:   +78 Elo
numSearchThreads = 16:   +88 Elo (recommended)
numSearchThreads = 20:   +70 Elo

If you care about performance, you may want to edit numSearchThreads in /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg based on the above results!
If you intend to do much longer searches, configure the seconds per game move you expect with the '-time' flag and benchmark again.
If you intend to do short or fixed-visit searches, use lower numSearchThreads for better strength, high threads will weaken strength.
If interested see also other notes about performance and mem usage in the top of /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg

2022-08-01 10:57:39+0200: GPU -1 finishing, processed 40619 rows 8472 batches

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for macOS Metal #669

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Support for macOS Metal #669

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions