Open
Description
Hi,
I'd like to run KataGo on my M1 macbook with native support for CoreML but sadly, I don't know the framework plus I have no experience with macOS development.
Hence I'd like to create this issue and ask the community if they are willing to financially support the development of CoreML Metal GPU backend. @lightvector said it would be very time consuming.
Thoughts?
P.S: Here is the output of benchmark of g170-b40c256x2-s5095420928-d1229425124.bin.gz
on my Macbook Pro 14' M1:
katago benchmark -config /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg -model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz
2022-08-01 10:52:07+0200: Loading model and initializing benchmark...
2022-08-01 10:52:07+0200: Testing with default positions for board size: 19
2022-08-01 10:52:07+0200: nnRandSeed0 = 17240075635628857784
2022-08-01 10:52:07+0200: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz useFP16 auto useNHWC auto
2022-08-01 10:52:07+0200: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-01 10:52:08+0200: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-01 10:52:08+0200: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-01 10:52:08+0200: Found OpenCL Device 0: Apple M1 Pro (Apple) (score 1000102)
2022-08-01 10:52:08+0200: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-01 10:52:08+0200: Using OpenCL Device 0: Apple M1 Pro (Apple) OpenCL 1.2 (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-08-01 10:52:08+0200: Loaded tuning parameters from: /Users/hologos/.katago/opencltuning/tune8_gpuAppleM1Pro_x19_y19_c256_mv8.txt
2022-08-01 10:52:08+0200: OpenCL backend thread 0: Model version 8
2022-08-01 10:52:08+0200: OpenCL backend thread 0: Model name: g170-b40c256x2-s5095420928-d1229425124
2022-08-01 10:52:10+0200: OpenCL backend thread 0: FP16Storage true FP16Compute false FP16TensorCores false
2022-08-01 10:52:10+0200: Loaded config /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg
2022-08-01 10:52:10+0200: Loaded model /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz
Testing using 800 visits.
If you have a good GPU, you might increase this using "-visits N" to get more accurate results.
If you have a weak GPU and this is taking forever, you can decrease it instead to finish the benchmark faster.
You are currently using the OpenCL version of KataGo.
If you have a strong GPU capable of FP16 tensor cores (e.g. RTX2080), using the Cuda version of KataGo instead may give a mild performance boost.
Your GTP config is currently set to use numSearchThreads = 6
Automatically trying different numbers of threads to home in on the best (board size 19x19):
2022-08-01 10:52:10+0200: GPU -1 finishing, processed 5 rows 5 batches
2022-08-01 10:52:10+0200: nnRandSeed0 = 5768574494763223581
2022-08-01 10:52:10+0200: After dedups: nnModelFile0 = /opt/homebrew/Cellar/katago/1.11.0/share/katago/g170-b40c256x2-s5095420928-d1229425124.bin.gz useFP16 auto useNHWC auto
2022-08-01 10:52:10+0200: Initializing neural net buffer to be size 19 * 19 exactly
2022-08-01 10:52:11+0200: Found OpenCL Platform 0: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-01 10:52:11+0200: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2022-08-01 10:52:11+0200: Found OpenCL Device 0: Apple M1 Pro (Apple) (score 1000102)
2022-08-01 10:52:11+0200: Creating context for OpenCL Platform: Apple (Apple) (OpenCL 1.2 (Jun 17 2022 18:58:24))
2022-08-01 10:52:11+0200: Using OpenCL Device 0: Apple M1 Pro (Apple) OpenCL 1.2 (Extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images )
2022-08-01 10:52:11+0200: Loaded tuning parameters from: /Users/hologos/.katago/opencltuning/tune8_gpuAppleM1Pro_x19_y19_c256_mv8.txt
2022-08-01 10:52:11+0200: OpenCL backend thread 0: Model version 8
2022-08-01 10:52:11+0200: OpenCL backend thread 0: Model name: g170-b40c256x2-s5095420928-d1229425124
2022-08-01 10:52:13+0200: OpenCL backend thread 0: FP16Storage true FP16Compute false FP16TensorCores false
Possible numbers of threads to test: 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32,
numSearchThreads = 5: 10 / 10 positions, visits/s = 112.16 nnEvals/s = 95.35 nnBatches/s = 38.29 avgBatchSize = 2.49 (71.7 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 157.41 nnEvals/s = 130.69 nnBatches/s = 22.07 avgBatchSize = 5.92 (51.5 secs)
numSearchThreads = 10: 10 / 10 positions, visits/s = 152.56 nnEvals/s = 124.96 nnBatches/s = 25.26 avgBatchSize = 4.95 (53.0 secs)
numSearchThreads = 20: 10 / 10 positions, visits/s = 176.72 nnEvals/s = 150.78 nnBatches/s = 15.34 avgBatchSize = 9.83 (46.3 secs)
numSearchThreads = 8: 10 / 10 positions, visits/s = 144.41 nnEvals/s = 118.71 nnBatches/s = 29.93 avgBatchSize = 3.97 (55.9 secs)
numSearchThreads = 16: 10 / 10 positions, visits/s = 172.77 nnEvals/s = 144.29 nnBatches/s = 18.30 avgBatchSize = 7.89 (47.2 secs)
Ordered summary of results:
numSearchThreads = 5: 10 / 10 positions, visits/s = 112.16 nnEvals/s = 95.35 nnBatches/s = 38.29 avgBatchSize = 2.49 (71.7 secs) (EloDiff baseline)
numSearchThreads = 8: 10 / 10 positions, visits/s = 144.41 nnEvals/s = 118.71 nnBatches/s = 29.93 avgBatchSize = 3.97 (55.9 secs) (EloDiff +73)
numSearchThreads = 10: 10 / 10 positions, visits/s = 152.56 nnEvals/s = 124.96 nnBatches/s = 25.26 avgBatchSize = 4.95 (53.0 secs) (EloDiff +80)
numSearchThreads = 12: 10 / 10 positions, visits/s = 157.41 nnEvals/s = 130.69 nnBatches/s = 22.07 avgBatchSize = 5.92 (51.5 secs) (EloDiff +78)
numSearchThreads = 16: 10 / 10 positions, visits/s = 172.77 nnEvals/s = 144.29 nnBatches/s = 18.30 avgBatchSize = 7.89 (47.2 secs) (EloDiff +88)
numSearchThreads = 20: 10 / 10 positions, visits/s = 176.72 nnEvals/s = 150.78 nnBatches/s = 15.34 avgBatchSize = 9.83 (46.3 secs) (EloDiff +70)
Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads = 5: (baseline)
numSearchThreads = 8: +73 Elo
numSearchThreads = 10: +80 Elo
numSearchThreads = 12: +78 Elo
numSearchThreads = 16: +88 Elo (recommended)
numSearchThreads = 20: +70 Elo
If you care about performance, you may want to edit numSearchThreads in /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg based on the above results!
If you intend to do much longer searches, configure the seconds per game move you expect with the '-time' flag and benchmark again.
If you intend to do short or fixed-visit searches, use lower numSearchThreads for better strength, high threads will weaken strength.
If interested see also other notes about performance and mem usage in the top of /opt/homebrew/Cellar/katago/1.11.0/share/katago/configs/gtp_example.cfg
2022-08-01 10:57:39+0200: GPU -1 finishing, processed 40619 rows 8472 batches
```
Metadata
Metadata
Assignees
Labels
No labels