テーブル処理による速度改善 - Cellでがんばってみたログ

処理速度改善としてテーブル処理を復活しました（全部は復活していませんが）。これは元々lameが持っている高速化処理ですが、LSの空きを稼ぐために前回は使用していませんでした。今回はPCMデータと生成したMP3データをMFCで転送しながら処理することで、フルサイズのバッファを持たないようにしてLSの空きを稼いでいます。いちおうダブルバッファによる転送でいわゆるレイテンシー削減も実現していると思います。

パッチはこちらです。

5月16日追加
上記パッチはlog_tableの初期化処理を実行していない不具合があります。実行速度にも影響します（遅くなります）ので、以下の結果は信頼できる数値ではありません。注意してください。

修正のポイント

PCMを保持する領域と生成したMP3データを保持する領域がかなり大きかったので、ダブルバッファで転送しながら処理するように変更しました。これによりLSの使用量は減るしレイテンシーも減ります。ただ今回の場合レイテンシーの問題は全体から見るとびびたるものです。他にもメインメモリから必要なときに転送してきて処理するようにしてLSの使用量を削減しています。

これらの変更の結果、lameの持つ主なテーブル処理を復活しています。pow43テーブルは大きさのわりにあまり使ってなさそうなので、テーブル処理になっていません。

SPUは相変わらず一基しか使いませんし、PPUとSPUの並列処理もしていません。

結果

今回の結果は以下の通りです。もちろん前回と同じwavファイルを使用しています。

376993507357: (246620604): LAME version 3.96.1 (http://lame.sourceforge.net/)
376993532028: (246645220): Using polyphase lowpass filter, transition band: 17249 Hz - 17782 Hz
376993541245: (246654413): Encoding starwars.wav to starwars.wav.mp3
376993548630: (246661778): Encoding as 44.1 kHz 128 kbps j-stereo MPEG-1 Layer III (11x) qval=3
376993677593: (246790387):     Frame          |  CPU time/estim | REAL time/estim | play/CPU |    ETA
376993682024: (246794807):      0/       ( 0%)|    0:00/     :  |    0:00/     :  |         x|     :
376994041887: (247154575):      0/117    ( 0%)|    0:00/    0:00|    0:00/    0:00|   0.0000x|    0:00
377346413049: (599981776):     50/117    (43%)|    0:00/    0:00|    0:00/    0:00|   130.61x|    0:00
377764491603: (1018405980):    100/117    (85%)|    0:00/    0:00|    0:00/    0:00|   261.22x|    0:00
377898663193: (1152690216):    114/117    (97%)|    0:00/    0:00|    0:00/    0:00|   297.80x|    0:00
377898758535: (1152785300): Writing LAME Tag...done
377898764143: (1152790890): ReplayGain: -7.5dB

(157073253726-156167967751)/3000000000=0.302 [sec]

PPUの速度まで達していませんが、約10倍速となり、前回からかなり改善しました。lameの元々の高速化処理をなめてはいかんですな。

パフォーマンス解析結果は以下の通りです。今回は上記と同じ3秒のwavファイルで試しました。Athlon64は速くていいです。

systemsim % mysim spu 6 display statistics
SPU DD3.0
***
Total Cycle count               1491419225
Total Instruction count         643
Total CPI                       2319469.88
***
Performance Cycle count         1491419225
Performance Instruction count   864859095 (807032455)
Performance CPI                 1.72 (1.85)

Branch instructions             39562467
Branch taken                    32060387
Branch not taken                7502080

Hint instructions               9964429
Hint hit                        24711617

Contention at LS between Load/Store and Prefetch 24902328

Single cycle                                         550683203 ( 36.9%)
Dual cycle                                           128174626 (  8.6%)
Nop cycle                                             17392180 (  1.2%)
Stall due to branch miss                             154456872 ( 10.4%)
Stall due to prefetch miss                                1422 (  0.0%)
Stall due to dependency                              575093941 ( 38.6%)
Stall due to fp resource conflict                            0 (  0.0%)
Stall due to waiting for hint target                  22498559 (  1.5%)
Stall due to dp pipeline                                     6 (  0.0%)
Channel stall cycle                                   43118407 (  2.9%)
SPU Initialization cycle                                     9 (  0.0%)
-----------------------------------------------------------------------
Total cycle                                         1491419225 (100.0%)

Stall cycles due to dependency on each pipelines
 FX2        62338407 ( 10.8% of all dependency stalls)
 SHUF       126397818 ( 22.0% of all dependency stalls)
 FX3        13718179 (  2.4% of all dependency stalls)
 LS         170538397 ( 29.7% of all dependency stalls)
 BR         158320 (  0.0% of all dependency stalls)
 SPR        1320 (  0.0% of all dependency stalls)
 LNOP       0 (  0.0% of all dependency stalls)
 NOP        0 (  0.0% of all dependency stalls)
 FXB        0 (  0.0% of all dependency stalls)
 FP6        146239895 ( 25.4% of all dependency stalls)
 FP7        55701599 (  9.7% of all dependency stalls)
 FPD        6 (  0.0% of all dependency stalls)

The number of used registers are 128, the used ratio is 100.00
dumped pipeline stats
systemsim %

こちらでは1491419225サイクル（約0.5秒なので6倍速相当）で、上記結果より悪いです。どちらを信じるかといえば、こちらの方が実機に近いのでしょう。

CPIは前回より少し改善し2命令同時実行の割合も増えましたが、CPUは1.72ですし、2命令同時実行も全体の8%程度です。相変わらず半分以上stallしてます。良く言えば、まだまだ高速化の余地があるとも言えます。ちなみに-funroll-loopsオプションを付けると大きくなりすぎてリンクできません。

個々の数値を比較するため、前回と同じ0.5秒のwavファイルによるパフォーマンス解析も貼ります。

systemsim % mysim spu 7 display statistics
SPU DD3.0
***
Total Cycle count               223957771
Total Instruction count         643
Total CPI                       348301.34
***
Performance Cycle count         223957771
Performance Instruction count   126329269 (117534141)
Performance CPI                 1.77 (1.91)

Branch instructions             5943491
Branch taken                    4706876
Branch not taken                1236615

Hint instructions               1537589
Hint hit                        3513938

Contention at LS between Load/Store and Prefetch 3658139

Single cycle                                          78213275 ( 34.9%)
Dual cycle                                            19660433 (  8.8%)
Nop cycle                                              2790498 (  1.2%)
Stall due to branch miss                              25270000 ( 11.3%)
Stall due to prefetch miss                                 282 (  0.0%)
Stall due to dependency                               86015584 ( 38.4%)
Stall due to fp resource conflict                            0 (  0.0%)
Stall due to waiting for hint target                   3208465 (  1.4%)
Stall due to dp pipeline                                     6 (  0.0%)
Channel stall cycle                                    8799219 (  3.9%)
SPU Initialization cycle                                     9 (  0.0%)
-----------------------------------------------------------------------
Total cycle                                          223957771 (100.0%)

Stall cycles due to dependency on each pipelines
 FX2        8040486 (  9.3% of all dependency stalls)
 SHUF       19898499 ( 23.1% of all dependency stalls)
 FX3        2212448 (  2.6% of all dependency stalls)
 LS         25301434 ( 29.4% of all dependency stalls)
 BR         44546 (  0.1% of all dependency stalls)
 SPR        275 (  0.0% of all dependency stalls)
 LNOP       0 (  0.0% of all dependency stalls)
 NOP        0 (  0.0% of all dependency stalls)
 FXB        0 (  0.0% of all dependency stalls)
 FP6        23202814 ( 27.0% of all dependency stalls)
 FP7        7315076 (  8.5% of all dependency stalls)
 FPD        6 (  0.0% of all dependency stalls)

The number of used registers are 128, the used ratio is 100.00
dumped pipeline stats
systemsim %

考察

前回は全体の約40%は依存関係が原因のストールで、そのうちの約60%は浮動小数点の依存関係だったのですが、今回はそこまで多くないです。テーブル処理のおかげで浮動小数点演算自体が減ったからでしょうか。

今回気になるのはStall due to branch missです。__builtin_expect関数を使うと分岐予測を指示できるとのことなので試してみます。その後はボトルネックを調べてループアンローリングとSIMD化をですかね。