引言
前面,我们介绍过对裸机程序进行RTL仿真,那些裸机程序规模比较小,只有几KB大小。
另外,我们也已经实现了针对O_board的SoC进行了RTL仿真(http://blog.csdn.net/rill_zhen/article/details/21190757),本小节,我们将实现在ML501平台上对linux进行RTL仿真。
1,DDR2仿真模型的修改
针对ML501的ORPSoC工程中,默认配置的DDR2的仿真模型与实际板子上使用的DDR2 SDRAM的参数不一致,我们要进行修改。
a,实际内存参数
要想对DDR2 SDRAM的仿真模型进行修改,我们首先要弄明白几个概念。
RANK,BANK,row,,column。这几个都是逻辑上的概念。
此外还有channel,module,chip,device等物理上的概念。
对于ML501使用的DDR2 SDRAM来说,其具体参数如下所示:
通过查看内存条,我们可以看到如下内容:MT4HTF3264HY-667F1 1RX16 256MB PC-5300S,
其中3263是指内存条的organization:32Megx64,x64表示整个内存条的数据线(DQ)宽度是64bit。
667表示内存条的speed grade。PC-5300也是speed grade。
1RX16表示内存条上面的4个device,每个数据宽度是16,16X4正好是64bit。
256MB,毫无疑问,表示内存条的容量是256M bytes。
通过内存条上面的标示,我们就可以获得很多信息,此外,通过查看其数据手册,我们会得到更详细的参数:
RANK:是single rank。
BANK:BA是2bit,说明bank数量是4,每个bank的大小是256MB/4=64MB。
row:宽度是[12:0],一共13bit。
column:宽度是[9:0],一共10bit。
b,仿真模型参数
确定了我们实际使用的内存条的参数之后,我们就可以修改仿真模型的具体参数了。
需要注意的是ddr2_model.v只是一个timing model,具体的storage,需要我们自己根据实际情况来定。
这里需要修改的是MEM_BITS,由于ddr2_model.v是一个device的仿真模型,每个device中包含4个四分之一的bank,共64MB,所以对于如下定义:
// Memory Storage `ifdef MAX_MEM reg [BL_MAX*DQ_BITS-1:0] memory [0:`MAX_SIZE-1]; `else// [8 * 16 -1:0] [0:(1<<22) -1]==>26bit==>64MB reg [BL_MAX*DQ_BITS-1:0] memory [0:`MEM_SIZE-1]; reg [`MAX_BITS-1:0] address [0:`MEM_SIZE-1]; reg [MEM_BITS:0] memory_index; reg [MEM_BITS:0] memory_used; `endif
我们需要定义MEM_BITS为22,如下所示:
完整的参数,如下所示:
/**************************************************************************************** * * Disclaimer This software code and all associated documentation, comments or other * of Warranty: information (collectively "Software") is provided "AS IS" without * warranty of any kind. MICRON TECHNOLOGY, INC. ("MTI") EXPRESSLY * DISCLAIMS ALL WARRANTIES EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED * TO, NONINFRINGEMENT OF THIRD PARTY RIGHTS, AND ANY IMPLIED WARRANTIES * OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. MTI DOES NOT * WARRANT THAT THE SOFTWARE WILL MEET YOUR REQUIREMENTS, OR THAT THE * OPERATION OF THE SOFTWARE WILL BE UNINTERRUPTED OR ERROR-FREE. * FURTHERMORE, MTI DOES NOT MAKE ANY REPRESENTATIONS REGARDING THE USE OR * THE RESULTS OF THE USE OF THE SOFTWARE IN TERMS OF ITS CORRECTNESS, * ACCURACY, RELIABILITY, OR OTHERWISE. THE ENTIRE RISK ARISING OUT OF USE * OR PERFORMANCE OF THE SOFTWARE REMAINS WITH YOU. IN NO EVENT SHALL MTI, * ITS AFFILIATED COMPANIES OR THEIR SUPPLIERS BE LIABLE FOR ANY DIRECT, * INDIRECT, CONSEQUENTIAL, INCIDENTAL, OR SPECIAL DAMAGES (INCLUDING, * WITHOUT LIMITATION, DAMAGES FOR LOSS OF PROFITS, BUSINESS INTERRUPTION, * OR LOSS OF INFORMATION) ARISING OUT OF YOUR USE OF OR INABILITY TO USE * THE SOFTWARE, EVEN IF MTI HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH * DAMAGES. Because some jurisdictions prohibit the exclusion or * limitation of liability for consequential or incidental damages, the * above limitation may not apply to you. * * Copyright 2003 Micron Technology, Inc. All rights reserved. * ****************************************************************************************/ // Parameters current with 512Mb datasheet rev N // Timing parameters based on Speed Grade // SYMBOL UNITS DESCRIPTION `define sg37E `define x16 //`define MAX_MEM `ifdef sg37E parameter TCK_MIN = 3750; // tCK ps Minimum Clock Cycle Time parameter TJIT_PER = 125; // tJIT(per) ps Period JItter parameter TJIT_DUTY = 125; // tJIT(duty) ps Half Period Jitter parameter TJIT_CC = 250; // tJIT(cc) ps Cycle to Cycle jitter parameter TERR_2PER = 175; // tERR(nper) ps Accumulated Error (2-cycle) parameter TERR_3PER = 225; // tERR(nper) ps Accumulated Error (3-cycle) parameter TERR_4PER = 250; // tERR(nper) ps Accumulated Error (4-cycle) parameter TERR_5PER = 250; // tERR(nper) ps Accumulated Error (5-cycle) parameter TERR_N1PER = 350; // tERR(nper) ps Accumulated Error (6-10-cycle) parameter TERR_N2PER = 450; // tERR(nper) ps Accumulated Error (11-50-cycle) parameter TQHS = 400; // tQHS ps Data hold skew factor parameter TAC = 500; // tAC ps DQ output access time from CK/CK# parameter TDS = 100; // tDS ps DQ and DM input setup time relative to DQS parameter TDH = 225; // tDH ps DQ and DM input hold time relative to DQS parameter TDQSCK = 450; // tDQSCK ps DQS output access time from CK/CK# parameter TDQSQ = 300; // tDQSQ ps DQS-DQ skew, DQS to last DQ valid, per group, per access parameter TIS = 250; // tIS ps Input Setup Time parameter TIH = 375; // tIH ps Input Hold Time parameter TRC = 55000; // tRC ps Active to Active/Auto Refresh command time parameter TRCD = 15000; // tRCD ps Active to Read/Write command time parameter TWTR = 7500; // tWTR ps Write to Read command delay parameter TRP = 15000; // tRP ps Precharge command period parameter TRPA = 15000; // tRPA ps Precharge All period parameter TXARDS = 6; // tXARDS tCK Exit low power active power down to a read command parameter TXARD = 2; // tXARD tCK Exit active power down to a read command parameter TXP = 2; // tXP tCK Exit power down to a non-read command parameter TANPD = 3; // tANPD tCK ODT to power-down entry latency parameter TAXPD = 8; // tAXPD tCK ODT power-down exit latency parameter CL_TIME = 15000; // CL ps Minimum CAS Latency `endif // ------ ----- ----------- `ifdef x16 `ifdef sg37E parameter TFAW = 50000; // tFAW ps Four Bank Activate window `endif `endif // Timing Parameters // Mode Register parameter AL_MIN = 0; // AL tCK Minimum Additive Latency parameter AL_MAX = 6; // AL tCK Maximum Additive Latency parameter CL_MIN = 3; // CL tCK Minimum CAS Latency parameter CL_MAX = 7; // CL tCK Maximum CAS Latency parameter WR_MIN = 2; // WR tCK Minimum Write Recovery parameter WR_MAX = 8; // WR tCK Maximum Write Recovery parameter BL_MIN = 4; // BL tCK Minimum Burst Length parameter BL_MAX = 8; // BL tCK Minimum Burst Length // Clock parameter TCK_MAX = 8000; // tCK ps Maximum Clock Cycle Time parameter TCH_MIN = 0.48; // tCH tCK Minimum Clock High-Level Pulse Width parameter TCH_MAX = 0.52; // tCH tCK Maximum Clock High-Level Pulse Width parameter TCL_MIN = 0.48; // tCL tCK Minimum Clock Low-Level Pulse Width parameter TCL_MAX = 0.52; // tCL tCK Maximum Clock Low-Level Pulse Width // Data parameter TLZ = TAC; // tLZ ps Data-out low-impedance window from CK/CK# parameter THZ = TAC; // tHZ ps Data-out high impedance window from CK/CK# parameter TDIPW = 0.35; // tDIPW tCK DQ and DM input Pulse Width // Data Strobe parameter TDQSH = 0.35; // tDQSH tCK DQS input High Pulse Width parameter TDQSL = 0.35; // tDQSL tCK DQS input Low Pulse Width parameter TDSS = 0.20; // tDSS tCK DQS falling edge to CLK rising (setup time) parameter TDSH = 0.20; // tDSH tCK DQS falling edge from CLK rising (hold time) parameter TWPRE = 0.35; // tWPRE tCK DQS Write Preamble parameter TWPST = 0.40; // tWPST tCK DQS Write Postamble parameter TDQSS = 0.25; // tDQSS tCK Rising clock edge to DQS/DQS# latching transition // Command and Address parameter TIPW = 0.6; // tIPW tCK Control and Address input Pulse Width parameter TCCD = 2; // tCCD tCK Cas to Cas command delay parameter TRAS_MIN = 40000; // tRAS ps Minimum Active to Precharge command time parameter TRAS_MAX =70000000; // tRAS ps Maximum Active to Precharge command time parameter TRTP = 7500; // tRTP ps Read to Precharge command delay parameter TWR = 15000; // tWR ps Write recovery time parameter TMRD = 2; // tMRD tCK Load Mode Register command cycle time parameter TDLLK = 200; // tDLLK tCK DLL locking time // Refresh parameter TRFC_MIN = 105000; // tRFC ps Refresh to Refresh Command interval minimum value parameter TRFC_MAX =70000000; // tRFC ps Refresh to Refresh Command Interval maximum value // Self Refresh parameter TXSNR = TRFC_MIN + 10000; // tXSNR ps Exit self refesh to a non-read command parameter TXSRD = 200; // tXSRD tCK Exit self refresh to a read command parameter TISXR = TIS; // tISXR ps CKE setup time during self refresh exit. // ODT parameter TAOND = 2; // tAOND tCK ODT turn-on delay parameter TAOFD = 2.5; // tAOFD tCK ODT turn-off delay parameter TAONPD = 2000; // tAONPD ps ODT turn-on (precharge power-down mode) parameter TAOFPD = 2000; // tAOFPD ps ODT turn-off (precharge power-down mode) parameter TMOD = 12000; // tMOD ps ODT enable in EMR to ODT pin transition // Power Down parameter TCKE = 3; // tCKE tCK CKE minimum high or low pulse width // Size Parameters based on Part Width `ifdef x16 parameter ADDR_BITS = 13; // Address Bits parameter ROW_BITS = 13; // Number of Address bits parameter COL_BITS = 10; // Number of Column bits parameter DM_BITS = 2; // Number of Data Mask bits parameter DQ_BITS = 16; // Number of Data bits parameter DQS_BITS = 2; // Number of Dqs bits parameter TRRD = 10000; // tRRD Active bank a to Active bank b command time `endif `ifdef QUAD_RANK `define DUAL_RANK // also define DUAL_RANK parameter CS_BITS = 4; // Number of Chip Select Bits parameter RANKS = 4; // Number of Chip Select Bits `else `ifdef DUAL_RANK parameter CS_BITS = 2; // Number of Chip Select Bits parameter RANKS = 2; // Number of Chip Select Bits `else parameter CS_BITS = 2; // Number of Chip Select Bits parameter RANKS = 1; // Number of Chip Select Bits `endif `endif // Size Parameters parameter BA_BITS = 2; // Set this parmaeter to control how many Bank Address bits // if MEM_BITS== 14, a DQ=16 each part, DQ=64 total (4 parts) => 1MB total (256KB each) // if MEM_BITS== 15, a DQ=16 each part, DQ=64 total (4 parts) => 2MB total (512KB each) // if MEM_BITS== 16, a DQ=16 each part, DQ=64 total (4 parts) => 4MB total (1MB each) // if MEM_BITS== 17, a DQ=16 each part, DQ=64 total (4 parts) => 8MB total (2MB each) //parameter MEM_BITS = 14; // Number of write data bursts can be stored in memory. The default is 2^10=1024. parameter MEM_BITS = 22; // Number of write data bursts can be stored in memory. //256MB total(64MB each),Rill modify from 17 to 22 140410 parameter AP = 10; // the address bit that controls auto-precharge and precharge-all parameter BL_BITS = 3; // the number of bits required to count to MAX_BL parameter BO_BITS = 2; // the number of Burst Order Bits // Simulation parameters parameter STOP_ON_ERROR = 1; // If set to 1, the model will halt on command sequence/major errors parameter DEBUG = 0; // Turn on Debug messages parameter BUS_DELAY = 0; // delay in nanoseconds parameter RANDOM_OUT_DELAY = 0; // If set to 1, the model will put a random amount of delay on DQ/DQS during reads parameter RANDOM_SEED = 711689044; //seed value for random generator. parameter RDQSEN_PRE = 2; // DQS driving time prior to first read strobe parameter RDQSEN_PST = 1; // DQS driving time after last read strobe parameter RDQS_PRE = 2; // DQS low time prior to first read strobe parameter RDQS_PST = 1; // DQS low time after last valid read strobe parameter RDQEN_PRE = 0; // DQ/DM driving time prior to first read data parameter RDQEN_PST = 0; // DQ/DM driving time after last read data parameter WDQS_PRE = 1; // DQS half clock periods prior to first write strobe parameter WDQS_PST = 1; // DQS half clock periods after last valid write strobe
c,preload的修改
目前,我们已经建立的和实际硬件一致的仿真模型,但是我们在仿真前,要把linux的镜像实现load到仿真模型中才行,这就需要了解DDR2 SDRAM的内部组织结构,了解BL_MAX,BL_BITS,DQ_BITS等参数的具体含义,了解DDR2 SDRAM的读写过程和时序。这些内容请参考《memory system - cache dram disk》一书。这里不再赘述。
对于仿真linux而言,由于编译时指定的内存大小是32MB,所以,我在preload时也只load32MB,一个bank是64MB,所以我们只需要load bank0即可,但是bank0是分布在4个device里的。
下面是修改后的orpsoc_testbench.v的部分代码:
`ifdef XILINX_DDR2 `ifndef GATE_SIM defparam dut.xilinx_ddr2_0.xilinx_ddr2_if0.ddr2_mig0.SIM_ONLY = 1; `endif always @( * ) begin ddr2_ck_sdram <= #(TPROP_PCB_CTRL) ddr2_ck_fpga; ddr2_ck_n_sdram <= #(TPROP_PCB_CTRL) ddr2_ck_n_fpga; ddr2_a_sdram <= #(TPROP_PCB_CTRL) ddr2_a_fpga; ddr2_ba_sdram <= #(TPROP_PCB_CTRL) ddr2_ba_fpga; ddr2_ras_n_sdram <= #(TPROP_PCB_CTRL) ddr2_ras_n_fpga; ddr2_cas_n_sdram <= #(TPROP_PCB_CTRL) ddr2_cas_n_fpga; ddr2_we_n_sdram <= #(TPROP_PCB_CTRL) ddr2_we_n_fpga; ddr2_cs_n_sdram <= #(TPROP_PCB_CTRL) ddr2_cs_n_fpga; ddr2_cke_sdram <= #(TPROP_PCB_CTRL) ddr2_cke_fpga; ddr2_odt_sdram <= #(TPROP_PCB_CTRL) ddr2_odt_fpga; ddr2_dm_sdram_tmp <= #(TPROP_PCB_DATA) ddr2_dm_fpga;//DM signal generation end // always @ ( * ) // Model delays on bi-directional BUS genvar dqwd; generate for (dqwd = 0;dqwd < DQ_WIDTH;dqwd = dqwd+1) begin : dq_delay wiredelay # ( .Delay_g (TPROP_PCB_DATA), .Delay_rd (TPROP_PCB_DATA_RD) ) u_delay_dq ( .A (ddr2_dq_fpga[dqwd]), .B (ddr2_dq_sdram[dqwd]), .reset (rst_n) ); end endgenerate genvar dqswd; generate for (dqswd = 0;dqswd < DQS_WIDTH;dqswd = dqswd+1) begin : dqs_delay wiredelay # ( .Delay_g (TPROP_DQS), .Delay_rd (TPROP_DQS_RD) ) u_delay_dqs ( .A (ddr2_dqs_fpga[dqswd]), .B (ddr2_dqs_sdram[dqswd]), .reset (rst_n) ); wiredelay # ( .Delay_g (TPROP_DQS), .Delay_rd (TPROP_DQS_RD) ) u_delay_dqs_n ( .A (ddr2_dqs_n_fpga[dqswd]), .B (ddr2_dqs_n_sdram[dqswd]), .reset (rst_n) ); end endgenerate assign ddr2_dm_sdram = ddr2_dm_sdram_tmp; //parameter NUM_PROGRAM_WORDS=1048576; parameter NUM_PROGRAM_WORDS=8388608; //Rill modify from 1048576 integer ram_ptr, program_word_ptr, k; reg [31:0] tmp_program_word; reg [31:0] program_array [0:NUM_PROGRAM_WORDS-1]; // 1M words = 4MB//8M words = 32MB reg [8*16-1:0] ddr2_ram_mem_line; //8*16-bits= 8 shorts (half-words) genvar i, j; generate // if the data width is multiple of 16 for(j = 0; j < CS_NUM; j = j+1) begin : gen_cs // Loop of 1 for(i = 0; i < DQS_WIDTH/2; i = i+1) begin : gen // Loop of 4 (DQS_WIDTH=8) initial begin `ifdef PRELOAD_RAM `include "ddr2_model_preload.v" `endif end ddr2_model u_mem0 ( .ck (ddr2_ck_sdram[CLK_WIDTH*i/DQS_WIDTH]), .ck_n (ddr2_ck_n_sdram[CLK_WIDTH*i/DQS_WIDTH]), .cke (ddr2_cke_sdram[j]), .cs_n (ddr2_cs_n_sdram[CS_WIDTH*i/DQS_WIDTH]), .ras_n (ddr2_ras_n_sdram), .cas_n (ddr2_cas_n_sdram), .we_n (ddr2_we_n_sdram), .dm_rdqs (ddr2_dm_sdram[(2*(i+1))-1 : i*2]), .ba (ddr2_ba_sdram), .addr (ddr2_a_sdram), .dq (ddr2_dq_sdram[(16*(i+1))-1 : i*16]), .dqs (ddr2_dqs_sdram[(2*(i+1))-1 : i*2]), .dqs_n (ddr2_dqs_n_sdram[(2*(i+1))-1 : i*2]), .rdqs_n (), .odt (ddr2_odt_sdram[ODT_WIDTH*i/DQS_WIDTH]) ); end end endgenerate `endif
下面是ddr2_model_preload.v的修改后的代码:
// File intended to be included in the generate statement for each DDR2 part. // The following loads a vmem file, "sram.vmem" by default, into the SDRAM. // Wait until the DDR memory is initialised, and then magically // load it $display("%t: wait phy_init_done",$time); @(posedge dut.xilinx_ddr2_0.xilinx_ddr2_if0.phy_init_done); $display("%t: Loading DDR2",$time); $readmemh("sram.vmem", program_array); /* Now dish it out to the DDR2 model‘s memory */ for(ram_ptr = 0 ; ram_ptr < 64*1024/*4096*/ ; ram_ptr = ram_ptr + 1) begin // Construct the burst line, with every second word from where we // started, and picking the correct half of the word with i%2 program_word_ptr = ram_ptr * 16 + (i/2) ; // Start on word0 or word1 tmp_program_word = program_array[program_word_ptr]; ddr2_ram_mem_line[15:0] = tmp_program_word[15 + ((i%2)*16):((i%2)*16)]; program_word_ptr = program_word_ptr + 2; tmp_program_word = program_array[program_word_ptr]; ddr2_ram_mem_line[31:16] = tmp_program_word[15 + ((i%2)*16):((i%2)*16)]; program_word_ptr = program_word_ptr + 2; tmp_program_word = program_array[program_word_ptr]; ddr2_ram_mem_line[47:32] = tmp_program_word[15 + ((i%2)*16):((i%2)*16)]; program_word_ptr = program_word_ptr + 2; tmp_program_word = program_array[program_word_ptr]; ddr2_ram_mem_line[63:48] = tmp_program_word[15 + ((i%2)*16):((i%2)*16)]; program_word_ptr = program_word_ptr + 2; tmp_program_word = program_array[program_word_ptr]; ddr2_ram_mem_line[79:64] = tmp_program_word[15 + ((i%2)*16):((i%2)*16)]; program_word_ptr = program_word_ptr + 2; tmp_program_word = program_array[program_word_ptr]; ddr2_ram_mem_line[95:80] = tmp_program_word[15 + ((i%2)*16):((i%2)*16)]; program_word_ptr = program_word_ptr + 2; tmp_program_word = program_array[program_word_ptr]; ddr2_ram_mem_line[111:96] = tmp_program_word[15 + ((i%2)*16):((i%2)*16)]; program_word_ptr = program_word_ptr + 2; tmp_program_word = program_array[program_word_ptr]; ddr2_ram_mem_line[127:112] = tmp_program_word[15 + ((i%2)*16):((i%2)*16)]; // Put this assembled line into the RAM using its memory writing TASK // (bank ,row , { col }, data u_mem0.memory_write(2‘b00,ram_ptr[19:7], {ram_ptr[6:0],3‘b000},ddr2_ram_mem_line); //$display("Writing 0x%h, ramline=%d",ddr2_ram_mem_line, ram_ptr); end // for (ram_ptr = 0 ; ram_ptr < ... $display("(%t) * DDR2 RAM %1d preloaded",$time, i);
这里有两点需要注意:
首先,program_array[]是连续线性的,但是4个device的组织不是连续线性的,所以在调用memory_write()之前一定要变成DDR2 SDRAM实际的组织形式。
此外,由于我们只preload了32MB,小于一个bank,所以bank的地址我们一直是2‘b00,如果以后需要仿真的程序规模超过一个bank的大小了,那么就需要修改bank地址了。
2,验证
修改orpsocv2/sw/makefile.inc中,是指使用现成的elf文件,生成vmem文件。具体修改方法,前面已经介绍过了,这里不再赘述。
执行:make rtl-test TEST=linux PRELOAD_RAM=1
即可得到linux的仿真结果,和实际下板的结果相同。
毫无疑问,由于linux程序规模很大,如果要等到linux启动完成,需要等待很久。
下面是部分输出:
3,小结
之前搞嵌入式,linux的启动信息很熟悉,但是如果想知道linux启动过程中,几乎是不可能的,现在板子上所有设备的每个clock的状态,通过RTL仿真,即可实现。
enjoy!