BTC FPGA Miner Challenge -- Best Hashrate, Lowest power per hash

  • Durum: Open
  • Ödül: $840
  • Alınan Girdiler: 12

Yarışma Özeti

See ALL Comments.

For 10 years, poor FPGA BTC mining implementations, with excessively slow, power hungry designs. Researchers presented dozens of papers on how to make this better. This is your chance to get it right. Read this paper, then and look at their Verilog here to get a good understanding about state of the art FPGA BTC mining with verilog. Then apply that to YOUR FORK of the old standard in with an updated proxy for getwork.

Clues follow to make FPGA BTC mining faster, smaller, and lower power, so that you will have REAL bragging rights for the fastest, smallest, lowest power FGPA miners.

1) The SHA256 compression is seeded with 256 bits of very random constants and forms a large shift register as the seed text and W[n] expansion pass results are mixed in during the next 64+64 rounds of the 2nd and 3rd SHA256 passes. The Double-CME-SHA256 paper shows you how to factor unnecessary work away for a leaner pipelined design. But it misses removing a completely wasteful group of registers doing a simple shift operation. As such the minimum fully unrolled digest requires about 256*(64+64)=32,768 registers of which 25% capture the rounds compression results, 75% are simply copying data in the shift operation. See Github nalex87/Verilog-SHA256-1

2) Heavy pipelining register setup and hold times become a significant part of each clock cycle and dominate as the designer tries to reach excessively high clock rates. This fails when high bandwidth, low resource use, and low power are three critical optimization metrics necessary for successful reconfigurable computing projects. EVERY CLOCK on pipeline registers burns power, so a power optimal design should have combinatorial logic with the least routing losses and fewest clock cycles per hash. There is a sweet spot in this combinatorial length, with additional wasted power when it becomes too long and cascaded gates oscillate with multiple uncertain inputs.

3) Cascaded expressions create unnecessary time delays that may not be recognized and optimized out by the tools. Expressions like A + B + C + D + E + F + G + H (7 serial adder delays) should be written as (((A + B) + (C + D)) + ((E + F) + (G + H))) (3 serial adder delays) with each matched addition pairs in parallel. Most synthesized arithmetic expressions are done with 3-2 or 4-2 full adder compressors (A+B+Carry) which even in tree form can still generate some uncertainty oscillations. A lower power FPGA design with 6-input LUT's and a hardware carry circuit is to implement 6-3 or 7-3 full adder compressors when there are three or more sequential operators to combine in parallel ... plus carry look-ahead. Carefully map out and optimize latency paths caused by cascaded operations, and carry propagation paths. Use floor planning to minimize routing latencies.

4) Optimizations to extract the last few percent of bandwidth, resource, power optimization is to take word wide synthesis of expressions completely out of the verilog, and reduce each bit lane down to ANF with shared ANF product terms across all expressions. Specialized synthesis and floor planning.

5) Gray code nonce and other counters, stable lower peak currents at edges.

Best winner with averages from Xilinx XC7Z010, Altera 10M08 Dev Kit, GOWIN GW1NR-UV9 (Tang Nano 9K) with RPI Pico controller. Weighted 10% speed, 10% power, each FPGA vendor, real mining with all three boards concurrently served by the RPI controller -- 20% solo mining, 20% pool mining for 12 hours each (report mining average and total hashes). Show your wiring diagram for concurrent mining in your github ReadMe page report.

Claim YOUR best engineer bragging rights?

Post your github link as your contest entry graphic. Only ONE entry per team. Non-conforming entries will be rejected

Aranan Beceriler

Genel Açıklama Panosu

  • TotallyLost
    Yarışma Sahibi
    • 3 gün önce

    Step 1: Start by cloning and building the three FPGA target devices, using an updated gateway and mining proxy. Completed github with sources, prebuilt RPI/fpga images, 3 board wiring diagram, with required testing report in github ReadMe page. Post this github link as your contest entry graphic. Entries without this minimum will be rejected (3-20hrs)

    Step 2: Incorporate the round folding found in nalex87/Verilog-SHA256-1/blob/master/main.v (2-8hrs)

    Step 3: Use floor planner on all three devices to optimize for best case hash rate, at the lowest power. (10+hrs)

    Step 4: Apply additional improvements to each target device. LUT packing, worst case delay mgmt.

    Step 5: Update public github sources, install images working/tested on all three platforms, and ReadMe project report for peer review. Use RPI or Petalinux proxy. Make sure your contest entry graphic displays your github link.

    Step 6: Peer review rank top 3 entries.

    • 3 gün önce
  • TotallyLost
    Yarışma Sahibi
    • 3 gün önce

    Your successful submission requires 3 implementations using Altera, GOWIN, and Xilinx student/hobby boards:

    Tang nano 9K board with GOWIN GW1NR-9 FPGA (about $15 from Sipeed on AliExpress or eBay)
    Intel Dev Kit EK10M08E144 with Altera 10M08 FPGA (about $52 from Mouser or eBay)
    Xilinx XC7Z7010 Development Board (about $22 from Shengzhi on AliExpress)
    Raspberry Pi Pico W for Mining controller (About $5-10 from AliExpress or eBay)

    These may use a heat sink and fan for best hash rate.

    Raspberry Pi Pico is the controller setting up the FPGA work, and handles the wifi communication for managing Solo or Pool work assignments. Or omit the RPI Pico, and use the EBAZ4205 PS section as the cluster controller with Ethernet (EBAZ4250 wired to Gowin and Altera)..

    BTC mining lotto randomly gives away $100,000+ every 10 minutes. More boards, better odds.

    $22 XC7Z010 board:
    or Digilent Arty Z7-10 with Xilinx XC7Z010 (about $200 from Digilent)

    • 3 gün önce
  • TotallyLost
    Yarışma Sahibi
    • 3 gün önce

    Ok ... extended contest time again, and added more money to the prize :)

    FYI: resources for the ebaz4205 board with XC7Z010

    In one of the first comments for this project, I opened the discussion about collapsing rounds to remove registers, and bring more combinatorial logic into the rounds. I had done this in a similar sha256 project back in 2012 ... here is another SHA256 designer that did a similar design in 2017:

    My 2012 design included reordering operations, and lut packing, to optimize the worst case delay path. Moved the WK[] = W[] + K[] operation into the expander, from the compressor. Plus a few other optimizations.

    • 3 gün önce
  • TotallyLost
    Yarışma Sahibi
    • 2 ay önce

    For those not familar with implementing SHA256, these are the function/macros in the expander and compressor.

    #define ROTL(x, n) (((x) << (n)) | ((x) >> (32 - (n))))
    #define ROTR(x, n) (((x) >> (n)) | ((x) << (32 - (n))))

    #define Ch(x, y, z) ((z) ^ ((x) & ((y) ^ (z))))
    #define Maj(x, y, z) (((x) & ((y) | (z))) | ((y) & (z)))
    #define SIGMA0(x) (ROTR((x), 2) ^ ROTR((x), 13) ^ ROTR((x), 22))
    #define SIGMA1(x) (ROTR((x), 6) ^ ROTR((x), 11) ^ ROTR((x), 25))
    #define sigma0(x) (ROTR((x), 7) ^ ROTR((x), 18) ^ ((x) >> 3))
    #define sigma1(x) (ROTR((x), 17) ^ ROTR((x), 19) ^ ((x) >> 10))

    A good, tight, fast, low power design can implement these as manually placed IP blocks created in the floor planner, and call the IP blocks out in the Verilog rather than use word level operator synthesis in Verilog.

    Likewise each pipeline round can be reduced to an IP block that is manually placed using the floor planner. This will minimize routing length delays/power.

    Good P&R is exponential, NP Hard

    • 2 ay önce
    1. AbhishekEG
      • 1 ay önce

      Each pipeline round can also be reduced to an IP block that is manually placed using the floor planner, which can help to further optimize the design. However, the process of optimizing the design using place-and-route (P&R) tools is complex and computationally intensive, and finding a good, tight, fast, and low-power design is an exponential problem that is known to be NP-hard.

      • 1 ay önce
  • TotallyLost
    Yarışma Sahibi
    • 3 gün önce

    Outline the report here, plus a few replies due to character limits:

    Team Report: [your github/repro name]
    BTC FPGA Miner Challenge
    Best Hashrate, Lowest power per hash


    Tang Nano 9K hash rate: 0.0 MH/sec at 0.0 MHz clock rate is 0.0 W/MH
    Altera 10M08 hash rate: 0.0 MH/sec at 0.0 MHz clock rate is 0.0 W/MH
    Xilinx XC7Z010 hash rate: 0.0 MH/sec at 0.0 MHz clock rate is 0.0 W/MH

    Tang Nano 9K stable temp: 0.0C at 0.0CFM with [XXX] heatsink attached
    Altera 10M08 stable temp: 0.0C at 0.0CFM with [XXX] heatsink attached
    Xilinx XC7Z010 stable temp: 0.0C at 0.0CFM with [XXX] heatsink attached

    Tang Nano 9K dynamic power: 0.0A at 0.0Volts is 0.0Watts
    Altera 10M08 dynamic power: 0.0A at 0.0Volts is 0.0Watts
    Xilinx XC7Z010 dynamic power: 0.0A at 0.0Volts is 0.0Watts

    • 3 gün önce
    1. TotallyLost
      Yarışma Sahibi
      • 1 ay önce

      Report summary continues with:

      Tang Nano 9K static power: 0.0A at 0.0Volts is 0.0Watts
      Altera 10M08 static power: 0.0A at 0.0Volts is 0.0Watts
      Xilinx XC7Z010 static power: 0.0A at 0.0Volts is 0.0Watts

      Solo Mining tested with node: [Node name and IP address] with average rate of 0.0 MH/sec
      Pool Mining tested with pool: [Pool name and IP address] with average rate of 0.0 MH/sec

      We have verified that each device meets all setup and hold times at idle, with operation inside vendor specified worst case operating conditions for our designs. Solo and Pool Mining results are the sum of one each Tang, Altera, and Xilinx device operating concurrently from the RPI Pico W mining controller(s).

      Team Lead: [Your name]
      Team Members: [Team member list]

      • 1 ay önce
    2. TotallyLost
      Yarışma Sahibi
      • 1 ay önce

      main section of report will include at minimum:

      Project Design Methodology

      [describe over all architecture for your design common to all devices]

      [describe the methods used to improve performance on the Tang Nano 9K device]

      [describe the methods used to improve performance on the Altera 10M08 device]

      [describe the methods used to improve performance on the Xilinx XC7Z010 device]

      • 1 ay önce
  • TotallyLost
    Yarışma Sahibi
    • 1 ay önce

    Googling for variations Bitcoin fpga Mining gets a lot of hits. Some are pretty cool, as this has been a fun project for a lot of engineers over the years.

    On GitHub this is another gem .... kramble/DE0-Nano-BitCoin-Miner

    And a lot more

    • 1 ay önce
  • TotallyLost
    Yarışma Sahibi
    • 2 ay önce

    This diagram is a good conceptualization tool, which highlights the 7 input adder that is central to the SHA265 compression rounds:

    Implementing this function:
    b[i+1].a = (b[i].h + SIGMA1(b[i].e) + Ch(b[i].e, b[i].f, b[i].g) + b[i].K + b[i].W) + (SIGMA0(b[i].a) + Maj(b[i].a, b[i].b, b[i].c));

    N input adders are an interesting topic, that many people ignore.

    • 2 ay önce
  • TotallyLost
    Yarışma Sahibi
    • 2 ay önce

    With Crypto projects, using lots of small FPGA's for high performance reconfigurable computing, is a simple solution to impossible power and thermal management when using large FPGA's.

    Early (and even some current) FPGA designs target traditional control and glue logic designs where only 5-15% of the logic is "actively switching". Dense Crypto algorithms have an average toggle rate of roughly 50% of gates, simply because the Crypto algorithms are attempting to fully randomize bits ... coin toss per gate to switch, or not switch ... to be a zero or a one ... probability of retaining previous value about 50% of the time, and toggling about 50% of the time.

    Toggling consumes power to charge parasitic capacitance in gates and routing, or to shunt that charge to the ground rail. Asic miner chips face the same problem ... they are all relatively small chips, and a miner uses a lot of them, to distribute the heat across many chips.

    Fast, power efficient is critical for usable mining rigs.

    • 2 ay önce
  • TotallyLost
    Yarışma Sahibi
    • 2 ay önce

    Sipeed has offered to refund a Tang Nano 9K board for contestants that complete the contest. A small but generous sponsor offer for this contest projects developers.

    Thank you for your support for our Tang product!
    We can return the Tang FPGA board fees for developers who have successfully submit your mining contest.

    吴才泽 / Caesar Wu
    深圳矽速科技有限公司 Shenzhen Sipeed Tech Ltd

    • 2 ay önce
  • TotallyLost
    Yarışma Sahibi
    • 2 ay önce

    There are more than a dozen area vs time design solutions, allowing a mix of different implementations to best fill a device to capacity for the best maxium hash rate. So lets explore architectures that may allow using idle resources in an FPGA, and/or better packing.

    1) Using block ram as memories, with a simple ALU design.

    2) Using LUT ram as memories, possibly dual port, with a simple ALU design.

    3) Using block ram as sequencers, including main counters.

    4) Using LUT ROMs as sequencers to compact control logic.

    5) Factoring control logic, K memories, to feed and control multiple 'slimer' expander/compressor's.

    Each of these different architectural approaches for the problem can be mixed and matched, giving multiple very different solutions, for even the smallest devices. With more than a dozen block memories, combining them with a LUT based ALU and sequencer, provides valuable hashers. Simple linear programming problem for optimal ratios.

    Multiple team members help here.

    • 2 ay önce
  • TotallyLost
    Yarışma Sahibi
    • 2 ay önce

    AND ... there are other significant optimization strategies NOT provided ... use your skills. Put a good team together, pool your formal training, interests, expertise and experience. WIN YOUR bragging rights, and EARN your TOP job in the FPGA reconfigurable computing accelerator industry!

    The skills learned and demonstrated in this project are extremely valuable when applied to real world algorithm implementation for FPGA accelerated data centers. This project should become a good resume builder, as reconfigurable computing emerges from research to production. And other POW algorithms for block chain can best be served with FPGA reconfigurable computing, since expensive ASIC implementations are not very flexible. BTC is just one of many to be easily FPGA implemented.

    I'll take donations from other entities, vendors, and mentors to increase the prize for this contest. I'm semi-retired, and the nearly $600 for this contest with fees, is the limit of my personal budget.


    • 2 ay önce
  • TotallyLost
    Yarışma Sahibi
    • 2 ay önce

    In C:
    struct buf {
    unsigned int K,W,a,b,c,d,e,f,g,h;
    } b[65];

    with rounds looking like:

    b[i+1].a = (b[i+0].h + SIGMA1(b[i+0].e) + Ch(b[i+0].e, b[i+0].f, b[i+0].g) + b[i+0].K + b[i+0].W) +
    (SIGMA0(b[i+0].a) + Maj(b[i+0].a, b[i+0].b, b[i+0].c));
    b[i+1].b = b[i+0].a;
    b[i+1].c = b[i+0].b;
    b[i+1].d = b[i+0].c;
    b[i+1].e = b[i+0].d + (b[i+0].h + SIGMA1(b[i+0].e) + Ch(b[i+0].e, b[i+0].f, b[i+0].g) + b[i+0].K + b[i+0].W);
    b[i+1].f = b[i+0].e;
    b[i+1].g = b[i+0].f;
    b[i+1].h = b[i+0].g;

    In verilog, remove the registers in 3 or 7 rounds, and let the combinatorials cascade. This reduces the number of pipeline stages and registers by 75%/87.5%, lowering foot print and dynamic power. The combinatorial path is now longer, doing more work per clock with a lower percentage of routing, setup, hold delays. Higher hash rate, even with a slower clock.

    • 2 ay önce
  • TotallyLost
    Yarışma Sahibi
    • 2 ay önce

    Big gains for a 10 year old, widely studied and used algorithm. The CME design reduced the Goldstrike 1 LUT count from 49,145 to 46,013 and the register count from 54,674 to 52,428 (95.5%, 4.5% net gain).

    Compacting 8 rounds into 1, shrinks the compressor by about 28,672 registers, so we now have a target implementation size of 52,428-28,672=23,756 registers (45.3% of CME, and 43.5% of Goldstrike1, 56.5% net gain). There is a smaller additional gain from also compacting the expander in the same 8:1 shrink.

    This is a 56.5%/4.5% = 12.5x improvement over CME's effort to reduce register count.

    LUT counts are not likely to be quite as substantial, but with switching to 7-3 compressors it should be significant, as it opens the door for packing additional logic besides the adders into LUT's. Hand packing functions should have a substantial effect on area, routing length, power, delays, and clock speed.

    • 2 ay önce

Daha fazla yorum göster

Yarışmalara nasıl başlanır

  • Projenizi ilan edin

    Yarışmanızı İlan Edin Hızlı ve kolay

  • Tonlarca girdi alın

    Tonlarca Girdi Alın Bütün dünyadan

  • En iyi girdiyi seçin

    En iyi girdiyi seçin Dosyaları indirin - Kolay!

Şimdi bir Yarışma İlan Et ya da Bugün Bize Katılın!