Bitcoin Forum
May 28, 2018, 05:50:14 AM *
News: Latest stable version of Bitcoin Core: 0.16.0  [Torrent]. (New!)
 
   Home   Help Search Donate Login Register  
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 [31] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 »
  Print  
Author Topic: Large Bitcoin Collider (Collision Finders Pool)  (Read 163445 times)
rico666
Legendary
*
Offline Offline

Activity: 952
Merit: 1006


฿ → ∞


View Profile WWW
March 18, 2017, 08:54:35 PM
 #601

And what about AMD? are you gonna do implementations for those too?

AFAIK some users operate their GPU client on AMD cards. Myself, I wasn't successful so far - but Jude Austin says he was (Ubuntu 14.04 with fglrx).


Rico

all non self-referential signatures except mine are lame ... oh wait ...   ·  LBC Thread (News)  ·  BURST Activities
1527486614
Hero Member
*
Offline Offline

Posts: 1527486614

View Profile Personal Message (Offline)

Ignore
1527486614
Reply with quote  #2

1527486614
Report to moderator
1527486614
Hero Member
*
Offline Offline

Posts: 1527486614

View Profile Personal Message (Offline)

Ignore
1527486614
Reply with quote  #2

1527486614
Report to moderator
Even in the event that an attacker gains more than 50% of the network's computational power, only transactions sent by the attacker could be reversed or double-spent. The network would not be destroyed.
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1527486614
Hero Member
*
Offline Offline

Posts: 1527486614

View Profile Personal Message (Offline)

Ignore
1527486614
Reply with quote  #2

1527486614
Report to moderator
GoldTiger69
Hero Member
*****
Offline Offline

Activity: 503
Merit: 500


View Profile WWW
March 19, 2017, 02:57:58 AM
 #602

Hi, I'm trying to use LBC with the following parameters:
./LBC --address 1M4QXX...XXXX --cpus 2

But I get in return: Server don't like us, wrong secret.
What do I need to do to participate in the project?
Thanks in advance.

Edit: LBC - Large Bitcoin Collider v. 1.031
        client fingerprint: db030a.........XXXX
        AMD FX-4300 on Ubuntu 15.04

I can help you to restore/recover your wallet or password.
https://bitcointalk.org/index.php?topic=1234619.0
Jude Austin
Legendary
*
Offline Offline

Activity: 1124
Merit: 1000


The Real Jude Austin


View Profile WWW
March 19, 2017, 03:39:04 AM
 #603

Hi, I'm trying to use LBC with the following parameters:
./LBC --address 1M4QXX...XXXX --cpus 2

But I get in return: Server don't like us, wrong secret.
What do I need to do to participate in the project?
Thanks in advance.

Edit: LBC - Large Bitcoin Collider v. 1.031
        client fingerprint: db030a.........XXXX
        AMD FX-4300 on Ubuntu 15.04

Set your secret first by doing --secret x:yoursecret then you can just do --secret yoursecret

I have AMD cards working but only on 14.04 after 14.04 Ubuntu no longer supports fglrx so it won't work.

Let me know if you need any help.

Thanks,
Jude

Get paid BTC to sign up for free tokens: http://earn.com/judeaustin/referral/?a=d3euriwoffdrlv4b
digaran
Hero Member
*****
Offline Offline

Activity: 812
Merit: 579


View Profile
March 19, 2017, 04:05:25 AM
 #604

When you find a correct private key how do you know that it's the right one? in iphone if you enter wrong password more than 5 times you get locked out for some minutes to be able to enter and try again, now is there any way to implement such mechanism into bitcoin?
Jude Austin
Legendary
*
Offline Offline

Activity: 1124
Merit: 1000


The Real Jude Austin


View Profile WWW
March 19, 2017, 04:13:51 AM
 #605

When you find a correct private key how do you know that it's the right one? in iphone if you enter wrong password more than 5 times you get locked out for some minutes to be able to enter and try again, now is there any way to implement such mechanism into bitcoin?

No, it doesn't work like that.

Imagine a normal key and lock, LBC isn't generating just the keys and trying them on a specific lock, LBC is generating the key AND the lock at the same time.

The key and lock combination creates a public hash which is what LBC is actually looking for by comparing to a list of existing public hashes with balances.

The ability to implement that system into Bitcoin is not feasible and would actually cause it to be less secure.

Rico,

Feel free to chime in with your genius, lol.

Thanks,
Jude

Get paid BTC to sign up for free tokens: http://earn.com/judeaustin/referral/?a=d3euriwoffdrlv4b
Jude Austin
Legendary
*
Offline Offline

Activity: 1124
Merit: 1000


The Real Jude Austin


View Profile WWW
March 19, 2017, 05:47:19 AM
 #606

It's official: 7 million pages on directory.io per second

My understanding is, most of this is done by one man and with CPUs.


Pool operation is seamless so far, I've seen a 13seconds network hiccup yesterday (which all clients handled well within 2 retries), and today I experienced a 500 error when calling the stats page. This too seems to have been only transient, although there may be some race condition at the bottom of this. => Pool operation purring like a cat

At the moment I'm completely dissecting the GPU client, as the segmentation faults I've been observing (read: have been driving me mad) for the past couple of days are 100% not my programming fault, but some internal error of the Nvidia OpenCL implementation. I'm trying to find a workaround and/or thorough internal analysis report to submit to Nvidia.


Rico

May I have a go at fixing this?

Thanks,
Jude

Get paid BTC to sign up for free tokens: http://earn.com/judeaustin/referral/?a=d3euriwoffdrlv4b
GoldTiger69
Hero Member
*****
Offline Offline

Activity: 503
Merit: 500


View Profile WWW
March 19, 2017, 05:55:45 AM
 #607

Hi, I'm trying to use LBC with the following parameters:
./LBC --address 1M4QXX...XXXX --cpus 2

But I get in return: Server don't like us, wrong secret.
What do I need to do to participate in the project?
Thanks in advance.

Edit: LBC - Large Bitcoin Collider v. 1.031
        client fingerprint: db030a.........XXXX
        AMD FX-4300 on Ubuntu 15.04

Set your secret first by doing --secret x:yoursecret then you can just do --secret yoursecret

I have AMD cards working but only on 14.04 after 14.04 Ubuntu no longer supports fglrx so it won't work.

Let me know if you need any help.

Thanks,
Jude

Thanks for the answer, I'll try that latter on. First, I'll try just with CPU, because AFAIK, to be able to use GPU we need to be authorized first, don't we?

I can help you to restore/recover your wallet or password.
https://bitcointalk.org/index.php?topic=1234619.0
Jude Austin
Legendary
*
Offline Offline

Activity: 1124
Merit: 1000


The Real Jude Austin


View Profile WWW
March 19, 2017, 06:01:24 AM
 #608

Hi, I'm trying to use LBC with the following parameters:
./LBC --address 1M4QXX...XXXX --cpus 2

But I get in return: Server don't like us, wrong secret.
What do I need to do to participate in the project?
Thanks in advance.

Edit: LBC - Large Bitcoin Collider v. 1.031
        client fingerprint: db030a.........XXXX
        AMD FX-4300 on Ubuntu 15.04

Set your secret first by doing --secret x:yoursecret then you can just do --secret yoursecret

I have AMD cards working but only on 14.04 after 14.04 Ubuntu no longer supports fglrx so it won't work.

Let me know if you need any help.

Thanks,
Jude

Thanks for the answer, I'll try that latter on. First, I'll try just with CPU, because AFAIK, to be able to use GPU we need to be authorized first, don't we?

No problem.

Yeah, you will need to be authorized.

The LBC client will tell you when you use the --gpu argument if you are authorized or not.

Thanks,
Jude

Get paid BTC to sign up for free tokens: http://earn.com/judeaustin/referral/?a=d3euriwoffdrlv4b
GoldTiger69
Hero Member
*****
Offline Offline

Activity: 503
Merit: 500


View Profile WWW
March 19, 2017, 06:43:25 AM
 #609

Hi, I'm trying to use LBC with the following parameters:
./LBC --address 1M4QXX...XXXX --cpus 2

But I get in return: Server don't like us, wrong secret.
What do I need to do to participate in the project?
Thanks in advance.

Edit: LBC - Large Bitcoin Collider v. 1.031
        client fingerprint: db030a.........XXXX
        AMD FX-4300 on Ubuntu 15.04

Set your secret first by doing --secret x:yoursecret then you can just do --secret yoursecret

I have AMD cards working but only on 14.04 after 14.04 Ubuntu no longer supports fglrx so it won't work.

Let me know if you need any help.

Thanks,
Jude

Thanks for the answer, I'll try that latter on. First, I'll try just with CPU, because AFAIK, to be able to use GPU we need to be authorized first, don't we?

No problem.

Yeah, you will need to be authorized.

The LBC client will tell you when you use the --gpu argument if you are authorized or not.

Thanks,
Jude

How can I get such authorization? (besides the 0.1 btc)

I can help you to restore/recover your wallet or password.
https://bitcointalk.org/index.php?topic=1234619.0
Jude Austin
Legendary
*
Offline Offline

Activity: 1124
Merit: 1000


The Real Jude Austin


View Profile WWW
March 19, 2017, 07:20:53 AM
 #610

Hi, I'm trying to use LBC with the following parameters:
./LBC --address 1M4QXX...XXXX --cpus 2

But I get in return: Server don't like us, wrong secret.
What do I need to do to participate in the project?
Thanks in advance.

Edit: LBC - Large Bitcoin Collider v. 1.031
        client fingerprint: db030a.........XXXX
        AMD FX-4300 on Ubuntu 15.04

Set your secret first by doing --secret x:yoursecret then you can just do --secret yoursecret

I have AMD cards working but only on 14.04 after 14.04 Ubuntu no longer supports fglrx so it won't work.

Let me know if you need any help.

Thanks,
Jude

Thanks for the answer, I'll try that latter on. First, I'll try just with CPU, because AFAIK, to be able to use GPU we need to be authorized first, don't we?

No problem.

Yeah, you will need to be authorized.

The LBC client will tell you when you use the --gpu argument if you are authorized or not.

Thanks,
Jude

How can I get such authorization? (besides the 0.1 btc)

Pray Rico is feeling generous when he sees this post.  Tongue

Get paid BTC to sign up for free tokens: http://earn.com/judeaustin/referral/?a=d3euriwoffdrlv4b
shorena
Copper Member
Legendary
*
Offline Offline

Activity: 1456
Merit: 1223


No I dont escrow anymore.


View Profile WWW
March 19, 2017, 07:47:43 AM
 #611

-snip-
How can I get such authorization? (besides the 0.1 btc)

Pray Rico is feeling generous when he sees this post.  Tongue

or get in the top30 with your CPU(s)

GoldTiger69
Hero Member
*****
Offline Offline

Activity: 503
Merit: 500


View Profile WWW
March 19, 2017, 07:58:22 AM
 #612

Thanks a lot Jude and Shorena! I'll hope for the first one and try the second one Smiley

I can help you to restore/recover your wallet or password.
https://bitcointalk.org/index.php?topic=1234619.0
rico666
Legendary
*
Offline Offline

Activity: 952
Merit: 1006


฿ → ∞


View Profile WWW
March 19, 2017, 09:32:32 AM
 #613

The mechanism for setting or changing a password (=secret) is the same:

Code:
-s oldsecret:newsecret

Obviously, if you had already some password, you are changing. If you had no password before, you are setting.

"But what is oldsecret when I am setting?" you may ask.

Simple answer: anything!

So as was mentioned here already - if you're setting your secret for the 1st time, just use x (or really anything else) for the oldsecret:

Code:
-s x:newsecret

and later you just give your
Code:
-s newsecret
to identify you with the server.



There is this guy from the Centre de Calcul el-Khawarizmi - CCK - Tunisia . Logs say, he has 160 (so far) tries of giving a wrong password to his id. May this short HowTo help him.


Rico

all non self-referential signatures except mine are lame ... oh wait ...   ·  LBC Thread (News)  ·  BURST Activities
unknownhostname
Member
**
Offline Offline

Activity: 61
Merit: 10


View Profile
March 19, 2017, 11:30:27 AM
 #614

Rico,

You mentioned that someone wanted a way to get notifications of found addresses...

Why not use Pushbullet?

I use it for some other stuff I do and I like it a lot.

Check it out: https://www.pushbullet.com/

Thanks,
Jude

I could use that  if rico could implement it in the main LBC ... without creating the hook-find things.
rico666
Legendary
*
Offline Offline

Activity: 952
Merit: 1006


฿ → ∞


View Profile WWW
March 19, 2017, 11:56:13 AM
 #615

Observe this code snippet from the GPU client. It is a small part from the Jacobi -> Affine transformation

I know that  hrd256k1_fe_sqr and hrd256k1_fe_mul work correctly. I know that I am getting the right values into my GPU (az, jpubkey).
However, this code doesn't even run the printf when hrd256k1_fe_mul is in place. It does, when I comment the hrd256k1_fe_mul call

Code:
  hrd256k1_fe_sqr(&zi2, &az);

  apubkey2.infinity = jpubkey.infinity;

  hrd256k1_fe_mul(&apubkey2.x, &jpubkey.x, &zi2);

  printf("GPU %d\nA:%016lx %016lx %016lx %016lx %016lx\nZ:%016lx %016lx %016lx %016lx %016lx\n---\n",
         idx,
         apubkey2.x.n[0],apubkey2.x.n[1],apubkey2.x.n[2],apubkey2.x.n[3],apubkey2.x.n[4],
         apubkey2.y.n[0],apubkey2.y.n[1],apubkey2.y.n[2],apubkey2.y.n[3],apubkey2.y.n[4]
         );

Ok. a simple apubkey2 = jpubkey works. So what is it, that causes this weird behavior? To investigate,. I wrote a small synthetic  hrd256k1_fe_mul2:

Code:
static void
hrd256k1_fe_mul2(hrd256k1_fe *r, const hrd256k1_fe *a, const hrd256k1_fe *b) {
  r->n[0] = a->n[0] + b->n[0];
  r->n[1] = a->n[1] + b->n[1];
  r->n[2] = a->n[2] + b->n[2];
  r->n[3] = a->n[3] + b->n[3];
  r->n[4] = a->n[4] + b->n[4];
}

Guess what? Same problem (doesn't even printf). Now if I comment out ANY of the r->n = a->n + b->n lines, it works!
If I even do

Code:
static void
hrd256k1_fe_mul2(hrd256k1_fe *r, const hrd256k1_fe *a, const hrd256k1_fe *b) {
  r->n[0] = a->n[0]; // + b->n[0];
  r->n[1] = a->n[1] + b->n[1];
  r->n[2] = a->n[2] + b->n[2];
  r->n[3] = a->n[3] + b->n[3];
  r->n[4] = a->n[4] + b->n[4];
}

It still works! What is going on???  Huh

Rico

all non self-referential signatures except mine are lame ... oh wait ...   ·  LBC Thread (News)  ·  BURST Activities
Jude Austin
Legendary
*
Offline Offline

Activity: 1124
Merit: 1000


The Real Jude Austin


View Profile WWW
March 20, 2017, 04:44:11 AM
 #616

Observe this code snippet from the GPU client. It is a small part from the Jacobi -> Affine transformation

I know that  hrd256k1_fe_sqr and hrd256k1_fe_mul work correctly. I know that I am getting the right values into my GPU (az, jpubkey).
However, this code doesn't even run the printf when hrd256k1_fe_mul is in place. It does, when I comment the hrd256k1_fe_mul call

Code:
  hrd256k1_fe_sqr(&zi2, &az);

  apubkey2.infinity = jpubkey.infinity;

  hrd256k1_fe_mul(&apubkey2.x, &jpubkey.x, &zi2);

  printf("GPU %d\nA:%016lx %016lx %016lx %016lx %016lx\nZ:%016lx %016lx %016lx %016lx %016lx\n---\n",
         idx,
         apubkey2.x.n[0],apubkey2.x.n[1],apubkey2.x.n[2],apubkey2.x.n[3],apubkey2.x.n[4],
         apubkey2.y.n[0],apubkey2.y.n[1],apubkey2.y.n[2],apubkey2.y.n[3],apubkey2.y.n[4]
         );

Ok. a simple apubkey2 = jpubkey works. So what is it, that causes this weird behavior? To investigate,. I wrote a small synthetic  hrd256k1_fe_mul2:

Code:
static void
hrd256k1_fe_mul2(hrd256k1_fe *r, const hrd256k1_fe *a, const hrd256k1_fe *b) {
  r->n[0] = a->n[0] + b->n[0];
  r->n[1] = a->n[1] + b->n[1];
  r->n[2] = a->n[2] + b->n[2];
  r->n[3] = a->n[3] + b->n[3];
  r->n[4] = a->n[4] + b->n[4];
}

Guess what? Same problem (doesn't even printf). Now if I comment out ANY of the r->n = a->n + b->n lines, it works!
If I even do

Code:
static void
hrd256k1_fe_mul2(hrd256k1_fe *r, const hrd256k1_fe *a, const hrd256k1_fe *b) {
  r->n[0] = a->n[0]; // + b->n[0];
  r->n[1] = a->n[1] + b->n[1];
  r->n[2] = a->n[2] + b->n[2];
  r->n[3] = a->n[3] + b->n[3];
  r->n[4] = a->n[4] + b->n[4];
}

It still works! What is going on???  Huh

Rico

What happens when you try to just printf the line you commented out?

Also this: http://stackoverflow.com/questions/1255099/whats-the-proper-use-of-printf-to-display-pointers-padded-with-0s

Get paid BTC to sign up for free tokens: http://earn.com/judeaustin/referral/?a=d3euriwoffdrlv4b
rico666
Legendary
*
Offline Offline

Activity: 952
Merit: 1006


฿ → ∞


View Profile WWW
March 20, 2017, 03:20:40 PM
 #617

What happens when you try to just printf the line you commented out?

Also this: http://stackoverflow.com/questions/1255099/whats-the-proper-use-of-printf-to-display-pointers-padded-with-0s


So lessons learned and progress:

Never try to impose a data size on the GPU which it was not built for. Todays GPUs are 32bit. Using 64bit data types is a performance penalty (as the GPU internally transforms this into a sequence of 32bit operations). Moreover, defining your own 128bit arithmetics library using 64bit types on GPU ... will eventually work after you really do something to the GPU which can only be described as raping, but the GPU will not like it and show a performance consistent with its unliking...

Turns out, there is a maximum number of assembler instructions per kernel and of course I ran into it with my glorious 128bit GPU-library, then the kernel will simply crash, or your host application gets a segmentation fault (from the OpenCL library), or <insert undefined behavior here>

Printout on GPU is nothing but a straw of the desperate GPU developer.


Back to the drawing board, I'm left with a highly optimized 64bit ECC library @ CPU and the need for a (highly optimized) 32bit library on GPU. At least as long as I have parts of the computation on CPU, parts on GPU. Sounds like Frankensteins monster? It is!

Computing with 5x52 fields @ CPU, pushing data to GPU, there a conversion 5x52 -> 10x26, followed by 32bit computations.

But it is surprisingly fast - so far. As the conversion (I hope) is a mere:

Code:
static void hrd256k1_fe_52to26(hrd256k1_fe32 *out, const hrd256k1_fe *in) {
  out->n[1] = in->n[0] & 0x3FFFFFFUL;
  out->n[0] = in->n[0] >> 26;
  out->n[3] = in->n[1] & 0x3FFFFFFUL;
  out->n[2] = in->n[1] >> 26;
  out->n[5] = in->n[2] & 0x3FFFFFFUL;
  out->n[4] = in->n[2] >> 26;
  out->n[7] = in->n[3] & 0x3FFFFFFUL;
  out->n[6] = in->n[3] >> 26;
  out->n[9] = in->n[4] & 0x3FFFFFFUL;
  out->n[8] = in->n[4] >> 26;
}

And the subsequent fe_mul etc. are done using GPU native data format. We'll see.


Rico

all non self-referential signatures except mine are lame ... oh wait ...   ·  LBC Thread (News)  ·  BURST Activities
Jude Austin
Legendary
*
Offline Offline

Activity: 1124
Merit: 1000


The Real Jude Austin


View Profile WWW
March 21, 2017, 06:18:35 AM
 #618

What happens when you try to just printf the line you commented out?

Also this: http://stackoverflow.com/questions/1255099/whats-the-proper-use-of-printf-to-display-pointers-padded-with-0s


So lessons learned and progress:

Never try to impose a data size on the GPU which it was not built for. Todays GPUs are 32bit. Using 64bit data types is a performance penalty (as the GPU internally transforms this into a sequence of 32bit operations). Moreover, defining your own 128bit arithmetics library using 64bit types on GPU ... will eventually work after you really do something to the GPU which can only be described as raping, but the GPU will not like it and show a performance consistent with its unliking...

Turns out, there is a maximum number of assembler instructions per kernel and of course I ran into it with my glorious 128bit GPU-library, then the kernel will simply crash, or your host application gets a segmentation fault (from the OpenCL library), or <insert undefined behavior here>

Printout on GPU is nothing but a straw of the desperate GPU developer.


Back to the drawing board, I'm left with a highly optimized 64bit ECC library @ CPU and the need for a (highly optimized) 32bit library on GPU. At least as long as I have parts of the computation on CPU, parts on GPU. Sounds like Frankensteins monster? It is!

Computing with 5x52 fields @ CPU, pushing data to GPU, there a conversion 5x52 -> 10x26, followed by 32bit computations.

But it is surprisingly fast - so far. As the conversion (I hope) is a mere:

Code:
static void hrd256k1_fe_52to26(hrd256k1_fe32 *out, const hrd256k1_fe *in) {
  out->n[1] = in->n[0] & 0x3FFFFFFUL;
  out->n[0] = in->n[0] >> 26;
  out->n[3] = in->n[1] & 0x3FFFFFFUL;
  out->n[2] = in->n[1] >> 26;
  out->n[5] = in->n[2] & 0x3FFFFFFUL;
  out->n[4] = in->n[2] >> 26;
  out->n[7] = in->n[3] & 0x3FFFFFFUL;
  out->n[6] = in->n[3] >> 26;
  out->n[9] = in->n[4] & 0x3FFFFFFUL;
  out->n[8] = in->n[4] >> 26;
}

And the subsequent fe_mul etc. are done using GPU native data format. We'll see.


Rico

Rico,

What is the performance cost of emulating 64 bit as 32 bit?

Does it double the cost? For example, does a 64 bit word emulated as 32 bit use 100% of the GPU while 32 bit words use 50%?

And am I wrong assuming that even 32 bit is emulated, specifically on Pascal/Maxwell chips? I read the white paper and it says it does half integers also.

Thanks,
Jude

Get paid BTC to sign up for free tokens: http://earn.com/judeaustin/referral/?a=d3euriwoffdrlv4b
rico666
Legendary
*
Offline Offline

Activity: 952
Merit: 1006


฿ → ∞


View Profile WWW
March 21, 2017, 08:03:11 AM
 #619

What is the performance cost of emulating 64 bit as 32 bit?

Does it double the cost? For example, does a 64 bit word emulated as 32 bit use 100% of the GPU while 32 bit words use 50%?

Ok, let me elaborate on this a little bit and give you some numbers for better estimates where we are and where we're going:

In my CPU/GPU combination, one CPU core puts 8% load on the GPU and that is a situation, where a fairly strong CPU meets a midrange GPU (a 2.8 - 3.7 GHz Skylake E3-Xeon firing at a Quadro M2000M - see http://www.notebookcheck.net/NVIDIA-Quadro-M2000M.151581.0.html). At the moment it's quite possible with a stronger GPU (1080) that the CPU can put only 5-6% load to the GPU.

The current development version of the generator gives me 9 Mkeys/s for all 4 physical cores running, whereas the published version (the one you can download from FTP) gives 7.5 Mkeys/s.

Main difference is bloom filter search is done on GPU with the development version and also moving the final step of affine->normalization->64bytes to GPU resulting in an overall speed improvement of about 375000 keys/s per core.

Up to now, the GPU behaved like a "magical wand", putting bloom for it to work, didn't raise GPU load, but it raised the keyrate. This could be explained that the time the GPU needs to do the bloom filter search is basically the time the GPU would need to transfer the hashed data back to CPU (which does the bloom filter search on the current public version). Same with the affine transformation.

There is nothing left on the CPU except (heavily optimized) EC computations, so any further speed improvements need to push that to the GPU.
In terms of time, currently one 16M block takes around 6.25 seconds on my machine (if I let compute 8 blocks, it takes 50 seconds - to mitigate the startup cost).

So I thought I'd emulate what's going on on the CPU and move the code piece by piece. Going backwards, the step before the affine transformation is the Jacobi->Affine transformation, where you need to compute the square and the cube of the Jacobi Z coordinate and multiply the X with the former and the Y with the latter. All in all one Field element sqr and 3 FE mul operations.

Done that with my 128bit library (based on 64bit data types) on GPU and behold! GPU load went to 100% and the time per block went to 16 seconds! Uh. Operation successful, patient dead.
-> Back to the drawing board.

Now the same with 32bit data types is currently 12% GPU load and 5.4 seconds per block (per CPU core). So very promising, but I'm hitting a little/big endianness brainwarp hell, so I have to figure out how to do it more elegant.

Also, the new version will demand a more GPU-heavy approach before I can release it. As the bloom filter search is done on GPU, an additional 512MB of GPU memory is used per process. Running 4 processes on my Maxwell GPU with its 4GB VRAM is just fine (and as the memory can be freed from the CPU part of the generator, it takes only 100MB of host memory), but I experienced also Segmentation faults with the Keppler machines on Amazon cloud.

So the goal is really to have one CPU core being able to put at least 50% load on one GPU.

It's no small engineering feat, but at the moment LBC is the fastest key generator on the planet (some 20% faster than oclvanitygen) and I believe it is achievable to be twice as fast as oclvanitygen. That's my goal and motivation and currently I have yet to tap 65% of my GPU capacity to get there.

Quote
And am I wrong assuming that even 32 bit is emulated, specifically on Pascal/Maxwell chips? I read the white paper and it says it does half integers also.

I'm not familiar in detail with the specific hardware interna. At the moment I have a Maxwell chip for my testing and I will have a tendency to support newer architectures/chip families, than the old stuff. Another way to put it: I will not sacrifice any speed to support that "old" chip from 2009. ;-)

Sidenote:

If anyone wants to be at the true forefront of development and have a great workstation-replacement notebook, buy a Lenovo P50 (maybe P51 to be slightly ahead), because that's what I am developing on and LBC will naturally be slightly tailored to it. E.g. it has also an Intel GPU, which I am using for display. So currently I can work with the notebook basically without any limitations, as the Intel Graphics are untouched and as I have the 4 logical cores for my interaction, I can watch videos, browse etc.) and the notebook is churning 9 Mkeys/s. Ok the fan noise is distracting, because normally, the notebook is fine with passive cooling. Wink



Rico

all non self-referential signatures except mine are lame ... oh wait ...   ·  LBC Thread (News)  ·  BURST Activities
Jude Austin
Legendary
*
Offline Offline

Activity: 1124
Merit: 1000


The Real Jude Austin


View Profile WWW
March 22, 2017, 07:29:21 AM
 #620

What is the performance cost of emulating 64 bit as 32 bit?

Does it double the cost? For example, does a 64 bit word emulated as 32 bit use 100% of the GPU while 32 bit words use 50%?

Ok, let me elaborate on this a little bit and give you some numbers for better estimates where we are and where we're going:

In my CPU/GPU combination, one CPU core puts 8% load on the GPU and that is a situation, where a fairly strong CPU meets a midrange GPU (a 2.8 - 3.7 GHz Skylake E3-Xeon firing at a Quadro M2000M - see http://www.notebookcheck.net/NVIDIA-Quadro-M2000M.151581.0.html). At the moment it's quite possible with a stronger GPU (1080) that the CPU can put only 5-6% load to the GPU.

The current development version of the generator gives me 9 Mkeys/s for all 4 physical cores running, whereas the published version (the one you can download from FTP) gives 7.5 Mkeys/s.

Main difference is bloom filter search is done on GPU with the development version and also moving the final step of affine->normalization->64bytes to GPU resulting in an overall speed improvement of about 375000 keys/s per core.

Up to now, the GPU behaved like a "magical wand", putting bloom for it to work, didn't raise GPU load, but it raised the keyrate. This could be explained that the time the GPU needs to do the bloom filter search is basically the time the GPU would need to transfer the hashed data back to CPU (which does the bloom filter search on the current public version). Same with the affine transformation.

There is nothing left on the CPU except (heavily optimized) EC computations, so any further speed improvements need to push that to the GPU.
In terms of time, currently one 16M block takes around 6.25 seconds on my machine (if I let compute 8 blocks, it takes 50 seconds - to mitigate the startup cost).

So I thought I'd emulate what's going on on the CPU and move the code piece by piece. Going backwards, the step before the affine transformation is the Jacobi->Affine transformation, where you need to compute the square and the cube of the Jacobi Z coordinate and multiply the X with the former and the Y with the latter. All in all one Field element sqr and 3 FE mul operations.

Done that with my 128bit library (based on 64bit data types) on GPU and behold! GPU load went to 100% and the time per block went to 16 seconds! Uh. Operation successful, patient dead.
-> Back to the drawing board.

Now the same with 32bit data types is currently 12% GPU load and 5.4 seconds per block (per CPU core). So very promising, but I'm hitting a little/big endianness brainwarp hell, so I have to figure out how to do it more elegant.

Also, the new version will demand a more GPU-heavy approach before I can release it. As the bloom filter search is done on GPU, an additional 512MB of GPU memory is used per process. Running 4 processes on my Maxwell GPU with its 4GB VRAM is just fine (and as the memory can be freed from the CPU part of the generator, it takes only 100MB of host memory), but I experienced also Segmentation faults with the Keppler machines on Amazon cloud.

So the goal is really to have one CPU core being able to put at least 50% load on one GPU.

It's no small engineering feat, but at the moment LBC is the fastest key generator on the planet (some 20% faster than oclvanitygen) and I believe it is achievable to be twice as fast as oclvanitygen. That's my goal and motivation and currently I have yet to tap 65% of my GPU capacity to get there.

Quote
And am I wrong assuming that even 32 bit is emulated, specifically on Pascal/Maxwell chips? I read the white paper and it says it does half integers also.

I'm not familiar in detail with the specific hardware interna. At the moment I have a Maxwell chip for my testing and I will have a tendency to support newer architectures/chip families, than the old stuff. Another way to put it: I will not sacrifice any speed to support that "old" chip from 2009. ;-)

Sidenote:

If anyone wants to be at the true forefront of development and have a great workstation-replacement notebook, buy a Lenovo P50 (maybe P51 to be slightly ahead), because that's what I am developing on and LBC will naturally be slightly tailored to it. E.g. it has also an Intel GPU, which I am using for display. So currently I can work with the notebook basically without any limitations, as the Intel Graphics are untouched and as I have the 4 logical cores for my interaction, I can watch videos, browse etc.) and the notebook is churning 9 Mkeys/s. Ok the fan noise is distracting, because normally, the notebook is fine with passive cooling. Wink



Rico

Rico,

Why does it require a BF for each process? Couldn't just the BF get loaded to VRAM and then each process references that one instance?

I am not questioning your work, just digging for information for a better understanding.

And shit, those are nice workstations at a pretty decent price, kind of pissed I bought an MSI GS70 Stealth Pro.... ~6 Mkeys

Thanks,
Jude

Get paid BTC to sign up for free tokens: http://earn.com/judeaustin/referral/?a=d3euriwoffdrlv4b
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 [31] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 »
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!