Large Bitcoin Collider (Collision Finders Pool)

rico666 (OP)

Legendary

Offline

Activity: 1120
Merit: 1037

฿ → ∞

Re: Large Bitcoin Collider (Collision Finders Pool)

March 18, 2017, 08:54:35 PM

#601

Quote from: GoldTiger69 on March 18, 2017, 06:50:45 PM

And what about AMD? are you gonna do implementations for those too?

AFAIK some users operate their GPU client on AMD cards. Myself, I wasn't successful so far - but Jude Austin says he was (Ubuntu 14.04 with fglrx).

Rico

all non self-referential signatures except mine are lame ... oh wait ... · LBC Thread (News) · Past BURST Activities

GoldTiger69

Hero Member

Offline

Activity: 582
Merit: 502

Re: Large Bitcoin Collider (Collision Finders Pool)

March 19, 2017, 02:57:58 AM
Last edit: March 19, 2017, 03:11:43 AM by GoldTiger69

#602

Hi, I'm trying to use LBC with the following parameters:
./LBC --address 1M4QXX...XXXX --cpus 2

But I get in return: Server don't like us, wrong secret.
What do I need to do to participate in the project?
Thanks in advance.

Edit: LBC - Large Bitcoin Collider v. 1.031
client fingerprint: db030a.........XXXX
AMD FX-4300 on Ubuntu 15.04

I can help you to restore/recover your wallet or password.
https://bitcointalk.org/index.php?topic=1234619.0

Jude Austin

Legendary

Offline

Activity: 1140
Merit: 1000

The Real Jude Austin

Re: Large Bitcoin Collider (Collision Finders Pool)

March 19, 2017, 03:39:04 AM

#603

Quote from: GoldTiger69 on March 19, 2017, 02:57:58 AM

Hi, I'm trying to use LBC with the following parameters:
./LBC --address 1M4QXX...XXXX --cpus 2

But I get in return: Server don't like us, wrong secret.
What do I need to do to participate in the project?
Thanks in advance.

Edit: LBC - Large Bitcoin Collider v. 1.031
client fingerprint: db030a.........XXXX
AMD FX-4300 on Ubuntu 15.04

Set your secret first by doing --secret x:yoursecret then you can just do --secret yoursecret

I have AMD cards working but only on 14.04 after 14.04 Ubuntu no longer supports fglrx so it won't work.

Let me know if you need any help.

Thanks,
Jude

Buy or sell $100 of Crypto and get $10!

digaran

Copper Member
Hero Member

Offline

Activity: 1330
Merit: 899

🖤😏

Re: Large Bitcoin Collider (Collision Finders Pool)

March 19, 2017, 04:05:25 AM

#604

When you find a correct private key how do you know that it's the right one? in iphone if you enter wrong password more than 5 times you get locked out for some minutes to be able to enter and try again, now is there any way to implement such mechanism into bitcoin?

🖤😏

Jude Austin

Legendary

Offline

Activity: 1140
Merit: 1000

The Real Jude Austin

Re: Large Bitcoin Collider (Collision Finders Pool)

March 19, 2017, 04:13:51 AM

#605

Quote from: digaran on March 19, 2017, 04:05:25 AM

When you find a correct private key how do you know that it's the right one? in iphone if you enter wrong password more than 5 times you get locked out for some minutes to be able to enter and try again, now is there any way to implement such mechanism into bitcoin?

No, it doesn't work like that.

Imagine a normal key and lock, LBC isn't generating just the keys and trying them on a specific lock, LBC is generating the key AND the lock at the same time.

The key and lock combination creates a public hash which is what LBC is actually looking for by comparing to a list of existing public hashes with balances.

The ability to implement that system into Bitcoin is not feasible and would actually cause it to be less secure.

Rico,

Feel free to chime in with your genius, lol.

Thanks,
Jude

Buy or sell $100 of Crypto and get $10!

Jude Austin

Legendary

Offline

Activity: 1140
Merit: 1000

The Real Jude Austin

Re: 896.04 Mkeys/s

March 19, 2017, 05:47:19 AM

#606

Quote from: rico666 on March 18, 2017, 11:16:56 AM

It's official: 7 million pages on directory.io per second

My understanding is, most of this is done by one man and with CPUs.

Pool operation is seamless so far, I've seen a 13seconds network hiccup yesterday (which all clients handled well within 2 retries), and today I experienced a 500 error when calling the stats page. This too seems to have been only transient, although there may be some race condition at the bottom of this. => Pool operation purring like a cat

At the moment I'm completely dissecting the GPU client, as the segmentation faults I've been observing (read: have been driving me mad) for the past couple of days are 100% not my programming fault, but some internal error of the Nvidia OpenCL implementation. I'm trying to find a workaround and/or thorough internal analysis report to submit to Nvidia.

Rico

May I have a go at fixing this?

Thanks,
Jude

Buy or sell $100 of Crypto and get $10!

GoldTiger69

Hero Member

Offline

Activity: 582
Merit: 502

Re: Large Bitcoin Collider (Collision Finders Pool)

March 19, 2017, 05:55:45 AM

#607

Quote from: Jude Austin on March 19, 2017, 03:39:04 AM

Quote from: GoldTiger69 on March 19, 2017, 02:57:58 AM

Hi, I'm trying to use LBC with the following parameters:
./LBC --address 1M4QXX...XXXX --cpus 2

But I get in return: Server don't like us, wrong secret.
What do I need to do to participate in the project?
Thanks in advance.

Edit: LBC - Large Bitcoin Collider v. 1.031
client fingerprint: db030a.........XXXX
AMD FX-4300 on Ubuntu 15.04

Set your secret first by doing --secret x:yoursecret then you can just do --secret yoursecret

I have AMD cards working but only on 14.04 after 14.04 Ubuntu no longer supports fglrx so it won't work.

Let me know if you need any help.

Thanks,
Jude

Thanks for the answer, I'll try that latter on. First, I'll try just with CPU, because AFAIK, to be able to use GPU we need to be authorized first, don't we?

I can help you to restore/recover your wallet or password.
https://bitcointalk.org/index.php?topic=1234619.0

Jude Austin

Legendary

Offline

Activity: 1140
Merit: 1000

The Real Jude Austin

Re: Large Bitcoin Collider (Collision Finders Pool)

March 19, 2017, 06:01:24 AM

#608

Quote from: GoldTiger69 on March 19, 2017, 05:55:45 AM

Quote from: Jude Austin on March 19, 2017, 03:39:04 AM

Quote from: GoldTiger69 on March 19, 2017, 02:57:58 AM

Hi, I'm trying to use LBC with the following parameters:
./LBC --address 1M4QXX...XXXX --cpus 2

But I get in return: Server don't like us, wrong secret.
What do I need to do to participate in the project?
Thanks in advance.

Edit: LBC - Large Bitcoin Collider v. 1.031
client fingerprint: db030a.........XXXX
AMD FX-4300 on Ubuntu 15.04

Set your secret first by doing --secret x:yoursecret then you can just do --secret yoursecret

I have AMD cards working but only on 14.04 after 14.04 Ubuntu no longer supports fglrx so it won't work.

Let me know if you need any help.

Thanks,
Jude

Thanks for the answer, I'll try that latter on. First, I'll try just with CPU, because AFAIK, to be able to use GPU we need to be authorized first, don't we?

No problem.

Yeah, you will need to be authorized.

The LBC client will tell you when you use the --gpu argument if you are authorized or not.

Thanks,
Jude

Buy or sell $100 of Crypto and get $10!

GoldTiger69

Hero Member

Offline

Activity: 582
Merit: 502

Re: Large Bitcoin Collider (Collision Finders Pool)

March 19, 2017, 06:43:25 AM

#609

Quote from: Jude Austin on March 19, 2017, 06:01:24 AM

Quote from: GoldTiger69 on March 19, 2017, 05:55:45 AM

Quote from: Jude Austin on March 19, 2017, 03:39:04 AM

Quote from: GoldTiger69 on March 19, 2017, 02:57:58 AM

Hi, I'm trying to use LBC with the following parameters:
./LBC --address 1M4QXX...XXXX --cpus 2

But I get in return: Server don't like us, wrong secret.
What do I need to do to participate in the project?
Thanks in advance.

Edit: LBC - Large Bitcoin Collider v. 1.031
client fingerprint: db030a.........XXXX
AMD FX-4300 on Ubuntu 15.04

Set your secret first by doing --secret x:yoursecret then you can just do --secret yoursecret

I have AMD cards working but only on 14.04 after 14.04 Ubuntu no longer supports fglrx so it won't work.

Let me know if you need any help.

Thanks,
Jude

Thanks for the answer, I'll try that latter on. First, I'll try just with CPU, because AFAIK, to be able to use GPU we need to be authorized first, don't we?

No problem.

Yeah, you will need to be authorized.

The LBC client will tell you when you use the --gpu argument if you are authorized or not.

Thanks,
Jude

How can I get such authorization? (besides the 0.1 btc)

I can help you to restore/recover your wallet or password.
https://bitcointalk.org/index.php?topic=1234619.0

Jude Austin

Legendary

Offline

Activity: 1140
Merit: 1000

The Real Jude Austin

Re: Large Bitcoin Collider (Collision Finders Pool)

March 19, 2017, 07:20:53 AM

#610

Quote from: GoldTiger69 on March 19, 2017, 06:43:25 AM

Quote from: Jude Austin on March 19, 2017, 06:01:24 AM

Quote from: GoldTiger69 on March 19, 2017, 05:55:45 AM

Quote from: Jude Austin on March 19, 2017, 03:39:04 AM

Quote from: GoldTiger69 on March 19, 2017, 02:57:58 AM

Hi, I'm trying to use LBC with the following parameters:
./LBC --address 1M4QXX...XXXX --cpus 2

But I get in return: Server don't like us, wrong secret.
What do I need to do to participate in the project?
Thanks in advance.

Edit: LBC - Large Bitcoin Collider v. 1.031
client fingerprint: db030a.........XXXX
AMD FX-4300 on Ubuntu 15.04

Set your secret first by doing --secret x:yoursecret then you can just do --secret yoursecret

I have AMD cards working but only on 14.04 after 14.04 Ubuntu no longer supports fglrx so it won't work.

Let me know if you need any help.

Thanks,
Jude

Thanks for the answer, I'll try that latter on. First, I'll try just with CPU, because AFAIK, to be able to use GPU we need to be authorized first, don't we?

No problem.

Yeah, you will need to be authorized.

The LBC client will tell you when you use the --gpu argument if you are authorized or not.

Thanks,
Jude

How can I get such authorization? (besides the 0.1 btc)

Pray Rico is feeling generous when he sees this post. Tongue

Buy or sell $100 of Crypto and get $10!

shorena

Copper Member
Legendary

Offline

Activity: 1498
Merit: 1520

No I dont escrow anymore.

Re: Large Bitcoin Collider (Collision Finders Pool)

March 19, 2017, 07:47:43 AM

#611

Quote from: Jude Austin on March 19, 2017, 07:20:53 AM

Quote from: GoldTiger69 on March 19, 2017, 06:43:25 AM

-snip-
How can I get such authorization? (besides the 0.1 btc)

Pray Rico is feeling generous when he sees this post. Tongue

or get in the top30 with your CPU(s)

Im not really here, its just your imagination.

GoldTiger69

Hero Member

Offline

Activity: 582
Merit: 502

Re: Large Bitcoin Collider (Collision Finders Pool)

March 19, 2017, 07:58:22 AM

#612

Thanks a lot Jude and Shorena! I'll hope for the first one and try the second one

I can help you to restore/recover your wallet or password.
https://bitcointalk.org/index.php?topic=1234619.0

rico666 (OP)

Legendary

Offline

Activity: 1120
Merit: 1037

฿ → ∞

Setting a password

March 19, 2017, 09:32:32 AM

#613

The mechanism for setting or changing a password (=secret) is the same:

Code:

-s oldsecret:newsecret

Obviously, if you had already some password, you are changing. If you had no password before, you are setting.

"But what is oldsecret when I am setting?" you may ask.

Simple answer: anything!

So as was mentioned here already - if you're setting your secret for the 1st time, just use x (or really anything else) for the oldsecret:

Code:

-s x:newsecret

and later you just give your

Code:

-s newsecret

to identify you with the server.

There is this guy from the Centre de Calcul el-Khawarizmi - CCK - Tunisia . Logs say, he has 160 (so far) tries of giving a wrong password to his id. May this short HowTo help him.

Rico

all non self-referential signatures except mine are lame ... oh wait ... · LBC Thread (News) · Past BURST Activities

unknownhostname

Member

Offline

Activity: 62
Merit: 10

Re: Large Bitcoin Collider (Collision Finders Pool)

March 19, 2017, 11:30:27 AM

#614

Quote from: Jude Austin on March 17, 2017, 09:36:22 PM

Rico,

You mentioned that someone wanted a way to get notifications of found addresses...

Why not use Pushbullet?

I use it for some other stuff I do and I like it a lot.

Check it out: https://www.pushbullet.com/

Thanks,
Jude

I could use that if rico could implement it in the main LBC ... without creating the hook-find things.

rico666 (OP)

Legendary

Offline

Activity: 1120
Merit: 1037

฿ → ∞

The horror that is Nvidia OpenCL

March 19, 2017, 11:56:13 AM

#615

Observe this code snippet from the GPU client. It is a small part from the Jacobi -> Affine transformation

I know that hrd256k1_fe_sqr and hrd256k1_fe_mul work correctly. I know that I am getting the right values into my GPU (az, jpubkey).
However, this code doesn't even run the printf when hrd256k1_fe_mul is in place. It does, when I comment the hrd256k1_fe_mul call

Code:

  hrd256k1_fe_sqr(&zi2, &az);

  apubkey2.infinity = jpubkey.infinity;

  hrd256k1_fe_mul(&apubkey2.x, &jpubkey.x, &zi2);

  printf("GPU %d\nA:%016lx %016lx %016lx %016lx %016lx\nZ:%016lx %016lx %016lx %016lx %016lx\n---\n",
         idx,
         apubkey2.x.n[0],apubkey2.x.n[1],apubkey2.x.n[2],apubkey2.x.n[3],apubkey2.x.n[4],
         apubkey2.y.n[0],apubkey2.y.n[1],apubkey2.y.n[2],apubkey2.y.n[3],apubkey2.y.n[4]
         );

Ok. a simple apubkey2 = jpubkey works. So what is it, that causes this weird behavior? To investigate,. I wrote a small synthetic hrd256k1_fe_mul2:

Code:

static void
hrd256k1_fe_mul2(hrd256k1_fe *r, const hrd256k1_fe *a, const hrd256k1_fe *b) {
  r->n[0] = a->n[0] + b->n[0];
  r->n[1] = a->n[1] + b->n[1];
  r->n[2] = a->n[2] + b->n[2];
  r->n[3] = a->n[3] + b->n[3];
  r->n[4] = a->n[4] + b->n[4];
}

Guess what? Same problem (doesn't even printf). Now if I comment out ANY of the r->n = a->n + b->n lines, it works!
If I even do

Code:

static void
hrd256k1_fe_mul2(hrd256k1_fe *r, const hrd256k1_fe *a, const hrd256k1_fe *b) {
  r->n[0] = a->n[0]; // + b->n[0];
  r->n[1] = a->n[1] + b->n[1];
  r->n[2] = a->n[2] + b->n[2];
  r->n[3] = a->n[3] + b->n[3];
  r->n[4] = a->n[4] + b->n[4];
}

It still works! What is going on??? Huh

Rico

all non self-referential signatures except mine are lame ... oh wait ... · LBC Thread (News) · Past BURST Activities

Jude Austin

Legendary

Offline

Activity: 1140
Merit: 1000

The Real Jude Austin

Re: The horror that is Nvidia OpenCL

March 20, 2017, 04:44:11 AM

#616

Quote from: rico666 on March 19, 2017, 11:56:13 AM

Observe this code snippet from the GPU client. It is a small part from the Jacobi -> Affine transformation

I know that hrd256k1_fe_sqr and hrd256k1_fe_mul work correctly. I know that I am getting the right values into my GPU (az, jpubkey).
However, this code doesn't even run the printf when hrd256k1_fe_mul is in place. It does, when I comment the hrd256k1_fe_mul call

Code:

  hrd256k1_fe_sqr(&zi2, &az);

  apubkey2.infinity = jpubkey.infinity;

  hrd256k1_fe_mul(&apubkey2.x, &jpubkey.x, &zi2);

  printf("GPU %d\nA:%016lx %016lx %016lx %016lx %016lx\nZ:%016lx %016lx %016lx %016lx %016lx\n---\n",
         idx,
         apubkey2.x.n[0],apubkey2.x.n[1],apubkey2.x.n[2],apubkey2.x.n[3],apubkey2.x.n[4],
         apubkey2.y.n[0],apubkey2.y.n[1],apubkey2.y.n[2],apubkey2.y.n[3],apubkey2.y.n[4]
         );

Ok. a simple apubkey2 = jpubkey works. So what is it, that causes this weird behavior? To investigate,. I wrote a small synthetic hrd256k1_fe_mul2:

Code:

static void
hrd256k1_fe_mul2(hrd256k1_fe *r, const hrd256k1_fe *a, const hrd256k1_fe *b) {
  r->n[0] = a->n[0] + b->n[0];
  r->n[1] = a->n[1] + b->n[1];
  r->n[2] = a->n[2] + b->n[2];
  r->n[3] = a->n[3] + b->n[3];
  r->n[4] = a->n[4] + b->n[4];
}

Guess what? Same problem (doesn't even printf). Now if I comment out ANY of the r->n = a->n + b->n lines, it works!
If I even do

Code:

static void
hrd256k1_fe_mul2(hrd256k1_fe *r, const hrd256k1_fe *a, const hrd256k1_fe *b) {
  r->n[0] = a->n[0]; // + b->n[0];
  r->n[1] = a->n[1] + b->n[1];
  r->n[2] = a->n[2] + b->n[2];
  r->n[3] = a->n[3] + b->n[3];
  r->n[4] = a->n[4] + b->n[4];
}

It still works! What is going on??? Huh

Rico

What happens when you try to just printf the line you commented out?

Also this: http://stackoverflow.com/questions/1255099/whats-the-proper-use-of-printf-to-display-pointers-padded-with-0s

Buy or sell $100 of Crypto and get $10!

rico666 (OP)

Legendary

Offline

Activity: 1120
Merit: 1037

฿ → ∞

Raping GPUs and having fun

March 20, 2017, 03:20:40 PM

#617

Quote from: Jude Austin on March 20, 2017, 04:44:11 AM

What happens when you try to just printf the line you commented out?

Also this: http://stackoverflow.com/questions/1255099/whats-the-proper-use-of-printf-to-display-pointers-padded-with-0s

So lessons learned and progress:

Never try to impose a data size on the GPU which it was not built for. Todays GPUs are 32bit. Using 64bit data types is a performance penalty (as the GPU internally transforms this into a sequence of 32bit operations). Moreover, defining your own 128bit arithmetics library using 64bit types on GPU ... will eventually work after you really do something to the GPU which can only be described as raping, but the GPU will not like it and show a performance consistent with its unliking...

Turns out, there is a maximum number of assembler instructions per kernel and of course I ran into it with my glorious 128bit GPU-library, then the kernel will simply crash, or your host application gets a segmentation fault (from the OpenCL library), or <insert undefined behavior here>

Printout on GPU is nothing but a straw of the desperate GPU developer.

Back to the drawing board, I'm left with a highly optimized 64bit ECC library @ CPU and the need for a (highly optimized) 32bit library on GPU. At least as long as I have parts of the computation on CPU, parts on GPU. Sounds like Frankensteins monster? It is!

Computing with 5x52 fields @ CPU, pushing data to GPU, there a conversion 5x52 -> 10x26, followed by 32bit computations.

But it is surprisingly fast - so far. As the conversion (I hope) is a mere:

Code:

static void hrd256k1_fe_52to26(hrd256k1_fe32 *out, const hrd256k1_fe *in) {
  out->n[1] = in->n[0] & 0x3FFFFFFUL;
  out->n[0] = in->n[0] >> 26;
  out->n[3] = in->n[1] & 0x3FFFFFFUL;
  out->n[2] = in->n[1] >> 26;
  out->n[5] = in->n[2] & 0x3FFFFFFUL;
  out->n[4] = in->n[2] >> 26;
  out->n[7] = in->n[3] & 0x3FFFFFFUL;
  out->n[6] = in->n[3] >> 26;
  out->n[9] = in->n[4] & 0x3FFFFFFUL;
  out->n[8] = in->n[4] >> 26;
}

And the subsequent fe_mul etc. are done using GPU native data format. We'll see.

Rico

all non self-referential signatures except mine are lame ... oh wait ... · LBC Thread (News) · Past BURST Activities

Jude Austin

Legendary

Offline

Activity: 1140
Merit: 1000

The Real Jude Austin

Re: Raping GPUs and having fun

March 21, 2017, 06:18:35 AM

#618

Quote from: rico666 on March 20, 2017, 03:20:40 PM

Quote from: Jude Austin on March 20, 2017, 04:44:11 AM

What happens when you try to just printf the line you commented out?

Also this: http://stackoverflow.com/questions/1255099/whats-the-proper-use-of-printf-to-display-pointers-padded-with-0s

So lessons learned and progress:

Never try to impose a data size on the GPU which it was not built for. Todays GPUs are 32bit. Using 64bit data types is a performance penalty (as the GPU internally transforms this into a sequence of 32bit operations). Moreover, defining your own 128bit arithmetics library using 64bit types on GPU ... will eventually work after you really do something to the GPU which can only be described as raping, but the GPU will not like it and show a performance consistent with its unliking...

Turns out, there is a maximum number of assembler instructions per kernel and of course I ran into it with my glorious 128bit GPU-library, then the kernel will simply crash, or your host application gets a segmentation fault (from the OpenCL library), or <insert undefined behavior here>

Printout on GPU is nothing but a straw of the desperate GPU developer.

Back to the drawing board, I'm left with a highly optimized 64bit ECC library @ CPU and the need for a (highly optimized) 32bit library on GPU. At least as long as I have parts of the computation on CPU, parts on GPU. Sounds like Frankensteins monster? It is!

Computing with 5x52 fields @ CPU, pushing data to GPU, there a conversion 5x52 -> 10x26, followed by 32bit computations.

But it is surprisingly fast - so far. As the conversion (I hope) is a mere:

Code:

static void hrd256k1_fe_52to26(hrd256k1_fe32 *out, const hrd256k1_fe *in) {
  out->n[1] = in->n[0] & 0x3FFFFFFUL;
  out->n[0] = in->n[0] >> 26;
  out->n[3] = in->n[1] & 0x3FFFFFFUL;
  out->n[2] = in->n[1] >> 26;
  out->n[5] = in->n[2] & 0x3FFFFFFUL;
  out->n[4] = in->n[2] >> 26;
  out->n[7] = in->n[3] & 0x3FFFFFFUL;
  out->n[6] = in->n[3] >> 26;
  out->n[9] = in->n[4] & 0x3FFFFFFUL;
  out->n[8] = in->n[4] >> 26;
}

And the subsequent fe_mul etc. are done using GPU native data format. We'll see.

Rico

Rico,

What is the performance cost of emulating 64 bit as 32 bit?

Does it double the cost? For example, does a 64 bit word emulated as 32 bit use 100% of the GPU while 32 bit words use 50%?

And am I wrong assuming that even 32 bit is emulated, specifically on Pascal/Maxwell chips? I read the white paper and it says it does half integers also.

Thanks,
Jude

Buy or sell $100 of Crypto and get $10!

rico666 (OP)

Legendary

Offline

Activity: 1120
Merit: 1037

฿ → ∞

Re: Raping GPUs and having fun

March 21, 2017, 08:03:11 AM
Last edit: March 21, 2017, 08:19:20 AM by rico666

#619

Quote from: Jude Austin on March 21, 2017, 06:18:35 AM

What is the performance cost of emulating 64 bit as 32 bit?

Does it double the cost? For example, does a 64 bit word emulated as 32 bit use 100% of the GPU while 32 bit words use 50%?

Ok, let me elaborate on this a little bit and give you some numbers for better estimates where we are and where we're going:

In my CPU/GPU combination, one CPU core puts 8% load on the GPU and that is a situation, where a fairly strong CPU meets a midrange GPU (a 2.8 - 3.7 GHz Skylake E3-Xeon firing at a Quadro M2000M - see http://www.notebookcheck.net/NVIDIA-Quadro-M2000M.151581.0.html). At the moment it's quite possible with a stronger GPU (1080) that the CPU can put only 5-6% load to the GPU.

The current development version of the generator gives me 9 Mkeys/s for all 4 physical cores running, whereas the published version (the one you can download from FTP) gives 7.5 Mkeys/s.

Main difference is bloom filter search is done on GPU with the development version and also moving the final step of affine->normalization->64bytes to GPU resulting in an overall speed improvement of about 375000 keys/s per core.

Up to now, the GPU behaved like a "magical wand", putting bloom for it to work, didn't raise GPU load, but it raised the keyrate. This could be explained that the time the GPU needs to do the bloom filter search is basically the time the GPU would need to transfer the hashed data back to CPU (which does the bloom filter search on the current public version). Same with the affine transformation.

There is nothing left on the CPU except (heavily optimized) EC computations, so any further speed improvements need to push that to the GPU.
In terms of time, currently one 16M block takes around 6.25 seconds on my machine (if I let compute 8 blocks, it takes 50 seconds - to mitigate the startup cost).

So I thought I'd emulate what's going on on the CPU and move the code piece by piece. Going backwards, the step before the affine transformation is the Jacobi->Affine transformation, where you need to compute the square and the cube of the Jacobi Z coordinate and multiply the X with the former and the Y with the latter. All in all one Field element sqr and 3 FE mul operations.

Done that with my 128bit library (based on 64bit data types) on GPU and behold! GPU load went to 100% and the time per block went to 16 seconds! Uh. Operation successful, patient dead.
-> Back to the drawing board.

Now the same with 32bit data types is currently 12% GPU load and 5.4 seconds per block (per CPU core). So very promising, but I'm hitting a little/big endianness brainwarp hell, so I have to figure out how to do it more elegant.

Also, the new version will demand a more GPU-heavy approach before I can release it. As the bloom filter search is done on GPU, an additional 512MB of GPU memory is used per process. Running 4 processes on my Maxwell GPU with its 4GB VRAM is just fine (and as the memory can be freed from the CPU part of the generator, it takes only 100MB of host memory), but I experienced also Segmentation faults with the Keppler machines on Amazon cloud.

So the goal is really to have one CPU core being able to put at least 50% load on one GPU.

It's no small engineering feat, but at the moment LBC is the fastest key generator on the planet (some 20% faster than oclvanitygen) and I believe it is achievable to be twice as fast as oclvanitygen. That's my goal and motivation and currently I have yet to tap 65% of my GPU capacity to get there.

Quote

And am I wrong assuming that even 32 bit is emulated, specifically on Pascal/Maxwell chips? I read the white paper and it says it does half integers also.

I'm not familiar in detail with the specific hardware interna. At the moment I have a Maxwell chip for my testing and I will have a tendency to support newer architectures/chip families, than the old stuff. Another way to put it: I will not sacrifice any speed to support that "old" chip from 2009. ;-)

Sidenote:

If anyone wants to be at the true forefront of development and have a great workstation-replacement notebook, buy a Lenovo P50 (maybe P51 to be slightly ahead), because that's what I am developing on and LBC will naturally be slightly tailored to it. E.g. it has also an Intel GPU, which I am using for display. So currently I can work with the notebook basically without any limitations, as the Intel Graphics are untouched and as I have the 4 logical cores for my interaction, I can watch videos, browse etc.) and the notebook is churning 9 Mkeys/s. Ok the fan noise is distracting, because normally, the notebook is fine with passive cooling. Wink

Rico

all non self-referential signatures except mine are lame ... oh wait ... · LBC Thread (News) · Past BURST Activities

Jude Austin

Legendary

Offline

Activity: 1140
Merit: 1000

The Real Jude Austin

Re: Raping GPUs and having fun

March 22, 2017, 07:29:21 AM

#620

Quote from: rico666 on March 21, 2017, 08:03:11 AM

Quote from: Jude Austin on March 21, 2017, 06:18:35 AM

What is the performance cost of emulating 64 bit as 32 bit?

Does it double the cost? For example, does a 64 bit word emulated as 32 bit use 100% of the GPU while 32 bit words use 50%?

Ok, let me elaborate on this a little bit and give you some numbers for better estimates where we are and where we're going:

In my CPU/GPU combination, one CPU core puts 8% load on the GPU and that is a situation, where a fairly strong CPU meets a midrange GPU (a 2.8 - 3.7 GHz Skylake E3-Xeon firing at a Quadro M2000M - see http://www.notebookcheck.net/NVIDIA-Quadro-M2000M.151581.0.html). At the moment it's quite possible with a stronger GPU (1080) that the CPU can put only 5-6% load to the GPU.

The current development version of the generator gives me 9 Mkeys/s for all 4 physical cores running, whereas the published version (the one you can download from FTP) gives 7.5 Mkeys/s.

Main difference is bloom filter search is done on GPU with the development version and also moving the final step of affine->normalization->64bytes to GPU resulting in an overall speed improvement of about 375000 keys/s per core.

Up to now, the GPU behaved like a "magical wand", putting bloom for it to work, didn't raise GPU load, but it raised the keyrate. This could be explained that the time the GPU needs to do the bloom filter search is basically the time the GPU would need to transfer the hashed data back to CPU (which does the bloom filter search on the current public version). Same with the affine transformation.

There is nothing left on the CPU except (heavily optimized) EC computations, so any further speed improvements need to push that to the GPU.
In terms of time, currently one 16M block takes around 6.25 seconds on my machine (if I let compute 8 blocks, it takes 50 seconds - to mitigate the startup cost).

So I thought I'd emulate what's going on on the CPU and move the code piece by piece. Going backwards, the step before the affine transformation is the Jacobi->Affine transformation, where you need to compute the square and the cube of the Jacobi Z coordinate and multiply the X with the former and the Y with the latter. All in all one Field element sqr and 3 FE mul operations.

Done that with my 128bit library (based on 64bit data types) on GPU and behold! GPU load went to 100% and the time per block went to 16 seconds! Uh. Operation successful, patient dead.
-> Back to the drawing board.

Now the same with 32bit data types is currently 12% GPU load and 5.4 seconds per block (per CPU core). So very promising, but I'm hitting a little/big endianness brainwarp hell, so I have to figure out how to do it more elegant.

Also, the new version will demand a more GPU-heavy approach before I can release it. As the bloom filter search is done on GPU, an additional 512MB of GPU memory is used per process. Running 4 processes on my Maxwell GPU with its 4GB VRAM is just fine (and as the memory can be freed from the CPU part of the generator, it takes only 100MB of host memory), but I experienced also Segmentation faults with the Keppler machines on Amazon cloud.

So the goal is really to have one CPU core being able to put at least 50% load on one GPU.

It's no small engineering feat, but at the moment LBC is the fastest key generator on the planet (some 20% faster than oclvanitygen) and I believe it is achievable to be twice as fast as oclvanitygen. That's my goal and motivation and currently I have yet to tap 65% of my GPU capacity to get there.

Quote

And am I wrong assuming that even 32 bit is emulated, specifically on Pascal/Maxwell chips? I read the white paper and it says it does half integers also.

I'm not familiar in detail with the specific hardware interna. At the moment I have a Maxwell chip for my testing and I will have a tendency to support newer architectures/chip families, than the old stuff. Another way to put it: I will not sacrifice any speed to support that "old" chip from 2009. ;-)

Sidenote:

If anyone wants to be at the true forefront of development and have a great workstation-replacement notebook, buy a Lenovo P50 (maybe P51 to be slightly ahead), because that's what I am developing on and LBC will naturally be slightly tailored to it. E.g. it has also an Intel GPU, which I am using for display. So currently I can work with the notebook basically without any limitations, as the Intel Graphics are untouched and as I have the 4 logical cores for my interaction, I can watch videos, browse etc.) and the notebook is churning 9 Mkeys/s. Ok the fan noise is distracting, because normally, the notebook is fine with passive cooling. Wink

Rico

Rico,

Why does it require a BF for each process? Couldn't just the BF get loaded to VRAM and then each process references that one instance?

I am not questioning your work, just digging for information for a better understanding.

And shit, those are nice workstations at a pretty decent price, kind of pissed I bought an MSI GS70 Stealth Pro.... ~6 Mkeys

Thanks,
Jude

Buy or sell $100 of Crypto and get $10!