Bitcoin Forum
April 26, 2024, 02:20:29 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1]
  Print  
Author Topic: Something odd  (Read 964 times)
dree12 (OP)
Legendary
*
Offline Offline

Activity: 1246
Merit: 1077



View Profile
July 26, 2013, 09:16:53 PM
 #1

I was making some expansions to this thread recently. When I saved the post, it was cut off. However, the post is well under the 65535-character limit:

(firefox)
Code:
[17:16:06.713] post.length
[17:16:06.719] 62075

No notice came up; the post was just cut off. The post preview worked as expected.

Why, then, was the post cut off?
1714141229
Hero Member
*
Offline Offline

Posts: 1714141229

View Profile Personal Message (Offline)

Ignore
1714141229
Reply with quote  #2

1714141229
Report to moderator
The grue lurks in the darkest places of the earth. Its favorite diet is adventurers, but its insatiable appetite is tempered by its fear of light. No grue has ever been seen by the light of day, and few have survived its fearsome jaws to tell the tale.
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1714141229
Hero Member
*
Offline Offline

Posts: 1714141229

View Profile Personal Message (Offline)

Ignore
1714141229
Reply with quote  #2

1714141229
Report to moderator
1714141229
Hero Member
*
Offline Offline

Posts: 1714141229

View Profile Personal Message (Offline)

Ignore
1714141229
Reply with quote  #2

1714141229
Report to moderator
theymos
Administrator
Legendary
*
Offline Offline

Activity: 5180
Merit: 12900


View Profile
July 26, 2013, 10:47:48 PM
 #2

There's a 65535-byte limit. Characters not in [a-zA-Z ] require ~6 bytes with SMF's encoding, including newlines.

There should be some sort of warning if you trigger this.

1NXYoJ5xU91Jp83XfVMHwwTUyZFK64BoAD
dree12 (OP)
Legendary
*
Offline Offline

Activity: 1246
Merit: 1077



View Profile
July 26, 2013, 11:59:33 PM
 #3

There's a 65535-byte limit. Characters not in [a-zA-Z ] require ~6 bytes with SMF's encoding, including newlines.

There should be some sort of warning if you trigger this.

This seems absolutely ridiculous. UTF-8 has more characters that can be fit into a byte, and all Unicode characters can be encoded in at most 6 bytes. I assume my post is so large then, because of all the numbers, punctuation, and newlines.

Thank you for inserting my post, though.
theymos
Administrator
Legendary
*
Offline Offline

Activity: 5180
Merit: 12900


View Profile
July 27, 2013, 12:16:21 AM
 #4

SMF translates all special characters into HTML entities and all newlines into <br />s before inserting text into the database. This is maybe more efficient, though I probably wouldn't have done it this way.

1NXYoJ5xU91Jp83XfVMHwwTUyZFK64BoAD
nimda
Hero Member
*****
Offline Offline

Activity: 784
Merit: 1000


0xFB0D8D1534241423


View Profile
July 27, 2013, 12:18:26 AM
 #5

There's a 65535-byte limit. Characters not in [a-zA-Z ] require ~6 bytes with SMF's encoding, including newlines.

There should be some sort of warning if you trigger this.

This seems absolutely ridiculous. UTF-8 has more characters that can be fit into a byte, and all Unicode characters can be encoded in at most 6 bytes. I assume my post is so large then, because of all the numbers, punctuation, and newlines.

Thank you for inserting my post, though.
Remember, you're going into HTML. For example, an ampersand (one byte in UTF-8) must become "&amp;" (5 bytes) or "&#38;" (5 bytes). Newlines are encoded as "<br />" which is 6 bytes.
dree12 (OP)
Legendary
*
Offline Offline

Activity: 1246
Merit: 1077



View Profile
July 27, 2013, 12:38:22 AM
 #6

SMF translates all special characters into HTML entities and all newlines into <br />s before inserting text into the database. This is maybe more efficient, though I probably wouldn't have done it this way.

Are numbers translated too? Because if so, it would seem they are translated back...

Anyways, I guess that's reasonable. Personally, I would have taken a storage hit and stored both a BBCode version in UTF-8 and a cached HTML translation. It would be most efficient, speed-wise (one translation per edit, rather than multiple), and storage is quite cheap (especially for text). IIRC that's what Wikipedia does, and it's a major reason why they can serve so many people so quickly with very few servers.
theymos
Administrator
Legendary
*
Offline Offline

Activity: 5180
Merit: 12900


View Profile
July 27, 2013, 12:43:24 AM
 #7

No, numbers aren't translated.

1NXYoJ5xU91Jp83XfVMHwwTUyZFK64BoAD
btcton
Legendary
*
Offline Offline

Activity: 1288
Merit: 1007


View Profile
July 27, 2013, 07:45:05 PM
 #8

SMF translates all special characters into HTML entities and all newlines into <br />s before inserting text into the database. This is maybe more efficient, though I probably wouldn't have done it this way.

Are numbers translated too? Because if so, it would seem they are translated back...

Anyways, I guess that's reasonable. Personally, I would have taken a storage hit and stored both a BBCode version in UTF-8 and a cached HTML translation. It would be most efficient, speed-wise (one translation per edit, rather than multiple), and storage is quite cheap (especially for text). IIRC that's what Wikipedia does, and it's a major reason why they can serve so many people so quickly with very few servers.
Only stuff that conflicts with HTML such as "<" or sometimes JavaScript need to be translated. Normal characters should be no more than one byte each.

The signature campaign posters adding useless redundant fluff to their posts to reach their minimum word count are lowering my IQ.
Foxpup
Legendary
*
Offline Offline

Activity: 4340
Merit: 3042


Vile Vixen and Miss Bitcointalk 2021-2023


View Profile
July 28, 2013, 03:32:13 AM
 #9

Only stuff that conflicts with HTML such as "<" or sometimes JavaScript need to be translated. Normal characters should be no more than one byte each.
The forum doesn't use Unicode. All non-ASCII characters must be converted to the corresponding HTML entity (eg, "©" becomes "&copy;" or "&#169;") in order to be displayed correctly. Without conversion, "©" will actually be displayed as "©". You've probably seen this before on sites that don't perform this conversion correctly.

Will pretend to do unspeakable things (while actually eating a taco) for bitcoins: 1K6d1EviQKX3SVKjPYmJGyWBb1avbmCFM4
I am not on the scammers' paradise known as Telegram! Do not believe anyone claiming to be me off-forum without a signed message from the above address! Accept no excuses and make no exceptions!
dree12 (OP)
Legendary
*
Offline Offline

Activity: 1246
Merit: 1077



View Profile
July 28, 2013, 03:35:49 AM
 #10

Only stuff that conflicts with HTML such as "<" or sometimes JavaScript need to be translated. Normal characters should be no more than one byte each.
The forum doesn't use Unicode. All non-ASCII characters must be converted to the corresponding HTML entity (eg, "©" becomes "&copy;" or "&#169;") in order to be displayed correctly. Without conversion, "©" will actually be displayed as "©". You've probably seen this before on sites that don't perform this conversion correctly.

Unicode should be UTF-8. Just a minor correction, as the forum does indeed use Unicode, but cannot encode most Unicode characters.
Foxpup
Legendary
*
Offline Offline

Activity: 4340
Merit: 3042


Vile Vixen and Miss Bitcointalk 2021-2023


View Profile
July 28, 2013, 03:49:33 AM
 #11

Unicode should be UTF-8. Just a minor correction, as the forum does indeed use Unicode, but cannot encode most Unicode characters.
The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages. If it actually does store posts in UTF-8 (or any other encoding), it would have to perform the conversion every time a page is requested, which seems rather wasteful.

Will pretend to do unspeakable things (while actually eating a taco) for bitcoins: 1K6d1EviQKX3SVKjPYmJGyWBb1avbmCFM4
I am not on the scammers' paradise known as Telegram! Do not believe anyone claiming to be me off-forum without a signed message from the above address! Accept no excuses and make no exceptions!
justusranvier
Legendary
*
Offline Offline

Activity: 1400
Merit: 1009



View Profile
July 28, 2013, 04:02:24 AM
 #12

The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages.
Really? In 2013?
nimda
Hero Member
*****
Offline Offline

Activity: 784
Merit: 1000


0xFB0D8D1534241423


View Profile
July 28, 2013, 04:16:47 AM
 #13

The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages.
Really? In 2013?
Code:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
btcton
Legendary
*
Offline Offline

Activity: 1288
Merit: 1007


View Profile
July 28, 2013, 04:21:51 AM
 #14

Only stuff that conflicts with HTML such as "<" or sometimes JavaScript need to be translated. Normal characters should be no more than one byte each.
The forum doesn't use Unicode. All non-ASCII characters must be converted to the corresponding HTML entity (eg, "©" becomes "&copy;" or "&#169;") in order to be displayed correctly. Without conversion, "©" will actually be displayed as "©". You've probably seen this before on sites that don't perform this conversion correctly.
Oh, I see. That's weird, nowadays quite a few websites use Unicode.

The signature campaign posters adding useless redundant fluff to their posts to reach their minimum word count are lowering my IQ.
Foxpup
Legendary
*
Offline Offline

Activity: 4340
Merit: 3042


Vile Vixen and Miss Bitcointalk 2021-2023


View Profile
July 28, 2013, 04:34:47 AM
 #15

The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages.
Really? In 2013?
Why not? HTML itself only uses plain ASCII characters, and HTML entities allow any other character to be represented in ASCII text. You could encode a Chinese-Klingon dictionary in ASCII using HTML entities if you really wanted to, though it would take a whopping 8 bytes per character.

Will pretend to do unspeakable things (while actually eating a taco) for bitcoins: 1K6d1EviQKX3SVKjPYmJGyWBb1avbmCFM4
I am not on the scammers' paradise known as Telegram! Do not believe anyone claiming to be me off-forum without a signed message from the above address! Accept no excuses and make no exceptions!
dree12 (OP)
Legendary
*
Offline Offline

Activity: 1246
Merit: 1077



View Profile
July 28, 2013, 02:23:56 PM
 #16

The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages.
Really? In 2013?
Why not? HTML itself only uses plain ASCII characters, and HTML entities allow any other character to be represented in ASCII text. You could encode a Chinese-Klingon dictionary in ASCII using HTML entities if you really wanted to, though it would take a whopping 8 bytes per character.

Again, its character encoding doesn't support Unicode, but the forum does use Unicode. HTML entities are a form of Unicode encoding too.
Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!