Bitcoin Forum

Other => Meta => Topic started by: dree12 on July 26, 2013, 09:16:53 PM



Title: Something odd
Post by: dree12 on July 26, 2013, 09:16:53 PM
I was making some expansions to this thread (https://bitcointalk.org/index.php?topic=83794.msg923918#msg923918) recently. When I saved the post, it was cut off. However, the post is well under the 65535-character limit:

(firefox)
Code:
[17:16:06.713] post.length
[17:16:06.719] 62075

No notice came up; the post was just cut off. The post preview worked as expected.

Why, then, was the post cut off?


Title: Re: Something odd
Post by: theymos on July 26, 2013, 10:47:48 PM
There's a 65535-byte limit. Characters not in [a-zA-Z ] require ~6 bytes with SMF's encoding, including newlines.

There should be some sort of warning if you trigger this.


Title: Re: Something odd
Post by: dree12 on July 26, 2013, 11:59:33 PM
There's a 65535-byte limit. Characters not in [a-zA-Z ] require ~6 bytes with SMF's encoding, including newlines.

There should be some sort of warning if you trigger this.

This seems absolutely ridiculous. UTF-8 has more characters that can be fit into a byte, and all Unicode characters can be encoded in at most 6 bytes. I assume my post is so large then, because of all the numbers, punctuation, and newlines.

Thank you for inserting my post, though.


Title: Re: Something odd
Post by: theymos on July 27, 2013, 12:16:21 AM
SMF translates all special characters into HTML entities and all newlines into <br />s before inserting text into the database. This is maybe more efficient, though I probably wouldn't have done it this way.


Title: Re: Something odd
Post by: nimda on July 27, 2013, 12:18:26 AM
There's a 65535-byte limit. Characters not in [a-zA-Z ] require ~6 bytes with SMF's encoding, including newlines.

There should be some sort of warning if you trigger this.

This seems absolutely ridiculous. UTF-8 has more characters that can be fit into a byte, and all Unicode characters can be encoded in at most 6 bytes. I assume my post is so large then, because of all the numbers, punctuation, and newlines.

Thank you for inserting my post, though.
Remember, you're going into HTML. For example, an ampersand (one byte in UTF-8) must become "&amp;" (5 bytes) or "&#38;" (5 bytes). Newlines are encoded as "<br />" which is 6 bytes.


Title: Re: Something odd
Post by: dree12 on July 27, 2013, 12:38:22 AM
SMF translates all special characters into HTML entities and all newlines into <br />s before inserting text into the database. This is maybe more efficient, though I probably wouldn't have done it this way.

Are numbers translated too? Because if so, it would seem they are translated back...

Anyways, I guess that's reasonable. Personally, I would have taken a storage hit and stored both a BBCode version in UTF-8 and a cached HTML translation. It would be most efficient, speed-wise (one translation per edit, rather than multiple), and storage is quite cheap (especially for text). IIRC that's what Wikipedia does, and it's a major reason why they can serve so many people so quickly with very few servers.


Title: Re: Something odd
Post by: theymos on July 27, 2013, 12:43:24 AM
No, numbers aren't translated.


Title: Re: Something odd
Post by: btcton on July 27, 2013, 07:45:05 PM
SMF translates all special characters into HTML entities and all newlines into <br />s before inserting text into the database. This is maybe more efficient, though I probably wouldn't have done it this way.

Are numbers translated too? Because if so, it would seem they are translated back...

Anyways, I guess that's reasonable. Personally, I would have taken a storage hit and stored both a BBCode version in UTF-8 and a cached HTML translation. It would be most efficient, speed-wise (one translation per edit, rather than multiple), and storage is quite cheap (especially for text). IIRC that's what Wikipedia does, and it's a major reason why they can serve so many people so quickly with very few servers.
Only stuff that conflicts with HTML such as "<" or sometimes JavaScript need to be translated. Normal characters should be no more than one byte each.


Title: Re: Something odd
Post by: Foxpup on July 28, 2013, 03:32:13 AM
Only stuff that conflicts with HTML such as "<" or sometimes JavaScript need to be translated. Normal characters should be no more than one byte each.
The forum doesn't use Unicode. All non-ASCII characters must be converted to the corresponding HTML entity (eg, "©" becomes "&copy;" or "&#169;") in order to be displayed correctly. Without conversion, "©" will actually be displayed as "©". You've probably seen this before on sites that don't perform this conversion correctly.


Title: Re: Something odd
Post by: dree12 on July 28, 2013, 03:35:49 AM
Only stuff that conflicts with HTML such as "<" or sometimes JavaScript need to be translated. Normal characters should be no more than one byte each.
The forum doesn't use Unicode. All non-ASCII characters must be converted to the corresponding HTML entity (eg, "©" becomes "&copy;" or "&#169;") in order to be displayed correctly. Without conversion, "©" will actually be displayed as "©". You've probably seen this before on sites that don't perform this conversion correctly.

Unicode should be UTF-8. Just a minor correction, as the forum does indeed use Unicode, but cannot encode most Unicode characters.


Title: Re: Something odd
Post by: Foxpup on July 28, 2013, 03:49:33 AM
Unicode should be UTF-8. Just a minor correction, as the forum does indeed use Unicode, but cannot encode most Unicode characters.
The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages. If it actually does store posts in UTF-8 (or any other encoding), it would have to perform the conversion every time a page is requested, which seems rather wasteful.


Title: Re: Something odd
Post by: justusranvier on July 28, 2013, 04:02:24 AM
The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages.
Really? In 2013?


Title: Re: Something odd
Post by: nimda on July 28, 2013, 04:16:47 AM
The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages.
Really? In 2013?
Code:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />


Title: Re: Something odd
Post by: btcton on July 28, 2013, 04:21:51 AM
Only stuff that conflicts with HTML such as "<" or sometimes JavaScript need to be translated. Normal characters should be no more than one byte each.
The forum doesn't use Unicode. All non-ASCII characters must be converted to the corresponding HTML entity (eg, "©" becomes "&copy;" or "&#169;") in order to be displayed correctly. Without conversion, "©" will actually be displayed as "©". You've probably seen this before on sites that don't perform this conversion correctly.
Oh, I see. That's weird, nowadays quite a few websites use Unicode.


Title: Re: Something odd
Post by: Foxpup on July 28, 2013, 04:34:47 AM
The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages.
Really? In 2013?
Why not? HTML itself only uses plain ASCII characters, and HTML entities allow any other character to be represented in ASCII text. You could encode a Chinese-Klingon dictionary in ASCII using HTML entities if you really wanted to, though it would take a whopping 8 bytes per character.


Title: Re: Something odd
Post by: dree12 on July 28, 2013, 02:23:56 PM
The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages.
Really? In 2013?
Why not? HTML itself only uses plain ASCII characters, and HTML entities allow any other character to be represented in ASCII text. You could encode a Chinese-Klingon dictionary in ASCII using HTML entities if you really wanted to, though it would take a whopping 8 bytes per character.

Again, its character encoding doesn't support Unicode, but the forum does use Unicode. HTML entities are a form of Unicode encoding too.