Title: Something odd Post by: dree12 on July 26, 2013, 09:16:53 PM I was making some expansions to this thread (https://bitcointalk.org/index.php?topic=83794.msg923918#msg923918) recently. When I saved the post, it was cut off. However, the post is well under the 65535-character limit:
(firefox) Code: [17:16:06.713] post.length No notice came up; the post was just cut off. The post preview worked as expected. Why, then, was the post cut off? Title: Re: Something odd Post by: theymos on July 26, 2013, 10:47:48 PM There's a 65535-byte limit. Characters not in [a-zA-Z ] require ~6 bytes with SMF's encoding, including newlines.
There should be some sort of warning if you trigger this. Title: Re: Something odd Post by: dree12 on July 26, 2013, 11:59:33 PM There's a 65535-byte limit. Characters not in [a-zA-Z ] require ~6 bytes with SMF's encoding, including newlines. There should be some sort of warning if you trigger this. This seems absolutely ridiculous. UTF-8 has more characters that can be fit into a byte, and all Unicode characters can be encoded in at most 6 bytes. I assume my post is so large then, because of all the numbers, punctuation, and newlines. Thank you for inserting my post, though. Title: Re: Something odd Post by: theymos on July 27, 2013, 12:16:21 AM SMF translates all special characters into HTML entities and all newlines into <br />s before inserting text into the database. This is maybe more efficient, though I probably wouldn't have done it this way.
Title: Re: Something odd Post by: nimda on July 27, 2013, 12:18:26 AM There's a 65535-byte limit. Characters not in [a-zA-Z ] require ~6 bytes with SMF's encoding, including newlines. There should be some sort of warning if you trigger this. This seems absolutely ridiculous. UTF-8 has more characters that can be fit into a byte, and all Unicode characters can be encoded in at most 6 bytes. I assume my post is so large then, because of all the numbers, punctuation, and newlines. Thank you for inserting my post, though. Title: Re: Something odd Post by: dree12 on July 27, 2013, 12:38:22 AM SMF translates all special characters into HTML entities and all newlines into <br />s before inserting text into the database. This is maybe more efficient, though I probably wouldn't have done it this way. Are numbers translated too? Because if so, it would seem they are translated back... Anyways, I guess that's reasonable. Personally, I would have taken a storage hit and stored both a BBCode version in UTF-8 and a cached HTML translation. It would be most efficient, speed-wise (one translation per edit, rather than multiple), and storage is quite cheap (especially for text). IIRC that's what Wikipedia does, and it's a major reason why they can serve so many people so quickly with very few servers. Title: Re: Something odd Post by: theymos on July 27, 2013, 12:43:24 AM No, numbers aren't translated.
Title: Re: Something odd Post by: btcton on July 27, 2013, 07:45:05 PM SMF translates all special characters into HTML entities and all newlines into <br />s before inserting text into the database. This is maybe more efficient, though I probably wouldn't have done it this way. Are numbers translated too? Because if so, it would seem they are translated back... Anyways, I guess that's reasonable. Personally, I would have taken a storage hit and stored both a BBCode version in UTF-8 and a cached HTML translation. It would be most efficient, speed-wise (one translation per edit, rather than multiple), and storage is quite cheap (especially for text). IIRC that's what Wikipedia does, and it's a major reason why they can serve so many people so quickly with very few servers. Title: Re: Something odd Post by: Foxpup on July 28, 2013, 03:32:13 AM Only stuff that conflicts with HTML such as "<" or sometimes JavaScript need to be translated. Normal characters should be no more than one byte each. The forum doesn't use Unicode. All non-ASCII characters must be converted to the corresponding HTML entity (eg, "©" becomes "©" or "©") in order to be displayed correctly. Without conversion, "©" will actually be displayed as "©". You've probably seen this before on sites that don't perform this conversion correctly.Title: Re: Something odd Post by: dree12 on July 28, 2013, 03:35:49 AM Only stuff that conflicts with HTML such as "<" or sometimes JavaScript need to be translated. Normal characters should be no more than one byte each. The forum doesn't use Unicode. All non-ASCII characters must be converted to the corresponding HTML entity (eg, "©" becomes "©" or "©") in order to be displayed correctly. Without conversion, "©" will actually be displayed as "©". You've probably seen this before on sites that don't perform this conversion correctly.Unicode should be UTF-8. Just a minor correction, as the forum does indeed use Unicode, but cannot encode most Unicode characters. Title: Re: Something odd Post by: Foxpup on July 28, 2013, 03:49:33 AM Unicode should be UTF-8. Just a minor correction, as the forum does indeed use Unicode, but cannot encode most Unicode characters. The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages. If it actually does store posts in UTF-8 (or any other encoding), it would have to perform the conversion every time a page is requested, which seems rather wasteful.Title: Re: Something odd Post by: justusranvier on July 28, 2013, 04:02:24 AM The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages. Really? In 2013?Title: Re: Something odd Post by: nimda on July 28, 2013, 04:16:47 AM The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages. Really? In 2013?Code: <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" /> Title: Re: Something odd Post by: btcton on July 28, 2013, 04:21:51 AM Only stuff that conflicts with HTML such as "<" or sometimes JavaScript need to be translated. Normal characters should be no more than one byte each. The forum doesn't use Unicode. All non-ASCII characters must be converted to the corresponding HTML entity (eg, "©" becomes "©" or "©") in order to be displayed correctly. Without conversion, "©" will actually be displayed as "©". You've probably seen this before on sites that don't perform this conversion correctly.Title: Re: Something odd Post by: Foxpup on July 28, 2013, 04:34:47 AM The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages. Really? In 2013?Title: Re: Something odd Post by: dree12 on July 28, 2013, 02:23:56 PM The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages. Really? In 2013?Again, its character encoding doesn't support Unicode, but the forum does use Unicode. HTML entities are a form of Unicode encoding too. |