Bitcoin Forum

Other => Off-topic => Topic started by: theymos on March 21, 2013, 05:40:16 AM



Title: Why is HTML   being converted to the 0xA0 character instead of a space?
Post by: theymos on March 21, 2013, 05:40:16 AM
Look at this page:
https://bitcointalk.org/test.php

The form is pre-filled with a  . If I submit it, I get "a0", indicating that the browser sent the special 0xA0 "non-breaking space" character instead of a regular space. This isn't normal, and it's causing several problems for the forum software. This behavior started when I upgraded php and switched to nginx before switching servers, and it's persisted after switching servers. So it's probably some problem with the configuration of nginx or php.

Any ideas on how to fix this?

The code for test.php:
Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
<title>asdf</title>
</head>
<body>
<form action="" method="post">
<input name="test" type="text" value="&nbsp;" />
<input type="submit" />
</form>
<?php
if(isset($_REQUEST['test']))
        echo 
'<p>'.bin2hex($_REQUEST['test']).'</p>';
?>

</body></html>


Title: Re: Why is HTML &nbsp; being converted to the 0xA0 character instead of a space?
Post by: gweedo on March 21, 2013, 05:48:36 AM
I am using php 5.3 on apache on mac OSX 10.8 (dev server), and just tried that snippet it gives me the same thing... so this is probably a php 5+ problem.


Title: Re: Why is HTML &nbsp; being converted to the 0xA0 character instead of a space?
Post by: gweedo on March 21, 2013, 05:51:09 AM
I just tried

Code:
<!DOCTYPE HTML>
<html lang="en">
<title>asdf</title>
</head>
<body>
<form action="" method="post">
<input name="test" type="text" value="&nbsp;" />
<input type="submit" />
</form>
<?php
if(isset($_REQUEST['test']))
        echo 
'<p>'.bin2hex($_REQUEST['test']).'</p>';
?>

</body></html>

which is HTML 5 and I got c2a0


Title: Re: Why is HTML &nbsp; being converted to the 0xA0 character instead of a space?
Post by: theymos on March 21, 2013, 06:05:01 AM
I looked at the XHTML/HTML standards, and &nbsp; actually is defined as being 0xA0. But I'm pretty sure that my browser didn't used to behave this way...


Title: Re: Why is HTML &nbsp; being converted to the 0xA0 character instead of a space?
Post by: theymos on March 21, 2013, 06:18:49 AM
This causes problems because SMF converts multiple spaces into a series of &nbsp; entities. So "  " becomes "&nbsp; ". When the entities are converted to 0xA0, it causes at least these problems:
- When quoting a PM which has multiple spaces, you will end up submitting a message containing 0xA0 characters. This confuses the forum's mail processing and the recipient ends up being sent a PM notification email with an empty message. If you receive a lot of PMs you've probably noticed this.
- There is no way for the forum to correctly display a clearsigned document if it has multiple spaces, even with [code] tags. Copying it will copy the non-breaking spaces (though only on some systems, I think) and it won't sign/verify consistently.
- I can't use the forum's file browser because submitting a file with any 0xA0s messes things up for some reason.


Title: Re: Why is HTML &nbsp; being converted to the 0xA0 character instead of a space?
Post by: chmod755 on March 21, 2013, 07:07:26 AM
Did you try the same

  • in a different browser?
  • with a different charset?
(utf-8 or something)


Title: Re: Why is HTML &nbsp; being converted to the 0xA0 character instead of a space?
Post by: 🏰 TradeFortress 🏰 on March 21, 2013, 08:18:02 AM
    Did you try the same

    • in a different browser?
    • with a different charset?
    (utf-8 or something)
    [/list]

    It's intended behavior (nbsp -> 0xA0), but SMF doesn't like it.

    What about removing SMF's multiple space to &nbsp conversion?


    Title: Re: Why is HTML &nbsp; being converted to the 0xA0 character instead of a space?
    Post by: davout on March 21, 2013, 08:22:36 AM
    nbsp -> non-breakable space oO
    That's like... by design


    Title: Re: Why is HTML &nbsp; being converted to the 0xA0 character instead of a space?
    Post by: Bitsky on March 21, 2013, 06:36:55 PM
    It's send like that from the browser. Just run a tcpdump and you'll see.

    Firefox/Opera sends the POST request string "test=%C2%A0", while curl sends "test=&nbsp;".

    http://en.wikipedia.org/wiki/Non-breaking_space#Encodings


    Title: Re: Why is HTML &nbsp; being converted to the 0xA0 character instead of a space?
    Post by: Foxpup on March 22, 2013, 10:17:09 AM
    This is exactly what's supposed to happen. The default value is a non-breaking space, so the browser sends a non-breaking space. What did you expect? No browser will send a regular space in this situation. Perhaps something in the old software was silently converting non-breaking spaces to regular spaces? Otherwise it should never have worked in the first place if non-breaking spaces are such a problem.


    Title: Re: Why is HTML &nbsp; being converted to the 0xA0 character instead of a space?
    Post by: theymos on March 23, 2013, 04:01:02 AM
    Ha! I figured out what causes most of these problems. (Though I guess &nbsp; has been the special non-breaking space character in all browsers for at least several years.)

    If omitted, the default value for this argument is ISO-8859-1 in versions of PHP prior to 5.4.0, and UTF-8 from PHP 5.4.0 onwards.

    With PHP < 5.4.0, 0xA0 was passed through normally. Now, this character is considered invalid UTF-8 and the entire input to htmlspecialchars is scrapped.

    This needs to be fixed in SMF. It affects even 2.x. I fixed it for the PM emails, but it'd be too much work to fix it everywhere.

    I'm getting pretty sick of dealing with SMF's escaping insanity... Who thought it was a good idea to have every function take strings with different degrees of escaping?