About proxying wget http 1.0 using http 1.1 headers

Sep 21, 2011   #HTTP  #Node.js  #Wget 

When a friend tried to syndicate my blog on his, his server was unable to complete the sync. Page http://blog.jtlebi.fr/feed/ simply timed out. After quite a few tests, we noticed that this issue never happened with a browser like Firefox. Aside,  Wget hanged for 2 minutes after downloading more than Firefox. Strange

In my previous post, I explained that WordPress is hosted behind Apache2, Apache2 behind itself reachable behind my house-made reverse-proxy. The main goal being to host all services on port 80.

|<-----> Apache (WordPress and more)
Client <----> Reverse-Proxy |<-----> Etherpad
                            |<-----> Cloud 9
                            |...

Using tcpdump, we noticed that the packet with the “FIN” flag set was never send by the client. The strangest was that Wget received more data than Firefox.

After a few hours of investigations, it appeared that Wget was sending header “keepalive” to keep the connection open while using http version 1.0. “Keepalive” is an illegal header in HTTP 1.0 as it has been introduced with 1.1 revision. This is actually a known bug. A workaround is to use it in conjunction with “-no-http-keep-alive” command-line option.

Wireshark shows that wget uses illegal header "keepalive" with HTTP version 1.0

Wireshark shows that wget uses illegal header “keepalive” with HTTP version 1.0

The real reason why Wget avoid using version 1.1 is that it doesn’t understand “Transfer-Encoding: chunked” header, which is shame, btw. Since the answer was encoded this way, it embedded chunk size informations interpreted as regular content by Wget making the resulting file both bigger and corrupted.

Since I can not force all my visitors willing to wget from my website to use this workaround, I had to “hardcode” a way to force HTTP/1.0 when proxying for 1.0 enabled client. According to “http” module documentation of node.js, it automatically adapts itself when responding to request protocol. This is great. Since my reverse proxy implementation just streams raw answers back to the client, I needed a way to forward the request in the same version to get a compatible answer. Sadly this is not (yet) possible.

I suggested a fix on github wich is currently under review by node.js team on master and v0.4 branches to address this missing feature. Wait and see :)

UPDATE 25/09/2011:

Here is a copy a comment i posted on github which sums well up the situation:

Actually, the “end” event seems to be fired when the data stream ends. It is not linked to the underlying socket. Eherpad (like most software) relies on the ability the HTTP1.1 to “keepalive” a connection. This is the behaviour broken by the “destroy” called on “end” event.

The actual bug I was facing was hidden deeply inside. It occured only when prowying HTTP/1.0 requests. To make things even more complicated, Wget is cheating and uses HTTP/1.1 “keepalive” header. I tried to clarify all this in this blogpost: http://blog.jtlebi.fr/2011/09/21/about-proxying-wget-http-1-0-using-http-1-1-headers/

Currently, i rely on a patch i did to add HTTP/1.0 to node.js http lib to fix this. My pull request will probably never be merged in as this truly is “legacy feature” :) The best would be to fix Wget 😀

If a need appears to bring a real long term fix, I can either embed a patched version of the http library or implement a real proxy HTTP/1.0 <—-> HTTP/1.1, the biggest concern being the “transfer-encoding: chunked” which is almost always used by Apache2 but not available in HTTP/1.0 and thus requires the full filesize to be known when starting the transfer and caching to be enabled on the proxy side as soon as this transfer encoding is used.

Let me know if a better fix would be a good idea.