Information and Links

Join the fray by commenting, tracking what others have to say, or linking to it from your blog.


Other Posts

Tracking file downloads

Posted by plattapuss on March 7th, 2007

Today a client wanted to track file downloads. Sure I told them, easy with Urchin, which they happen to have installed already. No, no, no they told me, it's broken, it is giving bogus download numbers. Sure enough when I took a look, the numbers Urchin was reporting were way too high. The numbers being used though, were pageviews, and not file downloads. The reason for that is that my client uses an .htaccess rewrite so as to hide the true location of the files and offer a little but of anti-leech protection.

My client had set the filenames in the redirect to simple names without extensions. Here is what the .htaccess file looked like:

CODE:
  1. RewriteEngine on
  2. RewriteRule ^dlfiles/file_one$ /realfiles/subfolder/one.pdf [L]

So when someone entered http://mydomain.com/dlfiles/file_one into the browser, they would download the file http://mydomain.com/realfiles/subfolder/one.pdf

This is great and works as expected. What if someone does not actually download the file after entering the URL, and then comes back a few seconds later, enters the URL again and downloads the file this time? An entry would be made in the apache logs which looks similar to this:

CODE:
  1. 999.123.46.12 - - [07/Mar/2007:09:31:20 -0800] "GET /dlfiles/file_one HTTP/1.1" 206 123456 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
  2. 999.123.46.12 - - [07/Mar/2007:09:31:20 -0800] "GET /dlfiles/file_one HTTP/1.1" 200 345678 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"

Urchin would read the logs and see that the file /dlfiles/file_one was touched twice. Since the file doesn't have an expected file extension--in this case .pdf--matching its list of files that are considered downloads, Urchin won't enter these as file downloads.

When my client would look at the log files, they would see the file as being looked at twice, which is not the same as being downloaded twice. To solve the problem, we simply changed the .htaccess file to look like this:

CODE:
  1. RewriteEngine on
  2. RewriteRule ^dlfiles/file_one.pdf$ /realfiles/subfolder/one.pdf [L]

And the logs now look like this using the same scenario as above, where the end user doesn't download the file the first time, but comes back and gets it the second time:

CODE:
  1. 999.123.46.12 - - [07/Mar/2007:09:31:20 -0800] "GET /dlfiles/file_one.pdf HTTP/1.1" 206 123456 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
  2. 999.123.46.12 - - [07/Mar/2007:09:31:20 -0800] "GET /dlfiles/file_one.pdf HTTP/1.1" 200 345678 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"

The big difference is that Urchin now sees the .pdf extension on the file name. And how does Urchin tell if the file was actually downloaded? Simple, see that 200 and 206 that I hi-lited? That is the the code from apache saying that 200 - file download was successful, and 206 - incomplete download.

Now when I my client looks in Urchin, they go to the downloads section and see the proper number of downloads for each of their files.

Below is a list of Apache server errors codes in case you wish to explore your Apache logs some more, and you can find a complete description of the HTTP Status Codes here.

Successful Client Requests
200 OK
201 Created
202 Accepted
203 Non-Authorative Information
204 No Content
205 Reset Content
206 Partial Content

Client Request Redirected
300 Multiple Choices
301 Moved Permanently
302 Moved Temporarily
303 See Other
304 Not Modified
305 Use Proxy
307 Temporary Redirect

Client Request Errors
400 Bad Request
401 Authorization Required
402 Payment Required (not used yet)
403 Forbidden
404 Not Found
405 Method Not Allowed
406 Not Acceptable (encoding)
407 Proxy Authentication Required
408 Request Timed Out
409 Conflicting Request
410 Gone
411 Content Length Required
412 Precondition Failed
413 Request Entity Too Long
414 Request URI Too Long
415 Unsupported Media Type
416 Requested Range Not Satisfiable
417 Expectation Failed

Server Errors
500 Internal Server Error
501 Not Implemented
502 Bad Gateway
503 Service Unavailable
504 Gateway Timeout
505 HTTP Version Not Supported



Reader Comments

You might want to take a second look at the interpretation of the 206 code. The Acrobat plug-in for web browsers makes partial file requests in order to speed up the initial page display. So the 206 entries come from the plug in requesting the next “chunk”. The longer the PDF, the more chunks and the more log entries. This article (http://www.panalysis.com/pdf_counting.php) suggests ignoring 206 lines for a more accurate count.