THE XPEDIA ORANGE BLOG: 2012

Thursday, November 22, 2012

varchar vs nvarchar (and text, ntext, char, nchar)

Everyone who's ever created a table in SQL has probably at some point wanted to add a field to store text values and wondered about the difference between all the various options. These are numerous. The most commonly used are varchar, nvarchar, char, nchar, text, and ntext. You could also store text in binary, nbinary, or image columns. Actually you could store text in almost any data type if you're willing to put yourself through the hassle of very complicated conversions. Let's stick with the original six for now: varchar, nvarchar, char, nchar, text, ntext.

The n prefix denotes unicode (more about this later), so ignoring that, we're dealing with three data types here: varchar, char, and text. Varchar is by far the most commonly used and the rest of this article will focus on this type, but in short these three can be differentiated as follows:

char: Fixed length string. In other words, char(4) will always have four characters. Thus if you want to store "A" it will become " A" (A padded with 3 spaces). This is not used very often, but can be useful in certain scenarios.
text: This stores the data in a blob (i.e. not in the row data of the SQL table). This is useful when you expect to store huge text values as there's no limit on the size of the data. However, this type is being deprecated and is being replaced (sort of) by varchar(max).
varchar: Up to 8000 characters this stores text in the row data (similar to char) but not always of fixed length. In other words if you create a varchar(4) column and store a value "A" in it, this will use just one byte to store the A. AA will use two bytes, AAA will use 3 and so on. This implies that the length of various rows in the table will vary from one row to the next. varchar always uses an additional 2 bytes for storage, essentially for meta data so that SQL server knows how much data to expect in each specific instance. That means if you know you will always need exactly 4 characters, char is much more efficient in terms of storage requirements and processing overhead, since it doesn't need the additional 2 bytes of meta data nor any processing power to interpret that meta data. Having said that, in the majority of modern applications, this type of gain from using char instead of varchar is trivial. In the olden days when memory, processor cycles, and storage space were more limited, this was a serious consideration though. In previous versions of SQL Server, varchar could only go up to 8000 characters, but as of SQL 2005 varchar(max) will allow you to store any amount of data. This works like varchar described above up to the 8000 character limit and once data size goes over this it switches over to a blob to store the data and behaves more like the text data type (presumably this is also why the text data type is no longer necessary and being deprecated).

Now, on to varchar vs nvarchar. These two types are essentially the same, except that varchar is non-Unicode and nvarchar is Unicode. What does this mean?

Unicode is a standard for encoding text that was specifically designed to allow for storage of a larger number of different characters and character sets to cater for the requirements of different languages. The wikipedia article for Unicode describes this in more detail.

In SQL Server, this means nvarchar supports multiple languages. If your data is only going to be in English, this is not necessary. If there's any chance that your data will contain other language data, especially languages using non-English character sets, like Arabic, Russian, etc., etc. then you will need to use nvarchar. The cost of using nvarchar is that it almost doubles the amount of storage used to store the data as it uses two bytes for each character of data, where varchar uses only one byte per character. Both have the same overhead of 2 characters, so effectively varchar will use 3 bytes to store "A" and nvarchar will use 4 bytes for the same thing. To store "QWERTY", six characters of data, varchar will use 8 bytes (2+6) and nvarchar will use 14 bytes (2+6*2).

My recommendation: If you're absolutely 100% sure that you will not need non-English data, and you will not need it in the foreseeable future either, then use varchar. In all other cases, use nvarchar.

Monday, November 12, 2012

Software I can recommend

This list isn't meant to be exhaustive, but I wanted to highlight a few things I use regularly that I can highly recommend. I also use a lot of other stuff (like MS Office) that is really good, but most people already know about those, so these are the lesser known ones that I think is worth a try:

Pixlr: Great for quickly editing a picture or logo. Works like a charm, completely web-based, no install necessary. And it's free.
VIPRE: Vipre is my anti-virus of choice. Over the years I've tried various products including TrendMicro, AVG, Norton, and McAfee, but VIPRE stands out. I haven't had any issues with viruses since I started using it about 2 years ago. It's really lightweight and resource-friendly compared to some of the others, and I've been very impressed with their responsive support, especially on chat and e-mail.
Xara: Not one of the mainstream design software names, but very good and competitively priced. I haven't used it very extensively, but from what I've seen it does the job and comes in much cheaper than it's main competitors. I've also received good support from them on something that they really didn't need to help with.
Google apps: Perhaps a bit too mainstream for this list, but it's so good that I really have to mention it. Xpedia started using Google apps a few years ago and I cannot recommend it highly enough. I use forms, spreadsheets, and sites regularly and I simply don't think there is an easier/cheaper way of doing what you can do with these tools. The mail is fantastic too. A lot of our users still use Outlook, but I don't recommend this because the Gmail web interface is just so good that it isn't necessary to have a mail client anymore. While Word/Excel are more feature rich than the comparative tools in Google apps, the ease of collaboration and sharing in Google apps counters this nicely. I use a combination of both, so it depends on what you need to achieve, but I think a lot of organizations can get great value from Google apps. The aforementioned tools are just a fraction of what you get when you sign up for Google apps though, so well worth checking out. And hey, if you're small enough, it's free!
Spybot S&D: Spybot is a great tool for keeping your PC free of malware and spyware. I've been using it for years and it really is a great project. Easy to use and works well with most AV products.
Mozy: I've been using Mozy's backup service for years and the backup part has always impressed me. It's easy to use and set up and doesn't interfere with your daily work (i.e. not a resource hog). Then I had a crash recently and had to "test" restoring. This could not have been easier. What a pleasure. All my backups, 100% up to date, available easily and immediately.
Red-Gate: More specifically Red-Gate's SQL tools. The ones I use often are Compare and Data Compare, but every one of their tools that I've tried has been simply fantastic. Easy to use software that just works. For people in the MS SQL world this is probably also considered mainstream, but in case you haven't tried it, do yourself a favour.
WinRar. Yes, I'm one of those people who actually bought a compression agent! WinRar is really good though. Supports a huge list of compression file formats, works quickly and well.

Monday, November 5, 2012

The cloud: Hype or real?

The ever-increasing hype about the cloud is getting to me. My personal experience is that a lot of people in our industry love throwing around the buzzword, but simply have no idea what it means. To dispel some myths, I recently (ok it was months ago, but I'm only getting around to blogging about it now...), did some research and gave an internal presentation at one of our company functions.

I've made the slideshow accompanying my presentation available for download here. Some of the slides are very specific to our business and these were designed to accompany my talk so it isn't intended as a comprehensive narrative in itself, but hopefully something that can be of use to readers anyway.

The highlights:

The term "cloud" actually comes from network diagrams that used clouds to represent portions of the network that did not need to be known for purposes of the diagram. In this context "cloud" means something like "Unknown magic happens here".
In most modern contexts "the cloud" typically means the Internet or some subset thereof.
A lot of people seem to refer to cloud when they actually mean cloud computing. This can be defined as data, services, or applications hosted on anonymous Internet-based infrastructure. Peder Ulander of Citrix gives a more comprehensive and pretty good description in this video: http://www.youtube.com/watch?v=C4krCK1PZHY
Cloud computing can be categorized by service type as Infrastructure as a Service (IaaS), Platform as a Service (PaaS), or Software as a Service (SaaS).
While PaaS (e.g. Azure, Amazon Web Services, etc.) is fairly new, IaaS and SaaS are as old as the Internet itself. IaaS is really nothing but hosting services and web hosting companies have been doing this for years. It has changed and evolved over time though and there is a lot of new stuff out there in this space, but it's still the same concept. Earliest SaaS examples are probably web mail services like hotmail. This field is growing very quickly with ever more advanced apps. Good examples of business applications in this area are Salesforce.com, Netsuite, and even our very own NuQ HR/Payroll system (see www.nuq.co.za).
Cloud computing has advantages and disadvantages.

For me, the biggest advantage is not having to worry about infrastructure. Your application can scale with your business without you even having to think about it, much less buy dedicated hardware and software to support it. This is based purely on my personal experience and probably more relevant to SaaS than IaaS or PaaS.
The biggest disadvantage (again only in my personal experience) is that support is often impersonal. Services are designed to scale well and that means personal service is usually not a priority. When everything works well this isn't an issue, but when you do have a problem that really affects your business this can be extremely frustrating.

In conclusion, the cloud is just hype, but it's also real.

It's hype, because it's existed for far longer than the cliched terminology that the buzzword brigade so love to use today. There's nothing new about the concept and it's simply an ongoing evolution of services available on the Internet. It's also very real, because this evolution has reached a point where the services on offer are becoming rapidly more useful in our day-to-day business and personal lives.

In my experience when you meet someone who goes on about the wonders of the cloud, this is a good indication that they've heard the hype, but don't understand what it is. In most cases you can replace "cloud" with "Internet" in their every sentence.

Wednesday, May 2, 2012

TRANSACT-SQL: JOIN...ON condition or WHERE condition - which is better?

A colleague suggested I write about this topic as she often wonders about it, but doesn't really understand the implications and has never had a chance to test it.

Basically, the question is whether or not it's better to add a condition in a query in the WHERE clause at the end of the query or to make it part of an ON clause in the JOIN when writing queries in Transact-SQL.

The very short answer is: It depends.

The short answer is:

INNER JOIN: It really doesn't matter and is a case of personal preference more than anything else.
OUTER JOIN (LEFT, RIGHT, or FULL): There's a logical difference that in some cases will cause result sets to differ, so it depends on what you're trying to achieve with your query.
CROSS JOIN: Has no ON clause and therefore isn't relevant for this topic.

Now comes the long answer.

Let's look at INNER JOIN first.

By means of example, say you have two tables called Manufacturer and Product respectively. You want to run a query that returns a list of all products with the manufacturer's name next to it. This is fairly easy to achieve like so:
SELECT Product.Code, Product.Name, Manufacturer.Name FROM Product INNER JOIN Manufacturer ON Product.ManufacturerID = Manufacturer.ID

Now, suppose that you want to add a condition to this. For instance, you only want a list of products for Manufacturers from a specific country, which you specify in a string literal. This could be done in one of two ways.
Option 1, adding a WHERE clause:
SELECT Product.Code, Product.Name, Manufacturer.Name FROM Product INNER JOIN Manufacturer ON Product.ManufacturerID = Manufacturer.ID WHERE Manufacturer.Country = 'Country1'

or Option 2, adding a condition to the JOIN clause:

SELECT Product.Code, Product.Name, Manufacturer.Name FROM Product INNER JOIN Manufacturer ON Product.ManufacturerID = Manufacturer.ID AND Manufacturer.Country = 'Country1'

The difference is subtle and the result set will be exactly the same. So which is better?

The answer is neither, or both. At least in terms of performance there is no difference between these queries. If you look at the query execution plan for these two queries, you will find that they are exactly the same. Basically SQL server will try to optimize the query regardless of where you put the condition.

To prove the point, I created some random data in the Product and Manufacturer tables, using 500 manufacturers, 150 countries and 1000000 products and then ran the query for one country using both methods. I did this twenty times for each query and the results were very closely matched. This wasn't a very well controlled test environment (I'm simply running it on my laptop with lots of other background services busy all the time), so saw a fair bit of variance and initially it looked like the option 2 might be faster, but over time the two averages converged. Most significantly the fastest times for each option were identical down to the last millisecond (i.e. on the odd occasion that nothing else was consuming processing power on my laptop, they were really identical).

Personally, my preference is generally to put these kinds of conditions in the ON clause of the join. This makes more logical sense to me as it feels like I'm limiting the result set earlier in the logical process and therefore I find this easier to read, understand and maintain. However, in practice it actually makes no difference, so it really is personal preference. There might be a good argument for putting all your conditions in one place (the WHERE clause) so you can easily see all the limitations that the code is putting on the result set even when dealing with multiple joins.

Now lets look at OUTER JOINS. I'm not going to go into all three types of outer join as for purposes of this article they will all behave similarly, so let's stick to the one that is most frequently used, LEFT JOIN.

With a left join, the two options are logically quite different. Looking at the above two queries, a left join will deliver vastly different results in the two respective options.

Option 1 will produce the exact same result set as the inner join does. This is partly because the Product table is first in the query and the foreign key constraint on Product.ManufacturerID and the unique key on Manufacturer.ID ensures that there will be no NULL countries nor multiple rows for a single product (i.e. every Product row will successfully be matched to a unique row in Manufacturer) and partly because the WHERE clause narrows the selection down to a specific country only.

However, option 2 will produce a completely different result. In this instance, the result set will be a complete list of products. Rows where the country condition is met will appear in the result set with a manufacturer name, but rows that do not meet this condition will still appear in the list, only with a NULL value instead of an actual manufacturer name.

In conclusion, for LEFT JOIN (and the same goes for RIGHT and FULL OUTER JOIN), the decision rests on what you're trying to achieve. Whether you want rows not meeting the condition to appear in the result set with NULL values for the secondary table or whether you prefer if these values are not there.

Note: Testing for this article was done on SQL Server 2008 R2 developer edition.

Monday, January 16, 2012

SSRS no longer using IIS

For those who haven't discovered this yet: Since SQL Server 2008, Microsoft decided that the reporting services web sites should no longer run in IIS.

This may seem a strange decision, but there is some logic to it. For one thing, you don't need to install IIS in order to run reporting services. I suppose it also prevents guys like me from fiddling with IIS settings that could break the report server, but now I've lost an aspect of control, which is rather frightening. And I don't think I'm a control freak.

It is what it is though and I cannot change that. So after a bit of research, I uncovered that this is done by SSRS using http.sys directly. Which begs the obvious question of whether this will cause a port conflict (if one assumes most sites and report server installs will both use the default port 80).

But alas it seems not. This technet article explains how http.sys handles these apparent port conflicts. It seems the only time there might be an issue is if you choose to install another virtual directory using the same name as the report server. I cannot think of any scenario where you would have a good and unavoidable reason to want to do this, so that should not be much of a problem.

Also see this post by Andrew Fryer highlighting some of the benefits of doing it this way. Ultimately it makes good sense and I don't imagine MS would change course any time soon.

Editorial note: This article was written some time ago for a blog that I never really got into very much, so just transferring it here to get all my content in one place. I know it's old news!