The SharePoint content repository: It's just a database
SharePoint's lifeline is the content repository, where all the content is housed. Learn how the database works as well as its challenges.
TABLE OF CONTENTS
SharePoint vision | Content repository | Content dilemmas |
Storage deep dive | Business goals
SharePoint’s repository -- where all its content lives, is indexed, and is version-controlled -- isn’t some special data construct. It’s just a database -- a SQL Server database, to be specific. Modern versions of SharePoint are well-tuned to support content databases in the multi-terabyte range, meaning SharePoint should be able to handle whatever you throw at it in terms of storage. In fact, the main reason that companies split their SharePoint content across multiple databases is to reduce things like backup and recovery time, not because SharePoint can’t scale to handle their content storage needs.
So let’s just be clear on one point: From a technical perspective, SharePoint can handle a lot of data. Probably in the tens of terabytes. Database size limitations are not the problem with the SharePoint content repository. But let’s step deeper inside for a moment, and look at how that database works.
SQL Server stores data on disk in 8KB chunks called pages. When SQL Server needs to change a single byte of data, it loads -- at a minimum -- 8KB of data off disk, makes the change in memory, then writes that 8KB page back to disk. Normally, SQL Server has to store an entire table row within a single page, meaning a single row of data can’t exceed that 8KB limit (the actual number is slightly smaller, since each page has a small amount of overhead for management data).
However, SQL Server does allow a row of data to contain a pointer to larger pieces of data, which can then be spread across multiple pages. Figure 1.3 illustrates this storage mechanism, with a single row of data on the first page, containing a pointer to several sequential pages that contain a large string of data -- perhaps a photo, a Word document, or some other large piece of information.
Figure 1.3: Storing data on pages in SQL Server.
SQL Server refers to these large objects, which are stored as binary data, as binary large objects (BLOBs) -- surely one of the most charming acronyms in IT!
Note: We’ll dive deeper into SQL Server’s storage mechanisms, and some of the other subtle problems that BLOBs can create, later in this excerpt.
It turns out, however, that SQL Server isn’t as amazing with BLOBs as it is with the smaller pieces of data it normally deals with. Streaming a BLOB into the database, or reading it out of the database, simply isn’t what SQL Server is best at. That’s not to suggest SQL Server’s performance is horrible or anything, but even Microsoft has spent years trying to come up with alternative ways of storing the information that are faster and more efficient. In SQL Server 2008, for example, Microsoft added the FILESTREAM data type to SQL Server, which allows BLOBs to be stored as simple files on the file system, with a pointer inside the database. The idea is that Windows’ file system excels at reading and writing large files, so why not let it do that? Of course, with some of the data living outside the actual database, tasks like replication, backup, and recovery can become more complicated, but the upside is increased performance.
A deeper problem with large SharePoint databases -- whether they’re full of BLOBs or not -- is that they simply take up a lot of room on disk, and data center-quality disk storage still isn’t cheap. You might be able to buy a dozen 1TB desktop-class hard drives for $1800, but just the cabinet for a 12-bay storage area network (SAN) can run $14,000 -- a fully-populated 12TB SAN can run to $30,000. So, although SharePoint’s database might not flinch at storing that much data, doing so can cost a lot. Plus, you’re going to have to find a way to back it all up, and be able to recover it quickly in the event of a disaster or failure.
A more subtle challenge with SharePoint storage is when you start enabling version control. Every time someone modifies a SharePoint-based file, you’re creating a new version of that file -- and the old version remains in the database. So the database can get quite large, quite quickly. SharePoint also needs database storage to index the file so that it can quickly locate files based on keyword searches by users. We want those features -- it would just be nice if we could find a way to have them take up a bit less space.
The idea, then, is to identify the specific problems associated with specific types of “problem content,” and to find ways to address those problems while still meeting the SharePoint vision of “everything in one place.” The general phrase for what we’re trying to do is SharePoint storage optimization, meaning we’re seeking to optimize our use of SharePoint storage to reduce our storage costs, while still maintaining a fully-functional SharePoint infrastructure that offers all the benefits that SharePoint offers.
TABLE OF CONTENTS
SharePoint vision | Content repository | Content dilemmas |
Storage deep dive | Business goals
This chapter is an excerpt from the book, Intelligently Reducing SharePoint Costs Through Storage Optimization, authored by Don Jones, and published by Realtime Publishers, November 2010, ISBN 978-1-935581-25-3, Copyright 2010 by Realtime Publishers. Download the complete book for free at Realtime Nexus Digital Library. |