Monday, August 31, 2009

Oprofile for IO bound apps

I am using Oprofile to look into the performance of a Mysql Storage engine. Upon reading further into the oprofile docs I find:

"OProfile is oriented to finding problems with CPU-limited processes. OProfile does not identify processes that are asleep because they are waiting on locks or for some other event to occur (for example an I/O device to finish an operation)."

That is kind of a drag since the results I am expecting to see include the largest bottlenecks for IO and waiting for locks.

From my experience with gprof I seem to remember the same thing: CPU profiling primarily.

Does any one know of a good off the shelf tool that can integrate CPU profiling with IO? I don't even expect to see anything out there that can include time waiting for locks because it is not generic.

Thursday, April 23, 2009

Highly Available Storage (without high prices)

One of the most interesting themes I have been paying attention to at this years Mysql users conference is techniques to create highly available storage volumes without spending a million dollars on a SAN or NAS infrastructure using companies like EMC or Network Appliance or IBM.

At least 3 options exist that I was not aware of before:

Amazon Elastic Block Store: as part of Amazon's EC2 web services you can have a virtual block level device available from your EC2 instance. Using this block level device you can either mount a typical linux filesystem and access the device with standard file access system calls or you can even do raw IO against the device without a filesystem. The data is stored on amazon's cloud, and is thus relatively highly available. As with all Amazon services you only pay for what you use. I was quoted performance numbers around 100 MB / sec which seems quite reasonable. You can only mount the storage on one instance at a time for the moment, but you could set up NFS between instances if you really wanted to.

Rackspace Virtualized Storage: I talked breifly with some guys from Rackspace and they said they have a service backed by a Network Appliance NAS farm that allows hosting clients to have access to NetApp volumes on a rental basis. This sounds pretty cool in that you can have NetApp storage space without actually buying the hardware. NetApps are usually highly available so you don't have to worry as you do with commodity linux boxes that it may go down at anytime. However when I went to the Rackspace site I couldn't really find the exact offering they were talking about so this option needs some more research.

DRBD: DRBD seems to be a fairly popular product, that I not heard about until now. It allows you to have a volume on one machine that appears to be a standard filesystem volume but is actually replicated using DRBD to another machine. There seems to be a few modes one of which allows your fsync calls to block until all data is flushed both on the local disk and the remote disk, another allows you to block until the data is in memory on the other machine but not flushed to disk, etc. Choosing the modes and finding out the exact characteristics of write and fsync with each mode, and the relative performance of these combinations will be important details (hopefully with no devil in there). At their booth they were quoting numbers that looked very similar to the IO throughput you would get on a commodity box for around 100MB / sec, but again this all depends on your config.

More info is available in these links:
http://aws.amazon.com/ec2/#functionality
http://www.mosso.com/?CMP=rackspacehome
http://www.drbd.org/

Friday, March 13, 2009

Auto Increment Stability

Over the last few years I have come to realize that Auto-Increment is a difficult problem for RDMS vendors to solve. When I was working with Sybase, the company I was with had a rule against using auto-increment for stability reasons.

Mysql storage engines seem to be a bit troublesome with auto increment as well. I believe that auto increment in Innodb was implemented using some global variables and mutexes which recently was patched.

For some fun with auto increment try executing the somewhat seemingly simple SQL on Innodb, listed at the end of this article. Which gives what to me appears to be unintuitive results.

The message here is not that the implementators of auto-increment have not done a good job, but that creating this functionality is actually really difficult. Personally I would recommend against using auto-increment in large scale complex applications, unless you really know what you are doing and are accepting the limitations that exist.

INTERESTING QUERY

drop table if exists t1;
CREATE TABLE t1
( id integer auto_increment primary key,
k integer NOT NULL,
INDEX k(k)
) engine = innodb;

insert into t1 (k) values (0);
insert into t1 (k) values (1);
insert into t1 (k) values (2);
insert into t1 (k) values (3);
insert into t1 (k) values (4);
insert into t1 (k) values (5);
insert into t1 (k) values (6);
insert into t1 (k) values (7);
insert into t1 (k) values (8);
insert into t1 (k) values (9);
insert into t1 (k) select k from t1;
insert into t1 (k) select k from t1;
select * from t1 order by id;

--------------------------------------------------------------
Here are the results I get. Notice the gaps in the numbers:

+----+---+
| id | k |
+----+---+
| 1 | 0 |
| 2 | 1 |
| 3 | 2 |
| 4 | 3 |
| 5 | 4 |
| 6 | 5 |
| 7 | 6 |
| 8 | 7 |
| 9 | 8 |
| 10 | 9 |
| 11 | 0 |
| 12 | 1 |
| 13 | 2 |
| 14 | 3 |
| 15 | 4 |
| 16 | 5 |
| 17 | 6 |
| 18 | 7 |
| 19 | 8 |
| 20 | 9 |
| 26 | 0 |
| 27 | 0 |
| 28 | 1 |
| 29 | 1 |
| 30 | 2 |
| 31 | 2 |
| 32 | 3 |
| 33 | 3 |
| 34 | 4 |
| 35 | 4 |
| 36 | 5 |
| 37 | 5 |
| 38 | 6 |
| 39 | 6 |
| 40 | 7 |
| 41 | 7 |
| 42 | 8 |
| 43 | 8 |
| 44 | 9 |
| 45 | 9 |
+----+---+

Thursday, February 19, 2009

Understanding UNIV_LIKELY and other Optimization Macros in the Innodb Codebase

When you first start browsing the code for Innodb (a Mysql Storage Engine) you will notice several strange macros like UNIV_LIKELY popping up all over the place. Its good to get clear on what those mean so as not to distract you from understanding the actual logic of the code you are reading.

The set of macros described below all wrap GCC builtin functions which are used for performance optimization. The macros are defined as blank space on other platforms not supporting these builtin functions. The macros are in two categories the expect and prefetch macros which I will describe separately:


EXPECT MACROS

These macros take expressions and return the exact same value as the expression would return without the macros. From the user's point of view, they are the same as just enclosing the expression in an extra set of parenthesis. However the macros, provide hints to the optimizer about the expected value of the expression which lets them optimize branches.

UNIV_EXPECT(expr, constant)
This is a wrapper around the gnu builtin function '__builtin_expect' . The macro is intended to be used within an if condition or other branching statement to tell the compiler the expected outcome of the expression. The compiler than uses that information to optimize the generated code for that outcome. The args to the function UNIV_EXPECT are an expression that will be evaluated and the expected output which must be a compile time constant integer. Here is an example of how to use this macro:
if (UNIV_EXPECT(retCode, 0)){

fprintf(stderr, "failure, got a non-zero return code\n");

}
This will evaluate the same as doing just if (retCode) but the optimizer will understand that retCode is most likely to always be 0.

UNIV_LIKELY(cond)
The macro returns the value of the condition, and indicates a hint to the compiler that the likely value of the condition is TRUE or 1

UNIV_UNLIKELY(cond)
The macro returns the value of the condition, and indicates a hint to the compiler that the likely value of the condition is FALSE or 0

UNIV_LIKELY_NULL(ptr)
The macro returns the value of the ptr, and indicates a hint to the compiler that the likely value of the ptr is NULL.



Prefetch macros

There are two prefetch macros in Innodb, that are both wrappers around GCC's '__builtin_prefetch'. By telling the system you will be reading or writing to a given address in the near future, the compiler/hardware can try to pre-fetch this address, making the actual memory accesses faster.

UNIV_PREFETCH_R(addr)
Give a hint to the compiler that the addr specified will soon be accessed for reading.

UNIV_PREFETCH_RW(addr)
Give a hint to the compiler that the addr specified will soon be accessed for writing.



More details on the inner workings of these Macros can be found in the GCC documentation, Builtin Functions section. Just look for info about '__builtin_expect' and '__builtin_prefetch'.

Now that you will not be distracted by these macros anymore, Happy Browsing,

Ivan Novick

Wednesday, February 4, 2009

Understanding the InnoDB Source Code Structure

Innodb source code is organized into directories where each directory holds the C source code files for a given module. Within the directory are 1 or more files that are part of the module. The file names have a structure which I will describe below.

The first part of the file name indicates the module name. The module name is followed by a '0' character which is a separator. The second part of the file name represents the sub-module. Most modules have one file where the sub-module is the same name as the module, this file represents the primary file in the module.

For, example the main file for Btree is located in the btr directory and named: btr0btr.c
The first btr indicates the module name, the 0 is a separator and the the second btr is the sub module name which is the same for the primary file.

The file that handles the Btree cursor is also located in the btr directory and named: btr0cur.c
The first btr is the module name, the 0 is a separator, and the cur is the sub module name which stands for cursor in this case.

There is also a include directory which contains all the header files. The header files are also named in a similar way.

For example, the file 'btr0cur.h' is the header file for the Btree cursor module. Note the include files contain both '.h' files and '.ic' files which are included C files.

Now that we understand the naming conventions, I am going to give you a list of all the modules in the Innodb source code base and a description of what they are for. The list below is sorted by lines of code in the module, with the biggest modules coming first.

INNODB MODULES

row

Row Abstraction, 19,768 lines
sub-modules: Updates, Undo, Undo Modify, Undo Insert, Select, Purge, Mysql Interface
The logic for the mysql row formatting and the innodb row formatting is quite lengthy. This module also seems to have a lot of the high-level business logic for Innodb.

trx

Transactions, 13,138 lines
sub-modules: Rollback, Rollback Segment, Undo, Log Records, Purge
As Innodb is a transactional storage engine there is a lot of logic to implement this

btr

Btree data structure, 12,228 lines
sub-modules: Btree Cursor, Btree Persistent Cursor, Btree Adaptive Search
Btree is the index of Innodb and is core functionality

pars
SQL parser, 11,691 lines
sub-modules: Symbol Table, Optimizer, Lexical Analyzer, Grammar
I am pretty sure you can ignore this directory if you are using innodb with mysql, as it is dead code in that case (please do correct me if I am wrong)

dict
Data Dictionary (meta-data), 10,446 lines
sub-modules: Boot, Creation, Load, Memory
Table names, column names, key names, etc. all in this code

handler

Mysql Storage Engine Interface, 8498 lines
This is the primary interface between Mysql and the innodb storage engine and the entry point for all mysql API calls.

log
Database Log, 8379 lines
sub-modules: Recovery
Database logging is core functionality

buf
Buffer Pool, 7784 lines
sub-modules: Buffer Flush Algorithm, Buffer Replacement Algorithm, Buffer Read Abstraction

os
Operating System Interface, 7659 lines
sub-modules: Files, Processes, Threads, Synchronization
This is the fun stuff, all the low level OS specific code

lock

Transaction Lock System, 6224 lines
sub-modules: Lock Queue Iterator

page
Index Page, 5675 lines
sub-modules: Page Cursor

srv
Main Server Driver, 5469 lines
sub-modules: Startup, Query Execution
Look here for configuration option handling coding and other startup issues

sync
Synchronization, 5361 lines
sub-modules: ReadWrite Lock, Wait Array

fil
Table Space Memory Cache, 5282 lines

rem
Records Abstraction, 4965 lines
sub-modules: Record Manager, Record Comparison Service

fsp
File Space Management, 4405 lines

ibuf
Insert Buffer, 4125 lines

ut
Utilities, 4113 lines
sub-modules: Vector, Random Numbers, Memory, List, Debug, Byte Manipulation, Work Queue

mem
Memory Management, 3598 lines
sub-modules: Memory Debug, Memory Pool

data
Data Element Abstraction, 2867 lines
sub-modules: Data Types

que
Query Graph, 2255 lines

mtr
Mini-transaction Buffer, 1967 lines
sub-modules: Mini-transaction Log

eval
SQL evaluator, 1603 lines
sub-modules: Stored Procedures

ha
Hash table, 1422 lines

mach
Machine Dependent Utilities, 1198 lines

fut
File Based Utilities, 951 lines
sub-modules: File Based List

read

Cursor Read, 788 lines

dyn
Dynamically Allocated Array, 560 lines

thr
Threads, 302 lines
sub-modules: Thread Local Storage

usr
Sessions, 163 lines

Monday, February 2, 2009

Portable condition variables in Mysql codebase

How can you write good multi-threaded code without using condition variables?

On most flavors of unix you have the pthread api which gives you pthread_cond_t structures and the related API functions. However on all version of Windows before Windows Vista and Windows Server 2008, the Windows API's did not come with any kind of condition variables. This makes it really hard to write multi-threaded code that works on both Windows and Unix systems especially if you want the code to behave in a similar way on all platforms.

Well Mysql developers solved this issue by writing their own version pthread_cond_t and related functions which can be found in the codebase in the files my_wincond.c. The code is #ifdef enabled on Windows platforms otherwise the standard pthread api is used. The implementation of condition variables in the mysql code base utilizes one CRITICAL_SECTION variable, and 3 arrays of Windows' Event objects. The code is short but hard to truly understand. Please do take a look at this code.

There have been several other attempts by people to create Windows versions of Condition variables, which you can find by searching the web. Almost always these solutions include notes about how the solutions are not perfect.

Thankfully Microsoft has released Condition variables as part of their latest OS versions which they are supporting as part of the OS and system APIs. I don't know if Mysql developers are going to add #ifdef's to use these new APIs when available but I sure think it would be a good idea.

Ivan

Thursday, January 29, 2009

NULL fields in Mysql row format

As a storage engine developer you need to take rows of data given by Mysql and interpret the meaning. You also need to take data from your storage engine format and convert this to Mysql format before returning the data to the engine. One important aspect of this is determining which fields have NULL values. Here is how the NULL values work in Mysql internal row format:

There is a table object that is part of the storage engine handler class:

struct st_table *table;

This table object contains a field member, which is an array of fields in the table:

Field **field;

Any individual field within a table has the following members:

uchar* null_ptr;
uchar null_bit;

The null_ptr is used to reference the byte location in the mysql row where the null bit mask for this field is located.

The null_bit is a bit mask for a single byte that can be used to set or test the NULL value for this field. So for example if you want to test the value of a specific field to see if it is NULL, you do it like this:

if (table->field[0]->null_ptr & table->field[0]->null_bit){
// field 0 is null
}

If you want to set a field to be null you can do it like this:

table->field[0]->null_ptr |= table->field[0]->null_bit;

Hope this helps.

Ivan Novick