Home MySQL Blogs

My MySQL tipsvalid-rss-rogers

MySQL 2012 Percona conference day 2 part 1 PDF Print E-mail
Written by Marco Tusa   
Monday, 16 April 2012 21:05

Using and benchmarking Galera in different architectures


What I was interested most during the second day was again, synchronous replication and Replication solutions provide from Continuent.

The first I attend in the day was the Galera one, done Henrik and Alexey.

The presentation was going to talk about:

"We will present results from benchmarking a MySQL Galera cluster under various workloads and also compare them to how other MySQL high-availability approaches perform. We will also go through the different ways you can setup Galera, some of its architectures are unique among MySQL clustering solutions.

* MySQL Galera

** Synchronous multi-master clustering, what does it mean?

** Load balancing and other options

** WAN replication

** How split brain is handled

** How split brain is handled in WAN replication

* How does it perform?

** In memory workload

** Scale-out for writes - how is it possible?

** Disk bound workload

** WAN replication

** Parallel slave threads

** Allowing slave to replicate (commit) out-of-order


I know how passionate is Henrik when talking about Galera and I partially shares the feeling. I said partially because I am still not fully convinced about the numbers, but I am working on that.

Anyhow, a side the part related to bench marching, I have found interesting the combination of blocks and element for the HA solution.

Including redundant load balancer and use MySQL JDBC with Galera is a simple but efficient way to provide HA.

Also important we can finally drop the DRBD solution that has being for too long the only syncronous solution for MySQL. DRBD was forcing to have one PRIMARY (RW) node, and one SECONDARY completely useless.

I have also appreciated the honesty from Alexey about scaling.

Galera will not scale to infinite as some foul state, but it could have a decent number of nodes.

Now the limit of it is obviously to discover and calibrate against the real load pushed against the nodes, it cannot be define as an absolute abstract limit.

Interesting also how the Galera team is managing by "quorum" the server synchronization. In short having 3 nodes if one will not be able to access the other two but still get writes (split brain), at the moment of re-union, the other two will take over by "quorum".

Obvious and immediate problem is in case of having 3 datacenter one with 6 nodes and the others with 2 nodes each. If the DC with 6 nodes gets disconnected, the valid data will be in the 2 remaining data centre, but at the moment of reunion, the DC with 6 nodes will take the majority, and all data set will become invalid.

Alexey is working on a way to calculate the "weight" by proximity to fix this issue.

Honestly, I am not sure that Galera is production ready, but is for sure the most interesting and easy solution for simple write scale.

Reference for Galera at http://codership.com/

Last Updated on Sunday, 18 August 2013 17:30
My day at MySQL conference 2012 in Santa Clara day 1. PDF Print E-mail
Written by Marco Tusa   
Thursday, 12 April 2012 10:19

Key Notes

It was simply amazing for me as ex-MySQL AB to be at the conference today.
I was really emoted seeing so much people most of the ones I know, all together again.
The spirit was again the right one, with the will to say WE ARE HERE!
Impressive, and I am happy to say once more, “I was there”.
I am not going to comment the keynote speeches, but want to share the Baron message.
We are here to share, and help each other to make better, help each other to go beyond our current limit.
The spirit was the right one, the people there probably the most smart in the field, so why not.
I have only one world AMAZING.

Topic: Measuring Scalability and Performance With TCP

What if you had all the data you needed to measure system performance
and scalability at any tier, discover performance and stability
problems before they happen, and plan for capacity and performance by
modeling the system's behavior at greater load than you currently have?
Now it is as easy as running tcpdump and processing the result with a
tool. In this two-part talk you will first learn how to do black-box
performance analysis to discover hidden problems in your systems. In
the second part you will learn about mathematical performance and
scalability models, how the inputs can be computed from TCP packet
headers, and how to derive and interpret the results with free tools
from Percona Toolkit.



Good Talk, as we are now used to get from Baron, also if the topic was touched on the Percona Live event, Baron had review and refine the slides, which are now much more clear.
The proposed method for the quick analysis of the performance using TCP dump is simple and efficient.
Honesty we do use it already but Baron add the scientific notations that makes an empiric measurement more objective.
Specially in regards to the immediate issue identification and the concurrency calculation. On this specific topic I still need to digest/elaborate.
Like the formula for the concurrency:

GOOD reading:
Neil J. Gunther's book
? Guerrilla Capacity Planning

Topic: Hibernate and Connector/J Tuning

Many Java developers using MySQL as a data backend rely on Hibernate to bridge their OO designs with the relational database world.
This talk will review Hibernate and some of it's related projects, with a focus on performance.
We will also cover performance related considerations about Connector/J, discussing settings and usage scenarios that will be useful even for Java developers not using Hibernate.


I did attend this talk, hoping in something more and less.
More focus on Hibernate possible problems, that we find every day because customers have no idea how to use Hibernate.
Less because it was going too much in details of few Select, and was too fast in describing the solutions.
Anyhow, given my huge background in programming, I was not really enlighten by the information, and was able to follow the flow the information, which are base on good sense in using the standard feature and definitions in Hibernate, regarding the Lazy load of the collections, and the way SELECT … JOIN(s) needs to be done.
Finally a good review of what the MySQL JDBC can really do, which is not common given the most of the user just

Topic:Build simple and complex replication clusters with Tungsten Replicator

Tired of the intricacies of circular replication? Dreaming of real multiple masters solutions for MySQL replication?
Dream no more. Tungsten Replicator, a free and open source replacement for MySQL replication,
can build clusters of asynchronous nodes in a matter of minutes.
This workshop will explain the basics of Tungsten Replicator, and it will show how to start your multiple master cluster in a few minutes.
There will be examples of several topologies: from the simplest bi-directional replication to the ambitious all-to-all (every node is a master),
fan-in (multiple masters to a single slave), the star (a central hub connected to several bi-directional masters).


Mysql 5.6 is going to be GA soon, probably at the end of September, by then most product that use customize replication solutions will be obsolete, but not the replicator.
Continuent has develop a good solution for the multi master/multi master single slave solution that will remain valid in the time.
Also replicator offers MySQL -> Oracle replication, and Oracle -> MySQL replication.
It is going to be the perfect solution for many customers that will need to have scalable replication solution, and/or relation with Oracle databases.
The parallel replication is and will remain by schema, also no real mechanism to guarantee the data integrity between masters/slave given the checksum will be calculated on the command and not on the data.
Installation was facilitated a lot with the replicator installer.
Last but not least the product has already in place the possibility to support “FILTERS” develop in Java or JavaScript, this will allow the implementation of possible DATA transformation at replication level, which is a very important factor.
I was already discussing how this solution could solve several issue for some of our customers.

Topic:MySQL Optimizer Standoff MySQL 5.6 and MariaDB 5.3

Both MySQL 5.6 and MariaDB 5.3 introduced advanced game changing optimizer features.
In this presentation we will look in details and comparison on these changes as well as perform benchmarks to show which version is able to handle complex queries better.
If you're working with application using complex queries with MySQL this presentation is for you.


This speech was some how a little bit strange.
From one side Peter presenting his results on the evolution on the optimizer, on the other side few of developers from MariaDB team discussing most of the result.
The most important point is and remain that the MySQL Optimizer, one of the most important element of the MySQL DB platform, is finally revisited from both sides MariaDB and Oracle MySQL.
Optimizer was revisited in full to be included in the MySQL 6.0. This MySQL version was never released, as consequences all the improvements done so far on the Optimizer, where forgotten and leave aside the release delivery.
Optimizer is the core of any DB platform, it mainly decide how to physically access the data, reading the SQL statements, and translating it in to action plan against Indexes, and table(s) reads. The optimizer goes for lower cost not for execution time, to do so the optimizer use statistics, if a DBA do not collect accurate statistics the optimizer will not be able to identify an efficient action plan. Is a fact that changes to the optimizer are always scaring, giving they can overturn successful SQL statements in very bad SQL statement.
The MySQL optimizer still has serious limitations like not using prepare statement to execute the queries, such that any query will invoke/involve the optimizer.
From the many improvements done both side, Maria DB and MySQL 5.6 are very much more efficient then 5.5, of an order of 5.5. 900 sec 5.6 maria 180 sec.
Not only in many cases MariaDB is much more efficient then MySQL 5.6
Speaker: Peter Zaitsev

Last Updated on Sunday, 18 August 2013 17:31
Pythian Community Dinner at Pedro PDF Print E-mail
Written by Marco Tusa   
Sunday, 08 April 2012 18:17

This is my 2nd year in Pythian, and the first in Santa Clara as part of the Pythian company, but I will not be alone.

This year Pythian will have a good number of MySQL members. We will wait for you at Pedro's restaurant , if you are not registered yet, please do so NOW!!!  Register yourself and join us.

The official announcement from the company:

"Pythian organizing an event that by now may be considered a tradition: The MySQL community dinner at Pedro’s! This dinner is open to all MySQL community members as many of you will be in town for the MySQL Conference that week."

We will wait there for all of you....

Last Updated on Sunday, 18 August 2013 17:31
InnoDB data flush and write for dummy. PDF Print E-mail
Written by Marco Tusa   
Friday, 23 March 2012 21:23


That is trying to explain myself a quite complicate process in few word/paragraphs, don’t expect anything special, this write start from one talk at the office, where I was drive in confusion by many talks, mine include.

So I decide to try to get a better logical path on what is going on InnoDB process when we modify a record. It was not my purpose to do an in depth review of every process, just overview.

Most of what I have written come from my understanding of other people writing, and code reading.

InnoDB maxi experts forgive me if I will not be precise, and correct me please when I will be wrong.



Let us start from the beginning.


InnoDB is one of the storage engines present in MySQL, one of its characteristics is that it is ACID compliant; this means it has atomic, consistent, isolated, durable transactions, and so on …. We know that let us go ahead, if you don’t know what I am talking about STOP here and read: http://dev.mysql.com/doc/refman/5.5/en/innodb-storage-engine.html.


To satisfy those requirements InnoDB use several components like:

Buffer Pool

Rollback log (undo insert, undo update)

redo log



Finally InnoDB internally organizes the data in memory page of 16K default dimension. How all those blocks are combined and used is not always very clear.

But we know so move it on …


I will try to present a logical path that explains the different blocks and moments.


Let us begin starting MySQL, on start the MYSQLD demon will invoke InnoDB storge engine initialization, which has to perform several operations, like check the files presence and if they are readable/writable, check the catalog in the system table space, read the LSN from the data files and compare it with the last apply in the double-buffer and REDO log, initialize the Buffer Pool and other buffers.


If we are lucky and all the operation will be successfully executed, InnoDB will initialize correctly and will come online.


At this point our data-set still reside in the data file and BP is empty, system areas like undo insert, undo update (rollback log) are empty as well given no transaction are really in place.


Let assume we want to do a smart thing like warming-our database loading in the BP the most used datasets, to do so we simply run a select on those dataset and InnoDB will load the data from disk to BP storing record in pages of 16Kb (5.6 will allow an easy way to change/manage the page size).


Anyhow, running SELECT * FROM COUNTRY will load in the BP all the table in once, and data will from now on accessible from BP instead of the disk. This is true until data is frequently accessed. In this case InnoDB will maintain the pages in memory available for reading, in the case no one will ask for it, the pages containing the data will be marked as old. InnoDB use the LRU approach (least recently used) to identify which page can be removed from the BP to make space for new blocks. Also it will place the new created block(s), in the middle of the LRU list.


So far so good and easy, and if we need to tune the BP for efficient use of the LRU list reading the http://dev.mysql.com/doc/refman/5.5/en/innodb-buffer-pool.html gives good explanation.


But BP for InnoDB is not only a buffer we use to cache data, it is a real working area, where data is also manipulated.


Here the fun starts.


Let us assume we will have 3 different transactions accessing the COUNTRY table, for modification, what will happen?


First of all the first transaction (there is always someone coming first), will store the values its going to change in a undo log record, this could fit in one page or more, structure will follow the standard InnoDB way using cursor, tuple and page. In the undo log LSN reference is present as well for comparison with the data set.


The rollback log is mainly pided in undo insert and undo update. It is a log allocate PER transaction and located in the system tables (in 5.6 is possible to move it out). The number of possible buffers/transactions from InnoDB 1.1 is 123*1023 approximately 123K, before InnoDB 1.1 the maximum number was 1023.


Finally the log will contain the initial status of the page(s) and the versioning of the modifications done by the transaction, organize in cursors, record and page(s).


The combination of the different buffers/transaction and the level of interaction are defined by the Isolation level (Repeatable reads; read committed; read uncommitted etc).


We can have many transactions accessing the rollback log but what is not evident at the first shot is:


When rows are modified or deleted, the rows and associated undo logs are not physically removed immediately, or even immediately after the transaction commits. The old data is preserved until transactions that started earlier or concurrently are finished, so that those transactions can access the previous state of modified or deleted rows.

Thus, a long-running transaction can prevent InnoDB from purging data that was changed by a different transaction.

When rows are modified or deleted within a long-running transaction, other transactions using the READ COMMITTED and REPEATABLE READ isolation levels have to do more work to reconstruct the older data if they read those same rows.

When a long-running transaction modifies a table, queries against that table from other transactions do not make use of the covering index technique. Queries that normally could retrieve all the result columns from a secondary index, instead look up the appropriate values from the table data.


Seems clear enough, but let me make it clearer.


IF a transaction is modifying a table and this table is accessed by another transaction, the secondary index IS NOT USED for returning the value, but IS USED for the search, the value is retrieve by table look-up instead, so INDEX ->PK ->Table look-up.


It is not clear to me how undo log it manages the different LSN position, in case of difference between records/cursors. Can we have them on the same page? Or must reside on different space?  Also how I retrieve them in order to follow the correct order.


Reading the comments in trx0undo.c done by Heikki I suppose:

If each transaction is given a whole page for its update undo log, file space consumption can be 10 times higher than necessary.

Therefore, partly filled update undo log pages should be reusable.

But then there is no way inpidual pages can be ordered so that the ordering agrees with the serialization numbers of the transactions on the pages.

Thus, the history list must be formed of undo logs, not their header pages as it was in the old implementation.

However, on a single header page the transactions are placed in

the order of their serialization numbers.

As old versions are purged, we may free the page when the last transaction on the page has been purged.

A problem is that the purge has to go through the transactions in the serialization order. This means that we have to look through all rollback segments for the one that has the smallest transaction number in its history list.


So from what I read here, we will have a History list based on undo log (s) themselves. But the transactions will also be ordered at page level.


I am not sue I get it right… so don’t take it as a statement.


Ok back to our COUNTRY table, assume we have first transaction doing ”UPDATE COUNTRY set gov_type='dictatorship' where iso3=ITA;”

Then another doing   “UPDATE COUNTRY set people_status='unhappy' where gov_type='dictatorship'”;

Finally the 3td doing :”INSERT INTO COUNTY (<fields>)  VALUES (values);


Assuming they happens in short period of time but without overlap so no issue as the one describe above, what happens internally?


First InnoDB need to have a transaction (ib_trx_t), then it will create a “tuple” to put the value I want to change, then a cursor to represent the record structure (ib_crsr_t),  get the tuple value from the cursor, clone the tuple, modify it put back in the cursor, destroy tuple(s) and finally commit the transaction.


All operations until the commit, are done at tuple and cursor level. Nothing is really written in the BP or REDO log.


Both updates will work the same while the insert will have a different path:  Transaction and cursor creation will be similar but then it will need a tuple array and it will populate it with the available values, if value not present for the specific tuple NULL will apply, if that tuple represent a NOT NULL attribute, an error will occur.

So far so good, so nothing is really written outside the cursor until the commit, so what happen after the commit?


Many things.


On commit, value is pushed from cursor to the BP and the relevant pages modified and marked dirty.

Also the dirty page will be located in the flush list, which contains all of the dirty pages in LSN order, least-recently-modified first

HERE the optimistic approach will happen, means that InnoDB assume that the change will be only at row/Page level, without the need of an index/B-tree reorganization.


Note that at this point InnoDB is not in a Consistent state anymore, and thinking to flush the page(s) to the data file could take to long leaving it in an Inconsistent state for too long.

Solution is that as part of the commit InnoDB flush the information to the REDO log, which “the facto” represent the CD of ACID for InnoDB.


There is a lot of confusion in the documentation here, which sometimes contradict itself in the same paragraph.

This one is a good example “InnoDB, like any other ACID-compliant database engine, flushes the redo log of a transaction before it is committed. …  With group commit, InnoDB issues a single write to the log file to perform the commit action for multiple user transactions that commit at about the same time, significantly improving throughput.


Ok so IF InnoDB flush uncommitted transaction how can it then group by committed transaction in X of time.


More does it make ANY sense to flush not committed transaction at all?

I read all this in the following way (code reading will follow and let you know).

InnoDB will process all the COMMITTED transaction in memory on REDOLOG, but those transactions are not committed on the data-file(s) yet, following the write-ahead-logging concept.

To me seems more logic, and it seems to follow more the InnoDB logic.


Back to REDO logs or Iblog, they are written sequentially, where information is previously store in the buffer then flush on disk/file.

They contains only the LSN, and the information related to the changes.


Is important to mention that from InnoDB 1.1 the Group commit is back and that is the way to ensure that Binary log AND Iblog inserts follow the same order. To achieve that InnoDB use the Two Phase Commit protocol, pushing writes to Iblog and Binary log, but it needs to have sync_binlog disable.


At this point if the server crash our data will be recover reading the REDO log.

Restarting the InnoDB will start reading the Iblog from the latest checkpoint comparing the LSN number in the log with the last related to the table, all the LSN less then the log LSNs will be then apply.


So REDO logs are quite relevant not only for recovery but also for the durability, unfortunately we all know that the REDO logs are also a huge bottleneck, and if not tuned correctly they represent a serious issue for the MySQL effiency.


See Vadim blog for discussion about checkpoints (http://www.mysqlperformanceblog.com/2011/04/04/innodb-flushing-theory-and-solutions/).

But that is not all, we can have many other possible issue related to REDO logs like:

1) Iblog needs to have enough space to write the incoming transactions or your transactions process will be stop soon.

2) InnoDB will not cancel information from the log, if the relevant pages are still not flushed from BP to data-files. It will stop any insert and start to flush the dirty pages from the BP.


In short, keep innodb_max_dirty_pages_pct to the default value of 75% can KILL your server when using large BP.


After that we still have the data in the data-files not matching the data in memory, so now InnoDB needs to align that as well.

InnoDB use both sharp checkpoint, and fuzzy checkpoint.


The sharp checkpoint takes place when InnoDB align all the modification performed in committed transactions and when all LSN references are align as well.

Example is when InnoDB shuts down.


The fuzzy checkpoint take place very frequently, given InnoDB try to keep the dirty page to minimum, simplifying and citing:

When InnoDB flushes dirty pages to disk, it finds the oldest dirty page’s LSN and treats that as the checkpoint low-water mark. It then writes this to the transaction log header. You can see this in the functions log_checkpoint_margin() and log_checkpoint().

Therefore, every time InnoDB flushes dirty pages from the head of the flush list, it is actually making a checkpoint by advancing the oldest LSN in the system. And that is how continual fuzzy checkpointing is implemented without ever “doing a checkpoint” as a separate event. If there is a server crash, then recovery simply proceeds from the oldest LSN onwards.

(tks Baron for the simplification)


Finally InnoDB to perform the real writes use the doublebuffer, only after the write and the flush to the doublewrite buffer have completed, InnoDB write the pages to their proper positions in the data file.


If the operating system crashes in the middle of a page write, InnoDB can later find a good copy of the page from the doublewrite during crash recovery.


With that the process is far to be complete and I skip points that needs to be addressed, I know it very well.


My intention was just to put in order, simplifying a lot, what happens in the “main” flow.

Many open questions remain, what happen in the case of non optimistic (or pessimistic) writes that require a rebuild of innoDB b-tree index?

What in the B-tree reorganization, what about the page flush, insert buffer … and many more.


WHAT it will be very nice to have is a book explaining how it works from A to Z, driving us poor guys working with MySQL and InnoDB also in each file/function of the code, or at least on the main sections of it.


Checkpoint     As changes are made to data pages that are cached in the buffer pool, those changes are written to the data files sometime later, a process known as flushing. The checkpoint is a record of the latest changes (represented by an LSN value) that have been successfully written to the data files.

LRU     An acronym for “least recently used”, a common method for managing storage areas. The items that have not been used recently are evicted when space is needed to cache newer items. InnoDB uses the LRU mechanism by default to manage the pages within the buffer pool, but makes exceptions in cases where a page might be read only a single time, such as during a full table scan. This variation of the LRU algorithm is called the midpoint insertion strategy. The ways in which the buffer pool management differs from the traditional LRU algorithm is fine-tuned by the options innodb_old_blocks_pct, innodb_old_blocks_time, and the new MySQL 5.6 options innodb_lru_scan_depth and innodb_flush_neighbors.

LSN     Acronym for “log sequence number”. This arbitrary, ever-increasing value represents a point in time corresponding to operations recorded in the redo log. (This point in time is regardless of transaction boundaries; it can fall in the middle of one or more transactions.) It is used internally by InnoDB during crash recovery and for managing the buffer pool.

In the MySQL Enterprise Backup product, you can specify an LSN to represent the point in time from which to take an incremental backup. The relevant LSN is displayed by the output of the ibbackup command. Once you have the LSN corresponding to the time of a full backup, you can specify that value to take a subsequent incremental backup, whose output contains another LSN for the next incremental backup.

Flush     To write changes to the database files, that had been buffered in a memory area or a temporary disk storage area. The InnoDB storage structures that are periodically flushed include the redo log, the undo log, and the buffer pool.     Flushing can happen because a memory area becomes full and the system needs to free some space, because a commit operation means the changes from a transaction can be finalized, or because a slow shutdown operation means that all outstanding work should be finalized. When it is not critical to flush all the buffered data at once, InnoDB can use a technique called fuzzy checkpointing to flush small batches of pages to spread out the I/O overhead.


A SQL statement that ends a transaction, making permanent any changes made by the transaction. It is the opposite of rollback, which undoes any changes made in the transaction.     InnoDB uses an optimistic mechanism for commits, so that changes can be written to the data files before the commit actually occurs. This technique makes the commit itself faster, with the tradeoff that more work is required in case of a rollback.

Page     A unit representing how much data InnoDB transfers at any one time between disk (the data files) and memory (the buffer pool). A page can contain one or more rows, depending on how much data is in each row. If a row does not fit entirely into a single page, InnoDB sets up additional pointer-style data structures so that the information about the row can be stored in one page.     One way to fit more data in each page is to use compressed row format. For tables that use BLOBs or large text fields, compact row format allows those large columns to be stored separately from the rest of the row, reducing I/O overhead and memory usage for queries that do not reference those columns.     When InnoDB reads or writes sets of pages as a batch to increase I/O throughput, it reads or writes an extent at a time.     All the InnoDB disk data structures within a MySQL instance share the same page size.

Extent     A group of pages within a tablespace totaling 1 megabyte. With the default page size of 16KB, an extent contains 64 pages. In MySQL 5.6, the page size can also be 4KB or 8KB, in which case an extent contains more pages, still adding up to 1MB.     InnoDB features such as segments, read-ahead requests and the doublewrite buffer use I/O operations that read, write, allocate, or free data one extent at a time.

Dirty page

A page in the InnoDB buffer pool that has been updated in memory, where the changes are not yet written (flushed) to the data files.

Dirty read     An operation that retrieves unreliable data, data that was updated by another transaction but not yet committed. It is only possible with the isolation level known as read uncommitted.     This kind of operation does not adhere to the ACID principle of database design. It is considered very risky, because the data could be rolled back, or updated further before being committed; then, the transaction doing the dirty read would be using data that was never confirmed as accurate.     Its polar opposite is consistent read, where InnoDB goes to great lengths to ensure that a transaction does not read information updated by another transaction, even if the other transaction commits in the meantime.

Crash recovery     The cleanup activities that occur when InnoDB is started again after a crash. Changes that were committed before the crash, but not yet written into the tablespace files, are reconstructed from the doublewrite buffer. When the database is shut down normally, this type of activity is performed during shutdown by the purge operation.     During normal operation, committed data can be stored in the insert buffer for a period of time before being written to the tablespace files. There is always a tradeoff between keeping the tablespace files up-to-date, which introduces performance overhead during normal operation, and buffering the data, which can make shutdown and crash recovery take longer.

Doublewrite buffer     InnoDB uses a novel file flush technique called doublewrite. Before writing pages to the data files, InnoDB first writes them to a contiguous area called the doublewrite buffer. Only after the write and the flush to the doublewrite buffer have completed, does InnoDB write the pages to their proper positions in the data file. If the operating system crashes in the middle of a page write, InnoDB can later find a good copy of the page from the doublewrite buffer during crash recovery.     Although data is always written twice, the doublewrite buffer does not require twice as much I/O overhead or twice as many I/O operations. Data is written to the buffer itself as a large sequential chunk, with a single fsync() call to the operating system.

Write combining     An optimization technique that reduces write operations when dirty pages are flushed from the InnoDB buffer pool. If a row in a page is updated multiple times, or multiple rows on the same page are updated, all of those changes are stored to the data files in a single write operation rather than one write for each change.

Undo log     A storage area that holds copies of data modified by active transactions. If another transaction needs to see the original data (as part of a consistent read operation), the unmodified data is retrieved from this storage area. This area is physically part of the system tablespace. It is split into separate portions, the insert undo buffer and the update undo buffer. Collectively, these parts are also known as the rollback segment, a familiar term for Oracle DBAs.

Redo log     A set of files, typically named ib_logfile0 and ib_logfile1, that record statements that attempt to change data in InnoDB tables. These statements are replayed automatically to correct data written by incomplete transactions, on startup following a crash. The passage of data through the redo logs is represented by the ever-increasing LSN value. The 4GB limit on maximum size for the redo log is raised in MySQL 5.6.     The disk layout of the redo log is influenced by the configuration options innodb_log_file_size, innodb_log_group_home_dir, and (rarely) innodb_log_files_in_group. The performance of redo log operations is also affected by the log buffer, which is controlled by the innodb_log_buffer_size configuration option.

Tuple     In the Embedded InnoDB product, a data structure holding one or more table columns. The tuple is an intermediate holding area where query results are stored before the columns are extracted, or where column values are stored before being used in an insert or update operation.

Cursor     An internal data structure that is used to represent the result set of a query, or other operation that performs a search using a SQL WHERE clause. It works like an iterator in other high-level languages, producing each value from the result set as requested.     Although usually SQL handles the processing of cursors for you, you might delve into the inner workings when dealing with performance-critical code. Embedded InnoDB requires you to be familiar with cursors and to manipulate them using C API calls.


Baron http://www.xaprb.com/blog/2011/01/29/how-innodb-performs-a-checkpoint/

Vadim http://www.mysqlperformanceblog.com/2011/04/04/innodb-flushing-theory-and-solutions/

Ewen Fortune http://www.mysqlperformanceblog.com/2011/02/03/how-innodb-handles-redo-logging/

Mark Challagan  http://www.facebook.com/note.php?note_id=408059000932

Domas http://dom.as/2011/07/03/innodb-index-lock/


And the code … good reading


Finally I thank Danil Zburivsky who raised few good questions.


Last Updated on Sunday, 18 August 2013 17:32
GALERA on Red Hat is 1,2,3? (Part 1) PDF Print E-mail
Written by Marco Tusa   
Wednesday, 23 November 2011 01:04


Let me say NO, but nothing really bad just annoying and thanks to Oli (always great) less annoying then how it could be.

Anyhow, I had to put all up in a RED HAT box on EC2.

So I start to take a look here an there to see what is the best way to do it, I also take the configurator from Severalnines, but for the moment I am not going to use it,

and in any case I never use it for the my.cnf parameters.

But what was the scope of all this?



I have a customer that could be interested in using a scaling MySQL solution, for his business.

Requirements as usual are 100% HA (OK we know this is not possible), scaling in writes as much as possible, geographic distribution.

Data set will be starting from 40G (peanuts) to 6 Tb (well not so peanuts) in 2 years.

More the scalability need to be "out" and "in" so architecture need to be able to scale as we like in relation to the "seasons".


What from there?


Well first of all define the generic architecture that should serve better, then find few good candidates that could fit in from the technological part

do some POCs and take numbers...

So here we are POCs and Numbers but first of all Installation.

As said I need to do that using REDHAT 6 EL so no way to use the pre-compile version, but taking a try will not harm.

Time to test and see how it works.

What I want to have as start:

3 nodes as start

data replicate from to each node

do stress test

download and installation


Excluding the demo package (that I don't want to use)

Any how I have found the site confuse, what I have to take? the package with ONLY the wsrep patch, or also the galera provider, all none both?

wget http://launchpad.net/codership-mysql/5.5/5.5.15-21.2/+download/mysql-5.5.15-wsrep_21.2-linux-x86_64.tar.gz

I had to download and extract to see that the galera libraries where not in the wsrep version, noting bad having it clear and on the same page could help.

Then I took the file from several nine using the configurator ...

DO I REALLY need it ??

mmm let us see

use http://www.severalnines.com/galera-configurator/map

You will deploy on ec2, using redhat, and x86_64 architecture. You can also go into s9s-galera-xyz/scripts/install/.s9s/config

to make adjustments (but be careful) - if you get it wrong, refresh this page and start over with fresh deployment package.

To install:

Upload your keypair to /root/.ssh/Availng.pem on the cluster-control server

tar xvfz s9s-galera-1.0.0.tar.gz

cd s9s-galera-1.0.0/mysql/scripts/install

./deploy.sh 2>&1 |tee cc.log

A detailed log is also written to $HOME/s9s_deploy.log

As said I decide not to use it for the moment.

My directory layout will be:


+-- bin

+-- etc
+-- galera -> /opt/galera_templates/galera¦   +-- bi

¦   +-- lib64

¦   +-- shar

+-- games

+-- include
+-- lib
+-- lib64
¦   +-- perl5
+-- libexec
+-- mysql -> /opt/mysql
¦   +-- bi

¦   +-- data

¦   +-- docs

¦   +-- include

¦   +-- lib

¦   +-- man

¦   +-- mysql-test

¦   +-- scripts

¦   +-- share

¦   +-- sql-bench

¦   +-- support-files

+-- sbin

+-- share

¦   +-- applications

¦   +-- info

¦   +-- man

¦   +-- perl5
+-- src


Next I had to do the usual thing:

setup MySQL user and group

setup directories

Configuring my.cnf


Start Galera


I am in rush (never good), so I did start and configure MySQL, but then when I run it:

111118 17:34:55 [Note] Event Scheduler: Loaded 0 events
111118 17:34:55 [Note] WSREP: wsrep_load(): loading provider library '/usr/local/galera/lib64/galera/libgalera_smm.so'
111118 17:34:55 [ERROR] WSREP: wsrep_load(): dlopen(): libssl.so.6: cannot open shared object file: No such file or directory
111118 17:34:55 [ERROR] WSREP: wsrep_load(/usr/local/galera/lib64/galera/libgalera_smm.so) failed..
111118 17:34:55 [Note] WSREP: wsrep_load(): loading provider library 'none'
111118 17:34:55 [ERROR] Aborting
111118 17:34:55 [Note] WSREP: Service disconnected.
111118 17:34:56 [Note] WSREP: Some threads may fail to exit.
111118 17:34:56  InnoDB: Starting shutdown...
111118 17:34:57  InnoDB: Shutdown completed; log sequence number 1595682
111118 17:34:57 [Note] /usr/local/mysql/bin/mysqld: Shutdown complete

Ok Ok I know they said that so no way I need to recompile and install.

As Oli wrote issue is set up all before so let me see for RED HAT:

Take MySQL as for Oli instructions http://www.fromdual.com/building-galera-replication-from-scratch

and patch: wget http://launchpad.net/codership-mysql/5.5/5.5.15-21.2/+download/mysql-5.5.15-wsrep_21.2.patch

apply patch and download the galera replicator

wget http://launchpad.net/galera/1.x/21.1.0/+download/galera-21.1.0-src.tar.gz



yum -y install openssl

yum -y install boost-devel

yum -y install openssl-devel.x86_64

yum -y install check-devel.x86_64


Then run :


Have a coffee or two ...

cp libgalera_smm.so /opt/mysql/lib/plugin/

cp garb/garbd /opt/mysql/bin/



11122 18:19:53 [Note] WSREP: Flow-control interval: [8, 16]
111122 18:19:53 [Note] WSREP: Restored state OPEN -> JOINED (0)
111122 18:19:53 [Note] WSREP: Member 0 (galera1) synced with group.
111122 18:19:53 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 0)
111122 18:19:53 [Note] WSREP: New cluster view: group UUID: 7bd4ac4b-1560-11e1-0800-19942f7ebcc0, conf# 1: Primary, number of nodes: 1, my index: 0, first seqno: 1, protocol version 1
111122 18:19:53 [Warning] WSREP: wsrep_notify_cmd is not defined, skipping notification.
111122 18:19:53 [Note] WSREP: Assign initial position for certification: 0, protocol version: 1
111122 18:19:53 [Note] WSREP: Synchronized with group, ready for connections
111122 18:19:53 [Warning] WSREP: wsrep_notify_cmd is not defined, skipping notification.











mysql> SHOW global STATUS LIKE "wsrep%";
| Variable_name | Value |
| wsrep_local_state_uuid | 7bd4ac4b-1560-11e1-0800-19942f7ebcc0 |
| wsrep_protocol_version | 1 |
| wsrep_last_committed | 0 |
| wsrep_replicated | 0 |
| wsrep_replicated_bytes | 0 |
| wsrep_received | 2 |
| wsrep_received_bytes | 135 |
| wsrep_local_commits | 0 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_avg | 0.000000 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_avg | 0.000000 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 0.000000 |
| wsrep_apply_oooe | 0.000000 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 0.000000 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 0.000000 |
| wsrep_local_state | 4 |
| wsrep_local_state_comment | Synced (6) |
| wsrep_cert_index_size | 0 |
| wsrep_cluster_conf_id | 1 |
| wsrep_cluster_size | 1 |
| wsrep_cluster_state_uuid | 7bd4ac4b-1560-11e1-0800-19942f7ebcc0 |
| wsrep_cluster_status | PRIMARY |
| wsrep_connected | ON |
| wsrep_local_index | 0 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
| wsrep_provider_version | 1.dev(rXXXX) |
| wsrep_ready | ON |
38 rows IN SET (0.00 sec)



As state in the internal text file a note about the LIMITATIONS

1) Currently replication works only with InnoDB storage engine. Any writes to
tables of other types, including system (mysql.*) tables are not replicated.
However, DDL statements are replicated in statement level, and changes
to mysql.* tables will get replicated that way.
So, you can safely issue: CREATE USER...,
but issuing: INSERT INTO mysql.user..., will not be replicated.

2) DELETE operation is unsupported on tables without primary keys.
Rows in tables without primary keys may appear in different order on
different nodes. As a result SELECT...LIMIT... may return slightly
different sets.

3) Unsupported queries:
* LOCK/UNLOCK tables is not supported in multimaster configuration.
* lock functions (GET_LOCK(), RELEASE_LOCK()... )

4) Query log cannot be directed to table. If you enable query logging,
you must forward the log to a file:
log_output = FILE
Use general_log and general_log_file to choose query logging and the
log file name

5) Maximum allowed transaction size is defined by wsrep_max_ws_rows and
wsrep_max_ws_size. Anything bigger (e.g. huge LOAD DATA) will be rejected.

6) Due to cluster level optimistic concurrency control, transaction issuing
COMMIT may still be aborted at that stage. There can be two transactions
writing to same rows and committing in separate cluster nodes, and only one
of the them can successfully commit. The failing one will be aborted.
For cluster level aborts, MySQL/galera cluster gives back deadlock error
code (Error: 1213 SQLSTATE: 40001  (ER_LOCK_DEADLOCK)).

7) XA transactions can not be supported due to possible rollback on commit.

8) Currently supported only in master-slave mode without parallel applying
(will be fixed in the nearest release):
* cascading foreign key constraints,
* secondary unique indexes and
* primary varchar indexes with non-binary collation.

Creating the full environment

Now that I have all up and working, I can finally use my current EC2 as template and add boxes to my cluster.
From here the real Fun will start ...



-> Expanding the cluster
-> Starting my dataset
-> Pushing on throttle and take numbers









Last Updated on Sunday, 18 August 2013 17:33

Page 8 of 12

Who's Online

We have 134 guests online