Cross Posting Woes of Micro Blogging Messages

February 9th, 2010

The concept of web services to interoperate and broaden the ecosystem is a beautiful thing. I agree to this concept but lately I’ve found myself being frustrated to a certain subset of this concept. Where does my frustration come from? It comes from cross-posts of micro blogging messages. To be more specific, seeing significant amount of Twitter updates on my Facebook news feed.

Living in Tokyo, I usually check social updates and RSS feeds on the train. It often begins from checking unread tweets then firing up whatever I feel like checking next (usually Facebook, Google Reader or Mixi). What disappoints me here is having to rip through tweets that I’ve already looked at on Facebook. Tokyo is a busy place so it’s important to gain information efficiently.

On Facebook I have various types of connections from childhood friends to acquaintances. Some of them are on Facebook and not on Twitter (and vice versa). I guess this is to do with user demographics but the important thing here is that I’m gradually finding it hard to pickup content from my friends outside of IT. I’m saying IT because it seems from observing my news feed that it’s mostly my friends in the IT industry that have setup cross posting.

There is probably some sort of content balancing gimick in the news feed code but this seems to not work well against my heavy Japanese twitter user friends. It’s a shame because my cross posting friends are completely innocent and aren’t trying to deliberately cause noise. Last thing I want to do is defriend people just because they are innocently causing noise.

Further Thoughts

This is what user specified content filters are for! If there’s a ‘block updates from xxx service’ option I would use it. Or perhaps I’m missing something and there is a way to filter out content from certain web services in the news feed.

If there is such an option, I would love to be enlightened!

Toru Maesaka random, webservice , ,

Notes on HEAP/MyISAM Index Key Handling on WRITE

January 26th, 2010

Disclaimer: This post is based on HEAP/MyISAM’s sourcecode in Drizzle.

Here are my brief notes on investigating how index keys are generated in HEAP and MyISAM. I lurked through these because I’ve started preparing for decent index support in BlitzDB. I also wrote this to assist my biological memory for later grepping (I have terrible memory for names). I’m only going to cover key generation on write in this post. Otherwise this post is going to be massive.

HEAP Engine

The index structure of HEAP can be either BTREE or HASH (in MySQL doc terms). Like other engines HEAP has a structure for keeping Key definition (parts, type, logic and etc). This structure is called HP_KEYDEF and it contains function pointers for write, delete, and getting the length of the key. These function pointers are assigned to at table creation or when the table is opened. The assigned function depends on the data structure of the index and it can be either of the following:

BTREE

  • hp_rb_write_key()
  • hp_rb_delete_key()

HASH

  • hp_write_key()
  • hp_delete_key()

As for get_key_length(), either of the following functions are used for both data structures.

  • hp_rb_var_key_length()
  • hp_rb_null_key_length()
  • hp_rb_key_length()

When writing a row to the tree, HEAP writes to the index using a key generated by hp_rb_make_key(). Note that it does not use this for the hash index. The generated key is populated inside ‘recbuffer’ in HEAP’s handler object (HP_INFO structure).

From my understanding, it loops through the key segments (I suspect it is similar the internal KEY_PART_INFO structure) and appropriately copies each key field value to the output buffer. By meaning “appropriately” it respects the characteristics of the data type when packing the buffer. For example, for a variable length field, it will only copy the actual data and not the max possible size of it. The final byte that is copied to the buffer is the address of the chunk where the record lives.

MyISAM Engine

The upper layer of key handling in MyISAM looks somewhat similar to HEAP so you can really tell that it was written by the same people. Things are nicely wrapped together by the MYISAM_SHARE structure so it’s relatively easy to follow. BlitzDB has a class called BlitzShare for the same purpose (This is based off Archive Engine’s ArchiveShare class).

Like HEAP, MyISAM has a structure for individual key definition called MI_KEYDEF (it’s defined in myisam.h). There are more function pointers in this structure than HEAP.

  • bin_search()
  • get_key()
  • pack_key()
  • store_key()
  • ck_insert()
  • ck_delete()

In Drizzle, _mi_ck_write() is assigned to ck_insert() which is the entry point to writing a MyISAM index. The key that MyISAM uses to write to the index is generated by _mi_make_key(). Like HEAP, it will loop through the key segments and pack the relevant fields accordingly to the characteristic of the data type. The output buffer belongs to MyISAM’s hander (lastkey2).

From Here

I’ve actually written a naive key generator for BlitzDB already based on Drizzle/MySQL’s internal KEY_PART_INFO array. It seems to be working on EXACT MATCH but I still need to implement an index scanner which looks much harder to pull off than a table scanner. What I’m really worried about is supporting composite indexes (namely reading/searching on it) but hopefully I’ll understand how this area of the storage system works soon.

Toru Maesaka drizzle, knowledge, oss , ,

Further thoughts on BlitzDB’s Index Handling

January 15th, 2010

I’ve been thinking quite a bit about collation handling in BlitzDB for the last couple of days. The more I think about it, the more stuck I’ve been getting with BlitzDB’s index design. I’m actually so frustrated with myself at the moment that I want to hit my head against a wall or something.

So, I’m writing this entry to clear up my mind. Heh, my blog is slowly becoming BlitzDB’s design document draft. This should hopefully be good though since by blogging it, people can tell me whether I’m moving towards a stupid direction or not.

Collation Importance

When writing database software that is intended for International use, it is important to handle textual data by respecting collation order. It is arguable that most people are only interested in English lexicographic ordering but unfortunately the world is not so standard.

Internal Primary Key Handling

I want to motivate people to actively define a PRIMARY KEY with BlitzDB. I plan to make this attractive by providing the best performance when PK is defined. In December 2009, my answer to this was to write the PK value as the key for the data dictionary (where actual rows are stored in BlitzDB). This allows BlitzDB to do a direct lookup on the data dictionary for PK based lookup, instead of consulting the B+Tree index. I’m still fond of this lookup optimization approach but it introduces problems too.

Problem 1. Consider the following textual keys: “key” and “KEY”. They obviously have different binary representations but in certain cases they can be logically equivalent. Because the data dictionary is a hash database, this is a problem. The solution that instantly pops up is to normalize the key before writing or reading it. This however, causes a problem in cases where the two keys are inequivalent. Perhaps Drizzle/MySQL provides an internal normalization function that respects this. I still need to study this area of the storage subsystem.

Problem 2. Directly writing a PK to the data dictionary means fast lookup but because of the data structure, it’s not possible to fetch the next “logical” key, meaning I can’t implement index scanning on PK as it is. Quick solution for users is to create an index on the PK column (this would create a separate B+Tree for it) but this is not so friendly because it requires the user to have prior knowledge of all this. So, my plan is to provide the best of both worlds. I’ll elaborate on how I’m planning on tackling this problem next.

Current Primary Key Read/Write Behavior

In general, keys of BlitzDB’s data dictionary is a unique 8 byte integer. The idea is that BlitzDB writes this unique ID along with the key to the B+Tree Index so that it can later identify that row. The difference with PK is that, if a PK is present in a table, BlitzDB will not generate an internal unique ID and use PK for the data dictionary’s key instead. BlitzDB won’t create a B+Tree index for PK at the time I wrote this blog entry.

Next Step

Create a B+Tree index for PK anyway. BlitzDB will still use the PK value as the key for data dictionary if it exists. For PK based lookup requests, BlitzDB will look directly at the data dictionary and for PK based requests that involve index scanning, BlitzDB will look at the B+Tree index.

This approach can consume more space when textual data is used for keys but I think it’s worth it. At the same time, you can save space if you use use types that are smaller than 8 bytes for PK. For example, using a 4 byte integer would reduce BlitzDB’s key space by 50%.

Hmm, I think my mind has cleared a little.

Toru Maesaka drizzle, oss ,

Drizzle, BlitzDB and HTON_STATS_RECORDS_IS_EXACT

January 13th, 2010

Recently I enabled HTON_STATS_RECORDS_IS_EXACT in BlitzDB to let the optimizer know that BlitzDB can instantaneously return the number of rows in a specified table. As a result, the Drizzle kernel can directly call the Cursor::info() function to get the row count. To users, it means that SELECT COUNT statements can be executed in O(1). So it’s a great thing in general.

Something Broke

After I enabled HTON_STATS_RECORDS_IS_EXACT, I noticed that issuing SELECT statement on a table with 1 row would no longer return a resultset. Weird indeed! after investigating with GDB, I noticed that rnd_next() is only called once instead of twice on a table with 1 row (second time is to find EOF) when HTON_STATS_RECORDS_IS_EXACT is enabled. This makes sense because the kernel knows that there is only 1 row and therefore it doesn’t need to keep scanning for EOF. However, this made me scratch my head since this shouldn’t break BlitzDB’s table scanner.

Remedy

Logically, I was confident that BlitzDB’s table scanner was functioning properly so I decided to look at what was going on beyond the engine API. Turns out that join_read_system() in sql_select.cc looks at the table->status value and decides that it’s an error if 0 isn’t assigned to it. What’d you know? I realized that I wasn’t assigning anything to the status variable. It’s more that I didn’t know that I was meant to update an internal structure. You’d think that engine developers aren’t meant to touch those. It’s not mentioned in the Engine Documentation at MySQL Forge either. Nevertheless, the important thing is that it works now. Oh and SELECT COUNT is fast now too.

Eye Opener

This experience among other occasions where I had to read the kernel’s source made me think that it would be nice to provide an intensive up to date documentation on how to develop storage engines for Drizzle in the future (when the API becomes stable). Needless to say, this would be co-ordinated within the Drizzle community. I’m not a license person but it should hopefully be provided with a freely available license too.

Toru Maesaka drizzle, oss ,

How to Recover a Tokyo Cabinet Database

January 8th, 2010

Recently Mark Callaghan had asked me whether BlitzDB is crash safe since he was aware that Tokyo Cabinet isn’t crash safe (unless used with transactions). For Tokyo Cabinet and Tyrant’s defense, I should mention that this is intentional. The idea is to reduce durability in return for higher throughput. The author’s philosophy is that data availability should be secured by replication. This makes sense since the design of TC and TT are influenced by mixi’s high traffic (we need single instances to handle over 10k requests per sec).

So with that said, let’s move on to the main topic. The honest answer is that BlitzDB is not crash safe either (transaction support is still a long way to go). If the admin is lucky, she would be able to repair the table(s) using the REPAIR TABLE syntax. BlitzDB’s crash safety strategy is the same as Tokyo Tyrant – You should use replication. The question is, how do you repair a broken Tokyo Cabinet file?

The answer is pretty simple and it’s documented in the Japanese TC documentation. Unfortunately it’s not not present in the English documentation. So allow me to go through it with demo code in this post. There are two ways to attempt to recover a Tokyo Cabinet database:

  1. By using the Tokyo Cabinet API.
  2. By using Tokyo Cabinet’s command line tool.

Let’s first go through how to confirm that your database is broken. I’ve also covered how to comprehend the errors.

How to confirm that your Database is broken

Simply use the command line tools installed with Tokyo Cabinet. Look at the “additional flags” line on the output of “tchmgr inform” or “tcbmgr inform” depending on your database type. If it says, “fetal” then your file is really broken. If it says “open”, it means that your application died or exited without closing the database. A file in the “open” state is still usable but your most recent records are most likely unavailable. This is because TC connects the hash chain after it has confirmed that a write operation was successful. If your application died before the record is chained, then it’s not accessible in the database.

Furthermore, the records that weren’t sync’d by the kernel won’t be present on power failure. If the disaster was a process failure, then the written data will hopefully be in the kernel’s write buffer so you won’t lose that data. For pedantic people, TC provides a way to sync the database from your application. Whether to call this function (and how often) is up to your application’s policy.

Using the Tokyo Cabinet API

(1) Open the database file without the lock option. Meaning, supply HDBONOLCK or BDBONOLCK to the open function of the appropriate database type (TCHDB or TCBDB).

/* This is for TCHDB */
TCHDB *hdb = tchdbnew();
 
if (!tchdbopen(hdb, "/path/to/broken_file", HDBONOLCK | HDBOWRITER)) {
  /* Failed to open. Do the appropriate thing. */
}
 
/* This is for TCBDB */
TCBDB *btree = tcbdbnew();
 
if (!tcbdbopen(btree, "/path/to/broken_file", BDBONOLCK | BDBOWRITER)) {
  /* Failed to open. Do the appropriate thing. */
}

(2) Run tchdboptmize() or tcbdboptimize() depending on the database type. You might wonder what you should give as the parameter for the optimize function. Conveniently, TC stores the tuning parameters of the database when you first opened it so you can just provide -1 as an argument _but_ the final one. This is because the final argument is an unsigned integer (uint8_t). What you want to provide instead is UINT8_MAX for this.

/* This if for TCHDB */
if (!tchdboptimize(hdb, -1, -1, -1, UINT8_MAX)) {
  /* We're out of luck. This hash database can't be rescued. */
}
 
/* This if for TCBDB */
if (!tcbdboptimize(btree, -1, -1, -1, -1, -1, UINT8_MAX)) {
  /* We're out of luck. This b+tree database can't be rescued. */
}

If you’re lucky, the above would repair the database that is associated with TC’s database object.

Using TC’s command line tool

This approach is more towards database admins since I’m sure the last thing they want to do is write their own program to get their work done. Lazyness is good.

TC provides a utility program called tchmgr (for a hash database) and tcbmgr (for a b+tree database) which allows you to run optimize on a database file. So if you wanted to repair a TC hash database, you would do the following:

$ tchmgr optimize -nl /path/to/broken_file

and the following for the B+Tree Database:

$ tcbmgr optimize -nl /path/to/broken_file

For those that are interested, the “-nl” option means “No Lock” which is required to repair a database file.

Well, I guess this sums up this blog post. I hope this post will help you administrate Tokyo Tyrant and/or your Tokyo Cabinet based application!

Toru Maesaka oss ,

Congrats to Kyoto Cabinet’s Alpha Release

January 1st, 2010

Writing this blog entry to congratulate Mikio for releasing Kyoto Cabinet (alpha release), which is positioned as a successor project of Tokyo Cabinet.

From development perspective, the big difference is that Kyoto Cabinet is implemented in pure C++03 whereas TC is pure C99. ASFAIK, C++03 was adopted so that KC can run on broader platforms than POSIX oriented systems (So, theoretically it can build on Windows). From project perspective, KC is licensed under GPLv3 whereas TC is licensed under LGPL.

For my projects, I currently have no interest in moving to KC from TC since I’m convinced that TC is better suited for my target platforms (UNIX/Linux). Another reason (VERY personal reason) is that I like C99 more than C++. Although… I don’t have the right to say this since I’m far from proficient at C++. For more details on the project differences, you should take a look at KC’s project page that Mikio had prepared:

For those that are interested in KC’s performance, I’ll do some benchmarks against TC when I get back from my end of year vacation. Hope everyone is enjoying New Years!

Toru Maesaka oss ,

Last Minute Christmas Shopping

December 24th, 2009

Last minute Christmas Shopping for myself. 23″ Full HD monitor with HDMI input (Mitsubishi RDT231WLM). I’ve been looking for a reasonable new monitor with HDMI for sometime so I’m pretty happy with this purchase. This is going to be used for PS3 and watching movies.

Last Minute Purchase

Merry Christmas.

Toru Maesaka random ,

End of Year Progress on BlitzDB

December 24th, 2009

FURTHER UPDATE: Further thoughts on BlitzDB’s Index Handling

My open source friends might have noticed that I’ve been working quite a bit on BlitzDB lately. To tell the truth, I had a hidden goal to get Version-1 done by Christmas. Unfortunately it doesn’t look like I can reach that goal. However, looking at the brightside I got a lot done in the past few weeks so allow me to “journal” it in this blog post.

Agony of Knowing

The more I understood Drizzle’s storage mechanism and Tokyo Cabinet’s internals, the more I disliked what I previously had. This led me to spending quite a bit of time rewriting BlitzDB’s codebase. I was using pthread’s rwlock for concurrency control but I decided to design and write BlitzDB’s own lock mechanism to get the best out of TC (in terms of concurrency). I also rewrote the entire table scan code which is something you’d hope won’t be executed that often (people should use indexes!) but needless to say, it’s an important component of a relational storage engine so I’ve put in a lot of effort there.

Rewriting the Table Scanner

In the process of rewriting the table scanner, Jay Pipes’ gave me a fantastic advise on using Drizzle’s internal atomic type (drizzled::atomics). He gave me this advise because he noticed that my atomic ID generator was securing atomicity with pthread’s mutex. It is debatable that this mutex was only enabled for only few CPU instructions but the philosophy of using the most efficient method on the platform where BlitzDB is to be run was appealing enough for me to use drizzled::atomics. Mikio did some experiments on this and found that in a competitive/congested environment, using the compiler’s builtin function can gain you 3x throughput.

Hacking on Index Support

I’ve finally started hacking on index support and I just finished supporting basic operations on a primary key. By design, BlitzDB’s index is a dense clustered b+tree but in the first release I am going to limit PK to only be a HASH index. This is because I want BlitzDB to treat all PKs as direct keys inside the data dictionary (hash database where the actual rows are stored). So in other words, I want people to use PK for “needle in a haystack” like queries only. An example of a needle in a haystack like query is:

SELECT * FROM TABLE WHERE primary_key_column = whatever;

Saying that, I don’t like to force people to do things the way I like so I plan on providing best of both worlds by supporting both data structures for PKs in Version-2:

CREATE TABLE t1 (id int, PRIMARY KEY(id) USING btree) ENGINE=blitzdb;
CREATE TABLE t1 (id int, PRIMARY KEY(id) USING hash) ENGINE=blitzdb;

BlitzDB’s default configuration will use PK as a “direct” data dictionary index. If you wish to do range queries on PK, the solution is to create a index on the PK column.

Primary Key lookup Performance

So, how does my implementation perform? Here’s a quick benchmark with a test-run that randomly fetches 100 thousand rows from a BlitzDB table with 1 million rows. This is the table I used:

CREATE TABLE t1 (id int PRIMARY KEY, a int, b int) ENGINE=blitzdb;

and the query looks like this:

SELECT * FROM t1 WHERE id = random_number_under_one_million;

The hardware I used is the following commodity server: Intel Quad Xeon E5345 (2×4MB L2 cache), 8GB Memory, 500GB SATA II. Unfortunately I could not prepare a standalone client server today so both the server and the test program were run on the same machine. Yeah… this sucks so I can’t claim that this benchmark is 100% creditable.

Here is the result I obtained from skyload. Please only view it as a guideline to BlitzDB’s lookup performance. I’ll do a proper benchmark with the Drizzle Community and publish it after I get Version-1 released.

[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 1
  Task Completion Time   : 5.88856 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1
 
[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 2
  Task Completion Time   : 6.94474 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1
 
[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 4
  Task Completion Time   : 7.04455 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1

As you can see, “needle in a haystack” queries can be executed pretty efficiently in BlitzDB. Looking at the first result, we can observe that it took an average of 0.058 milliseconds to process a query.

Future Plans

Admittedly, primary key support isn’t completely done so I’ll continue working on it. After that, I will start hacking on b+tree indexes and write more tests as I go. Once I support at least two indexes, I’ll ask the Drizzle Community to consider merging BlitzDB into Drizzle’s trunk. This is my goal for BlitzDB at the moment.

I also happen to own blitzdb.com so I’m planning on putting user documentation (including tutorial) and architectural notes there. This is currently not so high on my TODO list so I suspect it won’t happen until I get Version-1 released. All I can say about the release schedule at the moment is, “before the MySQL conference in april”.

So, that’s all I have to summarize for now. Thanks for reading this far. Merry Christmas and have a Happy New Year. Don’t trip on ice :)

Toru Maesaka drizzle, oss , ,