Wednesday, July 2, 2014

Taming Zookeeper - introduction and “how to” for Java and Python

Zookeeper is one of the most versatile projects out there but still somewhat underutilized and feared; however, almost every software project could benefit from using it as a distributed, fault tolerant, hierarchical data storage.

What really is Zookeeper?

Zookeeper is a free and open source, distributed, fault tolerant, hierarchical value storage - pretty much like a network share that stores small files in directories and subdirectories. These “files”, which are called nodes in Zookeeper, are small, so the platform is not really a BLOB storage (actually the size of a node is limited to 1mb).

Storing and reading these nodes are extremely fast, so any information can be stored in Zookeeper without seconds thoughts – active keys in a cache, currently active sessions, list of active servers, and so on.

A Zookeeper cluster can consist of multiple servers and there is no predefined master in this topology – the platform will make sure that the storage and retrieval of the nodes are synchronized properly.

Moreover, Zookeeper guarantees consistency, which means that it doesn’t matter which Zookeeper server the client is talking to, the returned values are always consistent. However, it’s good to keep in mind that consistency is reached on the order of tens of seconds scale - usually it’s much faster but Zookeeper doesn’t consider the service to be down within that time period.

Cross platform

The core of the Zookeeper server is written in pure Java and consists only of 190 classes, and about 42k lines of code. It is reasonably small for such a system and as the code is nicely written it is far from hard to read it and understand the internals of the system.

Zookeeper uses a protocol name “Jute” to communicate with the clients, which is somewhat similar to Protobuf that was designed by Google. Interestingly the Jute compiler, which can turn the zookeeper.jute definition into the Java classes, is included in the Zookeeper source code too.

The protocol is quite simple, so even if there is no direct compilation to other platforms, sending the required data across the wire should be straightforward on most systems. For instance, the Python Kazoo implementation simply recreated the required bindings manually.

Ephemeral Nodes

One of the most interesting features of Zookeeper is called Ephemeral Node. While permanent nodes are stored forever, an Ephemeral Node will be immediately deleted when the client who created it disconnects from the server. It is one of the best ways to keep track of actual “live” data, like which services or servers are up and running, who are online on our site, or who joined/left a chat room.

Storage and cleanup

Zookeeper stores the data on disk, so even if the server is restarted, all nodes will available again. As the system is more optimized for performance than storage, cleaning old and deleted data is not part of the Zookeeper server’s normal job.

Earlier versions required a scheduled job to be run to purge old data, newer versions can do it periodically, but not enabled by default. To enable purging the old data from log files, add/uncomment the following in the zoo.cfg file:

# keeps the 3 most recent snapshots of the data
autopurge.snapRetainCount=3
# runs every 1 hour
autopurge.purgeInterval=1 

Watching data changes

Zookeeper can notify the clients if a watched node has been changed, but every watcher fire only once. If we need to keep watching the changes, we need to create a new watcher; however, it is important to keep in mind that while we will receive the latest change to the node, we may not receive all intermittent changes when we are re-creating the watch, or temporarily get disconnected from the server.

Starting the server

For development purposes it’s really easy to start using Zookeeper: after downloading the binaries, simply run
./bin/zkServer.sh start-foreground
In production mode it might be a good idea to use it as a service or run though supervisord to make sure it is restarted if it crashes for any reason.

Java client

The Curator framework provides high-level access for accessing the Zookeeper server. Even though the protocol is simple so it could be quickly re-implemented, the frameworks typically give extra features like automatic reconnection to the server with a predefined exponential back off time.

To use the Curator framework, simply add the following dependency to the Maven pom.xml file:

<dependency>
  <groupId>org.apache.curator</groupId>
  <artifactId>curator-framework</artifactId>
  <version>2.5.0</version>
</dependency>

To read string data from an existing node, only the following is required:

final RetryPolicy retryPolicy =
        new ExponentialBackoffRetry(BASE_SLEEP_TIME_MS, MAX_RETRIES);
final CuratorFramework client =
        CuratorFrameworkFactory.newClient(CONNECTION_STRING, retryPolicy);
client.start();

final GetDataBuilder getDataBuilder = client.getData();
final byte[] configBytes = getDataBuilder.forPath(CONFIG_PATH);
final String config = new String(configBytes);

System.out.println("Data: " + config);

For a full fledged demo with watchers, deletion, and Zookeeper ACLs, check out the java source code here.

Python client

The easiest way to access Zookeeper is to use the Kazoo client library, which is written purely in Python. It can be installed system wide or just used as a module in a solution and deployed as part of the application.

The steps are very similar to the Java client, so we need to create and start a connection and read the data we are looking for:
from kazoo.client import KazooClient
zk = KazooClient(hosts='127.0.0.1:2181')
zk.start()

print('Data in node: %s' % zk.get('/rabbit/config')[0])
As Kazoo is a high level framework, we do not need to recreate the watcher after every change, it will stay active.

For the full source code with an embedded Kazoo client, check out the python source code here.

Performance

The Zookeeper server are surprisingly fast, reaching above 20k operations / second on a singe server setup with 2 cores and 10 simulated clients. The average latency is typically less than 1 millisecond so in most circumstances Zookeeper wont’ be the bottleneck in the system.

For a detailed performance review of the system under different configurations (cores and machines), check out this Zookeeper performance article.

Conclusion

Even though Zookeeper is usually used as part of a larger software package like Apache Hadoop or Apache Storm, it is a great standalone product.

However, unfortunately it is a little bit mystified with articles going in great depth of the election algorithm of the server nodes, which doesn’t really help to understand how to and when to use Zookeeper.

Storing simple hierarchical information in a fault tolerant, distributed fashion with the live tracking capability of ephemeral nodes make Zookeeper a really handy tool across a lot of different software projects. Give it a go, get Zookeeper binaries!

Friday, June 6, 2014

Beating the binary search algorithm – interpolation search, galloping search

Binary search is one of the simplest yet most efficient algorithms out there for looking up data in sorted arrays. The question is, can it be beaten by more sophisticated methods? Let’s have a look at the alternatives.

In some cases hashing the whole dataset is not feasible or the search needs to find the location as well as the data itself. In these cases the O(1) runtime cannot be achieved with hash tables, but O(log(n)) worst case runtime is generally available on sorted arrays with different divide and conquer approaches.

Before jumping to conclusions, it is worth to note that there are a lot of ways to “beat” an algorithm: required space, required runtime, and required accesses to the underlying data structure can be different priorities. For the following runtime and comparison tests different random arrays between 10,000 and 81,920,000 items were created with 4 byte integer elements. The keys were evenly distributed with an average increment of 5 between them.

Binary search

The binary search is a guaranteed runtime algorithm, whereas the search space is always halved at each step. Searching for a specific item in an array guaranteed to finish in O(log(n)) time, and if the middle point was selected luckily even sooner. It means that an array with 81,920,000 elements only needs 27 or less iterations to find the element’s location in the array.
Because of the random jumps of the binary search, this algorithm is not cache friendly so some fine tuned versions would switch back to linear search as long as the size of the search space is less than a specified value (64 or less typically). However, this final size is very much architecture dependent, so most frameworks don’t have this optimization.

Galloping search; galloping search with binary search fallback

If the length of the array is unknown for some reason, the galloping search can identify the initial range of the search scope. This algorithm starts at the first element and keeps doubling the upper limit of the search range until the value there is larger than the searched key. After this, depending on the implementation, the search either falls back to a standard binary search on the selected range, or restarts another galloping search. The former one guarantees an O(log(n)) runtime, the latter one is closer to O(n) runtime.
Galloping search is efficient if we expect to find the element closer to the beginning of the array.

Sampling search

The sampling search is somewhat similar to the binary search but takes several samples across the array before deciding which region to focus on. As a final step, if the range is small enough, it falls back to a standard binary search to identify the exact location of the element.
The theory is quite interesting but in practice the algorithm doesn’t perform too well.

Interpolation search; interpolation search with sequential fallback

The interpolation search supposed to be the “smartest” among the tested algorithms. It resembles the way humans are using phonebooks, as it tries to guess the location of the element by assuming that the elements are evenly distributed in the array.
As a first step it samples the beginning and the end of the search space and then guesses the element’s location. It keeps repeating this step until the element is found. If the guesses are accurate, the number of comparisons can be around O(log(log(n)), runtime around O(log(n)), but unlucky guesses easily push it up to O(n).
The smarter version switches back to linear search as soon as the guessed location of the element is presumably close to the final location. As every iteration is computationally expensive compared to binary search, falling back to linear search as the last step can easily outperform the complex calculations needed to guess the elements location on a short (around 10 elements) region.
One of the big confusions around interpolation search is that the O(log(log(n)) number of comparisons may yield O(log(log(n)) runtime. This is not the case, as there is a big tradeoff between storage access time versus CPU time needed to calculate the next guess. If the data is huge and storage access time is significant, like on an actual disk, interpolation search will easily beat binary search. However, as the tests show, if access time is very short, as in RAM, it may not yield any benefit at all.

Test results

All the code was hand written for the tests in Java (source code at the end); each test was run 10 times on the same array; the arrays were random, in memory integer arrays.

The interpolation search fell back to linear search if the assumed distance was 10 or fewer items, while the sampling search took 20 samples across the search space before deciding where to continue the search. Also, when the key space was less than 2000 items, it fell back to standard binary search.

As a reference, Java’s default Arrays.binarySearch was added to compare its runtime to the custom implementations.

Average search time / element, given the array size


Average comparisons / search, given the array size

Despite of the high expectations for interpolation search, the actual runtime did not really beat the default binary search. When storage access time is long, a combination of some kind of hashing and B+ tree probably would be a better choice, but it is worth to note that on uniformly distributed arrays the interpolation search combined with sequential search always beats the binary search on the number of comparisons required. It’s also interesting how efficient the platform’s binary search was, so for most cases it’s probably not worth it to replace it with something more sophisticated.

Raw data – average runtime per search

SizeArrays.
binarySearch
Interpolation
+Seq
InterpolationSamplingBinaryGallopGallop
+Binary
10,0001.50E-04 ms1.60E-04 ms2.50E-04 ms3.20E-04 ms5.00E-05 ms1.50E-04 ms1.00E-04 ms
20,0005.00E-05 ms5.50E-05 ms1.05E-04 ms2.35E-04 ms7.00E-05 ms1.15E-04 ms6.50E-05 ms
40,0004.75E-05 ms5.00E-05 ms9.00E-05 ms1.30E-04 ms5.25E-05 ms1.33E-04 ms8.75E-05 ms
80,0004.88E-05 ms5.88E-05 ms9.88E-05 ms1.95E-04 ms6.38E-05 ms1.53E-04 ms9.00E-05 ms
160,0005.25E-05 ms5.94E-05 ms1.01E-04 ms2.53E-04 ms6.56E-05 ms1.81E-04 ms9.38E-05 ms
320,0005.16E-05 ms6.13E-05 ms1.22E-04 ms2.19E-04 ms6.31E-05 ms2.45E-04 ms1.04E-04 ms
640,0005.30E-05 ms6.06E-05 ms9.61E-05 ms2.12E-04 ms7.27E-05 ms2.31E-04 ms1.16E-04 ms
1,280,0005.39E-05 ms6.06E-05 ms9.72E-05 ms2.59E-04 ms7.52E-05 ms2.72E-04 ms1.18E-04 ms
2,560,0005.53E-05 ms6.40E-05 ms1.11E-04 ms2.57E-04 ms7.37E-05 ms2.75E-04 ms1.05E-04 ms
5,120,0005.53E-05 ms6.30E-05 ms1.26E-04 ms2.69E-04 ms7.66E-05 ms3.32E-04 ms1.18E-04 ms
10,240,0005.66E-05 ms6.59E-05 ms1.22E-04 ms2.92E-04 ms8.07E-05 ms4.27E-04 ms1.42E-04 ms
20,480,0005.95E-05 ms6.54E-05 ms1.18E-04 ms3.50E-04 ms8.31E-05 ms4.88E-04 ms1.49E-04 ms
40,960,0005.87E-05 ms6.58E-05 ms1.15E-04 ms3.76E-04 ms8.59E-05 ms5.72E-04 ms1.75E-04 ms
81,920,0006.75E-05 ms6.83E-05 ms1.04E-04 ms3.86E-04 ms8.66E-05 ms6.89E-04 ms2.15E-04 ms

Raw data – average number of comparisons per search

SizeArrays.
binarySearch
Interpolation
+Seq
InterpolationSamplingBinaryGallopGallop
+Binary
10,000?10.617.619.012.258.213.2
20,000?11.320.719.013.266.314.2
40,000?11.016.920.914.274.915.2
80,000?12.119.938.015.284.016.2
160,000?11.718.338.016.293.617.2
320,000?12.425.338.217.2103.818.2
640,000?12.419.041.618.2114.419.2
1,280,000?12.520.257.019.2125.520.2
2,560,000?12.822.757.020.2137.121.2
5,120,000?12.726.557.521.2149.222.2
10,240,000?13.225.262.122.2161.823.2
20,480,000?13.423.476.023.2175.024.2
40,960,000?13.421.976.124.2188.625.2
81,920,000?14.019.777.025.2202.726.2

Source code

The full Java source code for the search algorithms can be found here. Keep in mind, the code is not production quality; for instance, it may have too many or too few range checks in some cases!

Monday, May 26, 2014

Java 8 parallel sort - internals

Java 8 is a quite big update to the platform and luckily they focused on multicore systems too. The new Arrays.parallelSort() implementation is quite interesting: it is using quicksort, merge sort, and Timsort to achieve the best performance, depending on the underlying data type and size.

Arrays.parallelSort() – not always parallel

The first thing that needs to be noted is that it’s not always a parallel sort. If the array size is small enough (<= 8192 elements) or the number of reported logical CPUs is 1, it always falls back to a dual pivot quick sort. Note, that the “logical CPU” here refers to hyper-threading, so an average i7 dual core CPU will report 4 logical CPUs to the Java environment.

To override the number of reported CPUs for the sort algorithm, we simply need to pass in any value to the following system property when starting the JVM:
java.util.concurrent.ForkJoinPool.common.parallelism

Sorting the chunks and chunk sizes

The algorithm (Arrays.java) will decide on the chunk size before submitting to sort the array using the helper (ArraysParallelSortHelpers.java). The helper also receives a working buffer as this parallel sort implementation will require the same amount of memory as the original array. This property itself may make it unsuitable for certain cases, like almost sorted arrays or extremely large arrays.

The chunk size that will be distributed among CPUs will be at least 8192 items, or Array size / (4 * reported CPUs) if the latter is larger.
So, an array of 500.000 items with a standard 2 core, 4 virtual cores machine would have:

500.000 / (4 * 4) = 31.250 items / chunk

The original array is split into these 4 logical blocks, which are sorted then merged together. With the 4 logical processing units, each quarter is split into another quarter (16th), then sorted and merged back to the original array.

The sorting itself is always the dual pivot quick sort for raw types (int, byte, short…) and TimSort for anything else.

Interestingly, the helper class has native implementation for all raw types for performance reasons, instead of using generics.

But why does the 8192 chunk size work so well?

From the Arrays.java documentation it seems like a totally empirical value: “Using smaller sizes typically results in memory contention across tasks that makes parallel speedups unlikely.” So contention is definitely a reason, but if we look at the current CPU architectures, there is something interesting about L1 cache sizes too:

Haswell (current) L1 64kb (32kb data, 32kb instructions) per core.
Skylake (future) L1 128kb (64kb+64kb) per core.

Fitting everything in L1 cache would be ideal, but 32kb for data on the typical 64 bit systems would mean only 4096 pointers (items in the array):
32768 * 8 / 64 = 4096

So why is the 8192 working that well? The answer is more around Java runtime: the
-XX:+UseCompressedOops 

flag is on by default enabled for systems less than 32gb of memory on 64bit JVM since Java SE 6u23. This flag “compresses” pointers and stores the 64 bit pointers as 32 bit, sparing significant memory.

How about Skylake and later?

The chunk size works very well for current CPU architectures, but as L1 cache size will grow fairly soon, this may not be the perfect size. Unfortunately the 8129 chunk size is hard wired and cannot be overridden, unlike the number of virtual CPUs. This would mean that on a newer generation CPUs the allocation of new working buffer, distributing workload across CPUs then re-merging the result may not be more efficient than single threaded sorting for not too large arrays.

Timsort

Timsort is part of Python since version 2.3 and Java since 7. The idea behind this algorithm is that with real world data the array is almost never random – it contains smaller and larger sorted sub-arrays, called “runs”. The algorithm keeps identifying these runs and merges them together – hence it needs a working buffer for the merge operations.

Although it’s not possible to beat the O(n*log(n)) lower bound in sorting, Timsort can be very efficient when the array is already sorted or has large "runs" that are sorted. It’s lower bound is O(n), unlike quicksort, which is O(n*log(n)); upper bound is still O(n*log(n)), unlike quicksort, which is O(n^2).

Apart from using working buffer, the other issue with Timsort is that the implementation is quite complex: finding the runs, merging them, then finding the correct place for the next run using galloping search requires much more complex implementation than quicksort. The Java implementation is almost 1000 lines, while a quicksort implementation can fit on a screen (around 100 lines).

For source code reference, have a look at Arrays.java and ArraysParallelSortHelpers.java

Sunday, May 4, 2014

Git cheat sheet - the most useful commands

After a little practise, Git can be fully utilised from the command line without any GUI. Adding, committing, pulling, and pushing is part of the daily work; however, there are other commands that are useful but not that frequently used. Here is my collection of those:

# Create and checkout new branch based on the current branch
git checkout -b <newBranch>

# Undo all local pending changes
git reset --hard HEAD

# Erase the last 3 commits
git reset --hard HEAD~3

# Undo a single file change
git checkout -- <fileName>

# Push new branch to server
git push origin <branchName>

# Delete remote branch
git push origin --delete <branchName>

# List all local and remote branches
git branch -a

# Remove remotely deleted branches - will not delete local branches, only copies of remotes
git remote prune origin

# Pull remote master branch and merge it into current branch
git pull origin master

# Undo last commit but keep changes unstaged
git reset HEAD~1

# Rename current branch
git branch -m <newname>

# Save changes temporarily - branch independent
git stash

# List temporary saves
git stash list

# Apply temporarily saved changes
git stash apply

# Delete temporary saves
git stash drop

# Resolve conflict by accepting their changes
git checkout --theirs <file>

# Interactive stage / unstage
git add -i

#display all changes with additions/deletions between two branches
git diff --stat master..<otherBranch>

Saturday, October 19, 2013

Creating a simple TCP server in Java and using it from Python – to use the Stanford POS tagger

As the advanced network protocols flood the market we forgot how easy it is to create a simple server and a client if we want to integrate two different platforms – Java and Python in this case. The standard TCP socket library is there on practically every platform so connecting to a simple TCP server or actually creating one is just a couple of lines even in C. If we need to, we can build more complex protocols and use more complex document formats, but it's important to remember that we don't need it all the time.


POS tagging

For one of my projects, I needed to use the Stanford POS tagger to parse a large text corpus. Even though there are Python POS taggers, my favourite one is by far the Java based Stanford implementation. Usually I use it directly from Java, but in this case the input file was a bit tricky to parse and Python did it very well so I just wanted to do the POS tagging in Java and everything else in Python.

My first thought was to create a file based interaction between them but that wasn't as responsive as I wanted it to be – this type of batch processing wasn't too appealing. Then for a moment I was considering using more advanced techniques like Apache Thrift or Google Protobuf but why would I need them if I just need to send a sentence over the write and receive the POS tagger version of it?

Creating the Java server

The server part seemed to be the trickiest one as once in a while I received exception I wasn't really foreseeing. As I didn't really care about those so putting the server routine in a simple try catch solved all the issues.
This is the simplified version of the server code:

// Only listen on localhost, no remote connection allowed.
ServerSocket serverSocket = 
  new ServerSocket(10007, 0, InetAddress.getByName("localhost"));

while(true) // We never really terminate it. CTRL+C is enough.
{
    System.out.println ("Waiting for connection.....");
    Socket clientSocket = serverSocket.accept();
    System.out.println ("Waiting for input.....");

    // Create IO streams.
    PrintWriter out = new PrintWriter(clientSocket.getOutputStream(), true);
    BufferedReader in = 
      new BufferedReader(new InputStreamReader(clientSocket.getInputStream()));

    String inputLine;

    try
    {
        while ((inputLine = in.readLine()) != null)
        {
            String tagged = tagger.tagString(inputLine); // POS tag the input.
            out.println(tagged);  // Send back the tagged output.
        }

        out.close(); // The client decided to close the connection, cleanup.
        in.close();
        clientSocket.close();
    }
    catch(Exception ex)
    {
    }
    finally
    {
        System.out.println("Client has disconnected.");
    }
}
// serverSocket.close(); // We never reach this.

There are couple of interesting things I've found. One is that the Java server gets upset and throws an exception if the client decides to disappear instead of cleanly closing the socket, so a try-catch-finally was required. The other thing is that both reading and writing has to be buffered, otherwise the performance would have been really poor. The PrintWriter is buffered, so it will write the whole line to the wire in once instead of byte-by-byte.

In my case the input and output was standard English text but for unicode some kind of encoding might be required.

Connecting from Python

The Python client was unexpectedly easy, not counting the import it was three lines all together:

import telnetlib

HOST = "localhost"
PORT = 10007
tn = telnetlib.Telnet(HOST, PORT)

tn.write(title)
response = tn.read_until("\n")

Performance and issues

The text corpus I was parsing was quite big (10+Gb) so I quickly realised it won't be extremely fast to parse all the text. I was using an old Linux server (512Mb, Core2Duo) to parse the data overnight. The total process was around 20 hours with an average of ~100 sentences tagged a second.

The only issue I was facing was that the Stanford POS tagger once in a while ran out of heap memory so I had to increase the initial heap size, but 500mb seemed to be enough (-mx500m).
Otherwise the process was surprisingly stable and performing well even on that really old machine. The POS tagger did not leak memory at all so I wouldn't hesitate running the same setup again next time.

Wednesday, September 4, 2013

Drawing circle and calculating sinus function without trigonometry, power, or root

The formula of a circle is quite straightforward (r=sqrt(x^2+y^2)) but it's not trivial how to draw a circle or calculate the trigonometric functions without advanced math. However, an interesting finding from 1972 makes it really easy.

Minsky discovered (by a mistake) that the following loop will draw an almost perfect circle on the screen:

loop:
    x = x - epsilon * y
    y = y + epsilon * x # NOTE: the x is the new x value here

Turns out that epsilon in the equation is practically the rotation angle in radians, so the above loop will gradually rotate the x and y coordinates in a circle.

Calculating sinus

If we can draw this circle, we can easily estimate the sinus values: for the current angle, which is basically the sum of epsilons so far,  we have a height (y), which just needs to be normalized to the 0-1 range to get the actual sinus for the angle.

The smaller the steps (epsilon) are, the more accurate the formula will be. However, because it's not a perfect circle, it can never be a very accurate estimation. If epsilon is large, the algorithm will draw a visible ellipsis instead of a circle (slightly tilted left).



The code in JavaScript

I've used a simple canvas object to draw on, so the core logic of the drawing looks like this:

for(var i = 0; i < steps; i++)
{
    x -= y/epsilon;
    y += x/epsilon;

    // Draws circle.
    context.fillRect(middleX+x, middleY+y, 1, 1);

    var deg = i / 2 / epsilon / Math.PI * 360; // Angle in degrees.
    var rad = i / epsilon; // Angle in radians.
    var sinVal = y / 100; // Normalising to 0-1

    // Draws sinus wave.
    context.fillRect(sinX+deg, sinY+y, 1, 1);

    if(i % parseInt(steps / sinMarkerCount) == 0)
    {
        var text = "Sin(" + parseInt(deg) + ")=" + sinVal.toFixed(2);
        // Draws couple sinus values on sinus wave.
        context.fillText(text, sinX+deg, sinY+y);
    }
}


And the result is this:



To grab the full source, go here. Or have a look at the live demo here

Monday, September 2, 2013

Dutch national flag problem - performance in Python and C (two pivot partition, three way partition)

One of the typical interview questions is the three way partitioning, also known as the Dutch national flag problem: given an array with three different values, sort it in a way that all values are grouped together (like a three colored flag) in linear time without extra memory. The problem was first described by Edsger Dijkstra.

As it's a typical sorting problem, any sorting would do but they would need n*log(n) time. There are two good solutions to it: do a counting sort which requires two passes over the array; place the elements to their correct location using two pointers.

The latter uses a reader, lower, and an upper pointer to do the sorting. Reader and lower starts from one end of the array, upper pointer starts at the other end. The algorithms goes like this:

- If it's a small value, swap it with the lower pointer and step lower pointer one up
- If it's a middle value, do nothing with it and step reader pointer one up
- If it's a larger value, swap it with the upper pointer and step upper pointer one down

After the reader reaches the upper pointer, we are done: all smaller values have been swapped to the lower end of the array, all larger values pushed to the upper end so middle values necessarily occupy the middle of the array.

Two pivot partitioning

Java 7 introduced an idea with it's traditional quicksort implementation: it uses the above method to create three partitions instead of the two. The new partitioning phase is still linear time but provides three subproblems to solve instead of the two like the traditional quicksort. This little modification in practise is usually faster than the two partitioning quicksort.

When we don't have predefined values like the flag problem above, we can simply take two random values (a, b) from the array and sort it like this:

  <a   a-b   b<

I've created two simple implementation in Python and C to compare them:

Python implementation C implementation
def partition(array):
  if array is None or
   len(array) == 0:
    raise TypeError

  small = array[0]
  large = array[len(array)-1]

  if small > large:
    small, large = large, small

  wr = len(array) - 1
  wl = 0
  reader = 0

  while reader <= wr:
    v = array[reader]

    if v < small:
      array[wl], array[reader] =
       v, array[wl]
      wl += 1
    elif v > large:
      array[wr], array[reader] =
       v, array[wr]
      wr -= 1
    else:
      reader += 1

  return (wl, wr)
int partition(int array[], int n, 
  int* left, int* right)
{
  if(!array || !n)
  {
    return 0;
  }

  int small = array[0];
  int large = array[n-1];
  if(small > large)
  {
    int tmp = large;
    large = small;
    small = tmp;
  }

  int wr = n-1;
  int wl = 0;

  for(int reader = 0; reader <= wr; )
  {
    int v = array[reader];

    if(v < small)
    {
      array[reader] = array[wl];
      array[wl] = v;
      wl++;
    }
    else if(v > large)
    {
      array[reader] = array[wr];
      array[wr] = v;
      wr--;
    }
    else
    {
      reader++;
    }
  }
  *left = wl;
  *right = wr;

  return 1;
}
Runner code for PythonRunner code for C
start = time.time()

for j in range(1000 * 1000):
  arr = [0] * 10
  for i in range(10):
    # much slower, ~20 sec        
    # arr[i] = random.randint(0,10)
    arr[i] = int(random.random()*10) 
    pass
  l, r = partition(arr)

end = time.time()
int n = 10;
int n = 10;
int arr[n];
srand(time(NULL));

int start = clock();
for(int j = 0; j < 1000 * 1000; j++)
{
    for(int i = 0; i < n; i++)
    {
        arr[i] = rand() % n;
    }
    int l, r;
    partition(arr, n, &l, &r);
}
int end = clock();
cout << ((float)end - start) / 
  CLOCKS_PER_SEC;
// Compiled with 
// gcc threads.cc -lstdc++ -std=c++11


Both implementations are readable but Python seems much less "noisy", easier to read, so I really wanted it to be as good and fast as the C implementation. To my disappointment, the Python implementation is about 50 times slower than the C counterpart (and even slower if randint() is used instead of random()). This is a bit surprising as only very basic array arithmetic is involved so the slowdown is somewhat unjustified.

Performance of the Python and C implementation of the two pivot partition, ran 10^6 times on a random integer array with 10 elements.



Sunday, August 18, 2013

Introsort algorithm in .NET 4.5 – internal implementation details

Even though introspective sort was invented in 1997 (by David Musser) and was taken on  by many frameworks no too long after, Microsoft decided to add it to the .NET platform only in 2012. However, it worth having a look at the implementation, it’s quite smartly written.

Introsort

Currently the fastest sorting algorithm is quicksort but it’s got a couple of edge cases when it performs really poorly – in O(n^2) time instead of the O(n*log(n)) typical runtime. Even though a smart pivot selection makes it practically impossible to have an O(n^2) runtime, “evil” input can force the sorting to fall into this, making the running time much longer than expected.

Musser had two ideas to speed up the typical quicksort algorithm:

  • If the algorithm on any sub-array runs longer than expected (recursion is deeper than log(array size)) then let’s switch to heapsort which is guaranteed to finish in O(n*log(n)) time
  • If the number of elements is small enough don’t use quicksort, simply do an insertion sort which is faster on small data sets 

These two optimizations allow introsort to outperform the “dumb” quicksort implementation on most arrays (even ~200 times on artificial “evil” arrays consisting of 100.000 elements).

Microsoft implementation

Interestingly Microsoft decided not to use the fully optimized introsort in every case: the framework checks the current runtime version which has to be at least .NET 4.5 desktop (full). This means that for some reason on mobile devices and older platforms the framework will use a “limited depth quicksort” but will not optimize the sorting of small arrays. My guess is that Microsoft wanted to show a huge performance gain with 4.5 but from engineering perspective it’s a bit hard to sell.

Let’s have a look at the pseudo code of the fully optimized implementation:

Introsort()
{
    while(subarray)
    {
        if (length <= 16)
        {
            if (length == 1)
            {
                already sorted, return
            }
            if (length == 2)
            {
                SwapIfGreater left, right then return
            }
            if (length == 3)
            {
                SwapIfGreater left, middle
                SwapIfGreater left, right
                SwapIfGreater middle, right
                return
            }
            InsertionSort(subarray)
            return
        }
        else
        {
            if (depthLimit == 0)
            {
                Heapsort(subarray)
                return
            }
            depthLimit--;
            pivot = QuickSortPartition(subarray)
            recurse IntroSort(right of pivot)
            Set subarray to left of pivot
        }
    }
}

There is a couple of interesting parts to look at. The first is the way the algorithm sorts the 2 and 3 size arrays. The 2 is pretty trivial (swap is not in order), but the 3 is bit more hidden. The first list makes sure that [0] <= [1] holds. The second makes sure [0] <= [2] is true while the third makes sure [1] <= [2] holds as well, ultimately sorting the array.

The other key thing is the way the recursion is implemented. Unlike the textbook based implementations which use recurse(left), recurse(right), this uses a single recursion. The reason is that before it recurses to the left side, it makes sure the sub array is still larger than 16 elements by letting the upper part of the code run on it again. Please note that this is not a tail recursion because there is still executable code after the recursion so the compiler cannot optimize it.

Heapsort

Finally it’s really worth to have a look at the heapsort implementation as it’s beautifully written:

Heapsort()
{
    for middle down to 1
    {
        DownHeap element
    }

    for end down to 0
    {
        Swap first element with current
        DownHeap fist element
    }
}

DownHeap()
{
    while
    {
        int parentPos = elementPos * 2
        if arr[parentPos + 1] > arr[parentPos]
        {
            parentPos = parentPos + 1
        }

        if arr[parentPos] > element
        {
            break
        }
        Swap arr[elementPos], arr[parentPos]
        elementPos = parentPos
    }
    arr[parentPos] = element
}


The DownHeap is a standard heap correcting implementation, pushing the newly inserted item down as long as its parent node is larger than itself – as a side effect, it will build a max heap, meaning the first element in the array will be the largest element and its children are always smaller or equal to itself ([0] >= [1], [0] >= [2] but there is no known between [1] and [2]).

However, the HeapSort implementation is very concise: the first loop from middle to first element will heapify the array, meaning the array will store a max heap where the two children of an element can be found in arr[pos*2] and arr[pos*2+1] position with the properties described above.

The second loop relies on this property: it will keep looping from the end and keep swapping the current item with the first in the array. As we know it’s a max heap, so arr[0] will always be the largest element. Whatever is put on arr[0] from the last position will be shifted down to its proper position by the DownHeap call, maintaining a max heap all the way through.

It means that the arr[0..pos] subarray always will be a max heap while arr[pos+1..end] will be a sorted subarray with the largest element at the end.

Conclusion

Since 2012 Microsoft slightly changed the default sorting algorithm on the .NET platform that might give better performance even on average arrays. However, the biggest change is more around artificially crafted “evil” arrays that previously could have caused O(n^2) run time: the new implementation is guaranteed to finish in O(n*log(n)) time. To have a look at the actual implementation, you can use ILSpy.

Thursday, August 15, 2013

Async-Await in .NET is not for performance – it’s for readability

While I see the LINQ on the .NET platform one of the best if not the best language feature, I think the async-await idea is more like smoke in the mirror than anything else. Actually, it’s even worse: as the underlying operations are not trivial in some cases we are better off using linear code.

Idea based on desktop apps

The whole async-await idea is mostly coming from the cumbersomeness of desktop GUI applications and UI threads: only a single thread can modify the UI so when a long running process finishes we need to ask the UI thread to display the result for us but the operation itself cannot update the UI. It’s not a language problem; it’s a Windows/framework problem. However, Microsoft decided to “solve” it with a language feature, because:

  • Concurrent programming is a hot topic and Microsoft wanted to “be there”
  • C# language evolves fast so adding a new feature is quite easy


While the latter is a great thing (hey Java, can you hear me?), adding a language feature because of marketing reasons is just plain wrong.

Myth #1: Await will make the code run faster

Aysnc-await is a unique language feature because it’s implemented more as code generator than anything else. Microsoft decided to support the async-await keywords with generated classes and a huge, somewhat undocumented framework that just “solves your problems”. The underlying implementation is so hidden, we tend to believe that it will just make the code faster somehow.

It will not. It will reorganize your code so you don’t have to write callbacks to complete the rest of the function when a long running process completes. The rewriting of your linear code to emulate the callbacks is massive, it worth to have a look at it using ILSpy. Anyhow, the Task<> that is created by an async function was there before, nothing new about it: it will be scheduled to run on the ThreadPool; which, again, was there from 1.0. However, in certain cases (like in a loop) the execution will be essentially sequential. No parallelism, no speedup.

On the other hand, while it can be really helpful in a desktop application it makes little or no sense at all in a server environment – like ASP.NET. Unfortunately VS2013 will generate sample ASP.NET project with async-awaits all over. It is wrong! Why?

  • Servers are optimized to handle lot of parallel requests – they will manage the threads for you, no need for extra "optimizations"
  • A single request should never take too long – use queues if that’s the case and refresh the client using Javascript instead


Unfortunately Microsoft decided to return Task<>  in the new ASP.NET api calls forcing us to either call .Wait() or to use async-await. As typical HTTP is around 100-300ms the whole idea of asynchronously serving all requests in a synchronous environment is somewhat flawed. Remember, the user is actually waiting for your HTTP response.

Myth #2: Creating a thread is expensive

The biggest argument against creating a new thread for your long running task is that thread creation is expensive. But how expensive exactly? I did two quick measurements:

  • Start 1000 threads and wait (Thread.Sleep) in all for 1 second – all threads completed in 1136ms, meaning that the overhead for creating, starting, and stopping a single thread was around 0.13ms
  • Start 1000 threads and do not do any work in them – all threads completed in 95ms, meaning that the overhead was 0.095ms per thread


Let’s say 0.1ms per thread is the overhead and creating even 1000 threads is not too stressing for any modern machines – including my average spec laptop. Let’s say you create a brand new thread for every request (bad idea) and a request takes 100ms to complete. That’s 0.1% overhead. It wouldn’t really rush to optimize this overhead.

However, if you have more than 1000 requests per sec (which is more than 86 million requests per day) you really should use some kind of queuing to limit the number of concurrent threads – but all server environments like ASP.NET does that for you, don’t worry.

Myth #3: Concurrent programming can be easy

In an average environment concurrent programming is pretty much unnecessary – no real need to squeeze every computing bit out from a CPU so a standard sequential program will just do fine. For the rare cases when we need to update the UI of a desktop app or do a little background processing, async-await will be more than enough.

However, for any high performance computing projects forget about the features offered by any language or framework. You will need to understand work scheduling, synchronization, hardware bottlenecks, data layout patterns in memory, and in some cases even SIMD instructions and branch prediction. No IDE, programming language or framework will understand that for you. Even in concurrent languages like Google Go you have to understand synchronization quite well to efficiently use your resources.

As currently we operate with a few CPU cores on a large, common, shared memory, concurrent programming is far from trivial – my guess is that until we change our hardware architecture it always will be.

Measurements

Out of curiosity I measured how long it would take to calculate the square root of a large amount of numbers (1000 times of 0, 1 million, 3 million numbers) divided into 1000 work units using different approaches (source codes below):

  • Single thread – do all calculations on a single thread
  • Lot of threads – try to start 1000 threads to do it (actually around 8 can start because the CPU is too busy running other calculations. This is a surprisingly good self-limiting side effect).
  • Lot of async – async done in a loop (which will essentially run sequentially)
  • Parallel.For – using the built in .NET parallel library
  • Task WhenAll – using the WhenAll() synchronizer and simply creating an running the tasks
  • ThreadPool – Using the default QueueUserWorkItem on the default thread pool




Couple things to take away from the chart:


  • Parallel.For is slightly faster than other solutions
  • Misusing async-await will yield linear performance without compiler warning
  • Creating a lot of threads to do the work is only slightly slower than the best performing solution (4% slower in this case).
So is async-await bad?

I think it would be crucial for everyone to understand the internals of async-await before jumping in because really surprising side effects can and will pop up - there are just too many questions around weird behavior of async-await on Stackoverflow. However, the idea is very useful in situations where a callback code would severely undermine the readability of the code, especially on desktop apps. Anywhere else? Not too sure.


Code snippets

private static void SingleThread()
{
    Stopwatch sw = Stopwatch.StartNew();
    for (int i = 0; i < n; i++)
    {
        Sq();
    }
    Console.WriteLine(sw.ElapsedMilliseconds);
}

public static void ParallelForeach()
{
    var sw = Stopwatch.StartNew();

    Parallel.For(0, n, i =>
    {
        Sq();
    });

    Console.WriteLine(n + " parallels foreach finished in " + sw.ElapsedMilliseconds);
}

public static async void LotofAsync()
{
    var sw = Stopwatch.StartNew();

    for (int i = 0; i < n; i++)
    {
        await Task.Factory.StartNew(
        () =>
        {
            Sq();
        });
    }

    Console.WriteLine(n + " awaits started and finished in " + sw.ElapsedMilliseconds);
}

public static void TaskWhenAll()
{
    var sw = Stopwatch.StartNew();
    var tasks = new List<Task>();

    for (int i = 0; i < n; i++)
    {
        var t = Task.Factory.StartNew(
        () =>
        {
            Sq();
        });
        tasks.Add(t);
    }

    Task.WhenAll(tasks).Wait();

    Console.WriteLine(n + " tasks WhenAll finished in " + sw.ElapsedMilliseconds);
}

public static void Threadpool()
{
    var sw = Stopwatch.StartNew();
    var tasks = new List<Task>();

    int total = n;
    var re = new ManualResetEvent(false);

    for (int i = 0; i < n; i++)
    {
        ThreadPool.QueueUserWorkItem(new WaitCallback(o =>
        {
            Sq();
            Interlocked.Decrement(ref total);
            if (total == 0)
            {
                re.Set(); // ThreadPool does not support waiting on tasks by itself.
            }
        }));
    }

    re.WaitOne();

    Console.WriteLine(n + " tasks WhenAll finished in " + sw.ElapsedMilliseconds);
}

public static void LotofThreads()
{
    var sw = Stopwatch.StartNew();

    var threads = new List<Thread>();

    for (int i = 0; i < n; i++)
    {
        var t = new Thread(new ThreadStart(() =>
        {
            Sq();
        }));

        threads.Add(t);
        t.Start();
    }

    threads.ForEach(t => t.Join());

    Console.WriteLine(n + " threads started and finished in " + sw.ElapsedMilliseconds);
}

private static double Sq()
{
    double sum = 0;
    for (int j = 0; j < 1 * 1000 * 1000; j++)
    {
        sum += Math.Sqrt(j);
    }
    return sum;
}

Tuesday, August 13, 2013

Damn cool machines – stirling engine, ram pump, and valveless pulsejet engine

Once in a while it is worth looking around what kind of software or in this case hardware solutions were invented that did not catch enough attention. This post is about three hardware designs that amaze me.

Stirling engine

The Otto engine is by far the most successful engine we know these days but it requires special kind of fuel like gasoline or gases to operate. In some cases it is not available but cheap heat source can be accessed, like burning solid materials or focusing solar energy with mirrors.

In this case a stirling engine can convert heat into mechanical movement in an easy way. The idea is based on the rules of thermodynamics: the temperature of the gas always equals to pressure multiplied by volume (within constant factor). This basically means if we heat up the gas it expands or the pressure needs to rise. Similarly, if we decrease the pressure, the volume or the temperature has to drop (or both moving together).

The stirling engine is basically a two cylinder/piston engine that moves the heat energy from the source to a cooling side using the above equation.
The phases of the engine are:
1 – Heat up the gas in the first cylinder which will expand, pushing the piston away and forcing the gas to flow into the second cylinder
2 – The piston in the second cylinder moves away while cooling the gas down, which reduces the pressure
3 – Both pistons in first and seconds cylinders move back to initial position with the cool gas
4 – The heat source heats the gas again, increasing the pressure and restarting the cycle
(note – there are one cylinder but two piston designs following the same logic)
The oscillation is quite fast between the two cylinders but usually the torque is quite low. To increase the torque, multiple cylinders can be added in parallel.





Ram water pump

Electric pumps are extremely efficient in moving water but they need electricity which in some cases is not available – like remote areas or developing countries. The idea behind the ram water pump is the same that “kicks” your water hose when you are watering or washing your car but suddenly releasing the handle of the sprinkler gun. The reason the hose jumps is because the water is moving in the hose but when you block it, it has to stop – with a sharp deceleration. The quicker it needs to stop the bigger the pressure will be in the hose for a very short period of time.

The ram pump is basically a long tube that allows the water to gain speed in it – flowing in from a slow but not limited source like a river. This tube has a self-closing valve at its end that closes as the speed of the outflowing water increases, ultimately increasing the pressure in the tube. This pressure is now high enough to open the second one way valve at the end of the tube which will allow a small amount of high pressure water to flow into the “output” of the pump. This can push the water up quite high but the higher it needs to go the smaller amount the outflow will be: the pressure that is generated with this method is limited (usually by the speed of water in the tube, the rigidity of the tube and the speed of the closing valve).

The advantage of the ram pump is that is extremely simple to install and operate, the drawback is that the output flow is not constant and fairly limited. It is best suited to lift water from a river to a higher area like a water tower which can provide constant flow with constant pressure.



Valveless pulse jet

There are basically two types of jet engines: continuous combustion and non-continuous combustion engines. Most aircrafts use the former one as its operation is smooth and very controlled with practically constant throttle and no vibration. However, one of the most interesting non-continuous engines is the valveless pulse jet engine which does not have any moving part in it making it extremely simple and essentially maintenance free.

The idea is quite simple: if we could somehow create a combustion chamber that pushes out most of the exhaust gases after igniting the fuel, it would suck in fresh air as the gases would start to cool down, preparing for the next cycle. It turns out that it’s quite easy to create such a shape! To help the forward thrust both the intake (smaller) and exhaust (larger) pipe usually faces the same direction as some of the exhaust would still exit through intake because there are no valves in the system.

The result is a high frequency resonation consisting of tiny explosions in the chamber which will generate almost constant thrust. On the downside, the system needs external source of air flow to start, will get very hot and the series of tiny explosions make it extremely noisy.





Monday, August 12, 2013

Hosted Git (GitLab) in 5 minutes on Azure Virtual Machine

Finally Microsoft realized they are not the only software vendors out there so they started to support other suppliers on their cloud platform, Windows Azure. Apart from the stock images from Microsoft, BitNami publishes their own high quality virtual machines as well – mostly based on some kind of Linux system.

One of their images is a completely set up and ready to use Git source control solution using GitLab. This could really save a lot of time for us as the installation guide seems to be a little longer than anyone would prefer spending on setting up Git.

To install the image simply Browse VM Depot, select the GitLab image and create a new virtual machine from it. Microsoft went so far with supporting Linux systems that we can even enter our own username and password for the box during setup, no need to use any default logins on the fresh machine.

GitLab setup isn’t particularly resource hungry, so we are perfect fine with the extra small (A0) instance to host our git/wiki/bug tracking/code review solution. The whole setup takes around 15 minutes, which is mostly just automated - we don’t need to spend more than couple of minutes of setting up the image. The monthly cost would be around $20 for the XS instance which is pretty cheap for an on-demand solution. However, if we need a build server as well, probably a beefier machine would be more suitable.

Wednesday, July 31, 2013

Languages and databases - a step back a day

I've started to learn the Go language recently and to be frank I'm horrified. Horrified by the way the future of programming looks like and the NoSQL movement just making it worse.

Languages - a step back

If you haven't read any Fortran code yet, do it by any means. It's a language that was invented in 1957 when 'computers' were practically non existent. However, if you look at it more than 50 years later it's still readable and not really ugly - especially compared to some C++ code...

Then we had Python, Ruby, Java, C# and all sorts of modern languages. In fact, we have a new language born almost every day (have you heard about Rust for instance?). The only issue with them is that the programming model hasn't really changed in the past 50 years. We write text files, line by line, instructing the computer what to do step after step. The processors didn't really change much either unfortunately.

And the horrifying part? Google comes along, creates a new language (Go) and makes a big hype around it. And they are really good at making hype around things. As a result people think that this is the future, that finally this is a solution to our problems! Did you say you wanted a nice IDE to work with? Sure, you can use Vim or Emacs! Or wanted a great debugger? GDB is the perfect solution! How about inheritance? Event handlers? Lambda expression? Language integrated queries? Generics? Annotations? You don't need them, right?

How about a language that is not a stripped down version of other great languages? Go is great, it's readable, it compiles to native code and usually the code is faster than Java. Hurray! Oh, your site has like 10.000 impressions a day? That's less than 1 request per second if distributed over 8 hours, or 3 request a second if you have a spike for an hour! What framework cannot handle that? Even if you have 100.000 impressions in an hour it's still just 30 request per second. It's practically zero for any kind of hardware and software framework today but it would already be among the biggest sites out there.

But no, let's invent a language that fits our specific purpose and shove it down everyone else's throat. But wait a sec, why not actually evolve a language? Not strip it down but to make it THE next big thing? How about a language where I don't have to write boilerplate ever again? Where validating a field from front-end to backend does not make me write code again and again? Where I don't have to think about threads and mutexes when writing distributed code? How about not writing for cycle every again? Or not bothering with connection strings again? And forgetting about beginning and ending transactions? Because these are the actual problems developers are working on every day.

NoSQL - the NoSolution

And this brings us to the next big issue, the NoSQL movement. Just because a handful of companies are dealing with data volumes that cannot be handled by relational databases, it does not mean everyone is. But everyone seems to think it's the best storage idea ever. However, before you jump in, make sure you have a look at what functional complexities these companies struggle with! A wall post? To upload a photo? To synchronize your plain text notes? Now check what complexities an ERP system is dealing with.

But again, when several parties criticized Google for this not-relational approach, they arrogantly responded: "it works for us". Sure it does, because you can invest thousands of hours/weeks/years of engineering time in creating a simple web application. Or just to add a new field to an existing one. Or create a new report. Oh, creating a new report needs a new data structure? Well, it works for you, not for me!

How about inventing the next big thing in databases? Where I don't need to map my fields to columns, where I could compile my code against the database to eliminate 99% of useless integration test? To have the ability to create reports with ease? To never bother with change scripts again?

We think we got it right - so we are stuck

The really horrifying part of all this is that the 'average Joe' developer does not know anything about all this. He's happy copy-pasting code from Stackoverflow, day in, day out. Meanwhile the 'top of the game' developers are just mesmerized by all the latest-and-greatest ideas from companies that do very different things from what they do.

However, if we get stuck and keep thinking that the way forward is a dumbed down language and database that is as smart as a text file, we won't see much progress any time soon. Actually, we haven't seen that much since '57 either...

Given that you don't need just to follow other ideas, do you think it's all going in the right direction? I do not.

Saturday, July 27, 2013

The next big thing on the web – custom HTML elements

If you are either a HTML component vendor or ever wanted to use someone else's HTML module, you already faces the same issue: components may not work together and even if they do, they put a lot of noise in your pure HTML structure. This is going to change very soon.

Large component vendors like Kendo or Telerik are creating amazing HTML components that are, in theory, just plug and play: whatever framework you are working with it should be really easy to use those enhanced modules. However, even if they work as they are intended to, they change and litter the HTML DOM so any kind of debugging can be a bit of a pain.

Noisy HTML

The problem is that even if we start with a nice and clean HTML code that looks like this:

<select id="size">
<option>S - 6 3/4"</option>
<option>M - 7 1/4"</option>
<option>L - 7 1/8"</option>
<option>XL - 7 5/8"</option>
</select>

the frameworks will turn this into a lot of different divs and spans to be able to apply the nice visuals on it. However, this will create unexpected nodes and structures in your HTML code which may or may not display correctly after applying it. The above will be replaced to this during runtime (some attributes are removed for readability):

<span class="k-widget k-dropdown k-header" unselectable="on" …>
<span unselectable="on" class="k-dropdown-wrap k-state-default">
<span unselectable="on" class="k-input">S - 6 3/4"</span>
<span unselectable="on" class="k-select">
<span unselectable="on" class="k-icon k-i-arrow-s">select</span>
</span>
</span>
<select id="size" data-role="dropdownlist" style="display: none;">
<option value="S - 6 3/4&quot;" selected="selected">S - 6 3/4"</option>
<option value="M - 7 1/4&quot;">M - 7 1/4"</option>
<option value="L - 7 1/8&quot;">L - 7 1/8"</option>
<option value="XL - 7 5/8&quot;">XL - 7 5/8"</option>
</select>
</span>

Custom elements

Even though the custom HTML elementdefinition is still just a W3C draft and basically none of the browsers are natively supporting it (ChromeCanary and Firefox has a partial support), it's still by far the most interesting upcoming HTML feature I've seen in a while.

With custom elements, you can define a custom tag and add any HTML or Javascript rendering within it – the browser will interpret that locally but won't be visible from the outside. The only requirement is that the tag must contain a dash so the name will never interfere with upcoming standard HTML tags (eg: <hello> is not a valid custom tag but <adams-hello> is).

Defining custom elements

To define a custom tag, it has to be declared in the following format:

<element name="adams-hello">
...
</element>

The rendering is a bit tricky though: just because we have some internal content it will not be visible from the outside. In the above case “...” will not be rendered when someone uses the <adams-hello></adams-hello> tag. In some cases it can be a bit confusing but it's for the greater good: any internal structure can be pre-stored and used when necessary, no need to fiddle with what and what's not visible: anything that we have not explicitly rendered with Javascript is not visible.

Displaying time

To continue with our simple extension above, let's add some simple rendering to it to display the current time:

<element name="adams-hello">
<section>
Hey, the time is:
<time id='currentTime' />
</section>
<footer>It will not be visible</footer>
<script>
      var el;
      function tick(){
              el.innerHTML = (new Date()).toString();
      }
      if (this !== window) {
            var section = this.querySelector('section');
            this.register({
                prototype: {
                        readyCallback: function() {
                              var root = this.createShadowRoot();
                              root.innerHTML = section.innerHTML;
                              el = root.querySelector("#currentTime");
                              setInterval(tick, 1000);
                        }
                }
            });
      }
</script>
</element>

The code is pretty straightforward, however, there are couple of things that are good to know about:
- We need to register for the readyCallback to do any kind of rendering. Without that the element won't display anything.
- If we want to hide the internal content of the DOM from debuggers we can create a shadow root. This is not really supported yet so the root element can be simply this. In that case the internal structure will appear in the DOM (note: even if we use shadow root the browser knows about it and can display it to the developer is we enable that option – it's not for security, more for easier debugging).
Without shadow root - Chrome will display internal structure


With shadow root - nice and clean markup


You can grab the full source code fromhere. Remember to run the samples on a local web server (like python -m SimpleHTTPServer 8080) otherwise the HTML import won't work.

How can I use it today?

As basically none of the browsers support is yet so the only solution is to use some kind of polyfill framework to “bring the browser up to speed”. The only reasonable implementation I've seen is Google's Polymer Project. The framework is pretty well written so even Internet Explorer will be able to work with most of the features mentioned above (note: for me shadow root did not work). Give it a go and be prepared: this is really an exciting feature in HTML!