Caution: my English is far from perfect. (Русский тоже не всегда хорош).

Saturday, 9 February 2013

Quest for Immutable Static Files Hosting

For cl-test-grid I need an online storage for static files - library test logs.

The requirements:

  1. Permissions:
    • Upload new file - anyone (because the tests may be run by anyone)
    • Delete or modify existing file - forbidden for public.
  2. Current upload rate is around 50 000 small files per month, usually uploaded during several days after new Quicklisp is out. In the future it may increase to 100 000 or more. Size of each file is <= 100 KB.

After file is uploaded it must be available by an HTTP URL.

None of the online file storage services I saw satisfy my permissions requirements. Usually if user can upload file, he can delete/modify it. Therefore every of the solutions below include an intermediate server application which enforces my access rules.

I have considered several variants:

  1. Google App Engine Blobstore

    The first solution I found and it served me for more than a year. The free usage quota includes 5 GB of blobstore, which I hoped will be enough for years.

    But it shown some disadvantages. I faced some bugs: #7619, #8032. Also every blob upload takes several Datastore operations, thus exhausting the free Datastore quota. Also the blob keys are 164 characters long, making the log URLs horrible: http://cl-test-grid.appspot.com/blob?key=AMIfv97wo4FQxGLBZOagyZyLZCqwMWavAfwsByKxjq8QiJQ5rIzEggGwGJ_kH2qRZLMb8N_el8aKIpLDbnr67Pxcy9r8RFKmBnjTQ1B44yaCcyZWtO2CSbBliyAINvoI41_R8uA8hoPia-yXPdlmADiJcavCCgpHGA

    To make the URLs shorter I save blobkey in a Datastore Entity, and use the integer entity ID as a reference to the log:

        Entity shortKeyEntity = new Entity("ShortKey");
        shortKeyEntity.setProperty("blobKey", blobKey);
        datastore.put(shortKeyEntity); 
        return shortKeyEntity.getKey().getId();
    That way the URLs look better: http://cl-test-grid.appspot.com/blob?key=789714. But this increases amount of Datastore operations. Also, the Datastore creates some internal indexes for the entities, and which take huge amount of space. I in total have 1GB of log files, but it takes 4GB of Datastore to keep the mapping from blob keys to short IDs:

    So my application exceeds the free quota by several parameters and the paid plan must be used. It is > $108 per yer. Expensive for just publishing 1GB of static files.

  2. Google App Engine + Google Cloud Storage

    Cloud Storage is an Amazon S3 clone. I migrated to it from App Engine Blobstore recently.

    As you choose file names yourself there is no problem with long keys. Also Cloud Storage doesn't involve any App Engine Datastore operations.

    There is an API to access it directly from App Engine, although it is experimental. And I faced a bug #8592; solved it by a work-around.

    Other difficulties to solve were:

    • Handling of multipart/form-data based file uploads, because Google App Engine does not support servlet API 3.0 yet. And Apache Commons FileUpload library doesn't work out of box in App Engine, because it tries access file system.
    • Java Servlets on App Engine should handle every request in 30 seconds time-out. The servlet is killed if exceeds the time-out.
    • Writing files one-by-one to Cloud Storage via the Java API is relatively slow.

    Having solved all this by customizing Apache Commons FileUpload to keep the uploaded files in memory while parsing the request body and writing to Cloud Storage from multiple threads I now have a solution allowing to upload 300 logs in one request. And with the free 5GB provided by Google for the fist Cloud Storage project, the payments for this service will most likely be zero.

  3. Amazon S3 + Heroku.

    This is the next thing I'll do if I have any problems with CloudStorage. A small app at Heroku creating pre-signed S3 URLs and returning them to client. The client uploads files and after the pre-signed URLs expire he can not modify the files.

    Amazon only allows to upload one file per PUT request. With the price of $0.01 for 1000 requests it may cost me around $10 a year.

    A tempting property of this solution is that the Heroku app may be written in Common Lisp thus getting rid of Java servlets in cl-test-grid.

Tuesday, 5 February 2013

Multicore Forth processors provide Go-like concurrency primitives in hardware

In the Go Concurrency Patterns presentation Rob Pike demonstrates how unbuffered channels are
enough for many concurrency tasks. (NB: use Left/Right arrow keys to scroll the presentation)

It reminded me of the Forth chips produced by Chuck Moore and colleagues.

The current version contains 144 cores per square centimeter chip. The machine language is Forth and
each core is equipped with its own little data and control Forth stacks, making it a fully fledged
independent computer (that's why the more precise term is "multi-computer chips" rather than "multi-core").

The cores talk to each other via communication ports. Writing to a port suspends the core until
the peer reads the value. And vice-versa.

This semantics corresponds to the Go channels.

The chips have other interesting properties. Quoting the doc:

A computer can read from multiple ports [Corresponds to Go's select]
and can execute instructions directly from those ports.

FINE GRAINED ENERGY CONTROL: ... The read or write instruction is automatically 
suspended in mid-operation if the address [one or more of communication ports and I/O pin] is inactive,
consuming energy only due to transistor leakage currents, resuming when the address becomes active. 

NO CLOCKS: Most computing devices have one or more clocks that synchronize all 
operations. When a conventional computer is powered up and waiting to respond 
quickly to stimuli, clock generation and distribution are consuming energy at a huge rate 
by our standards, yet accomplishing nothing. This is why “starting” and “stopping” the 
clock is a big deal and takes much time and energy for other architectures. Our 
architecture explicitly omits a clock, saving energy and time among other benefits.

Read more at the company website.

As Rob Pike says, the channel-like concurrency primitives are not new. It is interesting to see
them implemented in hardware.

Blog archive