Caution: my English is far from perfect. (Русский тоже не всегда хорош).

Saturday 9 February 2013

Quest for Immutable Static Files Hosting

For cl-test-grid I need an online storage for static files - library test logs.

The requirements:

  1. Permissions:
    • Upload new file - anyone (because the tests may be run by anyone)
    • Delete or modify existing file - forbidden for public.
  2. Current upload rate is around 50 000 small files per month, usually uploaded during several days after new Quicklisp is out. In the future it may increase to 100 000 or more. Size of each file is <= 100 KB.

After file is uploaded it must be available by an HTTP URL.

None of the online file storage services I saw satisfy my permissions requirements. Usually if user can upload file, he can delete/modify it. Therefore every of the solutions below include an intermediate server application which enforces my access rules.

I have considered several variants:

  1. Google App Engine Blobstore

    The first solution I found and it served me for more than a year. The free usage quota includes 5 GB of blobstore, which I hoped will be enough for years.

    But it shown some disadvantages. I faced some bugs: #7619, #8032. Also every blob upload takes several Datastore operations, thus exhausting the free Datastore quota. Also the blob keys are 164 characters long, making the log URLs horrible:

    To make the URLs shorter I save blobkey in a Datastore Entity, and use the integer entity ID as a reference to the log:

        Entity shortKeyEntity = new Entity("ShortKey");
        shortKeyEntity.setProperty("blobKey", blobKey);
        return shortKeyEntity.getKey().getId();
    That way the URLs look better: But this increases amount of Datastore operations. Also, the Datastore creates some internal indexes for the entities, and which take huge amount of space. I in total have 1GB of log files, but it takes 4GB of Datastore to keep the mapping from blob keys to short IDs:

    So my application exceeds the free quota by several parameters and the paid plan must be used. It is > $108 per yer. Expensive for just publishing 1GB of static files.

  2. Google App Engine + Google Cloud Storage

    Cloud Storage is an Amazon S3 clone. I migrated to it from App Engine Blobstore recently.

    As you choose file names yourself there is no problem with long keys. Also Cloud Storage doesn't involve any App Engine Datastore operations.

    There is an API to access it directly from App Engine, although it is experimental. And I faced a bug #8592; solved it by a work-around.

    Other difficulties to solve were:

    • Handling of multipart/form-data based file uploads, because Google App Engine does not support servlet API 3.0 yet. And Apache Commons FileUpload library doesn't work out of box in App Engine, because it tries access file system.
    • Java Servlets on App Engine should handle every request in 30 seconds time-out. The servlet is killed if exceeds the time-out.
    • Writing files one-by-one to Cloud Storage via the Java API is relatively slow.

    Having solved all this by customizing Apache Commons FileUpload to keep the uploaded files in memory while parsing the request body and writing to Cloud Storage from multiple threads I now have a solution allowing to upload 300 logs in one request. And with the free 5GB provided by Google for the fist Cloud Storage project, the payments for this service will most likely be zero.

  3. Amazon S3 + Heroku.

    This is the next thing I'll do if I have any problems with CloudStorage. A small app at Heroku creating pre-signed S3 URLs and returning them to client. The client uploads files and after the pre-signed URLs expire he can not modify the files.

    Amazon only allows to upload one file per PUT request. With the price of $0.01 for 1000 requests it may cost me around $10 a year.

    A tempting property of this solution is that the Heroku app may be written in Common Lisp thus getting rid of Java servlets in cl-test-grid.


Locke said...

Looks like anonymous ftp with write permissions of folder and read-only mask of file will do the trick.

A. V. said...

Almost, but I need to upload files via HTTP.

Pete said...

Make a tiny Heroku app that accepts the http submit, and then immediately stores the file somewhere where your app can upload with full permissions, but the 'endusers' cannot.

In esssence, a micro-proxy that ensures the append-only policy that you want.

A. V. said...

Pete, yes this possible too. BTW, as Heroku apps are run in Amazon EC2, Amazon S3 would be a good choice because it is accesses locally.

I will only need to ensure the app itself doesn't become a bottle-neck when transfer all uploads through itself, rather then returning pre-signed S3 URLs to client (I am going to use only the free dyno at Heroku, because this project is not commercial).

Andrei F said...

Maybe this solution will also be of some help for you:

Blog Archive