Tuesday, August 6, 2013

7 Reasons You Should Use MongoDB over DynamoDB

Even recently I migrated from MongoDB to DynamoDB, and shared 3 reason to use DynamoDB. I still love MongoDB, really good NoSQL solution. Here are some points for you to make decision on using MongoDB over DynamoDB.

Reason 1: Use MongoDB if your indexing fields might be altered later.
With DynamoDB, it's NOT possible to alter indexing after being created. I have to admit that there are workarounds. For example, you can create a new table and import data from the old one. But no one is straightforward and you need some trade off if using workaround. Back to indexing, DynamoDB allows you define a hash key to make the data well-distributed, and then adding range key and secondary index. When query from table, hash key must be used, and then either range key or one of secondary indices. No complex query supported. The hash key, range key and secondary index key definition can NOT be changed in future. So your database structure must be well designed before going production. By the way, the secondary key will occupy additional storage. If you have 1G data, and if you create index and "project" all attribute to the index, then your actually cost of storage will be 2G data. If you project only the hash and range key value to index, then you need to query twice to get the whole record. Actually the API allows you to invoke query only once, but the cost to "read" capacity is twice. In addition, you can still "scan" the data and filter by conditions on un-indexed key, but please check the data in my previous post, scan could be 100 times (or more) slow than query.

Reason 2: Use MongoDB if your need features of document database as your NoSQL solution.
If you will save document like this:
  _id: 1,
  name: { first: 'John', last: 'Backus' },
  birth: new Date('Dec 03, 1924'),
  death: new Date('Mar 17, 2007'),
  contribs: [ 'Fortran', 'ALGOL', 'Backus-Naur Form', 'FP' ],
  awards: [{
      award: 'National Medal of Science',
      year: 1975,
      by: 'National Science Foundation'
  }, {
      award: 'Turing Award',
      year: 1977,
      by: 'ACM'
(sample document from MongoDB technical document)
With document database, you'll be able to query by name.first, or if some value exists in sub-document of awards. However, DynamoDB is key-value database, and support only value or set, no sub-document supported, and no complex index and query supported. It's not possible to save sub-document { first: 'John', last: 'Backus' } to name, accordingly, not possible to query by name.first.

Reason 3: Use MongoDB if you are going to use Perl, Erlang, or C++.
Official AWS SDK support Java, JavaScript, Ruby, PHP, Python, and .NET, while MongoDB supports more. I used node.js to build my backend server, both AWS SDK for node.js and mongoose SDK for MongoDB works very well. It's really amazing to use mongoose for MongoDB. It's in active development and the defect I report to mongoose can be fixed soon. I have also experience of using AWS SDK for Java and morphia for MongoDB, both of them works perfect! SDK for AWS and MongoDB are all well designed and widely used. But if your programing language is not listed in official support list, you may need to evaluate the quality of the SDK carefully. I have ever used non-official Java SDK for AWS SimpleDB, it's also good. But I can still easily get defect, for example, when using Boolean in object persistence modal, the Java SDK for SimpleDB cannot handle this type and will introduce some bad result.

Reason 4: Use MongoDB if you may exceed the limits of DynamoDB.
Please be careful about the limits and read them carefully if you are evaluating DynamoDB. You may easy to exceed some of the limits. For example, the value you stored in an item(value of a key) cannot exceed 64k bytes. It's easy to exceed 64k bytes when you allow user to input content. User may input a 100k bytes text as article title just because of pasting it by mistake. There is also workaround. I divide the content to multiple keys if it exceed the limits, and aggregate to one key in the post processing stage after reading the data from DynamoDB server. For example, the content of an article database may exceed 64k bytes, then in the pre-processing stage when storing to DynamoDB, I divide it to article.content0, article.content1, article.content2 and so on. After reading from DynamoDB, I will check if keys article.content0 exists, and if article.content0 exists, then continue to check article.content1, and combine the value in these fields to article.content and remove the article.content0, article.content1, and so on. This will introduce the complexity of your code and introduce additional dependency to your code. MongoDB does not have these limitations.

Reason 5: Use MongoDB if you are going to have data type other than string, number, and base 64 encoded binary.
In addition to string, number, binary, and array, MongoDB supports date, boolean, and a MongoDB specified type "Object ID". I use mongoose.js, and it supports these data type well. When you define data structure for object mapping, you can specify the correct type. Date and Boolean are quite important types. With DynamoDB you can use number as alternative, but still, need additional logic in your code to handle them. With MongoDB you can get all these data types by nature.

Reason 6: Use MongoDB if you are going to query by regular expression.
RegEx query might be an edge case, but in case this happens in your situation. DynamoDB provided a way to query by checking if a string or binary start with some substring, and provided the "CONTAINS" and "NOT_CONTAINS" filter when you do "scan". But you know "scan" is quite slow. With MongoDB, you can query easily on any key or sub document with RegEx, for example, if you want to query by user's name for "John" or "john", you can query by a simple regular expression {"name" => qr/[Jj]ohn/}, while this cannot be completed in DynamoDB by 1 query.

Reason 7: Use MongoDB if you are a big funs of document database.
10gen is the company backing MongoDB. They are very active on community. I asked question on stackoverflow, and Dylan, a Solution Architect of MongoDB, actively follows up my question, helped me analyze the issue, looked for the cause, also gave some very good suggestions on MongoDB. This is really a very good experience. In addition, the MongoDB community are willing to listen to users. Amazon is big company, it's not easy to getting touch with the people inside, not to mention impacting their decision and roadmap.

Bonus Tips: Read carefully on DynamoDB document if you are going to use it.
For example, there is an API "batchWriteItem". This API may return no error but give a field with key "UnprocessedItems" in result. This is somewhat anti-pattern. When I invoke a call, the result could be either success or failed. But this API gives a different status: "partial correct". You need to manually re-submit those "UnprocessedItems" again until there is no item in it. I didn't notice this because it's never happens during the testing. However, when there are big traffic, and the count of request to DynamoDB exceeded your quote for several seconds, this may happen.

Hold on, before you made the decision on using MongoDB, please read 3 Reasons You Should Use DynamoDB over MongoDB.

3 Reasons You Should Use DynamoDB over MongoDB

Recently I post a blog to share my experience of migrating from MongoDB to DynamoDB. Migration is smooth, and here are a summary of 7 reason we did the migration:

Reason 1: Use DynamoDB if you are NOT going to have an employee to manage the database servers. 
This is the top 1 reason I migrate from MongoDB to DynamoDB. We are launching a startup, and we have a long list of user requirements from early adoption users. We want to make them satisfied. I need to develop the Windows/Mac OS/Ubuntu software and iPhone/Android apps, also need to work on server to provide data synchronization among these apps. Kelly is not a technically people and didn't have experience on managing servers. Someone may said that people can be a web developer with 21 days. However, that's really not easy for server troubleshooting. With only 15k users and 1.4 million records, I start to get into serious troubles. From the last post,  the more data I stored, the more worse the database latency. In future when I set sharding and replica set for shardings, I can imagine that database management may take a big portion of my time in future. With DynamoDB, you can totally avoid any database management stuff. AWS manages it very well. I've migrated the database for one week, everything works very well.

Reason 2: Use DynamoDB if you didn't have budget for dedicated database servers.
Because I didn't have too much traffic and data records. I used 2 linode VPS as database servers, 1G RAM, 24G disk. The 2 database server is grouped as replica set, and no sharding yet. Ideally they should support my current data scale very well. However it's not true. Upgrading database servers will take more cost, and may still not be able to resolve the issue. There are some managed MongoDB services, but I may not be able to stand for the cost. With current user base, the MongoDB database occupied 8G disk on data and 2G disk on journal file. With managed mongodb service, I need to select 25G plan and starting from US$500 monthly fee. If I got more traffic and users, it would cost too much. Before migration, I tested on DynamoDB, migrating all the data to DynamoDB, that is, 1.4 million records. The actually space is less than 300M. I'm not sure how managed mongodb service, I use command in mongo console to get the disk usage statistics. My first week of cost on DynamoDB is, US$0.05. That's the last week of July, let's see how much it will cost in August.

Reason 3: Use DynamoDB if you are going to integrate with other Amazon Web Services.
For example, full text index of the database. There are solutions for MongoDB, but you need to setup additional servers for indexing and search, and understand the system. The good thing is that MongoDB provided full text index, but I can imagine that full text index for multiple languages is not easy, especially the Chinese word segmentation. Amazon CloudSearch is a solution for DynamoDB full text index. Another example could be AWS Elastic MapReduce, it can be integrated with DynamoDB very easily. Also for database backup and restore, Amazon has other services to integrate with DynamoDB. In my opinion, as the major NoSQL database in Amazon Web Services, DynamoDB will have more and more features, and you can speed up development and reduce the cost of server management by integrating Amazon Web Services.

However, DynamoDB has it's shortcomings. Before you made the decision on using DynamoDB, please read 7 Reasons You Should Use MongoDB over DynamoDB.

Monday, July 29, 2013

LEAN7: Migrate from MongoDB to AWS DynamoDB + SimpleDB

Migrate from MongoDB to DynamoDB + SimpleDB: New Server Side Architecture Ready for More Users

Recently we have 14,000 registered users, a small portion of them are paid users. I feel that TeamViz is recognized for more and more sales (even still a very small number) generated every month. However, I start to get trouble on our server architecture mentioned in this post. The issue is, the MongoDB backed database getting locked for unknown reason for several minutes every 2 hours. Initially, the all request will be hold for 2 minutes every 2 hours 7 minutes. Now it becomes more worse, all request will be hold for 7 minutes every 2 hours and 7 minutes. I asked this question on stackoverflow, but no answer yet. So I can either increase the capacity of servers, or shift to another database server. We are small, and I can try different solutions.

Because all the connection will be hold for several minutes, so the connection on load balancer looks like this way. (At the beginning I though the server are attacked, but no one will attack a sever every 2 hours 7 minutes, and for 1 month, right ^_^ )

So here are several possible solutions. Use another NoSQL database, or use managed NoSQL database. My first decision is to looking for other NoSQL database servers, I have read comparison of NoSQL solutionsthis link about NoSQL benchmark, and this link about couchbase. Every NoSQL database has some pros and cons.

I then talked with Kelly about the cost of server, cost of managed service, possibility of shifting to other NoSQL providers, or even shifting to MySQL. The conclusion is, current issue on MongoDB is just a start, we may take more time on managing databases and resolve performance, or some unknown issues. This will cost much energy. However, our focus is to providing better product. There are a lot of fun on playing NoSQL and other cutting edge technology. But that's not our goal. Shifting to managed database service can help us focusing on providing features/fix issues on product itself. At least we have a long list of features and issues to resolve. So we shifted to Amazon AWS DynamoDB, and to reduce the cost, part of the data on AWS SimpleDB. The server side is almost rewrote to handle the database change. I take this chance to practiced Promise pattern on node.js. It works great! and leveraged the middleware technology provided by Express framework. In addition, hold data of DynamoDB and SimpleDB in memcache. Everything has worked great for 24 hours (except that I got some error logs on memcache).

Here are the picture after 10 hours of migration. The huge periodically traffic disappeared.

Here are the new architecture on database and sync server.

You may have concern about accessing AWS from Linode, currently it's fine. We have more than 1.3 million items in one DynamoDB table, and response from DynamoDB to get one record by key is 25 ~ 45 ms from Linode network. SimpleDB has less than 20k items, and also 25 ~ 45 ms.

Some notes about the new architecture:
- Why Linode: much cheaper than AWS EC2.
- Why AWS DynamoDB and SimpleDB: don't want to worry about managing database.
- memcached suppose to work independently, we use CouchBase because they provided automatic clustering.
- Still, the design goal is to scale out. Every machine is independent. We can add more sync server and memcached server independently.
- Future plan: currently we still need a message queue, AWS SQS does not provide a way for post event to multiple subscribers simultaneously. RabbitMQ can make it. But message queue is not urgent so far.
- Future blog: I will share more experience on using SimpleDB and DynamoDB.

Sunday, July 14, 2013

LEAN6: 3 Reasons Not to Do an Unnecessary SDK Upgrade

3 Reasons Not to Do an Unnecessary SDK Upgrade

I used ExtJS to build my productivity tool TeamViz. Recently ExtJS release 4.2.1 while I'm still using 4.1.1a. After checked the release notes of 4.2.1, I'm excited to see some fixes and performance improvement. So I decided to make an upgrade. I read the upgrade guide from 4.1 to 4.2, and estimated that it should be completed within 1 hour. However, actually I spent 2 days on it. Here I share more details about the items happened in this upgrade.

  • Dependency Tools. My project is generated using Sencha Cmd, it can help generate an initial framework based on Ext JS so you can start your work quickly. Firstly I replaced the library with ExtJS 4.2, and it works well. But when I use sencha cmd to compile the project. Errors happened. Some changes happened in ExtJS 4.2 framework, just replacing the JS/CSS/Resource files does not work. Sencha Cmd rely on some auto-generated config file. So I decided to upgrade Sencha Cmd from 3.0 to 3.1 also. Also generated project again using command sencha -sdk ~/ext- generate app TeamViz ./TeamViz, and then replacing files based on the generated sample project. Later when I compile on Ubuntu 32Bit and 64Bit machine, and Windows, I also need to upgrade toolset for Sencha Cmd.
  • Fixes or Regressions. Every time when a new version of apps/sdk released, there must be some regressions or fixes. After the upgrade, I got some issues on mouse enter/leave events. My instant tools on items are broken. It works in a normal case, but broken on some special scenarios. After dig into the code of Ext JS 4.2, I found it's a regression of Ext JS 4.2, and make some workaround to resolve it. The workaround could be technical debt for future release, but it's the most efficient way to resolve it so far.
  • Undocumented API. When I implemented my complicated drag & drop in my app, I used undocumented api, actually injected some code in the drag & drop process of Ext JS. When I upgraded to ExtJS 4.2, the hacked part has been changed. I need to do a full test to find it, then to resolve it. I think there might some other potential issues but not find so far.
Actually the upgrade is not necessary, there is no bug report directly related the SDK, and the existing version works very well. For a startup, that everyday is important, it's may not be necessary comparing the risk and benefit of upgrading.

Wednesday, July 10, 2013

SDK to Sync Tasks: Dropbox vs Evernote vs Google Apps Tasks vs Jira

Today Dropbox published a blog post for their new Datastore API, the amazing feature is offline support. I have ever investigated other popular tasks API providers, and want to share some quick summary. I didn't discuss outlook/skydrive/calender staff, and would be focusing company who intent to be service providers.

1. Introduction to Providers

  • Dropbox: Datastore API in Beta, well designed and elegant API for Tasks.
  • Evernote: Evernote does not provide a really SDK or functionality for tasks, but personally I want to make evernote a tasks/project management tool. You can attach your own data to ever note, this would be enough for client tools to filter the note marked as tasks, and category them. The API Documetation here
  • Google: Google Apps Tasks API. Google have provided the tasks API for several years, and there are some tools, chrome plugins.
  • Jira: The enterprise project management tool. They also provided REST API. Jira provided best-in-class feature set.

2. Features, Pros, Cons

  • Dropbox
    • Features: 
      • Provided data store API to handle Table/Record. The data store API is the API to handle generic remote key-value database. You can easily build your task management tool based on it.
      • Support offline temporarily. The SDK works when your apps go offline temporarily, with all its data locally. Accordingly, it provided a way to sync data, and resolve conflicts.
      • SDK: Provided SDK in JavaScript for Web, and iOS/Android SDK.
    • Pros:
      • Flexibility: Because the API is to handle generic NoSQL database remotely, it has enough flexibility for app developers to add their own fields, and store what they need.
      • Temporarily Offline Support: this is essential for mobile apps because they can easily be offline. I can imagine that the Dropbox API would improve the user experience greatly on mobile devices.
      • SDK in JavaScript, iOS, Android can bootstrap the integration quickly.
      • Potentially when you need larger storage for content/attachment of a task, Dropbox would be the best candidate.
    • Cons:
      • It's still in Beta, so not enough support on Search/Filter on server side. So when you have a big data set, it would be a problem in current release. However, I can expect that Dropbox will improve it very quickly!
  • Evernote:
    • Features:
      • Evernote does not provide a way to direct create tasks and projects. It provided SDK to create and manage notes. Notes can contain rich format of text, images, and other resources. You can categories them by Notebooks, or Tags. Application Data can be attached to notes, so you can manage status/estimations/priorities with application data for a note. A task management model for Evernote can be:
        • Put all tasks notes in a special notebooks
        • Use Tags/Parent Tags to build hierarchy of projects
      • SDK: Objective-C, Java, PHP, Ruby, Python, Perl, C#, C++, ActionScript
    • Pros:
      • All your data can be visible in Evernote Client Tools from Web, Windows, Mac, iOS and Android. The official Evernote apps has very high quality.
      • You an do search on server, and leverage great Evernote features like OCR. This is unique comparing with all the other providers.
    • Cons:
      • Even you can add some tasks/checkbox in a note, but that's not a direct way to manage them.
      • Evernote is designed for notes, you need some workaround to make it works as task management tool.
  • Google Apps Tasks API:
    • Features:
    • Pros:
      • Better for integration with other google apps.
      • Simple but complete feature for task management.
    • Cons:
      • No way to extend. For example, if I want to add estimation for a tasks, then there is no tasks properties supported, and there is no flexibility to add customized fields.
  • Jira
    • Features:
      • Jira is already an ENTERPRISE task management tool for team planning and project tracking.
      • SDK: Rest API
    • Pros:
      • Really feature rich, and generally you can get everything done on web.
      • You can deploy Jira Server to your private cloud or internal networks.

3. Summary of Unique Features

  • Dropbox: Allow temporarily offline, and handled sync/conflict resolve well inside SDK, developers don't need to worry about it. Also provided the best flexibility on apps design.
  • Evernote: Rich format for contents in a note, and provided powerful search capability.
  • Google Apps Tasks: Compete API dedicated for simple tasks management.
  • Jira: Provided a way to deploy server to your internal network.

Finally let's back to TeamViz, my task management tool. The goal is to support completely offline work. User can use it as a standalone tool, and also can sync with other desktop apps and mobile apps. None of the modal above can meet my goal, the most close one is the what Dropbox released today, datastore API. But it supports only temporarily offline, you still need to be online to access data.

Wednesday, June 19, 2013

2 reasons why we select SimpleDB instead of DynamoDB

If you search on google with keywords "SimpleDB vs DynamoDB", there will be a lot of helpful posts. Most of them give you 3 to 7 reasons to select DynamoDB. However, today I'll share some experience of using SimpleDB instead of DynamoDB.

I got some issues when use DynamoDB in my production, and finally found that SimpleDB is fit in my case perfectly. I think the choice of SimpleDB and DynamoDB should NOT rely on the performance or the benefits of the DynamoDB/SimpleDB, instead, based on the limitation and real requirement in my product.

Some background: I have some data previously saved in MongoDB, the amount of data will mostly not exceed 2G bytes in SimpleDB. Now we decided not to maintain our MongoDB database servers, but leverage AWS SimpleDB or DynamoDB to reduce the cost on ops.

Both SimpleDB/DynamoDB is key/value pair database. There are some workaround to store a JSON document, but will introduce additional cost. The data structure in my MongoDB is not too complicated and can be convert to key-value pair. So, before you choose SimpleDB or DynamoDB as your database backend, you must understand this fundamental.

Reason 1: Not flexible on indexing. With DynamoDB you have to set indexing fields before creating the database, and cannot be modified. This is really limited the future change. DynamoDB supports 2 mode of data lookup, "Query" and "Scan". "Query": based on hash key and secondary keys, high performance. However, when you query data, “hash” key must be set. For example, suppose we have “id” key as hash key. When query by “id”, it’s good, we can get best performance. But when we query only by a field "name", we have to shift to “Scan” because hash key is not used. The performance of "Scan" is totally not acceptable because AWS will scan every record. I created a sample DynamoDb with 100,000 records, and each record has 6 fields. With "Scan", it costs 2 ~ 6 minutes to selecting ONE record by adding condition on one field. Here is the testing code in Java:

DynamoDBScanExpression scan = new DynamoDBScanExpression();

scan.addFilterCondition("count", new Condition().withAttributeValueList(new AttributeValue().withN("70569")).withComparisonOperator(ComparisonOperator.EQ));

System.out.println("1=> " + new Date());

PaginatedScanList<Book> list = mapper.scan(Book.class, scan);

System.out.println("2=> " + new Date());

Object[] all = list.toArray();

System.out.println(all.length); // should be 1

System.out.println("3=> " + new Date()); // 2 ~ 6 minutes comparing to date after “2=>”, in most cases around 2 minutes

SimpleDB does not have this limitations. SimpleDB create index for "EVERY" field in a table(actually AWS use the term "domain", and MongoDB use "collection"). I modified a little bit the code and test on SimpleDB, here are the results:

  • Query 500 (use "limits" to get the first 500 items in a “select” call) items with no condition: about 400 ms to complete. The sample application running on my local machine. If it is running on EC2, it should be within 100 ms. 
  • Query 500 items with 1 condition, also about 400 ms to complete.
Reason 2: Not cost effective for our case. The DynamoDB charge money by capacity of Read/Writes per seconds. Please note that the capacity is based on read/write your records instead of the read/write API call, and no matter you use batch or not. Here are more details in my test.  I used batch API to send 1000 records with more than 1000 bytes for each record. There will cost 50 seconds to finish the batch when the write capacity was set to 20/seconds. While I keep the my application running, and change the capacity on AWS console to 80/seconds, there will take 12 to 25 seconds to complete one batch(ideally it should be 1000/80 = 12.5 seconds, the extra time comes from network latency because I’m sending more than 1 megabytes data per API call). 

In our case, we may read the 500 records in SimpleDB into memory, but read nothing in next 10 minutes. With SimpleDB we can complete it in 500 milliseconds. With DynamoDB we have to set read capacity to 1000 reads/seconds, and it will cost $94.46 per month(via AWS Simple Monthly Calculator). With SimpleDB, it may cost less than 1 dollar.

Conclusion: DynamoDB is really designed for high performance database. SimpleDB has more flexibility. Here what I mean "really designed for high performance" to DynamoDB is, if you choose DynamoDB, you must make sure you have well designed your architecture for high traffic dynamic content. If you have design your architecture targeting high traffic dynamic content and high performance, DynamoDB may perfectly match your request. In our case, SimpleDB is enough, excellent flexibility, and cost effective. Before looking for the comparison of SimpleDB and DynamoDB, design your architecture first. DynamoDB is good, but not fit for everyone.

Here are some useful links:

Sunday, June 2, 2013

Cross Platform - Initial Idea

I worked on a commercial product for 7 years. have more than 400 million dollar revenue per year. That product can running on Windows and Mac, also a lite version on web, android and iPhone/iPad, and have data interoperability across all the platforms. We have investigated various possible techniques to support cross platform development using C#/C++/Objective-C, with some framework like Qt framework, as well as some other approach like HTML+CSS+JavaScript for cross-platform features. I want to share my working experience on some technologies that support cross-platform development.

Decades ago, when the 2nd operation system came to the world, there is the needs for cross platform development. We need to choose the target platforms based on current marketing shares. Here is the majority target platforms:
- Desktop
 - Microsoft Windows
 - Mac OS X, Apple Inc.
 - Linux(My favorite distribution is Ubuntu)
from https://www.netmarketshare.com/, February 2013

- Mobile
 - Google Android, there is also difference between handheld and tablet.
 - Apple iOS, there is also difference between iPhone and iPad

from https://www.netmarketshare.com/, February 2013

For this series of articles, I'll starting with this roadmap:
- Programming Language for cross-platform development.
- Review of frameworks for cross desktop operations, e.g. Qt, Mono, wxWidgets
- Review of web as platform: HTML 5, Native Client
- Review of frameworks supporting multiple mobile frameworks, e.g. PhoneGap, Appcelerator/Titanium