BitPim and databases

BitPim currently stores information as a Python dictionary. This information is saved in multiple files (one per information type) as sourceable Python code. The data can be easily inspected with a plain text editor.

Users never need to explicitly load and save data (ie there is no need for them to manage the transitioning of data between temporary storage - RAM - and persistent storage - disk.)

Problems and goals

BitPim currently has no undo functionality. Any edits take effect immediately and there is no ability to reverse mistakes.

It is currently not possible to do a sync. Syncing requires being able to examine two snapshots of data and generate a list of changes that were made (eg the name "John Smith" was changed to "John Smythe")

BitPim doesn't work correctly when run concurrently as the same user. The user is not prevented from starting a second instance, and multiple instances just continue oblivious to each other. The old solution of preventing multiple instances at startup is no longer appropriate since users can and do access their machines via different means (eg logging in on the console and logging in remotely). Some programs such as Mozilla/Firefox force you to have multiple independent profiles, which is very annoying.

BitPim currently doesn't support multiple information stores. This happens if there are multiple users who login as the same user at the operating system level. There is some advice in the online doc, which amounts to switching the preferences behind BitPim's back before starting to switch the main data directory.

Care also needs to be taken over dealing with version issues. This means BitPim starting up with an older version of the saved data, or the saved data being a newer format than the current version.

BitPim currently holds all data in memory. This makes memory consumption equal the amount of data, and can get very large.

Solution - SQLite

BitPim will be migrating to use the SQLite database. SQLite is accessed using SQL syntax and is an embedded database - you have it as part of your program and do not contact it over the network. The Python wrapper is pysqlite and everything is available under appropriate licenses and is available on all platforms.

It has many many other nice properties such as using a single file, being safe for usage in multi-threaded and multi-process environments, is ACID compliant (Atomic, Consistent, Isolated, and Durable), survives power and unexpected program termination, etc. There is no access control or other security issues to deal with. The only requirement is access to the single file (via normal filesystem and process permissions).

Version 3 of SQLite uses unicode natively for strings, supports BLOBS (binary large objects), and allows unlimited field size.

SQLite is also different than other databases in that the type of a field is attached to each value in each record, rather than to the column as a whole. This is very similar to how Python works where the type of a value is attached to the value itself, not to the name it is given. (Contrast with C/Java where the type is associated with the variable name, not the value).

Some more reading on SQLite:

Startup/meta information

One table will contain meta information. Primarily this will be the version of BitPim to which the the database corresponds.

On startup, BitPim will inspect the version information. If it is older than the current version of BitPim then a copy of the file will be made.

For example, if the current version of BitPim is 1.2 and the database says it is for 1.1 then a copy will be made as foo-1.1-`date`

Main method

The main data type used in BitPim will be the dict, as is currently the case. dicts will be saved to tables with each dict key being a column in the table. None values in Python are mapped to null in SQL. When reading from the table, a dict is produced based on the columns. Note that columns with a null value will not have any key in the returned dict.

When saving to a table, the support code will automatically create columns as needed, that default to null.

Each table will be a journal. Existing rows will never be modified, only new rows added at the end. A distinct entry is identified by a unique identifier and is stored in a column named __uid__. Consequently a table will typically look like this:

primary key
(integer)
Name Phone number __uid__
0 John Smith 123456 0x4523
1 Fred Bloggs 7676987897 0x8769
2 John Smythe 123456 0x4523
3 Spiderman 435435345 0x7888
4 John Smythe 123888 0x4523

You can see how the "John Smith" record was editted at row 2, and again at row 4. To produce the list of "current" records, the last entry the table for a particular uid is used.

There will be an additional __timestamp__ column. That will allow for retrieving old values for any record, as well as archiving off very old values (eg if the user doesn't care about anything old than 6 months).

There will also be a __deleted__ column which is set to true when a particular uid is deleted.

The uid will actually be a long unique string. For phonebook records, it will be the bitpim serial.

This scheme will allow easy undo's since you can always find out what any particular record used to look like. You can also track a mass action (eg 10 records being selected and then all deleted at the same time) since they will have the same timestamp.

Undo will also be possible between runs of the program, and even amongst multiple running concurrent instances!

Implementation plan

The initial implementation will create a new database.py file. The existing phonebook code will point to this new module. Code in database.py will continue to use the existing routines that read and write index.idx as well as to the sqlite database. The data will be compared between the two to ensure it is working correctly. Once we are certain the code talking to sqlite is correct, then the code using index.idx will be switched off.

The process will then be repeated for the other data types (calendar, wallpaper and ringtones).

New data sources (SMS, call history, voice and text memos) will just use the sqlite database exclusively.

Undecided issue - lists of dicts

Almost every field in a phonebook entry is a list of dicts. This can be stored as a single value (the string representation), or a new table can be created, and redirect to that. Both approaches are shown below for the phone numbers column with most other columns omitted for the purpose of clarification.

Stringized

primary key
(integer)
Name Phone numbers
0 John Smith [{'number': '1234567890', 'type': 'home'}, {'number': '233423423', 'type': 'work'}]

Indirect table

primary key
(integer)
Name Phone numbers
0 John Smith __numbertable:0,2

_numbertable is:

primary key
(integer)
number type
0 1234567890 home
1 76547657 cell
2 233423423 work

I am leaning towards implementig both schemes in database.py since they are not mutually exclusive and then see how it goes. My instinct is inclined towards the indirect table since it will save space, and allows faster searches down the road.

Undecided issue - bitmaps and wallpapers

sqlite does allow storing blobs (binary large objects) in the database itself, so we could store the actual files directly in the database. The other alternative is store the files on disk with non-descript names (eg 0000001.jpg) and then point to the file from the relevant records.

My instinct is for the latter approach for wallpaper and ringtones since it will keep the database smaller. For other file like items such as text memos and SMS messages, I would be inclined to keep them directly in the database.