Package CedarBackup2 :: Module filesystem :: Class BackupFileList
[show private | hide private]
[frames | no frames]

Type BackupFileList

object --+        
         |        
      list --+    
             |    
FilesystemList --+
                 |
                BackupFileList


List of files to be backed up.

A BackupFileList is a FilesystemList containing a list of files to be backed up. It only contains files, not directories (soft links are treated like files). On top of the generic functionality provided by FilesystemList, this class adds functionality to keep a hash (checksum) for each file in the list, and it also provides a method to calculate the total size of the files in the list and a way to export the list into tar form.
Method Summary
  __init__(self)
Initializes a list with no configured exclusions.
  addDir(self, path)
Adds a directory to the list.
  totalSize(self)
Returns the total size among all files in the list.
  generateSizeMap(self)
Generates a mapping from file to file size in bytes.
  generateDigestMap(self, stripPrefix)
Generates a mapping from file to file digest.
  generateFitted(self, capacity, algorithm)
Generates a list of items that fit in the indicated capacity.
  generateTarfile(self, path, mode, ignore, flat)
Creates a tar file containing the files in the list.
  removeUnchanged(self, digestMap, captureDigest)
Removes unchanged entries from the list.
  generateSpan(self, capacity, algorithm)
Splits the list of items into sub-lists that fit in a given capacity.
  _generateDigest(path)
Generates an SHA digest for a given file on disk. (Static method)
  _getKnapsackFunction(algorithm)
Returns a reference to the function associated with an algorithm name. (Static method)
  _getKnapsackTable(self, capacity)
Converts the list into the form needed by the knapsack algorithms.
    Inherited from FilesystemList
  addDirContents(self, path, recursive, addSelf)
Adds the contents of a directory to the list.
  addFile(self, path)
Adds a file to the list.
  normalize(self)
Normalizes the list, ensuring that each entry is unique.
  removeDirs(self, pattern)
Removes directory entries from the list.
  removeFiles(self, pattern)
Removes file entries from the list.
  removeInvalid(self)
Removes from the list all entries that do not exist on disk.
  removeLinks(self, pattern)
Removes soft link entries from the list.
  removeMatch(self, pattern)
Removes from the list all entries matching a pattern.
  verify(self)
Verifies that all entries in the list exist on disk.
  _addDirContentsInternal(self, path, includePath, recursive)
Internal implementation of addDirContents.
  _getExcludeBasenamePatterns(self)
Property target used to get the exclude basename patterns list.
  _getExcludeDirs(self)
Property target used to get the exclude directories flag.
  _getExcludeFiles(self)
Property target used to get the exclude files flag.
  _getExcludeLinks(self)
Property target used to get the exclude soft links flag.
  _getExcludePaths(self)
Property target used to get the absolute exclude paths list.
  _getExcludePatterns(self)
Property target used to get the exclude patterns list.
  _getIgnoreFile(self)
Property target used to get the ignore file.
  _setExcludeBasenamePatterns(self, value)
Property target used to set the exclude basename patterns list.
  _setExcludeDirs(self, value)
Property target used to set the exclude directories flag.
  _setExcludeFiles(self, value)
Property target used to set the exclude files flag.
  _setExcludeLinks(self, value)
Property target used to set the exclude soft links flag.
  _setExcludePaths(self, value)
Property target used to set the exclude paths list.
  _setExcludePatterns(self, value)
Property target used to set the exclude patterns list.
  _setIgnoreFile(self, value)
Property target used to set the ignore file.
    Inherited from list
  __add__(x, y)
x.__add__(y) <==> x+y
  __contains__(x, y)
x.__contains__(y) <==> y in x
  __delitem__(x, y)
x.__delitem__(y) <==> del x[y]
  __delslice__(x, i, j)
Use of negative indices is not supported.
  __eq__(x, y)
x.__eq__(y) <==> x==y
  __ge__(x, y)
x.__ge__(y) <==> x>=y
  __getattribute__(...)
x.__getattribute__('name') <==> x.name
  __getitem__(x, y)
x.__getitem__(y) <==> x[y]
  __getslice__(x, i, j)
Use of negative indices is not supported.
  __gt__(x, y)
x.__gt__(y) <==> x>y
  __hash__(x)
x.__hash__() <==> hash(x)
  __iadd__(x, y)
x.__iadd__(y) <==> x+=y
  __imul__(x, y)
x.__imul__(y) <==> x*=y
  __iter__(x)
x.__iter__() <==> iter(x)
  __le__(x, y)
x.__le__(y) <==> x<=y
  __len__(x)
x.__len__() <==> len(x)
  __lt__(x, y)
x.__lt__(y) <==> x<y
  __mul__(x, n)
x.__mul__(n) <==> x*n
  __ne__(x, y)
x.__ne__(y) <==> x!=y
  __new__(T, S, ...)
T.__new__(S, ...) -> a new object with type S, a subtype of T
  __repr__(x)
x.__repr__() <==> repr(x)
  __reversed__(...)
L.__reversed__() -- return a reverse iterator over the list
  __rmul__(x, n)
x.__rmul__(n) <==> n*x
  __setitem__(x, i, y)
x.__setitem__(i, y) <==> x[i]=y
  __setslice__(x, i, j, y)
Use of negative indices is not supported.
  append(...)
L.append(object) -- append object to end
  count(L, value)
L.count(value) -> integer -- return number of occurrences of value
  extend(...)
L.extend(iterable) -- extend list by appending elements from the iterable
  index(...)
L.index(value, [start, [stop]]) -> integer -- return first index of value
  insert(...)
L.insert(index, object) -- insert object before index
  pop(L, index)
L.pop([index]) -> item -- remove and return item at index (default last)
  remove(...)
L.remove(value) -- remove first occurrence of value
  reverse(...)
L.reverse() -- reverse *IN PLACE*
  sort(...)
L.sort(cmp=None, key=None, reverse=False) -- stable sort *IN PLACE*; cmp(x, y) -> -1, 0, 1
    Inherited from object
  __delattr__(...)
x.__delattr__('name') <==> del x.name
  __reduce__(...)
helper for pickle
  __reduce_ex__(...)
helper for pickle
  __setattr__(...)
x.__setattr__('name', value) <==> x.name = value
  __str__(x)
x.__str__() <==> str(x)

Property Summary
    Inherited from FilesystemList
  excludeBasenamePatterns: List of regular expression patterns (matching basename) to be excluded.
  excludeDirs: Boolean indicating whether directories should be excluded.
  excludeFiles: Boolean indicating whether files should be excluded.
  excludeLinks: Boolean indicating whether soft links should be excluded.
  excludePaths: List of absolute paths to be excluded.
  excludePatterns: List of regular expression patterns (matching complete path) to be excluded.
  ignoreFile: Name of file which will cause directory contents to be ignored.

Instance Method Details

__init__(self)
(Constructor)

Initializes a list with no configured exclusions.
Overrides:
CedarBackup2.filesystem.FilesystemList.__init__

addDir(self, path)

Adds a directory to the list.

Note that this class does not allow directories to be added by themselves (a backup list contains only files). However, since links to directories are technically files, we allow them to be added.

This method is implemented in terms of the superclass method, with one additional validation: the superclass method is only called if the passed-in path is both a directory and a link. All of the superclass's existing validations and restrictions apply.
Parameters:
path - Directory path to be added to the list
           (type=String representing a path on disk)
Returns:
Number of items added to the list.
Raises:
ValueError - If path is not a directory or does not exist.
ValueError - If the path could not be encoded properly.
Overrides:
CedarBackup2.filesystem.FilesystemList.addDir

totalSize(self)

Returns the total size among all files in the list. Only files are counted. Soft links that point at files are ignored. Entries which do not exist on disk are ignored.
Returns:
Total size, in bytes

generateSizeMap(self)

Generates a mapping from file to file size in bytes. The mapping does include soft links, which are listed with size zero. Entries which do not exist on disk are ignored.
Returns:
Dictionary mapping file to file size

generateDigestMap(self, stripPrefix=None)

Generates a mapping from file to file digest.

Currently, the digest is an SHA hash, which should be pretty secure. In the future, this might be a different kind of hash, but we guarantee that the type of the hash will not change unless the library major version number is bumped.

Entries which do not exist on disk are ignored.

Soft links are ignored. We would end up generating a digest for the file that the soft link points at, which doesn't make any sense.

If stripPrefix is passed in, then that prefix will be stripped from each key when the map is generated. This can be useful in generating two "relative" digest maps to be compared to one another.
Parameters:
stripPrefix - Common prefix to be stripped from paths
           (type=String with any contents)
Returns:
Dictionary mapping file to digest value

See Also: removeUnchanged

generateFitted(self, capacity, algorithm='worst_fit')

Generates a list of items that fit in the indicated capacity.

Sometimes, callers would like to include every item in a list, but are unable to because not all of the items fit in the space available. This method returns a copy of the list, containing only the items that fit in a given capacity. A copy is returned so that we don't lose any information if for some reason the fitted list is unsatisfactory.

The fitting is done using the functions in the knapsack module. By default, the first fit algorithm is used, but you can also choose from best fit, worst fit and alternate fit.
Parameters:
capacity - Maximum capacity among the files in the new list
           (type=Integer, in bytes)
algorithm - Knapsack (fit) algorithm to use
           (type=One of "first_fit", "best_fit", "worst_fit", "alternate_fit")
Returns:
Copy of list with total size no larger than indicated capacity
Raises:
ValueError - If the algorithm is invalid.

generateTarfile(self, path, mode='tar', ignore=False, flat=False)

Creates a tar file containing the files in the list.

By default, this method will create uncompressed tar files. If you pass in mode 'targz', then it will create gzipped tar files, and if you pass in mode 'tarbz2', then it will create bzipped tar files.

The tar file will be created as a GNU tar archive, which enables extended file name lengths, etc. Since GNU tar is so prevalent, I've decided that the extra functionality out-weighs the disadvantage of not being "standard".

If you pass in flat=True, then a "flat" archive will be created, and all of the files will be added to the root of the archive. So, the file /tmp/something/whatever.txt would be added as just whatever.txt.

By default, the whole method call fails if there are problems adding any of the files to the archive, resulting in an exception. Under these circumstances, callers are advised that they might want to call removeInvalid() and then attempt to extract the tar file a second time, since the most common cause of failures is a missing file (a file that existed when the list was built, but is gone again by the time the tar file is built).

If you want to, you can pass in ignore=True, and the method will ignore errors encountered when adding individual files to the archive (but not errors opening and closing the archive itself).

We'll always attempt to remove the tarfile from disk if an exception will be thrown.
Parameters:
path - Path of tar file to create on disk
           (type=String representing a path on disk)
mode - Tar creation mode
           (type=One of either 'tar', 'targz' or 'tarbz2')
ignore - Indicates whether to ignore certain errors.
           (type=Boolean)
flat - Creates "flat" archive by putting all items in root
           (type=Boolean)
Raises:
ValueError - If mode is not valid
ValueError - If list is empty
ValueError - If the path could not be encoded properly.
TarError - If there is a problem creating the tar file

Notes:

  • No validation is done as to whether the entries in the list are files, since only files or soft links should be in an object like this. However, to be safe, everything is explicitly added to the tar archive non-recursively so it's safe to include soft links to directories.
  • The Python tarfile module, which is used internally here, is supposed to deal properly with long filenames and links. In my testing, I have found that it appears to be able to add long really long filenames to archives, but doesn't do a good job reading them back out, even out of an archive it created. Fortunately, all Cedar Backup does is add files to archives.

removeUnchanged(self, digestMap, captureDigest=False)

Removes unchanged entries from the list.

This method relies on a digest map as returned from generateDigestMap. For each entry in digestMap, if the entry also exists in the current list and the entry in the current list has the same digest value as in the map, the entry in the current list will be removed.

This method offers a convenient way for callers to filter unneeded entries from a list. The idea is that a caller will capture a digest map from generateDigestMap at some point in time (perhaps the beginning of the week), and will save off that map using pickle or some other method. Then, the caller could use this method sometime in the future to filter out any unchanged files based on the saved-off map.

If captureDigest is passed-in as True, then digest information will be captured for the entire list before the removal step occurs using the same rules as in generateDigestMap. The check will involve a lookup into the complete digest map.

If captureDigest is passed in as False, we will only generate a digest value for files we actually need to check, and we'll ignore any entry in the list which isn't a file that currently exists on disk.

The return value varies depending on captureDigest, as well. To preserve backwards compatibility, if captureDigest is False, then we'll just return a single value representing the number of entries removed. Otherwise, we'll return a tuple of (entries removed, digest map). The returned digest map will be in exactly the form returned by generateDigestMap.
Parameters:
digestMap - Dictionary mapping file name to digest value.
           (type=Map as returned from generateDigestMap.)
captureDigest - Indicates that digest information should be captured.
           (type=Boolean)
Returns:
Number of entries removed

Note: For performance reasons, this method actually ends up rebuilding the list from scratch. First, we build a temporary dictionary containing all of the items from the original list. Then, we remove items as needed from the dictionary (which is faster than the equivalent operation on a list). Finally, we replace the contents of the current list based on the keys left in the dictionary. This should be transparent to the caller.

generateSpan(self, capacity, algorithm='worst_fit')

Splits the list of items into sub-lists that fit in a given capacity.

Sometimes, callers need split to a backup file list into a set of smaller lists. For instance, you could use this to "span" the files across a set of discs.

The fitting is done using the functions in the knapsack module. By default, the first fit algorithm is used, but you can also choose from best fit, worst fit and alternate fit.
Parameters:
capacity - Maximum capacity among the files in the new list
           (type=Integer, in bytes)
algorithm - Knapsack (fit) algorithm to use
           (type=One of "first_fit", "best_fit", "worst_fit", "alternate_fit")
Returns:
List of SpanItem objects.
Raises:
ValueError - If the algorithm is invalid.
ValueError - If it's not possible to fit some items

Note: If any of your items are larger than the capacity, then it won't be possible to find a solution. In this case, a value error will be raised.

_getKnapsackTable(self, capacity=None)

Converts the list into the form needed by the knapsack algorithms.
Returns:
Dictionary mapping file name to tuple of (file path, file size).

Static Method Details

_generateDigest(path)

Generates an SHA digest for a given file on disk.

The original code for this function used this simplistic implementation, which requires reading the entire file into memory at once in order to generate a digest value:
  sha.new(open(path).read()).hexdigest()

Not surprisingly, this isn't an optimal solution. The Simple file hashing Python Cookbook recipe describes how to incrementally generate a hash value by reading in chunks of data rather than reading the file all at once. The recipe relies on the the update() method of the various Python hashing algorithms.

In my tests using a 110 MB file on CD, the original implementation requires 111 seconds. This implementation requires only 40-45 seconds, which is a pretty substantial speed-up.

Practice shows that reading in around 4kB (4096 bytes) at a time yields the best performance. Smaller reads are quite a bit slower, and larger reads don't make much of a difference. The 4kB number makes me a little suspicious, and I think it might be related to the size of a filesystem read at the hardware level. However, I've decided to just hardcode 4096 until I have evidence that shows it's worthwhile making the read size configurable.
Parameters:
path - Path to generate digest for.
Returns:
ASCII-safe SHA digest for the file.
Raises:
OSError - If the file cannot be opened.

_getKnapsackFunction(algorithm)

Returns a reference to the function associated with an algorithm name. Algorithm name must be one of "first_fit", "best_fit", "worst_fit", "alternate_fit"
Parameters:
algorithm - Name of the algorithm
Returns:
Reference to knapsack function
Raises:
ValueError - If the algorithm name is unknown.

Generated by Epydoc 2.1 on Thu Mar 29 20:58:30 2007 http://epydoc.sf.net