Crystallize

Overview

A collection of tools for working with files and filesystems.

Note: There is an oddity in the version numbering: version 2.15.5.9.1 was followed immediately by version 2.15.5.9.3 (skipping 2.15.5.9.2); version 2.15.5.9.2 was then released following version 2.15.5.9.3 (and is identical to version 2.15.5.9.3 aside from the version number); normal numbering then resumed with version 2.15.5.9.4, .5, .6 etc. following version 2.15.5.9.2 (skipping version 2.15.5.9.3, as that number was already used).

Documentation

Usage

Crystallize provides the following primary user-facing command-line scripts.

crystallize
Archives a snapshot of a file or files and removes it/them (unless --keep is specified), providing an address and/or a pointer file that can be used to retrieve it/them. crystallize --update is an alias for crystallize-update. Synopsis: crystallize ( ([--version] | [--update]) | ([--leave-locked] [--keep] [--passphrase <passphrase-to-use>] [--] [--leave-pointer] <file>...) )
crystallize-backup
Backs up crystallized files with their associated metadata to the current directory. Synopsis: crystallize-backup [--passphrase <passphrase-to-use>] <crystal-address>
crystallize-editconf
Open the Crystallize configuration file in the default editor. Synopsis: crystallize-editconf
decrystallize
Retrieves files stored using crystallize to the current directory, given their address. Synopsis: decrystallize [--passphrase <passphrase-to-use>] [--lock-override|--no-lock-override] <crystal-address> [--here]
decrystallize-pointer
Retrieves files stored using crystallize to the current directory, given a path to a pointer directory. Does not handle single-file pointers: use sreg_read_stream instead. Synopsis: decrystallize-pointer [--passphrase <passphrase-to-use>] [--lock-override|--no-lock-override] <file> [--here]

The following public-facing scripts are either used internally by the other tools and are mainly useful for writing other shell scripts, or are not thoroughly tested.

crystal-search
Show the addresses of crystals that contain files (or, optionally, a maximum of one file per crystal) with path names matching the given pattern. Synopsis: crystal-search [--single] [--] <search-key>
crystallize-getconf
Retrieve a configuration value from the configuration file used by Crystallize tools. Synopsis: crystallize-getconf <configuration-key-name>
crystallize-update
Attempt to update Crystallize. Synopsis: crystallize-update
CrystallyCopy
Copy the specified source files or directories to the destination directory. Crystallizes the source files, and outputs a csum file for the source files to the destination directory. Synopsis: CrystallyMove <source-item>... <destination-directory>
CrystallyMove
Move the specified source files or directories to the destination directory. Crystallizes the source files, and outputs a csum file for the source files to the destination directory. Synopsis: CrystallyMove <source-item>... <destination-directory>
depbz
Extract files stored in the "pbz" or "pbze" formats (files come in sets of three: either a Packed-*.pbz, a Packed-*.pdx, and a Packed-*.pmbz, or a Packed-*.pbze, a Packed-*.pdxe, and a Packed-*.pmbze; the second type of set are encrypted). using the pbz (or pbz.py) programs. When extracting an encrypted pbz file, it's necessary to either provide a passphrase using the --passphrase command-line argument, or to have a file in your home directory called .pbz containing the passphrase. The set of files to extract must be in the current directory. They will be extracted into a directory with a name beginning with "depbz-". That directory will be in the current directory by default, but a different directory to extract to can be specified. Synopsis: depbz [--passphrase <passphrase-to-use>] <pbz-date> [<destination-directory>]
dequicklify
Retrieve a file stored using quickliquid. Synopsis: dequicklify <URL>
fcache_init
Create an "fcache" cache directory: fcache is a naïve caching layer for non-changing URLs. Synopsis: fcache_init <directory-name> <cache-size-limit-in-bytes> (if the cache directory already exists, the current size limit will override the one provided as an argument)
fcache_request
Get an item using the specified "fcache" cache. Synopsis: fcache_request <cache-directory> <URL> [--lock-override|--no-lock-override]
mount.srfs
Mount and unmount a translation FUSE filesystem for a folder containing hash pointers: mount.*-style wrapper for srfs. Synopsis: mount.srfs <root-folder-name> <mountpoint>
ndu
Alternative way of invoking rubberfs usage. Synopsis: ndu
quickliquid
Quickly upload a file to the Internet Archive. Synopsis: quickliquid <file>
rubberfs
Not ready for production use! Provides tools for managing filesystems. Synopsis: rubberfs ( ((create|mount|soft-mount|remount|rename|cd|unmount|soft-unmount|attach|check|save|freeze|gc|thaw|patch|status|list|usage-write|destroy|destroy-no-upload|historybak|historypull) [<RubberFS-name>]) | usage | whereami | stub | (stash <file>...) | (delta [<RubberFS-name> [--keep]]) )
s3-streaming-upload
Not ready for production use! Streaming upload to Amazon S3–compatible endpoints, supporting some of the Internet Archive's extensions to S3. Synopsis: s3-streaming-upload <host-name> <collection> <identifier> <remote-file-name> <file-size-estimate> <title> <description> <keywords> [access-key-id] [secret-access-key] (if the access keys are not provided, s3-streaming-upload will attempt to retrieve them from ia's configuration file)
sreg_build_backup_set
Make a clone of a stream registry database where the bodies of the streams are stored instead of the pointers. Synopsis: sreg_build_backup_set [--sreg-dir <directory>] <target-directory>
sreg_check_failed
Check whether streams that could not be read in the past and were moved to the Failed Fsck directory have become readable in the meantime, and return them to the database if so. Synopsis: sreg_check_failed [--sreg-dir <directory>] [--skip-cache]
sreg_enroll_url
Given an Internet Archive URL to a file, path to a file (with identifier), or identifier with file name in the form [https://archive.org/download/]identifier/file/path, add that file to the current Stream Registry. Synopsis: sreg_enroll_url <input>
sreg_flush_localstore
Convert LocalStore pointers to finished pointers. Synopsis: sreg_flush_localstore [--sreg-dir <directory>]
sreg_folder_check
Go through the hashpointers in the specified directory, make sure that they are present in the stream registry, optionally verify those streams, and optionally remove any streams from the stream registry that are not referenced by the hashpointers in the specified directory (defaults to listing them only, add --delete to actually remove). If no directory is specified, the Ember Library directory is assumed. Synopsis: sreg_folder_check [--sreg-dir <directory>] [--verify] [ (--drop-unused | --drop) [--delete]] [<directory>]
sreg_fsck
Verify that all entries in the sreg stream database can be read correctly. Synopsis: sreg_fsck [--skip-cache] [--drop-failed] [<repository-directory>]
sreg_fsck_hashpointers
Verify that all sreg hash pointers in the specified directory can be read correctly (an alias for sreg_folder_check --verify). Synopsis: sreg_fsck_hashpointers [--sreg-dir <directory>] <path>
sreg_init
Prepare a directory (need not exist) as a stream-registry-backed repository. The stream registry tools, with names prefaced by "sreg" (stream registry), is a virtual file system layer that allows files to be stored as small text-based pointers that can be tracked using Git while avoiding the need to have the entire repository stored locally (the most-used data are cached locally, instead). It has some restrictions on what can be stored in it:
  • Special files are not supported, other than symbolic links
  • Files or folders may not be named '.git.686fc528-0e8e-4724-91bb-c103cdcdd592'
  • Folders may not be named '.sreg'
  • Files may not begin with any of the following ASCII strings:
    • a5e2f296-3085-49c0-8f48-24ea436b7a8b
    • c39f8657-384b-438b-a5a2-eece17147589
    • 2fae2004-94bb-4aa8-a01a-fc44298efc2c
    • 209fcfdf-d1ad-4345-8ef7-1fdc2d583d49
    • 760fa662-89cf-4ebd-9664-150b7637ddd4
While it is suboptimal to have these restrictions, they allowed the implementation of sreg to be simpler. Patches to fix these issues would be considered, if someone has the interest to write them (it's currently on the to-do list at issue #587). By default, the stream registry LocalStore folder (a temporary gathering location for small files that will be stored as a batch when sreg_flush_localstore is run — it will be run occasionally automatically, and can be run manually if desired) is added to a file called .gitignore in the target directory, under the assumption that the stream registry will be kept in Git version control; to suppress this behavior, use the --no-gitignore option. Synopsis: sreg_init [--passphrase <passphrase>] [--no-gitignore] [<path-to-folder-to-prepare>]
sreg_read_stream
Accepts a sreg pointer on stdin, and outputs the corresponding data from the stream registry. If a checksum is provided on the command line, the retrieved data will be checked against it. Synopsis: sreg_read_stream [--lock-override|--no-lock-override] [--sreg-dir <directory>] [--ignore-lock] [--checksum <checksum>] [--disallow-hash-pointer] [--skip-cache]
sreg_store_stream
Stores data provided on stdin into the stream registry, and sends a pointer to it to stdout (or, optionally, to a specified file: the --output-file option acts similarly to redirecting sreg_store_stream's standard output to the specified file, but has additional sanity checks to avoid writing to files needed by the stream registry; consequently, this option is generally preferable to a simple redirection, unless you know the redirection is going somewhere safe). A checksum, if one is known for the stream, can be provided on the command line for a slight performance improvement. Synopsis: sreg_store_stream [--sreg-dir <directory>] [--output-file <file>] [--assume-checksum <checksum>]
srfs
Mount and unmount a translation FUSE filesystem for a folder containing hash pointers. Synopsis: srfs [--sreg-dir <directory>] [ (mount ([<root-folder-name>] | [<root-folder-name> <mountpoint>])) | (unmount [<mountpoint>]) | ([<root-folder-name> <mountpoint>]) ]
srpull
Copy the first argument(s) into the destination (usually the last parameter) and replace any enclosed sreg pointers with their contents. If only one path is specified, the current directory will be used as the destination. The --replace option controls whether files that exist in the destination are overwritten (files that exist in the destination that do not exist in the source will not be removed) (the default is --replace). Synopsis: srpull [--skip <number-of-items-to-skip>] [--replace|--no-replace] ((<source-path>... <destination-directory>) | <source-path>)
srsync
Copy the first argument(s) into the destination (usually the last parameter) and replace non-pointerized or out-of-date files in the destination with their pointers. If only one path is specified, the current directory will be used as the destination. Synopsis: srsync [--sreg-dir <directory>] [--skip <number-of-items-to-skip>] [--verify|--no-verify] ((<source-path>... <destination-directory>) | <source-path>)

In addition, Crystallize also provides the following scripts that it uses internally that are not supported for independent use.

crystallize-bash_setup
Set up the bash environment shared by Crystallize tools. Synopsis: source crystallize-bash_setup
crystallize-logsession
The main logic for crystallize. Synopsis: crystallize-logsession <true-if-using-custom-passphrase> <custom-passphrase-if-using> <log-file> <crystal-address> <file>... (needs specific environment variables set)
localstorecache_init
Create a "localstorecache" cache directory: localstorecache is a naïve caching layer for LocalStore crystals (variant of "fcache"). Synopsis: localstorecache_init <directory-name> <cache-size-limit-in-bytes> (if the cache directory already exists, the current size limit will override the one provided as an argument)
localstorecache_request
Get an item (returned as a file path) using the specified "localstorecache" cache. Synopsis: localstorecache_request [--sreg-dir <directory>] <cache-directory> <crystal-address>
scache_gc
Drop old items from the specified (s/f/localstore)cache. Defaults to scache. Synopsis: scache_gc [--verbose] <cache-directory> [s|f|localstore]
sregi_bundle_pointer
Given a LocalStore pointer, replace it with a remote pointer. If a tracking file (should contain only an integer) is specified, the file's value will be incremented (to not specify it, pass an empty string as that argument). Synopsis: sregi_bundle_pointer <path-to-instance-file> [--sreg-dir <directory>] <tracking-file> <path-to-remote-pointer-data> <path-to-pointer-to-replace> <crystalWorkdir-config-value>
sregi_check_failed_entry
Check that the specified pointer not in the stream registry database can be retrieved, and if so, move it into the stream registry database. If a tracking file (should contain only an integer) is specified, the file's value will be incremented. Synopsis: sregi_check_failed_entry <path-to-instance-file> [--sreg-dir <directory>] <path-to-pointer> [tracking-file] [--skip-cache]
sregi_copy_read
Copy the first argument (must be a file) to the first argument appended to the destination folder, and if it is a sreg pointer, replace it with its contents. If a tracking file (should contain only an integer) is specified, the file's value will be incremented (to not specify it, pass an empty string as that argument). characters-to-trim is the number of characters to remove from the source filename to give the location of the destination file relative to the enclosing destination directory. Synopsis: sregi_copy_read <path-to-instance-file> [--skip <number-of-items-to-skip>] [--sreg-dir <directory>] [--replace] <path-to-file> <destination-folder> <tracking-file> <characters-to-trim>
sregi_copy_write
Copy the first argument (must be a file) to the first argument appended to the destination folder, and replace it with a sreg pointer. If a tracking file (should contain only an integer) is specified, the file's value will be incremented (to not specify it, pass an empty string as that argument). characters-to-trim is the number of characters to remove from the source filename to give the location of the destination file relative to the enclosing destination directory. Synopsis: sregi_copy_write <path-to-instance-file> [--skip <number-of-items-to-skip>] [--sreg-dir <directory>] [--no-verify] <path-to-file> <destination-folder> <tracking-file> <characters-to-trim>
sregi_drop_single_unused
Remove the specified pointer from the stream registry if it is not listed in the specified ID list (newline-separated list of pointer IDs). Synopsis: sregi_drop_single_unused <path-to-instance-file> [--sreg-dir <directory>] <path-to-pointer> <path-to-ID-list> <tracking-file>
sregi_file_backup
Accepts a sreg database entry as an argument, and replaces the entry with the entry's contents (but does nothing if this has already been done). The "instance file" is used to check whether a previous instance of this command has failed to exit successfully when this command is called repeatedly using find's -exec option, which lacks a convenient way to abort execution immediately upon failure. Synopsis: sregi_file_backup <path-to-instance-file> [--sreg-dir <directory>] <path-to-database-entry>
sregi_find_dir
Report the location of the stream registry applicable to the current directory, or (if specified) the given path. Synopsis: sregi_find_dir [--sreg-dir <directory> [--full-check]] [<path>]
sregi_fuse.py
Mount a FUSE filesystem overlay for sreg. Synopsis: sregi_fuse.py <sreg-repository-to-mount> <mount-point> <sreg-directory>
sregi_get_passphrase
Print a stream registry passphrase, either the system default or the passphrase for the sreg repository provided as an argument. Synopsis: sregi_get_passphrase [<sreg-repository>]
sregi_get_length_from_pointer
Return the length (size) in bytes of the path specified, or of the data referenced by a pointer if the path is a pointer. Synopsis: sregi_get_length_from_pointer [--sreg-dir <directory>] <path>
sregi_hashpointer_sane
Check that the pointer corresponding to the specified hash pointer (it's OK to toss pretty much any files at it — hash pointers, other pointers, and files that aren't pointers at all; non–hash pointer pointers will have hash pointers generated on demand for testing, and non-pointer files will be ignored and reported as success: this allows checking folders containing a mix of file types) is present in the stream registry, and add the specified hash pointer to the specified ID list for use by sregi_drop_single_unused. If instead of a file name "-" is given, the input to check will be read from standard input (use "./-" to check a file named "-" in the current directory). Synopsis: sregi_hashpointer_sane [--fail-check <path-to-instance-file>] [--sreg-dir <directory>] [<path-to-hashpointer>|-] [<path-to-ID-list> <tracking-file> [--verify]]
sregi_verify_backup
Basic sanity check of whether the specified backup can be used. If a tracking file (should contain only an integer) is specified, the file's value will be incremented. Synopsis: sregi_verify_backup <path-to-instance-file> <path-to-file-to-check>
sregi_verify_entry
Check that the specified pointer can be retrieved. If a tracking file (should contain only an integer) is specified, the file's value will be incremented. Synopsis: sregi_verify_entry [--lock-override|--no-lock-override] [--fail-check <path-to-instance-file>] [--sreg-dir <directory>] <path-to-pointer> [tracking-file] [--quick] [--skip-cache|--skip-drop-failed|--drop-failed]

crystallize-bash_setup provides these bash functions.

rubberfs
Wrapper around the rubberfs script: this function should be used instead. Synopsis: Synopsis: rubberfs ( ((create|mount|soft-mount|remount|rename|cd|unmount|soft-unmount|attach|check|save|freeze|delta|gc|thaw|patch|status|list|usage-write|destroy|destroy-no-upload|historybak|historypull) [RubberFS name]) | usage | whereami | stub | (stash <file>...) )
git-absolute-path
???? Synopsis: git-absolute-path <file>
git-escape-path
???? Synopsis: git-escape-path <path>
sregi_available_by_hash
Basic sanity check for a stream registry entry given the checksum corresponding to the entry to check. Synopsis: sregi_available_by_hash <checksum> [<path-to-hashpointer>]
sregi_hash_from_file
Retrieves and prints the checksum from a pointer. Synopsis: sregi_hash_from_file <file>

Installation

For Wreathe 7.3, an ebuild (app-misc/crystallize) is available in the Wreathe overlay (this may also work for similar operating systems such as Ututo XS GNU/Linux).

For other operating systems, use the following installation instructions.

Instructions for installation without ebuild

Requirements

Wreathe 7.3 is required for full support. The simple invocations of the 'crystallize' and 'decrystallize' commands (with filenames as the only arguments) are also supported on Ubuntu GNU/Linux and macOS 10.12 in the interest of promoting the preservation of knowledge (although Ember strongly advises not using non-libre software such as those operating systems), and will probably work on many other UNIX-like operating systems; these instructions only cover that basic support. This is not as well tested as using the software in Wreathe. Please report issues if you encounter them.

While using full functionality is not supported without using the ebuild, the following additional requirements are needed for it (this list is probably incomplete/incorrect).

Download

To download Crystallize, run:

git clone https://github.com/ethus3h/crystallize.git
Copy scripts

To install the downloaded scripts, run cd crystallize; make.

Setup

Edit the configuration file; see the "Configuration file format" section for documentation of this.

Finally, run sudo make install.

Configuration file format

The configuration file is located in your system configuration directory (probably /usr/local/etc or /etc), and is named crystallize.conf. It is a list of key-value pairs in the format Key,Value, separated by a line feed (0x0A), as follows:

InstallationIdentifier
Installation UUID
Collection
Internet Archive collection identifier (write access to the collection is required)
Passphrase
Passphrase (must be a valid GPG passphrase)
WorkDirectory
Directory for working data (should be writeable and have sufficient free space to hold approximately three times the amount of data being crystallized at any one time). MUST NOT have newline in the path. A space in the path may cause degraded performance (see comments in scache_gc), and will cause failure if the part of the path name preceding the space is also a pathname that exists; ideally, don't have a space in the path.
EmberLibrary
Path to a directory tree of the format used by the Ember Library
SrfsMountpoint
The mountpoint to be used when srfs is run without one specified.
SrfsDefaultRoot
An optional override for the default stream registry directory to be mounted when srfs is run without one specified (by default, the value of the EmberLibrary configuration field is used instead).

Note that there is currently no facility for storing configuration values containing line feeds.

Development

To learn about contributing to this project, visit the development page.