2023-07-25 00:34:29 +00:00
|
|
|
# Planning for implementing a postgres repo for pict-rs
|
|
|
|
|
2023-07-25 00:48:45 +00:00
|
|
|
## Technology
|
|
|
|
|
|
|
|
I've identified these crates as useful for achieving a reasonable postgres experience
|
|
|
|
- [diesel-async](https://docs.rs/diesel-async/latest/diesel_async/)
|
|
|
|
- [refinery](https://docs.rs/refinery/latest/refinery/)
|
|
|
|
- [deadpool](https://docs.rs/deadpool/latest/deadpool/)
|
|
|
|
- [tokio-postgres](https://docs.rs/tokio-postgres/latest/tokio_postgres/)
|
|
|
|
|
|
|
|
tokio-postgres will actually do the work of talking to postgres in all cases. diesel-async can use
|
|
|
|
tokio-postgres to execute queries. refinery can use tokio-postgres to run migrations. deadpool can
|
|
|
|
pool tokio-postgres connections.
|
|
|
|
|
|
|
|
I've chosen this stack specifically to avoid depending on `libpq`, the c implementation of a
|
|
|
|
postgres client. This is not because it's written in C. 99.9% of postgres client libraries use libpq
|
|
|
|
to great success. It is to keep the build process for pict-rs simple. Sticking with a full stack of
|
|
|
|
rust means that only a rust compiler is required to build pict-rs.
|
|
|
|
|
|
|
|
|
|
|
|
## Plan
|
|
|
|
|
|
|
|
pict-rs isolates different concepts between different "Repo" traits. There's a single top-level
|
|
|
|
FullRepo that depends on the others to ensure everything gets implemented properly. Since there's
|
|
|
|
only been one repo implementation so far, it's not optimized for network databases and some things
|
|
|
|
are less efficient than they could be.
|
2023-07-25 00:34:29 +00:00
|
|
|
|
|
|
|
|
|
|
|
### HashRepo
|
|
|
|
This is a little complicated because one of the things a HashRepo can do is return a stream of
|
|
|
|
hashes from the repo. This can likely be implemented as a batch-retrieval operation that fetches
|
|
|
|
1000 hashes at once and then drains them on each call to `poll_next`
|
|
|
|
|
|
|
|
methods:
|
|
|
|
- size
|
|
|
|
- hashes
|
2023-09-01 23:42:18 +00:00
|
|
|
- hash_page
|
|
|
|
- hash_page_by_date
|
|
|
|
- bound
|
|
|
|
- create_hash
|
|
|
|
- create_hash_with_timestamp
|
2023-07-26 01:08:27 +00:00
|
|
|
- update_identifier
|
|
|
|
- identifier
|
2023-07-25 00:34:29 +00:00
|
|
|
- relate_variant_identifier
|
|
|
|
- variant_identifier
|
2023-07-26 01:08:27 +00:00
|
|
|
- variants
|
2023-07-25 00:34:29 +00:00
|
|
|
- remove_variant
|
|
|
|
- relate_motion_identifier
|
2023-07-26 01:08:27 +00:00
|
|
|
- motion_identifier
|
2023-09-01 23:42:18 +00:00
|
|
|
- cleanup_hash
|
2023-07-25 00:34:29 +00:00
|
|
|
|
|
|
|
```sql
|
|
|
|
CREATE TABLE hashes (
|
|
|
|
hash BYTEA PRIMARY KEY,
|
|
|
|
identifer TEXT NOT NULL,
|
|
|
|
motion_identifier TEXT,
|
2023-09-01 23:42:18 +00:00
|
|
|
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
|
2023-07-25 00:34:29 +00:00
|
|
|
);
|
|
|
|
|
2023-09-01 23:42:18 +00:00
|
|
|
-- paging through hashes
|
|
|
|
CREATE INDEX ordered_hash_index ON hashes (created_at, hash);
|
|
|
|
|
2023-07-25 00:38:12 +00:00
|
|
|
|
2023-07-25 00:34:29 +00:00
|
|
|
CREATE TABLE variants (
|
|
|
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
|
|
|
hash BYTEA REFERENCES hashes(hash) ON DELETE CASCADE,
|
|
|
|
variant TEXT NOT NULL,
|
|
|
|
identifier TEXT NOT NULL
|
|
|
|
);
|
|
|
|
|
2023-07-25 00:38:12 +00:00
|
|
|
|
2023-07-25 00:34:29 +00:00
|
|
|
CREATE UNIQUE INDEX hash_variant_index ON variants (hash, variant);
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### AliasRepo
|
|
|
|
Used to relate Aliases to Hashes, and to relate Delete Tokens to Aliases. Hashes are always bytes,
|
|
|
|
but could be hex-encoded. postgres's `bytea` type can work with hex-encoding on storage and
|
|
|
|
retrieval so maybe this can be used. Delete Tokens are not always UUIDs, even though they have been
|
|
|
|
UUIDs in all recent versions of pict-rs.
|
|
|
|
|
|
|
|
methods:
|
2023-09-01 23:42:18 +00:00
|
|
|
- create_alias
|
2023-07-26 01:08:27 +00:00
|
|
|
- delete_token
|
2023-07-25 00:34:29 +00:00
|
|
|
- hash
|
2023-09-01 23:42:18 +00:00
|
|
|
- aliases_for_hash
|
|
|
|
- cleanup_alias
|
2023-07-25 00:34:29 +00:00
|
|
|
|
|
|
|
```sql
|
|
|
|
CREATE TABLE aliases (
|
|
|
|
alias VARCHAR(30) PRIMARY KEY,
|
|
|
|
hash BYTEA NOT NULL REFERENCES hashes(hash) ON DELETE CASCADE,
|
|
|
|
delete_token VARCHAR(30) NOT NULL
|
|
|
|
);
|
2023-09-01 23:42:18 +00:00
|
|
|
|
|
|
|
|
|
|
|
CREATE INDEX alias_hashes_index ON aliases (hash);
|
2023-07-25 00:34:29 +00:00
|
|
|
```
|
|
|
|
|
|
|
|
|
2023-09-01 23:42:18 +00:00
|
|
|
### SettingsRepo
|
2023-07-25 00:34:29 +00:00
|
|
|
This is used for generic server-level storage. The file & object stores keep their current path
|
|
|
|
generator values here. This is also used in some migrations to mark completion.
|
|
|
|
|
|
|
|
methods:
|
|
|
|
- set
|
|
|
|
- get
|
|
|
|
- remove
|
|
|
|
|
|
|
|
could be a simple table with String key & values
|
|
|
|
|
|
|
|
pict-rs currently treats the value here as Bytes, so it could either be hex-encoded or translated to
|
|
|
|
a string
|
|
|
|
```sql
|
|
|
|
CREATE TABLE settings (
|
|
|
|
key VARCHAR(80) PRIMARY KEY,
|
|
|
|
value VARCHAR(80) NOT NULL
|
|
|
|
);
|
|
|
|
```
|
|
|
|
|
|
|
|
|
2023-09-01 23:42:18 +00:00
|
|
|
### DetailsRepo
|
2023-07-25 00:34:29 +00:00
|
|
|
Used to relate details (image metadata) to identifiers (image paths). Identifiers are currently
|
|
|
|
treated as bytes, so may need hex-encoding to store in the database. They _should_ be valid strings
|
|
|
|
in most environments, so it might be possible to drop the bytes requirement & instead have a string
|
|
|
|
requirement.
|
|
|
|
|
|
|
|
methods:
|
|
|
|
- relate_details
|
|
|
|
- details
|
2023-09-01 23:42:18 +00:00
|
|
|
- cleanup_details
|
2023-07-25 00:34:29 +00:00
|
|
|
|
|
|
|
```sql
|
|
|
|
CREATE TABLE details (
|
|
|
|
identifier TEXT PRIMARY KEY,
|
|
|
|
details JSONB NOT NULL,
|
|
|
|
);
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### QueueRepo
|
|
|
|
This is going to be the troublesome table. It represents jobs that will be processed. Jobs are
|
|
|
|
pushed as Bytes, but at a higher level are actually JSON strings. The QueueRepo API could be updated
|
2023-07-26 02:14:11 +00:00
|
|
|
to take `T: Serialize` as input rather than bytes, and then we can store it as JSONB. With the
|
2023-07-25 00:34:29 +00:00
|
|
|
current API, the repo doesn't need to know the shape of a job, and maybe that is a benefit. We
|
|
|
|
should take care in the future not to query on the contents of the job.
|
|
|
|
|
|
|
|
methods:
|
|
|
|
- push
|
|
|
|
- pop
|
2023-09-01 23:42:18 +00:00
|
|
|
- heartbeat
|
|
|
|
- complete_job
|
2023-07-25 00:34:29 +00:00
|
|
|
|
|
|
|
```sql
|
|
|
|
CREATE TYPE job_status AS ENUM ('new', 'running');
|
|
|
|
|
2023-07-25 00:38:12 +00:00
|
|
|
|
2023-07-25 00:34:29 +00:00
|
|
|
CREATE TABLE queue (
|
|
|
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
|
|
|
queue VARCHAR(30) NOT NULL,
|
|
|
|
job JSONB NOT NULL,
|
|
|
|
status job_status NOT NULL DEFAULT 'new',
|
2023-09-01 23:42:18 +00:00
|
|
|
queue_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
|
|
|
|
heartbeat TIMESTAMP
|
2023-07-25 00:34:29 +00:00
|
|
|
);
|
|
|
|
|
2023-07-25 00:38:12 +00:00
|
|
|
|
2023-07-25 00:34:29 +00:00
|
|
|
CREATE INDEX queue_status_index ON queue INCLUDE status;
|
2023-09-01 23:42:18 +00:00
|
|
|
CREATE INDEX heartbeat_index ON queue
|
2023-07-25 00:34:29 +00:00
|
|
|
```
|
|
|
|
|
|
|
|
claiming a job can be
|
|
|
|
```sql
|
2023-09-01 23:42:18 +00:00
|
|
|
UPDATE queue SET status = 'new', heartbeat = NULL
|
|
|
|
WHERE
|
|
|
|
heartbeat IS NOT NULL AND heartbeat < NOW - INTERVAL '2 MINUTES';
|
2023-07-25 00:34:29 +00:00
|
|
|
|
2023-09-01 23:42:18 +00:00
|
|
|
UPDATE queue SET status = 'running', heartbeat = CURRENT_TIMESTAMP
|
2023-07-25 00:34:29 +00:00
|
|
|
WHERE id = (
|
|
|
|
SELECT id
|
|
|
|
FROM queue
|
2023-09-01 23:42:18 +00:00
|
|
|
WHERE status = 'new' AND queue = '$QUEUE'
|
2023-07-25 00:34:29 +00:00
|
|
|
ORDER BY queue_time ASC
|
|
|
|
FOR UPDATE SKIP LOCKED
|
|
|
|
LIMIT 1
|
|
|
|
)
|
|
|
|
returning *;
|
|
|
|
```
|
|
|
|
|
2023-07-25 00:38:12 +00:00
|
|
|
notifying pict-rs of a ready job could be
|
|
|
|
```sql
|
|
|
|
CREATE OR REPLACE FUNCTION queue_status_notify()
|
|
|
|
RETURNS trigger AS
|
|
|
|
$$
|
|
|
|
BEGIN
|
|
|
|
PERFORM pg_notify('queue_status_channel', NEW.id::text);
|
|
|
|
RETURN NEW;
|
|
|
|
END;
|
|
|
|
$$ LANGUAGE plpgsql;
|
|
|
|
|
|
|
|
|
|
|
|
CREATE TRIGGER queue_status
|
|
|
|
AFTER INSERT OR UPDATE OF status
|
|
|
|
ON queue
|
|
|
|
FOR EACH ROW
|
|
|
|
EXECUTE PROCEDURE queue_status_notify();
|
|
|
|
```
|
|
|
|
|
2023-07-25 00:51:39 +00:00
|
|
|
Postgres queue implementation from this blog post: https://webapp.io/blog/postgres-is-the-answer/
|
|
|
|
|
2023-07-25 00:34:29 +00:00
|
|
|
|
2023-09-01 23:42:18 +00:00
|
|
|
### StoreMigrationRepo
|
2023-07-25 00:34:29 +00:00
|
|
|
This is used for migrating from local storage to object storage. It keeps track of which identifiers
|
|
|
|
have been migrated, and on a successful migration, it is fully cleared.
|
|
|
|
|
|
|
|
methods:
|
|
|
|
- is_continuing_migration
|
|
|
|
- mark_migrated
|
|
|
|
- is_migrated
|
|
|
|
- clear
|
|
|
|
|
|
|
|
```sql
|
|
|
|
CREATE TABLE migrations (
|
|
|
|
identifier TEXT PRIMARY KEY,
|
|
|
|
);
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### ProxyRepo
|
|
|
|
This is used for keeping track of URLs that map to Aliases for proxied media.
|
|
|
|
|
|
|
|
methods:
|
|
|
|
- relate_url
|
|
|
|
- related
|
|
|
|
- remove_relation
|
|
|
|
|
|
|
|
```sql
|
|
|
|
CREATE TABLE proxies (
|
|
|
|
url PRIMARY KEY,
|
|
|
|
alias VARCHAR(30) NOT NULL REFERENCES aliases(alias)
|
|
|
|
);
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### AliasAccessRepo
|
|
|
|
This is used for keeping track of aliases that are "cached" in pict-rs and can be safely removed
|
|
|
|
when they are not useful to keep around. This might be able to piggyback on the aliases table or the
|
|
|
|
proxies table.
|
|
|
|
|
|
|
|
methods:
|
2023-09-01 23:42:18 +00:00
|
|
|
- accessed_alias
|
|
|
|
- set_accessed_alias
|
|
|
|
- alias_accessed_at
|
2023-07-25 00:34:29 +00:00
|
|
|
- older_aliases
|
2023-09-01 23:42:18 +00:00
|
|
|
- remove_alias_access
|
2023-07-25 00:34:29 +00:00
|
|
|
|
|
|
|
```sql
|
|
|
|
ALTER TABLE aliases ADD COLUMN accessed TIMESTAMP;
|
|
|
|
```
|
|
|
|
or
|
|
|
|
```sql
|
|
|
|
ALTER TABLE proxies ADD COLUMN accessed TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP;
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### VariantAccessRepo
|
|
|
|
This is used for keeping track of access times for variants of an image to enable freeing up space
|
|
|
|
from seldom-accessed variants. This might be able to piggyback on the variants table.
|
|
|
|
|
|
|
|
methods:
|
2023-09-01 23:42:18 +00:00
|
|
|
- accessed_variant
|
|
|
|
- set_accessed_variant
|
|
|
|
- variant_accessed_at
|
2023-07-25 00:34:29 +00:00
|
|
|
- older_variants
|
2023-09-01 23:42:18 +00:00
|
|
|
- remove_variant_access
|
2023-07-25 00:34:29 +00:00
|
|
|
|
|
|
|
```sql
|
|
|
|
ALTER TABLE variants ADD COLUMN accessed TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP;
|
|
|
|
```
|
2023-09-01 23:42:18 +00:00
|
|
|
|
|
|
|
### UploadRepo
|
|
|
|
Used to keep track of backgrounded uploads.
|
|
|
|
|
|
|
|
methods:
|
|
|
|
- create_upload
|
|
|
|
- wait
|
|
|
|
- claim
|
|
|
|
- complete_upload
|
|
|
|
|
|
|
|
```sql
|
|
|
|
CREATE TABLE uploads (
|
|
|
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
|
|
|
result JSONB,
|
|
|
|
);
|
|
|
|
```
|
|
|
|
|
|
|
|
Waiting for an upload
|
|
|
|
```sql
|
|
|
|
CREATE OR REPLACE FUNCTION upload_completion_notify()
|
|
|
|
RETURNS trigger AS
|
|
|
|
$$
|
|
|
|
BEGIN
|
|
|
|
PERFORM pg_notify('upload_completion_channel', NEW.id::text);
|
|
|
|
RETURN NEW;
|
|
|
|
END;
|
|
|
|
$$ LANGUAGE plpgsql;
|
|
|
|
|
|
|
|
|
|
|
|
CREATE TRIGGER upload_result
|
|
|
|
AFTER INSERT OR UPDATE OF result
|
|
|
|
ON uploads
|
|
|
|
FOR EACH ROW
|
|
|
|
EXECUTE PROCEDURE upload_completion_notify();
|
|
|
|
```
|