The sync process was getting stuck since we never handled the case where
the update to Gluon failed. This caused the flush stage to exist, but
the sync process would continue until it eventually gets stuck due to
lack of progress.
There was an issue where new attachment download requests would hang
forever due to not checking whether the context was cancelled. At this
point there were no more workers to consume to channel messages.
Feature was not restored in previous MR. Attachment are now download in
parallel. There is a pool of maxParallelDownloads attachment downloaders
shared with all message downloads.
Updates go-proton-api and Gluon to includes memory reduction changes and
modify the sync process to take into account how much memory is used
during the sync stage.
The sync process now has an extra stage which first download the message
metada to ensure that we only download up to `syncMaxDownloadRequesMem`
messages or 250 messages total. This allows for scaling the download
request automatically to accommodate many small or few very large
messages.
The IDs are then sent to a download go-routine which downloads the
message and its attachments. The result is then forwarded to another
go-routine which builds the actual message. This stage tries to ensure
that we don't use more than `syncMaxMessageBuildingMem` to build these
messages.
Finally the result is sent to a last go-routine which applies the
changes to Gluon and waits for them to be completed.
The new process is currently limited to 2GB. Dynamic scaling will be
implemented in a follow up. For systems with less than 2GB of memory we
limit the values to a set of values that is known to work.
For every update sent to gluon wait and check the error code to see if
an error occurred.
Note: Updates can't be inspect on the call site as it can lead to
deadlocks.
Make sure that we are at least using 16 workers for sync, otherwise
multiply the current sync worker count by 2.
Finally, this patch also logs the duration of the time it takes to
transfer all the messages from the server.
Revise syncing work distribution. Sync time can be reduced by up to 50%.
Rework the sync so that it pipelines better with bigger batch counts at
each stage. We now use 3 separate stages: Download, Updates and Sync.
The Download stage downloads messages in maxBatchSize intervals using
1.5x syncWorkers. Once the current batch has finished downloading it's
forwarded to the Updates stage and we proceed to download the next
batch.
The Update stage converts everything into gluon updates and prepares a
collection of noops that the sync stage can wait on for termination.
Finally the sync stage waits until the updates have been applied in
Gluon so that the vault information can be updated. We allow up to 4
pending wait operations to be queued currently to not block the
pipeline.
This change implements safe.Mutex and safe.RWMutex, which wrap the
sync.Mutex and sync.RWMutex types and are assigned a globally unique
integer ID. The safe.Lock and safe.RLock methods sort the mutexes
by this integer ID before locking to ensure that locks for a given
set of mutexes are always performed in the same order, avoiding
deadlocks.
This fixes various race conditions and leaks related to the user's sync
and API event stream. It was possible for a sync/stream to begin after a
user was already closed; this change prevents that by managing the
goroutines related to sync/stream within cancellable groups.
We need to unlock the user keyring anyway to unlock the address keyring,
so we should just return it instead of re-unlocking the user keyring
when sending a message.