## Requirements Functional: 1. Users can upload files. 2. Users can download files. 3. Users can update files. 4. Users can share files with other users. 5. Users can automatically sync files across devices. Non-functional: 1. Support files up to 50GB. 2. Availability > Consistency. 3. Read >> Write. 4. 1B users and 10M daily file downloads/updates. Out of scope: 1. Fine-grained access control. 2. Directory structure. 3. Versioning. ## Entities 1. User 2. File 3. FileMetadata 4. Access ## API ### Upload API Request a list of presigned URLs to upload file chunks to S3. ``` POST /files { FileMetadata } -> { FileId, PresignedUrl[], } ``` ```python @dataclass class FileMetadata: file_id: Optional[int] # For List API file_fingerprint: int chunk_fingerprints: Sequence[int] ``` `file_fingerprint` is a hash of the file content, used by the server to verify the integrity of the uploaded file. Large files are split into smaller **chunks** for upload. These chunks are hashed individually and stored in `chunk_fingerprints`. ### Download API Similar to the Upload API, the user receives a list of presigned URLs to download a file from S3. ``` GET /files/{file_id} -> PresignedUrl[] ``` ### Update API Update a file. ``` PATCH /files/{file_id} { FileMetadata } -> { PresignedUrl[], ChunkFingerprint[], } ``` ### Share API Share a file with other users. ``` POST /files/{file_id}/share { UserId[] } ``` ### List API View all files. ``` GET /files -> { FileMetadata[], } ``` ## High Level Design ### Storage Since file content can be large, we store it in cloud object storage like AWS S3. We use a SQL database to store metadata. FileOwnership table: 1. file_id (primary key) 2. file_name 3. created_by 4. created_time Access table: 1. file_id 2. accessible_user_id ChunkFingerprint table: 1. chunk_fingerprint (primary key) 2. s3_path 3. upload_status ```python class UploadStatus(Enum): NOT_UPLOADED = auto() UPLOADING = auto() UPLOADED = auto() ``` We use a NoSQL database to map `file_id` to a list of `chunk_fingerprint`s. ### 1 - User can upload files 1. Client chunks the file and computes fingerprints to create the `FileMetadata`. 2. Client calls the Upload API. 3. Server updates the DB. 4. Server returns a list of presigned URLs. 5. Client uploads the chunks to the presigned URLs. ### 2 - User can download files 1. Client calls the Download API. 2. Server validates user access. - If the user does not have access or the file does not exist, return 422 Unprocessable Content. 3. Server creates and returns a list of presigned URLs. 4. Client downloads the file from the presigned URLs. ### 3 - User can update files 1. Client recomputes the chunks and fingerprints. 2. Client calls the Update API. 3. Server updates the DB. 4. Server returns a list of presigned URLs and the fingerprints of chunks requiring re-upload. ### 4 - User can share files with other users 1. Client calls the Share API. 2. Server validates file ownership. - If the user does not own the file or the file does not exist, return 422 Unprocessable Content. 3. Update the Access table. Ignore duplicates. ### 5 - User can automatically sync files across devices #### Client to remote 1. Client listens for file changes. - Can use Meta Watchman. 2. On new file creation, perform upload. 3. On existing file change, perform update. #### Remote to client ##### Desktop / Mobile Foreground 1. Client performs 90s long-polling. 2. On remote file changes, client syncs. ##### Mobile Background When the app is not in use, it is suspended by the OS to conserve battery. 1. On remote file changes, server sends a Push Notification. 2. Client wakes up and syncs. Fallback: 1. Client performs periodic polling. ## Deep Dives ### Chunking Strategy We should use [[Content-Defined Chunking]] so that modification of one chunk does not affect the content (and fingerprint) of other chunks. ### Chunk Packing If we have many small chunks in S3, the cost is high. We can pack multiple small chunks together into a single S3 object and store their offsets in our database. When downloading data, we use HTTP Range requests to specify just the chunk to download. Updated schema: 1. chunk_fingerprint (primary key) 2. s3_path 3. **offset** 4. **size** 5. upload_status ### Why not S3 multipart upload / download? S3 Multipart abstracts the complexity of handling file chunk concatenation. However, losing control over individual chunks means we cannot perform delta sync and block-level deduplication.