Conversation
|
Not specific feedback for the RFC, just sharing some context links and previous discussion that was helpful in getting the apache/datafusion#16779 Also tagging @ggershinsky in case he has any cycles to read this. His guidance was instrumental on the DataFusion and Arrow-rs PME work, as well as Iceberg Java encryption implementation. |
|
Thanks, I'll be glad to have a look. |
docs/rfcs/0003_table_encryption.md
Outdated
| ``` | ||
| Master Key (in KMS) | ||
| └── wraps → KEK (Key Encryption Key) — stored in table metadata as EncryptedKey | ||
| └── wraps → DEK (Data Encryption Key) — stored in StandardKeyMetadata per file |
There was a problem hiding this comment.
some DEKs (those for manifest list files) are also stored in table metadata as EncryptedKey. These DEKs are indeed packaged in a StandardKeyMetadata (along with AAD prefix and file length). The serialized StandardKeyMetadata is encrypted/wrapped by the KEK, and stored in the table metadata / encrypted_keys structure.
The manifest file DEKs are packaged in StandardKeyMetadata, and stored as-is (without encryption) in manifest list files. The latter are encrypted then.
The data file DEKs are packaged in StandardKeyMetadata, and stored as-is (without encryption) in manifest files. The latter are encrypted then.
docs/rfcs/0003_table_encryption.md
Outdated
|
|
||
| - **Master keys** live in the KMS and never leave it | ||
| - **KEKs** are wrapped by the master key and stored in `TableMetadata.encryption_keys` | ||
| - **DEKs** are wrapped by a KEK and stored per-file in `StandardKeyMetadata` |
There was a problem hiding this comment.
Only manifest list DEKs are wrapped by a KEK. Other DEKs are encrypted in the parent files, by the parent DEKs
| │ | ||
| ▼ | ||
| load_manifest_list(file_io, table_metadata) | ||
| 1. Look up encryption_key_id in table_metadata.encryption_keys |
There was a problem hiding this comment.
Also need to unwrap the KEK (via a KMS client)
docs/rfcs/0003_table_encryption.md
Outdated
| a. file_io.new_encrypted_output(path) → AGS1-encrypting OutputFile | ||
| b. em.wrap_key_metadata() → EncryptedKey for table metadata | ||
| c. Store key_id on Snapshot.encryption_key_id | ||
| 3. Table updates include AddEncryptionKey for new KEKs |
There was a problem hiding this comment.
Also need to wrap the KEK (via a KMS client)
|
Thanks for taking a look @ggershinsky I've tried to fill in some of the details here |
blackmwk
left a comment
There was a problem hiding this comment.
Thanks @xanderbailey for this pr, generally LGTM, left some suggestions.
|
|
||
| ### Iceberg Spec: Encryption | ||
|
|
||
| The [Iceberg table spec](https://iceberg.apache.org/spec/#table-metadata) defines encryption |
| │ ├── kms/ | ||
| │ │ ├── mod.rs | ||
| │ │ └── in_memory.rs # InMemoryKms (testing only) | ||
| │ └── integration_tests.rs # End-to-end encryption round-trip tests |
There was a problem hiding this comment.
In rust we usually put integration tests in tests dir, next to src dir.
|
|
||
| ```rust | ||
| #[async_trait] | ||
| pub trait EncryptionManager: Debug + Send + Sync { |
There was a problem hiding this comment.
Do we really need this to be a trait? I think a no-op kms client would be enough?
There was a problem hiding this comment.
Yeah we actually don't need this to be a trait the standard encryption manager is the only one we need
| `InputFile` and `OutputFile` are enums with three variants each: | ||
|
|
||
| ```rust | ||
| pub enum InputFile { |
There was a problem hiding this comment.
I think we this enum should sth as following:
pub enum EncryptedInputFile {
Encrypted {
key_metadata: KeyMetadata
inner: InputFile
},
NativeedEncryted {
key_material: NativeKeyMaterial,
inner: InputFile
}
}
There was a problem hiding this comment.
I think this requires a lot more changes to the current code since encrypted inputs and outputs won't be handled transparently. Or is the suggestion to have InputFile as an enum of Plain, EncryptedInputFile and EncryptedInputFile is itself an enum?
There was a problem hiding this comment.
I find it a little difficult to imagine what it will look like given different choice? Could you give raise some examples? I'm even thinking if we actually need an enum. InputFile, EncryptedInputFile and NativeEncryptedInputFile are used in differenent cases. Let's use an example, ManifestReader, if we remove the enum, what we need is to add an extra api for read(EncryptedInputFile). This is acceptable to me, some we have a FileRead trait to abstract out the descryption process.
There was a problem hiding this comment.
I'm leaning towards the example snippet @blackmwk above and I don't think we should expand Input/OutputFile to an enum. EncryptedFile should be a wrapper of Input/OutputFile rather than a variation.
By the same logic, having a wrapper Storage like EncryptionStorage may not be ideal. For encryption, we only need to (un)wrap Input/OutputFile in the FileIO level
There was a problem hiding this comment.
I think we can do this if you both think this is better but this requires some reasonable surgery on structs like ManifestListWriter which today takes an OutputFile and will now need to take a Box<dyn FileWrite> and ManifestWriter will need Box<dyn FileWrite> and a location.
There was a problem hiding this comment.
Okay I tried it out and it's actually not too bad. If you're both happy with the changes I mentioned above then I think this is a good approach.
| // Via FileIOBuilder extension (works with RestCatalog and any extension-aware catalog) | ||
| let file_io = FileIOBuilder::new("s3") | ||
| .with_prop("s3.region", "us-east-1") | ||
| .with_extension(encryption_manager) |
| .build()?; | ||
|
|
||
| // Or via convenience method on FileIO | ||
| let file_io = file_io.with_encryption_manager(encryption_manager); |
| // Option A: EncryptionManager on the catalog | ||
| let catalog = GlueCatalogBuilder::default() | ||
| .with_storage_factory(Arc::new(OpenDalStorageFactory::S3)) | ||
| .with_encryption_manager(encryption_manager) |
There was a problem hiding this comment.
I think what we actually need from user is KmsClientFactory?
There was a problem hiding this comment.
Oh interesting, where they can provide a way to construct a number of KMS clients?
There was a problem hiding this comment.
I mean the code to be as following:
let catalog = GlueCatalogBuilder::default()
.with_storage_factory(Arc::new(OpenDalStorageFactory::S3))
.with_kms_client_factory(new AwsKmsClientFactory())
| .load("my_catalog", props) | ||
| .await?; | ||
|
|
||
| // Option B: Wrapping StorageFactory |
There was a problem hiding this comment.
I prefer option A due to https://github.com/apache/iceberg-rust/pull/2183/changes#r2927906969
Which issue does this PR close?
RFC for table encryption
Part of: #2034
Rough draft with some of the key parts: #2042
What changes are included in this PR?
Are these changes tested?