feat: Add file_converter extension module (Issue #54) by vaibhav45sktech · Pull Request #56 · dbpedia/databus-python-client

vaibhav45sktech · 2026-02-16T14:58:00Z

Description

This PR adds a new file_converter.py extension module to address Issue #54. The module provides a streaming pipeline for file format conversion with support for gzip decompression and checksum validation.

Changes

Create new databusclient/extensions/file_converter.py module
Implement FileConverter class with streaming support
Add decompress_gzip_stream() method with optional checksum validation
Add compress_gzip_stream() method for gzip compression
Add validate_checksum_stream() method for SHA256 checksum validation

Related Issues
Fixes #54

Type of change

New feature (non-breaking change which adds functionality)

Summary by CodeRabbit

New Features
- Gzip-based compression and decompression with streaming for memory-efficient handling of large files.
- Optional checksum validation to verify data integrity during decompression; validation failures surface as errors.
- Incremental chunked I/O processing for safe, efficient file transformations of any size.

- Create new file_converter.py extension module in databusclient/extensions/ - Implements FileConverter class with streaming pipeline support - Supports gzip decompression with optional checksum validation - Provides methods for compress_gzip_stream and decompress_gzip_stream - Minimal version as suggested in issue to start with gzip + checksum - Can be extended later to support more compression formats

coderabbitai · 2026-02-16T14:58:18Z

📝 Walkthrough

Walkthrough

Adds a new FileConverter class in databusclient/extensions/file_converter.py offering streaming gzip compress/decompress and SHA-256 checksum validation using 8KB chunks and only Python stdlib.

Changes

Cohort / File(s)	Summary
FileConverter Extension Module `databusclient/extensions/file_converter.py`	New `FileConverter` class with three static methods: `decompress_gzip_stream(input_stream, output_stream, validate_checksum=False)` which streams gzip decompression in 8KB chunks and optionally returns a SHA-256 hex digest; `compress_gzip_stream(input_stream, output_stream)` which streams gzip compression; `validate_checksum_stream(input_stream, expected_checksum)` which computes and compares SHA-256 (raises `IOError` on mismatch). Uses `gzip`, `hashlib`.
Package Exports `databusclient/extensions/__init__.py`	Exports `FileConverter` by importing it and adding to `__all__`, making the class publicly importable from `databusclient.extensions`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	❓ Inconclusive	The PR implements gzip operations but Issue `#54` requests support for multiple compression formats; unclear if minimal scope was mutually agreed upon or represents incomplete implementation.	Clarify whether limiting to gzip-only is intentional minimum scope or if bz2/xz support was deferred to future work.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: adding a new file_converter extension module to address Issue `#54`.
Linked Issues check	✅ Passed	The PR fulfills core coding requirements from Issue `#54`: refactoring conversion logic into file_converter.py module and implementing decompress/compress/validate methods.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	The pull request description includes all required sections: a clear description of changes, related issues (Issue `#54`), and the type of change marked as 'New feature'. However, the checklist items are incomplete - no testing confirmations or code style validations are marked.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@databusclient/extensions/file_converter.py`:
- Around line 38-50: The code creates source_hasher from expected_checksum but
never uses it, so remove the unused expected_checksum parameter and the
source_hasher/hash-from-compressed-stream logic from the FileConverter method
that reads gzip (the block using gzip.open, FileConverter.CHUNK_SIZE, hasher,
and output_stream), keep only validate_checksum-driven hasher for decompressed
chunks, and update the method signature and return value accordingly; then
update callers to perform checksum validation on the compressed input stream
(e.g., via validate_checksum_stream) before calling this decompression routine.

🧹 Nitpick comments (1)

databusclient/extensions/file_converter.py (1)
70-103: seek(0) assumes a seekable stream; return type is misleading.

Two concerns:

Line 88: input_stream.seek(0) will raise on non-seekable streams (e.g., network response bodies, pipes). Since the PR objective targets integration with the download pipeline, callers will need to be aware of this constraint. Consider either documenting the seekable requirement, removing the seek(0) call (let the caller manage stream position), or accepting a non-seekable stream and removing the seek.

Return type: The method signature says -> bool but it never returns False — it either returns True or raises IOError. Consider returning bool without raising (let the caller decide), or changing the return type to None and only raising on failure. Pick one contract and be consistent.
Suggested approach: remove seek, return bool without raising
     `@staticmethod`
     def validate_checksum_stream(
         input_stream: BinaryIO,
         expected_checksum: str
     ) -> bool:
         """Validate SHA256 checksum of a stream.
 
         Args:
-            input_stream: Input stream to validate
+            input_stream: Input stream to validate (must be positioned at start)
             expected_checksum: Expected SHA256 checksum
 
         Returns:
-            True if checksum matches
+            True if checksum matches, False otherwise
-
-        Raises:
-            IOError: If checksum validation fails
         """
         hasher = hashlib.sha256()
-        input_stream.seek(0)
-        
+
         while True:
             chunk = input_stream.read(FileConverter.CHUNK_SIZE)
             if not chunk:
                 break
             hasher.update(chunk)
-        
+
         computed = hasher.hexdigest()
-        if computed.lower() != expected_checksum.lower():
-            raise IOError(
-                f"Checksum mismatch: expected {expected_checksum}, "
-                f"got {computed}"
-            )
-        
-        return True
+        return computed.lower() == expected_checksum.lower()

databusclient/extensions/file_converter.py

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

databusclient/extensions/file_converter.py (3)

102-105: IOError is semantically incorrect for a checksum mismatch; consider ValueError.

IOError (aliased to OSError in Python 3) conventionally signals operating-system-level I/O failures (file not found, disk full, permission denied). A checksum mismatch is a data-integrity error — ValueError or a custom ChecksumMismatchError would let callers distinguish between a genuine I/O failure and bad data without catching all OSErrors.

♻️ Proposed change

-            raise IOError(
+            raise ValueError(
                 f"Checksum mismatch: expected {expected_checksum}, "
                 f"got {computed}"
             )

Also update the docstring:

-        Raises:
-            IOError: If checksum validation fails
+        Raises:
+            ValueError: If checksum validation fails

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@databusclient/extensions/file_converter.py` around lines 102 - 105, Replace
the semantically incorrect IOError raised for checksum mismatches with a more
appropriate exception: change the raise in file_converter.py (the
checksum-checking block that currently raises IOError("Checksum mismatch:
expected..., got ...")) to raise ValueError with the same message, or define and
raise a custom ChecksumMismatchError class and use that instead; also update the
surrounding function/class docstring (the docstring for the checksum
verification routine in file_converter.py) to document the new exception type so
callers know to catch ValueError or ChecksumMismatchError.

21-21: validate_checksum parameter name should be compute_checksum.

The parameter only computes and returns the digest — it performs no comparison. The docstring correctly describes the behavior as "compute", but the parameter name implies validation. The past review had proposed this rename; it wasn't carried through.

♻️ Proposed rename

     def decompress_gzip_stream(
         input_stream: BinaryIO,
         output_stream: BinaryIO,
-        validate_checksum: bool = False,
+        compute_checksum: bool = False,
     ) -> Optional[str]:
         """Decompress gzip stream with optional checksum computation.
         ...
             validate_checksum: Whether to compute a SHA-256 checksum of
+            compute_checksum: Whether to compute a SHA-256 checksum of
                 the decompressed output.
         ...
         """
-        hasher = hashlib.sha256() if validate_checksum else None
+        hasher = hashlib.sha256() if compute_checksum else None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@databusclient/extensions/file_converter.py` at line 21, Rename the parameter
validate_checksum to compute_checksum throughout the file to match behaviour
described in the docstring: update the function/method signature(s) that
currently declare validate_checksum (and any default value False) to
compute_checksum: bool = False, update all internal variable references and any
return/tuple keys or comments that use validate_checksum to compute_checksum,
and update any call sites inside databusclient/extensions/file_converter.py (and
its unit tests if present) so callers use the new name; ensure type hints,
docstring example/parameter list, and any logging/messages reflect the new name.

12-107: No tests provided for the new module.

The PR adds a non-trivial streaming pipeline but no unit tests. At minimum, these cases should be covered:

Round-trip: compress → decompress restores original bytes.
validate_checksum_stream passes on a correct hash and raises on a bad hash.
decompress_gzip_stream with validate_checksum=True returns the correct hex digest.
Non-seekable stream handling for validate_checksum_stream.

Would you like me to generate a tests/test_file_converter.py skeleton covering the above cases, or open a follow-up issue to track this?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@databusclient/extensions/file_converter.py` around lines 12 - 107, Add unit
tests for the FileConverter class to cover the streaming pipeline: create
tests/test_file_converter.py and include (1) a round-trip test that writes
random bytes, uses FileConverter.compress_gzip_stream to compress into a buffer
and then FileConverter.decompress_gzip_stream to decompress and assert original
bytes are restored; (2) tests for validate_checksum_stream that assert True on a
correct SHA-256 and that an IOError is raised on a bad hash; (3) a test that
calls FileConverter.decompress_gzip_stream with validate_checksum=True and
asserts the returned hex digest equals the SHA-256 of the decompressed bytes;
and (4) a test for non-seekable input to validate_checksum_stream using a custom
non-seekable wrapper (or io.BufferedReader over a pipe-like object) to ensure
validation still works without calling seek; use BinaryIO-compatible buffers
(io.BytesIO) and reference FileConverter.compress_gzip_stream,
FileConverter.decompress_gzip_stream, and FileConverter.validate_checksum_stream
in assertions.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@databusclient/extensions/file_converter.py`:
- Around line 91-107: validate_checksum_stream currently reads to EOF and
returns without resetting the stream, which breaks the validate-then-decompress
pattern used by decompress_gzip_stream; change validate_checksum_stream to seek
input_stream back to 0 before returning and update its docstring to state that
input_stream must be seekable (and that the function will reset the stream
position), so callers like FileConverter.decompress_gzip_stream can safely read
from the start after validation.
- Around line 12-13: The package API currently doesn't export FileConverter from
the extensions package, forcing consumers to import
databusclient.extensions.file_converter.FileConverter directly; update
extensions/__init__.py to import and export FileConverter so it can be accessed
as databusclient.extensions.FileConverter (e.g., add "from .file_converter
import FileConverter" and include it in __all__), or document that only direct
module imports are supported—modify the __init__ export to reference the
FileConverter class name to restore the expected package-level import.

---

Nitpick comments:
In `@databusclient/extensions/file_converter.py`:
- Around line 102-105: Replace the semantically incorrect IOError raised for
checksum mismatches with a more appropriate exception: change the raise in
file_converter.py (the checksum-checking block that currently raises
IOError("Checksum mismatch: expected..., got ...")) to raise ValueError with the
same message, or define and raise a custom ChecksumMismatchError class and use
that instead; also update the surrounding function/class docstring (the
docstring for the checksum verification routine in file_converter.py) to
document the new exception type so callers know to catch ValueError or
ChecksumMismatchError.
- Line 21: Rename the parameter validate_checksum to compute_checksum throughout
the file to match behaviour described in the docstring: update the
function/method signature(s) that currently declare validate_checksum (and any
default value False) to compute_checksum: bool = False, update all internal
variable references and any return/tuple keys or comments that use
validate_checksum to compute_checksum, and update any call sites inside
databusclient/extensions/file_converter.py (and its unit tests if present) so
callers use the new name; ensure type hints, docstring example/parameter list,
and any logging/messages reflect the new name.
- Around line 12-107: Add unit tests for the FileConverter class to cover the
streaming pipeline: create tests/test_file_converter.py and include (1) a
round-trip test that writes random bytes, uses
FileConverter.compress_gzip_stream to compress into a buffer and then
FileConverter.decompress_gzip_stream to decompress and assert original bytes are
restored; (2) tests for validate_checksum_stream that assert True on a correct
SHA-256 and that an IOError is raised on a bad hash; (3) a test that calls
FileConverter.decompress_gzip_stream with validate_checksum=True and asserts the
returned hex digest equals the SHA-256 of the decompressed bytes; and (4) a test
for non-seekable input to validate_checksum_stream using a custom non-seekable
wrapper (or io.BufferedReader over a pipe-like object) to ensure validation
still works without calling seek; use BinaryIO-compatible buffers (io.BytesIO)
and reference FileConverter.compress_gzip_stream,
FileConverter.decompress_gzip_stream, and FileConverter.validate_checksum_stream
in assertions.

databusclient/extensions/file_converter.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@databusclient/extensions/file_converter.py`:
- Around line 102-109: The input_stream is only seeked back on success so when
the checksum check raises an IOError the stream stays at EOF; move the
input_stream.seek(0) to before the comparison of computed vs expected_checksum
(i.e., seek the input_stream right after computing computed = hasher.hexdigest()
and before comparing computed.lower() != expected_checksum.lower()) so the
stream is rewound regardless of an exception, and update the function docstring
to drop the word "successful" (so it documents that the stream will be rewound
even when a checksum mismatch occurs).

---

Duplicate comments:
In `@databusclient/extensions/file_converter.py`:
- Line 21: The parameter name validate_checksum is misleading because the
function only computes and returns a digest; rename the parameter to
compute_checksum across the FileConverter API in
databusclient/extensions/file_converter.py (update the function signature where
validate_checksum appears, any internal uses/branches that check it, and all
docstrings that reference "validate_checksum") and update all callers/tests to
use compute_checksum so the name matches behavior and the docstrings that
already say "compute" remain consistent.

databusclient/extensions/file_converter.py

vaibhav45sktech · 2026-02-18T13:43:07Z

Greetings @Integer-Ctrl,Could you please review this pr whenever available . I might be mistaken in some of the changes, so your guidance and feedback would be very helpful. Thank you!

coderabbitai bot reviewed Feb 16, 2026

View reviewed changes

databusclient/extensions/file_converter.py Outdated Show resolved Hide resolved

vaibhav45sktech closed this Feb 17, 2026

vaibhav45sktech reopened this Feb 17, 2026

vaibhav45sktech added 2 commits February 18, 2026 02:00

Merge branch 'dbpedia:main' into feature/file-converter-extension

2ef168c

pipeline

4698136

coderabbitai bot reviewed Feb 17, 2026

View reviewed changes

databusclient/extensions/file_converter.py Show resolved Hide resolved

databusclient/extensions/file_converter.py Show resolved Hide resolved

minorchnages

3d9ad5c

coderabbitai bot reviewed Feb 17, 2026

View reviewed changes

databusclient/extensions/file_converter.py Show resolved Hide resolved

fixclean

3428c92

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add file_converter extension module (Issue #54)#56

feat: Add file_converter extension module (Issue #54)#56
vaibhav45sktech wants to merge 5 commits intodbpedia:mainfrom
vaibhav45sktech:feature/file-converter-extension

vaibhav45sktech commented Feb 16, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 16, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

vaibhav45sktech commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

vaibhav45sktech commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Type of change

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vaibhav45sktech commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

vaibhav45sktech commented Feb 16, 2026 •

edited

Loading

coderabbitai bot commented Feb 16, 2026 •

edited

Loading