Validate downloaded file integrity and raise ValueError on hash mismatch#8833
Validate downloaded file integrity and raise ValueError on hash mismatch#8833e-mny wants to merge 11 commits intoProject-MONAI:devfrom
Conversation
for more information, see https://pre-commit.ci
DCO Remediation Commit for Enoch Mok <enochmokny@gmail.com> I, Enoch Mok <enochmokny@gmail.com>, hereby add my Signed-off-by to this commit: d02da49 Signed-off-by: Enoch Mok <enochmokny@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: Enoch Mok <enochmokny@gmail.com>
…tch; used ./runtests.sh autofix Signed-off-by: Enoch Mok <enochmokny@gmail.com>
📝 WalkthroughWalkthroughmonai/apps/utils.py: changed hash-failure exceptions from RuntimeError to ValueError for both newly downloaded and existing files; moved post-download hash validation to validate the temporary download file before renaming and removed the later final-file check; expanded failure message to include URL, expected hash, and corruption/tamper guidance. tests/test_utils.py: updated download-failure sentinel to "hash check", added ValueError handling in skip_if_downloading_fails(), introduced SAMPLE_TIFF, SAMPLE_TIFF_HASH, SAMPLE_TIFF_HASH_TYPE, and added TestDownloadUrl covering success and hash-mismatch failure. tests/apps/test_download_and_extract.py: accept ValueError/"hash check" for bad-hash assertions. tests/testing_data/data_config.json: updated all artifact URLs to Hugging Face. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
monai/apps/utils.py (2)
263-276:⚠️ Potential issue | 🟠 MajorValidate the temporary file before caching it.
The file is moved to
filepathbefore the hash check, so a mismatch leaves the corrupted/tampered file in the cache and later retries hit the bad existing file.Proposed fix
- file_dir = filepath.parent - if file_dir: - os.makedirs(file_dir, exist_ok=True) - shutil.move(f"{tmp_name}", f"{filepath}") # copy the downloaded to a user-specified cache. + if not check_hash(tmp_name, hash_val, hash_type): + raise ValueError( + f"{hash_type} hash check of downloaded file failed: URL={url}, " + f"filepath={filepath}, expected {hash_type}={hash_val}, " + f"The file may be corrupted or tampered with. " + "Please retry the download or verify the source." + ) + file_dir = filepath.parent + if file_dir: + os.makedirs(file_dir, exist_ok=True) + shutil.move(f"{tmp_name}", f"{filepath}") # copy the downloaded to a user-specified cache. except (PermissionError, NotADirectoryError): # project-monai/monai issue `#3613` `#3757` for windows pass logger.info(f"Downloaded: {filepath}") - if not check_hash(filepath, hash_val, hash_type): - raise ValueError( - f"{hash_type} hash check of downloaded file failed: URL={url}, " - f"filepath={filepath}, expected {hash_type}={hash_val}, " - f"The file may be corrupted or tampered with. " - "Please retry the download or verify the source." - )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@monai/apps/utils.py` around lines 263 - 276, The code moves the downloaded temporary file (tmp_name) to the final cache path (filepath) before calling check_hash, which can leave a corrupted file cached; instead, run check_hash against the temporary file (use check_hash(tmp_name, hash_val, hash_type)), only create the target directory and shutil.move(tmp_name, filepath) after the hash passes, and if the hash fails delete the temporary file and raise the ValueError (preserve the same error message). Update the logic around tmp_name, filepath and the check_hash call accordingly in the same function where tmp_name and filepath are used so that corrupted downloads are never moved into the cache.
215-224:⚠️ Potential issue | 🟡 MinorDocument the new
ValueErrorcontract.Line 223 still documents
RuntimeError, but downloaded hash mismatches now raiseValueError.Proposed fix
- RuntimeError: When the hash validation of the ``url`` downloaded file fails. + ValueError: When the hash validation of the ``url`` downloaded file fails.As per coding guidelines, "Docstrings should be present for all definition which describe each variable, return value, and raised exception in the appropriate section of the Google-style of docstrings."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@monai/apps/utils.py` around lines 215 - 224, Update the docstring Raises section for the download routine in monai/apps/utils.py so it documents the new ValueError contract: replace the RuntimeError entry that states "When the hash validation of the ``url`` downloaded file fails." (and any other RuntimeError entry referring to downloaded hash mismatch) with ValueError, and ensure the entry that mentions the existing ``filepath`` hash validation also reflects the correct exception type if it changed; keep other urllib-related exceptions (URLError, HTTPError, ContentTooShortError, IOError) as-is and keep messaging consistent with the parameters ``filepath`` and ``url``.tests/test_utils.py (1)
78-87:⚠️ Potential issue | 🔴 CriticalDo not skip or swallow
ValueErrorhash mismatches.This defeats the new
ValueErrorbehavior, and non-matchingValueErrors fall through silently because the branch does not re-raise.Proposed fix
DOWNLOAD_FAIL_MSGS = ( "unexpected EOF", # incomplete download "network issue", "gdown dependency", # gdown not installed - "hash check", # check hash value of downloaded file "limit", # HTTP Error 503: Egress is over the account limit "authenticate", "timed out", # urlopen error [Errno 110] Connection timed out "HTTPError", # HTTPError: 429 Client Error: Too Many Requests for huggingface hub ) @@ - except ValueError as v_e: - if "hash check" in str(v_e): - raise unittest.SkipTest(f"Hash value error while downloading: {v_e}") from v_eAlso applies to: 179-187
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_utils.py` around lines 78 - 87, The code currently swallows ValueError exceptions when matching messages in DOWNLOAD_FAIL_MSGS; change the exception handling where DOWNLOAD_FAIL_MSGS is consulted so that if the caught exception is a ValueError it is re-raised immediately (do not treat it as a benign download failure), while preserving the existing fallback logic for other exception types; update both places that check DOWNLOAD_FAIL_MSGS (the exception handlers that inspect exception messages around the DOWNLOAD_FAIL_MSGS use) to first if isinstance(e, ValueError): raise before matching messages.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_utils.py`:
- Around line 198-218: The test asserts the wrong exception type: update
test_download_url to expect ValueError (not RuntimeError) when download_url
detects a hash mismatch, and also update the test's docstring "Raises:" section
to list ValueError instead of RuntimeError; locate the assertion block that
calls download_url with a bad hash and change the exception class accordingly
and adjust the docstring text to reflect ValueError as the raised exception from
download_url.
---
Outside diff comments:
In `@monai/apps/utils.py`:
- Around line 263-276: The code moves the downloaded temporary file (tmp_name)
to the final cache path (filepath) before calling check_hash, which can leave a
corrupted file cached; instead, run check_hash against the temporary file (use
check_hash(tmp_name, hash_val, hash_type)), only create the target directory and
shutil.move(tmp_name, filepath) after the hash passes, and if the hash fails
delete the temporary file and raise the ValueError (preserve the same error
message). Update the logic around tmp_name, filepath and the check_hash call
accordingly in the same function where tmp_name and filepath are used so that
corrupted downloads are never moved into the cache.
- Around line 215-224: Update the docstring Raises section for the download
routine in monai/apps/utils.py so it documents the new ValueError contract:
replace the RuntimeError entry that states "When the hash validation of the
``url`` downloaded file fails." (and any other RuntimeError entry referring to
downloaded hash mismatch) with ValueError, and ensure the entry that mentions
the existing ``filepath`` hash validation also reflects the correct exception
type if it changed; keep other urllib-related exceptions (URLError, HTTPError,
ContentTooShortError, IOError) as-is and keep messaging consistent with the
parameters ``filepath`` and ``url``.
In `@tests/test_utils.py`:
- Around line 78-87: The code currently swallows ValueError exceptions when
matching messages in DOWNLOAD_FAIL_MSGS; change the exception handling where
DOWNLOAD_FAIL_MSGS is consulted so that if the caught exception is a ValueError
it is re-raised immediately (do not treat it as a benign download failure),
while preserving the existing fallback logic for other exception types; update
both places that check DOWNLOAD_FAIL_MSGS (the exception handlers that inspect
exception messages around the DOWNLOAD_FAIL_MSGS use) to first if isinstance(e,
ValueError): raise before matching messages.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9b3e7460-f1d0-4896-8762-66d40df2812f
📒 Files selected for processing (3)
monai/apps/utils.pytests/test_utils.pytests/testing_data/data_config.json
Signed-off-by: Enoch Mok <enochmokny@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
monai/apps/utils.py (1)
258-272:⚠️ Potential issue | 🔴 Critical🐛 Critical: hash check runs against the wrong path — always fails for new downloads.
At line 262 you call
check_hash(filepath, ...), but the downloaded bytes live attmp_name.filepathisn't created until theshutil.moveon line 272.check_hashwill hit itsopen(...)exceptbranch, log an error, and returnFalse, so every successfully-downloaded file will raiseValueError("... hash check of downloaded file failed ...")regardless of whether the hash actually matches.The reason the new
TestDownloadUrl.test_download_url"passes" today is thatskip_if_downloading_fails()converts this ValueError into aSkipTest— the success path isn't actually being exercised.🔧 Proposed fix
if not tmp_name.exists(): raise RuntimeError( f"Download of file from {url} to {filepath} failed due to network issue or denied permission." ) - if not check_hash(filepath, hash_val, hash_type): + if not check_hash(tmp_name, hash_val, hash_type): raise ValueError( f"{hash_type} hash check of downloaded file failed: URL={url}, " f"filepath={filepath}, expected {hash_type}={hash_val}, " f"The file may be corrupted or tampered with. " "Please retry the download or verify the source." ) file_dir = filepath.parent if file_dir: os.makedirs(file_dir, exist_ok=True) shutil.move(f"{tmp_name}", f"{filepath}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@monai/apps/utils.py` around lines 258 - 272, The hash check is being done against filepath which doesn't exist yet; change the code to run check_hash on the temporary download path tmp_name (i.e., call check_hash(tmp_name, hash_val, hash_type)) before creating/moving to filepath, then ensure the destination directory (file_dir) is created and finally shutil.move(tmp_name, filepath); update any related error messages if needed to reference the tmp_name when reporting the downloaded file being validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_utils.py`:
- Around line 212-218: The failing test is network-fragile because the second
download_url call expecting a ValueError can instead raise network errors; wrap
the bad-hash scenario in the same network-skip guard or separate network from
hash checking: first call download_url under skip_if_downloading_fails() with
the correct hash to ensure the file is present, then call the pure hash checker
(e.g., check_hash or whatever validation routine used by download_url) against
that file to assert it raises ValueError for a bad hash, or re-enter
skip_if_downloading_fails() around the bad-hash download and convert network
exceptions to skips while asserting that a returned/created file triggers the
ValueError path. Ensure you reference download_url and skip_if_downloading_fails
(and the module's hash validation function) when making the change.
- Around line 185-187: The except ValueError handler in tests/test_utils.py
currently only raises SkipTest when the exception message contains "hash check"
and otherwise swallows the ValueError; modify the except ValueError as v_e block
to re-raise v_e when the message does not indicate a hash-check failure (i.e.,
keep the unittest.SkipTest(f"Hash value error while downloading: {v_e}") from
v_e branch for "hash check" and add a plain raise v_e for all other cases) so
genuine ValueErrors from the test body are propagated.
---
Outside diff comments:
In `@monai/apps/utils.py`:
- Around line 258-272: The hash check is being done against filepath which
doesn't exist yet; change the code to run check_hash on the temporary download
path tmp_name (i.e., call check_hash(tmp_name, hash_val, hash_type)) before
creating/moving to filepath, then ensure the destination directory (file_dir) is
created and finally shutil.move(tmp_name, filepath); update any related error
messages if needed to reference the tmp_name when reporting the downloaded file
being validated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b3be07b5-74d1-45d2-abef-997ad1617025
📒 Files selected for processing (2)
monai/apps/utils.pytests/test_utils.py
Signed-off-by: Enoch Mok <enochmokny@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
monai/apps/utils.py (1)
215-223:⚠️ Potential issue | 🟡 MinorUpdate the stale raised-exception docs.
Line 216 still says existing-file hash validation raises
RuntimeError, but Line 231 now raisesValueError.Proposed fix
Raises: - RuntimeError: When the hash validation of the ``filepath`` existing file fails. + ValueError: When the hash validation of the ``filepath`` existing file fails. RuntimeError: When a network issue or denied permission prevents the file download from ``url`` to ``filepath``.As per coding guidelines, "Docstrings should be present for all definition which describe each variable, return value, and raised exception in the appropriate section of the Google-style of docstrings."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@monai/apps/utils.py` around lines 215 - 223, The docstring's Raises section is inconsistent: it still claims a RuntimeError is raised for an existing-file hash mismatch (refers to ``filepath``) while the implementation now raises ValueError for the downloaded-URL hash check (refers to ``url``). Edit the Raises block in the function's Google-style docstring (the one mentioning filepath and url) so the exception types and descriptions match the current behavior: replace the stale RuntimeError mention with the correct exception type(s) and ensure the entries for the filepath/hash check and the url/download/hash check accurately state ValueError (or RuntimeError only where actually raised), and keep other URL/HTTP/IO exceptions unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@monai/apps/utils.py`:
- Around line 231-233: Tests expecting a RuntimeError for hash-mismatch should
be updated to expect ValueError and the revised message text; change the
assertions in tests/apps/test_download_and_extract.py (the block that calls
download_url(... wrong_md5) and the other occurrence around the same check) to
catch ValueError instead of RuntimeError and assert the message contains "md5
hash check" (or more generically "{hash_type} hash check") to match the new
raise in monai/apps/utils.py (the ValueError raised in the hash check path).
In `@tests/test_utils.py`:
- Around line 213-215: The test currently calls
download_url(filepath=model_path, hash_val="0"*64,
hash_type=SAMPLE_TIFF_HASH_TYPE) without the required url, causing a TypeError
before the intended hash check; update the test to pass a valid url (e.g., the
same model_path or a known test URL) as the url argument when calling
download_url so the function runs to the hash validation and raises ValueError
as expected (adjust the call in tests/test_utils.py where download_url is
invoked inside the with self.assertRaises(ValueError) block).
---
Outside diff comments:
In `@monai/apps/utils.py`:
- Around line 215-223: The docstring's Raises section is inconsistent: it still
claims a RuntimeError is raised for an existing-file hash mismatch (refers to
``filepath``) while the implementation now raises ValueError for the
downloaded-URL hash check (refers to ``url``). Edit the Raises block in the
function's Google-style docstring (the one mentioning filepath and url) so the
exception types and descriptions match the current behavior: replace the stale
RuntimeError mention with the correct exception type(s) and ensure the entries
for the filepath/hash check and the url/download/hash check accurately state
ValueError (or RuntimeError only where actually raised), and keep other
URL/HTTP/IO exceptions unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a321ffa5-076e-4483-8541-5bec4e0da138
📒 Files selected for processing (2)
monai/apps/utils.pytests/test_utils.py
Signed-off-by: Enoch Mok <enochmokny@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_utils.py (1)
78-83:⚠️ Potential issue | 🟠 MajorDo not suppress hash-integrity failures here.
This reintroduces the behavior the PR is meant to avoid: hash mismatches become
SkipTestinstead of visible failures underskip_if_downloading_fails().Proposed fix
DOWNLOAD_FAIL_MSGS = ( "unexpected EOF", # incomplete download "network issue", "gdown dependency", # gdown not installed - "hash check", # check hash value of downloaded file "limit", # HTTP Error 503: Egress is over the account limit "authenticate", "timed out", # urlopen error [Errno 110] Connection timed out "HTTPError", # HTTPError: 429 Client Error: Too Many Requests for huggingface hub ) @@ - except ValueError as v_e: - if "hash check" in str(v_e): - raise unittest.SkipTest(f"Hash value error while downloading: {v_e}") from v_e - - raise v_e - -Also applies to: 179-189
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_utils.py` around lines 78 - 83, The DOWNLOAD_FAIL_MSGS list used by skip_if_downloading_fails() is incorrectly treating hash-integrity failures as transient download failures; remove the "hash check" entry from DOWNLOAD_FAIL_MSGS so hash mismatches raise test failures rather than being converted to SkipTest, and make the same change to the other identical failure-message tuple used later in the file (the second DOWNLOAD_FAIL_MSGS-style list referenced by skip_if_downloading_fails()). Ensure skip_if_downloading_fails() continues to match only genuine download/network/gdown/limit messages and not integrity/hash errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_utils.py`:
- Around line 197-217: The test reuses model_path so download_url sees the
existing file and skips re-downloading, preventing the hash-mismatch path from
being exercised; update TestDownloadUrl.test_download_url to either delete
model_path (os.remove(model_path)) before calling download_url for the mismatch
case or use a new filepath (e.g., model_path_mismatch) and wrap that mismatch
download call with skip_if_downloading_fails() so the test forces a fresh
download and still skips in quick/network-failing environments; reference
download_url, model_path, TestDownloadUrl.test_download_url, and
skip_if_downloading_fails when making the change.
---
Outside diff comments:
In `@tests/test_utils.py`:
- Around line 78-83: The DOWNLOAD_FAIL_MSGS list used by
skip_if_downloading_fails() is incorrectly treating hash-integrity failures as
transient download failures; remove the "hash check" entry from
DOWNLOAD_FAIL_MSGS so hash mismatches raise test failures rather than being
converted to SkipTest, and make the same change to the other identical
failure-message tuple used later in the file (the second
DOWNLOAD_FAIL_MSGS-style list referenced by skip_if_downloading_fails()). Ensure
skip_if_downloading_fails() continues to match only genuine
download/network/gdown/limit messages and not integrity/hash errors.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7b5d0b2c-e70c-4e3d-bd08-86cee7116a1d
📒 Files selected for processing (2)
tests/apps/test_download_and_extract.pytests/test_utils.py
| class TestDownloadUrl(unittest.TestCase): | ||
| """Exercise ``download_url`` success and hash-mismatch paths.""" | ||
|
|
||
| def test_download_url(self): | ||
| """ | ||
| Download a sample TIFF and validate hash handling. | ||
| """ | ||
| with tempfile.TemporaryDirectory() as tempdir: | ||
| model_path = os.path.join(tempdir, "model.tiff") | ||
|
|
||
| with skip_if_downloading_fails(): | ||
| download_url( | ||
| url=SAMPLE_TIFF, filepath=model_path, hash_val=SAMPLE_TIFF_HASH, hash_type=SAMPLE_TIFF_HASH_TYPE | ||
| ) | ||
|
|
||
| # Verify file exists before testing hash mismatch | ||
| if not os.path.exists(model_path): | ||
| self.skipTest("File was not downloaded successfully") | ||
|
|
||
| with self.assertRaises(ValueError): | ||
| download_url(url=SAMPLE_TIFF, filepath=model_path, hash_val="0" * 64, hash_type=SAMPLE_TIFF_HASH_TYPE) |
There was a problem hiding this comment.
Exercise the downloaded-file mismatch path and skip in quick mode.
Line 217 reuses model_path, so download_url() validates the existing file and never reaches the new temp-download hash check. This also adds a network test to quick discovery.
Proposed fix
+@unittest.skipIf(os.environ.get(quick_test_var, "").lower() == "true", "Skipping slow tests")
class TestDownloadUrl(unittest.TestCase):
"""Exercise ``download_url`` success and hash-mismatch paths."""
@@
- with self.assertRaises(ValueError):
- download_url(url=SAMPLE_TIFF, filepath=model_path, hash_val="0" * 64, hash_type=SAMPLE_TIFF_HASH_TYPE)
+ bad_model_path = os.path.join(tempdir, "model_bad.tiff")
+ with self.assertRaisesRegex(ValueError, "hash check"):
+ download_url(
+ url=Path(model_path).resolve().as_uri(),
+ filepath=bad_model_path,
+ hash_val="0" * 64,
+ hash_type=SAMPLE_TIFF_HASH_TYPE,
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_utils.py` around lines 197 - 217, The test reuses model_path so
download_url sees the existing file and skips re-downloading, preventing the
hash-mismatch path from being exercised; update
TestDownloadUrl.test_download_url to either delete model_path
(os.remove(model_path)) before calling download_url for the mismatch case or use
a new filepath (e.g., model_path_mismatch) and wrap that mismatch download call
with skip_if_downloading_fails() so the test forces a fresh download and still
skips in quick/network-failing environments; reference download_url, model_path,
TestDownloadUrl.test_download_url, and skip_if_downloading_fails when making the
change.
Fixes #8832 .
Description
Adds a check on downloaded files to verify their hash against the expected value and raises a
ValueErrorif there is a mismatch, indicating possible corruption or tampering.Types of changes
./runtests.sh -f -u --net --coverage../runtests.sh --quick --unittests --disttests.