Skip to content

Fixed Issues raised upon reviews for integrating the arxiv fetching functionality in arxiv_fetch.py #189

@Goziee-git

Description

@Goziee-git

#188

Fix arxiv_fetch.py: Replace urllib with requests, remove unnecessary plan

index, fix license extraction logic

Overview

This issue addresses the reviews from the PR reconciling the arxiv_fetch.py script to improve reliability, consistency, and code quality by aligning it with project patterns established in gcs_fetch.py.

Changes Made

1. Replace urllib.request with requests library

Issue: Script used urllib.request instead of the preferred requests library.

Fix:

• Added requests, HTTPAdapter, and Retry imports
• Created get_requests_session() function with exponential backoff retry
strategy
• Replaced urllib.request.urlopen() with session.get() calls
• Added proper timeout handling (30 seconds)

2. Remove unnecessary plan index system

Issue: Script used plan index system similar to GCS fetch, but arXiv API doesn't have quotas requiring it, the perculiar features of the arXiv API include:
• No authentication required
• No daily quotas
• Only rate limiting: 3-second delay recommended
• Max 30,000 results per call, 2,000 per slice

Fix:

• Removed PLAN_INDEX from all CSV headers:
• HEADER_COUNT = ["TOOL_IDENTIFIER", "COUNT"]
• HEADER_CATEGORY = ["TOOL_IDENTIFIER", "CATEGORY", "COUNT"]
• HEADER_YEAR = ["TOOL_IDENTIFIER", "YEAR", "COUNT"]
• HEADER_AUTHOR = ["TOOL_IDENTIFIER", "AUTHOR_COUNT", "COUNT"]
• Simplified save_count_data() function to remove plan index tracking
• Updated all CSV writing operations

3. Fix license extraction logic inconsistency

Issue: extract_license_info() function converted text to lowercase then assigned uppercase values, creating logic conflicts.

Fix:

• Start with license_info = entry.rights.upper()
• Use uppercase comparisons throughout: "CC BY" instead of "cc by"
• Applied same logic to both rights and summary field processing
• Proper fallback to "Unknown" when no matches found

4. Improve error handling and rate limiting

Issue: Complex manual retry loops and inconsistent error handling.

####Fix:
• Added retry strategy with 5 retries, 3-second backoff factor
• Status codes for retry: [408, 429, 500, 502, 503, 504]
• Simplified to use arXiv's recommended 3-second delay between calls
• Better exception handling with requests.RequestException

Code Quality Improvements

• Removed complex manual retry loops in favor of requests' built-in retry mechanism
• Cleaner, more maintainable code structure
• Consistent with gcs_fetch.py patterns

Testing

• [x] Python syntax validation passes and static analysis checks out properly with all checks passing
• [x] Script structure follows project patterns
• [x] CSV headers align with simplified data model

Implementation

  • This feature has been implemented with love

Metadata

Metadata

Assignees

No fields configured for Feature.

Projects

Status
Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions