-
Notifications
You must be signed in to change notification settings - Fork 749
Updates to retraction status checker #370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from 14 commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
0c8ebca
initial_commits
geemi725 bd24406
first commit - retraction scripts
geemi725 9381296
Merge branch 'september-2024-release' of https://github.com/Future-Ho…
geemi725 26cdad5
removed: pandas depdencency, retractions.csv
geemi725 f3ee3d1
Merge branch 'september-2024-release' of https://github.com/Future-Ho…
geemi725 e7b1441
not recording anymore
geemi725 d9edda0
relative import ..types -> paperqa.types
geemi725 71a1623
remove RetrationDataPostProcessor as a default client
geemi725 5fb3b71
test commit: remove RetrationDataPostProcessor from ALL_CLIENTS
geemi725 ccb984f
Added: formatted citation, tenancity retry
geemi725 3b38f25
removed: tqdm, added: citation deets to formatted_citation
geemi725 f6199d3
Merge branch 'september-2024-release' of https://github.com/Future-Ho…
geemi725 84b22d6
Check if citation is none, download method moved to crossref.py
geemi725 4612797
crossref mailto made modular
geemi725 4024778
moved all to one gitinore file
geemi725 e3613aa
Merge branch 'main' of https://github.com/Future-House/paper-qa into …
geemi725 a6925dd
fix: W0621
geemi725 06cfc99
Update paperqa/clients/crossref.py
geemi725 8eab028
crossref_mailto() -> get_crossref_mailto()
geemi725 2b6f7ca
Merge branch 'main' into issue-366
whitead File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
|
||
# ignore downloaded files retractions.csv | ||
retractions.csv |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -11,6 +11,8 @@ | |||||||||
from urllib.parse import quote | ||||||||||
|
||||||||||
import aiohttp | ||||||||||
from anyio import open_file | ||||||||||
from tenacity import retry, stop_after_attempt, wait_exponential | ||||||||||
|
||||||||||
from paperqa.types import CITATION_FALLBACK_DATA, DocDetails | ||||||||||
from paperqa.utils import ( | ||||||||||
|
@@ -104,6 +106,20 @@ def crossref_headers() -> dict[str, str]: | |||||||||
return {} | ||||||||||
|
||||||||||
|
||||||||||
def crossref_mailto() -> str: | ||||||||||
"""Crossref mailto if available, otherwise a default.""" | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
crossref_mailto = os.getenv("CROSSREF_MAILTO") | ||||||||||
|
||||||||||
if not crossref_mailto: | ||||||||||
logger.warning( | ||||||||||
"CROSSREF_MAILTO environment variable not set. Crossref API rate limits may" | ||||||||||
" apply." | ||||||||||
) | ||||||||||
return "[email protected]" | ||||||||||
|
||||||||||
return crossref_mailto | ||||||||||
|
||||||||||
|
||||||||||
async def doi_to_bibtex( | ||||||||||
doi: str, | ||||||||||
session: aiohttp.ClientSession, | ||||||||||
|
@@ -251,12 +267,7 @@ async def get_doc_details_from_crossref( # noqa: PLR0912 | |||||||||
|
||||||||||
inputs_msg = f"DOI {doi}" if doi is not None else f"title {title}" | ||||||||||
|
||||||||||
if not (CROSSREF_MAILTO := os.getenv("CROSSREF_MAILTO")): | ||||||||||
logger.warning( | ||||||||||
"CROSSREF_MAILTO environment variable not set. Crossref API rate limits may" | ||||||||||
" apply." | ||||||||||
) | ||||||||||
CROSSREF_MAILTO = "[email protected]" | ||||||||||
CROSSREF_MAILTO = crossref_mailto() | ||||||||||
quoted_doi = f"/{quote(doi, safe='')}" if doi else "" | ||||||||||
url = f"{CROSSREF_BASE_URL}/works{quoted_doi}" | ||||||||||
params = {"mailto": CROSSREF_MAILTO} | ||||||||||
|
@@ -335,6 +346,46 @@ async def get_doc_details_from_crossref( # noqa: PLR0912 | |||||||||
return await parse_crossref_to_doc_details(message, session, query_bibtex) | ||||||||||
|
||||||||||
|
||||||||||
@retry( | ||||||||||
stop=stop_after_attempt(3), | ||||||||||
wait=wait_exponential(multiplier=5, min=5), | ||||||||||
reraise=True, | ||||||||||
) | ||||||||||
async def download_retracted_dataset( | ||||||||||
retraction_data_path: os.PathLike | str, | ||||||||||
) -> None: | ||||||||||
""" | ||||||||||
Download the retraction dataset from Crossref. | ||||||||||
|
||||||||||
Saves the retraction dataset to `retraction_data_path`. | ||||||||||
""" | ||||||||||
CROSSREF_MAILTO = crossref_mailto() | ||||||||||
url = f"https://api.labs.crossref.org/data/retractionwatch?{CROSSREF_MAILTO}" | ||||||||||
geemi725 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
async with ( | ||||||||||
aiohttp.ClientSession() as session, | ||||||||||
session.get( | ||||||||||
url, | ||||||||||
timeout=aiohttp.ClientTimeout(total=300), | ||||||||||
) as response, | ||||||||||
): | ||||||||||
response.raise_for_status() | ||||||||||
|
||||||||||
logger.info( | ||||||||||
f"Retraction data was not cashed. Downloading retraction data from {url}..." | ||||||||||
) | ||||||||||
|
||||||||||
async with await open_file(str(retraction_data_path), "wb") as f: | ||||||||||
while True: | ||||||||||
chunk = await response.content.read(1024) | ||||||||||
if not chunk: | ||||||||||
break | ||||||||||
await f.write(chunk) | ||||||||||
|
||||||||||
if os.path.getsize(str(retraction_data_path)) == 0: | ||||||||||
raise RuntimeError("Retraction data is empty") | ||||||||||
|
||||||||||
|
||||||||||
class CrossrefProvider(DOIOrTitleBasedProvider): | ||||||||||
async def _query(self, query: TitleAuthorQuery | DOIQuery) -> DocDetails | None: | ||||||||||
if isinstance(query, DOIQuery): | ||||||||||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,14 +5,12 @@ | |
import logging | ||
import os | ||
|
||
import aiohttp | ||
from anyio import open_file | ||
from pydantic import ValidationError | ||
from tenacity import retry, stop_after_attempt, wait_exponential | ||
|
||
from paperqa.types import DocDetails | ||
|
||
from .client_models import DOIQuery, MetadataPostProcessor | ||
from .crossref import download_retracted_dataset | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
@@ -52,40 +50,6 @@ def _has_cache_expired(self) -> bool: | |
def _is_csv_cached(self) -> bool: | ||
return os.path.exists(self.retraction_data_path) | ||
|
||
@retry( | ||
stop=stop_after_attempt(3), | ||
wait=wait_exponential(multiplier=5, min=5), | ||
reraise=True, | ||
) | ||
async def _download_retracted_dataset(self) -> None: | ||
|
||
if not (CROSSREF_MAILTO := os.getenv("CROSSREF_MAILTO")): | ||
CROSSREF_MAILTO = "[email protected]" | ||
url = f"https://api.labs.crossref.org/data/retractionwatch?{CROSSREF_MAILTO}" | ||
|
||
async with ( | ||
aiohttp.ClientSession() as session, | ||
session.get( | ||
url, | ||
timeout=aiohttp.ClientTimeout(total=300), | ||
) as response, | ||
): | ||
response.raise_for_status() | ||
|
||
logger.info( | ||
f"Retraction data was not cashed. Downloading retraction data from {url}..." | ||
) | ||
|
||
async with await open_file(self.retraction_data_path, "wb") as f: | ||
while True: | ||
chunk = await response.content.read(1024) | ||
if not chunk: | ||
break | ||
await f.write(chunk) | ||
|
||
if os.path.getsize(self.retraction_data_path) == 0: | ||
raise RuntimeError("Retraction data is empty") | ||
|
||
def _filter_dois(self) -> None: | ||
with open(self.retraction_data_path, newline="", encoding="utf-8") as csvfile: | ||
reader = csv.DictReader(csvfile) | ||
|
@@ -96,7 +60,7 @@ def _filter_dois(self) -> None: | |
|
||
async def load_data(self) -> None: | ||
if not self._is_csv_cached() or self._has_cache_expired(): | ||
await self._download_retracted_dataset() | ||
await download_retracted_dataset(self.retraction_data_path) | ||
|
||
self._filter_dois() | ||
|
||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.