User talk:Dalba: Difference between revisions

Browse history interactively

← Older edit

Content deleted Content added

VisualWikitext

Inline

Latest revision as of 23:49, 14 May 2024

Archives

1

Kew POWO citations format[edit]

Latest comment: 29 days ago4 comments3 people in discussion

For Kew Plants of the World Citations can the format be change from this:

<ref name="Plants of the World Online k345">{{cite web | title=Melocactus estevesii P.J.Braun | website=Plants of the World Online | url=https://powo.science.kew.org/taxon/urn:lsid:ipni.org:names:938363-1 | access-date=2024-04-29}}</ref>

to this:

<ref name="Plants of the World Online k345">{{BioRef|powo | title=''Melocactus estevesii'' P.J.Braun | id=938363-1 | access-date=2024-04-29}}</ref>

One of the users complained on my talk page about the cites -Cs california (talk) 05:34, 4 May 2024 (UTC)Reply

That's a great suggestion about italics in the title. Unfortunately, italicizing the scientific name within the title field is currently difficult. Plants of the World Online doesn't provide distinct metadata for the scientific name and author.

The BioRef template offers a cleaner format, but it's not widely adopted across Wikipedias.

Continuing with the 'cite web' template ensures compatibility with most other wikis.

And, honestly, the main issue for me right now is that maintaining additional code for alternative citation formats can be challenging. However, I'll certainly keep this feedback in mind for future development if resources allow. Dalba 15:02, 5 May 2024 (UTC)Reply

Can we get the ref name shortened? There's no need for it to be that long. "POWO" would be sufficient, or "POWO k345" if there needs to be the distinguisher, though it is cryptic and therefore no better than ":2". - UtherSRG (talk) 10:23, 7 May 2024 (UTC)Reply

Sure, but the algorithm needs to be general. Using website acronym does not work in general since many citations don't have a site name. One should also try to choose unique ref names ... I'm going to change it once again to just a random string. (the last time I changed it was nearly 8 months ago, see [1] for the related discussion.) Dalba 16:20, 7 May 2024 (UTC)Reply

HTTPError[edit]

Latest comment: 5 months ago2 comments2 people in discussion

My assumption is that you would rather hear about issues than not. The changes you made to present PDF citations in partial form have been a terrific help. I just need to add the title and the author. However, the following URL

https://www.icj-cij.org/public/files/case-related/182/182-20220316-ORD-01-00-EN.pdf

produces: HTTPError

How popular is Citer? Do you keep track of how many uses per day it is getting? Best regards. Swood100 (talk) 19:28, 10 December 2023 (UTC)Reply

I do, thank you. I wish I had more time to work on parsing pdf files, it might be possible to extract more information about PDF files, I'm just concerned about the performance. Anyway, the problem with this particular URL is that it is behind some CloudFlare restriction mechanism. Not actually sure why, but I cannot download the file from command line either:

 
$ wget https://www.icj-cij.org/public/files/case-related/182/182-20220316-ORD-01-00-EN.pdf
--2023-12-15 14:58:52--  https://www.icj-cij.org/public/files/case-related/182/182-20220316-ORD-01-00-EN.pdf
Resolving www.icj-cij.org (www.icj-cij.org)... 104.22.41.99, 172.67.26.159, 104.22.40.99, ...
Connecting to www.icj-cij.org (www.icj-cij.org)|104.22.41.99|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2023-12-15 14:58:52 ERROR 403: Forbidden.

Citer cannot access the URL through HTTP protocol and hence the HTTPError. I guess, the result can be improved by returning a partial cite web template instead, but it may take a while before I can get to it.

Regarding popularity, I really don't know and I regularly clear the limited logs that toolforge provides. But since you asked, I just looked, and for the past 6 hours there has been around 324 requests processed. Not sure how many of them are unique though, the logs are anonymized.

Dalba 15:23, 15 December 2023 (UTC)Reply

HTTPStatusError[edit]

Latest comment: 3 months ago3 comments2 people in discussion

Hi again,

This link:

https://www.jpeds.com/article/S0022-3476(22)00185-8/fulltext

produces the above error, though supplying the DOI listed on that page works fine:

doi.org/10.1016/j.jpeds.2022.03.005

Best regards, Swood100 (talk) 15:20, 27 December 2023 (UTC)Reply

Unfortunately the website has blocked toolforge's IP address. :( Dalba 09:53, 28 December 2023 (UTC)Reply

Seems to be

Fixed using curl-impersonate. Dalba 07:27, 25 February 2024 (UTC)Reply

ConnectError[edit]

Latest comment: 5 months ago2 comments2 people in discussion

Hi again, when I ran this URL I got the above message:

https://web.archive.org/web/20161105162350/https:/thejungsoul.com/guidance-for-parents-of-teens-with-rapid-onset-gender-dysphoria/

However, when I switched at random to this different saved version it worked fine:

https://web.archive.org/web/20171106084816/http://thejungsoul.com/guidance-for-parents-of-teens-with-rapid-onset-gender-dysphoria

I see what the problem is. In the first one the second https: is only followed by a single '/' instead of two. Looks like a screwball error from the page I got this URL from, because I got another URL from that page:

https://web.archive.org/web/20161209083621/http:/adflegal.org/detailspages/blog-details/allianceedge/2016/08/24/the-weekly-digest-8-24-16

This one also has a single '/' after the http: but it results in a ref that retains the error in two locations:

<ref name="Arnold 2016 i520">{{cite web | last=Arnold | first=James | title=The Weekly Digest: 8-24-16 | website=web.archive.org | date=24 August 2016 | url=http:/adflegal.org/detailspages/blog-details/allianceedge/2016/08/24/the-weekly-digest-8-24-16 | archive-url=https://web.archive.org/web/20161209083621/http:/adflegal.org/detailspages/blog-details/allianceedge/2016/08/24/the-weekly-digest-8-24-16 | archive-date=9 December 2016 | url-status=dead | access-date=29 December 2023}}</ref>

This results in a "{{cite web}}: Check |url= value (help)" red error message, a reference to this page, and a tooltip when I hover over the link:

Arnold, James (24 August 2016). "The Weekly Digest: 8-24-16". web.archive.org. Archived from [http:/adflegal.org/detailspages/blog-details/allianceedge/2016/08/24/the-weekly-digest-8-24-16 the original] on 9 December 2016. Retrieved 29 December 2023. {{cite web}}: Check |url= value (help)

When I add another '/' to the http: in the "url" param in the produced ref the error goes away. I suppose it is asking too much for Citer to correct errors in the URLs it is supplied.

Swood100 (talk) 04:28, 29 December 2023 (UTC)Reply

Hi there! For me, none of the URLs work. I believe this is another case of toolforge's IP address being blocked by a third party server. Unfortunately, there is not much I can do in these cases. There might be some workarounds, but it will take me a while to implement and test. Dalba 04:07, 31 December 2023 (UTC)Reply

HTTPStatusError[edit]

Latest comment: 3 months ago3 comments2 people in discussion

Hi again,

This URL:

https://www.reuters.com/world/middle-east/iraq-pays-last-chunk-524-billion-gulf-war-reparations-un-2022-02-09/

Results in the above error. Another website blocking toolforge's IP address? Why do they do that? Is it always rate-limiting? Best regards, Swood100 (talk) 20:25, 6 January 2024 (UTC)Reply

Hi. Yes, reuters.com has blocked the IP address of toolforge. It's completely blocked as far as I can tell, no rate limiting here. I can only guess, but I believe after the recent OpenAI and New York Times confrontation, websites have become more stringent about who can access their contents. Toolforge, being the host of several citation generating tools is sending more than usual requests and therefore websites have started blocking its IP address. Dalba 08:17, 12 January 2024 (UTC)Reply

This seems to be

Fixed now that citer is using curl-impersonate. Dalba 07:25, 25 February 2024 (UTC)Reply

Allowing citer requests from en.wikipedia.org[edit]

Latest comment: 3 months ago7 comments2 people in discussion

Hi Dalba, I'm writing a citation script for myself on en.wikipedia.org and encountered a CORS error when trying to use citer.toolforge.org. Would it be possible to enable CORS by setting the "Access-Control-Allow-Origin" header appropriately on the citer web server? This page has more information. Your tool is awesome, by the way. Thanks. Daniel Quinlan (talk) 08:40, 6 February 2024 (UTC)Reply

Hi there! Done. Just note that since I'm not maintaining a stable API yet, the response format might change in the future without any deprecation period. (I have had some thoughts about using Citoid response format, but it's unlikely I'll be able to implement it anytime soon.) Dalba 17:27, 6 February 2024 (UTC)Reply

Thank you so much! One thing that might help scripts a bit would be adding a parameter to get a raw text response (if you have to choose, just the latter format). I haven't really used Citoid because it doesn't seem to extract enough information to make it worthwhile. Daniel Quinlan (talk) 13:43, 7 February 2024 (UTC)Reply

Not sure how you are using it right now, but if you send a POST request instead of a GET request and send the user_input in the body of the request, then citer will return a json response which I guess might be more easily digestible by scripts. Something like

await (await fetch('https://citer.toolforge.org/', {'method': 'POST', 'body': 'https://example.com/somepath.html' })).json()

should work. Dalba 07:39, 8 February 2024 (UTC)Reply

I've barely started, but I was doing a GET request and parsing the document. JSON is so much better. For easier updates in the future, you might consider returning a JSON dictionary with named keys like "sfn", "cite", and "ref-name". Also, can the date format be included in the POST request? Thanks again. Daniel Quinlan (talk) 13:49, 8 February 2024 (UTC)Reply

All parameters of a GET request also work on a POST request if they remain in the URL. The only difference between GET and POST is that `user_input` value should be the body and not in the URL. My previous example with a date_format parameter would become:

await (await fetch('https://citer.toolforge.org/?date_format=%Y-%m-%d', {'method': 'POST', 'body': 'https://citer.toolforge.org/' })).json()

. You are right about returning a dictionary, it's more flexible and easier to understand. I will probably change it in the future. Dalba 14:06, 8 February 2024 (UTC)Reply

Thanks! Daniel Quinlan (talk) 07:28, 9 February 2024 (UTC)Reply

Citing via archive links[edit]

Latest comment: 3 months ago6 comments2 people in discussion

Hello again Dalba. I've been having some issues trying to use citer with archive.org links. It is frequently returning a 500 code with "ConnectError" in the JSON almost immediately. archive.org can be exceptionally slow retrieving archives, it often takes 15 to 30 seconds and sometimes is probably even more than that. It's also possible citer is just being rate limited by archive.org and my limited testing might be enough to drive it from bad to worse. Any ideas?

I've also tried using archive.today links like https://archive.today/N3fQ (they also use archive.is and archive.ph, and probably a few more aliases) and that always seems to result in a ReadTimeout error from citer. Would it be possible to support archive.today archive links?

By the way, I did reach out to archive.org to request that they enable CORS for *.wikipedia.org. If they do that, it's possible that clients could make the request to archive.org and then POST the archive link and the entire web page result to citer for data extraction. That might help if rate limits are the issue. Anyhow, I'll let you know if my request goes anywhere. Regards. Daniel Quinlan (talk) 07:13, 13 February 2024 (UTC)Reply

The more I look at it, the more archive.today is starting to look like a good addition for dead links. They do comment out scripts including application/ld+json, but that's easy to work around. I'm not sure how aggressive the server is about blocking non-interactive clients, but the maintainer has been willing to whitelist IP addresses in the past. Daniel Quinlan (talk) 18:48, 13 February 2024 (UTC)Reply

Hi!

archive.org: I currently cannot reproduce. It's probably a rate limit. Citer is set to wait for 10 seconds before aborting the request, if you are getting the response immediately then it is not a timeout, perhaps the server has declined the request sooner or some other issue. There might be some clues in the logs, I might need to dig into them. Let me know if they enable CORS for wikipedia, I'll implement a way to submit HTML content to citer.
archive.today: I would love to add support, but apparently the server does not reply to toolforge requests, no matter the timeout. Here is the verbose output of a curl call:

:$ time curl -I https://archive.today/N3fQ --connect-timeout 300 -v
:*   Trying 51.38.69.52...
:* TCP_NODELAY set
:* Connected to archive.today (51.38.69.52) port 443 (#0)
:* ALPN, offering h2
:* ALPN, offering http/1.1
:* successfully set certificate verify locations:
:*   CAfile: none
:  CApath: /etc/ssl/certs
:* TLSv1.3 (OUT), TLS handshake, Client hello (1):
:* TLSv1.3 (IN), TLS handshake, Server hello (2):
:* TLSv1.2 (IN), TLS handshake, Certificate (11):
:* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
:* TLSv1.2 (IN), TLS handshake, Server finished (14):
:* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
:* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
:* TLSv1.2 (OUT), TLS handshake, Finished (20):
:* TLSv1.2 (IN), TLS handshake, Finished (20):
:* SSL connection using TLSv1.2 / ECDHE-ECDSA-AES256-GCM-SHA384
:* ALPN, server accepted to use h2
:* Server certificate:
:*  subject: CN=archive.today
:*  start date: Feb  4 02:20:57 2024 GMT
:*  expire date: May  4 02:20:56 2024 GMT
:*  subjectAltName: host "archive.today" matched cert's "archive.today"
:*  issuer: C=US; O=Let's Encrypt; CN=R3
:*  SSL certificate verify ok.
:* Using HTTP2, server supports multi-use
:* Connection state changed (HTTP/2 confirmed)
:* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
:* Using Stream ID: 1 (easy handle 0x5645340fd110)
:> HEAD /N3fQ HTTP/2
:> Host: archive.today
:> User-Agent: curl/7.64.0
:> Accept: */*
:>
:* TLSv1.2 (IN), TLS alert, close notify (256):
:* Empty reply from server
:* Connection #0 to host archive.today left intact
:curl: (52) Empty reply from server
:real    1m0.404s
:user    0m0.030s
:sys     0m0.009s
:

Copying `User-Agent` and other headers from browser did not help either. I suspect they have blacklisted toolforge. Dalba 06:26, 22 February 2024 (UTC)Reply

I suspect archive.today has done something to block non-interactive requests. It might be necessary to use something like Selenium. As an alternative, would it be possible for Citer to support submitting the web page content in a POST request along with the original link and the archive link (if the content is from an archive server)? That would help with sites blocking tools like curl and it might help with rate limits and timeouts too.

Also, archive.today responded positively to two of my requests: CORS requests now work and they also added back some <meta> tags as <old-meta>. The application/ld+json data is available as well (it's commented out, but easy to extract). Daniel Quinlan (talk) 07:00, 22 February 2024 (UTC)Reply

They are using SSL handshake fingerprinting to detect non-browser requests. I was able to access the website using https://github.com/lwthiker/curl-impersonate . I might be able to embed that into citer, it just might take me some time.

The POST request idea is also possible and I do plan to implement it. Dalba 13:11, 22 February 2024 (UTC)Reply

OK, archive.today URLs are now expected to work (not tested thoroughly though).

Also, you can now submit HTML using post request. In order to implement this I had to change the POST submit format. Now all parameters should be submitted within the body of the requests in json format. To submit HTML forms, "input_type" should be set to "html" and "user_input" should be an object containing two keys: {"html": "<HTML string of the page>", "url": "<URL>"}. Dalba 17:00, 23 February 2024 (UTC)Reply

RequestsError[edit]

Latest comment: 2 months ago2 comments2 people in discussion

Hi again,

I copied a DOI address from a web page. It was split over two lines which resulted in a space being placed in the middle:

https://doi.org/10.1371/%20journal.pgph.0000245

This resulted in Citer returning the message: "RequestsError". When I removed the '%20' from the string I got the right result. If it is true that a space is never appropriate in the middle of a DOI string, then stripping any such spaces before running the query might result in more satisfied and less confused users (or in the alternative, substitute the message, "You did not enter a valid DOI. Please check your source."). Swood100 (talk) 20:45, 15 March 2024 (UTC)Reply

Hi. Thanks for the suggestion. I had to refer to DOI handbook to see if space is a valid character or not. According to section 3.2.1 GENERAL CHARACTERISTICS OF THE DOI SYNTAX: "The DOI name is case-insensitive and can incorporate any printable characters from the legal graphic characters of Unicode." Apparently, space is considered both a graphic character and printable character. That being said, I have not seen any DOI containing the space character.

Currently citer does not consider the space a valid DOI character, but https://doi.org/10.1371/%20journal.pgph.0000245 is still a valid URL and citer tries to connect to its server, but it fails with RequestsError because the server responds with 404 error code.

It is possible to add a separate input type for DOIs. That way citer would not confuse a DOI for a URL. However I believe a separate input type would be a little less convenient for users. For now I'm going to leave citer as it is but might reconsider if other users report similar issues. Dalba 08:33, 22 March 2024 (UTC)Reply

Twin ISSN generated by Citer in cite journal[edit]

Latest comment: 1 month ago4 comments2 people in discussion

In quite a few cases, Citer generates a twin ISSN in the form issn=<ISSN1>, <ISSN2> in the {{Cite journal}}. The magazines now routinely declare twin ISSNs, one for Internet, one for print. Is it possible to channel the second ISSN into eissn= ? Thank you in advance! Викидим (talk) 19:29, 23 April 2024 (UTC)Reply

Could you provide an example input that has this issue? Dalba 05:14, 26 April 2024 (UTC)Reply

For example, https://www.jstor.org/stable/1687467 produces "issn=00368075, 10959203" that does not work with cite templates. The first ISSN is print, the second - online. Викидим (talk) 18:19, 27 April 2024 (UTC)Reply

Fixed AFAICT, JSTOR does not provide any info about which ISSN is the electronic one. I decided to ignore the second one and use the first as |issn=. Dalba 18:03, 2 May 2024 (UTC)Reply

DOI 10.1109/5992.805138[edit]

Latest comment: 1 month ago2 comments2 people in discussion

With input 10.1109/5992.805138 , the result is unexpected: the submit button stays grayed out, I( have to close the window to continue. There is no result either. While at it, this is a truly great tool! Thank you! Викидим (talk) 18:25, 27 April 2024 (UTC)Reply

Thank you! Should be fixed now. Dalba 17:58, 2 May 2024 (UTC)Reply

Ref names[edit]

Latest comment: 22 days ago3 comments2 people in discussion

Hi Dalba, thanks again for this amazing tool. I had a question about the ref names that are generated by the tool. I noticed that, until a week ago, the tool would include the author's last name and the publication date in the reference name, e.g.:

<ref name="Valenti 2024 n238">{{cite web | last=Valenti | first=John | title=60 years ago, the World's Fair showcased dazzling inventions and international cultures | website=Newsday | date=April 20, 2024 | url=https://www.newsday.com/news/new-york/worlds-fair-60th-anniversary-v7xgi3gr | access-date=May 12, 2024}}</ref>

Recently, however, it appears the last name and publication date are not included in the reference name at all, so the references come out like this:

<ref name="n238">{{cite web | last=Valenti | first=John | title=60 years ago, the World's Fair showcased dazzling inventions and international cultures | website=Newsday | date=April 20, 2024 | url=https://www.newsday.com/news/new-york/worlds-fair-60th-anniversary-v7xgi3gr | access-date=May 12, 2024}}</ref>

Is this an intentional change? I am not sure about other projects, but on English Wikipedia, Help:Footnotes says that the reference names "should have semantic value, so that they can be more easily distinguished from each other by human editors who are looking at the wikitext". I am concerned that the current reference names might not be doing that. Epicgenius (talk) 18:00, 12 May 2024 (UTC)Reply

Hi! You're right, I did change it (again!) after another user complained that the generated names can sometimes be too long.[2] I'm aware of the guideline, but in practice, how much do you rely on the semantic meaning of the reference name? Personally, I don't find the reference name that important; using the browser's "find in page" function or a page preview works fine for me. That being said, I'm happy to revert the change (again!!) if you think the older method was better. I'm undecided on this one. Dalba 15:27, 14 May 2024 (UTC)Reply

Thanks for the response. I mainly rely on the author's last name (or the name of the publication, if there's no author). I could see why someone may think "Plants of the World Online" is too long, but for that particular case, spelling out the whole name may also be useful to people who wouldn't know what "POWO" stands for.

I personally am not too bothered if you leave the names as is, since I primarily use Citer in conjunction with VisualEditor, which allows editors to reuse references without actually knowing the ref name. However, for those who use the wikitext editor, the reference names might be more helpful to them. Epicgenius (talk) 23:49, 14 May 2024 (UTC)Reply