NATION

PASSWORD

[Q] Legality of Scraping the NS forum

Bug reports, general help, ideas for improvements, and questions about how things are meant to work.
User avatar
The Ice States
GA Secretariat
 
Posts: 2902
Founded: Jun 23, 2022
Compulsory Consumerist State

[Q] Legality of Scraping the NS forum

Postby The Ice States » Sun Nov 20, 2022 7:35 pm

Greetings!

Is it legal to send scripts to the forums to scrape a forum page without being logged-in? I'm currently working on a script to scrape HTML from NS forum threads for the purpose of archival, which sends automated requests to the NS forum for the purpose of saving the HTML code while logged-out. I would appreciate confirmation on whether this is fully legal, as I have been advised that scraping the NS forum is frowned upon?

Thanks!
Factbooks · 46x World Assembly Author · Festering Snakepit Wiki · WACampaign · GA Stat Effects Data

Posts in the WA forums are Ooc and unofficial, absent indication otherwise.
Please check out my roleplay thread The Battle of Glass Tears!
WA 101 Guides to GA authorship, campaigning, and more.

User avatar
United Calanworie
Technical Moderator
 
Posts: 3843
Founded: Dec 12, 2018
Democratic Socialists

Postby United Calanworie » Sun Nov 20, 2022 8:22 pm

Please note that this is not a ruling.

Most likely, yes. Our rules prohibit sending requests as a logged-in user automatically. So presuming that you will not be doing that, you would have to adhere to no more than ten requests per minute. Obviously though, it's heavily discouraged to scrape the forums. If you're going to persist in this idea, please consider not going anywhere close to the ratelimit.
Trans rights are human rights.
||||||||||||||||||||
Discord: Aav#7546 @queerlyfe
She/Her/Hers
My telegrams are not for Moderation enquiries, those belong in a GHR. Feel free to reach out if you want to just chat.

User avatar
Valentine Z
Postmaster-General
 
Posts: 13041
Founded: Nov 08, 2015
Scandinavian Liberal Paradise

Postby Valentine Z » Sun Nov 20, 2022 9:24 pm

For clarity sake, I have been doing this for a long while at this point, again while I'm not logged in. I didn't use scripts or anything of that sort, but I do have an URL opener (it's a browser extension) that lets me open multiple URLs at once (and slowing down the computer a bit). Then I have another extension to download webpages as HTML.

In essence, open 500 tabs, then save them as HTML, both requiring my manual input.

Does the rate limit apply for this as well? As in, can I open all 500 tabs at one go, or should I open in small batches? Thanks in advance!

EDIT: Here's the ongoing project of mine for reference: https://www.nationstates.net/page=dispatch/id=1543370
Last edited by Valentine Z on Sun Nov 20, 2022 10:52 pm, edited 3 times in total.
Val's Stuff. ♡ ^_^ ♡ For You
If you are reading my sig, I want you to have the best day ever ! You are worth it, do not let anyone get you down !
Glory to De Geweldige Sierlijke Katachtige Utopia en Zijne Autonome Machten ov Valentine Z !
(✿◠‿◠) ☆ \(^_^)/ ☆

Issues Thread Photography Stuff Project: Save F7. Stats Analysis

The Sixty! Valentian Stories! Gwen's Adventures!

• Never trouble trouble until trouble troubles you.
• World Map is a cat playing with Australia.
Let Fate sort it out.

User avatar
Racoda
Technical Moderator
 
Posts: 579
Founded: Aug 12, 2014
Democratic Socialists

Postby Racoda » Sun Nov 20, 2022 10:09 pm

Valentine Z wrote:For clarity sake, I have been doing this for a long while at this point, again while I'm not logged in. I didn't use scripts or anything of that sort, but I do have an URL opener (it's a browser extension) that lets me open multiple URLs at once (and slowing down the computer a bit). Then I have another extension to download webpages as HTML.

In essence, open 500 tabs, then save them as HTML, both requiring my manual input.

Does the rate limit apply for this as well? As in, can I open all 500 tabs at one go, or should I open in small batches? Thank in advance!

EDIT: Here's the ongoing project of mine for reference: https://www.nationstates.net/page=dispatch/id=1543370

According to some earlier questions*, it is a tool and does fall under the rate limit:
viewtopic.php?p=32647607#p32647607

*I'm saying questions, because I'm sure there was another more recent thread on the matter, but I can't find it.

Acting as a player unless accompagnied by mod action or reddish text
Any pronouns

User avatar
Imperium Anglorum
GA Secretariat
 
Posts: 12664
Founded: Aug 26, 2013
Left-Leaning College State

Postby Imperium Anglorum » Sun Nov 20, 2022 10:44 pm

I've tried running requests (the Python package) against the forum circa September this year and get a pile of Cloudflare gobbledygook. Is that intentional or can nothing be done on it?

Author: 1 SC and 56+ GA resolutions
Maintainer: GA Passed Resolutions
Developer: Communiqué and InfoEurope
GenSec (24 Dec 2021 –); posts not official unless so indicated
Delegate for Europe
Elsie Mortimer Wellesley
Ideological Bulwark 285, WALL delegate
Twice-commended toxic villainous globalist kittehs

User avatar
Valentine Z
Postmaster-General
 
Posts: 13041
Founded: Nov 08, 2015
Scandinavian Liberal Paradise

Postby Valentine Z » Sun Nov 20, 2022 10:52 pm

Racoda wrote:
Valentine Z wrote:For clarity sake, I have been doing this for a long while at this point, again while I'm not logged in. I didn't use scripts or anything of that sort, but I do have an URL opener (it's a browser extension) that lets me open multiple URLs at once (and slowing down the computer a bit). Then I have another extension to download webpages as HTML.

In essence, open 500 tabs, then save them as HTML, both requiring my manual input.

Does the rate limit apply for this as well? As in, can I open all 500 tabs at one go, or should I open in small batches? Thank in advance!

EDIT: Here's the ongoing project of mine for reference: https://www.nationstates.net/page=dispatch/id=1543370

According to some earlier questions*, it is a tool and does fall under the rate limit:
viewtopic.php?p=32647607#p32647607

*I'm saying questions, because I'm sure there was another more recent thread on the matter, but I can't find it.

I forgot that I have asked that before in that thread. My bad, then!

For the subsequent archives I'm doing, I'll follow the rate limit. Slowing down will actually help me, in all honesty, because having a lot of tabs loading at the same time does cause a bit of slowdowns.
Last edited by Valentine Z on Sun Nov 20, 2022 10:54 pm, edited 1 time in total.
Val's Stuff. ♡ ^_^ ♡ For You
If you are reading my sig, I want you to have the best day ever ! You are worth it, do not let anyone get you down !
Glory to De Geweldige Sierlijke Katachtige Utopia en Zijne Autonome Machten ov Valentine Z !
(✿◠‿◠) ☆ \(^_^)/ ☆

Issues Thread Photography Stuff Project: Save F7. Stats Analysis

The Sixty! Valentian Stories! Gwen's Adventures!

• Never trouble trouble until trouble troubles you.
• World Map is a cat playing with Australia.
Let Fate sort it out.

User avatar
Roavin
Admin
 
Posts: 1778
Founded: Apr 07, 2016
Democratic Socialists

Postby Roavin » Mon Nov 21, 2022 12:58 am

The Ice States wrote:Is it legal to send scripts to the forums to scrape a forum page without being logged-in? I'm currently working on a script to scrape HTML from NS forum threads for the purpose of archival, which sends automated requests to the NS forum for the purpose of saving the HTML code while logged-out. I would appreciate confirmation on whether this is fully legal, as I have been advised that scraping the NS forum is frowned upon?


United Calanworie already said the important bits. Beyond that, it might help if you explain what you're trying to achieve — "archival" is sufficiently vague that it can range anywhere from "I want to save a handful of RPs I've been involved in" (which is fine) to "I want to mirror the forum" (dear god, no, that's hundreds of gigabytes of data).
Helpful Resources: One Stop Rules Shop | API documentation | NS Coders Discord
About me: Longest serving Prime Minister in TSP | Former First Warden of TGW | aka Curious Observations

Feel free to TG me, but not about moderation matters.

User avatar
Valentine Z
Postmaster-General
 
Posts: 13041
Founded: Nov 08, 2015
Scandinavian Liberal Paradise

Postby Valentine Z » Mon Nov 21, 2022 1:29 am

Roavin wrote:
The Ice States wrote:Is it legal to send scripts to the forums to scrape a forum page without being logged-in? I'm currently working on a script to scrape HTML from NS forum threads for the purpose of archival, which sends automated requests to the NS forum for the purpose of saving the HTML code while logged-out. I would appreciate confirmation on whether this is fully legal, as I have been advised that scraping the NS forum is frowned upon?


United Calanworie already said the important bits. Beyond that, it might help if you explain what you're trying to achieve — "archival" is sufficiently vague that it can range anywhere from "I want to save a handful of RPs I've been involved in" (which is fine) to "I want to mirror the forum" (dear god, no, that's hundreds of gigabytes of data).

I can vouch for The Ice States, if I may, since we have talked about this a while back and did a collaboration of sorts.

For all intents and purposes, we are not trying to mirror the entire forum, or even all of a subsection (Forum 7); our aim is to archive noteworthy and effortful threads that have hit 500 pages and are locked, in order to save them as part of F7's history before they are wiped by the 7-day window. So this is an occasional archive because it does take a while to hit 500 pages (even popular threads take months).

I hope this helps!
Last edited by Valentine Z on Mon Nov 21, 2022 1:29 am, edited 1 time in total.
Val's Stuff. ♡ ^_^ ♡ For You
If you are reading my sig, I want you to have the best day ever ! You are worth it, do not let anyone get you down !
Glory to De Geweldige Sierlijke Katachtige Utopia en Zijne Autonome Machten ov Valentine Z !
(✿◠‿◠) ☆ \(^_^)/ ☆

Issues Thread Photography Stuff Project: Save F7. Stats Analysis

The Sixty! Valentian Stories! Gwen's Adventures!

• Never trouble trouble until trouble troubles you.
• World Map is a cat playing with Australia.
Let Fate sort it out.

User avatar
Site-
Political Columnist
 
Posts: 3
Founded: Dec 07, 2020
Inoffensive Centrist Democracy

Wrong nation loggedin

Postby Site- » Mon Nov 21, 2022 4:29 am

Valentine Z wrote:For clarity sake, I have been doing this for a long while at this point, again while I'm not logged in. I didn't use scripts or anything of that sort, but I do have an URL opener (it's a browser extension) that lets me open multiple URLs at once (and slowing down the computer a bit). Then I have another extension to download webpages as HTML.

In essence, open 500 tabs, then save them as HTML, both requiring my manual input.

Does the rate limit apply for this as well? As in, can I open all 500 tabs at one go, or should I open in small batches? Thanks in advance!

EDIT: Here's the ongoing project of mine for reference: https://www.nationstates.net/page=dispatch/id=1543370


I've been told in private by an admin that usage of a multilink open tool that opens multiple links with one action is illegal before.
This was more then 2 years ago.
In my case the tool was Snap Links Plus.


Edit: Forgot to login my main, this was Vylixan posting :).
Last edited by Site- on Mon Nov 21, 2022 4:30 am, edited 1 time in total.

User avatar
The Ice States
GA Secretariat
 
Posts: 2902
Founded: Jun 23, 2022
Compulsory Consumerist State

Postby The Ice States » Mon Nov 21, 2022 10:26 am

Valentine Z wrote:
Roavin wrote:
United Calanworie already said the important bits. Beyond that, it might help if you explain what you're trying to achieve — "archival" is sufficiently vague that it can range anywhere from "I want to save a handful of RPs I've been involved in" (which is fine) to "I want to mirror the forum" (dear god, no, that's hundreds of gigabytes of data).

I can vouch for The Ice States, if I may, since we have talked about this a while back and did a collaboration of sorts.

For all intents and purposes, we are not trying to mirror the entire forum, or even all of a subsection (Forum 7); our aim is to archive noteworthy and effortful threads that have hit 500 pages and are locked, in order to save them as part of F7's history before they are wiped by the 7-day window. So this is an occasional archive because it does take a while to hit 500 pages (even popular threads take months).

I hope this helps!

For the record, I can confirm this.
Factbooks · 46x World Assembly Author · Festering Snakepit Wiki · WACampaign · GA Stat Effects Data

Posts in the WA forums are Ooc and unofficial, absent indication otherwise.
Please check out my roleplay thread The Battle of Glass Tears!
WA 101 Guides to GA authorship, campaigning, and more.

User avatar
Roavin
Admin
 
Posts: 1778
Founded: Apr 07, 2016
Democratic Socialists

Postby Roavin » Mon Nov 21, 2022 11:35 am

If you stay below the rate limit, identify your script to the server via User Agent, and only do it occasionally to pull such threads, that should be fine.
Helpful Resources: One Stop Rules Shop | API documentation | NS Coders Discord
About me: Longest serving Prime Minister in TSP | Former First Warden of TGW | aka Curious Observations

Feel free to TG me, but not about moderation matters.

User avatar
United Calanworie
Technical Moderator
 
Posts: 3843
Founded: Dec 12, 2018
Democratic Socialists

Postby United Calanworie » Mon Nov 21, 2022 5:46 pm

My concern would be making sure that you spread out those 500 requests sensibly. I would say that you likely shouldn't do more than one or two requests per minute, as 10 requests a minute would result in nearly an hour of high scripting load for the forums.
Trans rights are human rights.
||||||||||||||||||||
Discord: Aav#7546 @queerlyfe
She/Her/Hers
My telegrams are not for Moderation enquiries, those belong in a GHR. Feel free to reach out if you want to just chat.

User avatar
Valentine Z
Postmaster-General
 
Posts: 13041
Founded: Nov 08, 2015
Scandinavian Liberal Paradise

Postby Valentine Z » Tue Nov 22, 2022 12:23 am

Hmm, okay. Dumb question, sorry about that, I just want to make sure that I am clear on this.

What about browser extensions, because for those, there is no way to identify myself as the one using it. Here's the extension in question that I am using, which I drop a bunch of links and open them on my computer: https://github.com/htrinter/Open-Multiple-URLs/

Thanks, and sorry to ask again!

EDIT: Or should I move on using a script that has an user agent instead of using a browser extension? Either way should work for me.
Last edited by Valentine Z on Tue Nov 22, 2022 12:27 am, edited 2 times in total.
Val's Stuff. ♡ ^_^ ♡ For You
If you are reading my sig, I want you to have the best day ever ! You are worth it, do not let anyone get you down !
Glory to De Geweldige Sierlijke Katachtige Utopia en Zijne Autonome Machten ov Valentine Z !
(✿◠‿◠) ☆ \(^_^)/ ☆

Issues Thread Photography Stuff Project: Save F7. Stats Analysis

The Sixty! Valentian Stories! Gwen's Adventures!

• Never trouble trouble until trouble troubles you.
• World Map is a cat playing with Australia.
Let Fate sort it out.

User avatar
[violet]
Executive Director
 
Posts: 16207
Founded: Antiquity

Postby [violet] » Tue Nov 22, 2022 4:19 am

We ban all bot activity from the forums, but that's mostly just because it's being constantly attacked by them, and I have to deploy aggressive counter-measures to keep things working.

I have no problem with people very slowly scraping relatively small amounts of data for archive, but I can't explicitly support it (because our anti-bot stuff, which I don't want to disable, may block you), and I also don't want to have to deal with well-meaning authors who write bots that hit an unexpected corner case and then start smashing the site at maximum speed, which happens more often than you'd think.


Advertisement

Remove ads

Return to Technical

Who is online

Users browsing this forum: Dytarma, Elsywer, Google [Bot], La Xinga

Advertisement

Remove ads