Discussion:
[Xmldatadumps-l] Collecting data on page revisions over time
SEAN CHRISTOPHER BUCHANAN
2017-08-29 15:39:21 UTC
Permalink
Hello,

Hello,
My name is Sean Buchanan. I am a professor of Business Administration at the Asper School of Business at the University of Manitoba.
My colleagues and I are trying to collect a certain type of data from Wikipedia and would like some advice on the most efficient and user friendly way of collecting this data.

We are looking to collect data on the difference between revisions over the lifetime of a three Wikipedia pages (see attached screenshot)
We haven't found a way to do that through the channels on the web page and were wondering if you have any ideas on how such data could be collected?

We are interested in the revision history for the following pages:

1) Capitalism
2) Socialism
3) Communism
Thank you for your help! I look forward to hearing from you.
Sincerely,
Sean Buchanan
Caner GÜRÇAY
2017-08-31 13:10:10 UTC
Permalink
Looks good

31 Ağu 2017 ÖS 3:45 tarihinde "SEAN CHRISTOPHER BUCHANAN" <
Post by SEAN CHRISTOPHER BUCHANAN
Hello,
Hello,
My name is Sean Buchanan. I am a professor of Business Administration at
the Asper School of Business at the University of Manitoba.
My colleagues and I are trying to collect a certain type of data from
Wikipedia and would like some advice on the most efficient and user
friendly way of collecting this data.
We are looking to collect data on the difference between revisions over
the lifetime of a three Wikipedia pages (see attached screenshot)
We haven’t found a way to do that through the channels on the web page and
were wondering if you have any ideas on how such data could be collected?
1) Capitalism
2) Socialism
3) Communism
Thank you for your help! I look forward to hearing from you.
Sincerely,
Sean Buchanan
_______________________________________________
Xmldatadumps-l mailing list
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Jérémie Roquet
2017-08-31 13:10:23 UTC
Permalink
Dear Sean,

2017-08-29 17:39 GMT+02:00 SEAN CHRISTOPHER BUCHANAN
Post by SEAN CHRISTOPHER BUCHANAN
We are looking to collect data on the difference between revisions over the
lifetime of a three Wikipedia pages (see attached screenshot)
We haven’t found a way to do that through the channels on the web page and
were wondering if you have any ideas on how such data could be collected?
If you are interested in past revisions, the simplest way I can think
of is through Special:Export:

1. go to https://en.wikipedia.org/wiki/Special:Export;
2. write Capitalism, Socialism and Communism in the textarea, each on
its own line (or repeat the whole process thrice with only one line
each time);
3. uncheck the “Include only the current revision, not the full
history” to get the full history;
4. click “Export” and download the file.

You will get a large XML file containing every single revision of each article.

In addition, if you are interested in getting new revisions as they
are made (ie. in real time), you might want to have a look at
EventStreams¹, but is is somehow less user friendly (unless the user
is well versed in programming, that is).

Best regards,

PS : what could be incredibly useful to dive into articles histories
would be to import them in git², as it would allow the user to see
diffs between revisions the way you see them online, to look for when
a given sentence has been added / removed, etc. There are some very
user-friendly tools to present the histories to non-technical users
once the import has been made.

¹ https://wikitech.wikimedia.org/wiki/EventStreams
² https://git-scm.com/
--
Jérémie
Platonides
2017-08-31 18:08:04 UTC
Permalink
Hi Platonides,
Post by Jérémie Roquet
PS : what could be incredibly useful to dive into articles histories
would be to import them in git², as it would allow the user to see
diffs between revisions the way you see them online, to look for when
a given sentence has been added / removed, etc. There are some very
user-friendly tools to present the histories to non-technical users
once the import has been made.
Not as much as you think. I did that once, but the results were worse
than
expected. git (and other scms) diffing is line-based. You have many
relatively-independent lines of code, and diff based on that. Whereas on
wikipedia articles, each line is a full paragraph, Thus, as soon as
someone
added a sentence (or a word), the full paragraph showed as changed.
Good point, thanks!
Did you try with git's builtin diff UI, or with some other frontend? I
have never tried on Wikimedia dumps (I really should!) but I have to
diff XML files with horribly long lines on a regular basis — which is
something I naively believe to be very close to what diffing Wikimedia
dumps would look like — and diff-so-fancy and vimdiff do wonders with
that. Unfortunately, “user-friendly” GUIs like GitKraken, which I'd
have recommended to non-technical users, appear to handle diffs as
poorly as git builtin UI

Best regards,
--
Jérémie
I think I attempted to use git gui blame, and perhaps git bisect. Not sure
how I finally handled whatever I was looking for. It has been a long time
ago.
You might be able to get better results with some preprocessing, though.

Cheers

Loading...