Discussion:
[Xmldatadumps-l] Missing pages in enwiki pages-articles-multistream dumps
Ryan Hitchman
2018-02-27 08:45:17 UTC
Permalink
Multiple pages are missing from the enwiki pages-articles-multistream dumps
from 20180201 and 20180220.

Page id 88444: "Phosphor" doesn't appear in the index or in the data
stream. This also happens for TARDIS, Psalm 132, and many others

Why would the dump be partial?
Ariel Glenn WMF
2018-02-27 12:10:04 UTC
Permalink
It turns out that this happens for exactly 27 pages, those at the end of
each enwiki-20180220-stub-articlesXX.xml.gz file. Tracking here:
https://phabricator.wikimedia.org/T188388

Ariel
Post by Ryan Hitchman
Multiple pages are missing from the enwiki pages-articles-multistream
dumps from 20180201 and 20180220.
Page id 88444: "Phosphor" doesn't appear in the index or in the data
stream. This also happens for TARDIS, Psalm 132, and many others
Why would the dump be partial?
_______________________________________________
Xmldatadumps-l mailing list
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Ryan Hitchman
2018-02-27 19:11:07 UTC
Permalink
Thanks for the quick fix! I'll verify it too with the next run.

I discovered this while building a link graph directly from the
pages-articles dump, and finding that I had more broken links (missing
target articles) than expected.
Post by Ariel Glenn WMF
It turns out that this happens for exactly 27 pages, those at the end of
https://phabricator.wikimedia.org/T188388
Ariel
Post by Ryan Hitchman
Multiple pages are missing from the enwiki pages-articles-multistream
dumps from 20180201 and 20180220.
Page id 88444: "Phosphor" doesn't appear in the index or in the data
stream. This also happens for TARDIS, Psalm 132, and many others
Why would the dump be partial?
_______________________________________________
Xmldatadumps-l mailing list
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Loading...