Discussion:
[Xmldatadumps-l] [Analytics] Missing categorylinks and pages in Wikipedia dumps
Tilman Bayer
2017-11-01 17:40:48 UTC
Permalink
CCing the data dumps mailing list, which is the recommended venue for
questions like this (https://meta.wikimedia.org/wiki/Data_dumps#Where_to_go_
for_help ).

On Wed, Nov 1, 2017 at 8:44 AM, Shubhanshu Mishra <
Also, important categories like Computer Architechture, Human based
computation, Programming language theory, Software Engineering, and Theory
of Computation, are missing from the subcategories of Areas of Computer
Science.
*Regards,*
*Shubhanshu Mishra*
Research Assistant,
iSchool at University of Illinois at Urbana-Champaign
--------------------------------------------------
*Website:* http://shubhanshu.com
*LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog <http://shubhanshu.com/blog> || Facebook
<http://www.facebook.com/shubhanshu.mishra> || Twitter
<http://www.twitter.com/TheShubhanshu> || LinkedIn
<http://www.linkedin.com/in/shubhanshumishra>
On Wed, Nov 1, 2017 at 10:42 AM, Shubhanshu Mishra <
Hi,
When using the wikipedia dump files, I am unable to find many categories
and pages in the dump.
E.g. under the Areas_of_computer_science category I get only 13
subcategories and 2 pages instead of 17 subcategories, 2 pages.
Furthermore, 1 page "Computational_creativity" is not present as a
subcategory.
I am using the following wikipedia dump files to extract the
1.6G Sep 21 00:45 enwiki-20170920-page.sql.gz
21M Sep 21 00:45 enwiki-20170920-category.sql.gz
113M Sep 21 00:55 enwiki-20170920-redirect.sql.gz
2.2G Sep 21 03:10 enwiki-20170920-categorylinks.sql.gz
221M Sep 21 03:13 enwiki-20170920-page_props.sql.gz
I use https://github.com/napsternxg/WikiUtils to parse the sql.gz dump
files, but I also tried searching in the sql.gz files and couldn't find any
entry for 16300571 in the page.sql.gz and in category.sql.gz
files. 16300571 supposedly refers to the Computational_creativity page as
16300571 'All_NPOV_disputes' 'page'
16300571 'All_articles_needing_additional_references' 'page'
16300571 'All_articles_with_dead_external_links' 'page'
16300571 'All_articles_with_unsourced_statements' 'page'
16300571 'Areas_of_computer_science' 'page'
16300571 'Articles_needing_additional_references_from_May_2013' 'page'
16300571 'Articles_with_French-language_external_links' 'page'
16300571 'Articles_with_dead_external_links_from_November_2016' 'page'
16300571 'Articles_with_permanently_dead_external_links' 'page'
16300571 'Articles_with_unsourced_statements_from_April_2015' 'page'
16300571 'Articles_with_unsourced_statements_from_April_2016' 'page'
16300571 'Articles_with_unsourced_statements_from_December_2015'
'page'
16300571 'Articles_with_unsourced_statements_from_January_2010' 'page'
16300571 'Articles_with_unsourced_statements_from_October_2016' 'page'
16300571 'Artificial_intelligence' 'page'
16300571 'Arts' 'page'
16300571 'CS1_maint:_Extra_text:_authors_list' 'page'
16300571 'Cognitive_psychology' 'page'
16300571 'Computational_fields_of_study' 'page'
16300571 'Creativity_techniques' 'page'
16300571 'NPOV_disputes_from_January_2013' 'page'
16300571 'Philosophical_movements' 'page'
16300571 'Webarchive_template_wayback_links' 'page'
16300571 'Wikipedia_articles_needing_clarification_from_November_2008'
'page'
More details can be found at: https://twitter.com/TheShu
bhanshu/status/925736635572072449
Is there something, I am doing wrong, or are these rows just missing from
the dumps.
*Regards,*
*Shubhanshu Mishra*
Research Assistant,
iSchool at University of Illinois at Urbana-Champaign
--------------------------------------------------
*Website:* http://shubhanshu.com
*LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog <http://shubhanshu.com/blog> || Facebook
<http://www.facebook.com/shubhanshu.mishra> || Twitter
<http://www.twitter.com/TheShubhanshu> || LinkedIn
<http://www.linkedin.com/in/shubhanshumishra>
_______________________________________________
Analytics mailing list
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
Ariel Glenn WMF
2017-11-07 11:00:59 UTC
Permalink
I checked the files directly, both the pages.sql.gz and the
categorylinks.sql.gz files for 20170920. The page is listed:

$ zcat enwiki-20170920-page.sql.gz | sed -e 's/),/),\n/g;' | grep
Computational_creativity | more
(16300571,0,'Computational_creativity','',0,0,0,0.718037721126,'20170903222622','20170903222623',798803037,59318,'wikitext',NULL),
(16390036,1,'Computational_creativity','',0,0,0,0.20741249006,'20170831064438','20170831084246',786288354,107057,'wikitext',NULL),

The first entry is the page, the second is the talk page.

$ zcat enwiki-20170920-categorylinks.sql.gz | sed -e 's/),/),\n/g;' | grep
16300571 | cat -vte
(16300571,'All_NPOV_disputes','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2013-01-27
10:43:57','','uca-default-u-kn','page'),$
(16300571,'All_articles_needing_additional_references','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2013-05-19
16:52:06','','uca-default-u-kn','page'),$
(16300571,'All_articles_with_dead_external_links','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-11-29
07:32:22','','uca-default-u-kn','page'),$
(16300571,'All_articles_with_unsourced_statements','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2008-11-21
10:36:21','','uca-default-u-kn','page'),$
(16300571,'Areas_of_computer_science','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-04-15
15:40:40','','uca-default-u-kn','page'),$
(16300571,'Articles_needing_additional_references_from_May_2013','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2013-05-19
16:52:06','','uca-default-u-kn','page'),$
(16300571,'Articles_with_French-language_external_links','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2013-06-20
04:05:59','','uca-default-u-kn','page'),$
(16300571,'Articles_with_dead_external_links_from_November_2016','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-11-29
07:32:22','','uca-default-u-kn','page'),$
(16300571,'Articles_with_permanently_dead_external_links','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-11-29
07:32:22','','uca-default-u-kn','page'),$
(16300571,'Articles_with_unsourced_statements_from_April_2015','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-04-15
15:40:40','','uca-default-u-kn','page'),$
(16300571,'Articles_with_unsourced_statements_from_April_2016','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-04-15
15:40:40','','uca-default-u-kn','page'),$
(16300571,'Articles_with_unsourced_statements_from_December_2015','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2015-12-01
14:40:27','','uca-default-u-kn','page'),$
(16300571,'Articles_with_unsourced_statements_from_January_2010','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2010-01-09
05:50:15','','uca-default-u-kn','page'),$
(16300571,'Articles_with_unsourced_statements_from_October_2016','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-10-10
21:27:12','','uca-default-u-kn','page'),$
(16300571,'Artificial_intelligence','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2008-03-19
03:45:58','','uca-default-u-kn','page'),$
(16300571,'Arts','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-04-15
15:40:40','','uca-default-u-kn','page'),$
(16300571,'CS1_maint:_Extra_text:_authors_list','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2017-06-04
08:45:09','','uca-default-u-kn','page'),$
(16300571,'Cognitive_psychology','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-04-15
15:40:40','','uca-default-u-kn','page'),$
(16300571,'Computational_fields_of_study','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-11-10
15:53:12','','uca-default-u-kn','page'),$
(16300571,'Creativity_techniques','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-04-15
15:40:40','','uca-default-u-kn','page'),$
(16300571,'NPOV_disputes_from_January_2013','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2013-05-19
15:48:55','','uca-default-u-kn','page'),$
(16300571,'Philosophical_movements','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2017-01-07
20:24:38','','uca-default-u-kn','page'),$
(16300571,'Webarchive_template_wayback_links','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2017-01-27
20:04:18','','uca-default-u-kn','page'),$
(16300571,'Wikipedia_articles_needing_clarification_from_November_2008','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2009-02-13
10:49:28','','uca-default-u-kn','page'),$

That list of categorylinks entries matches your results.
Is it possible that your download of the pages.sql file is corrupted? Do
the md5 sums check out? Or perhaps it is an issue with the tools.

Ariel
Post by Tilman Bayer
CCing the data dumps mailing list, which is the recommended venue for
questions like this (https://meta.wikimedia.org/wi
ki/Data_dumps#Where_to_go_for_help ).
On Wed, Nov 1, 2017 at 8:44 AM, Shubhanshu Mishra <
Also, important categories like Computer Architechture, Human based
computation, Programming language theory, Software Engineering, and Theory
of Computation, are missing from the subcategories of Areas of Computer
Science.
*Regards,*
*Shubhanshu Mishra*
Research Assistant,
iSchool at University of Illinois at Urbana-Champaign
--------------------------------------------------
*Website:* http://shubhanshu.com
*LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog <http://shubhanshu.com/blog> || Facebook
<http://www.facebook.com/shubhanshu.mishra> || Twitter
<http://www.twitter.com/TheShubhanshu> || LinkedIn
<http://www.linkedin.com/in/shubhanshumishra>
On Wed, Nov 1, 2017 at 10:42 AM, Shubhanshu Mishra <
Hi,
When using the wikipedia dump files, I am unable to find many categories
and pages in the dump.
E.g. under the Areas_of_computer_science category I get only 13
subcategories and 2 pages instead of 17 subcategories, 2 pages.
Furthermore, 1 page "Computational_creativity" is not present as a
subcategory.
I am using the following wikipedia dump files to extract the
1.6G Sep 21 00:45 enwiki-20170920-page.sql.gz
21M Sep 21 00:45 enwiki-20170920-category.sql.gz
113M Sep 21 00:55 enwiki-20170920-redirect.sql.gz
2.2G Sep 21 03:10 enwiki-20170920-categorylinks.sql.gz
221M Sep 21 03:13 enwiki-20170920-page_props.sql.gz
I use https://github.com/napsternxg/WikiUtils to parse the sql.gz dump
files, but I also tried searching in the sql.gz files and couldn't find any
entry for 16300571 in the page.sql.gz and in category.sql.gz
files. 16300571 supposedly refers to the Computational_creativity page as
16300571 'All_NPOV_disputes' 'page'
16300571 'All_articles_needing_additional_references' 'page'
16300571 'All_articles_with_dead_external_links' 'page'
16300571 'All_articles_with_unsourced_statements' 'page'
16300571 'Areas_of_computer_science' 'page'
16300571 'Articles_needing_additional_references_from_May_2013' 'page'
16300571 'Articles_with_French-language_external_links' 'page'
16300571 'Articles_with_dead_external_links_from_November_2016' 'page'
16300571 'Articles_with_permanently_dead_external_links' 'page'
16300571 'Articles_with_unsourced_statements_from_April_2015' 'page'
16300571 'Articles_with_unsourced_statements_from_April_2016' 'page'
16300571 'Articles_with_unsourced_statements_from_December_2015'
'page'
16300571 'Articles_with_unsourced_statements_from_January_2010' 'page'
16300571 'Articles_with_unsourced_statements_from_October_2016' 'page'
16300571 'Artificial_intelligence' 'page'
16300571 'Arts' 'page'
16300571 'CS1_maint:_Extra_text:_authors_list' 'page'
16300571 'Cognitive_psychology' 'page'
16300571 'Computational_fields_of_study' 'page'
16300571 'Creativity_techniques' 'page'
16300571 'NPOV_disputes_from_January_2013' 'page'
16300571 'Philosophical_movements' 'page'
16300571 'Webarchive_template_wayback_links' 'page'
16300571 'Wikipedia_articles_needing_clarification_from_November_2008'
'page'
More details can be found at: https://twitter.com/TheShu
bhanshu/status/925736635572072449
Is there something, I am doing wrong, or are these rows just missing
from the dumps.
*Regards,*
*Shubhanshu Mishra*
Research Assistant,
iSchool at University of Illinois at Urbana-Champaign
--------------------------------------------------
*Website:* http://shubhanshu.com
*LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog <http://shubhanshu.com/blog> || Facebook
<http://www.facebook.com/shubhanshu.mishra> || Twitter
<http://www.twitter.com/TheShubhanshu> || LinkedIn
<http://www.linkedin.com/in/shubhanshumishra>
_______________________________________________
Analytics mailing list
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
_______________________________________________
Xmldatadumps-l mailing list
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Loading...