[PATCHv3 1/2] git-remote-mediawiki: import "File:" attachments

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCHv3 1/2] git-remote-mediawiki: import "File:" attachments
@ 2012-06-12 20:14 Pavel Volek
  2012-06-12 20:14 ` [PATCHv3 2/2] git-remote-mediawiki: refactoring get_mw_pages function Pavel Volek
  2012-06-12 21:05 ` [PATCHv3 1/2] git-remote-mediawiki: import "File:" attachments Junio C Hamano
  0 siblings, 2 replies; 4+ messages in thread
From: Pavel Volek @ 2012-06-12 20:14 UTC (permalink / raw
  To: git; +Cc: Pavel VOlek, NGUYEN Kim Thuat, ROUCHER IGLESIAS Javier,
	Matthieu Moy

From: Pavel VOlek <Pavel.Volek@ensimag.imag.fr>

The current version of the git-remote-mediawiki supports only import and export
of the pages, doesn't support import and export of file attachments which are
also exposed by MediaWiki API. This patch adds the functionality to import file
attachments and description pages for these files.

Chages version2 -> version3:
Fixes in comments.
Variable '$file' -> '$file_content' refactoring to be clearer.

Signed-off-by: Pavel Volek <Pavel.Volek@ensimag.imag.fr>
Signed-off-by: NGUYEN Kim Thuat <Kim-Thuat.Nguyen@ensimag.imag.fr>
Signed-off-by: ROUCHER IGLESIAS Javier <roucherj@ensimag.imag.fr>
Signed-off-by: Matthieu Moy <Matthieu.Moy@imag.fr>
---
 contrib/mw-to-git/git-remote-mediawiki | 223 ++++++++++++++++++++++++++++++++-
 1 file changed, 218 insertions(+), 5 deletions(-)

diff --git a/contrib/mw-to-git/git-remote-mediawiki b/contrib/mw-to-git/git-remote-mediawiki
index c18bfa1..04d3959 100755
--- a/contrib/mw-to-git/git-remote-mediawiki
+++ b/contrib/mw-to-git/git-remote-mediawiki
@@ -13,9 +13,6 @@
 #
 # Known limitations:
 #
-# - Only wiki pages are managed, no support for [[File:...]]
-#   attachments.
-#
 # - Poor performance in the best case: it takes forever to check
 #   whether we're up-to-date (on fetch or push) or to fetch a few
 #   revisions from a large wiki, because we use exclusively a
@@ -71,6 +68,9 @@ chomp(@tracked_pages);
 my @tracked_categories = split(/[ \n]/, run_git("config --get-all remote.". $remotename .".categories"));
 chomp(@tracked_categories);
 
+# Import media files too.
+my $import_media = run_git("config --get --bool remote.". $remotename .".mediaimport");
+
 my $wiki_login = run_git("config --get remote.". $remotename .".mwLogin");
 # TODO: ideally, this should be able to read from keyboard, but we're
 # inside a remote helper, so our stdin is connect to git, not to a
@@ -225,6 +225,10 @@ sub get_mw_pages {
 			get_mw_first_pages(\@slice, \%pages);
 			@some_pages = @some_pages[51..$#some_pages];
 		}
+
+		if ($import_media) {
+			get_mw_pages_for_linked_mediafiles(\@tracked_pages, \%pages);
+		}
 	}
 	if (@tracked_categories) {
 		$user_defined = 1;
@@ -244,6 +248,11 @@ sub get_mw_pages {
 			foreach my $page (@{$mw_pages}) {
 				$pages{$page->{title}} = $page;
 			}
+
+			if ($import_media) {
+				my @titles = map $_->{title}, @{$mw_pages};
+				get_mw_pages_for_linked_mediafiles(\@titles, \%pages);
+			}
 		}
 	}
 	if (!$user_defined) {
@@ -263,10 +272,186 @@ sub get_mw_pages {
 		foreach my $page (@{$mw_pages}) {
 			$pages{$page->{title}} = $page;
 		}
+
+		if ($import_media) {
+			# Attach list of all pages for media files from the API,
+			# they are in a different namespace, only one namespace
+			# can be queried at the same moment
+			my $mw_pages = $mediawiki->list({
+				action => 'query',
+				list => 'allpages',
+				apnamespace => get_mw_namespace_id("File"),
+				aplimit => 500
+			});
+			if (!defined($mw_pages)) {
+				print STDERR "fatal: could not get the list of pages for media files.\n";
+				print STDERR "fatal: '$url' does not appear to be a mediawiki\n";
+				print STDERR "fatal: make sure '$url/api.php' is a valid page.\n";
+				exit 1;
+			}
+			foreach my $page (@{$mw_pages}) {
+				$pages{$page->{title}} = $page;
+			}
+		}
 	}
 	return values(%pages);
 }
 
+sub get_mw_pages_for_linked_mediafiles {
+	my $titles = shift;
+	my @titles = @{$titles};
+	my $pages = shift;
+
+	# pattern 'page1|page2|...' required by the API
+	my $mw_titles = join('|', @titles);
+
+	# Media files could be included or linked from
+	# a page, get all related
+	my $query = {
+		action => 'query',
+		prop => 'links|images',
+		titles => $mw_titles,
+		plnamespace => get_mw_namespace_id("File"),
+		pllimit => 500
+	};
+	my $result = $mediawiki->api($query);
+
+	while (my ($id, $page) = each(%{$result->{query}->{pages}})) {
+		my @titles;
+		if (defined($page->{links})) {
+			my @link_titles = map $_->{title}, @{$page->{links}};
+			push(@titles, @link_titles);
+		}
+		if (defined($page->{images})) {
+			my @image_titles = map $_->{title}, @{$page->{images}};
+			push(@titles, @image_titles);
+		}
+		if (@titles) {
+			get_mw_first_pages(\@titles, \%{$pages});
+		}
+	}
+}
+
+# Return MediaWiki id for a canonical namespace name.
+# Ex.: "File", "Project".
+# Looks for the namespace id in the local configuration
+# variables, if it is not found asks MW API.
+sub get_mw_namespace_id {
+	mw_connect_maybe();
+
+	my $name = shift;
+
+	# Look at configuration file, if the record
+	# for that namespace is already stored.
+	# Namespaces are stored in form: "Name_of_namespace:Id_namespace",
+	# Ex.: "File:6".
+	my @tracked_namespaces = split(/[ \n]/, run_git("config --get-all remote.". $remotename .".namespaces"));
+	chomp(@tracked_namespaces);
+	if (@tracked_namespaces) {
+		foreach my $ns (@tracked_namespaces) {
+			my @ns_split = split(/:/, $ns);
+			if ($ns_split[0] eq $name) {
+				return $ns_split[1];
+			}
+		}
+	}
+
+	# NS not found => get namespace id from MW and store it in
+	# configuration file.
+	my $query = {
+		action => 'query',
+		meta => 'siteinfo',
+		siprop => 'namespaces'
+	};
+	my $result = $mediawiki->api($query);
+
+	while (my ($id, $ns) = each(%{$result->{query}->{namespaces}})) {
+		if (defined($ns->{canonical}) && ($ns->{canonical} eq $name)) {
+			run_git("config --add remote.". $remotename .".namespaces ". $name .":". $ns->{id});
+			return $ns->{id};
+		}
+	}
+	die "Namespace $name was not found on MediaWiki.";
+}
+
+sub get_mw_mediafile_for_page_revision {
+	# Name of the file on Wiki, with the prefix.
+	my $mw_filename = shift;
+	my $timestamp = shift;
+	my %mediafile;
+
+	# Search if on MediaWiki exists a media file with given
+	# timestamp. In that case download the file.
+	my $query = {
+		action => 'query',
+		prop => 'imageinfo',
+		titles => $mw_filename,
+		iistart => $timestamp,
+		iiend => $timestamp,
+		iiprop => 'timestamp|archivename|url',
+		iilimit => 1
+	};
+	my $result = $mediawiki->api($query);
+
+	my ($fileid, $file) = each ( %{$result->{query}->{pages}} );
+	# If not defined it means there is no revision of the file for
+	# given timestamp.
+	if (defined($file->{imageinfo})) {
+		# Get real name of media file.
+		my $filename;
+		if (index($mw_filename, 'File:') == 0) {
+			$filename = substr $mw_filename, 5;
+		} else {
+			$filename = substr $mw_filename, 6;
+		}
+		$mediafile{title} = $filename;
+
+		my $fileinfo = pop(@{$file->{imageinfo}});
+		$mediafile{timestamp} = $fileinfo->{timestamp};
+		# If this is an old version of the file, the file has to be
+		# obtained from the archive. Otherwise it can be downloaded
+		# by MediaWiki API download() function.
+		if (defined($fileinfo->{archivename})) {
+			$mediafile{content} = download_mw_mediafile_from_archive($fileinfo->{url});
+		} else {
+			$mediafile{content} = download_mw_mediafile($mw_filename);
+		}
+	}
+	return %mediafile;
+}
+
+sub download_mw_mediafile_from_archive {
+	my $url = shift;
+	my $file;
+
+	my $ua = LWP::UserAgent->new;
+	my $response = $ua->get($url);
+	if ($response->code) {
+		$file = $response->decoded_content;
+	} else {
+		print STDERR "Error downloading a file from archive.\n";
+	}
+
+	return $file;
+}
+
+sub download_mw_mediafile {
+	my $filename = shift;
+
+	$mediawiki->{config}->{files_url} = $url;
+
+	my $file_content = $mediawiki->download( { title => $filename } );
+	if (!defined($file_content)) {
+		print STDERR "\tFile \'$filename\' could not be downloaded.\n";
+		exit 1;
+	} elsif ($file_content eq "") {
+		print STDERR "\tFile \'$filename\' does not exist on the wiki.\n";
+		exit 1;
+	} else {
+		return $file_content;
+	}
+}
+
 sub run_git {
 	open(my $git, "-|:encoding(UTF-8)", "git " . $_[0]);
 	my $res = do { local $/; <$git> };
@@ -466,6 +651,13 @@ sub import_file_revision {
 	my %commit = %{$commit};
 	my $full_import = shift;
 	my $n = shift;
+	my $mediafile_import = shift;
+	my $mediafile;
+	my %mediafile;
+	if ($mediafile_import) {
+		$mediafile = shift;
+		%mediafile = %{$mediafile};
+	}
 
 	my $title = $commit{title};
 	my $comment = $commit{comment};
@@ -485,6 +677,10 @@ sub import_file_revision {
 	if ($content ne DELETED_CONTENT) {
 		print STDOUT "M 644 inline $title.mw\n";
 		literal_data($content);
+		if ($mediafile_import) {
+			print STDOUT "M 644 inline $mediafile{title}\n";
+			literal_data($mediafile{content});
+		}
 		print STDOUT "\n\n";
 	} else {
 		print STDOUT "D $title.mw\n";
@@ -580,6 +776,7 @@ sub mw_import_ref {
 
 		$n++;
 
+		my $page_title = $result->{query}->{pages}->{$pagerevid->{pageid}}->{title};
 		my %commit;
 		$commit{author} = $rev->{user} || 'Anonymous';
 		$commit{comment} = $rev->{comment} || '*Empty MediaWiki Message*';
@@ -596,9 +793,25 @@ sub mw_import_ref {
 		}
 		$commit{date} = DateTime::Format::ISO8601->parse_datetime($last_timestamp);
 
-		print STDERR "$n/", scalar(@revisions), ": Revision #$pagerevid->{revid} of $commit{title}\n";
+		# Differentiates classic pages and media files.
+		my @prefix = split (":", $page_title);
 
-		import_file_revision(\%commit, ($fetch_from == 1), $n);
+		my %mediafile;
+		if ($prefix[0] eq "File" || $prefix[0] eq "Image") {
+			# The name of the file is the same as the media page.
+			my $filename = $page_title;
+			%mediafile = get_mw_mediafile_for_page_revision($filename, $rev->{timestamp});
+		}
+		# If this is a revision of the media page for new version
+		# of a file do one common commit for both file and media page.
+		# Else do commit only for that page.
+		print STDERR "$n/", scalar(@revisions), ": Revision #$pagerevid->{revid} of $commit{title}\n";
+		if (%mediafile) {
+			print STDERR "\tDownloading file $mediafile{title}, version $mediafile{timestamp}\n";
+			import_file_revision(\%commit, ($fetch_from == 1), $n, 1, \%mediafile);
+		} else {
+			import_file_revision(\%commit, ($fetch_from == 1), $n, 0);
+		}
 	}
 
 	if ($fetch_from == 1 && $n == 0) {
-- 
1.7.10.2.552.gaa3bb87

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCHv3 2/2] git-remote-mediawiki: refactoring get_mw_pages function
  2012-06-12 20:14 [PATCHv3 1/2] git-remote-mediawiki: import "File:" attachments Pavel Volek
@ 2012-06-12 20:14 ` Pavel Volek
  2012-06-12 21:05 ` [PATCHv3 1/2] git-remote-mediawiki: import "File:" attachments Junio C Hamano
  1 sibling, 0 replies; 4+ messages in thread
From: Pavel Volek @ 2012-06-12 20:14 UTC (permalink / raw
  To: git; +Cc: Pavel VOlek, NGUYEN Kim Thuat, ROUCHER IGLESIAS Javier,
	Matthieu Moy

From: Pavel VOlek <Pavel.Volek@ensimag.imag.fr>

Splits the code in the get_mw_pages function into three separate functions.
One for getting list of all pages and all file attachments, second for pages
in category specified in configuration file and files related to these pages
and the last function to get from MW a list of specified pages with related
file attachments.

Chages version2 -> version3:
Fixes in comments.

Signed-off-by: Pavel Volek <Pavel.Volek@ensimag.imag.fr>
Signed-off-by: NGUYEN Kim Thuat <Kim-Thuat.Nguyen@ensimag.imag.fr>
Signed-off-by: ROUCHER IGLESIAS Javier <roucherj@ensimag.imag.fr>
Signed-off-by: Matthieu Moy <Matthieu.Moy@imag.fr>
---
 contrib/mw-to-git/git-remote-mediawiki | 140 ++++++++++++++++++---------------
 1 file changed, 77 insertions(+), 63 deletions(-)

diff --git a/contrib/mw-to-git/git-remote-mediawiki b/contrib/mw-to-git/git-remote-mediawiki
index 04d3959..ace1868 100755
--- a/contrib/mw-to-git/git-remote-mediawiki
+++ b/contrib/mw-to-git/git-remote-mediawiki
@@ -212,89 +212,103 @@ sub get_mw_pages {
 	my $user_defined;
 	if (@tracked_pages) {
 		$user_defined = 1;
-		# The user provided a list of pages titles, but we
-		# still need to query the API to get the page IDs.
-
-		my @some_pages = @tracked_pages;
-		while (@some_pages) {
-			my $last = 50;
-			if ($#some_pages < $last) {
-				$last = $#some_pages;
-			}
-			my @slice = @some_pages[0..$last];
-			get_mw_first_pages(\@slice, \%pages);
-			@some_pages = @some_pages[51..$#some_pages];
-		}
-
-		if ($import_media) {
-			get_mw_pages_for_linked_mediafiles(\@tracked_pages, \%pages);
-		}
+		get_mw_tracked_pages(\%pages);
 	}
 	if (@tracked_categories) {
 		$user_defined = 1;
-		foreach my $category (@tracked_categories) {
-			if (index($category, ':') < 0) {
-				# Mediawiki requires the Category
-				# prefix, but let's not force the user
-				# to specify it.
-				$category = "Category:" . $category;
-			}
-			my $mw_pages = $mediawiki->list( {
-				action => 'query',
-				list => 'categorymembers',
-				cmtitle => $category,
-				cmlimit => 'max' } )
-			    || die $mediawiki->{error}->{code} . ': ' . $mediawiki->{error}->{details};
-			foreach my $page (@{$mw_pages}) {
-				$pages{$page->{title}} = $page;
-			}
-
-			if ($import_media) {
-				my @titles = map $_->{title}, @{$mw_pages};
-				get_mw_pages_for_linked_mediafiles(\@titles, \%pages);
-			}
-		}
+		get_mw_tracked_categories(\%pages);
 	}
 	if (!$user_defined) {
-		# No user-provided list, get the list of pages from
-		# the API.
+		get_mw_all_pages(\%pages);
+	}
+	return values(%pages);
+}
+
+sub get_mw_all_pages {
+	my $pages = shift;
+	# No user-provided list, get the list of pages from the API.
+	my $mw_pages = $mediawiki->list({
+		action => 'query',
+		list => 'allpages',
+		aplimit => 500
+	});
+	if (!defined($mw_pages)) {
+		print STDERR "fatal: could not get the list of wiki pages.\n";
+		print STDERR "fatal: '$url' does not appear to be a mediawiki\n";
+		print STDERR "fatal: make sure '$url/api.php' is a valid page.\n";
+		exit 1;
+	}
+	foreach my $page (@{$mw_pages}) {
+		$pages->{$page->{title}} = $page;
+	}
+
+	if ($import_media) {
+		# Attach list of all pages for media files from the API,
+		# they are in a different namespace, only one namespace
+		# can be queried at the same moment
 		my $mw_pages = $mediawiki->list({
 			action => 'query',
 			list => 'allpages',
-			aplimit => 500,
+			apnamespace => get_mw_namespace_id("File"),
+			aplimit => 500
 		});
 		if (!defined($mw_pages)) {
-			print STDERR "fatal: could not get the list of wiki pages.\n";
+			print STDERR "fatal: could not get the list of pages for media files.\n";
 			print STDERR "fatal: '$url' does not appear to be a mediawiki\n";
 			print STDERR "fatal: make sure '$url/api.php' is a valid page.\n";
 			exit 1;
 		}
 		foreach my $page (@{$mw_pages}) {
-			$pages{$page->{title}} = $page;
+			$pages->{$page->{title}} = $page;
+		}
+	}
+}
+
+sub get_mw_tracked_pages {
+	my $pages = shift;
+	# The user provided a list of pages titles, but we
+	# still need to query the API to get the page IDs.
+	my @some_pages = @tracked_pages;
+	while (@some_pages) {
+		my $last = 50;
+		if ($#some_pages < $last) {
+			$last = $#some_pages;
+		}
+		my @slice = @some_pages[0..$last];
+		get_mw_first_pages(\@slice, \%{$pages});
+		@some_pages = @some_pages[51..$#some_pages];
+	}
+
+	if ($import_media) {
+		get_mw_pages_for_linked_mediafiles(\@tracked_pages, \%{$pages});
+	}
+}
+
+sub get_mw_tracked_categories {
+	my $pages = shift;
+	foreach my $category (@tracked_categories) {
+		if (index($category, ':') < 0) {
+			# Mediawiki requires the Category
+			# prefix, but let's not force the user
+			# to specify it.
+			$category = "Category:" . $category;
+		}
+		my $mw_pages = $mediawiki->list( {
+			action => 'query',
+			list => 'categorymembers',
+			cmtitle => $category,
+			cmlimit => 'max' } )
+			|| die $mediawiki->{error}->{code} . ': '
+				. $mediawiki->{error}->{details};
+		foreach my $page (@{$mw_pages}) {
+			$pages->{$page->{title}} = $page;
 		}
 
 		if ($import_media) {
-			# Attach list of all pages for media files from the API,
-			# they are in a different namespace, only one namespace
-			# can be queried at the same moment
-			my $mw_pages = $mediawiki->list({
-				action => 'query',
-				list => 'allpages',
-				apnamespace => get_mw_namespace_id("File"),
-				aplimit => 500
-			});
-			if (!defined($mw_pages)) {
-				print STDERR "fatal: could not get the list of pages for media files.\n";
-				print STDERR "fatal: '$url' does not appear to be a mediawiki\n";
-				print STDERR "fatal: make sure '$url/api.php' is a valid page.\n";
-				exit 1;
-			}
-			foreach my $page (@{$mw_pages}) {
-				$pages{$page->{title}} = $page;
-			}
+			my @titles = map $_->{title}, @{$mw_pages};
+			get_mw_pages_for_linked_mediafiles(\@titles, \%{$pages});
 		}
 	}
-	return values(%pages);
 }
 
 sub get_mw_pages_for_linked_mediafiles {
-- 
1.7.10.2.552.gaa3bb87

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCHv3 1/2] git-remote-mediawiki: import "File:" attachments
  2012-06-12 20:14 [PATCHv3 1/2] git-remote-mediawiki: import "File:" attachments Pavel Volek
  2012-06-12 20:14 ` [PATCHv3 2/2] git-remote-mediawiki: refactoring get_mw_pages function Pavel Volek
@ 2012-06-12 21:05 ` Junio C Hamano
  2012-06-13 13:37   ` volekp
  1 sibling, 1 reply; 4+ messages in thread
From: Junio C Hamano @ 2012-06-12 21:05 UTC (permalink / raw
  To: Pavel Volek; +Cc: git, NGUYEN Kim Thuat, ROUCHER IGLESIAS Javier, Matthieu Moy

Pavel Volek <Pavel.Volek@ensimag.imag.fr> writes:

> From: Pavel VOlek <Pavel.Volek@ensimag.imag.fr>

Did you really mean this?  It does not match your S-o-b: line below.

>
> The current version of the git-remote-mediawiki supports only import and export
> of the pages, doesn't support import and export of file attachments which are
> also exposed by MediaWiki API. This patch adds the functionality to import file
> attachments and description pages for these files.
>
> Chages version2 -> version3:
> Fixes in comments.
> Variable '$file' -> '$file_content' refactoring to be clearer.

These three lines do not belong here above the three-dash lines, I think.

> Signed-off-by: Pavel Volek <Pavel.Volek@ensimag.imag.fr>
> Signed-off-by: NGUYEN Kim Thuat <Kim-Thuat.Nguyen@ensimag.imag.fr>
> Signed-off-by: ROUCHER IGLESIAS Javier <roucherj@ensimag.imag.fr>
> Signed-off-by: Matthieu Moy <Matthieu.Moy@imag.fr>
> ---
>  contrib/mw-to-git/git-remote-mediawiki | 223 ++++++++++++++++++++++++++++++++-
>  1 file changed, 218 insertions(+), 5 deletions(-)
>
> diff --git a/contrib/mw-to-git/git-remote-mediawiki b/contrib/mw-to-git/git-remote-mediawiki
> index c18bfa1..04d3959 100755
> --- a/contrib/mw-to-git/git-remote-mediawiki
> +++ b/contrib/mw-to-git/git-remote-mediawiki
> @@ -13,9 +13,6 @@
>  #
>  # Known limitations:
>  #
> -# - Only wiki pages are managed, no support for [[File:...]]
> -#   attachments.
> -#

Nice.

> @@ -71,6 +68,9 @@ chomp(@tracked_pages);
>  my @tracked_categories = split(/[ \n]/, run_git("config --get-all remote.". $remotename .".categories"));
>  chomp(@tracked_categories);
>  
> +# Import media files too.
> +my $import_media = run_git("config --get --bool remote.". $remotename .".mediaimport");
> +
>  my $wiki_login = run_git("config --get remote.". $remotename .".mwLogin");
>  # TODO: ideally, this should be able to read from keyboard, but we're
>  # inside a remote helper, so our stdin is connect to git, not to a
> @@ -225,6 +225,10 @@ sub get_mw_pages {
>  			get_mw_first_pages(\@slice, \%pages);
>  			@some_pages = @some_pages[51..$#some_pages];
>  		}
> +
> +		if ($import_media) {
> +			get_mw_pages_for_linked_mediafiles(\@tracked_pages, \%pages);
> +		}

I am guessing that the loop above is to avoid fetching and
processing too many pages at once.  Doesn't the call to
get_mw_pages_for_linked_mediafiles() need a similar consideration,
or what the function does is significantly different from what
get_mw_first_pages() does and there is no need to worry?

By the way, does it really have to be that overly long name?

> @@ -244,6 +248,11 @@ sub get_mw_pages {
>  			foreach my $page (@{$mw_pages}) {
>  				$pages{$page->{title}} = $page;
>  			}
> +
> +			if ($import_media) {
> +				my @titles = map $_->{title}, @{$mw_pages};
> +				get_mw_pages_for_linked_mediafiles(\@titles, \%pages);
> +			}
>  		}
>  	}
>  	if (!$user_defined) {
> @@ -263,10 +272,186 @@ sub get_mw_pages {
>  		foreach my $page (@{$mw_pages}) {
>  			$pages{$page->{title}} = $page;
>  		}
> +
> +		if ($import_media) {
> +			# Attach list of all pages for media files from the API,
> +			# they are in a different namespace, only one namespace
> +			# can be queried at the same moment
> +			my $mw_pages = $mediawiki->list({
> +				action => 'query',
> +				list => 'allpages',
> +				apnamespace => get_mw_namespace_id("File"),
> +				aplimit => 500
> +			});
> +			if (!defined($mw_pages)) {
> +				print STDERR "fatal: could not get the list of pages for media files.\n";
> +				print STDERR "fatal: '$url' does not appear to be a mediawiki\n";
> +				print STDERR "fatal: make sure '$url/api.php' is a valid page.\n";
> +				exit 1;
> +			}
> +			foreach my $page (@{$mw_pages}) {
> +				$pages{$page->{title}} = $page;
> +			}
> +		}

For categories you need to call pages-for-mediafiles with the titles
you learned (the hunk starting at l.224), but there is no need to
call pages-for-mediafiles in this hunk?

Not a rhetorical question to suggest that you should; just
wondering.

> +sub get_mw_pages_for_linked_mediafiles {
> +	my $titles = shift;
> +	my @titles = @{$titles};

Do you really need to make a copy of this array?  Wouldn't it
suffice to say

	my $mw_titles = join('|', @$titles);

at the only location in this function that uses this parameter?

> +	my $pages = shift;
> +
> +	# pattern 'page1|page2|...' required by the API
> +	my $mw_titles = join('|', @titles);

Nobody seems to be making sure there won't be more than 500 (I am
assuming that this script is considered a 'bot) pages in $mw_titles
variable.  Shouldn't the API call be split into such batches?

> +	# Media files could be included or linked from
> +	# a page, get all related
> +	my $query = {
> +		action => 'query',
> +		prop => 'links|images',
> +		titles => $mw_titles,
> +		plnamespace => get_mw_namespace_id("File"),
> +		pllimit => 500
> +	};
> +	my $result = $mediawiki->api($query);
> +
> +	while (my ($id, $page) = each(%{$result->{query}->{pages}})) {
> +		my @titles;
> +		if (defined($page->{links})) {
> +			my @link_titles = map $_->{title}, @{$page->{links}};
> +			push(@titles, @link_titles);
> +		}
> +		if (defined($page->{images})) {
> +			my @image_titles = map $_->{title}, @{$page->{images}};
> +			push(@titles, @image_titles);
> +		}
> +		if (@titles) {
> +			get_mw_first_pages(\@titles, \%{$pages});
> +		}
> +	}
> +}
> +
> +# Return MediaWiki id for a canonical namespace name.
> +# Ex.: "File", "Project".
> +# Looks for the namespace id in the local configuration
> +# variables, if it is not found asks MW API.
> +sub get_mw_namespace_id {
> +	mw_connect_maybe();
> +
> +	my $name = shift;
> +
> +	# Look at configuration file, if the record
> +	# for that namespace is already stored.
> +	# Namespaces are stored in form: "Name_of_namespace:Id_namespace",
> +	# Ex.: "File:6".
> +	my @tracked_namespaces = split(/[ \n]/, run_git("config --get-all remote.". $remotename .".namespaces"));
> +	chomp(@tracked_namespaces);
> +	if (@tracked_namespaces) {
> +		foreach my $ns (@tracked_namespaces) {
> +			my @ns_split = split(/:/, $ns);
> +			if ($ns_split[0] eq $name) {
> +				return $ns_split[1];
> +			}
> +		}
> +	}

Does it make sense to call out to run_git() every time this function
is called?  Shoudln't this part be caching the result in a hash,
something like

	if (!exists $namespace_id{$name}) {
		@temp = ... run_git() ...;
                foreach my $ns (@temp) {
			my ($n, $s) = split(/:/, $ns);
			$namespace_id{$n} = $s;
		}
	}

	if (!exists $namespace_id{$name}) {
		... similarly, ask MW API and store in %namespace_id{}
	}

        if (exists $namespace_id{$name}) {
        	return $namespace_id{$name};
	}
	die "No such namespace $name";

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCHv3 1/2] git-remote-mediawiki: import "File:" attachments
  2012-06-12 21:05 ` [PATCHv3 1/2] git-remote-mediawiki: import "File:" attachments Junio C Hamano
@ 2012-06-13 13:37   ` volekp
  0 siblings, 0 replies; 4+ messages in thread
From: volekp @ 2012-06-13 13:37 UTC (permalink / raw
  To: Junio C Hamano
  Cc: Pavel Volek, git, NGUYEN Kim Thuat, ROUCHER IGLESIAS Javier,
	Matthieu Moy

On Tue, 12 Jun 2012 14:05:21 -0700, Junio C Hamano wrote:
> Pavel Volek <Pavel.Volek@ensimag.imag.fr> writes:
>
>> From: Pavel VOlek <Pavel.Volek@ensimag.imag.fr>
>
> Did you really mean this?  It does not match your S-o-b: line below.

You're right, I will change it.

>>
>> The current version of the git-remote-mediawiki supports only import 
>> and export
>> of the pages, doesn't support import and export of file attachments 
>> which are
>> also exposed by MediaWiki API. This patch adds the functionality to 
>> import file
>> attachments and description pages for these files.
>>
>> Chages version2 -> version3:
>> Fixes in comments.
>> Variable '$file' -> '$file_content' refactoring to be clearer.
>
> These three lines do not belong here above the three-dash lines, I 
> think.

OK. But should I mention somewhere the changes I performed from the 
precedent to
the actual version of the PATCH?

>> @@ -71,6 +68,9 @@ chomp(@tracked_pages);
>>  my @tracked_categories = split(/[ \n]/, run_git("config --get-all 
>> remote.". $remotename .".categories"));
>>  chomp(@tracked_categories);
>>
>> +# Import media files too.
>> +my $import_media = run_git("config --get --bool remote.". 
>> $remotename .".mediaimport");
>> +
>>  my $wiki_login = run_git("config --get remote.". $remotename 
>> .".mwLogin");
>>  # TODO: ideally, this should be able to read from keyboard, but 
>> we're
>>  # inside a remote helper, so our stdin is connect to git, not to a
>> @@ -225,6 +225,10 @@ sub get_mw_pages {
>>  			get_mw_first_pages(\@slice, \%pages);
>>  			@some_pages = @some_pages[51..$#some_pages];
>>  		}
>> +
>> +		if ($import_media) {
>> +			get_mw_pages_for_linked_mediafiles(\@tracked_pages, \%pages);
>> +		}
>
> I am guessing that the loop above is to avoid fetching and
> processing too many pages at once.  Doesn't the call to
> get_mw_pages_for_linked_mediafiles() need a similar consideration,
> or what the function does is significantly different from what
> get_mw_first_pages() does and there is no need to worry?
>
> By the way, does it really have to be that overly long name?

Yes, it will be better to split the processing in a similar way. Now 
there is
a possibility that because of the MW limits, we won't get all media 
files.

For the name of the method I wanted to be precise and clear, I didn't 
find
more fitting name. Any suggestions welcomed.

>> @@ -244,6 +248,11 @@ sub get_mw_pages {
>>  			foreach my $page (@{$mw_pages}) {
>>  				$pages{$page->{title}} = $page;
>>  			}
>> +
>> +			if ($import_media) {
>> +				my @titles = map $_->{title}, @{$mw_pages};
>> +				get_mw_pages_for_linked_mediafiles(\@titles, \%pages);
>> +			}
>>  		}
>>  	}
>>  	if (!$user_defined) {
>> @@ -263,10 +272,186 @@ sub get_mw_pages {
>>  		foreach my $page (@{$mw_pages}) {
>>  			$pages{$page->{title}} = $page;
>>  		}
>> +
>> +		if ($import_media) {
>> +			# Attach list of all pages for media files from the API,
>> +			# they are in a different namespace, only one namespace
>> +			# can be queried at the same moment
>> +			my $mw_pages = $mediawiki->list({
>> +				action => 'query',
>> +				list => 'allpages',
>> +				apnamespace => get_mw_namespace_id("File"),
>> +				aplimit => 500
>> +			});
>> +			if (!defined($mw_pages)) {
>> +				print STDERR "fatal: could not get the list of pages for media 
>> files.\n";
>> +				print STDERR "fatal: '$url' does not appear to be a 
>> mediawiki\n";
>> +				print STDERR "fatal: make sure '$url/api.php' is a valid 
>> page.\n";
>> +				exit 1;
>> +			}
>> +			foreach my $page (@{$mw_pages}) {
>> +				$pages{$page->{title}} = $page;
>> +			}
>> +		}
>
> For categories you need to call pages-for-mediafiles with the titles
> you learned (the hunk starting at l.224), but there is no need to
> call pages-for-mediafiles in this hunk?

Actually it is not necessary. In here the user doesn't specify the 
pages or
categories to be imported, so the whole wiki is imported. Thats why we 
can
skip the searching for relations between pages and files and directly 
ask
for all files.

>> +sub get_mw_pages_for_linked_mediafiles {
>> +	my $titles = shift;
>> +	my @titles = @{$titles};
>
> Do you really need to make a copy of this array?  Wouldn't it
> suffice to say
>
> 	my $mw_titles = join('|', @$titles);
>
> at the only location in this function that uses this parameter?

You are right, I will change it.

>> +	my $pages = shift;
>> +
>> +	# pattern 'page1|page2|...' required by the API
>> +	my $mw_titles = join('|', @titles);
>
> Nobody seems to be making sure there won't be more than 500 (I am
> assuming that this script is considered a 'bot) pages in $mw_titles
> variable.  Shouldn't the API call be split into such batches?

Right, it would be better.

>> +	# Media files could be included or linked from
>> +	# a page, get all related
>> +	my $query = {
>> +		action => 'query',
>> +		prop => 'links|images',
>> +		titles => $mw_titles,
>> +		plnamespace => get_mw_namespace_id("File"),
>> +		pllimit => 500
>> +	};
>> +	my $result = $mediawiki->api($query);
>> +
>
>
>> +
>> +# Return MediaWiki id for a canonical namespace name.
>> +# Ex.: "File", "Project".
>> +# Looks for the namespace id in the local configuration
>> +# variables, if it is not found asks MW API.
>> +sub get_mw_namespace_id {
>> +	mw_connect_maybe();
>> +
>> +	my $name = shift;
>> +
>> +	# Look at configuration file, if the record
>> +	# for that namespace is already stored.
>> +	# Namespaces are stored in form: "Name_of_namespace:Id_namespace",
>> +	# Ex.: "File:6".
>> +	my @tracked_namespaces = split(/[ \n]/, run_git("config --get-all 
>> remote.". $remotename .".namespaces"));
>> +	chomp(@tracked_namespaces);
>> +	if (@tracked_namespaces) {
>> +		foreach my $ns (@tracked_namespaces) {
>> +			my @ns_split = split(/:/, $ns);
>> +			if ($ns_split[0] eq $name) {
>> +				return $ns_split[1];
>> +			}
>> +		}
>> +	}
>
> Does it make sense to call out to run_git() every time this function
> is called?  Shoudln't this part be caching the result in a hash,
> something like
>
> 	if (!exists $namespace_id{$name}) {
> 		@temp = ... run_git() ...;
>                 foreach my $ns (@temp) {
> 			my ($n, $s) = split(/:/, $ns);
> 			$namespace_id{$n} = $s;
> 		}
> 	}
>
> 	if (!exists $namespace_id{$name}) {
> 		... similarly, ask MW API and store in %namespace_id{}
> 	}
>
>         if (exists $namespace_id{$name}) {
>         	return $namespace_id{$name};
> 	}
> 	die "No such namespace $name";

Good idea, I will look closer on it!

Thanks for your review!

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-06-13 13:38 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-12 20:14 [PATCHv3 1/2] git-remote-mediawiki: import "File:" attachments Pavel Volek
2012-06-12 20:14 ` [PATCHv3 2/2] git-remote-mediawiki: refactoring get_mw_pages function Pavel Volek
2012-06-12 21:05 ` [PATCHv3 1/2] git-remote-mediawiki: import "File:" attachments Junio C Hamano
2012-06-13 13:37   ` volekp

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).